06 Apr, 2019

1 commit

  • [ Upstream commit 058fdecc6de7cdecbf4c59b851e80eb2d6c5295f ]

    When a new I/O request arrives for a bfq_queue, say Q, bfq checks
    whether that request is close to
    (a) the head request of some other queue waiting to be served, or
    (b) the last request dispatched for the in-service queue (in case Q
    itself is not the in-service queue)

    If a queue, say Q2, is found for which the above condition holds, then
    bfq merges Q and Q2, to hopefully get a more sequential I/O in the
    resulting merged queue, and thus a possibly higher throughput.

    Case (b) is checked by comparing the new request for Q with the last
    request dispatched, assuming that the latter necessarily belonged to the
    in-service queue. Unfortunately, this assumption is no longer always
    correct, since commit d0edc2473be9 ("block, bfq: inject other-queue I/O
    into seeky idle queues on NCQ flash").

    When the assumption does not hold, queues that must not be merged may be
    merged, causing unexpected loss of control on per-queue service
    guarantees.

    This commit solves this problem by adding an extra field, which stores
    the actual last request dispatched for the in-service queue, and by
    using this new field to correctly check case (b).

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Paolo Valente
     

17 Aug, 2018

2 commits

  • When a sync request is dispatched, the queue that contains that
    request, and all the ancestor entities of that queue, are charged with
    the number of sectors of the request. In constrast, if the request is
    async, then the queue and its ancestor entities are charged with the
    number of sectors of the request, multiplied by an overcharge
    factor. This throttles the bandwidth for async I/O, w.r.t. to sync
    I/O, and it is done to counter the tendency of async writes to steal
    I/O throughput to reads.

    On the opposite end, the lower this parameter, the stabler I/O
    control, in the following respect. The lower this parameter is, the
    less the bandwidth enjoyed by a group decreases
    - when the group does writes, w.r.t. to when it does reads;
    - when other groups do reads, w.r.t. to when they do writes.

    The fixes "block, bfq: always update the budget of an entity when
    needed" and "block, bfq: readd missing reset of parent-entity service"
    improved I/O control in bfq to such an extent that it has been
    possible to revise this overcharge factor downwards. This commit
    introduces the resulting, new value.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • The received-service counter needs to be equal to 0 when an entity is
    set in service. Unfortunately, commit "block, bfq: fix service being
    wrongly set to zero in case of preemption" mistakenly removed the
    resetting of this counter for the parent entities of the bfq_queue
    being set in service. This commit fixes this issue by resetting
    service for parent entities, directly on the expiration of the
    in-service bfq_queue.

    Fixes: 9fae8dd59ff3 ("block, bfq: fix service being wrongly set to zero in case of preemption")
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

09 Jul, 2018

4 commits

  • The actual goal of the function bfq_bfqq_may_idle is to tell whether
    it is better to perform device idling (more precisely: I/O-dispatch
    plugging) for the input bfq_queue, either to boost throughput or to
    preserve service guarantees. This commit improves the name of the
    function accordingly.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • If
    - a bfq_queue Q preempts another queue, because one request of Q
    arrives in time,
    - but, after this preemption, Q is not the queue that is set in service,
    then Q->entity.service is set to 0 when Q is eventually set in
    service. But Q should have continued receiving service with its old
    budget (which is why preemption has occurred) and its old service.

    This commit addresses this issue by resetting service on queue real
    expiration.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • For some bfq_queues, BFQ plugs I/O dispatching when the queue becomes
    idle, and keeps the plug until a new request of the queue arrives, or
    a timeout fires. BFQ does so either to boost throughput or to preserve
    service guarantees for the queue.

    More precisely, for such a queue, plugging starts when the queue
    happens to have either no request enqueued, or no request in flight,
    that is, no request already dispatched but not yet completed.

    On the opposite end, BFQ may happen to expire a queue with no request
    enqueued, without doing any plugging, if the queue still has some
    request in flight. Unfortunately, such a premature expiration causes
    the queue to lose its chance to enjoy dispatch plugging a moment
    later, i.e., when its in-flight requests finally get completed. This
    breaks service guarantees for the queue.

    This commit prevents BFQ from expiring an empty queue if the latter
    still has in-flight requests.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • To keep I/O throughput high as often as possible, BFQ performs
    I/O-dispatch plugging (aka device idling) only when beneficial exactly
    for throughput, or when needed for service guarantees (low latency,
    fairness). An important case where the latter condition holds is when
    the scenario is 'asymmetric' in terms of weights: i.e., when some
    bfq_queue or whole group of queues has a higher weight, and thus has
    to receive more service, than other queues or groups. Without dispatch
    plugging, lower-weight queues/groups may unjustly steal bandwidth to
    higher-weight queues/groups.

    To detect asymmetric scenarios, BFQ checks some sufficient
    conditions. One of these conditions is that active groups have
    different weights. BFQ controls this condition by maintaining a
    special set of unique weights of active groups
    (group_weights_tree). To this purpose, in the function
    bfq_active_insert/bfq_active_extract BFQ adds/removes the weight of a
    group to/from this set.

    Unfortunately, the function bfq_active_extract may happen to be
    invoked also for a group that is still active (to preserve the correct
    update of the next queue to serve, see comments in function
    bfq_no_longer_next_in_service() for details). In this case, removing
    the weight of the group makes the set group_weights_tree
    inconsistent. Service-guarantee violations follow.

    This commit addresses this issue by moving group_weights_tree
    insertions from their previous location (in bfq_active_insert) into
    the function __bfq_activate_entity, and by moving group_weights_tree
    extractions from bfq_active_extract to when the entity that represents
    a group remains throughly idle, i.e., with no request either enqueued
    or dispatched.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

31 May, 2018

7 commits

  • BFQ can deem a bfq_queue as soft real-time only if the queue
    - periodically becomes completely idle, i.e., empty and with
    no still-outstanding I/O request;
    - after becoming idle, gets new I/O only after a special reference
    time soft_rt_next_start.

    In this respect, after commit "block, bfq: consider also past I/O in
    soft real-time detection", the value of soft_rt_next_start can never
    decrease. This causes a problem with the following special updating
    case for soft_rt_next_start: to prevent queues that are not completely
    idle to be wrongly detected as soft real-time (when they become
    non-empty again), soft_rt_next_start is temporarily set to infinity
    for empty queues with still outstanding I/O requests. But, if such an
    update is actually performed, then, because of the above commit,
    soft_rt_next_start will be stuck at infinity forever, and the queue
    will have no more chance to be considered soft real-time.

    On slow systems, this problem does cause actual soft real-time
    applications to be occasionally not detected as such.

    This commit addresses this issue by eliminating the pushing of
    soft_rt_next_start to infinity, and by changing the way non-empty
    queues are prevented from being wrongly detected as soft
    real-time. Simply, a queue that becomes non-empty again can now be
    detected as soft real-time only if it has no outstanding I/O request.

    Signed-off-by: Davide Sapienza
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Davide Sapienza
     
  • The maximum possible duration of the weight-raising period for
    interactive applications is limited to 13 seconds, as this is the time
    needed to load the largest application that we considered when tuning
    weight raising. Unfortunately, in such an evaluation, we did not
    consider the case of very slow virtual machines.

    For example, on a QEMU/KVM virtual machine
    - running in a slow PC;
    - with a virtual disk stacked on a slow low-end 5400rpm HDD;
    - serving a heavy I/O workload, such as the sequential reading of
    several files;
    mplayer takes 23 seconds to start, if constantly weight-raised.

    To address this issue, this commit conservatively sets the upper limit
    for weight-raising duration to 25 seconds.

    Signed-off-by: Davide Sapienza
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Davide Sapienza
     
  • BFQ computes the duration of weight raising for interactive
    applications automatically, using some reference parameters. In
    particular, BFQ uses the best durations (see comments in the code for
    how these durations have been assessed) for two classes of systems:
    slow and fast ones. Examples of slow systems are old phones or systems
    using micro HDDs. Fast systems are all the remaining ones. Using these
    parameters, BFQ computes the actual duration of the weight raising,
    for the system at hand, as a function of the relative speed of the
    system w.r.t. the speed of a reference system, belonging to the same
    class of systems as the system at hand.

    This slow vs fast differentiation proved to be useful in the past, but
    happens to have little meaning with current hardware. Even worse, it
    does cause problems in virtual systems, where the speed of the system
    can vary frequently, and so widely to just confuse the class-detection
    mechanism, and, as we have verified experimentally, to cause BFQ to
    compute non-sensical weight-raising durations.

    This commit addresses this issue by removing the slow class and the
    class-detection mechanism.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • A description of how weight raising works is missing in BFQ
    sources. In addition, the code for handling weight raising is
    scattered across a few functions. This makes it rather hard to
    understand the mechanism and its rationale. This commits adds such a
    description at the beginning of the main source file.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Since bfq_finish_request() is always called on the request 'next',
    after bfq_requests_merged() is finished, and bfq_finish_request()
    removes 'next' from its bfq_queue if needed, it isn't necessary to do
    such a removal in advance in bfq_merged_requests().

    This commit removes such a useless 'next' removal.

    Signed-off-by: Filippo Muzzini
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Filippo Muzzini
     
  • The request rq passed to the function bfq_requests_merged is always in
    a bfq_queue, so the check !RB_EMPTY_NODE(&rq->rb_node) at the
    beginning of bfq_requests_merged always succeeds, and the control
    flow systematically skips to the end of the function. This implies
    that the body of the function is never executed, i.e., the
    repositioning of rq is never performed.

    On the opposite end, a control is missing in the body of the function:
    'next' must be removed only if it is inside a bfq_queue.

    This commit removes the wrong check on rq, and adds the missing check
    on 'next'. In addition, this commit adds comments on
    bfq_requests_merged.

    Signed-off-by: Filippo Muzzini
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • In bfq_requests_merged(), there is a deadlock because the lock on
    bfqq->bfqd->lock is held by the calling function, but the code of
    this function tries to grab the lock again.

    This deadlock is currently hidden by another bug (fixed by next commit
    for this source file), which causes the body of bfq_requests_merged()
    to be never executed.

    This commit removes the deadlock by removing the lock/unlock pair.

    Signed-off-by: Filippo Muzzini
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Filippo Muzzini
     

11 May, 2018

5 commits

  • If our shallow depth is smaller than the wake batching of sbitmap,
    we can introduce hangs. Ensure that sbitmap knows how low we'll go.

    Acked-by: Paolo Valente
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • bfqd->sb_shift was attempted used as a cache for the sbitmap queue
    shift, but we don't need it, as it never changes. Kill it with fire.

    Acked-by: Paolo Valente
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It doesn't change, so don't put it in the per-IO hot path.

    Acked-by: Paolo Valente
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Reserved tags are used for error handling, we don't need to
    care about them for regular IO. The core won't call us for these
    anyway.

    Acked-by: Paolo Valente
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • When invoked for an I/O request rq, the prepare_request hook of bfq
    increments reference counters in the destination bfq_queue for rq. In
    this respect, after this hook has been invoked, rq may still be
    transformed into a request with no icq attached, i.e., for bfq, a
    request not associated with any bfq_queue. No further hook is invoked
    to signal this tranformation to bfq (in general, to the destination
    elevator for rq). This leads bfq into an inconsistent state, because
    bfq has no chance to correctly lower these counters back. This
    inconsistency may in its turn cause incorrect scheduling and hangs. It
    certainly causes memory leaks, by making it impossible for bfq to free
    the involved bfq_queue.

    On the bright side, no transformation can still happen for rq after rq
    has been inserted into bfq, or merged with another, already inserted,
    request. Exploiting this fact, this commit addresses the above issue
    by delaying the preparation of an I/O request to when the request is
    inserted or merged.

    This change also gives a performance bonus: a lock-contention point
    gets removed. To prepare a request, bfq needs to hold its scheduler
    lock. After postponing request preparation to insertion or merging, no
    lock needs to be grabbed any longer in the prepare_request hook, while
    the lock already taken to perform insertion or merging is used to
    preparare the request as well.

    Tested-by: Oleksandr Natalenko
    Tested-by: Bart Van Assche
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

09 May, 2018

1 commit

  • Currently, struct request has four timestamp fields:

    - A start time, set at get_request time, in jiffies, used for iostats
    - An I/O start time, set at start_request time, in ktime nanoseconds,
    used for blk-stats (i.e., wbt, kyber, hybrid polling)
    - Another start time and another I/O start time, used for cfq and bfq

    These can all be consolidated into one start time and one I/O start
    time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
    request depending on the kernel config.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

18 Apr, 2018

1 commit

  • Even if we don't have an IO context attached to a request, we still
    need to clear the priv[0..1] pointers, as they could be pointing
    to previously used bic/bfqq structures. If we don't do so, we'll
    either corrupt memory on dispatching a request, or cause an
    imbalance in counters.

    Inspired by a fix from Kees.

    Reported-by: Oleksandr Natalenko
    Reported-by: Kees Cook
    Cc: stable@vger.kernel.org
    Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Mar, 2018

1 commit

  • If a storage device handled by BFQ happens to be slower than 7.5 KB/s
    for a certain amount of time (in the order of a second), then the
    estimated peak rate of the device, maintained in BFQ, becomes equal to
    0. The reason is the limited precision with which the rate is
    represented (details on the range of representable values in the
    comments introduced by this commit). This leads to a division-by-zero
    error where the estimated peak rate is used as divisor. Such a type of
    failure has been reported in [1].

    This commit addresses this issue by:
    1. Lower-bounding the estimated peak rate to 1
    2. Adding and improving comments on the range of rates representable

    [1] https://www.spinics.net/lists/kernel/msg2739205.html

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

08 Feb, 2018

1 commit

  • Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
    RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
    be re-inserted into the active I/O scheduler for that device. As a
    consequence, I/O schedulers may get the same request inserted again,
    even several times, without a finish_request invoked on that request
    before each re-insertion.

    This fact is the cause of the failure reported in [1]. For an I/O
    scheduler, every re-insertion of the same re-prepared request is
    equivalent to the insertion of a new request. For schedulers like
    mq-deadline or kyber, this fact causes no harm. In contrast, it
    confuses a stateful scheduler like BFQ, which keeps state for an I/O
    request, until the finish_request hook is invoked on the request. In
    particular, BFQ may get stuck, waiting forever for the number of
    request dispatches, of the same request, to be balanced by an equal
    number of request completions (while there will be one completion for
    that request). In this state, BFQ may refuse to serve I/O requests
    from other bfq_queues. The hang reported in [1] then follows.

    However, the above re-prepared requests undergo a requeue, thus the
    requeue_request hook of the active elevator is invoked for these
    requests, if set. This commit then addresses the above issue by
    properly implementing the hook requeue_request in BFQ.

    [1] https://marc.info/?l=linux-block&m=151211117608676

    Reported-by: Ivan Kozik
    Reported-by: Alban Browaeys
    Tested-by: Mike Galbraith
    Signed-off-by: Paolo Valente
    Signed-off-by: Serena Ziviani
    Signed-off-by: Jens Axboe

    Paolo Valente
     

18 Jan, 2018

2 commits

  • To maximise responsiveness, BFQ raises the weight, and performs device
    idling, for bfq_queues associated with processes deemed as
    interactive. In particular, weight raising has a maximum duration,
    equal to the time needed to start a large application. If a
    weight-raised process goes on doing I/O beyond this maximum duration,
    it loses weight-raising.

    This mechanism is evidently vulnerable to the following false
    positives: I/O-bound applications that will go on doing I/O for much
    longer than the duration of weight-raising. These applications have
    basically no benefit from being weight-raised at the beginning of
    their I/O. On the opposite end, while being weight-raised, these
    applications
    a) unjustly steal throughput to applications that may truly need
    low latency;
    b) make BFQ uselessly perform device idling; device idling results
    in loss of device throughput with most flash-based storage, and may
    increase latencies when used purposelessly.

    This commit adds a countermeasure to reduce both the above
    problems. To introduce this countermeasure, we provide the following
    extra piece of information (full details in the comments added by this
    commit). During the start-up of the large application used as a
    reference to set the duration of weight-raising, involved processes
    transfer at most ~110K sectors each. Accordingly, a process initially
    deemed as interactive has no right to be weight-raised any longer,
    once transferred 110K sectors or more.

    Basing on this consideration, this commit early-ends weight-raising
    for a bfq_queue if the latter happens to have received an amount of
    service at least equal to 110K sectors (actually, a little bit more,
    to keep a safety margin). I/O-bound applications that reach a high
    throughput, such as file copy, get to this threshold much before the
    allowed weight-raising period finishes. Thus this early ending of
    weight-raising reduces the amount of time during which these
    applications cause the problems described above.

    Tested-by: Oleksandr Natalenko
    Tested-by: Holger Hoffstätte
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Asynchronous I/O can easily starve synchronous I/O (both sync reads
    and sync writes), by consuming all request tags. Similarly, storms of
    synchronous writes, such as those that sync(2) may trigger, can starve
    synchronous reads. In their turn, these two problems may also cause
    BFQ to loose control on latency for interactive and soft real-time
    applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice
    Writer takes 0.6 seconds to start if the device is idle, but it takes
    more than 45 seconds (!) if there are sequential writes in the
    background.

    This commit addresses this issue by limiting the maximum percentage of
    tags that asynchronous I/O requests and synchronous write requests can
    consume. In particular, this commit grants a higher threshold to
    synchronous writes, to prevent the latter from being starved by
    asynchronous I/O.

    According to the above test, LibreOffice Writer now starts in about
    1.2 seconds on average, regardless of the background workload, and
    apart from some rare outlier. To check this improvement, run, e.g.,
    sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init"
    for the comm_startup_lat benchmark in the S suite [1].

    [1] https://github.com/Algodev-github/S

    Tested-by: Oleksandr Natalenko
    Tested-by: Holger Hoffstätte
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

10 Jan, 2018

2 commits

  • Commit '7b9e93616399' ("blk-mq-sched: unify request finished methods")
    changed the old name of current bfq_finish_request method, but left it
    unchanged elsewhere in the code (related comments, part of function
    name bfq_put_rq_priv_body).

    This commit fixes all occurrences of the old name of this method by
    changing them into the current name.

    Fixes: 7b9e93616399 ("blk-mq-sched: unify request finished methods")
    Reviewed-by: Paolo Valente
    Signed-off-by: Federico Motta
    Signed-off-by: Chiara Bruschi
    Signed-off-by: Jens Axboe

    Chiara Bruschi
     
  • It's not available if we don't have group io scheduling set, and
    there's no need to call it.

    Fixes: 0d52af590552 ("block, bfq: release oom-queue ref to root group on exit")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

09 Jan, 2018

1 commit

  • On scheduler init, a reference to the root group, and a reference to
    its corresponding blkg are taken for the oom queue. Yet these
    references are not released on scheduler exit, which prevents these
    objects from be freed. This commit adds the missing reference
    releases.

    Reported-by: Davide Ferrari
    Tested-by: Holger Hoffstätte
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

06 Jan, 2018

7 commits

  • Commit a33801e8b473 ("block, bfq: move debug blkio stats behind
    CONFIG_DEBUG_BLK_CGROUP") introduced two batches of confusing ifdefs:
    one reported in [1], plus a similar one in another function. This
    commit removes both batches, in the way suggested in [1].

    [1] https://www.spinics.net/lists/linux-block/msg20043.html

    Fixes: a33801e8b473 ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP")
    Reported-by: Linus Torvalds
    Tested-by: Luca Miccio
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • BFQ privileges the I/O of soft real-time applications, such as video
    players, to guarantee to these application a high bandwidth and a low
    latency. In this respect, it is not easy to correctly detect when an
    application is soft real-time. A particularly nasty false positive is
    that of an I/O-bound application that occasionally happens to meet all
    requirements to be deemed as soft real-time. After being detected as
    soft real-time, such an application monopolizes the device. Fortunately,
    BFQ will realize soon that the application is actually not soft
    real-time and suspend every privilege. Yet, the application may happen
    again to be wrongly detected as soft real-time, and so on.

    As highlighted by our tests, this problem causes BFQ to occasionally
    fail to guarantee a high responsiveness, in the presence of heavy
    background I/O workloads. The reason is that the background workload
    happens to be detected as soft real-time, more or less frequently,
    during the execution of the interactive task under test. To give an
    idea, because of this problem, Libreoffice Writer occasionally takes 8
    seconds, instead of 3, to start up, if there are sequential reads and
    writes in the background, on a Kingston SSDNow V300.

    This commit addresses this issue by leveraging the following facts.

    The reason why some applications are detected as soft real-time despite
    all BFQ checks to avoid false positives, is simply that, during high
    CPU or storage-device load, I/O-bound applications may happen to do
    I/O slowly enough to meet all soft real-time requirements, and pass
    all BFQ extra checks. Yet, this happens only for limited time periods:
    slow-speed time intervals are usually interspersed between other time
    intervals during which these applications do I/O at a very high speed.
    To exploit these facts, this commit introduces a little change, in the
    detection of soft real-time behavior, to systematically consider also
    the recent past: the higher the speed was in the recent past, the
    later next I/O should arrive for the application to be considered as
    soft real-time. At the beginning of a slow-speed interval, the minimum
    arrival time allowed for the next I/O usually happens to still be so
    high, to fall *after* the end of the slow-speed period itself. As a
    consequence, the application does not risk to be deemed as soft
    real-time during the slow-speed interval. Then, during the next
    high-speed interval, the application cannot, evidently, be deemed as
    soft real-time (exactly because of its speed), and so on.

    This extra filtering proved to be rather effective: in the above test,
    the frequency of false positives became so low that the start-up time
    was 3 seconds in all iterations (apart from occasional outliers,
    caused by page-cache-management issues, which are out of the scope of
    this commit, and cannot be solved by an I/O scheduler).

    Tested-by: Lee Tibbert
    Signed-off-by: Paolo Valente
    Signed-off-by: Angelo Ruocco
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • When two or more processes do I/O in a way that the their requests are
    sequential in respect to one another, BFQ merges the bfq_queues associated
    with the processes. This way the overall I/O pattern becomes sequential,
    and thus there is a boost in througput.
    These cooperating processes usually start or restart to do I/O shortly
    after each other. So, in order to avoid merging non-cooperating processes,
    BFQ ensures that none of these queues has been in weight raising for too
    long.

    In this respect, from commit "block, bfq-sq, bfq-mq: let a queue be merged
    only shortly after being created", BFQ checks whether any queue (and not
    only weight-raised ones) is doing I/O continuously from too long to be
    merged.

    This new additional check makes the first one useless: a queue doing
    I/O from long enough, if being weight-raised, is also a queue in
    weight raising for too long to be merged. Accordingly, this commit
    removes the first check.

    Signed-off-by: Angelo Ruocco
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Angelo Ruocco
     
  • In BFQ and CFQ, two processes are said to be cooperating if they do
    I/O in such a way that the union of their I/O requests yields a
    sequential I/O pattern. To get such a sequential I/O pattern out of
    the non-sequential pattern of each cooperating process, BFQ and CFQ
    merge the queues associated with these processes. In more detail,
    cooperating processes, and thus their associated queues, usually
    start, or restart, to do I/O shortly after each other. This is the
    case, e.g., for the I/O threads of KVM/QEMU and of the dump
    utility. Basing on this assumption, this commit allows a bfq_queue to
    be merged only during a short time interval (100ms) after it starts,
    or re-starts, to do I/O. This filtering provides two important
    benefits.

    First, it greatly reduces the probability that two non-cooperating
    processes have their queues merged by mistake, if they just happen to
    do I/O close to each other for a short time interval. These spurious
    merges cause loss of service guarantees. A low-weight bfq_queue may
    unjustly get more than its expected share of the throughput: if such a
    low-weight queue is merged with a high-weight queue, then the I/O for
    the low-weight queue is served as if the queue had a high weight. This
    may damage other high-weight queues unexpectedly. For instance,
    because of this issue, lxterminal occasionally took 7.5 seconds to
    start, instead of 6.5 seconds, when some sequential readers and
    writers did I/O in the background on a FUJITSU MHX2300BT HDD. The
    reason is that the bfq_queues associated with some of the readers or
    the writers were merged with the high-weight queues of some processes
    that had to do some urgent but little I/O. The readers then exploited
    the inherited high weight for all or most of their I/O, during the
    start-up of terminal. The filtering introduced by this commit
    eliminated any outlier caused by spurious queue merges in our start-up
    time tests.

    This filtering also provides a little boost of the throughput
    sustainable by BFQ: 3-4%, depending on the CPU. The reason is that,
    once a bfq_queue cannot be merged any longer, this commit makes BFQ
    stop updating the data needed to handle merging for the queue.

    Signed-off-by: Paolo Valente
    Signed-off-by: Angelo Ruocco
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • A just-created bfq_queue will certainly be deemed as interactive on
    the arrival of its first I/O request, if the low_latency flag is
    set. Yet, if the queue is merged with another queue on the arrival of
    its first I/O request, it will not have the chance to be flagged as
    interactive. Nevertheless, if the queue is then split soon enough, it
    has to be flagged as interactive after the split.

    To handle this early-merge scenario correctly, BFQ saves the state of
    the queue, on the merge, as if the latter had already been deemed
    interactive. So, if the queue is split soon, it will get
    weight-raised, because the previous state of the queue is resumed on
    the split.

    Unfortunately, in the act of saving the state of the newly-created
    queue, BFQ doesn't check whether the low_latency flag is set, and this
    causes early-merged queues to be then weight-raised, on queue splits,
    even if low_latency is off. This commit addresses this problem by
    adding the missing check.

    Signed-off-by: Angelo Ruocco
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Angelo Ruocco
     
  • If two processes do I/O close to each other, then BFQ merges the
    bfq_queues associated with these processes, to get a more sequential
    I/O, and thus a higher throughput. In this respect, to detect whether
    two processes are doing I/O close to each other, BFQ keeps a list of
    the head-of-line I/O requests of all active bfq_queues. The list is
    ordered by initial sectors, and implemented through a red-black tree
    (rq_pos_tree).

    Unfortunately, the update of the rq_pos_tree was incomplete, because
    the tree was not updated on the removal of the head-of-line I/O
    request of a bfq_queue, in case the queue did not remain empty. This
    commit adds the missing update.

    Signed-off-by: Paolo Valente
    Signed-off-by: Angelo Ruocco
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • If two processes do I/O close to each other, i.e., are cooperating
    processes in BFQ (and CFQ'S) nomenclature, then BFQ merges their
    associated bfq_queues, so as to get sequential I/O from the union of
    the I/O requests of the processes, and thus reach a higher
    throughput. A merged queue is then split if its I/O stops being
    sequential. In this respect, BFQ deems the I/O of a bfq_queue as
    (mostly) sequential only if less than 4 I/O requests are random, out
    of the last 32 requests inserted into the queue.

    Unfortunately, extensive testing (with the interleaved_io benchmark of
    the S suite [1], and with real applications spawning cooperating
    processes) has clearly shown that, with such a low threshold, only a
    rather low I/O throughput may be reached when several cooperating
    processes do I/O. In particular, the outcome of each test run was
    bimodal: if queue merging occurred and was stable during the test,
    then the throughput was close to the peak rate of the storage device,
    otherwise the throughput was arbitrarily low (usually around 1/10 of
    the peak rate with a rotational device). The probability to get the
    unlucky outcomes grew with the number of cooperating processes: it was
    already significant with 5 processes, and close to one with 7 or more
    processes.

    The cause of the low throughput in the unlucky runs was that the
    merged queues containing the I/O of these cooperating processes were
    soon split, because they contained more random I/O requests than those
    tolerated by the 4/32 threshold, but
    - that I/O would have however allowed the storage device to reach
    peak throughput or almost peak throughput;
    - in contrast, the I/O of these processes, if served individually
    (from separate queues) yielded a rather low throughput.

    So we repeated our tests with increasing values of the threshold,
    until we found the minimum value (19) for which we obtained maximum
    throughput, reliably, with at least up to 9 cooperating
    processes. Then we checked that the use of that higher threshold value
    did not cause any regression for any other benchmark in the suite [1].
    This commit raises the threshold to such a higher value.

    [1] https://github.com/Algodev-github/S

    Signed-off-by: Angelo Ruocco
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

15 Nov, 2017

3 commits

  • BFQ currently creates, and updates, its own instance of the whole
    set of blkio statistics that cfq creates. Yet, from the comments
    of Tejun Heo in [1], it turned out that most of these statistics
    are meant/useful only for debugging. This commit makes BFQ create
    the latter, debugging statistics only if the option
    CONFIG_DEBUG_BLK_CGROUP is set.

    By doing so, this commit also enables BFQ to enjoy a high perfomance
    boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then
    BFQ has to update far fewer statistics, and, in particular, not the
    heaviest to update. To give an idea of the benefits, if
    CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and
    with 8 threads doing random I/O in parallel on null_blk (configured
    with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS
    (+30%). We have measured similar or even much higher boosts with other
    CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have
    been obtained and can be reproduced very easily with the script in [1].

    [1] https://www.spinics.net/lists/linux-block/msg18943.html

    Suggested-by: Tejun Heo
    Suggested-by: Ulf Hansson
    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Luca Miccio
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Luca Miccio
     
  • bfq invokes various blkg_*stats_* functions to update the statistics
    contained in the special files blkio.bfq.* in the blkio controller
    groups, i.e., the I/O accounting related to the proportional-share
    policy provided by bfq. The execution of these functions takes a
    considerable percentage, about 40%, of the total per-request execution
    time of bfq (i.e., of the sum of the execution time of all the bfq
    functions that have to be executed to process an I/O request from its
    creation to its destruction). This reduces the request-processing
    rate sustainable by bfq noticeably, even on a multicore CPU. In fact,
    the bfq functions that invoke blkg_*stats_* functions cannot be
    executed in parallel with the rest of the code of bfq, because both
    are executed under the same same per-device scheduler lock.

    To reduce this slowdown, this commit moves, wherever possible, the
    invocation of these functions (more precisely, of the bfq functions
    that invoke blkg_*stats_* functions) outside the critical sections
    protected by the scheduler lock.

    With this change, and with all blkio.bfq.* statistics enabled, the
    throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel
    i7-4850HQ, in case of 8 threads doing random I/O in parallel on
    null_blk, with the latter configured with 0 latency. We obtained the
    same or higher throughput boosts, up to +30%, with other processors
    (some figures are reported in the documentation). For our tests, we
    used the script [1], with which our results can be easily reproduced.

    NOTE. This commit still protects the invocation of blkg_*stats_*
    functions with the request_queue lock, because the group these
    functions are invoked on may otherwise disappear before or while these
    functions are executed. Fortunately, tests without even this lock
    show, by difference, that the serialization caused by this lock has a
    little impact (at most ~5% of throughput reduction).

    [1] https://github.com/Algodev-github/IOSpeed

    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Luca Miccio
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • bfqg_stats_update_io_add and bfqg_stats_update_io_remove are to be
    invoked, respectively, when an I/O request enters and when an I/O
    request exits the scheduler. Unfortunately, bfq does not fully comply
    with this scheme, because it does not invoke these functions for
    requests that are inserted into or extracted from its priority
    dispatch list. This commit fixes this mistake.

    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Luca Miccio
    Signed-off-by: Jens Axboe

    Luca Miccio
     

09 Oct, 2017

2 commits

  • The commit "block, bfq: decrease burst size when queues in burst
    exit" introduced the decrement of burst_size on the removal of a
    bfq_queue from the burst list. Unfortunately, this decrement can
    happen to be performed even when burst size is already equal to 0,
    because of unbalanced decrements. A description follows of the cause
    of these unbalanced decrements, namely a wrong assumption, and of the
    way how this wrong assumption leads to unbalanced decrements.

    The wrong assumption is that a bfq_queue can exit only if the process
    associated with the bfq_queue has exited. This is false, because a
    bfq_queue, say Q, may exit also as a consequence of a merge with
    another bfq_queue. In this case, Q exits because the I/O of its
    associated process has been redirected to another bfq_queue.

    The decrement unbalance occurs because Q may then be re-created after
    a split, and added back to the current burst list, *without*
    incrementing burst_size. burst_size is not incremented because Q is
    not a new bfq_queue added to the burst list, but a bfq_queue only
    temporarily removed from the list, and, before the commit "bfq-sq,
    bfq-mq: decrease burst size when queues in burst exit", burst_size was
    not decremented when Q was removed.

    This commit addresses this issue by just checking whether the exiting
    bfq_queue is a merged bfq_queue, and, in that case, not decrementing
    burst_size. Unfortunately, this still leaves room for unbalanced
    decrements, in the following rarer case: on a split, the bfq_queue
    happens to be inserted into a different burst list than that it was
    removed from when merged. If this happens, the number of elements in
    the new burst list becomes higher than burst_size (by one). When the
    bfq_queue then exits, it is of course not in a merged state any
    longer, thus burst_size is decremented, which results in an unbalanced
    decrement. To handle this sporadic, unlucky case in a simple way,
    this commit also checks that burst_size is larger than 0 before
    decrementing it.

    Finally, this commit removes an useless, extra check: the check that
    the bfq_queue is sync, performed before checking whether the bfq_queue
    is in the burst list. This extra check is redundant, because only sync
    bfq_queues can be inserted into the burst list.

    Fixes: 7cb04004fa37 ("block, bfq: decrease burst size when queues in burst exit")
    Reported-by: Philip Müller
    Signed-off-by: Paolo Valente
    Signed-off-by: Angelo Ruocco
    Tested-by: Philip Müller
    Tested-by: Oleksandr Natalenko
    Tested-by: Lee Tibbert
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Similarly to CFQ, BFQ has its write-throttling heuristics, and it
    is better not to combine them with further write-throttling
    heuristics of a different nature.
    So this commit disables write-back throttling for a device if BFQ
    is used as I/O scheduler for that device.

    Signed-off-by: Luca Miccio
    Signed-off-by: Paolo Valente
    Tested-by: Oleksandr Natalenko
    Tested-by: Lee Tibbert
    Signed-off-by: Jens Axboe

    Luca Miccio