03 Sep, 2020

1 commit

  • [ Upstream commit 2de791ab4918969d8108f15238a701968375f235 ]

    Changes from v1:
    - update commit description with proper ref-accounting justification

    commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
    introduce leak forbfq_group and blkcg_gq objects because of get/put
    imbalance.
    In fact whole idea of original commit is wrong because bfq_group entity
    can not dissapear under us because it is referenced by child bfq_queue's
    entities from here:
    -> bfq_init_entity()
    ->bfqg_and_blkg_get(bfqg);
    ->entity->parent = bfqg->my_entity

    -> bfq_put_queue(bfqq)
    FINAL_PUT
    ->bfqg_and_blkg_put(bfqq_group(bfqq))
    ->kmem_cache_free(bfq_pool, bfqq);

    So parent entity can not disappear while child entity is in tree,
    and child entities already has proper protection.
    This patch revert commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")

    bfq_group leak trace caused by bad commit:
    -> blkg_alloc
    -> bfq_pq_alloc
    -> bfqg_get (+1)
    ->bfq_activate_bfqq
    ->bfq_activate_requeue_entity
    -> __bfq_activate_entity
    ->bfq_get_entity
    ->bfqg_and_blkg_get (+1) bfq_del_bfqq_busy
    ->bfq_deactivate_entity+0x53/0xc0 [bfq]
    ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
    -> bfq_forget_entity(is_in_service = true)
    entity->on_st_or_in_serv = false do not touch reference
    -> blkcg_css_offline
    -> blkcg_destroy_blkgs
    -> blkg_destroy
    -> bfq_pd_offline
    -> __bfq_deactivate_entity
    if (!entity->on_st_or_in_serv) /* true, because (Note2)
    return false;
    -> bfq_pd_free
    -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
    So bfq_group and blkcg_gq will leak forever, see test-case below.

    ##TESTCASE_BEGIN:
    #!/bin/bash

    max_iters=${1:-100}
    #prep cgroup mounts
    mount -t tmpfs cgroup_root /sys/fs/cgroup
    mkdir /sys/fs/cgroup/blkio
    mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

    # Prepare blkdev
    grep blkio /proc/cgroups
    truncate -s 1M img
    losetup /dev/loop0 img
    echo bfq > /sys/block/loop0/queue/scheduler

    grep blkio /proc/cgroups
    for ((i=0;i /sys/fs/cgroup/blkio/a/cgroup.procs
    dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
    echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
    rmdir /sys/fs/cgroup/blkio/a
    grep blkio /proc/cgroups
    done
    ##TESTCASE_END:

    Fixes: db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Dmitry Monakhov
     

12 Mar, 2020

1 commit

  • commit db37a34c563bf4692b36990ae89005c031385e52 upstream.

    BFQ schedules generic entities, which may represent either bfq_queues
    or groups of bfq_queues. When an entity is inserted into a service
    tree, a reference must be taken, to make sure that the entity does not
    disappear while still referred in the tree. Unfortunately, such a
    reference is mistakenly taken only if the entity represents a
    bfq_queue. This commit takes a reference also in case the entity
    represents a group.

    Tested-by: Oleksandr Natalenko
    Tested-by: Chris Evich
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     

07 Sep, 2019

1 commit


01 May, 2019

1 commit


22 Apr, 2019

1 commit

  • Pull in v5.1-rc6 to resolve two conflicts. One is in BFQ, in just a
    comment, and is trivial. The other one is a conflict due to a later fix
    in the bio multi-page work, and needs a bit more care.

    * tag 'v5.1-rc6': (770 commits)
    Linux 5.1-rc6
    block: make sure that bvec length can't be overflow
    block: kill all_q_node in request_queue
    x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority
    coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
    mm/kmemleak.c: fix unused-function warning
    init: initialize jump labels before command line option parsing
    kernel/watchdog_hld.c: hard lockup message should end with a newline
    kcov: improve CONFIG_ARCH_HAS_KCOV help text
    mm: fix inactive list balancing between NUMA nodes and cgroups
    mm/hotplug: treat CMA pages as unmovable
    proc: fixup proc-pid-vm test
    proc: fix map_files test on F29
    mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n
    mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock
    mm: swapoff: shmem_unuse() stop eviction without igrab()
    mm: swapoff: take notice of completion sooner
    mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
    mm: swapoff: shmem_find_swap_entries() filter out other types
    slab: store tagged freelist for off-slab slabmgmt
    ...

    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Apr, 2019

1 commit

  • The function bfq_bfqq_expire() invokes the function
    __bfq_bfqq_expire(), and the latter may free the in-service bfq-queue.
    If this happens, then no other instruction of bfq_bfqq_expire() must
    be executed, or a use-after-free will occur.

    Basing on the assumption that __bfq_bfqq_expire() invokes
    bfq_put_queue() on the in-service bfq-queue exactly once, the queue is
    assumed to be freed if its refcounter is equal to one right before
    invoking __bfq_bfqq_expire().

    But, since commit 9dee8b3b057e ("block, bfq: fix queue removal from
    weights tree") this assumption is false. __bfq_bfqq_expire() may also
    invoke bfq_weights_tree_remove() and, since commit 9dee8b3b057e
    ("block, bfq: fix queue removal from weights tree"), also
    the latter function may invoke bfq_put_queue(). So __bfq_bfqq_expire()
    may invoke bfq_put_queue() twice, and this is the actual case where
    the in-service queue may happen to be freed.

    To address this issue, this commit moves the check on the refcounter
    of the queue right around the last bfq_put_queue() that may be invoked
    on the queue.

    Fixes: 9dee8b3b057e ("block, bfq: fix queue removal from weights tree")
    Reported-by: Dmitrii Tcvetkov
    Reported-by: Douglas Anderson
    Tested-by: Dmitrii Tcvetkov
    Tested-by: Douglas Anderson
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

09 Apr, 2019

1 commit


01 Apr, 2019

2 commits

  • In most cases, it is detrimental for throughput to plug I/O dispatch
    when the in-service bfq_queue becomes temporarily empty (plugging is
    performed to wait for the possible arrival, soon, of new I/O from the
    in-service queue). There is however a case where plugging is needed
    for service guarantees. If a bfq_queue, say Q, has a higher weight
    than some other active bfq_queue, and is sync, i.e., contains sync
    I/O, then, to guarantee that Q does receive a higher share of the
    throughput than other lower-weight queues, it is necessary to plug I/O
    dispatch when Q remains temporarily empty while being served.

    For this reason, BFQ performs I/O plugging when some active bfq_queue
    has a higher weight than some other active bfq_queue. But this is
    unnecessarily overkill. In fact, if the in-service bfq_queue actually
    has a weight lower than or equal to the other queues, then the queue
    *must not* be guaranteed a higher share of the throughput than the
    other queues. So, not plugging I/O cannot cause any harm to the
    queue. And can boost throughput.

    Taking advantage of this fact, this commit does not plug I/O for sync
    bfq_queues with a weight lower than or equal to the weights of the
    other queues. Here is an example of the resulting throughput boost
    with the dbench workload, which is particularly nasty for BFQ. With
    the dbench test in the Phoronix suite, BFQ reaches its lowest total
    throughput with 6 clients on a filesystem with journaling, in case the
    journaling daemon has a higher weight than normal processes. Before
    this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5,
    after this commit it is ~100 MB/sec.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Replace BFQ_GROUP_IOSCHED_ENABLED with CONFIG_BFQ_GROUP_IOSCHED.
    Code under these ifdefs never worked, something might be broken.

    Fixes: 0471559c2fbd ("block, bfq: add/remove entity weights correctly")
    Fixes: 73d58118498b ("block, bfq: consider also ioprio classes in symmetry detection")
    Reviewed-by: Holger Hoffstätte
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     

01 Feb, 2019

2 commits

  • bfq maintains an ordered list, through a red-black tree, of unique
    weights of active bfq_queues. This list is used to detect whether there
    are active queues with differentiated weights. The weight of a queue is
    removed from the list when both the following two conditions become
    true:

    (1) the bfq_queue is flagged as inactive
    (2) the has no in-flight request any longer;

    Unfortunately, in the rare cases where condition (2) becomes true before
    condition (1), the removal fails, because the function to remove the
    weight of the queue (bfq_weights_tree_remove) is rightly invoked in the
    path that deactivates the bfq_queue, but mistakenly invoked *before* the
    function that actually performs the deactivation (bfq_deactivate_bfqq).

    This commits moves the invocation of bfq_weights_tree_remove for
    condition (1) to after bfq_deactivate_bfqq. As a consequence of this
    move, it is necessary to add a further reference to the queue when the
    weight of a queue is added, because the queue might otherwise be freed
    before bfq_weights_tree_remove is invoked. This commit adds this
    reference and makes all related modifications.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • In asymmetric scenarios, i.e., when some bfq_queue or bfq_group needs to
    be guaranteed a different bandwidth than other bfq_queues or bfq_groups,
    these service guaranteed can be provided only by plugging I/O dispatch,
    completely or partially, when the queue in service remains temporarily
    empty. A case where asymmetry is particularly strong is when some active
    bfq_queues belong to a higher-priority class than some other active
    bfq_queues. Unfortunately, this important case is not considered at all
    in the code for detecting asymmetric scenarios. This commit adds the
    missing logic.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

14 Jan, 2019

1 commit

  • Comments on function __bfq_deactivate_entity contains two imprecise or
    wrong statements:
    1) The function performs the deactivation of the entity.
    2) The function must be invoked only if the entity is on a service tree.

    This commits replaces both statements with the correct ones:
    1) The functions updates sched_data and service trees for the entity,
    so as to represent entity as inactive (which is only part of the steps
    needed for the deactivation of the entity).
    2) The function must be invoked on every entity being deactivated.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

07 Dec, 2018

1 commit

  • Since commit '2d29c9f89fcd ("block, bfq: improve asymmetric scenarios
    detection")', if there are process groups with I/O requests waiting for
    completion, then BFQ tags the scenario as 'asymmetric'. This detection
    is needed for preserving service guarantees (for details, see comments
    on the computation * of the variable asymmetric_scenario in the
    function bfq_better_to_idle).

    Unfortunately, commit '2d29c9f89fcd ("block, bfq: improve asymmetric
    scenarios detection")' contains an error exactly in the updating of
    the number of groups with I/O requests waiting for completion: if a
    group has more than one descendant process, then the above number of
    groups, which is renamed from num_active_groups to a more appropriate
    num_groups_with_pending_reqs by this commit, may happen to be wrongly
    decremented multiple times, namely every time one of the descendant
    processes gets all its pending I/O requests completed.

    A correct, complete solution should work as follows. Consider a group
    that is inactive, i.e., that has no descendant process with pending
    I/O inside BFQ queues. Then suppose that num_groups_with_pending_reqs
    is still accounting for this group, because the group still has some
    descendant process with some I/O request still in
    flight. num_groups_with_pending_reqs should be decremented when the
    in-flight request of the last descendant process is finally completed
    (assuming that nothing else has changed for the group in the meantime,
    in terms of composition of the group and active/inactive state of
    child groups and processes). To accomplish this, an additional
    pending-request counter must be added to entities, and must be
    updated correctly.

    To avoid this additional field and operations, this commit resorts to
    the following tradeoff between simplicity and accuracy: for an
    inactive group that is still counted in num_groups_with_pending_reqs,
    this commit decrements num_groups_with_pending_reqs when the first
    descendant process of the group remains with no request waiting for
    completion.

    This simplified scheme provides a fix to the unbalanced decrements
    introduced by 2d29c9f89fcd. Since this error was also caused by lack
    of comments on this non-trivial issue, this commit also adds related
    comments.

    Fixes: 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection")
    Reported-by: Steven Barrett
    Tested-by: Steven Barrett
    Tested-by: Lucjan Lucjanov
    Reviewed-by: Federico Motta
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

26 Oct, 2018

1 commit

  • Since commit 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios
    detection"), a scenario is defined asymmetric when one of the
    following conditions holds:
    - active bfq_queues have different weights
    - one or more group of entities (bfq_queue or other groups of entities)
    are active
    bfq grants fairness and low latency also in such asymmetric scenarios,
    by plugging the dispatching of I/O if the bfq_queue in service happens
    to be temporarily idle. This plugging may lower throughput, so it is
    important to do it only when strictly needed.

    By mistake, in commit '2d29c9f89fcd' ("block, bfq: improve asymmetric
    scenarios detection") the num_active_groups counter was firstly
    incremented and subsequently decremented at any entity (group or
    bfq_queue) weight change.

    This is useless, because only transitions from active to inactive and
    vice versa matter for that counter. Unfortunately this is also
    incorrect in the following case: the entity at issue is a bfq_queue
    and it is under weight raising. In fact in this case there is a
    spurious increment of the num_active_groups counter.

    This spurious increment may cause scenarios to be wrongly detected as
    asymmetric, thus causing useless plugging and loss of throughput.

    This commit fixes this issue by simply removing the above useless and
    wrong increments and decrements.

    Fixes: 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection")
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Federico Motta
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Federico Motta
     

14 Oct, 2018

1 commit

  • bfq defines as asymmetric a scenario where an active entity, say E
    (representing either a single bfq_queue or a group of other entities),
    has a higher weight than some other entities. If the entity E does sync
    I/O in such a scenario, then bfq plugs the dispatch of the I/O of the
    other entities in the following situation: E is in service but
    temporarily has no pending I/O request. In fact, without this plugging,
    all the times that E stops being temporarily idle, it may find the
    internal queues of the storage device already filled with an
    out-of-control number of extra requests, from other entities. So E may
    have to wait for the service of these extra requests, before finally
    having its own requests served. This may easily break service
    guarantees, with E getting less than its fair share of the device
    throughput. Usually, the end result is that E gets the same fraction of
    the throughput as the other entities, instead of getting more, according
    to its higher weight.

    Yet there are two other more subtle cases where E, even if its weight is
    actually equal to or even lower than the weight of any other active
    entities, may get less than its fair share of the throughput in case the
    above I/O plugging is not performed:
    1. other entities issue larger requests than E;
    2. other entities contain more active child entities than E (or in
    general tend to have more backlog than E).

    In the first case, other entities may get more service than E because
    they get larger requests, than those of E, served during the temporary
    idle periods of E. In the second case, other entities get more service
    because, by having many child entities, they have many requests ready
    for dispatching while E is temporarily idle.

    This commit addresses this issue by extending the definition of
    asymmetric scenario: a scenario is asymmetric when
    - active entities representing bfq_queues have differentiated weights,
    as in the original definition
    or (inclusive)
    - one or more entities representing groups of entities are active.

    This broader definition makes sure that I/O plugging will be performed
    in all the above cases, provided that there is at least one active
    group. Of course, this definition is very coarse, so it will trigger
    I/O plugging also in cases where it is not needed, such as, e.g.,
    multiple active entities with just one child each, and all with the same
    I/O-request size. The reason for this coarse definition is just that a
    finer-grained definition would be rather heavy to compute.

    On the opposite end, even this new definition does not trigger I/O
    plugging in all cases where there is no active group, and all bfq_queues
    have the same weight. So, in these cases some unfairness may occur if
    there are asymmetries in I/O-request sizes. We made this choice because
    I/O plugging may lower throughput, and probably a user that has not
    created any group cares more about throughput than about perfect
    fairness. At any rate, as for possible applications that may care about
    service guarantees, bfq already guarantees a high responsiveness and a
    low latency to soft real-time applications automatically.

    Signed-off-by: Federico Motta
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Federico Motta
     

15 Sep, 2018

1 commit

  • BFQ schedules entities (which represent either per-process queues or
    groups of queues) as a function of their timestamps. In particular, as
    a function of their (virtual) finish times. The finish time of an
    entity is computed as a function of the budget assigned to the entity,
    assuming, tentatively, that the entity, once in service, will receive
    an amount of service equal to its budget. Then, when the entity is
    expired because it finishes to be served, this finish time is updated
    as a function of the actual service received by the entity. This
    allows the entity to be correctly charged with only the service
    received, and then to be correctly re-scheduled.

    Yet an entity may receive service also while not being the entity in
    service (in the scheduling environment of its parent entity), for
    several reasons. If the entity remains with no backlog while receiving
    this 'unofficial' service, then it is expired. Also on such an
    expiration, the finish time of the entity should be updated to account
    for only the service actually received by the entity. Unfortunately,
    such an update is not performed for an entity expiring without being
    the entity in service.

    In a similar vein, the service counter of the entity in service is
    reset when the entity is expired, to be ready to be used for next
    service cycle. This reset too should be performed also in case an
    entity is expired because it remains empty after receiving service
    while not being the entity in service. But in this case the reset is
    not performed.

    This commit performs the above update of the finish time and reset of
    the service received, also for an entity expiring while not being the
    entity in service.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

17 Aug, 2018

2 commits


09 Jul, 2018

2 commits

  • If
    - a bfq_queue Q preempts another queue, because one request of Q
    arrives in time,
    - but, after this preemption, Q is not the queue that is set in service,
    then Q->entity.service is set to 0 when Q is eventually set in
    service. But Q should have continued receiving service with its old
    budget (which is why preemption has occurred) and its old service.

    This commit addresses this issue by resetting service on queue real
    expiration.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • To keep I/O throughput high as often as possible, BFQ performs
    I/O-dispatch plugging (aka device idling) only when beneficial exactly
    for throughput, or when needed for service guarantees (low latency,
    fairness). An important case where the latter condition holds is when
    the scenario is 'asymmetric' in terms of weights: i.e., when some
    bfq_queue or whole group of queues has a higher weight, and thus has
    to receive more service, than other queues or groups. Without dispatch
    plugging, lower-weight queues/groups may unjustly steal bandwidth to
    higher-weight queues/groups.

    To detect asymmetric scenarios, BFQ checks some sufficient
    conditions. One of these conditions is that active groups have
    different weights. BFQ controls this condition by maintaining a
    special set of unique weights of active groups
    (group_weights_tree). To this purpose, in the function
    bfq_active_insert/bfq_active_extract BFQ adds/removes the weight of a
    group to/from this set.

    Unfortunately, the function bfq_active_extract may happen to be
    invoked also for a group that is still active (to preserve the correct
    update of the next queue to serve, see comments in function
    bfq_no_longer_next_in_service() for details). In this case, removing
    the weight of the group makes the set group_weights_tree
    inconsistent. Service-guarantee violations follow.

    This commit addresses this issue by moving group_weights_tree
    insertions from their previous location (in bfq_active_insert) into
    the function __bfq_activate_entity, and by moving group_weights_tree
    extractions from bfq_active_extract to when the entity that represents
    a group remains throughly idle, i.e., with no request either enqueued
    or dispatched.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

18 Jan, 2018

1 commit

  • To maximise responsiveness, BFQ raises the weight, and performs device
    idling, for bfq_queues associated with processes deemed as
    interactive. In particular, weight raising has a maximum duration,
    equal to the time needed to start a large application. If a
    weight-raised process goes on doing I/O beyond this maximum duration,
    it loses weight-raising.

    This mechanism is evidently vulnerable to the following false
    positives: I/O-bound applications that will go on doing I/O for much
    longer than the duration of weight-raising. These applications have
    basically no benefit from being weight-raised at the beginning of
    their I/O. On the opposite end, while being weight-raised, these
    applications
    a) unjustly steal throughput to applications that may truly need
    low latency;
    b) make BFQ uselessly perform device idling; device idling results
    in loss of device throughput with most flash-based storage, and may
    increase latencies when used purposelessly.

    This commit adds a countermeasure to reduce both the above
    problems. To introduce this countermeasure, we provide the following
    extra piece of information (full details in the comments added by this
    commit). During the start-up of the large application used as a
    reference to set the duration of weight-raising, involved processes
    transfer at most ~110K sectors each. Accordingly, a process initially
    deemed as interactive has no right to be weight-raised any longer,
    once transferred 110K sectors or more.

    Basing on this consideration, this commit early-ends weight-raising
    for a bfq_queue if the latter happens to have received an amount of
    service at least equal to 110K sectors (actually, a little bit more,
    to keep a safety margin). I/O-bound applications that reach a high
    throughput, such as file copy, get to this threshold much before the
    allowed weight-raising period finishes. Thus this early ending of
    weight-raising reduces the amount of time during which these
    applications cause the problems described above.

    Tested-by: Oleksandr Natalenko
    Tested-by: Holger Hoffstätte
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

06 Jan, 2018

1 commit

  • In BFQ and CFQ, two processes are said to be cooperating if they do
    I/O in such a way that the union of their I/O requests yields a
    sequential I/O pattern. To get such a sequential I/O pattern out of
    the non-sequential pattern of each cooperating process, BFQ and CFQ
    merge the queues associated with these processes. In more detail,
    cooperating processes, and thus their associated queues, usually
    start, or restart, to do I/O shortly after each other. This is the
    case, e.g., for the I/O threads of KVM/QEMU and of the dump
    utility. Basing on this assumption, this commit allows a bfq_queue to
    be merged only during a short time interval (100ms) after it starts,
    or re-starts, to do I/O. This filtering provides two important
    benefits.

    First, it greatly reduces the probability that two non-cooperating
    processes have their queues merged by mistake, if they just happen to
    do I/O close to each other for a short time interval. These spurious
    merges cause loss of service guarantees. A low-weight bfq_queue may
    unjustly get more than its expected share of the throughput: if such a
    low-weight queue is merged with a high-weight queue, then the I/O for
    the low-weight queue is served as if the queue had a high weight. This
    may damage other high-weight queues unexpectedly. For instance,
    because of this issue, lxterminal occasionally took 7.5 seconds to
    start, instead of 6.5 seconds, when some sequential readers and
    writers did I/O in the background on a FUJITSU MHX2300BT HDD. The
    reason is that the bfq_queues associated with some of the readers or
    the writers were merged with the high-weight queues of some processes
    that had to do some urgent but little I/O. The readers then exploited
    the inherited high weight for all or most of their I/O, during the
    start-up of terminal. The filtering introduced by this commit
    eliminated any outlier caused by spurious queue merges in our start-up
    time tests.

    This filtering also provides a little boost of the throughput
    sustainable by BFQ: 3-4%, depending on the CPU. The reason is that,
    once a bfq_queue cannot be merged any longer, this commit makes BFQ
    stop updating the data needed to handle merging for the queue.

    Signed-off-by: Paolo Valente
    Signed-off-by: Angelo Ruocco
    Signed-off-by: Jens Axboe

    Paolo Valente
     

15 Nov, 2017

1 commit

  • bfq invokes various blkg_*stats_* functions to update the statistics
    contained in the special files blkio.bfq.* in the blkio controller
    groups, i.e., the I/O accounting related to the proportional-share
    policy provided by bfq. The execution of these functions takes a
    considerable percentage, about 40%, of the total per-request execution
    time of bfq (i.e., of the sum of the execution time of all the bfq
    functions that have to be executed to process an I/O request from its
    creation to its destruction). This reduces the request-processing
    rate sustainable by bfq noticeably, even on a multicore CPU. In fact,
    the bfq functions that invoke blkg_*stats_* functions cannot be
    executed in parallel with the rest of the code of bfq, because both
    are executed under the same same per-device scheduler lock.

    To reduce this slowdown, this commit moves, wherever possible, the
    invocation of these functions (more precisely, of the bfq functions
    that invoke blkg_*stats_* functions) outside the critical sections
    protected by the scheduler lock.

    With this change, and with all blkio.bfq.* statistics enabled, the
    throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel
    i7-4850HQ, in case of 8 threads doing random I/O in parallel on
    null_blk, with the latter configured with 0 latency. We obtained the
    same or higher throughput boosts, up to +30%, with other processors
    (some figures are reported in the documentation). For our tests, we
    used the script [1], with which our results can be easily reproduced.

    NOTE. This commit still protects the invocation of blkg_*stats_*
    functions with the request_queue lock, because the group these
    functions are invoked on may otherwise disappear before or while these
    functions are executed. Fortunately, tests without even this lock
    show, by difference, that the serialization caused by this lock has a
    little impact (at most ~5% of throughput reduction).

    [1] https://github.com/Algodev-github/IOSpeed

    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Luca Miccio
    Signed-off-by: Jens Axboe

    Paolo Valente
     

31 Aug, 2017

3 commits

  • If the function bfq_update_next_in_service is invoked as a consequence
    of the activation or requeueing of an entity, say E, then it doesn't
    invoke bfq_lookup_next_entity to get the next-in-service entity. In
    contrast, it follows a shorter path: if E happens to be eligible (see
    commit "bfq-sq-mq: make lookup_next_entity push up vtime on
    expirations" for details on eligibility) and to have a lower virtual
    finish time than the current candidate as next-in-service entity, then
    E directly becomes the next-in-service entity. Unfortunately, there is
    a corner case for which this shorter path makes
    bfq_update_next_in_service choose a non eligible entity: it occurs if
    both E and the current next-in-service entity happen to be non
    eligible when bfq_update_next_in_service is invoked. In this case, E
    is not set as next-in-service, and, since bfq_lookup_next_entity is
    not invoked, the state of the parent entity is not updated so as to
    end up with an eligible entity as the proper next-in-service entity.

    In this respect, next-in-service is actually allowed to be non
    eligible while some queue is in service: since no system-virtual-time
    push-up can be performed in that case (see again commit "bfq-sq-mq:
    make lookup_next_entity push up vtime on expirations" for details),
    next-in-service is chosen, speculatively, as a function of the
    possible value that the system virtual time may get after a push
    up. But the correctness of the schedule breaks if next-in-service is
    still a non eligible entity when it is time to set in service the next
    entity. Unfortunately, this may happen in the above corner case.

    This commit fixes this problem by making bfq_update_next_in_service
    invoke bfq_lookup_next_entity not only if the above shorter path
    cannot be taken, but also if the shorter path is taken but fails to
    yield an eligible next-in-service entity.

    Signed-off-by: Paolo Valente
    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • If the function bfq_update_next_in_service is invoked as a consequence
    of the activation or requeueing of an entity, say E, and finds out
    that E belongs to a higher-priority class than that of the current
    next-in-service entity, then it sets next_in_service directly to
    E. But this may lead to anomalous schedules, because E may happen not
    be eligible for service, because its virtual start time is higher than
    the system virtual time for its service tree.

    This commit addresses this issue by simply removing this direct
    switch.

    Signed-off-by: Paolo Valente
    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • To provide a very smooth service, bfq starts to serve a bfq_queue
    only if the queue is 'eligible', i.e., if the same queue would
    have started to be served in the ideal, perfectly fair system that
    bfq simulates internally. This is obtained by associating each
    queue with a virtual start time, and by computing a special system
    virtual time quantity: a queue is eligible only if the system
    virtual time has reached the virtual start time of the
    queue. Finally, bfq guarantees that, when a new queue must be set
    in service, there is always at least one eligible entity for each
    active parent entity in the scheduler. To provide this guarantee,
    the function __bfq_lookup_next_entity pushes up, for each parent
    entity on which it is invoked, the system virtual time to the
    minimum among the virtual start times of the entities in the
    active tree for the parent entity (more precisely, the push up
    occurs if the system virtual time happens to be lower than all
    such virtual start times).

    There is however a circumstance in which __bfq_lookup_next_entity
    cannot push up the system virtual time for a parent entity, even
    if the system virtual time is lower than the virtual start times
    of all the child entities in the active tree. It happens if one of
    the child entities is in service. In fact, in such a case, there
    is already an eligible entity, the in-service one, even if it may
    not be not present in the active tree (because in-service entities
    may be removed from the active tree).

    Unfortunately, in the last re-design of the
    hierarchical-scheduling engine, the reset of the pointer to the
    in-service entity for a given parent entity--reset to be done as a
    consequence of the expiration of the in-service entity--always
    happens after the function __bfq_lookup_next_entity has been
    invoked. This causes the function to think that there is still an
    entity in service for the parent entity, and then that the system
    virtual time cannot be pushed up, even if actually such a
    no-more-in-service entity has already been properly reinserted
    into the active tree (or in some other tree if no more
    active). Yet, the system virtual time *had* to be pushed up, to be
    ready to correctly choose the next queue to serve. Because of the
    lack of this push up, bfq may wrongly set in service a queue that
    had been speculatively pre-computed as the possible
    next-in-service queue, but that would no more be the one to serve
    after the expiration and the reinsertion into the active trees of
    the previously in-service entities.

    This commit addresses this issue by making
    __bfq_lookup_next_entity properly push up the system virtual time
    if an expiration is occurring.

    Signed-off-by: Paolo Valente
    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Jens Axboe

    Paolo Valente
     

30 Jul, 2017

2 commits

  • Groups of BFQ queues are represented by generic entities in BFQ. When
    a queue belonging to a parent entity is deactivated, the parent entity
    may need to be deactivated too, in case the deactivated queue was the
    only active queue for the parent entity. This deactivation may need to
    be propagated upwards if the entity belongs, in its turn, to a further
    higher-level entity, and so on. In particular, the upward propagation
    of deactivation stops at the first parent entity that remains active
    even if one of its child entities has been deactivated.

    To decide whether the last non-deactivation condition holds for a
    parent entity, BFQ checks whether the field next_in_service is still
    not NULL for the parent entity, after the deactivation of one of its
    child entity. If it is not NULL, then there are certainly other active
    entities in the parent entity, and deactivations can stop.

    Unfortunately, this check misses a corner case: if in_service_entity
    is not NULL, then next_in_service may happen to be NULL, although the
    parent entity is evidently active. This happens if: 1) the entity
    pointed by in_service_entity is the only active entity in the parent
    entity, and 2) according to the definition of next_in_service, the
    in_service_entity cannot be considered as next_in_service. See the
    comments on the definition of next_in_service for details on this
    second point.

    Hitting the above corner case causes crashes.

    To address this issue, this commit:
    1) Extends the above check on only next_in_service to controlling both
    next_in_service and in_service_entity (if any of them is not NULL,
    then no further deactivation is performed)
    2) Improves the (important) comments on how next_in_service is defined
    and updated; in particular it fixes a few rather obscure paragraphs

    Reported-by: Eric Wheeler
    Reported-by: Rick Yiu
    Reported-by: Tom X Nguyen
    Signed-off-by: Paolo Valente
    Tested-by: Eric Wheeler
    Tested-by: Rick Yiu
    Tested-by: Laurentiu Nicola
    Tested-by: Tom X Nguyen
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • BFQ implements hierarchical scheduling by representing each group of
    queues with a generic parent entity. For each parent entity, BFQ
    maintains an in_service_entity pointer: if one of the child entities
    happens to be in service, in_service_entity points to it. The
    resetting of these pointers happens only on queue expirations: when
    the in-service queue is expired, i.e., stops to be the queue in
    service, BFQ resets all in_service_entity pointers along the
    parent-entity path from this queue to the root entity.

    Functions handling the scheduling of entities assume, naturally, that
    in-service entities are active, i.e., have pending I/O requests (or,
    as a special case, even if they have no pending requests, they are
    expected to receive a new request very soon, with the scheduler idling
    the storage device while waiting for such an event). Unfortunately,
    the above resetting scheme of the in_service_entity pointers may cause
    this assumption to be violated. For example, the in-service queue may
    happen to remain without requests because of a request merge. In this
    case the queue does become idle, and all related data structures are
    updated accordingly. But in_service_entity still points to the queue
    in the parent entity. This inconsistency may even propagate to
    higher-level parent entities, if they happen to become idle as well,
    as a consequence of the leaf queue becoming idle. For this queue and
    parent entities, scheduling functions have an undefined behaviour,
    and, as reported, may easily lead to kernel crashes or hangs.

    This commit addresses this issue by simply resetting the
    in_service_entity field also when it is detected to point to an entity
    becoming idle (regardless of why the entity becomes idle).

    Reported-by: Laurentiu Nicola
    Signed-off-by: Paolo Valente
    Tested-by: Laurentiu Nicola
    Signed-off-by: Jens Axboe

    Paolo Valente
     

12 Jul, 2017

1 commit


04 Jul, 2017

1 commit

  • On each deactivation or re-scheduling (after being served) of a
    bfq_queue, BFQ invokes the function __bfq_entity_update_weight_prio(),
    to perform pending updates of ioprio, weight and ioprio class for the
    bfq_queue. BFQ also invokes this function on I/O-request dispatches,
    to raise or lower weights more quickly when needed, thereby improving
    latency. However, the entity representing the bfq_queue may be on the
    active (sub)tree of a service tree when this happens, and, although
    with a very low probability, the bfq_queue may happen to also have a
    pending change of its ioprio class. If both conditions hold when
    __bfq_entity_update_weight_prio() is invoked, then the entity moves to
    a sort of hybrid state: the new service tree for the entity, as
    returned by bfq_entity_service_tree(), differs from service tree on
    which the entity still is. The functions that handle activations and
    deactivations of entities do not cope with such a hybrid state (and
    would need to become more complex to cope).

    This commit addresses this issue by just making
    __bfq_entity_update_weight_prio() not perform also a possible pending
    change of ioprio class, when invoked on an I/O-request dispatch for a
    bfq_queue. Such a change is thus postponed to when
    __bfq_entity_update_weight_prio() is invoked on deactivation or
    re-scheduling of the bfq_queue.

    Reported-by: Marco Piazza
    Reported-by: Laurentiu Nicola
    Signed-off-by: Paolo Valente
    Tested-by: Marco Piazza
    Signed-off-by: Jens Axboe

    Paolo Valente
     

10 May, 2017

1 commit

  • In the function __bfq_deactivate_entity, the pointer
    entity->sched_data could happen to be used before being properly
    initialized. This led to a NULL pointer dereference. This commit fixes
    this bug by just using this pointer only where it is safe to do so.

    Reported-by: Tom Harrison
    Tested-by: Tom Harrison
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

19 Apr, 2017

1 commit

  • The BFQ I/O scheduler features an optimal fair-queuing
    (proportional-share) scheduling algorithm, enriched with several
    mechanisms to boost throughput and reduce latency for interactive and
    real-time applications. This makes BFQ a large and complex piece of
    code. This commit addresses this issue by splitting BFQ into three
    main, independent components, and by moving each component into a
    separate source file:
    1. Main algorithm: handles the interaction with the kernel, and
    decides which requests to dispatch; it uses the following two further
    components to achieve its goals.
    2. Scheduling engine (Hierarchical B-WF2Q+ scheduling algorithm):
    computes the schedule, using weights and budgets provided by the above
    component.
    3. cgroups support: handles group operations (creation, destruction,
    move, ...).

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente