01 May, 2019

1 commit


22 Apr, 2019

1 commit

  • Pull in v5.1-rc6 to resolve two conflicts. One is in BFQ, in just a
    comment, and is trivial. The other one is a conflict due to a later fix
    in the bio multi-page work, and needs a bit more care.

    * tag 'v5.1-rc6': (770 commits)
    Linux 5.1-rc6
    block: make sure that bvec length can't be overflow
    block: kill all_q_node in request_queue
    x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority
    coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
    mm/kmemleak.c: fix unused-function warning
    init: initialize jump labels before command line option parsing
    kernel/watchdog_hld.c: hard lockup message should end with a newline
    kcov: improve CONFIG_ARCH_HAS_KCOV help text
    mm: fix inactive list balancing between NUMA nodes and cgroups
    mm/hotplug: treat CMA pages as unmovable
    proc: fixup proc-pid-vm test
    proc: fix map_files test on F29
    mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n
    mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock
    mm: swapoff: shmem_unuse() stop eviction without igrab()
    mm: swapoff: take notice of completion sooner
    mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
    mm: swapoff: shmem_find_swap_entries() filter out other types
    slab: store tagged freelist for off-slab slabmgmt
    ...

    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Apr, 2019

1 commit

  • A previous commit moved the shallow depth and BFQ depth map calculations
    to be done at init time, moving it outside of the hotter IO path. This
    potentially causes hangs if the users changes the depth of the scheduler
    map, by writing to the 'nr_requests' sysfs file for that device.

    Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if
    the depth changes, so that the scheduler can update its internal state.

    Tested-by: Kai Krakow
    Reported-by: Paolo Valente
    Fixes: f0635b8a416e ("bfq: calculate shallow depths at init time")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Apr, 2019

1 commit

  • The function bfq_bfqq_expire() invokes the function
    __bfq_bfqq_expire(), and the latter may free the in-service bfq-queue.
    If this happens, then no other instruction of bfq_bfqq_expire() must
    be executed, or a use-after-free will occur.

    Basing on the assumption that __bfq_bfqq_expire() invokes
    bfq_put_queue() on the in-service bfq-queue exactly once, the queue is
    assumed to be freed if its refcounter is equal to one right before
    invoking __bfq_bfqq_expire().

    But, since commit 9dee8b3b057e ("block, bfq: fix queue removal from
    weights tree") this assumption is false. __bfq_bfqq_expire() may also
    invoke bfq_weights_tree_remove() and, since commit 9dee8b3b057e
    ("block, bfq: fix queue removal from weights tree"), also
    the latter function may invoke bfq_put_queue(). So __bfq_bfqq_expire()
    may invoke bfq_put_queue() twice, and this is the actual case where
    the in-service queue may happen to be freed.

    To address this issue, this commit moves the check on the refcounter
    of the queue right around the last bfq_put_queue() that may be invoked
    on the queue.

    Fixes: 9dee8b3b057e ("block, bfq: fix queue removal from weights tree")
    Reported-by: Dmitrii Tcvetkov
    Reported-by: Douglas Anderson
    Tested-by: Dmitrii Tcvetkov
    Tested-by: Douglas Anderson
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

09 Apr, 2019

1 commit


01 Apr, 2019

9 commits

  • bfq saves the state of a queue each time a merge occurs, to be
    able to resume such a state when the queue is associated again
    with its original process, on a split.

    Unfortunately bfq does not save & restore also the weight of the
    queue. If the weight is not correctly resumed when the queue is
    recycled, then the weight of the recycled queue could differ
    from the weight of the original queue.

    This commit adds the missing save & resume of the weight.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Francesco Pollicino
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Francesco Pollicino
     
  • The function "bfq_log_bfqq" prints the pid of the process
    associated with the queue passed as input.

    Unfortunately, if the queue is shared, then more than one process
    is associated with the queue. The pid that gets printed in this
    case is the pid of one of the associated processes.
    Which process gets printed depends on the exact sequence of merge
    events the queue underwent. So printing such a pid is rather
    useless and above all is often rather confusing because it
    reports a random pid between those of the associated processes.

    This commit addresses this issue by printing SHARED instead of a pid
    if the queue is shared.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Francesco Pollicino
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Francesco Pollicino
     
  • If many bfq_queues belonging to the same group happen to be created
    shortly after each other, then the processes associated with these
    queues have typically a common goal. In particular, bursts of queue
    creations are usually caused by services or applications that spawn
    many parallel threads/processes. Examples are systemd during boot, or
    git grep. If there are no other active queues, then, to help these
    processes get their job done as soon as possible, the best thing to do
    is to reach a high throughput. To this goal, it is usually better to
    not grant either weight-raising or device idling to the queues
    associated with these processes. And this is exactly what BFQ
    currently does.

    There is however a drawback: if, in contrast, some other queues are
    already active, then the newly created queues must be protected from
    the I/O flowing through the already existing queues. In this case, the
    best thing to do is the opposite as in the other case: it is much
    better to grant weight-raising and device idling to the newly-created
    queues, if they deserve it. This commit addresses this issue by doing
    so if there are already other active queues.

    This change also helps eliminating false positives, which occur when
    the newly-created queues do not belong to an actual large burst of
    creations, but some background task (e.g., a service) happens to
    trigger the creation of new queues in the middle, i.e., very close to
    when the victim queues are created. These false positive may cause
    total loss of control on process latencies.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Sync random I/O is likely to be confused with soft real-time I/O,
    because it is characterized by limited throughput and apparently
    isochronous arrival pattern. To avoid false positives, this commits
    prevents bfq_queues containing only random (seeky) I/O from being
    tagged as soft real-time.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • To boost throughput with a set of processes doing interleaved I/O
    (i.e., a set of processes whose individual I/O is random, but whose
    merged cumulative I/O is sequential), BFQ merges the queues associated
    with these processes, i.e., redirects the I/O of these processes into a
    common, shared queue. In the shared queue, I/O requests are ordered by
    their position on the medium, thus sequential I/O gets dispatched to
    the device when the shared queue is served.

    Queue merging costs execution time, because, to detect which queues to
    merge, BFQ must maintain a list of the head I/O requests of active
    queues, ordered by request positions. Measurements showed that this
    costs about 10% of BFQ's total per-request processing time.

    Request processing time becomes more and more critical as the speed of
    the underlying storage device grows. Yet, fortunately, queue merging
    is basically useless on the very devices that are so fast to make
    request processing time critical. To reach a high throughput, these
    devices must have many requests queued at the same time. But, in this
    configuration, the internal scheduling algorithms of these devices do
    also the job of queue merging: they reorder requests so as to obtain
    as much as possible a sequential I/O pattern. As a consequence, with
    processes doing interleaved I/O, the throughput reached by one such
    device is likely to be the same, with and without queue merging.

    In view of this fact, this commit disables queue merging, and all
    related housekeeping, for non-rotational devices with internal
    queueing. The total, single-lock-protected, per-request processing
    time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
    (time measured with simple code instrumentation, and using the
    throughput-sync.sh script of the S suite [1], in performance-profiling
    mode). To put this result into context, the total,
    single-lock-protected, per-request execution time of the lightest I/O
    scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
    ~800 LOC, against ~10500 LOC for BFQ).

    Disabling merging provides a further, remarkable benefit in terms of
    throughput. Merging tends to make many workloads artificially more
    uneven, mainly because of shared queues remaining non empty for
    incomparably more time than normal queues. So, if, e.g., one of the
    queues in a set of merged queues has a higher weight than a normal
    queue, then the shared queue may inherit such a high weight and, by
    staying almost always active, may force BFQ to perform I/O plugging
    most of the time. This evidently makes it harder for BFQ to let the
    device reach a high throughput.

    As a practical example of this problem, and of the benefits of this
    commit, we measured again the throughput in the nasty scenario
    considered in previous commit messages: dbench test (in the Phoronix
    suite), with 6 clients, on a filesystem with journaling, and with the
    journaling daemon enjoying a higher weight than normal processes. With
    this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
    PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
    of the other I/O schedulers. As such, this is also likely to be the
    maximum possible throughput reachable with this workload on this
    device, because I/O is mostly random, and the other schedulers
    basically just pass I/O requests to the drive as fast as possible.

    [1] https://github.com/Algodev-github/S

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Tested-by: Francesco Pollicino
    Signed-off-by: Alessio Masola
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • The processes associated with a bfq_queue, say Q, may happen to
    generate their cumulative I/O at a lower rate than the rate at which
    the device could serve the same I/O. This is rather probable, e.g., if
    only one process is associated with Q and the device is an SSD. It
    results in Q becoming often empty while in service. If BFQ is not
    allowed to switch to another queue when Q becomes empty, then, during
    the service of Q, there will be frequent "service holes", i.e., time
    intervals during which Q gets empty and the device can only consume
    the I/O already queued in its hardware queues. This easily causes
    considerable losses of throughput.

    To counter this problem, BFQ implements a request injection mechanism,
    which tries to fill the above service holes with I/O requests taken
    from other bfq_queues. The hard part in this mechanism is finding the
    right amount of I/O to inject, so as to both boost throughput and not
    break Q's bandwidth and latency guarantees. To this goal, the current
    version of this mechanism measures the bandwidth enjoyed by Q while it
    is being served, and tries to inject the maximum possible amount of
    extra service that does not cause Q's bandwidth to decrease too
    much.

    This solution has an important shortcoming. For bandwidth measurements
    to be stable and reliable, Q must remain in service for a much longer
    time than that needed to serve a single I/O request. Unfortunately,
    this does not hold with many workloads. This commit addresses this
    issue by changing the way the amount of injection allowed is
    dynamically computed. It tunes injection as a function of the service
    times of single I/O requests of Q, instead of Q's
    bandwidth. Single-request service times are evidently meaningful even
    if Q gets very few I/O requests completed while it is in service.

    As a testbed for this new solution, we measured the throughput reached
    by BFQ for one of the nastiest workloads and configurations for this
    scheduler: the workload generated by the dbench test (in the Phoronix
    suite), with 6 clients, on a filesystem with journaling, and with the
    journaling daemon enjoying a higher weight than normal processes.
    With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on
    a PLEXTOR PX-256M5.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Tested-by: Francesco Pollicino
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • In most cases, it is detrimental for throughput to plug I/O dispatch
    when the in-service bfq_queue becomes temporarily empty (plugging is
    performed to wait for the possible arrival, soon, of new I/O from the
    in-service queue). There is however a case where plugging is needed
    for service guarantees. If a bfq_queue, say Q, has a higher weight
    than some other active bfq_queue, and is sync, i.e., contains sync
    I/O, then, to guarantee that Q does receive a higher share of the
    throughput than other lower-weight queues, it is necessary to plug I/O
    dispatch when Q remains temporarily empty while being served.

    For this reason, BFQ performs I/O plugging when some active bfq_queue
    has a higher weight than some other active bfq_queue. But this is
    unnecessarily overkill. In fact, if the in-service bfq_queue actually
    has a weight lower than or equal to the other queues, then the queue
    *must not* be guaranteed a higher share of the throughput than the
    other queues. So, not plugging I/O cannot cause any harm to the
    queue. And can boost throughput.

    Taking advantage of this fact, this commit does not plug I/O for sync
    bfq_queues with a weight lower than or equal to the weights of the
    other queues. Here is an example of the resulting throughput boost
    with the dbench workload, which is particularly nasty for BFQ. With
    the dbench test in the Phoronix suite, BFQ reaches its lowest total
    throughput with 6 clients on a filesystem with journaling, in case the
    journaling daemon has a higher weight than normal processes. Before
    this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5,
    after this commit it is ~100 MB/sec.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • If a sync bfq_queue has a higher weight than some other queue, and
    remains temporarily empty while in service, then, to preserve the
    bandwidth share of the queue, it is necessary to plug I/O dispatching
    until a new request arrives for the queue. In addition, a timeout
    needs to be set, to avoid waiting for ever if the process associated
    with the queue has actually finished its I/O.

    Even with the above timeout, the device is however not fed with new
    I/O for a while, if the process has finished its I/O. If this happens
    often, then throughput drops and latencies grow. For this reason, the
    timeout is kept rather low: 8 ms is the current default.

    Unfortunately, such a low value may cause, on the opposite end, a
    violation of bandwidth guarantees for a process that happens to issue
    new I/O too late. The higher the system load, the higher the
    probability that this happens to some process. This is a problem in
    scenarios where service guarantees matter more than throughput. One
    important case are weight-raised queues, which need to be granted a
    very high fraction of the bandwidth.

    To address this issue, this commit lower-bounds the plugging timeout
    for weight-raised queues to 20 ms. This simple change provides
    relevant benefits. For example, on a PLEXTOR PX-256M5S, with which
    gnome-terminal starts in 0.6 seconds if there is no other I/O in
    progress, the same applications starts in
    - 0.8 seconds, instead of 1.2 seconds, if ten files are being read
    sequentially in parallel
    - 1 second, instead of 2 seconds, if, in parallel, five files are
    being read sequentially, and five more files are being written
    sequentially

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Replace BFQ_GROUP_IOSCHED_ENABLED with CONFIG_BFQ_GROUP_IOSCHED.
    Code under these ifdefs never worked, something might be broken.

    Fixes: 0471559c2fbd ("block, bfq: add/remove entity weights correctly")
    Fixes: 73d58118498b ("block, bfq: consider also ioprio classes in symmetry detection")
    Reviewed-by: Holger Hoffstätte
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     

01 Feb, 2019

14 commits

  • When a new I/O request arrives for a bfq_queue, say Q, bfq checks
    whether that request is close to
    (a) the head request of some other queue waiting to be served, or
    (b) the last request dispatched for the in-service queue (in case Q
    itself is not the in-service queue)

    If a queue, say Q2, is found for which the above condition holds, then
    bfq merges Q and Q2, to hopefully get a more sequential I/O in the
    resulting merged queue, and thus a possibly higher throughput.

    Case (b) is checked by comparing the new request for Q with the last
    request dispatched, assuming that the latter necessarily belonged to the
    in-service queue. Unfortunately, this assumption is no longer always
    correct, since commit d0edc2473be9 ("block, bfq: inject other-queue I/O
    into seeky idle queues on NCQ flash").

    When the assumption does not hold, queues that must not be merged may be
    merged, causing unexpected loss of control on per-queue service
    guarantees.

    This commit solves this problem by adding an extra field, which stores
    the actual last request dispatched for the in-service queue, and by
    using this new field to correctly check case (b).

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Writes tend to starve reads. bfq counters this problem by overcharging
    writes with an inflated service w.r.t. the actual service (number of
    sector written) they receive.

    Yet his overcharging is useless, and actually causes unfairness in the
    opposite direction, when bfq happens to be enforcing strong I/O control.
    bfq does this enforcing when the scenario is asymmetric, i.e., when some
    bfq_queue or group of bfq_queues is to be granted a different bandwidth
    than some other bfq_queue or group of bfq_queues. So, in such a
    scenario, this commit disables write overcharging.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • The original commit is commit 1a1238a7dd48 ("cfq-iosched: improve hw_tag
    detection") and has the following commit message:

    If active queue hasn't enough requests and idle window opens, cfq will
    not dispatch sufficient requests to hardware. In such situation, current
    code will zero hw_tag. But this is because cfq doesn't dispatch enough
    requests instead of hardware queue doesn't work. Don't zero hw_tag in
    such case.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • bfq simple heuristic from cfq for detecting whether the drive performs
    command queueing: check whether the average number of in-flight requests
    is above a given threshold. Unfortunately this heuristic does fail to
    detect queueing (on drives with queueing) if processes doing I/O are few
    and issue I/O with a low depth.

    To reduce false negatives, this commit lowers the threshold.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • bfq maintains an ordered list, through a red-black tree, of unique
    weights of active bfq_queues. This list is used to detect whether there
    are active queues with differentiated weights. The weight of a queue is
    removed from the list when both the following two conditions become
    true:

    (1) the bfq_queue is flagged as inactive
    (2) the has no in-flight request any longer;

    Unfortunately, in the rare cases where condition (2) becomes true before
    condition (1), the removal fails, because the function to remove the
    weight of the queue (bfq_weights_tree_remove) is rightly invoked in the
    path that deactivates the bfq_queue, but mistakenly invoked *before* the
    function that actually performs the deactivation (bfq_deactivate_bfqq).

    This commits moves the invocation of bfq_weights_tree_remove for
    condition (1) to after bfq_deactivate_bfqq. As a consequence of this
    move, it is necessary to add a further reference to the queue when the
    weight of a queue is added, because the queue might otherwise be freed
    before bfq_weights_tree_remove is invoked. This commit adds this
    reference and makes all related modifications.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • In bfq_update_peak_rate, to check whether an I/O request rq is
    sequential, only the seek distance of rq w.r.t. the last request
    dispatched is controlled. This is not sufficient for non-rotational
    storage, where the size of rq is at least as relevant. This commit adds
    the missing control.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • bfq detects the creation of multiple bfq_queues shortly after each
    other, namely a burst of queue creations in the terminology used in the
    code. If the burst is large, then no queue in the burst is granted
    - either I/O-dispatch plugging when the queue remains temporarily idle
    while in service;
    - or weight raising, because it causes even longer plugging.

    In fact, such a plugging tends to lower throughput, while these bursts
    are typically due to applications or services that spawn multiple
    processes, to reach a common goal as soon as possible. Examples are a
    "git grep" or the booting of a system.

    Unfortunately, disabling plugging may cause a loss of service guarantees
    in asymmetric scenarios, i.e., if queue weights are differentiated or if
    more than one group is active.

    This commit addresses this issue by no longer disabling I/O-dispatch
    plugging for queues in large bursts.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • If the in-service bfq_queue is sync and remains temporarily idle, then
    I/O dispatching (from other queues) may be plugged. It may be dome for
    two reasons: either to boost throughput, or to preserve the bandwidth
    share of the in-service queue. In the first case, if the I/O of the
    in-service queue, when it finally arrives, consists only of one small
    I/O request, then it makes sense to plug even the I/O of the in-service
    queue. In fact, serving such a small request immediately is likely to
    lower throughput instead of boosting it, whereas waiting a little bit is
    likely to let that request grow, thanks to request merging, and become
    more profitable in terms of throughput (this is likely to happen exactly
    because the I/O of the queue has been detected to boost throughput).

    On the opposite end, if I/O dispatching is being plugged only to
    preserve the bandwidth of the in-service queue, then it would be better
    not to plug also the I/O of the in-service queue, because such a
    plugging is likely to cause only loss of bandwidth for the queue.

    Unfortunately, no distinction is made between the two cases, and the I/O
    of the in-service queue is always plugged in case just a small I/O
    request arrives. This commit draws this missing distinction and does not
    perform harmful plugging.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • This is a preparatory commit for commits that need to check only one of
    the two main reasons for idling. This change should also improve the
    quality of the code a little bit, by splitting a function that contains
    very long, non-trivial and little related comments.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • In asymmetric scenarios, i.e., when some bfq_queue or bfq_group needs to
    be guaranteed a different bandwidth than other bfq_queues or bfq_groups,
    these service guaranteed can be provided only by plugging I/O dispatch,
    completely or partially, when the queue in service remains temporarily
    empty. A case where asymmetry is particularly strong is when some active
    bfq_queues belong to a higher-priority class than some other active
    bfq_queues. Unfortunately, this important case is not considered at all
    in the code for detecting asymmetric scenarios. This commit adds the
    missing logic.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Before commit 18e5a57d7987 ("block, bfq: postpone rq preparation to
    insert or merge"), the destination queue for a request was chosen by a
    different hook than the one that then inserted the request. So, between
    the execution of the two hooks, the bic of the process generating the
    request could happen to be redirected to a different bfq_queue. As a
    consequence, the destination bfq_queue stored in the request could be
    wrong. Such an event does not need to ba handled any longer.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • With some unlucky sequences of events, the function bfq_updated_next_req
    updates the current budget of a bfq_queue to a lower value than the
    service received by the queue using such a budget. Unfortunately, if
    this happens, then the return value of the function bfq_bfqq_budget_left
    becomes inconsistent. This commit solves this problem by lower-bounding
    the budget computed in bfq_updated_next_req to the service currently
    charged to the queue.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • To boost throughput on devices with internal queueing and in scenarios
    where device idling is not strictly needed, bfq immediately starts
    serving a new bfq_queue if the in-service bfq_queue remains without
    pending I/O, even if new I/O may arrive soon for the latter queue. Then,
    if such I/O actually arrives soon, bfq preempts the new in-service
    bfq_queue so as to give the previous queue a chance to go on being
    served (in case the previous queue should actually be the one to be
    served, according to its timestamps).

    However, the in-service bfq_queue, say Q, may also be without further
    budget when it remains also pending I/O. Since bfq changes budgets
    dynamically to fit the needs of bfq_queues, this happens more often than
    one may expect. If this happens, then there is no point in trying to go
    on serving Q when new I/O arrives for it soon: Q would be expired
    immediately after being selected for service. This would only cause
    useless overhead. This commit avoids such a useless selection.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • The speed at which a bfq_queue receives I/O is one of the parameters by
    which bfq decides whether the queue is soft real-time (i.e., whether the
    queue contains the I/O of a soft real-time application). In particular,
    when a bfq_queue remains without outstanding I/O requests, bfq computes
    the minimum time instant, named soft_rt_next_start, at which the next
    request of the queue may arrive for the queue to be deemed as soft real
    time.

    Unfortunately this filtering may cause problems with a queue in
    interactive weight raising. In fact, such a queue may be conveying the
    I/O needed to load a soft real-time application. The latter will
    actually exhibit a soft real-time I/O pattern after it finally starts
    doing its job. But, if soft_rt_next_start is updated for an interactive
    bfq_queue, and the queue has received a lot of service before remaining
    with no outstanding request (likely to happen on a fast device), then
    soft_rt_next_start is assigned such a high value that, for a very long
    time, the queue is prevented from being possibly considered as soft real
    time.

    This commit removes the updating of soft_rt_next_start for bfq_queues in
    interactive weight raising.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

10 Dec, 2018

1 commit

  • Pull in v4.20-rc6 to resolve the conflict in NVMe, but also to get the
    two corruption fixes. We're going to be overhauling the direct dispatch
    path, and we need to do that on top of the changes we made for that
    in mainline.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 Dec, 2018

1 commit

  • The bio_blkcg() function turns out to be inconsistent and consequently
    dangerous to use. The first part returns a blkcg where a reference is
    owned by the bio meaning it does not need to be rcu protected. However,
    the third case, the last line, is problematic:

    return css_to_blkcg(task_css(current, io_cgrp_id));

    This can race against task migration and the cgroup dying. It is also
    semantically different as it must be called rcu protected and is
    susceptible to failure when trying to get a reference to it.

    This patch adds association ahead of calling bio_blkcg() rather than
    after. This makes association a required and explicit step along the
    code paths for calling bio_blkcg(). In blk-iolatency, association is
    moved above the bio_blkcg() call to ensure it will not return %NULL.

    BFQ uses the old bio_blkcg() function, but I do not want to address it
    in this series due to the complexity. I have created a private version
    documenting the inconsistency and noting not to use it.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

07 Dec, 2018

1 commit

  • Since commit '2d29c9f89fcd ("block, bfq: improve asymmetric scenarios
    detection")', if there are process groups with I/O requests waiting for
    completion, then BFQ tags the scenario as 'asymmetric'. This detection
    is needed for preserving service guarantees (for details, see comments
    on the computation * of the variable asymmetric_scenario in the
    function bfq_better_to_idle).

    Unfortunately, commit '2d29c9f89fcd ("block, bfq: improve asymmetric
    scenarios detection")' contains an error exactly in the updating of
    the number of groups with I/O requests waiting for completion: if a
    group has more than one descendant process, then the above number of
    groups, which is renamed from num_active_groups to a more appropriate
    num_groups_with_pending_reqs by this commit, may happen to be wrongly
    decremented multiple times, namely every time one of the descendant
    processes gets all its pending I/O requests completed.

    A correct, complete solution should work as follows. Consider a group
    that is inactive, i.e., that has no descendant process with pending
    I/O inside BFQ queues. Then suppose that num_groups_with_pending_reqs
    is still accounting for this group, because the group still has some
    descendant process with some I/O request still in
    flight. num_groups_with_pending_reqs should be decremented when the
    in-flight request of the last descendant process is finally completed
    (assuming that nothing else has changed for the group in the meantime,
    in terms of composition of the group and active/inactive state of
    child groups and processes). To accomplish this, an additional
    pending-request counter must be added to entities, and must be
    updated correctly.

    To avoid this additional field and operations, this commit resorts to
    the following tradeoff between simplicity and accuracy: for an
    inactive group that is still counted in num_groups_with_pending_reqs,
    this commit decrements num_groups_with_pending_reqs when the first
    descendant process of the group remains with no request waiting for
    completion.

    This simplified scheme provides a fix to the unbalanced decrements
    introduced by 2d29c9f89fcd. Since this error was also caused by lack
    of comments on this non-trivial issue, this commit also adds related
    comments.

    Fixes: 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection")
    Reported-by: Steven Barrett
    Tested-by: Steven Barrett
    Tested-by: Lucjan Lucjanov
    Reviewed-by: Federico Motta
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

16 Nov, 2018

1 commit

  • With the legacy request path gone there is no good reason to keep
    queue_lock as a pointer, we can always use the embedded lock now.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    Fixed floppy and blk-cgroup missing conversions and half done edits.

    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Nov, 2018

2 commits

  • This is a remnant of when we had ops for both SQ and MQ
    schedulers. Now it's just MQ, so get rid of the union.

    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This removes a bunch of core and elevator related code. On the core
    front, we remove anything related to queue running, draining,
    initialization, plugging, and congestions. We also kill anything
    related to request allocation, merging, retrieval, and completion.

    Remove any checking for single queue IO schedulers, as they no
    longer exist. This means we can also delete a bunch of code related
    to request issue, adding, completion, etc - and all the SQ related
    ops and helpers.

    Also kill the load_default_modules(), as all that did was provide
    for a way to load the default single queue elevator.

    Tested-by: Ming Lei
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Nov, 2018

1 commit

  • This reverts a series committed earlier due to null pointer exception
    bug report in [1]. It seems there are edge case interactions that I did
    not consider and will need some time to understand what causes the
    adverse interactions.

    The original series can be found in [2] with a follow up series in [3].

    [1] https://www.spinics.net/lists/cgroups/msg20719.html
    [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
    [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

    This reverts the following commits:
    d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
    f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
    a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

14 Oct, 2018

1 commit

  • bfq defines as asymmetric a scenario where an active entity, say E
    (representing either a single bfq_queue or a group of other entities),
    has a higher weight than some other entities. If the entity E does sync
    I/O in such a scenario, then bfq plugs the dispatch of the I/O of the
    other entities in the following situation: E is in service but
    temporarily has no pending I/O request. In fact, without this plugging,
    all the times that E stops being temporarily idle, it may find the
    internal queues of the storage device already filled with an
    out-of-control number of extra requests, from other entities. So E may
    have to wait for the service of these extra requests, before finally
    having its own requests served. This may easily break service
    guarantees, with E getting less than its fair share of the device
    throughput. Usually, the end result is that E gets the same fraction of
    the throughput as the other entities, instead of getting more, according
    to its higher weight.

    Yet there are two other more subtle cases where E, even if its weight is
    actually equal to or even lower than the weight of any other active
    entities, may get less than its fair share of the throughput in case the
    above I/O plugging is not performed:
    1. other entities issue larger requests than E;
    2. other entities contain more active child entities than E (or in
    general tend to have more backlog than E).

    In the first case, other entities may get more service than E because
    they get larger requests, than those of E, served during the temporary
    idle periods of E. In the second case, other entities get more service
    because, by having many child entities, they have many requests ready
    for dispatching while E is temporarily idle.

    This commit addresses this issue by extending the definition of
    asymmetric scenario: a scenario is asymmetric when
    - active entities representing bfq_queues have differentiated weights,
    as in the original definition
    or (inclusive)
    - one or more entities representing groups of entities are active.

    This broader definition makes sure that I/O plugging will be performed
    in all the above cases, provided that there is at least one active
    group. Of course, this definition is very coarse, so it will trigger
    I/O plugging also in cases where it is not needed, such as, e.g.,
    multiple active entities with just one child each, and all with the same
    I/O-request size. The reason for this coarse definition is just that a
    finer-grained definition would be rather heavy to compute.

    On the opposite end, even this new definition does not trigger I/O
    plugging in all cases where there is no active group, and all bfq_queues
    have the same weight. So, in these cases some unfairness may occur if
    there are asymmetries in I/O-request sizes. We made this choice because
    I/O plugging may lower throughput, and probably a user that has not
    created any group cares more about throughput than about perfect
    fairness. At any rate, as for possible applications that may care about
    service guarantees, bfq already guarantees a high responsiveness and a
    low latency to soft real-time applications automatically.

    Signed-off-by: Federico Motta
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Federico Motta
     

22 Sep, 2018

1 commit

  • The accessor function bio_blkcg either returns the blkcg associated with
    the bio or finds one in the current context. This can cause an issue
    when trying to associate a bio with a blkcg. Particularly, it's the
    third case that is problematic:

    return css_to_blkcg(task_css(current, io_cgrp_id));

    As the above may race against task migration and the cgroup exiting, it
    is not always ok to take a reference on the blkcg returned from
    bio_blkcg.

    This patch adds association ahead of calling bio_blkcg rather than
    after. This makes association a required and explicit step along the
    code paths for calling bio_blkcg. blk_get_rl is modified as well to get
    a reference to the blkcg it may use and blk_put_rl will always put the
    reference back. Association is also moved above the bio_blkcg call to
    ensure it will not return NULL in blk-iolatency.

    BFQ and CFQ utilize this flaw, but due to the complexity, I do not want
    to address this in this series. I've created a private version of the
    function with notes not to use it describing the flaw. Hopefully soon,
    that code can be cleaned up.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     

15 Sep, 2018

2 commits

  • To reduce latency for interactive and soft real-time applications, bfq
    privileges the bfq_queues containing the I/O of these
    applications. These privileged queues, referred-to as weight-raised
    queues, get a much higher share of the device throughput
    w.r.t. non-privileged queues. To preserve this higher share, the I/O
    of any non-weight-raised queue must be plugged whenever a sync
    weight-raised queue, while being served, remains temporarily empty. To
    attain this goal, bfq simply plugs any I/O (from any queue), if a sync
    weight-raised queue remains empty while in service.

    Unfortunately, this plugging typically lowers throughput with random
    I/O, on devices with internal queueing (because it reduces the filling
    level of the internal queues of the device).

    This commit addresses this issue by restricting the cases where
    plugging is performed: if a sync weight-raised queue remains empty
    while in service, then I/O plugging is performed only if some of the
    active bfq_queues are *not* weight-raised (which is actually the only
    circumstance where plugging is needed to preserve the higher share of
    the throughput of weight-raised queues). This restriction proved able
    to boost throughput in really many use cases needing only maximum
    throughput.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • The Achilles' heel of BFQ is its failing to reach a high throughput
    with sync random I/O on flash storage with internal queueing, in case
    the processes doing I/O have differentiated weights.

    The cause of this failure is as follows. If at least two processes do
    sync I/O, and have a different weight from each other, then BFQ plugs
    I/O dispatching every time one of these processes, while it is being
    served, remains temporarily without pending I/O requests. This
    plugging is necessary to guarantee that every process enjoys a
    bandwidth proportional to its weight; but it empties the internal
    queue(s) of the drive. And this kills throughput with random I/O. So,
    if some processes have differentiated weights and do both sync and
    random I/O, the end result is a throughput collapse.

    This commit tries to counter this problem by injecting the service of
    other processes, in a controlled way, while the process in service
    happens to have no I/O. This injection is performed only if the medium
    is non rotational and performs internal queueing, and the process in
    service does random I/O (service injection might be beneficial for
    sequential I/O too, we'll work on that).

    As an example of the benefits of this commit, on a PLEXTOR PX-256M5S
    SSD, and with five processes having differentiated weights and doing
    sync random 4KB I/O, this commit makes the throughput with bfq grow by
    400%, from 25 to 100MB/s. This higher throughput is 10MB/s lower than
    that reached with none. As some less random I/O is added to the mix,
    the throughput becomes equal to or higher than that with none.

    This commit is a very first attempt to recover throughput without
    losing control, and certainly has many limitations. One is, e.g., that
    the processes whose service is injected are not chosen so as to
    distribute the extra bandwidth they receive in accordance to their
    weights. Thus there might be loss of weighted fairness in some
    cases. Anyway, this loss concerns extra service, which would not have
    been received at all without this commit. Other limitations and issues
    will probably show up with usage.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

17 Aug, 2018

1 commit

  • When a sync request is dispatched, the queue that contains that
    request, and all the ancestor entities of that queue, are charged with
    the number of sectors of the request. In constrast, if the request is
    async, then the queue and its ancestor entities are charged with the
    number of sectors of the request, multiplied by an overcharge
    factor. This throttles the bandwidth for async I/O, w.r.t. to sync
    I/O, and it is done to counter the tendency of async writes to steal
    I/O throughput to reads.

    On the opposite end, the lower this parameter, the stabler I/O
    control, in the following respect. The lower this parameter is, the
    less the bandwidth enjoyed by a group decreases
    - when the group does writes, w.r.t. to when it does reads;
    - when other groups do reads, w.r.t. to when they do writes.

    The fixes "block, bfq: always update the budget of an entity when
    needed" and "block, bfq: readd missing reset of parent-entity service"
    improved I/O control in bfq to such an extent that it has been
    possible to revise this overcharge factor downwards. This commit
    introduces the resulting, new value.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente