24 Feb, 2020

1 commit

  • [ Upstream commit f718b093277df582fbf8775548a4f163e664d282 ]

    Commit 478de3380c1c ("block, bfq: deschedule empty bfq_queues not
    referred by any process") fixed commit 3726112ec731 ("block, bfq:
    re-schedule empty queues if they deserve I/O plugging") by
    descheduling an empty bfq_queue when it remains with not process
    reference. Yet, this still left a case uncovered: an empty bfq_queue
    with not process reference that remains in service. This happens for
    an in-service sync bfq_queue that is deemed to deserve I/O-dispatch
    plugging when it remains empty. Yet no new requests will arrive for
    such a bfq_queue if no process sends requests to it any longer. Even
    worse, the bfq_queue may happen to be prematurely freed while still in
    service (because there may remain no reference to it any longer).

    This commit solves this problem by preventing I/O dispatch from being
    plugged for the in-service bfq_queue, if the latter has no process
    reference (the bfq_queue is then prevented from remaining in service).

    Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging")
    Tested-by: Oleksandr Natalenko
    Reported-by: Patrick Dung
    Tested-by: Patrick Dung
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Paolo Valente
     

14 Nov, 2019

1 commit

  • Since commit 3726112ec731 ("block, bfq: re-schedule empty queues if
    they deserve I/O plugging"), to prevent the service guarantees of a
    bfq_queue from being violated, the bfq_queue may be left busy, i.e.,
    scheduled for service, even if empty (see comments in
    __bfq_bfqq_expire() for details). But, if no process will send
    requests to the bfq_queue any longer, then there is no point in
    keeping the bfq_queue scheduled for service.

    In addition, keeping the bfq_queue scheduled for service, but with no
    process reference any longer, may cause the bfq_queue to be freed when
    descheduled from service. But this is assumed to never happen, and
    causes a UAF if it happens. This, in turn, caused crashes [1, 2].

    This commit fixes this issue by descheduling an empty bfq_queue when
    it remains with not process reference.

    [1] https://bugzilla.redhat.com/show_bug.cgi?id=1767539
    [2] https://bugzilla.kernel.org/show_bug.cgi?id=205447

    Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging")
    Reported-by: Chris Evich
    Reported-by: Patrick Dung
    Reported-by: Thorsten Schubert
    Tested-by: Thorsten Schubert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

18 Sep, 2019

4 commits

  • If equal to 0, the injection limit for a bfq_queue is pushed to 1
    after a first sample of the total service time of the I/O requests of
    the queue is computed (to allow injection to start). Yet, because of a
    mistake in the branch that performs this action, the push may happen
    also in some other case. This commit fixes this issue.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • The update period of the injection limit has been tentatively set to
    100 ms, to reduce fluctuations. This value however proved to cause,
    occasionally, the limit to be decremented for some bfq_queue only
    after the queue underwent excessive injection for a lot of time. This
    commit reduces the period to 10 ms.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Upon an increment attempt of the injection limit, the latter is
    constrained not to become higher than twice the maximum number
    max_rq_in_driver of I/O requests that have happened to be in service
    in the drive. This high bound allows the injection limit to grow
    beyond max_rq_in_driver, which may then cause max_rq_in_driver itself
    to grow.

    However, since the limit is incremented by only one unit at a time,
    there is no need for such a high bound, and just max_rq_in_driver+1 is
    enough.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • BFQ updates the injection limit of each bfq_queue as a function of how
    much the limit inflates the service times experienced by the I/O
    requests of the queue. So only service times affected by injection
    must be taken into account. Unfortunately, in the current
    implementation of this update scheme, the service time of an I/O
    request rq not affected by injection may happen to be considered in
    the following case: there is no I/O request in service when rq
    arrives.

    This commit fixes this issue by making sure that only service times
    affected by injection are considered for updating the injection
    limit. In particular, the service time of an I/O request rq is now
    considered only if at least one of the following two conditions holds:
    - the destination bfq_queue for rq underwent injection before rq
    arrival, and there is still I/O in service in the drive on rq arrival
    (the service of such unfinished I/O may delay the service of rq);
    - injection occurs between the arrival and the completion time of rq.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

08 Aug, 2019

3 commits

  • As reported in [1], the call bfq_init_rq(rq) may return NULL in case
    of OOM (in particular, if rq->elv.icq is NULL because memory
    allocation failed in failed in ioc_create_icq()).

    This commit handles this circumstance.

    [1] https://lkml.org/lkml/2019/7/22/824

    Cc: Hsin-Yi Wang
    Cc: Nicolas Boichat
    Cc: Doug Anderson
    Reported-by: Guenter Roeck
    Reported-by: Hsin-Yi Wang
    Reviewed-by: Guenter Roeck
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Since commit 13a857a4c4e8 ("block, bfq: detect wakers and
    unconditionally inject their I/O"), every bfq_queue has a pointer to a
    waker bfq_queue and a list of the bfq_queues it may wake. In this
    respect, when a bfq_queue, say Q, remains with no I/O source attached
    to it, Q cannot be woken by any other bfq_queue, and cannot wake any
    other bfq_queue. Then Q must be removed from the woken list of its
    possible waker bfq_queue, and all bfq_queues in the woken list of Q
    must stop having a waker bfq_queue.

    Q remains with no I/O source in two cases: when the last process
    associated with Q exits or when such a process gets associated with a
    different bfq_queue. Unfortunately, commit 13a857a4c4e8 ("block, bfq:
    detect wakers and unconditionally inject their I/O") performed the
    above updates only in the first case.

    This commit fixes this bug by moving these updates to when Q gets
    freed. This is a simple and safe way to handle all cases, as both the
    above events, process exit and re-association, lead to Q being freed
    soon, and because dangling references would come out only after Q gets
    freed (if no update were performed).

    Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O")
    Reported-by: Douglas Anderson
    Tested-by: Douglas Anderson
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Since commit 13a857a4c4e8 ("block, bfq: detect wakers and
    unconditionally inject their I/O"), BFQ stores, in a per-device
    pointer last_completed_rq_bfqq, the last bfq_queue that had an I/O
    request completed. If some bfq_queue receives new I/O right after the
    last request of last_completed_rq_bfqq has been completed, then
    last_completed_rq_bfqq may be a waker bfq_queue.

    But if the bfq_queue last_completed_rq_bfqq points to is freed, then
    last_completed_rq_bfqq becomes a dangling reference. This commit
    resets last_completed_rq_bfqq if the pointed bfq_queue is freed.

    Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O")
    Reported-by: Douglas Anderson
    Tested-by: Douglas Anderson
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

27 Jul, 2019

1 commit

  • Pull block fixes from Jens Axboe:

    - Several io_uring fixes/improvements:
    - Blocking fix for O_DIRECT (me)
    - Latter page slowness for registered buffers (me)
    - Fix poll hang under certain conditions (me)
    - Defer sequence check fix for wrapped rings (Zhengyuan)
    - Mismatch in async inc/dec accounting (Zhengyuan)
    - Memory ordering issue that could cause stall (Zhengyuan)
    - Track sequential defer in bytes, not pages (Zhengyuan)

    - NVMe pull request from Christoph

    - Set of hang fixes for wbt (Josef)

    - Redundant error message kill for libahci (Ding)

    - Remove unused blk_mq_sched_started_request() and related ops (Marcos)

    - drbd dynamic alloc shash descriptor to reduce stack use (Arnd)

    - blkcg ->pd_stat() non-debug print (Tejun)

    - bcache memory leak fix (Wei)

    - Comment fix (Akinobu)

    - BFQ perf regression fix (Paolo)

    * tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits)
    io_uring: ensure ->list is initialized for poll commands
    Revert "nvme-pci: don't create a read hctx mapping without read queues"
    nvme: fix multipath crash when ANA is deactivated
    nvme: fix memory leak caused by incorrect subsystem free
    nvme: ignore subnqn for ADATA SX6000LNP
    drbd: dynamically allocate shash descriptor
    block: blk-mq: Remove blk_mq_sched_started_request and started_request
    bcache: fix possible memory leak in bch_cached_dev_run()
    io_uring: track io length in async_list based on bytes
    io_uring: don't use iov_iter_advance() for fixed buffers
    block: properly handle IOCB_NOWAIT for async O_DIRECT IO
    blk-mq: allow REQ_NOWAIT to return an error inline
    io_uring: add a memory barrier before atomic_read
    rq-qos: use a mb for got_token
    rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
    rq-qos: don't reset has_sleepers on spurious wakeups
    rq-qos: fix missed wake-ups in rq_qos_throttle
    wait: add wq_has_single_sleeper helper
    block, bfq: check also in-flight I/O in dispatch plugging
    block: fix sysfs module parameters directory path in comment
    ...

    Linus Torvalds
     

18 Jul, 2019

1 commit

  • Consider a sync bfq_queue Q that remains empty while in service, and
    suppose that, when this happens, there is a fair amount of already
    in-flight I/O not belonging to Q. In such a situation, I/O dispatching
    may need to be plugged (until new I/O arrives for Q), for the
    following reason.

    The drive may decide to serve in-flight non-Q's I/O requests before
    Q's ones, thereby delaying the arrival of new I/O requests for Q
    (recall that Q is sync). If I/O-dispatching is not plugged, then,
    while Q remains empty, a basically uncontrolled amount of I/O from
    other queues may be dispatched too, possibly causing the service of
    Q's I/O to be delayed even longer in the drive. This problem gets more
    and more serious as the speed and the queue depth of the drive grow,
    because, as these two quantities grow, the probability to find no
    queue busy but many requests in flight grows too.

    If Q has the same weight and priority as the other queues, then the
    above delay is unlikely to cause any issue, because all queues tend to
    undergo the same treatment. So, since not plugging I/O dispatching is
    convenient for throughput, it is better not to plug. Things change in
    case Q has a higher weight or priority than some other queue, because
    Q's service guarantees may simply be violated. For this reason,
    commit 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric
    scenarios") does plug I/O in such an asymmetric scenario. Plugging
    minimizes the delay induced by already in-flight I/O, and enables Q to
    recover the bandwidth it may lose because of this delay.

    Yet the above commit does not cover the case of weight-raised queues,
    for efficiency concerns. For weight-raised queues, I/O-dispatch
    plugging is activated simply if not all bfq_queues are
    weight-raised. But this check does not handle the case of in-flight
    requests, because a bfq_queue may become non busy *before* all its
    in-flight requests are completed.

    This commit performs I/O-dispatch plugging for weight-raised queues if
    there are some in-flight requests.

    As a practical example of the resulting recover of control, under
    write load on a Samsung SSD 970 PRO, gnome-terminal starts in 1.5
    seconds after this fix, against 15 seconds before the fix (as a
    reference, gnome-terminal takes about 35 seconds to start with any of
    the other I/O schedulers).

    Fixes: 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric scenarios")
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

15 Jul, 2019

1 commit

  • Rename the block documentation files to ReST, add an
    index for them and adjust in order to produce a nice html
    output via the Sphinx build system.

    At its new index.rst, let's add a :orphan: while this is not linked to
    the main index.rst file, in order to avoid build warnings.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     

10 Jul, 2019

1 commit

  • Pull block updates from Jens Axboe:
    "This is the main block updates for 5.3. Nothing earth shattering or
    major in here, just fixes, additions, and improvements all over the
    map. This contains:

    - Series of documentation fixes (Bart)

    - Optimization of the blk-mq ctx get/put (Bart)

    - null_blk removal race condition fix (Bob)

    - req/bio_op() cleanups (Chaitanya)

    - Series cleaning up the segment accounting, and request/bio mapping
    (Christoph)

    - Series cleaning up the page getting/putting for bios (Christoph)

    - block cgroup cleanups and moving it to where it is used (Christoph)

    - block cgroup fixes (Tejun)

    - Series of fixes and improvements to bcache, most notably a write
    deadlock fix (Coly)

    - blk-iolatency STS_AGAIN and accounting fixes (Dennis)

    - Series of improvements and fixes to BFQ (Douglas, Paolo)

    - debugfs_create() return value check removal for drbd (Greg)

    - Use struct_size(), where appropriate (Gustavo)

    - Two lighnvm fixes (Heiner, Geert)

    - MD fixes, including a read balance and corruption fix (Guoqing,
    Marcos, Xiao, Yufen)

    - block opal shadow mbr additions (Jonas, Revanth)

    - sbitmap compare-and-exhange improvemnts (Pavel)

    - Fix for potential bio->bi_size overflow (Ming)

    - NVMe pull requests:
    - improved PCIe suspent support (Keith Busch)
    - error injection support for the admin queue (Akinobu Mita)
    - Fibre Channel discovery improvements (James Smart)
    - tracing improvements including nvmetc tracing support (Minwoo Im)
    - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
    Kulkarni)"

    - Various little fixes and improvements to drivers and core"

    * tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits)
    blk-iolatency: fix STS_AGAIN handling
    block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES
    blk-mq: simplify blk_mq_make_request()
    blk-mq: remove blk_mq_put_ctx()
    sbitmap: Replace cmpxchg with xchg
    block: fix .bi_size overflow
    block: sed-opal: check size of shadow mbr
    block: sed-opal: ioctl for writing to shadow mbr
    block: sed-opal: add ioctl for done-mark of shadow mbr
    block: never take page references for ITER_BVEC
    direct-io: use bio_release_pages in dio_bio_complete
    block_dev: use bio_release_pages in bio_unmap_user
    block_dev: use bio_release_pages in blkdev_bio_end_io
    iomap: use bio_release_pages in iomap_dio_bio_end_io
    block: use bio_release_pages in bio_map_user_iov
    block: use bio_release_pages in bio_unmap_user
    block: optionally mark pages dirty in bio_release_pages
    block: move the BIO_NO_PAGE_REF check into bio_release_pages
    block: skd_main.c: Remove call to memset after dma_alloc_coherent
    block: mtip32xx: Remove call to memset after dma_alloc_coherent
    ...

    Linus Torvalds
     

28 Jun, 2019

1 commit

  • In reboot tests on several devices we were seeing a "use after free"
    when slub_debug or KASAN was enabled. The kernel complained about:

    Unable to handle kernel paging request at virtual address 6b6b6c2b

    ...which is a classic sign of use after free under slub_debug. The
    stack crawl in kgdb looked like:

    0 test_bit (addr=, nr=)
    1 bfq_bfqq_busy (bfqq=)
    2 bfq_select_queue (bfqd=)
    3 __bfq_dispatch_request (hctx=)
    4 bfq_dispatch_request (hctx=)
    5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440)
    6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440)
    7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440)
    8 0xc0568d94 in blk_mq_run_work_fn (work=)
    9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480)
    10 0xc024cff4 in worker_thread (__worker=0xec6d4640)

    Digging in kgdb, it could be found that, though bfqq looked fine,
    bfqq->bic had been freed.

    Through further digging, I postulated that perhaps it is illegal to
    access a "bic" (AKA an "icq") after bfq_exit_icq() had been called
    because the "bic" can be freed at some point in time after this call
    is made. I confirmed that there certainly were cases where the exact
    crashing code path would access the "bic" after bfq_exit_icq() had
    been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and
    saw that the bic was 0x7 at the time of the crash.

    To understand a bit more about why this crash was fairly uncommon (I
    saw it only once in a few hundred reboots), you can see that much of
    the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't
    access the ->bic anymore. The only case it doesn't is if
    bfq_put_queue() sees a reference still held.

    However, even in the case when bfqq isn't freed, the crash is still
    rare. Why? I tracked what happened to the "bic" after the exit
    routine. It doesn't get freed right away. Rather,
    put_io_context_active() eventually called put_io_context() which
    queued up freeing on a workqueue. The freeing then actually happened
    later than that through call_rcu(). Despite all these delays, some
    extra debugging showed that all the hoops could be jumped through in
    time and the memory could be freed causing the original crash. Phew!

    To make a long story short, assuming it truly is illegal to access an
    icq after the "exit_icq" callback is finished, this patch is needed.

    Cc: stable@vger.kernel.org
    Reviewed-by: Paolo Valente
    Signed-off-by: Douglas Anderson
    Signed-off-by: Jens Axboe

    Douglas Anderson
     

27 Jun, 2019

1 commit

  • Some debug code suggested by Paolo was tripping when I did reboot
    stress tests. Specifically in bfq_bfqq_resume_state()
    "bic->saved_wr_start_at_switch_to_srt" was later than the current
    value of "jiffies". A bit of debugging showed that
    "bic->saved_wr_start_at_switch_to_srt" was actually 0 and a bit more
    debugging showed that was because we had run through the "unlikely"
    case in the bfq_bfqq_save_state() function.

    Let's init "saved_wr_start_at_switch_to_srt" in the unlikely case to
    something sane.

    NOTE: this fixes no known real-world errors.

    Reviewed-by: Paolo Valente
    Reviewed-by: Guenter Roeck
    Signed-off-by: Douglas Anderson
    Signed-off-by: Jens Axboe

    Douglas Anderson
     

26 Jun, 2019

1 commit

  • By mistake, there is a '&' instead of a '==' in the definition of the
    macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with
    the correct one.

    Fixes: 7074f076ff15 ("block, bfq: do not tag totally seeky queues as soft rt")
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

25 Jun, 2019

7 commits

  • Consider, on one side, a bfq_queue Q that remains empty while in
    service, and, on the other side, the pending I/O of bfq_queues that,
    according to their timestamps, have to be served after Q. If an
    uncontrolled amount of I/O from the latter bfq_queues were dispatched
    while Q is waiting for its new I/O to arrive, then Q's bandwidth
    guarantees would be violated. To prevent this, I/O dispatch is plugged
    until Q receives new I/O (except for a properly controlled amount of
    injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging,
    for the following reason.

    Preemption is performed in two steps. First, Q is expired and
    re-scheduled. Second, the new bfq_queue to serve is chosen. The first
    step is needed by the second, as the second can be performed only
    after Q's timestamps have been properly updated (done in the
    expiration step), and Q has been re-queued for service. This
    dependency is a consequence of the way how BFQ's scheduling algorithm
    is currently implemented.

    But Q is not re-scheduled at all in the first step, because Q is
    empty. As a consequence, an uncontrolled amount of I/O may be
    dispatched until Q becomes non empty again. This breaks Q's service
    guarantees.

    This commit addresses this issue by re-scheduling Q even if it is
    empty. This in turn breaks the assumption that all scheduled queues
    are non empty. Then a few extra checks are now needed.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • BFQ enqueues the I/O coming from each process into a separate
    bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be
    served for at most timeout_sync milliseconds (default: 125 ms). This
    service scheme is prone to the following inaccuracy.

    While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may
    receive I/O, and, according to BFQ's scheduling policy, may become the
    right bfq_queue to serve, in place of the currently in-service
    bfq_queue. In this respect, postponing the service of Q2 to after the
    service of Q1 finishes may delay the completion of Q2's I/O, compared
    with an ideal service in which all non-empty bfq_queues are served in
    parallel, and every non-empty bfq_queue is served at a rate
    proportional to the bfq_queue's weight. This additional delay is equal
    at most to the time Q1 may unjustly remain in service before switching
    to Q2.

    If Q1 and Q2 have the same weight, then this time is most likely
    negligible compared with the completion time to be guaranteed to Q2's
    I/O. In addition, first, one of the reasons why BFQ may want to serve
    Q1 for a while is that this boosts throughput and, second, serving Q1
    longer reduces BFQ's overhead. As a conclusion, it is usually better
    not to preempt Q1 if both Q1 and Q2 have the same weight.

    In contrast, as Q2's weight or priority becomes higher and higher
    compared with that of Q1, the above delay becomes larger and larger,
    compared with the I/O completion times that have to be guaranteed to
    Q2 according to Q2's weight. So reducing this delay may be more
    important than avoiding the costs of preempting Q1.

    Accordingly, this commit preempts Q1 if Q2 has a higher weight or a
    higher priority than Q1. Preemption causes Q1 to be re-scheduled, and
    triggers a new choice of the next bfq_queue to serve. If Q2 really is
    the next bfq_queue to serve, then Q2 will be set in service
    immediately.

    This change reduces the component of the I/O latency caused by the
    above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5
    SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for
    a process doing sporadic random reads while another process is doing
    continuous sequential reads.

    Signed-off-by: Nicola Bottura
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • A bfq_queue Q may happen to be synchronized with another
    bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
    receive new I/O. We call Q2 "waker queue".

    If I/O plugging is being performed for Q, and Q is not receiving any
    more I/O because of the above synchronization, then, thanks to BFQ's
    injection mechanism, the waker queue is likely to get served before
    the I/O-plugging timeout fires.

    Unfortunately, this fact may not be sufficient to guarantee a high
    throughput during the I/O plugging, because the inject limit for Q may
    be too low to guarantee a lot of injected I/O. In addition, the
    duration of the plugging, i.e., the time before Q finally receives new
    I/O, may not be minimized, because the waker queue may happen to be
    served only after other queues.

    To address these issues, this commit introduces the explicit detection
    of the waker queue, and the unconditional injection of a pending I/O
    request of the waker queue on each invocation of
    bfq_dispatch_request().

    One may be concerned that this systematic injection of I/O from the
    waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
    the contrary, next Q's I/O is brought forward dramatically, for it is
    not blocked for milliseconds.

    Reported-by: Srivatsa S. Bhat (VMware)
    Tested-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Until the base value for request service times gets finally computed
    for a bfq_queue, the inject limit for that queue does depend on the
    think-time state (short|long) of the queue. A timely update of the
    think time then guarantees a quicker activation or deactivation of the
    injection. Fortunately, the think time of a bfq_queue is updated in
    the same code path as the inject limit; but after the inject limit.

    This commits moves the update of the think time before the update of
    the inject limit. For coherence, it moves the update of the seek time
    too.

    Reported-by: Srivatsa S. Bhat (VMware)
    Tested-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • I/O injection gets reduced if it increases the request service times
    of the victim queue beyond a certain threshold. The threshold, in its
    turn, is computed as a function of the base service time enjoyed by
    the queue when it undergoes no injection.

    As a consequence, for injection to work properly, the above base value
    has to be accurate. In this respect, such a value may vary over
    time. For example, it varies if the size or the spatial locality of
    the I/O requests in the queue change. It is then important to update
    this value whenever possible. This commit performs this update.

    Reported-by: Srivatsa S. Bhat (VMware)
    Tested-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • One of the cases where the parameters for injection may be updated is
    when there are no more in-flight I/O requests. The number of in-flight
    requests is stored in the field bfqd->rq_in_driver of the descriptor
    bfqd of the device. So, the controlled condition is
    bfqd->rq_in_driver == 0.

    Unfortunately, this is wrong because, the instruction that checks this
    condition is in the code path that handles the completion of a
    request, and, in particular, the instruction is executed before
    bfqd->rq_in_driver is decremented in such a code path.

    This commit fixes this issue by just replacing 0 with 1 in the
    comparison.

    Reported-by: Srivatsa S. Bhat (VMware)
    Tested-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Until the base value of the request service times gets finally
    computed for a bfq_queue, the inject limit does depend on the
    think-time state (short|long). The limit must be 0 or 1 if the think
    time is deemed, respectively, as short or long. However, such a check
    and possible limit update is performed only periodically, once per
    second. So, to make the injection mechanism much more reactive, this
    commit performs the update also every time the think-time state
    changes.

    In addition, in the following special case, this commit lets the
    inject limit of a bfq_queue bfqq remain equal to 1 even if bfqq's
    think time is short: bfqq's I/O is synchronized with that of some
    other queue, i.e., bfqq may receive new I/O only after the I/O of the
    other queue is completed. Keeping the inject limit to 1 allows the
    blocking I/O to be served while bfqq is in service. And this is very
    convenient both for bfqq and for the total throughput, as explained
    in detail in the comments in bfq_update_has_short_ttime().

    Reported-by: Srivatsa S. Bhat (VMware)
    Tested-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

21 Jun, 2019

2 commits

  • This option is entirely bfq specific, give it an appropinquate name.

    Also make it depend on CONFIG_BFQ_GROUP_IOSCHED in Kconfig, as all
    the functionality already does so anyway.

    Acked-by: Tejun Heo
    Acked-by: Paolo Valente
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We only need the number of segments in the blk-mq submission path.
    Remove the field from struct bio, and return it from a variant of
    blk_queue_split instead of that it can passed as an argument to
    those functions that need the value.

    This also means we stop recounting segments except for cloning
    and partial segments.

    To keep the number of arguments in this how path down remove
    pointless struct request_queue arguments from any of the functions
    that had it and grew a nr_segs argument.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

01 May, 2019

1 commit


22 Apr, 2019

1 commit

  • Pull in v5.1-rc6 to resolve two conflicts. One is in BFQ, in just a
    comment, and is trivial. The other one is a conflict due to a later fix
    in the bio multi-page work, and needs a bit more care.

    * tag 'v5.1-rc6': (770 commits)
    Linux 5.1-rc6
    block: make sure that bvec length can't be overflow
    block: kill all_q_node in request_queue
    x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority
    coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
    mm/kmemleak.c: fix unused-function warning
    init: initialize jump labels before command line option parsing
    kernel/watchdog_hld.c: hard lockup message should end with a newline
    kcov: improve CONFIG_ARCH_HAS_KCOV help text
    mm: fix inactive list balancing between NUMA nodes and cgroups
    mm/hotplug: treat CMA pages as unmovable
    proc: fixup proc-pid-vm test
    proc: fix map_files test on F29
    mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n
    mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock
    mm: swapoff: shmem_unuse() stop eviction without igrab()
    mm: swapoff: take notice of completion sooner
    mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
    mm: swapoff: shmem_find_swap_entries() filter out other types
    slab: store tagged freelist for off-slab slabmgmt
    ...

    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Apr, 2019

1 commit

  • A previous commit moved the shallow depth and BFQ depth map calculations
    to be done at init time, moving it outside of the hotter IO path. This
    potentially causes hangs if the users changes the depth of the scheduler
    map, by writing to the 'nr_requests' sysfs file for that device.

    Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if
    the depth changes, so that the scheduler can update its internal state.

    Tested-by: Kai Krakow
    Reported-by: Paolo Valente
    Fixes: f0635b8a416e ("bfq: calculate shallow depths at init time")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Apr, 2019

1 commit

  • The function bfq_bfqq_expire() invokes the function
    __bfq_bfqq_expire(), and the latter may free the in-service bfq-queue.
    If this happens, then no other instruction of bfq_bfqq_expire() must
    be executed, or a use-after-free will occur.

    Basing on the assumption that __bfq_bfqq_expire() invokes
    bfq_put_queue() on the in-service bfq-queue exactly once, the queue is
    assumed to be freed if its refcounter is equal to one right before
    invoking __bfq_bfqq_expire().

    But, since commit 9dee8b3b057e ("block, bfq: fix queue removal from
    weights tree") this assumption is false. __bfq_bfqq_expire() may also
    invoke bfq_weights_tree_remove() and, since commit 9dee8b3b057e
    ("block, bfq: fix queue removal from weights tree"), also
    the latter function may invoke bfq_put_queue(). So __bfq_bfqq_expire()
    may invoke bfq_put_queue() twice, and this is the actual case where
    the in-service queue may happen to be freed.

    To address this issue, this commit moves the check on the refcounter
    of the queue right around the last bfq_put_queue() that may be invoked
    on the queue.

    Fixes: 9dee8b3b057e ("block, bfq: fix queue removal from weights tree")
    Reported-by: Dmitrii Tcvetkov
    Reported-by: Douglas Anderson
    Tested-by: Dmitrii Tcvetkov
    Tested-by: Douglas Anderson
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

09 Apr, 2019

1 commit


01 Apr, 2019

9 commits

  • bfq saves the state of a queue each time a merge occurs, to be
    able to resume such a state when the queue is associated again
    with its original process, on a split.

    Unfortunately bfq does not save & restore also the weight of the
    queue. If the weight is not correctly resumed when the queue is
    recycled, then the weight of the recycled queue could differ
    from the weight of the original queue.

    This commit adds the missing save & resume of the weight.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Francesco Pollicino
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Francesco Pollicino
     
  • The function "bfq_log_bfqq" prints the pid of the process
    associated with the queue passed as input.

    Unfortunately, if the queue is shared, then more than one process
    is associated with the queue. The pid that gets printed in this
    case is the pid of one of the associated processes.
    Which process gets printed depends on the exact sequence of merge
    events the queue underwent. So printing such a pid is rather
    useless and above all is often rather confusing because it
    reports a random pid between those of the associated processes.

    This commit addresses this issue by printing SHARED instead of a pid
    if the queue is shared.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Francesco Pollicino
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Francesco Pollicino
     
  • If many bfq_queues belonging to the same group happen to be created
    shortly after each other, then the processes associated with these
    queues have typically a common goal. In particular, bursts of queue
    creations are usually caused by services or applications that spawn
    many parallel threads/processes. Examples are systemd during boot, or
    git grep. If there are no other active queues, then, to help these
    processes get their job done as soon as possible, the best thing to do
    is to reach a high throughput. To this goal, it is usually better to
    not grant either weight-raising or device idling to the queues
    associated with these processes. And this is exactly what BFQ
    currently does.

    There is however a drawback: if, in contrast, some other queues are
    already active, then the newly created queues must be protected from
    the I/O flowing through the already existing queues. In this case, the
    best thing to do is the opposite as in the other case: it is much
    better to grant weight-raising and device idling to the newly-created
    queues, if they deserve it. This commit addresses this issue by doing
    so if there are already other active queues.

    This change also helps eliminating false positives, which occur when
    the newly-created queues do not belong to an actual large burst of
    creations, but some background task (e.g., a service) happens to
    trigger the creation of new queues in the middle, i.e., very close to
    when the victim queues are created. These false positive may cause
    total loss of control on process latencies.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Sync random I/O is likely to be confused with soft real-time I/O,
    because it is characterized by limited throughput and apparently
    isochronous arrival pattern. To avoid false positives, this commits
    prevents bfq_queues containing only random (seeky) I/O from being
    tagged as soft real-time.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • To boost throughput with a set of processes doing interleaved I/O
    (i.e., a set of processes whose individual I/O is random, but whose
    merged cumulative I/O is sequential), BFQ merges the queues associated
    with these processes, i.e., redirects the I/O of these processes into a
    common, shared queue. In the shared queue, I/O requests are ordered by
    their position on the medium, thus sequential I/O gets dispatched to
    the device when the shared queue is served.

    Queue merging costs execution time, because, to detect which queues to
    merge, BFQ must maintain a list of the head I/O requests of active
    queues, ordered by request positions. Measurements showed that this
    costs about 10% of BFQ's total per-request processing time.

    Request processing time becomes more and more critical as the speed of
    the underlying storage device grows. Yet, fortunately, queue merging
    is basically useless on the very devices that are so fast to make
    request processing time critical. To reach a high throughput, these
    devices must have many requests queued at the same time. But, in this
    configuration, the internal scheduling algorithms of these devices do
    also the job of queue merging: they reorder requests so as to obtain
    as much as possible a sequential I/O pattern. As a consequence, with
    processes doing interleaved I/O, the throughput reached by one such
    device is likely to be the same, with and without queue merging.

    In view of this fact, this commit disables queue merging, and all
    related housekeeping, for non-rotational devices with internal
    queueing. The total, single-lock-protected, per-request processing
    time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
    (time measured with simple code instrumentation, and using the
    throughput-sync.sh script of the S suite [1], in performance-profiling
    mode). To put this result into context, the total,
    single-lock-protected, per-request execution time of the lightest I/O
    scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
    ~800 LOC, against ~10500 LOC for BFQ).

    Disabling merging provides a further, remarkable benefit in terms of
    throughput. Merging tends to make many workloads artificially more
    uneven, mainly because of shared queues remaining non empty for
    incomparably more time than normal queues. So, if, e.g., one of the
    queues in a set of merged queues has a higher weight than a normal
    queue, then the shared queue may inherit such a high weight and, by
    staying almost always active, may force BFQ to perform I/O plugging
    most of the time. This evidently makes it harder for BFQ to let the
    device reach a high throughput.

    As a practical example of this problem, and of the benefits of this
    commit, we measured again the throughput in the nasty scenario
    considered in previous commit messages: dbench test (in the Phoronix
    suite), with 6 clients, on a filesystem with journaling, and with the
    journaling daemon enjoying a higher weight than normal processes. With
    this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
    PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
    of the other I/O schedulers. As such, this is also likely to be the
    maximum possible throughput reachable with this workload on this
    device, because I/O is mostly random, and the other schedulers
    basically just pass I/O requests to the drive as fast as possible.

    [1] https://github.com/Algodev-github/S

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Tested-by: Francesco Pollicino
    Signed-off-by: Alessio Masola
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • The processes associated with a bfq_queue, say Q, may happen to
    generate their cumulative I/O at a lower rate than the rate at which
    the device could serve the same I/O. This is rather probable, e.g., if
    only one process is associated with Q and the device is an SSD. It
    results in Q becoming often empty while in service. If BFQ is not
    allowed to switch to another queue when Q becomes empty, then, during
    the service of Q, there will be frequent "service holes", i.e., time
    intervals during which Q gets empty and the device can only consume
    the I/O already queued in its hardware queues. This easily causes
    considerable losses of throughput.

    To counter this problem, BFQ implements a request injection mechanism,
    which tries to fill the above service holes with I/O requests taken
    from other bfq_queues. The hard part in this mechanism is finding the
    right amount of I/O to inject, so as to both boost throughput and not
    break Q's bandwidth and latency guarantees. To this goal, the current
    version of this mechanism measures the bandwidth enjoyed by Q while it
    is being served, and tries to inject the maximum possible amount of
    extra service that does not cause Q's bandwidth to decrease too
    much.

    This solution has an important shortcoming. For bandwidth measurements
    to be stable and reliable, Q must remain in service for a much longer
    time than that needed to serve a single I/O request. Unfortunately,
    this does not hold with many workloads. This commit addresses this
    issue by changing the way the amount of injection allowed is
    dynamically computed. It tunes injection as a function of the service
    times of single I/O requests of Q, instead of Q's
    bandwidth. Single-request service times are evidently meaningful even
    if Q gets very few I/O requests completed while it is in service.

    As a testbed for this new solution, we measured the throughput reached
    by BFQ for one of the nastiest workloads and configurations for this
    scheduler: the workload generated by the dbench test (in the Phoronix
    suite), with 6 clients, on a filesystem with journaling, and with the
    journaling daemon enjoying a higher weight than normal processes.
    With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on
    a PLEXTOR PX-256M5.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Tested-by: Francesco Pollicino
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • In most cases, it is detrimental for throughput to plug I/O dispatch
    when the in-service bfq_queue becomes temporarily empty (plugging is
    performed to wait for the possible arrival, soon, of new I/O from the
    in-service queue). There is however a case where plugging is needed
    for service guarantees. If a bfq_queue, say Q, has a higher weight
    than some other active bfq_queue, and is sync, i.e., contains sync
    I/O, then, to guarantee that Q does receive a higher share of the
    throughput than other lower-weight queues, it is necessary to plug I/O
    dispatch when Q remains temporarily empty while being served.

    For this reason, BFQ performs I/O plugging when some active bfq_queue
    has a higher weight than some other active bfq_queue. But this is
    unnecessarily overkill. In fact, if the in-service bfq_queue actually
    has a weight lower than or equal to the other queues, then the queue
    *must not* be guaranteed a higher share of the throughput than the
    other queues. So, not plugging I/O cannot cause any harm to the
    queue. And can boost throughput.

    Taking advantage of this fact, this commit does not plug I/O for sync
    bfq_queues with a weight lower than or equal to the weights of the
    other queues. Here is an example of the resulting throughput boost
    with the dbench workload, which is particularly nasty for BFQ. With
    the dbench test in the Phoronix suite, BFQ reaches its lowest total
    throughput with 6 clients on a filesystem with journaling, in case the
    journaling daemon has a higher weight than normal processes. Before
    this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5,
    after this commit it is ~100 MB/sec.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • If a sync bfq_queue has a higher weight than some other queue, and
    remains temporarily empty while in service, then, to preserve the
    bandwidth share of the queue, it is necessary to plug I/O dispatching
    until a new request arrives for the queue. In addition, a timeout
    needs to be set, to avoid waiting for ever if the process associated
    with the queue has actually finished its I/O.

    Even with the above timeout, the device is however not fed with new
    I/O for a while, if the process has finished its I/O. If this happens
    often, then throughput drops and latencies grow. For this reason, the
    timeout is kept rather low: 8 ms is the current default.

    Unfortunately, such a low value may cause, on the opposite end, a
    violation of bandwidth guarantees for a process that happens to issue
    new I/O too late. The higher the system load, the higher the
    probability that this happens to some process. This is a problem in
    scenarios where service guarantees matter more than throughput. One
    important case are weight-raised queues, which need to be granted a
    very high fraction of the bandwidth.

    To address this issue, this commit lower-bounds the plugging timeout
    for weight-raised queues to 20 ms. This simple change provides
    relevant benefits. For example, on a PLEXTOR PX-256M5S, with which
    gnome-terminal starts in 0.6 seconds if there is no other I/O in
    progress, the same applications starts in
    - 0.8 seconds, instead of 1.2 seconds, if ten files are being read
    sequentially in parallel
    - 1 second, instead of 2 seconds, if, in parallel, five files are
    being read sequentially, and five more files are being written
    sequentially

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Replace BFQ_GROUP_IOSCHED_ENABLED with CONFIG_BFQ_GROUP_IOSCHED.
    Code under these ifdefs never worked, something might be broken.

    Fixes: 0471559c2fbd ("block, bfq: add/remove entity weights correctly")
    Fixes: 73d58118498b ("block, bfq: consider also ioprio classes in symmetry detection")
    Reviewed-by: Holger Hoffstätte
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     

01 Feb, 2019

1 commit

  • When a new I/O request arrives for a bfq_queue, say Q, bfq checks
    whether that request is close to
    (a) the head request of some other queue waiting to be served, or
    (b) the last request dispatched for the in-service queue (in case Q
    itself is not the in-service queue)

    If a queue, say Q2, is found for which the above condition holds, then
    bfq merges Q and Q2, to hopefully get a more sequential I/O in the
    resulting merged queue, and thus a possibly higher throughput.

    Case (b) is checked by comparing the new request for Q with the last
    request dispatched, assuming that the latter necessarily belonged to the
    in-service queue. Unfortunately, this assumption is no longer always
    correct, since commit d0edc2473be9 ("block, bfq: inject other-queue I/O
    into seeky idle queues on NCQ flash").

    When the assumption does not hold, queues that must not be merged may be
    merged, causing unexpected loss of control on per-queue service
    guarantees.

    This commit solves this problem by adding an extra field, which stores
    the actual last request dispatched for the in-service queue, and by
    using this new field to correctly check case (b).

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente