23 Sep, 2020

1 commit

  • [ Upstream commit e8a8a185051a460e3eb0617dca33f996f4e31516 ]

    Yang Yang reported the following crash caused by requeueing a flush
    request in Kyber:

    [ 2.517297] Unable to handle kernel paging request at virtual address ffffffd8071c0b00
    ...
    [ 2.517468] pc : clear_bit+0x18/0x2c
    [ 2.517502] lr : sbitmap_queue_clear+0x40/0x228
    [ 2.517503] sp : ffffff800832bc60 pstate : 00c00145
    ...
    [ 2.517599] Process ksoftirqd/5 (pid: 51, stack limit = 0xffffff8008328000)
    [ 2.517602] Call trace:
    [ 2.517606] clear_bit+0x18/0x2c
    [ 2.517619] kyber_finish_request+0x74/0x80
    [ 2.517627] blk_mq_requeue_request+0x3c/0xc0
    [ 2.517637] __scsi_queue_insert+0x11c/0x148
    [ 2.517640] scsi_softirq_done+0x114/0x130
    [ 2.517643] blk_done_softirq+0x7c/0xb0
    [ 2.517651] __do_softirq+0x208/0x3bc
    [ 2.517657] run_ksoftirqd+0x34/0x60
    [ 2.517663] smpboot_thread_fn+0x1c4/0x2c0
    [ 2.517667] kthread+0x110/0x120
    [ 2.517669] ret_from_fork+0x10/0x18

    This happens because Kyber doesn't track flush requests, so
    kyber_finish_request() reads a garbage domain token. Only call the
    scheduler's requeue_request() hook if RQF_ELVPRIV is set (like we do for
    the finish_request() hook in blk_mq_free_request()). Now that we're
    handling it in blk-mq, also remove the check from BFQ.

    Reported-by: Yang Yang
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Omar Sandoval
     

23 Apr, 2020

1 commit

  • commit c8997736650060594845e42c5d01d3118aec8d25 upstream.

    A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The
    goal of this put is to release a process reference to a bfq_queue. But
    process-reference releases may trigger also some extra operation, and,
    to this goal, are handled through bfq_release_process_ref(). So, turn
    the invocation of bfq_put_queue() into an invocation of
    bfq_release_process_ref().

    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     

17 Apr, 2020

1 commit

  • [ Upstream commit 2f95fa5c955d0a9987ffdc3a095e2f4e62c5f2a9 ]

    In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is
    not in bfqd-lock critical section. The bfqq, which is not
    equal to NULL in bfq_idle_slice_timer, may be freed after passing
    to bfq_idle_slice_timer_body. So we will access the freed memory.

    In addition, considering the bfqq may be in race, we should
    firstly check whether bfqq is in service before doing something
    on it in bfq_idle_slice_timer_body func. If the bfqq in race is
    not in service, it means the bfqq has been expired through
    __bfq_bfqq_expire func, and wait_request flags has been cleared in
    __bfq_bfqd_reset_in_service func. So we do not need to re-clear the
    wait_request of bfqq which is not in service.

    KASAN log is given as follows:
    [13058.354613] ==================================================================
    [13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290
    [13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767
    [13058.354646]
    [13058.354655] CPU: 96 PID: 19767 Comm: fork13
    [13058.354661] Call trace:
    [13058.354667] dump_backtrace+0x0/0x310
    [13058.354672] show_stack+0x28/0x38
    [13058.354681] dump_stack+0xd8/0x108
    [13058.354687] print_address_description+0x68/0x2d0
    [13058.354690] kasan_report+0x124/0x2e0
    [13058.354697] __asan_load8+0x88/0xb0
    [13058.354702] bfq_idle_slice_timer+0xac/0x290
    [13058.354707] __hrtimer_run_queues+0x298/0x8b8
    [13058.354710] hrtimer_interrupt+0x1b8/0x678
    [13058.354716] arch_timer_handler_phys+0x4c/0x78
    [13058.354722] handle_percpu_devid_irq+0xf0/0x558
    [13058.354731] generic_handle_irq+0x50/0x70
    [13058.354735] __handle_domain_irq+0x94/0x110
    [13058.354739] gic_handle_irq+0x8c/0x1b0
    [13058.354742] el1_irq+0xb8/0x140
    [13058.354748] do_wp_page+0x260/0xe28
    [13058.354752] __handle_mm_fault+0x8ec/0x9b0
    [13058.354756] handle_mm_fault+0x280/0x460
    [13058.354762] do_page_fault+0x3ec/0x890
    [13058.354765] do_mem_abort+0xc0/0x1b0
    [13058.354768] el0_da+0x24/0x28
    [13058.354770]
    [13058.354773] Allocated by task 19731:
    [13058.354780] kasan_kmalloc+0xe0/0x190
    [13058.354784] kasan_slab_alloc+0x14/0x20
    [13058.354788] kmem_cache_alloc_node+0x130/0x440
    [13058.354793] bfq_get_queue+0x138/0x858
    [13058.354797] bfq_get_bfqq_handle_split+0xd4/0x328
    [13058.354801] bfq_init_rq+0x1f4/0x1180
    [13058.354806] bfq_insert_requests+0x264/0x1c98
    [13058.354811] blk_mq_sched_insert_requests+0x1c4/0x488
    [13058.354818] blk_mq_flush_plug_list+0x2d4/0x6e0
    [13058.354826] blk_flush_plug_list+0x230/0x548
    [13058.354830] blk_finish_plug+0x60/0x80
    [13058.354838] read_pages+0xec/0x2c0
    [13058.354842] __do_page_cache_readahead+0x374/0x438
    [13058.354846] ondemand_readahead+0x24c/0x6b0
    [13058.354851] page_cache_sync_readahead+0x17c/0x2f8
    [13058.354858] generic_file_buffered_read+0x588/0xc58
    [13058.354862] generic_file_read_iter+0x1b4/0x278
    [13058.354965] ext4_file_read_iter+0xa8/0x1d8 [ext4]
    [13058.354972] __vfs_read+0x238/0x320
    [13058.354976] vfs_read+0xbc/0x1c0
    [13058.354980] ksys_read+0xdc/0x1b8
    [13058.354984] __arm64_sys_read+0x50/0x60
    [13058.354990] el0_svc_common+0xb4/0x1d8
    [13058.354994] el0_svc_handler+0x50/0xa8
    [13058.354998] el0_svc+0x8/0xc
    [13058.354999]
    [13058.355001] Freed by task 19731:
    [13058.355007] __kasan_slab_free+0x120/0x228
    [13058.355010] kasan_slab_free+0x10/0x18
    [13058.355014] kmem_cache_free+0x288/0x3f0
    [13058.355018] bfq_put_queue+0x134/0x208
    [13058.355022] bfq_exit_icq_bfqq+0x164/0x348
    [13058.355026] bfq_exit_icq+0x28/0x40
    [13058.355030] ioc_exit_icq+0xa0/0x150
    [13058.355035] put_io_context_active+0x250/0x438
    [13058.355038] exit_io_context+0xd0/0x138
    [13058.355045] do_exit+0x734/0xc58
    [13058.355050] do_group_exit+0x78/0x220
    [13058.355054] __wake_up_parent+0x0/0x50
    [13058.355058] el0_svc_common+0xb4/0x1d8
    [13058.355062] el0_svc_handler+0x50/0xa8
    [13058.355066] el0_svc+0x8/0xc
    [13058.355067]
    [13058.355071] The buggy address belongs to the object at ffffa02cf3e63e70#012 which belongs to the cache bfq_queue of size 464
    [13058.355075] The buggy address is located 264 bytes inside of#012 464-byte region [ffffa02cf3e63e70, ffffa02cf3e64040)
    [13058.355077] The buggy address belongs to the page:
    [13058.355083] page:ffff7e80b3cf9800 count:1 mapcount:0 mapping:ffff802db5c90780 index:0xffffa02cf3e606f0 compound_mapcount: 0
    [13058.366175] flags: 0x2ffffe0000008100(slab|head)
    [13058.370781] raw: 2ffffe0000008100 ffff7e80b53b1408 ffffa02d730c1c90 ffff802db5c90780
    [13058.370787] raw: ffffa02cf3e606f0 0000000000370023 00000001ffffffff 0000000000000000
    [13058.370789] page dumped because: kasan: bad access detected
    [13058.370791]
    [13058.370792] Memory state around the buggy address:
    [13058.370797] ffffa02cf3e63e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb
    [13058.370801] ffffa02cf3e63e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [13058.370805] >ffffa02cf3e63f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [13058.370808] ^
    [13058.370811] ffffa02cf3e63f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [13058.370815] ffffa02cf3e64000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
    [13058.370817] ==================================================================
    [13058.370820] Disabling lock debugging due to kernel taint

    Here, we directly pass the bfqd to bfq_idle_slice_timer_body func.
    --
    V2->V3: rewrite the comment as suggested by Paolo Valente
    V1->V2: add one comment, and add Fixes and Reported-by tag.

    Fixes: aee69d78d ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
    Acked-by: Paolo Valente
    Reported-by: Wang Wang
    Signed-off-by: Zhiqiang Liu
    Signed-off-by: Feilong Lin
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Zhiqiang Liu
     

12 Mar, 2020

2 commits

  • commit 4d8340d0d4d90e7ca367d18ec16c2fefa89a339c upstream.

    ifdefs around gets and puts of bfq groups reduce readability, remove them.

    Tested-by: Oleksandr Natalenko
    Reported-by: Jens Axboe
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     
  • [ Upstream commit 32c59e3a9a5a0b180dd015755d6d18ca31e55935 ]

    BFQ maintains an ordered list, implemented with an RB tree, of
    head-request positions of non-empty bfq_queues. This position tree,
    inherited from CFQ, is used to find bfq_queues that contain I/O close
    to each other. BFQ merges these bfq_queues into a single shared queue,
    if this boosts throughput on the device at hand.

    There is however a special-purpose bfq_queue that does not participate
    in queue merging, the oom bfq_queue. Yet, also this bfq_queue could be
    wrongly added to the position tree. So bfqq_find_close() could return
    the oom bfq_queue, which is a source of further troubles in an
    out-of-memory situation. This commit prevents the oom bfq_queue from
    being inserted into the position tree.

    Tested-by: Patrick Dung
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Paolo Valente
     

24 Feb, 2020

1 commit

  • [ Upstream commit f718b093277df582fbf8775548a4f163e664d282 ]

    Commit 478de3380c1c ("block, bfq: deschedule empty bfq_queues not
    referred by any process") fixed commit 3726112ec731 ("block, bfq:
    re-schedule empty queues if they deserve I/O plugging") by
    descheduling an empty bfq_queue when it remains with not process
    reference. Yet, this still left a case uncovered: an empty bfq_queue
    with not process reference that remains in service. This happens for
    an in-service sync bfq_queue that is deemed to deserve I/O-dispatch
    plugging when it remains empty. Yet no new requests will arrive for
    such a bfq_queue if no process sends requests to it any longer. Even
    worse, the bfq_queue may happen to be prematurely freed while still in
    service (because there may remain no reference to it any longer).

    This commit solves this problem by preventing I/O dispatch from being
    plugged for the in-service bfq_queue, if the latter has no process
    reference (the bfq_queue is then prevented from remaining in service).

    Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging")
    Tested-by: Oleksandr Natalenko
    Reported-by: Patrick Dung
    Tested-by: Patrick Dung
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Paolo Valente
     

14 Nov, 2019

1 commit

  • Since commit 3726112ec731 ("block, bfq: re-schedule empty queues if
    they deserve I/O plugging"), to prevent the service guarantees of a
    bfq_queue from being violated, the bfq_queue may be left busy, i.e.,
    scheduled for service, even if empty (see comments in
    __bfq_bfqq_expire() for details). But, if no process will send
    requests to the bfq_queue any longer, then there is no point in
    keeping the bfq_queue scheduled for service.

    In addition, keeping the bfq_queue scheduled for service, but with no
    process reference any longer, may cause the bfq_queue to be freed when
    descheduled from service. But this is assumed to never happen, and
    causes a UAF if it happens. This, in turn, caused crashes [1, 2].

    This commit fixes this issue by descheduling an empty bfq_queue when
    it remains with not process reference.

    [1] https://bugzilla.redhat.com/show_bug.cgi?id=1767539
    [2] https://bugzilla.kernel.org/show_bug.cgi?id=205447

    Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging")
    Reported-by: Chris Evich
    Reported-by: Patrick Dung
    Reported-by: Thorsten Schubert
    Tested-by: Thorsten Schubert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

18 Sep, 2019

4 commits

  • If equal to 0, the injection limit for a bfq_queue is pushed to 1
    after a first sample of the total service time of the I/O requests of
    the queue is computed (to allow injection to start). Yet, because of a
    mistake in the branch that performs this action, the push may happen
    also in some other case. This commit fixes this issue.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • The update period of the injection limit has been tentatively set to
    100 ms, to reduce fluctuations. This value however proved to cause,
    occasionally, the limit to be decremented for some bfq_queue only
    after the queue underwent excessive injection for a lot of time. This
    commit reduces the period to 10 ms.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Upon an increment attempt of the injection limit, the latter is
    constrained not to become higher than twice the maximum number
    max_rq_in_driver of I/O requests that have happened to be in service
    in the drive. This high bound allows the injection limit to grow
    beyond max_rq_in_driver, which may then cause max_rq_in_driver itself
    to grow.

    However, since the limit is incremented by only one unit at a time,
    there is no need for such a high bound, and just max_rq_in_driver+1 is
    enough.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • BFQ updates the injection limit of each bfq_queue as a function of how
    much the limit inflates the service times experienced by the I/O
    requests of the queue. So only service times affected by injection
    must be taken into account. Unfortunately, in the current
    implementation of this update scheme, the service time of an I/O
    request rq not affected by injection may happen to be considered in
    the following case: there is no I/O request in service when rq
    arrives.

    This commit fixes this issue by making sure that only service times
    affected by injection are considered for updating the injection
    limit. In particular, the service time of an I/O request rq is now
    considered only if at least one of the following two conditions holds:
    - the destination bfq_queue for rq underwent injection before rq
    arrival, and there is still I/O in service in the drive on rq arrival
    (the service of such unfinished I/O may delay the service of rq);
    - injection occurs between the arrival and the completion time of rq.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

08 Aug, 2019

3 commits

  • As reported in [1], the call bfq_init_rq(rq) may return NULL in case
    of OOM (in particular, if rq->elv.icq is NULL because memory
    allocation failed in failed in ioc_create_icq()).

    This commit handles this circumstance.

    [1] https://lkml.org/lkml/2019/7/22/824

    Cc: Hsin-Yi Wang
    Cc: Nicolas Boichat
    Cc: Doug Anderson
    Reported-by: Guenter Roeck
    Reported-by: Hsin-Yi Wang
    Reviewed-by: Guenter Roeck
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Since commit 13a857a4c4e8 ("block, bfq: detect wakers and
    unconditionally inject their I/O"), every bfq_queue has a pointer to a
    waker bfq_queue and a list of the bfq_queues it may wake. In this
    respect, when a bfq_queue, say Q, remains with no I/O source attached
    to it, Q cannot be woken by any other bfq_queue, and cannot wake any
    other bfq_queue. Then Q must be removed from the woken list of its
    possible waker bfq_queue, and all bfq_queues in the woken list of Q
    must stop having a waker bfq_queue.

    Q remains with no I/O source in two cases: when the last process
    associated with Q exits or when such a process gets associated with a
    different bfq_queue. Unfortunately, commit 13a857a4c4e8 ("block, bfq:
    detect wakers and unconditionally inject their I/O") performed the
    above updates only in the first case.

    This commit fixes this bug by moving these updates to when Q gets
    freed. This is a simple and safe way to handle all cases, as both the
    above events, process exit and re-association, lead to Q being freed
    soon, and because dangling references would come out only after Q gets
    freed (if no update were performed).

    Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O")
    Reported-by: Douglas Anderson
    Tested-by: Douglas Anderson
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Since commit 13a857a4c4e8 ("block, bfq: detect wakers and
    unconditionally inject their I/O"), BFQ stores, in a per-device
    pointer last_completed_rq_bfqq, the last bfq_queue that had an I/O
    request completed. If some bfq_queue receives new I/O right after the
    last request of last_completed_rq_bfqq has been completed, then
    last_completed_rq_bfqq may be a waker bfq_queue.

    But if the bfq_queue last_completed_rq_bfqq points to is freed, then
    last_completed_rq_bfqq becomes a dangling reference. This commit
    resets last_completed_rq_bfqq if the pointed bfq_queue is freed.

    Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O")
    Reported-by: Douglas Anderson
    Tested-by: Douglas Anderson
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

27 Jul, 2019

1 commit

  • Pull block fixes from Jens Axboe:

    - Several io_uring fixes/improvements:
    - Blocking fix for O_DIRECT (me)
    - Latter page slowness for registered buffers (me)
    - Fix poll hang under certain conditions (me)
    - Defer sequence check fix for wrapped rings (Zhengyuan)
    - Mismatch in async inc/dec accounting (Zhengyuan)
    - Memory ordering issue that could cause stall (Zhengyuan)
    - Track sequential defer in bytes, not pages (Zhengyuan)

    - NVMe pull request from Christoph

    - Set of hang fixes for wbt (Josef)

    - Redundant error message kill for libahci (Ding)

    - Remove unused blk_mq_sched_started_request() and related ops (Marcos)

    - drbd dynamic alloc shash descriptor to reduce stack use (Arnd)

    - blkcg ->pd_stat() non-debug print (Tejun)

    - bcache memory leak fix (Wei)

    - Comment fix (Akinobu)

    - BFQ perf regression fix (Paolo)

    * tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits)
    io_uring: ensure ->list is initialized for poll commands
    Revert "nvme-pci: don't create a read hctx mapping without read queues"
    nvme: fix multipath crash when ANA is deactivated
    nvme: fix memory leak caused by incorrect subsystem free
    nvme: ignore subnqn for ADATA SX6000LNP
    drbd: dynamically allocate shash descriptor
    block: blk-mq: Remove blk_mq_sched_started_request and started_request
    bcache: fix possible memory leak in bch_cached_dev_run()
    io_uring: track io length in async_list based on bytes
    io_uring: don't use iov_iter_advance() for fixed buffers
    block: properly handle IOCB_NOWAIT for async O_DIRECT IO
    blk-mq: allow REQ_NOWAIT to return an error inline
    io_uring: add a memory barrier before atomic_read
    rq-qos: use a mb for got_token
    rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
    rq-qos: don't reset has_sleepers on spurious wakeups
    rq-qos: fix missed wake-ups in rq_qos_throttle
    wait: add wq_has_single_sleeper helper
    block, bfq: check also in-flight I/O in dispatch plugging
    block: fix sysfs module parameters directory path in comment
    ...

    Linus Torvalds
     

18 Jul, 2019

1 commit

  • Consider a sync bfq_queue Q that remains empty while in service, and
    suppose that, when this happens, there is a fair amount of already
    in-flight I/O not belonging to Q. In such a situation, I/O dispatching
    may need to be plugged (until new I/O arrives for Q), for the
    following reason.

    The drive may decide to serve in-flight non-Q's I/O requests before
    Q's ones, thereby delaying the arrival of new I/O requests for Q
    (recall that Q is sync). If I/O-dispatching is not plugged, then,
    while Q remains empty, a basically uncontrolled amount of I/O from
    other queues may be dispatched too, possibly causing the service of
    Q's I/O to be delayed even longer in the drive. This problem gets more
    and more serious as the speed and the queue depth of the drive grow,
    because, as these two quantities grow, the probability to find no
    queue busy but many requests in flight grows too.

    If Q has the same weight and priority as the other queues, then the
    above delay is unlikely to cause any issue, because all queues tend to
    undergo the same treatment. So, since not plugging I/O dispatching is
    convenient for throughput, it is better not to plug. Things change in
    case Q has a higher weight or priority than some other queue, because
    Q's service guarantees may simply be violated. For this reason,
    commit 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric
    scenarios") does plug I/O in such an asymmetric scenario. Plugging
    minimizes the delay induced by already in-flight I/O, and enables Q to
    recover the bandwidth it may lose because of this delay.

    Yet the above commit does not cover the case of weight-raised queues,
    for efficiency concerns. For weight-raised queues, I/O-dispatch
    plugging is activated simply if not all bfq_queues are
    weight-raised. But this check does not handle the case of in-flight
    requests, because a bfq_queue may become non busy *before* all its
    in-flight requests are completed.

    This commit performs I/O-dispatch plugging for weight-raised queues if
    there are some in-flight requests.

    As a practical example of the resulting recover of control, under
    write load on a Samsung SSD 970 PRO, gnome-terminal starts in 1.5
    seconds after this fix, against 15 seconds before the fix (as a
    reference, gnome-terminal takes about 35 seconds to start with any of
    the other I/O schedulers).

    Fixes: 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric scenarios")
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

15 Jul, 2019

1 commit

  • Rename the block documentation files to ReST, add an
    index for them and adjust in order to produce a nice html
    output via the Sphinx build system.

    At its new index.rst, let's add a :orphan: while this is not linked to
    the main index.rst file, in order to avoid build warnings.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     

10 Jul, 2019

1 commit

  • Pull block updates from Jens Axboe:
    "This is the main block updates for 5.3. Nothing earth shattering or
    major in here, just fixes, additions, and improvements all over the
    map. This contains:

    - Series of documentation fixes (Bart)

    - Optimization of the blk-mq ctx get/put (Bart)

    - null_blk removal race condition fix (Bob)

    - req/bio_op() cleanups (Chaitanya)

    - Series cleaning up the segment accounting, and request/bio mapping
    (Christoph)

    - Series cleaning up the page getting/putting for bios (Christoph)

    - block cgroup cleanups and moving it to where it is used (Christoph)

    - block cgroup fixes (Tejun)

    - Series of fixes and improvements to bcache, most notably a write
    deadlock fix (Coly)

    - blk-iolatency STS_AGAIN and accounting fixes (Dennis)

    - Series of improvements and fixes to BFQ (Douglas, Paolo)

    - debugfs_create() return value check removal for drbd (Greg)

    - Use struct_size(), where appropriate (Gustavo)

    - Two lighnvm fixes (Heiner, Geert)

    - MD fixes, including a read balance and corruption fix (Guoqing,
    Marcos, Xiao, Yufen)

    - block opal shadow mbr additions (Jonas, Revanth)

    - sbitmap compare-and-exhange improvemnts (Pavel)

    - Fix for potential bio->bi_size overflow (Ming)

    - NVMe pull requests:
    - improved PCIe suspent support (Keith Busch)
    - error injection support for the admin queue (Akinobu Mita)
    - Fibre Channel discovery improvements (James Smart)
    - tracing improvements including nvmetc tracing support (Minwoo Im)
    - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
    Kulkarni)"

    - Various little fixes and improvements to drivers and core"

    * tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits)
    blk-iolatency: fix STS_AGAIN handling
    block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES
    blk-mq: simplify blk_mq_make_request()
    blk-mq: remove blk_mq_put_ctx()
    sbitmap: Replace cmpxchg with xchg
    block: fix .bi_size overflow
    block: sed-opal: check size of shadow mbr
    block: sed-opal: ioctl for writing to shadow mbr
    block: sed-opal: add ioctl for done-mark of shadow mbr
    block: never take page references for ITER_BVEC
    direct-io: use bio_release_pages in dio_bio_complete
    block_dev: use bio_release_pages in bio_unmap_user
    block_dev: use bio_release_pages in blkdev_bio_end_io
    iomap: use bio_release_pages in iomap_dio_bio_end_io
    block: use bio_release_pages in bio_map_user_iov
    block: use bio_release_pages in bio_unmap_user
    block: optionally mark pages dirty in bio_release_pages
    block: move the BIO_NO_PAGE_REF check into bio_release_pages
    block: skd_main.c: Remove call to memset after dma_alloc_coherent
    block: mtip32xx: Remove call to memset after dma_alloc_coherent
    ...

    Linus Torvalds
     

28 Jun, 2019

1 commit

  • In reboot tests on several devices we were seeing a "use after free"
    when slub_debug or KASAN was enabled. The kernel complained about:

    Unable to handle kernel paging request at virtual address 6b6b6c2b

    ...which is a classic sign of use after free under slub_debug. The
    stack crawl in kgdb looked like:

    0 test_bit (addr=, nr=)
    1 bfq_bfqq_busy (bfqq=)
    2 bfq_select_queue (bfqd=)
    3 __bfq_dispatch_request (hctx=)
    4 bfq_dispatch_request (hctx=)
    5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440)
    6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440)
    7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440)
    8 0xc0568d94 in blk_mq_run_work_fn (work=)
    9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480)
    10 0xc024cff4 in worker_thread (__worker=0xec6d4640)

    Digging in kgdb, it could be found that, though bfqq looked fine,
    bfqq->bic had been freed.

    Through further digging, I postulated that perhaps it is illegal to
    access a "bic" (AKA an "icq") after bfq_exit_icq() had been called
    because the "bic" can be freed at some point in time after this call
    is made. I confirmed that there certainly were cases where the exact
    crashing code path would access the "bic" after bfq_exit_icq() had
    been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and
    saw that the bic was 0x7 at the time of the crash.

    To understand a bit more about why this crash was fairly uncommon (I
    saw it only once in a few hundred reboots), you can see that much of
    the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't
    access the ->bic anymore. The only case it doesn't is if
    bfq_put_queue() sees a reference still held.

    However, even in the case when bfqq isn't freed, the crash is still
    rare. Why? I tracked what happened to the "bic" after the exit
    routine. It doesn't get freed right away. Rather,
    put_io_context_active() eventually called put_io_context() which
    queued up freeing on a workqueue. The freeing then actually happened
    later than that through call_rcu(). Despite all these delays, some
    extra debugging showed that all the hoops could be jumped through in
    time and the memory could be freed causing the original crash. Phew!

    To make a long story short, assuming it truly is illegal to access an
    icq after the "exit_icq" callback is finished, this patch is needed.

    Cc: stable@vger.kernel.org
    Reviewed-by: Paolo Valente
    Signed-off-by: Douglas Anderson
    Signed-off-by: Jens Axboe

    Douglas Anderson
     

27 Jun, 2019

1 commit

  • Some debug code suggested by Paolo was tripping when I did reboot
    stress tests. Specifically in bfq_bfqq_resume_state()
    "bic->saved_wr_start_at_switch_to_srt" was later than the current
    value of "jiffies". A bit of debugging showed that
    "bic->saved_wr_start_at_switch_to_srt" was actually 0 and a bit more
    debugging showed that was because we had run through the "unlikely"
    case in the bfq_bfqq_save_state() function.

    Let's init "saved_wr_start_at_switch_to_srt" in the unlikely case to
    something sane.

    NOTE: this fixes no known real-world errors.

    Reviewed-by: Paolo Valente
    Reviewed-by: Guenter Roeck
    Signed-off-by: Douglas Anderson
    Signed-off-by: Jens Axboe

    Douglas Anderson
     

26 Jun, 2019

1 commit

  • By mistake, there is a '&' instead of a '==' in the definition of the
    macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with
    the correct one.

    Fixes: 7074f076ff15 ("block, bfq: do not tag totally seeky queues as soft rt")
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

25 Jun, 2019

7 commits

  • Consider, on one side, a bfq_queue Q that remains empty while in
    service, and, on the other side, the pending I/O of bfq_queues that,
    according to their timestamps, have to be served after Q. If an
    uncontrolled amount of I/O from the latter bfq_queues were dispatched
    while Q is waiting for its new I/O to arrive, then Q's bandwidth
    guarantees would be violated. To prevent this, I/O dispatch is plugged
    until Q receives new I/O (except for a properly controlled amount of
    injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging,
    for the following reason.

    Preemption is performed in two steps. First, Q is expired and
    re-scheduled. Second, the new bfq_queue to serve is chosen. The first
    step is needed by the second, as the second can be performed only
    after Q's timestamps have been properly updated (done in the
    expiration step), and Q has been re-queued for service. This
    dependency is a consequence of the way how BFQ's scheduling algorithm
    is currently implemented.

    But Q is not re-scheduled at all in the first step, because Q is
    empty. As a consequence, an uncontrolled amount of I/O may be
    dispatched until Q becomes non empty again. This breaks Q's service
    guarantees.

    This commit addresses this issue by re-scheduling Q even if it is
    empty. This in turn breaks the assumption that all scheduled queues
    are non empty. Then a few extra checks are now needed.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • BFQ enqueues the I/O coming from each process into a separate
    bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be
    served for at most timeout_sync milliseconds (default: 125 ms). This
    service scheme is prone to the following inaccuracy.

    While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may
    receive I/O, and, according to BFQ's scheduling policy, may become the
    right bfq_queue to serve, in place of the currently in-service
    bfq_queue. In this respect, postponing the service of Q2 to after the
    service of Q1 finishes may delay the completion of Q2's I/O, compared
    with an ideal service in which all non-empty bfq_queues are served in
    parallel, and every non-empty bfq_queue is served at a rate
    proportional to the bfq_queue's weight. This additional delay is equal
    at most to the time Q1 may unjustly remain in service before switching
    to Q2.

    If Q1 and Q2 have the same weight, then this time is most likely
    negligible compared with the completion time to be guaranteed to Q2's
    I/O. In addition, first, one of the reasons why BFQ may want to serve
    Q1 for a while is that this boosts throughput and, second, serving Q1
    longer reduces BFQ's overhead. As a conclusion, it is usually better
    not to preempt Q1 if both Q1 and Q2 have the same weight.

    In contrast, as Q2's weight or priority becomes higher and higher
    compared with that of Q1, the above delay becomes larger and larger,
    compared with the I/O completion times that have to be guaranteed to
    Q2 according to Q2's weight. So reducing this delay may be more
    important than avoiding the costs of preempting Q1.

    Accordingly, this commit preempts Q1 if Q2 has a higher weight or a
    higher priority than Q1. Preemption causes Q1 to be re-scheduled, and
    triggers a new choice of the next bfq_queue to serve. If Q2 really is
    the next bfq_queue to serve, then Q2 will be set in service
    immediately.

    This change reduces the component of the I/O latency caused by the
    above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5
    SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for
    a process doing sporadic random reads while another process is doing
    continuous sequential reads.

    Signed-off-by: Nicola Bottura
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • A bfq_queue Q may happen to be synchronized with another
    bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
    receive new I/O. We call Q2 "waker queue".

    If I/O plugging is being performed for Q, and Q is not receiving any
    more I/O because of the above synchronization, then, thanks to BFQ's
    injection mechanism, the waker queue is likely to get served before
    the I/O-plugging timeout fires.

    Unfortunately, this fact may not be sufficient to guarantee a high
    throughput during the I/O plugging, because the inject limit for Q may
    be too low to guarantee a lot of injected I/O. In addition, the
    duration of the plugging, i.e., the time before Q finally receives new
    I/O, may not be minimized, because the waker queue may happen to be
    served only after other queues.

    To address these issues, this commit introduces the explicit detection
    of the waker queue, and the unconditional injection of a pending I/O
    request of the waker queue on each invocation of
    bfq_dispatch_request().

    One may be concerned that this systematic injection of I/O from the
    waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
    the contrary, next Q's I/O is brought forward dramatically, for it is
    not blocked for milliseconds.

    Reported-by: Srivatsa S. Bhat (VMware)
    Tested-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Until the base value for request service times gets finally computed
    for a bfq_queue, the inject limit for that queue does depend on the
    think-time state (short|long) of the queue. A timely update of the
    think time then guarantees a quicker activation or deactivation of the
    injection. Fortunately, the think time of a bfq_queue is updated in
    the same code path as the inject limit; but after the inject limit.

    This commits moves the update of the think time before the update of
    the inject limit. For coherence, it moves the update of the seek time
    too.

    Reported-by: Srivatsa S. Bhat (VMware)
    Tested-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • I/O injection gets reduced if it increases the request service times
    of the victim queue beyond a certain threshold. The threshold, in its
    turn, is computed as a function of the base service time enjoyed by
    the queue when it undergoes no injection.

    As a consequence, for injection to work properly, the above base value
    has to be accurate. In this respect, such a value may vary over
    time. For example, it varies if the size or the spatial locality of
    the I/O requests in the queue change. It is then important to update
    this value whenever possible. This commit performs this update.

    Reported-by: Srivatsa S. Bhat (VMware)
    Tested-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • One of the cases where the parameters for injection may be updated is
    when there are no more in-flight I/O requests. The number of in-flight
    requests is stored in the field bfqd->rq_in_driver of the descriptor
    bfqd of the device. So, the controlled condition is
    bfqd->rq_in_driver == 0.

    Unfortunately, this is wrong because, the instruction that checks this
    condition is in the code path that handles the completion of a
    request, and, in particular, the instruction is executed before
    bfqd->rq_in_driver is decremented in such a code path.

    This commit fixes this issue by just replacing 0 with 1 in the
    comparison.

    Reported-by: Srivatsa S. Bhat (VMware)
    Tested-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Until the base value of the request service times gets finally
    computed for a bfq_queue, the inject limit does depend on the
    think-time state (short|long). The limit must be 0 or 1 if the think
    time is deemed, respectively, as short or long. However, such a check
    and possible limit update is performed only periodically, once per
    second. So, to make the injection mechanism much more reactive, this
    commit performs the update also every time the think-time state
    changes.

    In addition, in the following special case, this commit lets the
    inject limit of a bfq_queue bfqq remain equal to 1 even if bfqq's
    think time is short: bfqq's I/O is synchronized with that of some
    other queue, i.e., bfqq may receive new I/O only after the I/O of the
    other queue is completed. Keeping the inject limit to 1 allows the
    blocking I/O to be served while bfqq is in service. And this is very
    convenient both for bfqq and for the total throughput, as explained
    in detail in the comments in bfq_update_has_short_ttime().

    Reported-by: Srivatsa S. Bhat (VMware)
    Tested-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

21 Jun, 2019

2 commits

  • This option is entirely bfq specific, give it an appropinquate name.

    Also make it depend on CONFIG_BFQ_GROUP_IOSCHED in Kconfig, as all
    the functionality already does so anyway.

    Acked-by: Tejun Heo
    Acked-by: Paolo Valente
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We only need the number of segments in the blk-mq submission path.
    Remove the field from struct bio, and return it from a variant of
    blk_queue_split instead of that it can passed as an argument to
    those functions that need the value.

    This also means we stop recounting segments except for cloning
    and partial segments.

    To keep the number of arguments in this how path down remove
    pointless struct request_queue arguments from any of the functions
    that had it and grew a nr_segs argument.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

01 May, 2019

1 commit


22 Apr, 2019

1 commit

  • Pull in v5.1-rc6 to resolve two conflicts. One is in BFQ, in just a
    comment, and is trivial. The other one is a conflict due to a later fix
    in the bio multi-page work, and needs a bit more care.

    * tag 'v5.1-rc6': (770 commits)
    Linux 5.1-rc6
    block: make sure that bvec length can't be overflow
    block: kill all_q_node in request_queue
    x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority
    coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
    mm/kmemleak.c: fix unused-function warning
    init: initialize jump labels before command line option parsing
    kernel/watchdog_hld.c: hard lockup message should end with a newline
    kcov: improve CONFIG_ARCH_HAS_KCOV help text
    mm: fix inactive list balancing between NUMA nodes and cgroups
    mm/hotplug: treat CMA pages as unmovable
    proc: fixup proc-pid-vm test
    proc: fix map_files test on F29
    mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n
    mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock
    mm: swapoff: shmem_unuse() stop eviction without igrab()
    mm: swapoff: take notice of completion sooner
    mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
    mm: swapoff: shmem_find_swap_entries() filter out other types
    slab: store tagged freelist for off-slab slabmgmt
    ...

    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Apr, 2019

1 commit

  • A previous commit moved the shallow depth and BFQ depth map calculations
    to be done at init time, moving it outside of the hotter IO path. This
    potentially causes hangs if the users changes the depth of the scheduler
    map, by writing to the 'nr_requests' sysfs file for that device.

    Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if
    the depth changes, so that the scheduler can update its internal state.

    Tested-by: Kai Krakow
    Reported-by: Paolo Valente
    Fixes: f0635b8a416e ("bfq: calculate shallow depths at init time")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Apr, 2019

1 commit

  • The function bfq_bfqq_expire() invokes the function
    __bfq_bfqq_expire(), and the latter may free the in-service bfq-queue.
    If this happens, then no other instruction of bfq_bfqq_expire() must
    be executed, or a use-after-free will occur.

    Basing on the assumption that __bfq_bfqq_expire() invokes
    bfq_put_queue() on the in-service bfq-queue exactly once, the queue is
    assumed to be freed if its refcounter is equal to one right before
    invoking __bfq_bfqq_expire().

    But, since commit 9dee8b3b057e ("block, bfq: fix queue removal from
    weights tree") this assumption is false. __bfq_bfqq_expire() may also
    invoke bfq_weights_tree_remove() and, since commit 9dee8b3b057e
    ("block, bfq: fix queue removal from weights tree"), also
    the latter function may invoke bfq_put_queue(). So __bfq_bfqq_expire()
    may invoke bfq_put_queue() twice, and this is the actual case where
    the in-service queue may happen to be freed.

    To address this issue, this commit moves the check on the refcounter
    of the queue right around the last bfq_put_queue() that may be invoked
    on the queue.

    Fixes: 9dee8b3b057e ("block, bfq: fix queue removal from weights tree")
    Reported-by: Dmitrii Tcvetkov
    Reported-by: Douglas Anderson
    Tested-by: Dmitrii Tcvetkov
    Tested-by: Douglas Anderson
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

09 Apr, 2019

1 commit


01 Apr, 2019

5 commits

  • bfq saves the state of a queue each time a merge occurs, to be
    able to resume such a state when the queue is associated again
    with its original process, on a split.

    Unfortunately bfq does not save & restore also the weight of the
    queue. If the weight is not correctly resumed when the queue is
    recycled, then the weight of the recycled queue could differ
    from the weight of the original queue.

    This commit adds the missing save & resume of the weight.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Francesco Pollicino
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Francesco Pollicino
     
  • The function "bfq_log_bfqq" prints the pid of the process
    associated with the queue passed as input.

    Unfortunately, if the queue is shared, then more than one process
    is associated with the queue. The pid that gets printed in this
    case is the pid of one of the associated processes.
    Which process gets printed depends on the exact sequence of merge
    events the queue underwent. So printing such a pid is rather
    useless and above all is often rather confusing because it
    reports a random pid between those of the associated processes.

    This commit addresses this issue by printing SHARED instead of a pid
    if the queue is shared.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Francesco Pollicino
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Francesco Pollicino
     
  • If many bfq_queues belonging to the same group happen to be created
    shortly after each other, then the processes associated with these
    queues have typically a common goal. In particular, bursts of queue
    creations are usually caused by services or applications that spawn
    many parallel threads/processes. Examples are systemd during boot, or
    git grep. If there are no other active queues, then, to help these
    processes get their job done as soon as possible, the best thing to do
    is to reach a high throughput. To this goal, it is usually better to
    not grant either weight-raising or device idling to the queues
    associated with these processes. And this is exactly what BFQ
    currently does.

    There is however a drawback: if, in contrast, some other queues are
    already active, then the newly created queues must be protected from
    the I/O flowing through the already existing queues. In this case, the
    best thing to do is the opposite as in the other case: it is much
    better to grant weight-raising and device idling to the newly-created
    queues, if they deserve it. This commit addresses this issue by doing
    so if there are already other active queues.

    This change also helps eliminating false positives, which occur when
    the newly-created queues do not belong to an actual large burst of
    creations, but some background task (e.g., a service) happens to
    trigger the creation of new queues in the middle, i.e., very close to
    when the victim queues are created. These false positive may cause
    total loss of control on process latencies.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Sync random I/O is likely to be confused with soft real-time I/O,
    because it is characterized by limited throughput and apparently
    isochronous arrival pattern. To avoid false positives, this commits
    prevents bfq_queues containing only random (seeky) I/O from being
    tagged as soft real-time.

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • To boost throughput with a set of processes doing interleaved I/O
    (i.e., a set of processes whose individual I/O is random, but whose
    merged cumulative I/O is sequential), BFQ merges the queues associated
    with these processes, i.e., redirects the I/O of these processes into a
    common, shared queue. In the shared queue, I/O requests are ordered by
    their position on the medium, thus sequential I/O gets dispatched to
    the device when the shared queue is served.

    Queue merging costs execution time, because, to detect which queues to
    merge, BFQ must maintain a list of the head I/O requests of active
    queues, ordered by request positions. Measurements showed that this
    costs about 10% of BFQ's total per-request processing time.

    Request processing time becomes more and more critical as the speed of
    the underlying storage device grows. Yet, fortunately, queue merging
    is basically useless on the very devices that are so fast to make
    request processing time critical. To reach a high throughput, these
    devices must have many requests queued at the same time. But, in this
    configuration, the internal scheduling algorithms of these devices do
    also the job of queue merging: they reorder requests so as to obtain
    as much as possible a sequential I/O pattern. As a consequence, with
    processes doing interleaved I/O, the throughput reached by one such
    device is likely to be the same, with and without queue merging.

    In view of this fact, this commit disables queue merging, and all
    related housekeeping, for non-rotational devices with internal
    queueing. The total, single-lock-protected, per-request processing
    time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
    (time measured with simple code instrumentation, and using the
    throughput-sync.sh script of the S suite [1], in performance-profiling
    mode). To put this result into context, the total,
    single-lock-protected, per-request execution time of the lightest I/O
    scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
    ~800 LOC, against ~10500 LOC for BFQ).

    Disabling merging provides a further, remarkable benefit in terms of
    throughput. Merging tends to make many workloads artificially more
    uneven, mainly because of shared queues remaining non empty for
    incomparably more time than normal queues. So, if, e.g., one of the
    queues in a set of merged queues has a higher weight than a normal
    queue, then the shared queue may inherit such a high weight and, by
    staying almost always active, may force BFQ to perform I/O plugging
    most of the time. This evidently makes it harder for BFQ to let the
    device reach a high throughput.

    As a practical example of this problem, and of the benefits of this
    commit, we measured again the throughput in the nasty scenario
    considered in previous commit messages: dbench test (in the Phoronix
    suite), with 6 clients, on a filesystem with journaling, and with the
    journaling daemon enjoying a higher weight than normal processes. With
    this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
    PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
    of the other I/O schedulers. As such, this is also likely to be the
    maximum possible throughput reachable with this workload on this
    device, because I/O is mostly random, and the other schedulers
    basically just pass I/O requests to the drive as fast as possible.

    [1] https://github.com/Algodev-github/S

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Tested-by: Francesco Pollicino
    Signed-off-by: Alessio Masola
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente