20 Jan, 2021

1 commit

  • [ Upstream commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a ]

    BFQ computes number of tags it allows to be allocated for each request type
    based on tag bitmap. However it uses 1 << bitmap.shift as number of
    available tags which is wrong. 'shift' is just an internal bitmap value
    containing logarithm of how many bits bitmap uses in each bitmap word.
    Thus number of tags allowed for some request types can be far to low.
    Use proper bitmap.depth which has the number of tags instead.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jan Kara
     

14 Oct, 2020

1 commit

  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

12 Sep, 2020

1 commit

  • Pull block fixes from Jens Axboe:

    - Fix a regression in bdev partition locking (Christoph)

    - NVMe pull request from Christoph:
    - cancel async events before freeing them (David Milburn)
    - revert a broken race fix (James Smart)
    - fix command processing during resets (Sagi Grimberg)

    - Fix a kyber crash with requeued flushes (Omar)

    - Fix __bio_try_merge_page() same_page error for no merging (Ritesh)

    * tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-block:
    block: Set same_page to false in __bio_try_merge_page if ret is false
    nvme-fabrics: allow to queue requests for live queues
    block: only call sched requeue_request() for scheduled requests
    nvme-tcp: cancel async events before freeing event struct
    nvme-rdma: cancel async events before freeing event struct
    nvme-fc: cancel async events before freeing event struct
    nvme: Revert: Fix controller creation races with teardown flow
    block: restore a specific error code in bdev_del_partition

    Linus Torvalds
     

09 Sep, 2020

1 commit

  • Yang Yang reported the following crash caused by requeueing a flush
    request in Kyber:

    [ 2.517297] Unable to handle kernel paging request at virtual address ffffffd8071c0b00
    ...
    [ 2.517468] pc : clear_bit+0x18/0x2c
    [ 2.517502] lr : sbitmap_queue_clear+0x40/0x228
    [ 2.517503] sp : ffffff800832bc60 pstate : 00c00145
    ...
    [ 2.517599] Process ksoftirqd/5 (pid: 51, stack limit = 0xffffff8008328000)
    [ 2.517602] Call trace:
    [ 2.517606] clear_bit+0x18/0x2c
    [ 2.517619] kyber_finish_request+0x74/0x80
    [ 2.517627] blk_mq_requeue_request+0x3c/0xc0
    [ 2.517637] __scsi_queue_insert+0x11c/0x148
    [ 2.517640] scsi_softirq_done+0x114/0x130
    [ 2.517643] blk_done_softirq+0x7c/0xb0
    [ 2.517651] __do_softirq+0x208/0x3bc
    [ 2.517657] run_ksoftirqd+0x34/0x60
    [ 2.517663] smpboot_thread_fn+0x1c4/0x2c0
    [ 2.517667] kthread+0x110/0x120
    [ 2.517669] ret_from_fork+0x10/0x18

    This happens because Kyber doesn't track flush requests, so
    kyber_finish_request() reads a garbage domain token. Only call the
    scheduler's requeue_request() hook if RQF_ELVPRIV is set (like we do for
    the finish_request() hook in blk_mq_free_request()). Now that we're
    handling it in blk-mq, also remove the check from BFQ.

    Reported-by: Yang Yang
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

04 Sep, 2020

2 commits

  • High CPU utilization on "native_queued_spin_lock_slowpath" due to lock
    contention is possible for mq-deadline and bfq IO schedulers
    when nr_hw_queues is more than one.

    It is because kblockd work queue can submit IO from all online CPUs
    (through blk_mq_run_hw_queues()) even though only one hctx has pending
    commands.

    The elevator callback .has_work for mq-deadline and bfq scheduler considers
    pending work if there are any IOs on request queue but it does not account
    hctx context.

    Add a per-hctx 'elevator_queued' count to the hctx to avoid triggering
    the elevator even though there are no requests queued.

    [jpg: Relocated atomic_dec() in dd_dispatch_request(), update commit message per Kashyap]

    Signed-off-by: Kashyap Desai
    Signed-off-by: Hannes Reinecke
    Signed-off-by: John Garry
    Tested-by: Douglas Gilbert
    Signed-off-by: Jens Axboe

    Kashyap Desai
     
  • Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
    with the goal of later being able to use a common shared tag bitmap across
    all HW contexts in a set.

    Signed-off-by: John Garry
    Tested-by: Don Brace #SCSI resv cmds patches used
    Tested-by: Douglas Gilbert
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    John Garry
     

24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

01 Aug, 2020

1 commit


30 May, 2020

1 commit


10 May, 2020

1 commit

  • Use the common interface bdi_dev_name() to get device name.

    Signed-off-by: Yufen Yu
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Jan Kara
    Reviewed-by: Bart Van Assche

    Add missing include BFQ

    Signed-off-by: Jens Axboe

    Yufen Yu
     

22 Mar, 2020

2 commits

  • A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The
    goal of this put is to release a process reference to a bfq_queue. But
    process-reference releases may trigger also some extra operation, and,
    to this goal, are handled through bfq_release_process_ref(). So, turn
    the invocation of bfq_put_queue() into an invocation of
    bfq_release_process_ref().

    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is
    not in bfqd-lock critical section. The bfqq, which is not
    equal to NULL in bfq_idle_slice_timer, may be freed after passing
    to bfq_idle_slice_timer_body. So we will access the freed memory.

    In addition, considering the bfqq may be in race, we should
    firstly check whether bfqq is in service before doing something
    on it in bfq_idle_slice_timer_body func. If the bfqq in race is
    not in service, it means the bfqq has been expired through
    __bfq_bfqq_expire func, and wait_request flags has been cleared in
    __bfq_bfqd_reset_in_service func. So we do not need to re-clear the
    wait_request of bfqq which is not in service.

    KASAN log is given as follows:
    [13058.354613] ==================================================================
    [13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290
    [13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767
    [13058.354646]
    [13058.354655] CPU: 96 PID: 19767 Comm: fork13
    [13058.354661] Call trace:
    [13058.354667] dump_backtrace+0x0/0x310
    [13058.354672] show_stack+0x28/0x38
    [13058.354681] dump_stack+0xd8/0x108
    [13058.354687] print_address_description+0x68/0x2d0
    [13058.354690] kasan_report+0x124/0x2e0
    [13058.354697] __asan_load8+0x88/0xb0
    [13058.354702] bfq_idle_slice_timer+0xac/0x290
    [13058.354707] __hrtimer_run_queues+0x298/0x8b8
    [13058.354710] hrtimer_interrupt+0x1b8/0x678
    [13058.354716] arch_timer_handler_phys+0x4c/0x78
    [13058.354722] handle_percpu_devid_irq+0xf0/0x558
    [13058.354731] generic_handle_irq+0x50/0x70
    [13058.354735] __handle_domain_irq+0x94/0x110
    [13058.354739] gic_handle_irq+0x8c/0x1b0
    [13058.354742] el1_irq+0xb8/0x140
    [13058.354748] do_wp_page+0x260/0xe28
    [13058.354752] __handle_mm_fault+0x8ec/0x9b0
    [13058.354756] handle_mm_fault+0x280/0x460
    [13058.354762] do_page_fault+0x3ec/0x890
    [13058.354765] do_mem_abort+0xc0/0x1b0
    [13058.354768] el0_da+0x24/0x28
    [13058.354770]
    [13058.354773] Allocated by task 19731:
    [13058.354780] kasan_kmalloc+0xe0/0x190
    [13058.354784] kasan_slab_alloc+0x14/0x20
    [13058.354788] kmem_cache_alloc_node+0x130/0x440
    [13058.354793] bfq_get_queue+0x138/0x858
    [13058.354797] bfq_get_bfqq_handle_split+0xd4/0x328
    [13058.354801] bfq_init_rq+0x1f4/0x1180
    [13058.354806] bfq_insert_requests+0x264/0x1c98
    [13058.354811] blk_mq_sched_insert_requests+0x1c4/0x488
    [13058.354818] blk_mq_flush_plug_list+0x2d4/0x6e0
    [13058.354826] blk_flush_plug_list+0x230/0x548
    [13058.354830] blk_finish_plug+0x60/0x80
    [13058.354838] read_pages+0xec/0x2c0
    [13058.354842] __do_page_cache_readahead+0x374/0x438
    [13058.354846] ondemand_readahead+0x24c/0x6b0
    [13058.354851] page_cache_sync_readahead+0x17c/0x2f8
    [13058.354858] generic_file_buffered_read+0x588/0xc58
    [13058.354862] generic_file_read_iter+0x1b4/0x278
    [13058.354965] ext4_file_read_iter+0xa8/0x1d8 [ext4]
    [13058.354972] __vfs_read+0x238/0x320
    [13058.354976] vfs_read+0xbc/0x1c0
    [13058.354980] ksys_read+0xdc/0x1b8
    [13058.354984] __arm64_sys_read+0x50/0x60
    [13058.354990] el0_svc_common+0xb4/0x1d8
    [13058.354994] el0_svc_handler+0x50/0xa8
    [13058.354998] el0_svc+0x8/0xc
    [13058.354999]
    [13058.355001] Freed by task 19731:
    [13058.355007] __kasan_slab_free+0x120/0x228
    [13058.355010] kasan_slab_free+0x10/0x18
    [13058.355014] kmem_cache_free+0x288/0x3f0
    [13058.355018] bfq_put_queue+0x134/0x208
    [13058.355022] bfq_exit_icq_bfqq+0x164/0x348
    [13058.355026] bfq_exit_icq+0x28/0x40
    [13058.355030] ioc_exit_icq+0xa0/0x150
    [13058.355035] put_io_context_active+0x250/0x438
    [13058.355038] exit_io_context+0xd0/0x138
    [13058.355045] do_exit+0x734/0xc58
    [13058.355050] do_group_exit+0x78/0x220
    [13058.355054] __wake_up_parent+0x0/0x50
    [13058.355058] el0_svc_common+0xb4/0x1d8
    [13058.355062] el0_svc_handler+0x50/0xa8
    [13058.355066] el0_svc+0x8/0xc
    [13058.355067]
    [13058.355071] The buggy address belongs to the object at ffffa02cf3e63e70#012 which belongs to the cache bfq_queue of size 464
    [13058.355075] The buggy address is located 264 bytes inside of#012 464-byte region [ffffa02cf3e63e70, ffffa02cf3e64040)
    [13058.355077] The buggy address belongs to the page:
    [13058.355083] page:ffff7e80b3cf9800 count:1 mapcount:0 mapping:ffff802db5c90780 index:0xffffa02cf3e606f0 compound_mapcount: 0
    [13058.366175] flags: 0x2ffffe0000008100(slab|head)
    [13058.370781] raw: 2ffffe0000008100 ffff7e80b53b1408 ffffa02d730c1c90 ffff802db5c90780
    [13058.370787] raw: ffffa02cf3e606f0 0000000000370023 00000001ffffffff 0000000000000000
    [13058.370789] page dumped because: kasan: bad access detected
    [13058.370791]
    [13058.370792] Memory state around the buggy address:
    [13058.370797] ffffa02cf3e63e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb
    [13058.370801] ffffa02cf3e63e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [13058.370805] >ffffa02cf3e63f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [13058.370808] ^
    [13058.370811] ffffa02cf3e63f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [13058.370815] ffffa02cf3e64000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
    [13058.370817] ==================================================================
    [13058.370820] Disabling lock debugging due to kernel taint

    Here, we directly pass the bfqd to bfq_idle_slice_timer_body func.
    --
    V2->V3: rewrite the comment as suggested by Paolo Valente
    V1->V2: add one comment, and add Fixes and Reported-by tag.

    Fixes: aee69d78d ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
    Acked-by: Paolo Valente
    Reported-by: Wang Wang
    Signed-off-by: Zhiqiang Liu
    Signed-off-by: Feilong Lin
    Signed-off-by: Jens Axboe

    Zhiqiang Liu
     

03 Feb, 2020

5 commits

  • The exact, general goal of the function bfq_split_bfqq() is not that
    apparent. Add a comment to make it clear.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • ifdefs around gets and puts of bfq groups reduce readability, remove them.

    Tested-by: Oleksandr Natalenko
    Reported-by: Jens Axboe
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • The flag on_st in the bfq_entity data structure is true if the entity
    is on a service tree or is in service. Yet the name of the field,
    confusingly, does not mention the second, very important case. Extend
    the name to mention the second case too.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • BFQ maintains an ordered list, implemented with an RB tree, of
    head-request positions of non-empty bfq_queues. This position tree,
    inherited from CFQ, is used to find bfq_queues that contain I/O close
    to each other. BFQ merges these bfq_queues into a single shared queue,
    if this boosts throughput on the device at hand.

    There is however a special-purpose bfq_queue that does not participate
    in queue merging, the oom bfq_queue. Yet, also this bfq_queue could be
    wrongly added to the position tree. So bfqq_find_close() could return
    the oom bfq_queue, which is a source of further troubles in an
    out-of-memory situation. This commit prevents the oom bfq_queue from
    being inserted into the position tree.

    Tested-by: Patrick Dung
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Commit 478de3380c1c ("block, bfq: deschedule empty bfq_queues not
    referred by any process") fixed commit 3726112ec731 ("block, bfq:
    re-schedule empty queues if they deserve I/O plugging") by
    descheduling an empty bfq_queue when it remains with not process
    reference. Yet, this still left a case uncovered: an empty bfq_queue
    with not process reference that remains in service. This happens for
    an in-service sync bfq_queue that is deemed to deserve I/O-dispatch
    plugging when it remains empty. Yet no new requests will arrive for
    such a bfq_queue if no process sends requests to it any longer. Even
    worse, the bfq_queue may happen to be prematurely freed while still in
    service (because there may remain no reference to it any longer).

    This commit solves this problem by preventing I/O dispatch from being
    plugged for the in-service bfq_queue, if the latter has no process
    reference (the bfq_queue is then prevented from remaining in service).

    Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging")
    Tested-by: Oleksandr Natalenko
    Reported-by: Patrick Dung
    Tested-by: Patrick Dung
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

23 Jan, 2020

1 commit

  • This macro is never used after introduced from commit aee69d78dec0
    ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")

    Better to remove it.

    Signed-off-by: Alex Shi
    Cc: Paolo Valente
    Cc: Jens Axboe
    Cc: linux-block@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Jens Axboe

    Alex Shi
     

26 Nov, 2019

1 commit

  • Pull core block updates from Jens Axboe:
    "Due to more granular branches, this one is small and will be followed
    with other core branches that add specific features. I meant to just
    have a core and drivers branch, but external dependencies we ended up
    adding a few more that are also core.

    The changes are:

    - Fixes and improvements for the zoned device support (Ajay, Damien)

    - sed-opal table writing and datastore UID (Revanth)

    - blk-cgroup (and bfq) blk-cgroup stat fixes (Tejun)

    - Improvements to the block stats tracking (Pavel)

    - Fix for overruning sysfs buffer for large number of CPUs (Ming)

    - Optimization for small IO (Ming, Christoph)

    - Fix typo in RWH lifetime hint (Eugene)

    - Dead code removal and documentation (Bart)

    - Reduction in memory usage for queue and tag set (Bart)

    - Kerneldoc header documentation (André)

    - Device/partition revalidation fixes (Jan)

    - Stats tracking for flush requests (Konstantin)

    - Various other little fixes here and there (et al)"

    * tag 'for-5.5/block-20191121' of git://git.kernel.dk/linux-block: (48 commits)
    Revert "block: split bio if the only bvec's length is > SZ_4K"
    block: add iostat counters for flush requests
    block,bfq: Skip tracing hooks if possible
    block: sed-opal: Introduce SUM_SET_LIST parameter and append it using 'add_token_u64'
    blk-cgroup: cgroup_rstat_updated() shouldn't be called on cgroup1
    block: Don't disable interrupts in trigger_softirq()
    sbitmap: Delete sbitmap_any_bit_clear()
    blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue()
    block: split bio if the only bvec's length is > SZ_4K
    block: still try to split bio if the bvec crosses pages
    blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT
    blk-cgroup: reimplement basic IO stats using cgroup rstat
    blk-cgroup: remove now unused blkg_print_stat_{bytes|ios}_recursive()
    blk-throtl: stop using blkg->stat_bytes and ->stat_ios
    bfq-iosched: stop using blkg->stat_bytes and ->stat_ios
    bfq-iosched: relocate bfqg_*rwstat*() helpers
    block: add zone open, close and finish ioctl support
    block: add zone open, close and finish operations
    block: Simplify REQ_OP_ZONE_RESET_ALL handling
    block: Remove REQ_OP_ZONE_RESET plugging
    ...

    Linus Torvalds
     

14 Nov, 2019

1 commit

  • Since commit 3726112ec731 ("block, bfq: re-schedule empty queues if
    they deserve I/O plugging"), to prevent the service guarantees of a
    bfq_queue from being violated, the bfq_queue may be left busy, i.e.,
    scheduled for service, even if empty (see comments in
    __bfq_bfqq_expire() for details). But, if no process will send
    requests to the bfq_queue any longer, then there is no point in
    keeping the bfq_queue scheduled for service.

    In addition, keeping the bfq_queue scheduled for service, but with no
    process reference any longer, may cause the bfq_queue to be freed when
    descheduled from service. But this is assumed to never happen, and
    causes a UAF if it happens. This, in turn, caused crashes [1, 2].

    This commit fixes this issue by descheduling an empty bfq_queue when
    it remains with not process reference.

    [1] https://bugzilla.redhat.com/show_bug.cgi?id=1767539
    [2] https://bugzilla.kernel.org/show_bug.cgi?id=205447

    Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging")
    Reported-by: Chris Evich
    Reported-by: Patrick Dung
    Reported-by: Thorsten Schubert
    Tested-by: Thorsten Schubert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

08 Nov, 2019

1 commit

  • When used on cgroup1, bfq uses the blkg->stat_bytes and ->stat_ios
    from blk-cgroup core to populate six stat knobs. blk-cgroup core is
    moving away from blkg_rwstat to improve scalability and won't be able
    to support this usage.

    It isn't like the sharing gains all that much. Let's break it out to
    dedicated rwstat counters which are updated when on cgroup1. This
    makes use of bfqg_*rwstat*() helpers outside of
    CONFIG_BFQ_CGROUP_DEBUG. Move them out.

    v2: Compile fix when !CONFIG_BFQ_CGROUP_DEBUG.

    Signed-off-by: Tejun Heo
    Cc: Paolo Valente
    Signed-off-by: Jens Axboe

    Tejun Heo
     

18 Sep, 2019

4 commits

  • If equal to 0, the injection limit for a bfq_queue is pushed to 1
    after a first sample of the total service time of the I/O requests of
    the queue is computed (to allow injection to start). Yet, because of a
    mistake in the branch that performs this action, the push may happen
    also in some other case. This commit fixes this issue.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • The update period of the injection limit has been tentatively set to
    100 ms, to reduce fluctuations. This value however proved to cause,
    occasionally, the limit to be decremented for some bfq_queue only
    after the queue underwent excessive injection for a lot of time. This
    commit reduces the period to 10 ms.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Upon an increment attempt of the injection limit, the latter is
    constrained not to become higher than twice the maximum number
    max_rq_in_driver of I/O requests that have happened to be in service
    in the drive. This high bound allows the injection limit to grow
    beyond max_rq_in_driver, which may then cause max_rq_in_driver itself
    to grow.

    However, since the limit is incremented by only one unit at a time,
    there is no need for such a high bound, and just max_rq_in_driver+1 is
    enough.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • BFQ updates the injection limit of each bfq_queue as a function of how
    much the limit inflates the service times experienced by the I/O
    requests of the queue. So only service times affected by injection
    must be taken into account. Unfortunately, in the current
    implementation of this update scheme, the service time of an I/O
    request rq not affected by injection may happen to be considered in
    the following case: there is no I/O request in service when rq
    arrives.

    This commit fixes this issue by making sure that only service times
    affected by injection are considered for updating the injection
    limit. In particular, the service time of an I/O request rq is now
    considered only if at least one of the following two conditions holds:
    - the destination bfq_queue for rq underwent injection before rq
    arrival, and there is still I/O in service in the drive on rq arrival
    (the service of such unfinished I/O may delay the service of rq);
    - injection occurs between the arrival and the completion time of rq.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

08 Aug, 2019

3 commits

  • As reported in [1], the call bfq_init_rq(rq) may return NULL in case
    of OOM (in particular, if rq->elv.icq is NULL because memory
    allocation failed in failed in ioc_create_icq()).

    This commit handles this circumstance.

    [1] https://lkml.org/lkml/2019/7/22/824

    Cc: Hsin-Yi Wang
    Cc: Nicolas Boichat
    Cc: Doug Anderson
    Reported-by: Guenter Roeck
    Reported-by: Hsin-Yi Wang
    Reviewed-by: Guenter Roeck
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Since commit 13a857a4c4e8 ("block, bfq: detect wakers and
    unconditionally inject their I/O"), every bfq_queue has a pointer to a
    waker bfq_queue and a list of the bfq_queues it may wake. In this
    respect, when a bfq_queue, say Q, remains with no I/O source attached
    to it, Q cannot be woken by any other bfq_queue, and cannot wake any
    other bfq_queue. Then Q must be removed from the woken list of its
    possible waker bfq_queue, and all bfq_queues in the woken list of Q
    must stop having a waker bfq_queue.

    Q remains with no I/O source in two cases: when the last process
    associated with Q exits or when such a process gets associated with a
    different bfq_queue. Unfortunately, commit 13a857a4c4e8 ("block, bfq:
    detect wakers and unconditionally inject their I/O") performed the
    above updates only in the first case.

    This commit fixes this bug by moving these updates to when Q gets
    freed. This is a simple and safe way to handle all cases, as both the
    above events, process exit and re-association, lead to Q being freed
    soon, and because dangling references would come out only after Q gets
    freed (if no update were performed).

    Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O")
    Reported-by: Douglas Anderson
    Tested-by: Douglas Anderson
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Since commit 13a857a4c4e8 ("block, bfq: detect wakers and
    unconditionally inject their I/O"), BFQ stores, in a per-device
    pointer last_completed_rq_bfqq, the last bfq_queue that had an I/O
    request completed. If some bfq_queue receives new I/O right after the
    last request of last_completed_rq_bfqq has been completed, then
    last_completed_rq_bfqq may be a waker bfq_queue.

    But if the bfq_queue last_completed_rq_bfqq points to is freed, then
    last_completed_rq_bfqq becomes a dangling reference. This commit
    resets last_completed_rq_bfqq if the pointed bfq_queue is freed.

    Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O")
    Reported-by: Douglas Anderson
    Tested-by: Douglas Anderson
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

27 Jul, 2019

1 commit

  • Pull block fixes from Jens Axboe:

    - Several io_uring fixes/improvements:
    - Blocking fix for O_DIRECT (me)
    - Latter page slowness for registered buffers (me)
    - Fix poll hang under certain conditions (me)
    - Defer sequence check fix for wrapped rings (Zhengyuan)
    - Mismatch in async inc/dec accounting (Zhengyuan)
    - Memory ordering issue that could cause stall (Zhengyuan)
    - Track sequential defer in bytes, not pages (Zhengyuan)

    - NVMe pull request from Christoph

    - Set of hang fixes for wbt (Josef)

    - Redundant error message kill for libahci (Ding)

    - Remove unused blk_mq_sched_started_request() and related ops (Marcos)

    - drbd dynamic alloc shash descriptor to reduce stack use (Arnd)

    - blkcg ->pd_stat() non-debug print (Tejun)

    - bcache memory leak fix (Wei)

    - Comment fix (Akinobu)

    - BFQ perf regression fix (Paolo)

    * tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits)
    io_uring: ensure ->list is initialized for poll commands
    Revert "nvme-pci: don't create a read hctx mapping without read queues"
    nvme: fix multipath crash when ANA is deactivated
    nvme: fix memory leak caused by incorrect subsystem free
    nvme: ignore subnqn for ADATA SX6000LNP
    drbd: dynamically allocate shash descriptor
    block: blk-mq: Remove blk_mq_sched_started_request and started_request
    bcache: fix possible memory leak in bch_cached_dev_run()
    io_uring: track io length in async_list based on bytes
    io_uring: don't use iov_iter_advance() for fixed buffers
    block: properly handle IOCB_NOWAIT for async O_DIRECT IO
    blk-mq: allow REQ_NOWAIT to return an error inline
    io_uring: add a memory barrier before atomic_read
    rq-qos: use a mb for got_token
    rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
    rq-qos: don't reset has_sleepers on spurious wakeups
    rq-qos: fix missed wake-ups in rq_qos_throttle
    wait: add wq_has_single_sleeper helper
    block, bfq: check also in-flight I/O in dispatch plugging
    block: fix sysfs module parameters directory path in comment
    ...

    Linus Torvalds
     

18 Jul, 2019

1 commit

  • Consider a sync bfq_queue Q that remains empty while in service, and
    suppose that, when this happens, there is a fair amount of already
    in-flight I/O not belonging to Q. In such a situation, I/O dispatching
    may need to be plugged (until new I/O arrives for Q), for the
    following reason.

    The drive may decide to serve in-flight non-Q's I/O requests before
    Q's ones, thereby delaying the arrival of new I/O requests for Q
    (recall that Q is sync). If I/O-dispatching is not plugged, then,
    while Q remains empty, a basically uncontrolled amount of I/O from
    other queues may be dispatched too, possibly causing the service of
    Q's I/O to be delayed even longer in the drive. This problem gets more
    and more serious as the speed and the queue depth of the drive grow,
    because, as these two quantities grow, the probability to find no
    queue busy but many requests in flight grows too.

    If Q has the same weight and priority as the other queues, then the
    above delay is unlikely to cause any issue, because all queues tend to
    undergo the same treatment. So, since not plugging I/O dispatching is
    convenient for throughput, it is better not to plug. Things change in
    case Q has a higher weight or priority than some other queue, because
    Q's service guarantees may simply be violated. For this reason,
    commit 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric
    scenarios") does plug I/O in such an asymmetric scenario. Plugging
    minimizes the delay induced by already in-flight I/O, and enables Q to
    recover the bandwidth it may lose because of this delay.

    Yet the above commit does not cover the case of weight-raised queues,
    for efficiency concerns. For weight-raised queues, I/O-dispatch
    plugging is activated simply if not all bfq_queues are
    weight-raised. But this check does not handle the case of in-flight
    requests, because a bfq_queue may become non busy *before* all its
    in-flight requests are completed.

    This commit performs I/O-dispatch plugging for weight-raised queues if
    there are some in-flight requests.

    As a practical example of the resulting recover of control, under
    write load on a Samsung SSD 970 PRO, gnome-terminal starts in 1.5
    seconds after this fix, against 15 seconds before the fix (as a
    reference, gnome-terminal takes about 35 seconds to start with any of
    the other I/O schedulers).

    Fixes: 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric scenarios")
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

15 Jul, 2019

1 commit

  • Rename the block documentation files to ReST, add an
    index for them and adjust in order to produce a nice html
    output via the Sphinx build system.

    At its new index.rst, let's add a :orphan: while this is not linked to
    the main index.rst file, in order to avoid build warnings.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     

10 Jul, 2019

1 commit

  • Pull block updates from Jens Axboe:
    "This is the main block updates for 5.3. Nothing earth shattering or
    major in here, just fixes, additions, and improvements all over the
    map. This contains:

    - Series of documentation fixes (Bart)

    - Optimization of the blk-mq ctx get/put (Bart)

    - null_blk removal race condition fix (Bob)

    - req/bio_op() cleanups (Chaitanya)

    - Series cleaning up the segment accounting, and request/bio mapping
    (Christoph)

    - Series cleaning up the page getting/putting for bios (Christoph)

    - block cgroup cleanups and moving it to where it is used (Christoph)

    - block cgroup fixes (Tejun)

    - Series of fixes and improvements to bcache, most notably a write
    deadlock fix (Coly)

    - blk-iolatency STS_AGAIN and accounting fixes (Dennis)

    - Series of improvements and fixes to BFQ (Douglas, Paolo)

    - debugfs_create() return value check removal for drbd (Greg)

    - Use struct_size(), where appropriate (Gustavo)

    - Two lighnvm fixes (Heiner, Geert)

    - MD fixes, including a read balance and corruption fix (Guoqing,
    Marcos, Xiao, Yufen)

    - block opal shadow mbr additions (Jonas, Revanth)

    - sbitmap compare-and-exhange improvemnts (Pavel)

    - Fix for potential bio->bi_size overflow (Ming)

    - NVMe pull requests:
    - improved PCIe suspent support (Keith Busch)
    - error injection support for the admin queue (Akinobu Mita)
    - Fibre Channel discovery improvements (James Smart)
    - tracing improvements including nvmetc tracing support (Minwoo Im)
    - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
    Kulkarni)"

    - Various little fixes and improvements to drivers and core"

    * tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits)
    blk-iolatency: fix STS_AGAIN handling
    block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES
    blk-mq: simplify blk_mq_make_request()
    blk-mq: remove blk_mq_put_ctx()
    sbitmap: Replace cmpxchg with xchg
    block: fix .bi_size overflow
    block: sed-opal: check size of shadow mbr
    block: sed-opal: ioctl for writing to shadow mbr
    block: sed-opal: add ioctl for done-mark of shadow mbr
    block: never take page references for ITER_BVEC
    direct-io: use bio_release_pages in dio_bio_complete
    block_dev: use bio_release_pages in bio_unmap_user
    block_dev: use bio_release_pages in blkdev_bio_end_io
    iomap: use bio_release_pages in iomap_dio_bio_end_io
    block: use bio_release_pages in bio_map_user_iov
    block: use bio_release_pages in bio_unmap_user
    block: optionally mark pages dirty in bio_release_pages
    block: move the BIO_NO_PAGE_REF check into bio_release_pages
    block: skd_main.c: Remove call to memset after dma_alloc_coherent
    block: mtip32xx: Remove call to memset after dma_alloc_coherent
    ...

    Linus Torvalds
     

28 Jun, 2019

1 commit

  • In reboot tests on several devices we were seeing a "use after free"
    when slub_debug or KASAN was enabled. The kernel complained about:

    Unable to handle kernel paging request at virtual address 6b6b6c2b

    ...which is a classic sign of use after free under slub_debug. The
    stack crawl in kgdb looked like:

    0 test_bit (addr=, nr=)
    1 bfq_bfqq_busy (bfqq=)
    2 bfq_select_queue (bfqd=)
    3 __bfq_dispatch_request (hctx=)
    4 bfq_dispatch_request (hctx=)
    5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440)
    6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440)
    7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440)
    8 0xc0568d94 in blk_mq_run_work_fn (work=)
    9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480)
    10 0xc024cff4 in worker_thread (__worker=0xec6d4640)

    Digging in kgdb, it could be found that, though bfqq looked fine,
    bfqq->bic had been freed.

    Through further digging, I postulated that perhaps it is illegal to
    access a "bic" (AKA an "icq") after bfq_exit_icq() had been called
    because the "bic" can be freed at some point in time after this call
    is made. I confirmed that there certainly were cases where the exact
    crashing code path would access the "bic" after bfq_exit_icq() had
    been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and
    saw that the bic was 0x7 at the time of the crash.

    To understand a bit more about why this crash was fairly uncommon (I
    saw it only once in a few hundred reboots), you can see that much of
    the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't
    access the ->bic anymore. The only case it doesn't is if
    bfq_put_queue() sees a reference still held.

    However, even in the case when bfqq isn't freed, the crash is still
    rare. Why? I tracked what happened to the "bic" after the exit
    routine. It doesn't get freed right away. Rather,
    put_io_context_active() eventually called put_io_context() which
    queued up freeing on a workqueue. The freeing then actually happened
    later than that through call_rcu(). Despite all these delays, some
    extra debugging showed that all the hoops could be jumped through in
    time and the memory could be freed causing the original crash. Phew!

    To make a long story short, assuming it truly is illegal to access an
    icq after the "exit_icq" callback is finished, this patch is needed.

    Cc: stable@vger.kernel.org
    Reviewed-by: Paolo Valente
    Signed-off-by: Douglas Anderson
    Signed-off-by: Jens Axboe

    Douglas Anderson
     

27 Jun, 2019

1 commit

  • Some debug code suggested by Paolo was tripping when I did reboot
    stress tests. Specifically in bfq_bfqq_resume_state()
    "bic->saved_wr_start_at_switch_to_srt" was later than the current
    value of "jiffies". A bit of debugging showed that
    "bic->saved_wr_start_at_switch_to_srt" was actually 0 and a bit more
    debugging showed that was because we had run through the "unlikely"
    case in the bfq_bfqq_save_state() function.

    Let's init "saved_wr_start_at_switch_to_srt" in the unlikely case to
    something sane.

    NOTE: this fixes no known real-world errors.

    Reviewed-by: Paolo Valente
    Reviewed-by: Guenter Roeck
    Signed-off-by: Douglas Anderson
    Signed-off-by: Jens Axboe

    Douglas Anderson
     

26 Jun, 2019

1 commit

  • By mistake, there is a '&' instead of a '==' in the definition of the
    macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with
    the correct one.

    Fixes: 7074f076ff15 ("block, bfq: do not tag totally seeky queues as soft rt")
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

25 Jun, 2019

5 commits

  • Consider, on one side, a bfq_queue Q that remains empty while in
    service, and, on the other side, the pending I/O of bfq_queues that,
    according to their timestamps, have to be served after Q. If an
    uncontrolled amount of I/O from the latter bfq_queues were dispatched
    while Q is waiting for its new I/O to arrive, then Q's bandwidth
    guarantees would be violated. To prevent this, I/O dispatch is plugged
    until Q receives new I/O (except for a properly controlled amount of
    injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging,
    for the following reason.

    Preemption is performed in two steps. First, Q is expired and
    re-scheduled. Second, the new bfq_queue to serve is chosen. The first
    step is needed by the second, as the second can be performed only
    after Q's timestamps have been properly updated (done in the
    expiration step), and Q has been re-queued for service. This
    dependency is a consequence of the way how BFQ's scheduling algorithm
    is currently implemented.

    But Q is not re-scheduled at all in the first step, because Q is
    empty. As a consequence, an uncontrolled amount of I/O may be
    dispatched until Q becomes non empty again. This breaks Q's service
    guarantees.

    This commit addresses this issue by re-scheduling Q even if it is
    empty. This in turn breaks the assumption that all scheduled queues
    are non empty. Then a few extra checks are now needed.

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • BFQ enqueues the I/O coming from each process into a separate
    bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be
    served for at most timeout_sync milliseconds (default: 125 ms). This
    service scheme is prone to the following inaccuracy.

    While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may
    receive I/O, and, according to BFQ's scheduling policy, may become the
    right bfq_queue to serve, in place of the currently in-service
    bfq_queue. In this respect, postponing the service of Q2 to after the
    service of Q1 finishes may delay the completion of Q2's I/O, compared
    with an ideal service in which all non-empty bfq_queues are served in
    parallel, and every non-empty bfq_queue is served at a rate
    proportional to the bfq_queue's weight. This additional delay is equal
    at most to the time Q1 may unjustly remain in service before switching
    to Q2.

    If Q1 and Q2 have the same weight, then this time is most likely
    negligible compared with the completion time to be guaranteed to Q2's
    I/O. In addition, first, one of the reasons why BFQ may want to serve
    Q1 for a while is that this boosts throughput and, second, serving Q1
    longer reduces BFQ's overhead. As a conclusion, it is usually better
    not to preempt Q1 if both Q1 and Q2 have the same weight.

    In contrast, as Q2's weight or priority becomes higher and higher
    compared with that of Q1, the above delay becomes larger and larger,
    compared with the I/O completion times that have to be guaranteed to
    Q2 according to Q2's weight. So reducing this delay may be more
    important than avoiding the costs of preempting Q1.

    Accordingly, this commit preempts Q1 if Q2 has a higher weight or a
    higher priority than Q1. Preemption causes Q1 to be re-scheduled, and
    triggers a new choice of the next bfq_queue to serve. If Q2 really is
    the next bfq_queue to serve, then Q2 will be set in service
    immediately.

    This change reduces the component of the I/O latency caused by the
    above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5
    SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for
    a process doing sporadic random reads while another process is doing
    continuous sequential reads.

    Signed-off-by: Nicola Bottura
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • A bfq_queue Q may happen to be synchronized with another
    bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
    receive new I/O. We call Q2 "waker queue".

    If I/O plugging is being performed for Q, and Q is not receiving any
    more I/O because of the above synchronization, then, thanks to BFQ's
    injection mechanism, the waker queue is likely to get served before
    the I/O-plugging timeout fires.

    Unfortunately, this fact may not be sufficient to guarantee a high
    throughput during the I/O plugging, because the inject limit for Q may
    be too low to guarantee a lot of injected I/O. In addition, the
    duration of the plugging, i.e., the time before Q finally receives new
    I/O, may not be minimized, because the waker queue may happen to be
    served only after other queues.

    To address these issues, this commit introduces the explicit detection
    of the waker queue, and the unconditional injection of a pending I/O
    request of the waker queue on each invocation of
    bfq_dispatch_request().

    One may be concerned that this systematic injection of I/O from the
    waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
    the contrary, next Q's I/O is brought forward dramatically, for it is
    not blocked for milliseconds.

    Reported-by: Srivatsa S. Bhat (VMware)
    Tested-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Until the base value for request service times gets finally computed
    for a bfq_queue, the inject limit for that queue does depend on the
    think-time state (short|long) of the queue. A timely update of the
    think time then guarantees a quicker activation or deactivation of the
    injection. Fortunately, the think time of a bfq_queue is updated in
    the same code path as the inject limit; but after the inject limit.

    This commits moves the update of the think time before the update of
    the inject limit. For coherence, it moves the update of the seek time
    too.

    Reported-by: Srivatsa S. Bhat (VMware)
    Tested-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • I/O injection gets reduced if it increases the request service times
    of the victim queue beyond a certain threshold. The threshold, in its
    turn, is computed as a function of the base service time enjoyed by
    the queue when it undergoes no injection.

    As a consequence, for injection to work properly, the above base value
    has to be accurate. In this respect, such a value may vary over
    time. For example, it varies if the size or the spatial locality of
    the I/O requests in the queue change. It is then important to update
    this value whenever possible. This commit performs this update.

    Reported-by: Srivatsa S. Bhat (VMware)
    Tested-by: Srivatsa S. Bhat (VMware)
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente