25 Mar, 2020

3 commits

  • Column "time_in_queue" in diskstats is supposed to show total waiting time
    of all requests. I.e. value should be equal to the sum of times from other
    columns. But this is not true, because column "time_in_queue" is counted
    separately in jiffies rather than in nanoseconds as other times.

    This patch removes redundant counter for "time_in_queue" and shows total
    time of read, write, discard and flush requests.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     
  • Reading /proc/diskstats iterates over all cpus for summing each field.
    It's faster to sum all fields in one pass.

    Hammering /proc/diskstats with fio shows 2x performance improvement:

    fio --name=test --numjobs=$JOBS --filename=/proc/diskstats \
    --size=1k --bs=1k --fallocate=none --create_on_open=1 \
    --time_based=1 --runtime=10 --invalidate=0 --group_report

    JOBS=1 JOBS=10
    Before: 7k iops 64k iops
    After: 18k iops 120k iops

    Also this way code is more compact:

    add/remove: 1/0 grow/shrink: 0/2 up/down: 194/-1540 (-1346)
    Function old new delta
    part_stat_read_all - 194 +194
    diskstats_show 1344 631 -713
    part_stat_show 1219 392 -827
    Total: Before=14966947, After=14965601, chg -0.01%

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     
  • Currently io_ticks is approximated by adding one at each start and end of
    requests if jiffies counter has changed. This works perfectly for requests
    shorter than a jiffy or if one of requests starts/ends at each jiffy.

    If disk executes just one request at a time and they are longer than two
    jiffies then only first and last jiffies will be accounted.

    Fix is simple: at the end of request add up into io_ticks jiffies passed
    since last update rather than just one jiffy.

    Example: common HDD executes random read 4k requests around 12ms.

    fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
    iostat -x 10 sdb

    Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:

    Before:

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
    sdb 0,00 0,00 82,60 0,00 330,40 0,00 8,00 0,96 12,09 12,09 0,00 1,02 8,43

    After:

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
    sdb 0,00 0,00 82,50 0,00 330,00 0,00 8,00 1,00 12,10 12,10 0,00 12,12 99,99

    Now io_ticks does not loose time between start and end of requests, but
    for queue-depth > 1 some I/O time between adjacent starts might be lost.

    For load estimation "%util" is not as useful as average queue length,
    but it clearly shows how often disk queue is completely empty.

    Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     

24 Mar, 2020

21 commits


22 Mar, 2020

5 commits

  • In bfq_pd_offline(), the function bfq_flush_idle_tree() is invoked to
    flush the rb tree that contains all idle entities belonging to the pd
    (cgroup) being destroyed. In particular, bfq_flush_idle_tree() is
    invoked before bfq_reparent_active_queues(). Yet the latter may happen
    to add some entities to the idle tree. It happens if, in some of the
    calls to bfq_bfqq_move() performed by bfq_reparent_active_queues(),
    the queue to move is empty and gets expired.

    This commit simply reverses the invocation order between
    bfq_flush_idle_tree() and bfq_reparent_active_queues().

    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • bfq_reparent_leaf_entity() reparents the input leaf entity (a leaf
    entity represents just a bfq_queue in an entity tree). Yet, the input
    entity is guaranteed to always be a leaf entity only in two-level
    entity trees. In this respect, because of the error fixed by
    commit 14afc5936197 ("block, bfq: fix overwrite of bfq_group pointer
    in bfq_find_set_group()"), all (wrongly collapsed) entity trees happened
    to actually have only two levels. After the latter commit, this does not
    hold any longer.

    This commit fixes this problem by modifying
    bfq_reparent_leaf_entity(), so that it searches an active leaf entity
    down the path that stems from the input entity. Such a leaf entity is
    guaranteed to exist when bfq_reparent_leaf_entity() is invoked.

    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The
    goal of this put is to release a process reference to a bfq_queue. But
    process-reference releases may trigger also some extra operation, and,
    to this goal, are handled through bfq_release_process_ref(). So, turn
    the invocation of bfq_put_queue() into an invocation of
    bfq_release_process_ref().

    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Commit ecedd3d7e199 ("block, bfq: get extra ref to prevent a queue
    from being freed during a group move") gets an extra reference to a
    bfq_queue before possibly deactivating it (temporarily), in
    bfq_bfqq_move(). This prevents the bfq_queue from disappearing before
    being reactivated in its new group.

    Yet, the bfq_queue may also be expired (i.e., its service may be
    stopped) before the bfq_queue is deactivated. And also an expiration
    may lead to a premature freeing. This commit fixes this issue by
    simply moving forward the getting of the extra reference already
    introduced by commit ecedd3d7e199 ("block, bfq: get extra ref to
    prevent a queue from being freed during a group move").

    Reported-by: cki-project@redhat.com
    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is
    not in bfqd-lock critical section. The bfqq, which is not
    equal to NULL in bfq_idle_slice_timer, may be freed after passing
    to bfq_idle_slice_timer_body. So we will access the freed memory.

    In addition, considering the bfqq may be in race, we should
    firstly check whether bfqq is in service before doing something
    on it in bfq_idle_slice_timer_body func. If the bfqq in race is
    not in service, it means the bfqq has been expired through
    __bfq_bfqq_expire func, and wait_request flags has been cleared in
    __bfq_bfqd_reset_in_service func. So we do not need to re-clear the
    wait_request of bfqq which is not in service.

    KASAN log is given as follows:
    [13058.354613] ==================================================================
    [13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290
    [13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767
    [13058.354646]
    [13058.354655] CPU: 96 PID: 19767 Comm: fork13
    [13058.354661] Call trace:
    [13058.354667] dump_backtrace+0x0/0x310
    [13058.354672] show_stack+0x28/0x38
    [13058.354681] dump_stack+0xd8/0x108
    [13058.354687] print_address_description+0x68/0x2d0
    [13058.354690] kasan_report+0x124/0x2e0
    [13058.354697] __asan_load8+0x88/0xb0
    [13058.354702] bfq_idle_slice_timer+0xac/0x290
    [13058.354707] __hrtimer_run_queues+0x298/0x8b8
    [13058.354710] hrtimer_interrupt+0x1b8/0x678
    [13058.354716] arch_timer_handler_phys+0x4c/0x78
    [13058.354722] handle_percpu_devid_irq+0xf0/0x558
    [13058.354731] generic_handle_irq+0x50/0x70
    [13058.354735] __handle_domain_irq+0x94/0x110
    [13058.354739] gic_handle_irq+0x8c/0x1b0
    [13058.354742] el1_irq+0xb8/0x140
    [13058.354748] do_wp_page+0x260/0xe28
    [13058.354752] __handle_mm_fault+0x8ec/0x9b0
    [13058.354756] handle_mm_fault+0x280/0x460
    [13058.354762] do_page_fault+0x3ec/0x890
    [13058.354765] do_mem_abort+0xc0/0x1b0
    [13058.354768] el0_da+0x24/0x28
    [13058.354770]
    [13058.354773] Allocated by task 19731:
    [13058.354780] kasan_kmalloc+0xe0/0x190
    [13058.354784] kasan_slab_alloc+0x14/0x20
    [13058.354788] kmem_cache_alloc_node+0x130/0x440
    [13058.354793] bfq_get_queue+0x138/0x858
    [13058.354797] bfq_get_bfqq_handle_split+0xd4/0x328
    [13058.354801] bfq_init_rq+0x1f4/0x1180
    [13058.354806] bfq_insert_requests+0x264/0x1c98
    [13058.354811] blk_mq_sched_insert_requests+0x1c4/0x488
    [13058.354818] blk_mq_flush_plug_list+0x2d4/0x6e0
    [13058.354826] blk_flush_plug_list+0x230/0x548
    [13058.354830] blk_finish_plug+0x60/0x80
    [13058.354838] read_pages+0xec/0x2c0
    [13058.354842] __do_page_cache_readahead+0x374/0x438
    [13058.354846] ondemand_readahead+0x24c/0x6b0
    [13058.354851] page_cache_sync_readahead+0x17c/0x2f8
    [13058.354858] generic_file_buffered_read+0x588/0xc58
    [13058.354862] generic_file_read_iter+0x1b4/0x278
    [13058.354965] ext4_file_read_iter+0xa8/0x1d8 [ext4]
    [13058.354972] __vfs_read+0x238/0x320
    [13058.354976] vfs_read+0xbc/0x1c0
    [13058.354980] ksys_read+0xdc/0x1b8
    [13058.354984] __arm64_sys_read+0x50/0x60
    [13058.354990] el0_svc_common+0xb4/0x1d8
    [13058.354994] el0_svc_handler+0x50/0xa8
    [13058.354998] el0_svc+0x8/0xc
    [13058.354999]
    [13058.355001] Freed by task 19731:
    [13058.355007] __kasan_slab_free+0x120/0x228
    [13058.355010] kasan_slab_free+0x10/0x18
    [13058.355014] kmem_cache_free+0x288/0x3f0
    [13058.355018] bfq_put_queue+0x134/0x208
    [13058.355022] bfq_exit_icq_bfqq+0x164/0x348
    [13058.355026] bfq_exit_icq+0x28/0x40
    [13058.355030] ioc_exit_icq+0xa0/0x150
    [13058.355035] put_io_context_active+0x250/0x438
    [13058.355038] exit_io_context+0xd0/0x138
    [13058.355045] do_exit+0x734/0xc58
    [13058.355050] do_group_exit+0x78/0x220
    [13058.355054] __wake_up_parent+0x0/0x50
    [13058.355058] el0_svc_common+0xb4/0x1d8
    [13058.355062] el0_svc_handler+0x50/0xa8
    [13058.355066] el0_svc+0x8/0xc
    [13058.355067]
    [13058.355071] The buggy address belongs to the object at ffffa02cf3e63e70#012 which belongs to the cache bfq_queue of size 464
    [13058.355075] The buggy address is located 264 bytes inside of#012 464-byte region [ffffa02cf3e63e70, ffffa02cf3e64040)
    [13058.355077] The buggy address belongs to the page:
    [13058.355083] page:ffff7e80b3cf9800 count:1 mapcount:0 mapping:ffff802db5c90780 index:0xffffa02cf3e606f0 compound_mapcount: 0
    [13058.366175] flags: 0x2ffffe0000008100(slab|head)
    [13058.370781] raw: 2ffffe0000008100 ffff7e80b53b1408 ffffa02d730c1c90 ffff802db5c90780
    [13058.370787] raw: ffffa02cf3e606f0 0000000000370023 00000001ffffffff 0000000000000000
    [13058.370789] page dumped because: kasan: bad access detected
    [13058.370791]
    [13058.370792] Memory state around the buggy address:
    [13058.370797] ffffa02cf3e63e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb
    [13058.370801] ffffa02cf3e63e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [13058.370805] >ffffa02cf3e63f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [13058.370808] ^
    [13058.370811] ffffa02cf3e63f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [13058.370815] ffffa02cf3e64000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
    [13058.370817] ==================================================================
    [13058.370820] Disabling lock debugging due to kernel taint

    Here, we directly pass the bfqd to bfq_idle_slice_timer_body func.
    --
    V2->V3: rewrite the comment as suggested by Paolo Valente
    V1->V2: add one comment, and add Fixes and Reported-by tag.

    Fixes: aee69d78d ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
    Acked-by: Paolo Valente
    Reported-by: Wang Wang
    Signed-off-by: Zhiqiang Liu
    Signed-off-by: Feilong Lin
    Signed-off-by: Jens Axboe

    Zhiqiang Liu
     

19 Mar, 2020

5 commits


18 Mar, 2020

2 commits

  • submit_bio_wait() can be called from ioctl(BLKSECDISCARD), which
    may take long time to complete, as Salman mentioned, 4K BLKSECDISCARD
    takes up to 100 second on some devices. Also any block I/O operation
    that occurs after the BLKSECDISCARD is submitted will also potentially
    be affected by the hung task timeouts.

    Another report is that task hang can be observed when running mkfs
    over raid10 which takes a small max discard sectors limit because
    of chunk size.

    So prevent hung_check from firing by taking same approach used
    in blk_execute_rq(), and the wake-up interval is set as half the
    hung_check timer period, which keeps overhead low enough.

    Cc: Salman Qazi
    Cc: Jesse Barnes
    Cc: Bart Van Assche
    Link: https://lkml.org/lkml/2020/2/12/1193
    Reported-by: Salman Qazi
    Reviewed-by: Jesse Barnes
    Reviewed-by: Salman Qazi
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Historically we only set the capacity to zero for devices that support
    partitions (independ of actually having partitions created). Doing that
    is rather inconsistent, but changing it broke legacy udisks polling for
    legacy ide-cdrom devices. Use the crude a crude check for devices that
    either are non-removable or partitionable to get the sane behavior for
    most device while not breaking userspace for this particular setup.

    Fixes: a1548b674403 ("block: move rescan_partitions to fs/block_dev.c")
    Reported-by: He Zhe
    Signed-off-by: Christoph Hellwig
    Tested-by: He Zhe
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Mar, 2020

4 commits

  • Check for overflow in addition before checking for end-of-block-device.

    Steps to reproduce:

    #define _GNU_SOURCE 1
    #include
    #include
    #include
    #include

    typedef unsigned long long __u64;

    struct blk_zone_range {
    __u64 sector;
    __u64 nr_sectors;
    };

    #define BLKRESETZONE _IOW(0x12, 131, struct blk_zone_range)

    int main(void)
    {
    int fd = open("/dev/nullb0", O_RDWR|O_DIRECT);
    struct blk_zone_range zr = {4096, 0xfffffffffffff000ULL};
    ioctl(fd, BLKRESETZONE, &zr);
    return 0;
    }

    BUG: KASAN: null-ptr-deref in submit_bio_wait+0x74/0xe0
    Write of size 8 at addr 0000000000000040 by task a.out/1590

    CPU: 8 PID: 1590 Comm: a.out Not tainted 5.6.0-rc1-00019-g359c92c02bfa #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014
    Call Trace:
    dump_stack+0x76/0xa0
    __kasan_report.cold+0x5/0x3e
    kasan_report+0xe/0x20
    submit_bio_wait+0x74/0xe0
    blkdev_zone_mgmt+0x26f/0x2a0
    blkdev_zone_mgmt_ioctl+0x14b/0x1b0
    blkdev_ioctl+0xb28/0xe60
    block_ioctl+0x69/0x80
    ksys_ioctl+0x3af/0xa50

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alexey Dobriyan (SK hynix)
    Signed-off-by: Jens Axboe

    Alexey Dobriyan
     
  • Acked-by: Tejun Heo
    Signed-off-by: Weiping Zhang
    Signed-off-by: Jens Axboe

    Weiping Zhang
     
  • This patch changes the check condition for the validity/authentication
    of the session.

    1. The Host Session Number(HSN) in the response should match the HSN for
    the session.
    2. The TPER Session Number(TSN) can never be less than 4096 for a regular
    session.

    Reference:
    Section 3.2.2.1 of https://trustedcomputinggroup.org/wp-content/uploads/TCG_Storage_Opal_SSC_Application_Note_1-00_1-00-Final.pdf
    Section 3.3.7.1.1 of https://trustedcomputinggroup.org/wp-content/uploads/TCG_Storage_Architecture_Core_Spec_v2.01_r1.00.pdf

    Co-developed-by: Andrzej Jakowski
    Signed-off-by: Andrzej Jakowski
    Signed-off-by: Revanth Rajashekar
    Signed-off-by: Jens Axboe

    Revanth Rajashekar
     
  • The kernel documentation includes a brief section about genhd
    capabilities, but it turns out that the only documented
    capability (GENHD_FL_MEDIA_CHANGE_NOTIFY) isn't used any more.

    This patch removes that flag, and documents the rest, based on my
    understanding of the current uses of these flags in the kernel. The
    documentation is kept in the header file, alongside the declarations,
    in the hope that it will be kept up-to-date in future; the kernel
    documentation is changed to include the documentation generated from
    the header file.

    Because the ultimate goal is to provide some end-user
    documentation (or end-administrator documentation), the comments are
    perhaps more user-oriented than might be expected. Since the values
    are shown to users in hexadecimal, the documentation lists them in
    hexadecimal, and the constant declarations are adjusted to match.

    Reviewed-by: Matthew Wilcox (Oracle)
    Signed-off-by: Stephen Kitt
    Signed-off-by: Jens Axboe

    Stephen Kitt