03 Sep, 2020

2 commits

  • commit d7d8535f377e9ba87edbf7fbbd634ac942f3f54f upstream.

    SCHED_RESTART code path is relied to re-run queue for dispatch requests
    in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
    requests to hctx->dispatch.

    memory barriers have to be used for ordering the following two pair of OPs:

    1) adding requests to hctx->dispatch and checking SCHED_RESTART in
    blk_mq_dispatch_rq_list()

    2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
    in blk_mq_sched_restart().

    Without the added memory barrier, either:

    1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
    blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
    dispatch side

    or

    2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
    in dispatch side, meantime checking if there is request in
    hctx->dispatch from blk_mq_sched_restart() is missed.

    IO hang in ltp/fs_fill test is reported by kernel test robot:

    https://lkml.org/lkml/2020/7/26/77

    Turns out it is caused by the above out-of-order OPs. And the IO hang
    can't be observed any more after applying this patch.

    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Reported-by: kernel test robot
    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: David Jeffery
    Cc:
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • [ Upstream commit db03f88fae8a2c8007caafa70287798817df2875 ]

    c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list") supposed
    to add request which has been through ->queue_rq() to the hw queue dispatch
    list, however it adds request running out of budget or driver tag to hw queue
    too. This way basically bypasses request merge, and causes too many request
    dispatched to LLD, and system% is unnecessary increased.

    Fixes this issue by adding request not through ->queue_rq into sw/scheduler
    queue, and this way is safe because no ->queue_rq is called on this request
    yet.

    High %system can be observed on Azure storvsc device, and even soft lock
    is observed. This patch reduces %system during heavy sequential IO,
    meantime decreases soft lockup risk.

    Fixes: c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list")
    Signed-off-by: Ming Lei
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Mike Snitzer
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     

16 Jul, 2020

1 commit

  • commit 05a4fed69ff00a8bd83538684cb602a4636b07a7 upstream.

    dm-multipath is the only user of blk_mq_queue_inflight(). When
    dm-multipath calls blk_mq_queue_inflight() to check if it has
    outstanding IO it can get a false negative. The reason for this is
    blk_mq_rq_inflight() doesn't consider requests that are no longer
    MQ_RQ_IN_FLIGHT but that are now MQ_RQ_COMPLETE (->complete isn't
    called or finished yet) as "inflight".

    This causes request-based dm-multipath's dm_wait_for_completion() to
    return before all outstanding dm-multipath requests have actually
    completed. This breaks DM multipath's suspend functionality because
    blk-mq requests complete after DM's suspend has finished -- which
    shouldn't happen.

    Fix this by considering any request not in the MQ_RQ_IDLE state
    (so either MQ_RQ_COMPLETE or MQ_RQ_IN_FLIGHT) as "inflight" in
    blk_mq_rq_inflight().

    Fixes: 3c94d83cb3526 ("blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()")
    Signed-off-by: Ming Lei
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

01 Jul, 2020

1 commit

  • [ Upstream commit fe35ec58f0d339221643287bbb7cee15c93a5389 ]

    There is an issue when tune the number for read and write queues,
    if the total queue count was not changed. The hctx->type cannot
    be updated, since __blk_mq_update_nr_hw_queues will return directly
    if the total queue count has not been changed.

    Reproduce:

    dmesg | grep "default/read/poll"
    [ 2.607459] nvme nvme0: 48/0/0 default/read/poll queues
    cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c
    48 default

    tune the write queues to 24:
    echo 24 > /sys/module/nvme/parameters/write_queues
    echo 1 > /sys/block/nvme0n1/device/reset_controller

    dmesg | grep "default/read/poll"
    [ 433.547235] nvme nvme0: 24/24/0 default/read/poll queues

    cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c
    48 default

    The driver's hardware queue mapping is not same as block layer.

    Signed-off-by: Weiping Zhang
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Weiping Zhang
     

22 Jun, 2020

2 commits

  • [ Upstream commit aa880ad690ab6d4c53934af85fb5a43e69ecb0f5 ]

    When we increase hardware queue count, blk_mq_update_queue_map will
    reset the mapping between cpu and hardware queue base on the hardware
    queue count(set->nr_hw_queues). The mapping cannot be reset if it
    encounters error in blk_mq_realloc_hw_ctxs, but the fallback flow will
    continue using it, then blk_mq_map_swqueue will touch a invalid memory,
    because the mapping points to a wrong hctx.

    blktest block/030:

    null_blk: module loaded
    Increasing nr_hw_queues to 8 fails, fallback to 1
    ==================================================================
    BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x2f2/0x830
    Read of size 8 at addr 0000000000000128 by task nproc/8541

    CPU: 5 PID: 8541 Comm: nproc Not tainted 5.7.0-rc4-dbg+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
    Call Trace:
    dump_stack+0xa5/0xe6
    __kasan_report.cold+0x65/0xbb
    kasan_report+0x45/0x60
    check_memory_region+0x15e/0x1c0
    __kasan_check_read+0x15/0x20
    blk_mq_map_swqueue+0x2f2/0x830
    __blk_mq_update_nr_hw_queues+0x3df/0x690
    blk_mq_update_nr_hw_queues+0x32/0x50
    nullb_device_submit_queues_store+0xde/0x160 [null_blk]
    configfs_write_file+0x1c4/0x250 [configfs]
    __vfs_write+0x4c/0x90
    vfs_write+0x14b/0x2d0
    ksys_write+0xdd/0x180
    __x64_sys_write+0x47/0x50
    do_syscall_64+0x6f/0x310
    entry_SYSCALL_64_after_hwframe+0x49/0xb3

    Signed-off-by: Weiping Zhang
    Tested-by: Bart van Assche
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Weiping Zhang
     
  • [ Upstream commit fd689871bbfbb41cd77379d3e9e5f4def0f7d6c6 ]

    Alloc new map and request for new hardware queue when increse
    hardware queue count. Before this patch, it will show a
    warning for each new hardware queue, but it's not enough, these
    hctx have no maps and reqeust, when a bio was mapped to these
    hardware queue, it will trigger kernel panic when get request
    from these hctx.

    Test environment:
    * A NVMe disk supports 128 io queues
    * 96 cpus in system

    A corner case can always trigger this panic, there are 96
    io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel
    log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write
    queues to 96, then nvme will alloc others(32) queues for read, but
    blk_mq_update_nr_hw_queues does not alloc map and request for these new
    added io queues. So when process read nvme disk, it will trigger kernel
    panic when get request from these hardware context.

    Reproduce script:

    nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1)
    echo $nr > /sys/module/nvme/parameters/write_queues
    echo 1 > /sys/block/nvme0n1/device/reset_controller
    dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1

    [ 8040.805626] ------------[ cut here ]------------
    [ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0
    [ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f
    ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne
    tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_
    cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d
    rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
    [ 8040.805637] ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core]
    [ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
    [ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
    [ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
    [ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0
    [ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56
    [ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246
    [ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d
    [ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90
    [ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d
    [ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008
    [ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f
    [ 8040.805645] FS: 0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000
    [ 8040.805646] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0
    [ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 8040.805647] PKRU: 55555554
    [ 8040.805647] Call Trace:
    [ 8040.805649] blk_mq_update_nr_hw_queues+0x31b/0x390
    [ 8040.805650] nvme_reset_work+0xb4b/0xeab [nvme]
    [ 8040.805651] process_one_work+0x1a7/0x370
    [ 8040.805652] worker_thread+0x1c9/0x380
    [ 8040.805653] ? max_active_store+0x80/0x80
    [ 8040.805655] kthread+0x112/0x130
    [ 8040.805656] ? __kthread_parkme+0x70/0x70
    [ 8040.805657] ret_from_fork+0x35/0x40
    [ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]---
    [ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004
    [ 8229.365165] #PF: supervisor read access in kernel mode
    [ 8229.365178] #PF: error_code(0x0000) - not-present page
    [ 8229.365191] PGD 0 P4D 0
    [ 8229.365201] Oops: 0000 [#1] SMP PTI
    [ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
    [ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
    [ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250
    [ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff
    [ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246
    [ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998
    [ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38
    [ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198
    [ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000
    [ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001
    [ 8229.365397] FS: 00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000
    [ 8229.365415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0
    [ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 8229.365476] PKRU: 55555554
    [ 8229.365484] Call Trace:
    [ 8229.365498] ? finish_wait+0x80/0x80
    [ 8229.365512] blk_mq_get_request+0xcb/0x3f0
    [ 8229.365525] blk_mq_make_request+0x143/0x5d0
    [ 8229.365538] generic_make_request+0xcf/0x310
    [ 8229.365553] ? scan_shadow_nodes+0x30/0x30
    [ 8229.365564] submit_bio+0x3c/0x150
    [ 8229.365576] mpage_readpages+0x163/0x1a0
    [ 8229.365588] ? blkdev_direct_IO+0x490/0x490
    [ 8229.365601] read_pages+0x6b/0x190
    [ 8229.365612] __do_page_cache_readahead+0x1c1/0x1e0
    [ 8229.365626] ondemand_readahead+0x182/0x2f0
    [ 8229.365639] generic_file_buffered_read+0x590/0xab0
    [ 8229.365655] new_sync_read+0x12a/0x1c0
    [ 8229.365666] vfs_read+0x8a/0x140
    [ 8229.365676] ksys_read+0x59/0xd0
    [ 8229.365688] do_syscall_64+0x55/0x1d0
    [ 8229.365700] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Signed-off-by: Ming Lei
    Signed-off-by: Weiping Zhang
    Tested-by: Weiping Zhang
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     

02 May, 2020

1 commit

  • [ Upstream commit 5fe56de799ad03e92d794c7936bf363922b571df ]

    If in blk_mq_dispatch_rq_list() we find no budget, then we break of the
    dispatch loop, but the request may keep the driver tag, evaulated
    in 'nxt' in the previous loop iteration.

    Fix by putting the driver tag for that request.

    Reviewed-by: Ming Lei
    Signed-off-by: John Garry
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    John Garry
     

13 Apr, 2020

1 commit

  • commit 6e66b49392419f3fe134e1be583323ef75da1e4b upstream.

    blk_mq_map_queues() and multiple .map_queues() implementations expect that
    set->map[HCTX_TYPE_DEFAULT].nr_queues is set to the number of hardware
    queues. Hence set .nr_queues before calling these functions. This patch
    fixes the following kernel warning:

    WARNING: CPU: 0 PID: 2501 at include/linux/cpumask.h:137
    Call Trace:
    blk_mq_run_hw_queue+0x19d/0x350 block/blk-mq.c:1508
    blk_mq_run_hw_queues+0x112/0x1a0 block/blk-mq.c:1525
    blk_mq_requeue_work+0x502/0x780 block/blk-mq.c:775
    process_one_work+0x9af/0x1740 kernel/workqueue.c:2269
    worker_thread+0x98/0xe40 kernel/workqueue.c:2415
    kthread+0x361/0x430 kernel/kthread.c:255

    Fixes: ed76e329d74a ("blk-mq: abstract out queue map") # v5.0
    Reported-by: syzbot+d44e1b26ce5c3e77458d@syzkaller.appspotmail.com
    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Chaitanya Kulkarni
    Cc: Johannes Thumshirn
    Cc: Hannes Reinecke
    Cc: Ming Lei
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     

21 Mar, 2020

1 commit

  • [ Upstream commit 01e99aeca3979600302913cef3f89076786f32c8 ]

    For some reason, device may be in one situation which can't handle
    FS request, so STS_RESOURCE is always returned and the FS request
    will be added to hctx->dispatch. However passthrough request may
    be required at that time for fixing the problem. If passthrough
    request is added to scheduler queue, there isn't any chance for
    blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
    Then the FS IO request may never be completed, and IO hang is caused.

    So passthrough request has to be added to hctx->dispatch directly
    for fixing the IO hang.

    Fix this issue by inserting passthrough request into hctx->dispatch
    directly together withing adding FS request to the tail of
    hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
    to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().

    Then it becomes consistent with original legacy IO request
    path, in which passthrough request is always added to q->queue_head.

    Cc: Dongli Zhang
    Cc: Christoph Hellwig
    Cc: Ewan D. Milne
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     

28 Sep, 2019

2 commits

  • Some HDD drive may expose multiple hardware queues, such as MegraRaid.
    Let's apply the normal plugging for such devices because sequential IO
    may benefit a lot from plug merging.

    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Cc: Dave Chinner
    Reviewed-by: Damien Le Moal
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • If a device is using multiple queues, the IO scheduler may be bypassed.
    This may hurt performance for some slow MQ devices, and it also breaks
    zoned devices which depend on mq-deadline for respecting the write order
    in one zone.

    Don't bypass io scheduler if we have one setup.

    This patch can double sequential write performance basically on MQ
    scsi_debug when mq-deadline is applied.

    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Cc: Dave Chinner
    Reviewed-by: Javier González
    Reviewed-by: Damien Le Moal
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

27 Sep, 2019

1 commit

  • We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
    as following:

    [ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
    [ 108.827059] PGD 0 P4D 0
    [ 108.827313] Oops: 0000 [#1] SMP PTI
    [ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
    [ 108.829503] Workqueue: kblockd blk_mq_timeout_work
    [ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
    [ 108.838191] Call Trace:
    [ 108.838406] bt_iter+0x74/0x80
    [ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
    [ 108.839074] ? __switch_to_asm+0x34/0x70
    [ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
    [ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
    [ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
    [ 108.840732] blk_mq_timeout_work+0x74/0x200
    [ 108.841151] process_one_work+0x297/0x680
    [ 108.841550] worker_thread+0x29c/0x6f0
    [ 108.841926] ? rescuer_thread+0x580/0x580
    [ 108.842344] kthread+0x16a/0x1a0
    [ 108.842666] ? kthread_flush_work+0x170/0x170
    [ 108.843100] ret_from_fork+0x35/0x40

    The bug is caused by the race between timeout handle and completion for
    flush request.

    When timeout handle function blk_mq_rq_timed_out() try to read
    'req->q->mq_ops', the 'req' have completed and reinitiated by next
    flush request, which would call blk_rq_init() to clear 'req' as 0.

    After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
    normal requests lifetime are protected by refcount. Until 'rq->ref'
    drop to zero, the request can really be free. Thus, these requests
    cannot been reused before timeout handle finish.

    However, flush request has defined .end_io and rq->end_io() is still
    called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
    can be reused by the next flush request handle, resulting in null
    pointer deference BUG ON.

    We fix this problem by covering flush request with 'rq->ref'.
    If the refcount is not zero, flush_end_io() return and wait the
    last holder recall it. To record the request status, we add a new
    entry 'rq_status', which will be used in flush_end_io().

    Cc: Christoph Hellwig
    Cc: Keith Busch
    Cc: Bart Van Assche
    Cc: stable@vger.kernel.org # v4.18+
    Reviewed-by: Ming Lei
    Reviewed-by: Bob Liu
    Signed-off-by: Yufen Yu

    -------
    v2:
    - move rq_status from struct request to struct blk_flush_queue
    v3:
    - remove unnecessary '{}' pair.
    v4:
    - let spinlock to protect 'fq->rq_status'
    v5:
    - move rq_status after flush_running_idx member of struct blk_flush_queue
    Signed-off-by: Jens Axboe

    Yufen Yu
     

18 Sep, 2019

3 commits

  • Currently t10_pi_prepare/t10_pi_complete functions are called during the
    NVMe and SCSi layers command preparetion/completion, but their actual
    place should be the block layer since T10-PI is a general data integrity
    feature that is used by block storage protocols. Introduce .prepare_fn
    and .complete_fn callbacks within the integrity profile that each type
    can implement according to its needs.

    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Suggested-by: Martin K. Petersen
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Max Gurtovoy

    Fixed to not call queue integrity functions if BLK_DEV_INTEGRITY
    isn't defined in the config.

    Signed-off-by: Jens Axboe

    Max Gurtovoy
     
  • Pull block updates from Jens Axboe:

    - Two NVMe pull requests:
    - ana log parse fix from Anton
    - nvme quirks support for Apple devices from Ben
    - fix missing bio completion tracing for multipath stack devices
    from Hannes and Mikhail
    - IP TOS settings for nvme rdma and tcp transports from Israel
    - rq_dma_dir cleanups from Israel
    - tracing for Get LBA Status command from Minwoo
    - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
    - Some consolidation between the fabrics transports for handling
    the CAP register
    - reset race with ns scanning fix for fabrics (move fabrics
    commands to a dedicated request queue with a different lifetime
    from the admin request queue)."
    - controller reset and namespace scan races fixes
    - nvme discovery log change uevent support
    - naming improvements from Keith
    - multiple discovery controllers reject fix from James
    - some regular cleanups from various people

    - Series fixing (and re-fixing) null_blk debug printing and nr_devices
    checks (André)

    - A few pull requests from Song, with fixes from Andy, Guoqing,
    Guilherme, Neil, Nigel, and Yufen.

    - REQ_OP_ZONE_RESET_ALL support (Chaitanya)

    - Bio merge handling unification (Christoph)

    - Pick default elevator correctly for devices with special needs
    (Damien)

    - Block stats fixes (Hou)

    - Timeout and support devices nbd fixes (Mike)

    - Series fixing races around elevator switching and device add/remove
    (Ming)

    - sed-opal cleanups (Revanth)

    - Per device weight support for BFQ (Fam)

    - Support for blk-iocost, a new model that can properly account cost of
    IO workloads. (Tejun)

    - blk-cgroup writeback fixes (Tejun)

    - paride queue init fixes (zhengbin)

    - blk_set_runtime_active() cleanup (Stanley)

    - Block segment mapping optimizations (Bart)

    - lightnvm fixes (Hans/Minwoo/YueHaibing)

    - Various little fixes and cleanups

    * tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
    null_blk: format pr_* logs with pr_fmt
    null_blk: match the type of parameter nr_devices
    null_blk: do not fail the module load with zero devices
    block: also check RQF_STATS in blk_mq_need_time_stamp()
    block: make rq sector size accessible for block stats
    bfq: Fix bfq linkage error
    raid5: use bio_end_sector in r5_next_bio
    raid5: remove STRIPE_OPS_REQ_PENDING
    md: add feature flag MD_FEATURE_RAID0_LAYOUT
    md/raid0: avoid RAID0 data corruption due to layout confusion.
    raid5: don't set STRIPE_HANDLE to stripe which is in batch list
    raid5: don't increment read_errors on EILSEQ return
    nvmet: fix a wrong error status returned in error log page
    nvme: send discovery log page change events to userspace
    nvme: add uevent variables for controller devices
    nvme: enable aen regardless of the presence of I/O queues
    nvme-fabrics: allow discovery subsystems accept a kato
    nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
    nvme: Remove redundant assignment of cq vector
    nvme: Assign subsys instance from first ctrl
    ...

    Linus Torvalds
     
  • Pull core timer updates from Thomas Gleixner:
    "Timers and timekeeping updates:

    - A large overhaul of the posix CPU timer code which is a preparation
    for moving the CPU timer expiry out into task work so it can be
    properly accounted on the task/process.

    An update to the bogus permission checks will come later during the
    merge window as feedback was not complete before heading of for
    travel.

    - Switch the timerqueue code to use cached rbtrees and get rid of the
    homebrewn caching of the leftmost node.

    - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
    single function

    - Implement the separation of hrtimers to be forced to expire in hard
    interrupt context even when PREEMPT_RT is enabled and mark the
    affected timers accordingly.

    - Implement a mechanism for hrtimers and the timer wheel to protect
    RT against priority inversion and live lock issues when a (hr)timer
    which should be canceled is currently executing the callback.
    Instead of infinitely spinning, the task which tries to cancel the
    timer blocks on a per cpu base expiry lock which is held and
    released by the (hr)timer expiry code.

    - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
    resulting in faster access to timekeeping functions.

    - Updates to various clocksource/clockevent drivers and their device
    tree bindings.

    - The usual small improvements all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
    posix-cpu-timers: Fix permission check regression
    posix-cpu-timers: Always clear head pointer on dequeue
    hrtimer: Add a missing bracket and hide `migration_base' on !SMP
    posix-cpu-timers: Make expiry_active check actually work correctly
    posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
    tick: Mark sched_timer to expire in hard interrupt context
    hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
    x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
    posix-cpu-timers: Utilize timerqueue for storage
    posix-cpu-timers: Move state tracking to struct posix_cputimers
    posix-cpu-timers: Deduplicate rlimit handling
    posix-cpu-timers: Remove pointless comparisons
    posix-cpu-timers: Get rid of 64bit divisions
    posix-cpu-timers: Consolidate timer expiry further
    posix-cpu-timers: Get rid of zero checks
    rlimit: Rewrite non-sensical RLIMIT_CPU comment
    posix-cpu-timers: Respect INFINITY for hard RTTIME limit
    posix-cpu-timers: Switch thread group sampling to array
    posix-cpu-timers: Restructure expiry array
    posix-cpu-timers: Remove cputime_expires
    ...

    Linus Torvalds
     

16 Sep, 2019

2 commits

  • In __blk_mq_end_request() if block stats needs update, we should
    ensure now is valid instead of 0 even when iostat is disabled.

    Signed-off-by: Hou Tao
    Signed-off-by: Jens Axboe

    Hou Tao
     
  • Currently rq->data_len will be decreased by partial completion or
    zeroed by completion, so when blk_stat_add() is invoked, data_len
    will be zero and there will never be samples in poll_cb because
    blk_mq_poll_stats_bkt() will return -1 if data_len is zero.

    We could move blk_stat_add() back to __blk_mq_complete_request(),
    but that would make the effort of trying to call ktime_get_ns()
    once in vain. Instead we can reuse throtl_size field, and use
    it for both block stats and block throttle, and adjust the
    logic in blk_mq_poll_stats_bkt() accordingly.

    Fixes: 4bc6339a583c ("block: move blk_stat_add() to __blk_mq_end_request()")
    Tested-by: Pavel Begunkov
    Signed-off-by: Hou Tao
    Signed-off-by: Jens Axboe

    Hou Tao
     

06 Sep, 2019

3 commits

  • When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
    the only information known about the device is the number of hardware
    queues as the block device scan by the device driver is not completed
    yet for most drivers. The device type and elevator required features
    are not set yet, preventing to correctly select the default elevator
    most suitable for the device.

    This currently affects all multi-queue zoned block devices which default
    to the "none" elevator instead of the required "mq-deadline" elevator.
    These drives currently include host-managed SMR disks connected to a
    smartpqi HBA and null_blk block devices with zoned mode enabled.
    Upcoming NVMe Zoned Namespace devices will also be affected.

    Fix this by adding the boolean elevator_init argument to
    blk_mq_init_allocated_queue() to control the execution of
    elevator_init_mq(). Two cases exist:
    1) elevator_init = false is used for calls to
    blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
    case, a call to elevator_init_mq() is added to __device_add_disk(),
    resulting in the delayed initialization of the queue elevator
    after the device driver finished probing the device information. This
    effectively allows elevator_init_mq() access to more information
    about the device.
    2) elevator_init = true preserves the current behavior of initializing
    the elevator directly from blk_mq_init_allocated_queue(). This case
    is used for the special request based DM devices where the device
    gendisk is created before the queue initialization and device
    information (e.g. queue limits) is already known when the queue
    initialization is executed.

    Additionally, to make sure that the elevator initialization is never
    done while requests are in-flight (there should be none when the device
    driver calls device_add_disk()), freeze and quiesce the device request
    queue before calling blk_mq_init_sched() in elevator_init_mq().

    Reviewed-by: Ming Lei
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • If the default elevator chosen is mq-deadline, elevator_init_mq() may
    return an error if mq-deadline initialization fails, leading to
    blk_mq_init_allocated_queue() returning an error, which in turn will
    cause the block device initialization to fail and the device not being
    exposed.

    Instead of taking such extreme measure, handle mq-deadline
    initialization failures in the same manner as when mq-deadline is not
    available (no module to load), that is, default to the "none" scheduler.
    With this change, elevator_init_mq() return type can be changed to void.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Instead of checking a queue tag_set BLK_MQ_F_NO_SCHED flag before
    calling elevator_init_mq() to make sure that the queue supports IO
    scheduling, use the elevator.c function elv_support_iosched() in
    elevator_init_mq(). This does not introduce any functional change but
    ensure that elevator_init_mq() does the right thing based on the queue
    settings.

    Reviewed-by: Ming Lei
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

29 Aug, 2019

1 commit

  • There are currently two start time timestamps - start_time_ns and
    io_start_time_ns. The former marks the request allocation and and the
    second issue-to-device time. The planned io.weight controller needs
    to measure the total time bios take to execute after it leaves rq_qos
    including the time spent waiting for request to become available,
    which can easily dominate on saturated devices.

    This patch adds request->alloc_time_ns which records when the request
    allocation attempt started. As it isn't used for the usual stats,
    make it optional behind CONFIG_BLK_RQ_ALLOC_TIME and
    QUEUE_FLAG_RQ_ALLOC_TIME so that it can be compiled out when there are
    no users and it's active only on queues which need it even when
    compiled in.

    v2: s/pre_start_time/alloc_time/ and add CONFIG_BLK_RQ_ALLOC_TIME
    gating as suggested by Jens.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

28 Aug, 2019

1 commit

  • blk_mq_map_swqueue() is called from blk_mq_init_allocated_queue()
    and blk_mq_update_nr_hw_queues(). For the former caller, the kobject
    isn't exposed to userspace yet. For the latter caller, hctx sysfs entries
    and debugfs are un-registered before updating nr_hw_queues.

    On the other hand, commit 2f8f1336a48b ("blk-mq: always free hctx after
    request queue is freed") moves freeing hctx into queue's release
    handler, so there won't be race with queue release path too.

    So don't hold q->sysfs_lock in blk_mq_map_swqueue().

    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Greg KH
    Cc: Mike Snitzer
    Cc: Bart Van Assche
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

16 Aug, 2019

1 commit

  • We had a few issues with this code, and there's still a problem around
    how we deal with error handling for chained/split bios. For now, just
    revert the code and we'll try again with a thoroug solution. This
    reverts commits:

    e15c2ffa1091 ("block: fix O_DIRECT error handling for bio fragments")
    0eb6ddfb865c ("block: Fix __blkdev_direct_IO() for bio fragments")
    6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
    893a1c97205a ("blk-mq: allow REQ_NOWAIT to return an error inline")

    Signed-off-by: Jens Axboe

    Jens Axboe
     

12 Aug, 2019

2 commits

  • If blk_mq_init_allocated_queue->elevator_init_mq fails, need to release
    the previously requested resources.

    Fixes: d34849913819 ("blk-mq-sched: allow setting of default IO scheduler")
    Signed-off-by: zhengbin
    Signed-off-by: Jens Axboe

    zhengbin
     
  • blk_exit_queue will free elevator_data, while blk_mq_requeue_work
    will access it. Move cancel of requeue_work to the front of
    blk_exit_queue to avoid use-after-free.

    blk_exit_queue blk_mq_requeue_work
    __elevator_exit blk_mq_run_hw_queues
    blk_mq_exit_sched blk_mq_run_hw_queue
    dd_exit_queue blk_mq_hctx_has_pending
    kfree(elevator_data) blk_mq_sched_has_work
    dd_has_work

    Fixes: fbc2a15e3433 ("blk-mq: move cancel of requeue_work into blk_mq_release")
    Cc: stable@vger.kernel.org
    Reviewed-by: Ming Lei
    Signed-off-by: zhengbin
    Signed-off-by: Jens Axboe

    zhengbin
     

05 Aug, 2019

2 commits

  • blk_mq_tagset_wait_completed_request() has been applied for waiting
    for completed request's fn, so not necessary to use
    blk_mq_complete_request_sync() any more.

    Cc: Max Gurtovoy
    Cc: Sagi Grimberg
    Cc: Keith Busch
    Cc: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • NVMe needs this function to decide if one request to be aborted has
    been completed in normal IO path already.

    So introduce it.

    Cc: Max Gurtovoy
    Cc: Sagi Grimberg
    Cc: Keith Busch
    Cc: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

01 Aug, 2019

2 commits

  • hrtimer_sleepers will gain a scheduling class dependent treatment on
    PREEMPT_RT. Use the new hrtimer_sleeper_start_expires() function to make
    that possible.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • hrtimer_init_sleeper() calls require prior initialisation of the hrtimer
    object which is embedded into the hrtimer_sleeper.

    Combine the initialization and spare a function call. Fixup all call sites.

    This is also a preparatory change for PREEMPT_RT to do hrtimer sleeper
    specific initializations of the embedded hrtimer without modifying any of
    the call sites.

    No functional change.

    [ anna-maria: Minor cleanups ]
    [ tglx: Adopted to the removal of the task argument of
    hrtimer_init_sleeper() and trivial polishing.
    Folded a fix from Stephen Rothwell for the vsoc code ]

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Anna-Maria Gleixner
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20190726185752.887468908@linutronix.de

    Sebastian Andrzej Siewior
     

31 Jul, 2019

1 commit


23 Jul, 2019

1 commit


22 Jul, 2019

1 commit

  • By default, if a caller sets REQ_NOWAIT and we need to block, we'll
    return -EAGAIN through the bio->bi_end_io() callback. For some use
    cases, this makes it hard to use.

    Allow a caller to ask for inline return of errors related to
    blocking by also setting REQ_NOWAIT_INLINE.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Jul, 2019

1 commit

  • Simultaneously writing to a sequential zone of a zoned block device
    from multiple contexts requires mutual exclusion for BIO issuing to
    ensure that writes happen sequentially. However, even for a well
    behaved user correctly implementing such synchronization, BIO plugging
    may interfere and result in BIOs from the different contextx to be
    reordered if plugging is done outside of the mutual exclusion section,
    e.g. the plug was started by a function higher in the call chain than
    the function issuing BIOs.

    Context A Context B

    | blk_start_plug()
    | ...
    | seq_write_zone()
    | mutex_lock(zone)
    | bio-0->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-0)
    | submit_bio(bio-0)
    | bio-1->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-1)
    | submit_bio(bio-1)
    | mutex_unlock(zone)
    | return
    | -----------------------> | seq_write_zone()
    | mutex_lock(zone)
    | bio-2->bi_iter.bi_sector = zone->wp
    | zone->wp += bio_sectors(bio-2)
    | submit_bio(bio-2)
    | mutex_unlock(zone)
    |
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

03 Jul, 2019

2 commits

  • Move the blk_mq_bio_to_request() call in front of the if-statement.

    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Reviewed-by: Minwoo Im
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends
    on preemption being disabled for its correctness. Since removing the CPU
    preemption calls does not measurably affect performance, simplify the
    blk-mq code by removing the blk_mq_put_ctx() function and also by not
    disabling preemption in blk_mq_get_ctx().

    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

21 Jun, 2019

2 commits

  • We only need the number of segments in the blk-mq submission path.
    Remove the field from struct bio, and return it from a variant of
    blk_queue_split instead of that it can passed as an argument to
    those functions that need the value.

    This also means we stop recounting segments except for cloning
    and partial segments.

    To keep the number of arguments in this how path down remove
    pointless struct request_queue arguments from any of the functions
    that had it and grew a nr_segs argument.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • lightnvm should have never used this function, as it is sending
    passthrough requests, so switch it to blk_rq_append_bio like all the
    other passthrough request users. Inline blk_init_request_from_bio into
    the only remaining caller.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Minwoo Im
    Reviewed-by: Javier González
    Reviewed-by: Matias Bjørling
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

30 May, 2019

1 commit


24 May, 2019

1 commit

  • The following is a description of a hang in blk_mq_freeze_queue_wait().
    The hang happens on attempt to freeze a queue while another task does
    queue unfreeze.

    The root cause is an incorrect sequence of percpu_ref_resurrect() and
    percpu_ref_kill() and as a result those two can be swapped:

    CPU#0 CPU#1
    ---------------- -----------------
    q1 = blk_mq_init_queue(shared_tags)

    q2 = blk_mq_init_queue(shared_tags):
    blk_mq_add_queue_tag_set(shared_tags):
    blk_mq_update_tag_set_depth(shared_tags):
    list_for_each_entry()
    blk_mq_freeze_queue(q1)
    > percpu_ref_kill()
    > blk_mq_freeze_queue_wait()

    blk_cleanup_queue(q1)
    blk_mq_freeze_queue(q1)
    > percpu_ref_kill()
    ^^^^^^ freeze_depth can't guarantee the order

    blk_mq_unfreeze_queue()
    > percpu_ref_resurrect()

    > blk_mq_freeze_queue_wait()
    ^^^^^^ Hang here!!!!

    This wrong sequence raises kernel warning:
    percpu_ref_kill_and_confirm called more than once on blk_queue_usage_counter_release!
    WARNING: CPU: 0 PID: 11854 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x99/0xb0

    But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(),
    which waits for a zero of a q_usage_counter, which never happens
    because percpu-ref was reinited (instead of being killed) and stays in
    PERCPU state forever.

    How to reproduce:
    - "insmod null_blk.ko shared_tags=1 nr_devices=0 queue_mode=2"
    - cpu0: python Script.py 0; taskset the corresponding process running on cpu0
    - cpu1: python Script.py 1; taskset the corresponding process running on cpu1

    Script.py:
    ------
    #!/usr/bin/python3

    import os
    import sys

    while True:
    on = "echo 1 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
    off = "echo 0 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
    os.system(on)
    os.system(off)
    ------

    This bug was first reported and fixed by Roman, previous discussion:
    [1] Message id: 1443287365-4244-7-git-send-email-akinobu.mita@gmail.com
    [2] Message id: 1443563240-29306-6-git-send-email-tj@kernel.org
    [3] https://patchwork.kernel.org/patch/9268199/

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Roman Pen
    Signed-off-by: Bob Liu
    Signed-off-by: Jens Axboe

    Bob Liu
     

04 May, 2019

1 commit

  • In normal queue cleanup path, hctx is released after request queue
    is freed, see blk_mq_release().

    However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
    of hw queues shrinking. This way is easy to cause use-after-free,
    because: one implicit rule is that it is safe to call almost all block
    layer APIs if the request queue is alive; and one hctx may be retrieved
    by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
    finally use-after-free is triggered.

    Fixes this issue by always freeing hctx after releasing request queue.
    If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
    a per-queue list to hold them, then try to resuse these hctxs if numa
    node is matched.

    Cc: Dongli Zhang
    Cc: James Smart
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org,
    Cc: Martin K . Petersen ,
    Cc: Christoph Hellwig ,
    Cc: James E . J . Bottomley ,
    Reviewed-by: Hannes Reinecke
    Tested-by: James Smart
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei