31 Aug, 2022

1 commit

  • commit 65fac0d54f374625b43a9d6ad1f2c212bd41f518 upstream.

    Currently, in virtio_scsi, if 'bd->last' is not set to true while
    dispatching request, such io will stay in driver's queue, and driver
    will wait for block layer to dispatch more rqs. However, if block
    layer failed to dispatch more rq, it should trigger commit_rqs to
    inform driver.

    There is a problem in blk_mq_try_issue_list_directly() that commit_rqs
    won't be called:

    // assume that queue_depth is set to 1, list contains two rq
    blk_mq_try_issue_list_directly
    blk_mq_request_issue_directly
    // dispatch first rq
    // last is false
    __blk_mq_try_issue_directly
    blk_mq_get_dispatch_budget
    // succeed to get first budget
    __blk_mq_issue_directly
    scsi_queue_rq
    cmd->flags |= SCMD_LAST
    virtscsi_queuecommand
    kick = (sc->flags & SCMD_LAST) != 0
    // kick is false, first rq won't issue to disk
    queued++

    blk_mq_request_issue_directly
    // dispatch second rq
    __blk_mq_try_issue_directly
    blk_mq_get_dispatch_budget
    // failed to get second budget
    ret == BLK_STS_RESOURCE
    blk_mq_request_bypass_insert
    // errors is still 0

    if (!list_empty(list) || errors && ...)
    // won't pass, commit_rqs won't be called

    In this situation, first rq relied on second rq to dispatch, while
    second rq relied on first rq to complete, thus they will both hung.

    Fix the problem by also treat 'BLK_STS_*RESOURCE' as 'errors' since
    it means that request is not queued successfully.

    Same problem exists in blk_mq_dispatch_rq_list(), 'BLK_STS_*RESOURCE'
    can't be treated as 'errors' here, fix the problem by calling
    commit_rqs if queue_rq return 'BLK_STS_*RESOURCE'.

    Fixes: d666ba98f849 ("blk-mq: add mq_ops->commit_rqs()")
    Signed-off-by: Yu Kuai
    Reviewed-by: Ming Lei
    Link: https://lore.kernel.org/r/20220726122224.1790882-1-yukuai1@huaweicloud.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Yu Kuai
     

22 Jun, 2022

1 commit

  • [ Upstream commit 14dc7a18abbe4176f5626c13c333670da8e06aa1 ]

    This patch prevents that test nvme/004 triggers the following:

    UBSAN: array-index-out-of-bounds in block/blk-mq.h:135:9
    index 512 is out of range for type 'long unsigned int [512]'
    Call Trace:
    show_stack+0x52/0x58
    dump_stack_lvl+0x49/0x5e
    dump_stack+0x10/0x12
    ubsan_epilogue+0x9/0x3b
    __ubsan_handle_out_of_bounds.cold+0x44/0x49
    blk_mq_alloc_request_hctx+0x304/0x310
    __nvme_submit_sync_cmd+0x70/0x200 [nvme_core]
    nvmf_connect_io_queue+0x23e/0x2a0 [nvme_fabrics]
    nvme_loop_connect_io_queues+0x8d/0xb0 [nvme_loop]
    nvme_loop_create_ctrl+0x58e/0x7d0 [nvme_loop]
    nvmf_create_ctrl+0x1d7/0x4d0 [nvme_fabrics]
    nvmf_dev_write+0xae/0x111 [nvme_fabrics]
    vfs_write+0x144/0x560
    ksys_write+0xb7/0x140
    __x64_sys_write+0x42/0x50
    do_syscall_64+0x35/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Fixes: 20e4d8139319 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Link: https://lore.kernel.org/r/20220615210004.1031820-1-bvanassche@acm.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Bart Van Assche
     

15 Jun, 2022

1 commit

  • [ Upstream commit 5d05426e2d5fd7df8afc866b78c36b37b00188b7 ]

    blk_mq_run_hw_queues() could be run when there isn't queued request and
    after queue is cleaned up, at that time tagset is freed, because tagset
    lifetime is covered by driver, and often freed after blk_cleanup_queue()
    returns.

    So don't touch ->tagset for figuring out current default hctx by the mapping
    built in request queue, so use-after-free on tagset can be avoided. Meantime
    this way should be fast than retrieving mapping from tagset.

    Cc: "yukuai (C)"
    Cc: Jan Kara
    Fixes: b6e68ee82585 ("blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues")
    Signed-off-by: Ming Lei
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20220522122350.743103-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     

01 Dec, 2021

1 commit

  • commit 2a19b28f7929866e1cec92a3619f4de9f2d20005 upstream.

    For avoiding to slow down queue destroy, we don't call
    blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
    cancel dispatch work in blk_release_queue().

    However, this way has caused kernel oops[1], reported by Changhui. The log
    shows that scsi_device can be freed before running blk_release_queue(),
    which is expected too since scsi_device is released after the scsi disk
    is closed and the scsi_device is removed.

    Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
    and disk_release():

    1) when disk_release() is run, the disk has been closed, and any sync
    dispatch activities have been done, so canceling dispatch work is enough to
    quiesce filesystem I/O dispatch activity.

    2) in blk_cleanup_queue(), we only focus on passthrough request, and
    passthrough request is always explicitly allocated & freed by
    its caller, so once queue is frozen, all sync dispatch activity
    for passthrough request has been done, then it is enough to just cancel
    dispatch work for avoiding any dispatch activity.

    [1] kernel panic log
    [12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
    [12622.777186] #PF: supervisor read access in kernel mode
    [12622.782918] #PF: error_code(0x0000) - not-present page
    [12622.788649] PGD 0 P4D 0
    [12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
    [12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
    [12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
    [12622.813321] Workqueue: kblockd blk_mq_run_work_fn
    [12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
    [12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
    [12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
    [12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
    [12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
    [12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
    [12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
    [12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
    [12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
    [12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
    [12622.913328] Call Trace:
    [12622.916055]
    [12622.918394] scsi_mq_get_budget+0x1a/0x110
    [12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320
    [12622.928404] ? pick_next_task_fair+0x39/0x390
    [12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140
    [12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60
    [12622.944829] __blk_mq_run_hw_queue+0x30/0xa0
    [12622.949593] process_one_work+0x1e8/0x3c0
    [12622.954059] worker_thread+0x50/0x3b0
    [12622.958144] ? rescuer_thread+0x370/0x370
    [12622.962616] kthread+0x158/0x180
    [12622.966218] ? set_kthread_struct+0x40/0x40
    [12622.970884] ret_from_fork+0x22/0x30
    [12622.974875]
    [12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]

    Reported-by: ChanghuiZhong
    Cc: Christoph Hellwig
    Cc: "Martin K. Petersen"
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org
    Signed-off-by: Ming Lei
    Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

19 Nov, 2021

2 commits

  • [ Upstream commit 037057a5a979c7eeb2ee5d12cf4c24b805192c75 ]

    This check is meant to catch cases where a requeue is attempted on a
    request that is still inserted. It's never really been useful to catch any
    misuse, and now it's actively wrong. Outside of that, this should not be a
    BUG_ON() to begin with.

    Remove the check as it's now causing active harm, as requeue off the plug
    path will trigger it even though the request state is just fine.

    Reported-by: Yi Zhang
    Link: https://lore.kernel.org/linux-block/CAHj4cs80zAUc2grnCZ015-2Rvd-=gXRfB_dFKy=RTm+wRo09HQ@mail.gmail.com/
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     
  • [ Upstream commit ba0ffdd8ce48ad7f7e85191cd29f9674caca3745 ]

    Particularly for NVMe with efficient deferred submission for many
    requests, there are nice benefits to be seen by bumping the default max
    plug count from 16 to 32. This is especially true for virtualized setups,
    where the submit part is more expensive. But can be noticed even on
    native hardware.

    Reduce the multiple queue factor from 4 to 2, since we're changing the
    default size.

    While changing it, move the defines into the block layer private header.
    These aren't values that anyone outside of the block layer uses, or
    should use.

    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     

27 Oct, 2021

1 commit

  • When dispatching a zone append write request to a SCSI zoned block device,
    if the target zone of the request is already locked, the device driver will
    return BLK_STS_ZONE_RESOURCE and the request will be pushed back to the
    hctx dipatch queue. The queue will be marked as RESTART in
    dd_finish_request() and restarted in __blk_mq_free_request(). However, this
    restart applies to the hctx of the completed request. If the requeued
    request is on a different hctx, dispatch will no be retried until another
    request is submitted or the next periodic queue run triggers, leading to up
    to 30 seconds latency for the requeued request.

    Fix this problem by scheduling a queue restart similarly to the
    BLK_STS_RESOURCE case or when we cannot get the budget.

    Also, consolidate the checks into the "need_resource" variable to simplify
    the condition.

    Signed-off-by: Naohiro Aota
    Reviewed-by: Christoph Hellwig
    Cc: Niklas Cassel
    Link: https://lore.kernel.org/r/20211026165127.4151055-1-naohiro.aota@wdc.com
    Signed-off-by: Jens Axboe

    Naohiro Aota
     

16 Oct, 2021

1 commit

  • Don't switch back to percpu mode to avoid the double RCU grace period
    when tearing down SCSI devices. After removing the disk only passthrough
    commands can be send anyway.

    Suggested-by: Ming Lei
    Signed-off-by: Christoph Hellwig
    Tested-by: Darrick J. Wong
    Link: https://lore.kernel.org/r/20210929071241.934472-6-hch@lst.de
    Tested-by: Yi Zhang
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Sep, 2021

1 commit

  • Pull block fixes from Jens Axboe:

    - NVMe pull request from Christoph:
    - fix nvmet command set reporting for passthrough controllers (Adam Manzanares)
    - update a MAINTAINERS email address (Chaitanya Kulkarni)
    - set QUEUE_FLAG_NOWAIT for nvme-multipth (me)
    - handle errors from add_disk() (Luis Chamberlain)
    - update the keep alive interval when kato is modified (Tatsuya Sasaki)
    - fix a buffer overrun in nvmet_subsys_attr_serial (Hannes Reinecke)
    - do not reset transport on data digest errors in nvme-tcp (Daniel Wagner)
    - only call synchronize_srcu when clearing current path (Daniel Wagner)
    - revalidate paths during rescan (Hannes Reinecke)

    - Split out the fs/block_dev into block/fops.c and block/bdev.c, which
    has been long overdue. Do this now before -rc1, to avoid annoying
    conflicts due to this (Christoph)

    - blk-throtl use-after-free fix (Li)

    - Improve plug depth for multi-device plugs, greatly increasing md
    resync performance (Song)

    - blkdev_show() locking fix (Tetsuo)

    - n64cart error check fix (Yang)

    * tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block:
    n64cart: fix return value check in n64cart_probe()
    blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues
    block: move fs/block_dev.c to block/bdev.c
    block: split out operations on block special files
    blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()
    block: genhd: don't call blkdev_show() with major_names_lock held
    nvme: update MAINTAINERS email address
    nvme: add error handling support for add_disk()
    nvme: only call synchronize_srcu when clearing current path
    nvme: update keep alive interval when kato is modified
    nvme-tcp: Do not reset transport on data digest errors
    nvmet: fixup buffer overrun in nvmet_subsys_attr_serial()
    nvmet: return bool from nvmet_passthru_ctrl and nvmet_is_passthru_req
    nvmet: looks at the passthrough controller when initializing CAP
    nvme: move nvme_multi_css into nvme.h
    nvme-multipath: revalidate paths during rescan
    nvme-multipath: set QUEUE_FLAG_NOWAIT

    Linus Torvalds
     

08 Sep, 2021

1 commit

  • Limiting number of request to BLK_MAX_REQUEST_COUNT at blk_plug hurts
    performance for large md arrays. [1] shows resync speed of md array drops
    for md array with more than 16 HDDs.

    Fix this by allowing more request at plug queue. The multiple_queue flag
    is used to only apply higher limit to multiple queue cases.

    [1] https://lore.kernel.org/linux-raid/CAFDAVznS71BXW8Jxv6k9dXc2iR3ysX3iZRBww_rzA8WifBFxGg@mail.gmail.com/
    Tested-by: Marcin Wanat
    Signed-off-by: Song Liu
    Signed-off-by: Jens Axboe

    Song Liu
     

03 Sep, 2021

1 commit

  • Pull SCSI updates from James Bottomley:
    "This series consists of the usual driver updates (ufs, qla2xxx,
    target, smartpqi, lpfc, mpt3sas).

    The core change causing the most churn was replacing the command
    request field request with a macro, allowing us to offset map to it
    and remove the redundant field; the same was also done for the tag
    field.

    The most impactful change is the final removal of scsi_ioctl, which
    has been deprecated for over a decade"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (293 commits)
    scsi: ufs: Fix ufshcd_request_sense_async() for Samsung KLUFG8RHDA-B2D1
    scsi: ufs: ufs-exynos: Fix static checker warning
    scsi: mpt3sas: Use the proper SCSI midlayer interfaces for PI
    scsi: lpfc: Use the proper SCSI midlayer interfaces for PI
    scsi: lpfc: Copyright updates for 14.0.0.1 patches
    scsi: lpfc: Update lpfc version to 14.0.0.1
    scsi: lpfc: Add bsg support for retrieving adapter cmf data
    scsi: lpfc: Add cmf_info sysfs entry
    scsi: lpfc: Add debugfs support for cm framework buffers
    scsi: lpfc: Add support for maintaining the cm statistics buffer
    scsi: lpfc: Add rx monitoring statistics
    scsi: lpfc: Add support for the CM framework
    scsi: lpfc: Add cmfsync WQE support
    scsi: lpfc: Add support for cm enablement buffer
    scsi: lpfc: Add cm statistics buffer support
    scsi: lpfc: Add EDC ELS support
    scsi: lpfc: Expand FPIN and RDF receive logging
    scsi: lpfc: Add MIB feature enablement support
    scsi: lpfc: Add SET_HOST_DATA mbox cmd to pass date/time info to firmware
    scsi: fc: Add EDC ELS definition
    ...

    Linus Torvalds
     

31 Aug, 2021

2 commits

  • Pull block updates from Jens Axboe:
    "Nothing major in here - lots of good cleanups and tech debt handling,
    which is also evident in the diffstats. In particular:

    - Add disk sequence numbers (Matteo)

    - Discard merge fix (Ming)

    - Relax disk zoned reporting restrictions (Niklas)

    - Bio error handling zoned leak fix (Pavel)

    - Start of proper add_disk() error handling (Luis, Christoph)

    - blk crypto fix (Eric)

    - Non-standard GPT location support (Dmitry)

    - IO priority improvements and cleanups (Damien)o

    - blk-throtl improvements (Chunguang)

    - diskstats_show() stack reduction (Abd-Alrhman)

    - Loop scheduler selection (Bart)

    - Switch block layer to use kmap_local_page() (Christoph)

    - Remove obsolete disk_name helper (Christoph)

    - block_device refcounting improvements (Christoph)

    - Ensure gendisk always has a request queue reference (Christoph)

    - Misc fixes/cleanups (Shaokun, Oliver, Guoqing)"

    * tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits)
    sg: pass the device name to blk_trace_setup
    block, bfq: cleanup the repeated declaration
    blk-crypto: fix check for too-large dun_bytes
    blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN
    blk-zoned: allow zone management send operations without CAP_SYS_ADMIN
    block: mark blkdev_fsync static
    block: refine the disk_live check in del_gendisk
    mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA
    mmc: block: Support alternative_gpt_sector() operation
    partitions/efi: Support non-standard GPT location
    block: Add alternative_gpt_sector() operation
    bio: fix page leak bio_add_hw_page failure
    block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT
    block: remove a pointless call to MINOR() in device_add_disk
    null_blk: add error handling support for add_disk()
    virtio_blk: add error handling support for add_disk()
    block: add error handling for device_add_disk / add_disk
    block: return errors from disk_alloc_events
    block: return errors from blk_integrity_add
    block: call blk_register_queue earlier in device_add_disk
    ...

    Linus Torvalds
     
  • Pull irq updates from Thomas Gleixner:
    "Updates to the interrupt core and driver subsystems:

    Core changes:

    - The usual set of small fixes and improvements all over the place,
    but nothing stands out

    MSI changes:

    - Further consolidation of the PCI/MSI interrupt chip code

    - Make MSI sysfs code independent of PCI/MSI and expose the MSI
    interrupts of platform devices in the same way as PCI exposes them.

    Driver changes:

    - Support for ARM GICv3 EPPI partitions

    - Treewide conversion to generic_handle_domain_irq() for all chained
    interrupt controllers

    - Conversion to bitmap_zalloc() throughout the irq chip drivers

    - The usual set of small fixes and improvements"

    * tag 'irq-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
    platform-msi: Add ABI to show msi_irqs of platform devices
    genirq/msi: Move MSI sysfs handling from PCI to MSI core
    genirq/cpuhotplug: Demote debug printk to KERN_DEBUG
    irqchip/qcom-pdc: Trim unused levels of the interrupt hierarchy
    irqdomain: Export irq_domain_disconnect_hierarchy()
    irqchip/gic-v3: Fix priority comparison when non-secure priorities are used
    irqchip/apple-aic: Fix irq_disable from within irq handlers
    pinctrl/rockchip: drop the gpio related codes
    gpio/rockchip: drop irq_gc_lock/irq_gc_unlock for irq set type
    gpio/rockchip: support next version gpio controller
    gpio/rockchip: use struct rockchip_gpio_regs for gpio controller
    gpio/rockchip: add driver for rockchip gpio
    dt-bindings: gpio: change items restriction of clock for rockchip,gpio-bank
    pinctrl/rockchip: add pinctrl device to gpio bank struct
    pinctrl/rockchip: separate struct rockchip_pin_bank to a head file
    pinctrl/rockchip: always enable clock for gpio controller
    genirq: Fix kernel doc indentation
    EDAC/altera: Convert to generic_handle_domain_irq()
    powerpc: Bulk conversion to generic_handle_domain_irq()
    nios2: Bulk conversion to generic_handle_domain_irq()
    ...

    Linus Torvalds
     

24 Aug, 2021

4 commits

  • Replace the magic lookup through the kobject tree with an explicit
    backpointer, given that the device model links are set up and torn
    down at times when I/O is still possible, leading to potential
    NULL or invalid pointer dereferences.

    Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk")
    Reported-by: syzbot
    Signed-off-by: Christoph Hellwig
    Tested-by: Sven Schnelle
    Link: https://lore.kernel.org/r/20210816134624.GA24234@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Pass in a request_queue and assign disk->queue in __blk_alloc_disk to
    ensure struct gendisk always has a valid ->queue pointer.

    Signed-off-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20210816131910.615153-8-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This was a leftover from the legacy alloc_disk interface. Switch
    the scsi ULPs and dasd to set ->minors directly like all other
    drivers and remove the argument.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Stefan Haberland [dasd]
    Link: https://lore.kernel.org/r/20210816131910.615153-7-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Pass the lockdep name to the low-level __blk_alloc_disk helper and
    hardcode the name for it given that the number of minors or node_id
    are not very useful information. While this passes a pointless
    argument for non-lockdep builds that is not really an issue as
    disk allocation is a probe time only slow path.

    Signed-off-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20210816131910.615153-5-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

18 Aug, 2021

1 commit

  • is_flush_rq() is called from bt_iter()/bt_tags_iter(), and runs the
    following check:

    hctx->fq->flush_rq == req

    but the passed hctx from bt_iter()/bt_tags_iter() may be NULL because:

    1) memory re-order in blk_mq_rq_ctx_init():

    rq->mq_hctx = data->hctx;
    ...
    refcount_set(&rq->ref, 1);

    OR

    2) tag re-use and ->rqs[] isn't updated with new request.

    Fix the issue by re-writing is_flush_rq() as:

    return rq->end_io == flush_end_io;

    which turns out simpler to follow and immune to data race since we have
    ordered WRITE rq->end_io and refcount_set(&rq->ref, 1).

    Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter")
    Cc: "Blank-Burian, Markus, Dr."
    Cc: Yufen Yu
    Signed-off-by: Ming Lei
    Link: https://lore.kernel.org/r/20210818010925.607383-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe

    Ming Lei
     

17 Aug, 2021

1 commit

  • Inside blk_mq_queue_tag_busy_iter() we already grabbed request's
    refcount before calling ->fn(), so needn't to grab it one more time
    in blk_mq_check_expired().

    Meantime remove extra request expire check in blk_mq_check_expired().

    Cc: Keith Busch
    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Reviewed-by: John Garry
    Link: https://lore.kernel.org/r/20210811155202.629575-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe

    Ming Lei
     

13 Aug, 2021

1 commit

  • We run a test that delete and recover devcies frequently(two devices on
    the same host), and we found that 'active_queues' is super big after a
    period of time.

    If device a and device b share a tag set, and a is deleted, then
    blk_mq_exit_queue() will clear BLK_MQ_F_TAG_QUEUE_SHARED because there
    is only one queue that are using the tag set. However, if b is still
    active, the active_queues of b might never be cleared even if b is
    deleted.

    Thus clear active_queues before BLK_MQ_F_TAG_QUEUE_SHARED is cleared.

    Signed-off-by: Yu Kuai
    Reviewed-by: Ming Lei
    Link: https://lore.kernel.org/r/20210731062130.1533893-1-yukuai3@huawei.com
    Signed-off-by: Jens Axboe

    Yu Kuai
     

11 Aug, 2021

1 commit

  • With CONFIG_IRQ_FORCED_THREADING=y, testing the boolean force_irqthreads
    could incur a cache line miss in invoke_softirq() and other places.

    Replace the test with a static key to avoid the potential cache miss.

    [ tglx: Dropped the IDE part, removed the export and updated blk-mq ]

    Suggested-by: Eric Dumazet
    Signed-off-by: Tanner Love
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Eric Dumazet
    Reviewed-by: Kees Cook
    Link: https://lore.kernel.org/r/20210602180338.3324213-1-tannerlove.kernel@gmail.com

    Tanner Love
     

10 Aug, 2021

1 commit


31 Jul, 2021

1 commit

  • Move the sg_timeout and sg_reserved_size fields into the bsg_device and
    scsi_device structures as they have nothing to do with generic block I/O.
    Note that these values are now separate for bsg vs. SCSI device node
    access, but that just matches how /dev/sg vs the other nodes has always
    behaved.

    Link: https://lore.kernel.org/r/20210729064845.1044147-4-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Martin K. Petersen

    Christoph Hellwig
     

01 Jul, 2021

2 commits

  • All driver uses are gone now.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Link: https://lore.kernel.org/r/20210624081012.256464-1-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Pull core block updates from Jens Axboe:

    - disk events cleanup (Christoph)

    - gendisk and request queue allocation simplifications (Christoph)

    - bdev_disk_changed cleanups (Christoph)

    - IO priority improvements (Bart)

    - Chained bio completion trace fix (Edward)

    - blk-wbt fixes (Jan)

    - blk-wbt enable/disable fix (Zhang)

    - Scheduler dispatch improvements (Jan, Ming)

    - Shared tagset scheduler improvements (John)

    - BFQ updates (Paolo, Luca, Pietro)

    - BFQ lock inversion fix (Jan)

    - Documentation improvements (Kir)

    - CLONE_IO block cgroup fix (Tejun)

    - Remove of ancient and deprecated block dump feature (zhangyi)

    - Discard merge fix (Ming)

    - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
    Yang)

    * tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
    block: fix discard request merge
    block/mq-deadline: Remove a WARN_ON_ONCE() call
    blk-mq: update hctx->dispatch_busy in case of real scheduler
    blk: Fix lock inversion between ioc lock and bfqd lock
    bfq: Remove merged request already in bfq_requests_merged()
    block: pass a gendisk to bdev_disk_changed
    block: move bdev_disk_changed
    block: add the events* attributes to disk_attrs
    block: move the disk events code to a separate file
    block: fix trace completion for chained bio
    block/partitions/msdos: Fix typo inidicator -> indicator
    block, bfq: reset waker pointer with shared queues
    block, bfq: check waker only for queues with no in-flight I/O
    block, bfq: avoid delayed merge of async queues
    block, bfq: boost throughput by extending queue-merging times
    block, bfq: consider also creation time in delayed stable merge
    block, bfq: fix delayed stable merge check
    block, bfq: let also stably merged queues enjoy weight raising
    blk-wbt: make sure throttle is enabled properly
    blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
    ...

    Linus Torvalds
     

25 Jun, 2021

1 commit

  • Commit 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io")
    starts to support io batching submission by using hctx->dispatch_busy.

    However, blk_mq_update_dispatch_busy() isn't changed to update hctx->dispatch_busy
    in that commit, so fix the issue by updating hctx->dispatch_busy in case
    of real scheduler.

    Reported-by: Jan Kara
    Reviewed-by: Jan Kara
    Fixes: 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io")
    Signed-off-by: Ming Lei
    Link: https://lore.kernel.org/r/20210625020248.1630497-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe

    Ming Lei
     

18 Jun, 2021

3 commits

  • Change the type and name of task_struct::state. Drop the volatile and
    shrink it to an 'unsigned int'. Rename it in order to find all uses
    such that we can use READ_ONCE/WRITE_ONCE as appropriate.

    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Daniel Bristot de Oliveira
    Acked-by: Will Deacon
    Acked-by: Daniel Thompson
    Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org

    Peter Zijlstra
     
  • Remove yet another few p->state accesses.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Link: https://lore.kernel.org/r/20210611082838.347475156@infradead.org

    Peter Zijlstra
     
  • Replace a bunch of 'p->state == TASK_RUNNING' with a new helper:
    task_is_running(p).

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Davidlohr Bueso
    Acked-by: Geert Uytterhoeven
    Acked-by: Will Deacon
    Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org

    Peter Zijlstra
     

12 Jun, 2021

4 commits

  • All users are gone now.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Chaitanya Kulkarni
    Link: https://lore.kernel.org/r/20210602065345.355274-16-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Add a new API to allocate a gendisk including the request_queue for use
    with blk-mq based drivers. This is to avoid boilerplate code in drivers.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Chaitanya Kulkarni
    Link: https://lore.kernel.org/r/20210602065345.355274-4-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Don't return the passed in request_queue but a normal error code, and
    drop the elevator_init argument in favor of just calling elevator_init_mq
    directly from dm-rq.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Chaitanya Kulkarni
    Link: https://lore.kernel.org/r/20210602065345.355274-3-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Factour out a helper to initialize a simple single hw queue tag_set from
    blk_mq_init_sq_queue. This will allow to phase out blk_mq_init_sq_queue
    in favor of a more symmetric and general API.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Chaitanya Kulkarni
    Link: https://lore.kernel.org/r/20210602065345.355274-2-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Jun, 2021

1 commit

  • Provided the device driver does not implement dispatch budget accounting
    (which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
    requests from the IO scheduler as long as it is willing to give out any.
    That defeats scheduling heuristics inside the scheduler by creating
    false impression that the device can take more IO when it in fact
    cannot.

    For example with BFQ IO scheduler on top of virtio-blk device setting
    blkio cgroup weight has barely any impact on observed throughput of
    async IO because __blk_mq_do_dispatch_sched() always sucks out all the
    IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
    when that is all dispatched, it will give out IO of lower weight cgroups
    as well. And then we have to wait for all this IO to be dispatched to
    the disk (which means lot of it actually has to complete) before the
    IO scheduler is queried again for dispatching more requests. This
    completely destroys any service differentiation.

    So grab request tag for a request pulled out of the IO scheduler already
    in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
    cannot get it because we are unlikely to be able to dispatch it. That
    way only single request is going to wait in the dispatch list for some
    tag to free.

    Reviewed-by: Ming Lei
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20210603104721.6309-1-jack@suse.cz
    Signed-off-by: Jens Axboe

    Jan Kara
     

24 May, 2021

5 commits

  • The tags used for an IO scheduler are currently per hctx.

    As such, when q->nr_hw_queues grows, so does the request queue total IO
    scheduler tag depth.

    This may cause problems for SCSI MQ HBAs whose total driver depth is
    fixed.

    Ming and Yanhui report higher CPU usage and lower throughput in scenarios
    where the fixed total driver tag depth is appreciably lower than the total
    scheduler tag depth:
    https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b

    In that scenario, since the scheduler tag is got first, much contention
    is introduced since a driver tag may not be available after we have got
    the sched tag.

    Improve this scenario by introducing request queue-wide tags for when
    a tagset-wide sbitmap is used. The static sched requests are still
    allocated per hctx, as requests are initialised per hctx, as in
    blk_mq_init_request(..., hctx_idx, ...) ->
    set->ops->init_request(.., hctx_idx, ...).

    For simplicity of resizing the request queue sbitmap when updating the
    request queue depth, just init at the max possible size, so we don't need
    to deal with the possibly with swapping out a new sbitmap for old if
    we need to grow.

    Signed-off-by: John Garry
    Reviewed-by: Ming Lei
    Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe

    John Garry
     
  • The tag allocation code to alloc the sbitmap pairs is common for regular
    bitmaps tags and shared sbitmap, so refactor into a common function.

    Also remove superfluous "flags" argument from blk_mq_init_shared_sbitmap().

    Signed-off-by: John Garry
    Reviewed-by: Ming Lei
    Link: https://lore.kernel.org/r/1620907258-30910-2-git-send-email-john.garry@huawei.com
    Signed-off-by: Jens Axboe

    John Garry
     
  • Before we free request queue, clearing flush request reference in
    tags->rqs[], so that potential UAF can be avoided.

    Based on one patch written by David Jeffery.

    Tested-by: John Garry
    Reviewed-by: Bart Van Assche
    Reviewed-by: David Jeffery
    Signed-off-by: Ming Lei
    Link: https://lore.kernel.org/r/20210511152236.763464-5-ming.lei@redhat.com
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • refcount_inc_not_zero() in bt_tags_iter() still may read one freed
    request.

    Fix the issue by the following approach:

    1) hold a per-tags spinlock when reading ->rqs[tag] and calling
    refcount_inc_not_zero in bt_tags_iter()

    2) clearing stale request referred via ->rqs[tag] before freeing
    request pool, the per-tags spinlock is held for clearing stale
    ->rq[tag]

    So after we cleared stale requests, bt_tags_iter() won't observe
    freed request any more, also the clearing will wait for pending
    request reference.

    The idea of clearing ->rqs[] is borrowed from John Garry's previous
    patch and one recent David's patch.

    Tested-by: John Garry
    Reviewed-by: David Jeffery
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Link: https://lore.kernel.org/r/20210511152236.763464-4-ming.lei@redhat.com
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter(), and
    this way will prevent the request from being re-used when ->fn is
    running. The approach is same as what we do during handling timeout.

    Fix request use-after-free(UAF) related with completion race or queue
    releasing:

    - If one rq is referred before rq->q is frozen, then queue won't be
    frozen before the request is released during iteration.

    - If one rq is referred after rq->q is frozen, refcount_inc_not_zero()
    will return false, and we won't iterate over this request.

    However, still one request UAF not covered: refcount_inc_not_zero() may
    read one freed request, and it will be handled in next patch.

    Tested-by: John Garry
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Link: https://lore.kernel.org/r/20210511152236.763464-3-ming.lei@redhat.com
    Signed-off-by: Jens Axboe

    Ming Lei
     

14 May, 2021

1 commit

  • If a tag set is shared across request queues (e.g. SCSI LUNs) then the
    block layer core keeps track of the number of active request queues in
    tags->active_queues. blk_mq_tag_busy() and blk_mq_tag_idle() update that
    atomic counter if the hctx flag BLK_MQ_F_TAG_QUEUE_SHARED is set. Make
    sure that blk_mq_exit_queue() calls blk_mq_tag_idle() before that flag is
    cleared by blk_mq_del_queue_tag_set().

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Fixes: 0d2602ca30e4 ("blk-mq: improve support for shared tags maps")
    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Link: https://lore.kernel.org/r/20210513171529.7977-1-bvanassche@acm.org
    Signed-off-by: Jens Axboe

    Bart Van Assche