27 Sep, 2022

1 commit

  • This is the 5.15.70 stable release

    * tag 'v5.15.70': (2444 commits)
    Linux 5.15.70
    ALSA: hda/sigmatel: Fix unused variable warning for beep power change
    cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all()
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    arch/arm/boot/dts/imx6ul.dtsi
    arch/arm/mm/mmu.c
    arch/arm64/boot/dts/freescale/imx8mp-evk.dts
    drivers/gpu/drm/imx/dcss/dcss-kms.c
    drivers/media/platform/nxp/imx-jpeg/mxc-jpeg.c
    drivers/media/platform/nxp/imx-jpeg/mxc-jpeg.h
    drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c
    drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
    drivers/soc/fsl/Kconfig
    drivers/soc/imx/gpcv2.c
    drivers/usb/dwc3/host.c
    net/dsa/slave.c
    sound/soc/fsl/imx-card.c

    Jason Liu
     

23 Sep, 2022

1 commit

  • [ Upstream commit 56f99b8d06ef1ed1c9730948f9f05ac2b930a20b ]

    Today blk_queue_enter() and __bio_queue_enter() return -EBUSY for the
    nowait code path. This is not correct: they should return -EAGAIN
    instead.

    This problem was detected by fio. The following command exposed the
    above problem:

    t/io_uring -p0 -d128 -b4096 -s32 -c32 -F1 -B0 -R0 -X1 -n24 -P1 -u1 -O0 /dev/ng0n1

    By applying the patch, the retry case is handled correctly in the slow
    path.

    Signed-off-by: Stefan Roesch
    Fixes: bfd343aa1718 ("blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set")
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Stefan Roesch
     

31 Aug, 2022

1 commit

  • commit 65fac0d54f374625b43a9d6ad1f2c212bd41f518 upstream.

    Currently, in virtio_scsi, if 'bd->last' is not set to true while
    dispatching request, such io will stay in driver's queue, and driver
    will wait for block layer to dispatch more rqs. However, if block
    layer failed to dispatch more rq, it should trigger commit_rqs to
    inform driver.

    There is a problem in blk_mq_try_issue_list_directly() that commit_rqs
    won't be called:

    // assume that queue_depth is set to 1, list contains two rq
    blk_mq_try_issue_list_directly
    blk_mq_request_issue_directly
    // dispatch first rq
    // last is false
    __blk_mq_try_issue_directly
    blk_mq_get_dispatch_budget
    // succeed to get first budget
    __blk_mq_issue_directly
    scsi_queue_rq
    cmd->flags |= SCMD_LAST
    virtscsi_queuecommand
    kick = (sc->flags & SCMD_LAST) != 0
    // kick is false, first rq won't issue to disk
    queued++

    blk_mq_request_issue_directly
    // dispatch second rq
    __blk_mq_try_issue_directly
    blk_mq_get_dispatch_budget
    // failed to get second budget
    ret == BLK_STS_RESOURCE
    blk_mq_request_bypass_insert
    // errors is still 0

    if (!list_empty(list) || errors && ...)
    // won't pass, commit_rqs won't be called

    In this situation, first rq relied on second rq to dispatch, while
    second rq relied on first rq to complete, thus they will both hung.

    Fix the problem by also treat 'BLK_STS_*RESOURCE' as 'errors' since
    it means that request is not queued successfully.

    Same problem exists in blk_mq_dispatch_rq_list(), 'BLK_STS_*RESOURCE'
    can't be treated as 'errors' here, fix the problem by calling
    commit_rqs if queue_rq return 'BLK_STS_*RESOURCE'.

    Fixes: d666ba98f849 ("blk-mq: add mq_ops->commit_rqs()")
    Signed-off-by: Yu Kuai
    Reviewed-by: Ming Lei
    Link: https://lore.kernel.org/r/20220726122224.1790882-1-yukuai1@huaweicloud.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Yu Kuai
     

17 Aug, 2022

5 commits

  • [ Upstream commit 14a6e2eb7df5c7897c15b109cba29ab0c4a791b6 ]

    In our test of iocost, we encountered some list add/del corruptions of
    inner_walk list in ioc_timer_fn.

    The reason can be described as follows:

    cpu 0 cpu 1
    ioc_qos_write ioc_qos_write

    ioc = q_to_ioc(queue);
    if (!ioc) {
    ioc = kzalloc();
    ioc = q_to_ioc(queue);
    if (!ioc) {
    ioc = kzalloc();
    ...
    rq_qos_add(q, rqos);
    }
    ...
    rq_qos_add(q, rqos);
    ...
    }

    When the io.cost.qos file is written by two cpus concurrently, rq_qos may
    be added to one disk twice. In that case, there will be two iocs enabled
    and running on one disk. They own different iocgs on their active list. In
    the ioc_timer_fn function, because of the iocgs from two iocs have the
    same root iocg, the root iocg's walk_list may be overwritten by each other
    and this leads to list add/del corruptions in building or destroying the
    inner_walk list.

    And so far, the blk-rq-qos framework works in case that one instance for
    one type rq_qos per queue by default. This patch make this explicit and
    also fix the crash above.

    Signed-off-by: Jinke Han
    Reviewed-by: Muchun Song
    Acked-by: Tejun Heo
    Cc:
    Link: https://lore.kernel.org/r/20220720093616.70584-1-hanjinke.666@bytedance.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jinke Han
     
  • [ Upstream commit 325347d965e7ccf5424a05398807a6d801846612 ]

    There are cases where a bio may not accept additional pages, and the iov
    needs to advance to the last data length that was accepted. The zone
    append used to handle this correctly, but was inadvertently broken when
    the setup was made common with the normal r/w case.

    Fixes: 576ed9135489c ("block: use bio_add_page in bio_iov_iter_get_pages")
    Fixes: c58c0074c54c2 ("block/bio: remove duplicate append pages code")
    Signed-off-by: Keith Busch
    Link: https://lore.kernel.org/r/20220712153256.2202024-1-kbusch@fb.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Keith Busch
     
  • [ Upstream commit c58c0074c54c2e2bb3bb0d5a4d8896bb660cc8bc ]

    The getting pages setup for zone append and normal IO are identical. Use
    common code for each.

    Signed-off-by: Keith Busch
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20220610195830.3574005-3-kbusch@fb.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Keith Busch
     
  • [ Upstream commit f3ec5d11554778c24ac8915e847223ed71d104fc ]

    blk_mq_debugfs_register_hctx() can be called by blk_mq_update_nr_hw_queues
    when gendisk isn't added yet, such as nvme tcp.

    Fixes the warning of 'debugfs: Directory 'hctx0' with parent '/' already present!'
    which can be observed reliably when running blktests nvme/005.

    Fixes: 6cfc0081b046 ("blk-mq: no need to check return value of debugfs_create functions")
    Reported-by: Yi Zhang
    Signed-off-by: Ming Lei
    Tested-by: Yi Zhang
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20220711090808.259682-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     
  • [ Upstream commit b82d9fa257cb3725c49d94d2aeafc4677c34448a ]

    Returning 0 early from __bio_iov_append_get_pages() for the
    max_append_sectors warning just creates an infinite loop since 0 means
    success, and the bio will never fill from the unadvancing iov_iter. We
    could turn the return into an error value, but it will already be turned
    into an error value later on, so just remove the warning. Clearly no one
    ever hit it anyway.

    Fixes: 0512a75b98f84 ("block: Introduce REQ_OP_ZONE_APPEND")
    Signed-off-by: Keith Busch
    Reviewed-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Link: https://lore.kernel.org/r/20220610195830.3574005-2-kbusch@fb.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Keith Busch
     

11 Aug, 2022

1 commit

  • commit e589f46445960c274cc813a1cc8e2fc73b2a1849 upstream.

    Commit e70344c05995 ("block: fix default IO priority handling")
    introduced an inconsistency in get_current_ioprio() that tasks without
    IO context return IOPRIO_DEFAULT priority while tasks with freshly
    allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority.
    Tasks without IO context used to be rare before 5a9d041ba2f6 ("block:
    move io_context creation into where it's needed") but after this commit
    they became common because now only BFQ IO scheduler setups task's IO
    context. Similar inconsistency is there for get_task_ioprio() so this
    inconsistency is now exposed to userspace and userspace will see
    different IO priority for tasks operating on devices with BFQ compared
    to devices without BFQ. Furthemore the changes done by commit
    e70344c05995 change the behavior when no IO priority is set for BFQ IO
    scheduler which is also documented in ioprio_set(2) manpage:

    "If no I/O scheduler has been set for a thread, then by default the I/O
    priority will follow the CPU nice value (setpriority(2)). In Linux
    kernels before version 2.6.24, once an I/O priority had been set using
    ioprio_set(), there was no way to reset the I/O scheduling behavior to
    the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to
    reset to the default I/O scheduling behavior."

    So make sure we default to IOPRIO_CLASS_NONE as used to be the case
    before commit e70344c05995. Also cleanup alloc_io_context() to
    explicitely set this IO priority for the allocated IO context to avoid
    future surprises. Note that we tweak ioprio_best() to maintain
    ioprio_get(2) behavior and make this commit easily backportable.

    CC: stable@vger.kernel.org
    Fixes: e70344c05995 ("block: fix default IO priority handling")
    Reviewed-by: Damien Le Moal
    Tested-by: Damien Le Moal
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20220623074840.5960-1-jack@suse.cz
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

12 Jul, 2022

3 commits

  • [ Upstream commit aa1b46dcdc7baaf5fec0be25782ef24b26aa209e ]

    a647a524a467 ("block: don't call rq_qos_ops->done_bio if the bio isn't
    tracked") made bio_endio() skip rq_qos_done_bio() if BIO_TRACKED is not set.
    While this fixed a potential oops, it also broke blk-iocost by skipping the
    done_bio callback for merged bios.

    Before, whether a bio goes through rq_qos_throttle() or rq_qos_merge(),
    rq_qos_done_bio() would be called on the bio on completion with BIO_TRACKED
    distinguishing the former from the latter. rq_qos_done_bio() is not called
    for bios which wenth through rq_qos_merge(). This royally confuses
    blk-iocost as the merged bios never finish and are considered perpetually
    in-flight.

    One reliably reproducible failure mode is an intermediate cgroup geting
    stuck active preventing its children from being activated due to the
    leaf-only rule, leading to loss of control. The following is from
    resctl-bench protection scenario which emulates isolating a web server like
    workload from a memory bomb run on an iocost configuration which should
    yield a reasonable level of protection.

    # cat /sys/block/nvme2n1/device/model
    Samsung SSD 970 PRO 512GB
    # cat /sys/fs/cgroup/io.cost.model
    259:0 ctrl=user model=linear rbps=834913556 rseqiops=93622 rrandiops=102913 wbps=618985353 wseqiops=72325 wrandiops=71025
    # cat /sys/fs/cgroup/io.cost.qos
    259:0 enable=1 ctrl=user rpct=95.00 rlat=18776 wpct=95.00 wlat=8897 min=60.00 max=100.00
    # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1
    ...
    Memory Hog Summary
    ==================

    IO Latency: R p50=242u:336u/2.5m p90=794u:1.4m/7.5m p99=2.7m:8.0m/62.5m max=8.0m:36.4m/350m
    W p50=221u:323u/1.5m p90=709u:1.2m/5.5m p99=1.5m:2.5m/9.5m max=6.9m:35.9m/350m

    Isolation and Request Latency Impact Distributions:

    min p01 p05 p10 p25 p50 p75 p90 p95 p99 max mean stdev
    isol% 15.90 15.90 15.90 40.05 57.24 59.07 60.01 74.63 74.63 90.35 90.35 58.12 15.82
    lat-imp% 0 0 0 0 0 4.55 14.68 15.54 233.5 548.1 548.1 53.88 143.6

    Result: isol=58.12:15.82% lat_imp=53.88%:143.6 work_csv=100.0% missing=3.96%

    The isolation result of 58.12% is close to what this device would show
    without any IO control.

    Fix it by introducing a new flag BIO_QOS_MERGED to mark merged bios and
    calling rq_qos_done_bio() on them too. For consistency and clarity, rename
    BIO_TRACKED to BIO_QOS_THROTTLED. The flag checks are moved into
    rq_qos_done_bio() so that it's next to the code paths that set the flags.

    With the patch applied, the above same benchmark shows:

    # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1
    ...
    Memory Hog Summary
    ==================

    IO Latency: R p50=123u:84.4u/985u p90=322u:256u/2.5m p99=1.6m:1.4m/9.5m max=11.1m:36.0m/350m
    W p50=429u:274u/995u p90=1.7m:1.3m/4.5m p99=3.4m:2.7m/11.5m max=7.9m:5.9m/26.5m

    Isolation and Request Latency Impact Distributions:

    min p01 p05 p10 p25 p50 p75 p90 p95 p99 max mean stdev
    isol% 84.91 84.91 89.51 90.73 92.31 94.49 96.36 98.04 98.71 100.0 100.0 94.42 2.81
    lat-imp% 0 0 0 0 0 2.81 5.73 11.11 13.92 17.53 22.61 4.10 4.68

    Result: isol=94.42:2.81% lat_imp=4.10%:4.68 work_csv=58.34% missing=0%

    Signed-off-by: Tejun Heo
    Fixes: a647a524a467 ("block: don't call rq_qos_ops->done_bio if the bio isn't tracked")
    Cc: stable@vger.kernel.org # v5.15+
    Cc: Ming Lei
    Cc: Yu Kuai
    Reviewed-by: Ming Lei
    Link: https://lore.kernel.org/r/Yi7rdrzQEHjJLGKB@slm.duckdns.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Tejun Heo
     
  • [ Upstream commit 90b8faa0e8de1b02b619fb33f6c6e1e13e7d1d70 ]

    We set BIO_TRACKED unconditionally when rq_qos_throttle() is called, even
    though we may not even have an rq_qos handler. Only mark it as TRACKED if
    it really is potentially tracked.

    This saves considerable time for the case where the bio isn't tracked:

    2.64% -1.65% [kernel.vmlinux] [k] bio_endio

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     
  • [ Upstream commit 3caee4634be68e755d2fb130962f1623661dbd5b ]

    Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
    queue pointer and so is faster.

    Signed-off-by: Pavel Begunkov
    Link: https://lore.kernel.org/r/85c36ea784d285a5075baa10049e6b59e15fb484.1634219547.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     

30 Jun, 2022

2 commits

  • This is the 5.15.50 stable release

    * tag 'v5.15.50': (1395 commits)
    Linux 5.15.50
    arm64: mm: Don't invalidate FROM_DEVICE buffers at start of DMA transfer
    serial: core: Initialize rs485 RTS polarity already on probe
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    drivers/bus/fsl-mc/fsl-mc-bus.c
    drivers/crypto/caam/ctrl.c
    drivers/pci/controller/dwc/pci-imx6.c
    drivers/spi/spi-fsl-qspi.c
    drivers/tty/serial/fsl_lpuart.c
    include/uapi/linux/dma-buf.h

    Jason Liu
     
  • This is the 5.15.41 stable release

    * tag 'v5.15.41': (1977 commits)
    Linux 5.15.41
    usb: gadget: uvc: allow for application to cleanly shutdown
    usb: gadget: uvc: rename function to be more consistent
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    arch/arm64/boot/dts/freescale/fsl-ls1043a.dtsi
    arch/arm64/boot/dts/freescale/fsl-ls1046a.dtsi
    arch/arm64/configs/defconfig
    drivers/clk/imx/clk-imx8qxp-lpcg.c
    drivers/dma/imx-sdma.c
    drivers/gpu/drm/bridge/nwl-dsi.c
    drivers/mailbox/imx-mailbox.c
    drivers/net/phy/at803x.c
    drivers/tty/serial/fsl_lpuart.c
    security/keys/trusted-keys/trusted_core.c

    Jason Liu
     

22 Jun, 2022

1 commit

  • [ Upstream commit 14dc7a18abbe4176f5626c13c333670da8e06aa1 ]

    This patch prevents that test nvme/004 triggers the following:

    UBSAN: array-index-out-of-bounds in block/blk-mq.h:135:9
    index 512 is out of range for type 'long unsigned int [512]'
    Call Trace:
    show_stack+0x52/0x58
    dump_stack_lvl+0x49/0x5e
    dump_stack+0x10/0x12
    ubsan_epilogue+0x9/0x3b
    __ubsan_handle_out_of_bounds.cold+0x44/0x49
    blk_mq_alloc_request_hctx+0x304/0x310
    __nvme_submit_sync_cmd+0x70/0x200 [nvme_core]
    nvmf_connect_io_queue+0x23e/0x2a0 [nvme_fabrics]
    nvme_loop_connect_io_queues+0x8d/0xb0 [nvme_loop]
    nvme_loop_create_ctrl+0x58e/0x7d0 [nvme_loop]
    nvmf_create_ctrl+0x1d7/0x4d0 [nvme_fabrics]
    nvmf_dev_write+0xae/0x111 [nvme_fabrics]
    vfs_write+0x144/0x560
    ksys_write+0xb7/0x140
    __x64_sys_write+0x42/0x50
    do_syscall_64+0x35/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Fixes: 20e4d8139319 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Link: https://lore.kernel.org/r/20220615210004.1031820-1-bvanassche@acm.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Bart Van Assche
     

15 Jun, 2022

3 commits

  • [ Upstream commit 605f7415ecfb426610195dd6c7577b30592b3369 ]

    Most of bioset_exit() is fine being called twice, as it clears the
    various allocations etc when they are freed. The exception is
    bio_alloc_cache_destroy(), which does not clear ->cache when it has
    freed it.

    This isn't necessarily a bug, but can be if buggy users does call the
    exit path more then once, or with just a memset() bioset which has
    never been initialized. dm appears to be one such user.

    Fixes: be4d234d7aeb ("bio: add allocation cache abstraction")
    Link: https://lore.kernel.org/linux-block/YpK7m+14A+pZKs5k@casper.infradead.org/
    Reported-by: Matthew Wilcox
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     
  • [ Upstream commit 403d50341cce6b5481a92eb481e6df60b1f49b55 ]

    Appartly bcache can copy into bios that do not just contain fresh
    pages but can have offsets into the bio_vecs. Restore support for tht
    in bio_copy_data_iter.

    Fixes: f8b679a070c5 ("block: rewrite bio_copy_data_iter to use bvec_kmap_local and memcpy_to_bvec")
    Signed-off-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20220524143919.1155501-1-hch@lst.de
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Christoph Hellwig
     
  • [ Upstream commit 5d05426e2d5fd7df8afc866b78c36b37b00188b7 ]

    blk_mq_run_hw_queues() could be run when there isn't queued request and
    after queue is cleaned up, at that time tagset is freed, because tagset
    lifetime is covered by driver, and often freed after blk_cleanup_queue()
    returns.

    So don't touch ->tagset for figuring out current default hctx by the mapping
    built in request queue, so use-after-free on tagset can be avoided. Meantime
    this way should be fast than retrieving mapping from tagset.

    Cc: "yukuai (C)"
    Cc: Jan Kara
    Fixes: b6e68ee82585 ("blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues")
    Signed-off-by: Ming Lei
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20220522122350.743103-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     

09 Jun, 2022

13 commits

  • commit 22b106e5355d6e7a9c3b5cb5ed4ef22ae585ea94 upstream.

    Commit d92c370a16cb ("block: really clone the block cgroup in
    bio_clone_blkg_association") changed bio_clone_blkg_association() to
    just clone bio->bi_blkg reference from source to destination bio. This
    is however wrong if the source and destination bios are against
    different block devices because struct blkcg_gq is different for each
    bdev-blkcg pair. This will result in IOs being accounted (and throttled
    as a result) multiple times against the same device (src bdev) while
    throttling of the other device (dst bdev) is ignored. In case of BFQ the
    inconsistency can even result in crashes in bfq_bic_update_cgroup().
    Fix the problem by looking up correct blkcg_gq for the cloned bio.

    Reported-by: Logan Gunthorpe
    Reported-and-tested-by: Donald Buczek
    Fixes: d92c370a16cb ("block: really clone the block cgroup in bio_clone_blkg_association")
    CC: stable@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20220602081242.7731-1-jack@suse.cz
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 8a177a36da6c54c98b8685d4f914cb3637d53c0d upstream.

    iolatency needs to track the number of inflight IOs per cgroup. As this
    tracking can be expensive, it is disabled when no cgroup has iolatency
    configured for the device. To ensure that the inflight counters stay
    balanced, iolatency_set_limit() freezes the request_queue while manipulating
    the enabled counter, which ensures that no IO is in flight and thus all
    counters are zero.

    Unfortunately, iolatency_set_limit() isn't the only place where the enabled
    counter is manipulated. iolatency_pd_offline() can also dec the counter and
    trigger disabling. As this disabling happens without freezing the q, this
    can easily happen while some IOs are in flight and thus leak the counts.

    This can be easily demonstrated by turning on iolatency on an one empty
    cgroup while IOs are in flight in other cgroups and then removing the
    cgroup. Note that iolatency shouldn't have been enabled elsewhere in the
    system to ensure that removing the cgroup disables iolatency for the whole
    device.

    The following keeps flipping on and off iolatency on sda:

    echo +io > /sys/fs/cgroup/cgroup.subtree_control
    while true; do
    mkdir -p /sys/fs/cgroup/test
    echo '8:0 target=100000' > /sys/fs/cgroup/test/io.latency
    sleep 1
    rmdir /sys/fs/cgroup/test
    sleep 1
    done

    and there's concurrent fio generating direct rand reads:

    fio --name test --filename=/dev/sda --direct=1 --rw=randread \
    --runtime=600 --time_based --iodepth=256 --numjobs=4 --bs=4k

    while monitoring with the following drgn script:

    while True:
    for css in css_for_each_descendant_pre(prog['blkcg_root'].css.address_of_()):
    for pos in hlist_for_each(container_of(css, 'struct blkcg', 'css').blkg_list):
    blkg = container_of(pos, 'struct blkcg_gq', 'blkcg_node')
    pd = blkg.pd[prog['blkcg_policy_iolatency'].plid]
    if pd.value_() == 0:
    continue
    iolat = container_of(pd, 'struct iolatency_grp', 'pd')
    inflight = iolat.rq_wait.inflight.counter.value_()
    if inflight:
    print(f'inflight={inflight} {disk_name(blkg.q.disk).decode("utf-8")} '
    f'{cgroup_path(css.cgroup).decode("utf-8")}')
    time.sleep(1)

    The monitoring output looks like the following:

    inflight=1 sda /user.slice
    inflight=1 sda /user.slice
    ...
    inflight=14 sda /user.slice
    inflight=13 sda /user.slice
    inflight=17 sda /user.slice
    inflight=15 sda /user.slice
    inflight=18 sda /user.slice
    inflight=17 sda /user.slice
    inflight=20 sda /user.slice
    inflight=19 sda /user.slice
    Cc: Josef Bacik
    Cc: Liu Bo
    Fixes: 8c772a9bfc7c ("blk-iolatency: fix IO hang due to negative inflight counter")
    Cc: stable@vger.kernel.org # v5.0+
    Link: https://lore.kernel.org/r/Yn9ScX6Nx2qIiQQi@slm.duckdns.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 075a53b78b815301f8d3dd1ee2cd99554e34f0dd upstream.

    Bios queued into BFQ IO scheduler can be associated with a cgroup that
    was already offlined. This may then cause insertion of this bfq_group
    into a service tree. But this bfq_group will get freed as soon as last
    bio associated with it is completed leading to use after free issues for
    service tree users. Fix the problem by making sure we always operate on
    online bfq_group. If the bfq_group associated with the bio is not
    online, we pick the first online parent.

    CC: stable@vger.kernel.org
    Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support")
    Tested-by: "yukuai (C)"
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20220401102752.8599-9-jack@suse.cz
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 4e54a2493e582361adc3bfbf06c7d50d19d18837 upstream.

    BFQ usage of __bio_blkcg() is a relict from the past. Furthermore if bio
    would not be associated with any blkcg, the usage of __bio_blkcg() in
    BFQ is prone to races with the task being migrated between cgroups as
    __bio_blkcg() calls at different places could return different blkcgs.

    Convert BFQ to the new situation where bio->bi_blkg is initialized in
    bio_set_dev() and thus practically always valid. This allows us to save
    blkcg_gq lookup and noticeably simplify the code.

    CC: stable@vger.kernel.org
    Fixes: 0fe061b9f03c ("blkcg: fix ref count issue with bio_blkcg() using task_css")
    Tested-by: "yukuai (C)"
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20220401102752.8599-8-jack@suse.cz
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 09f871868080c33992cd6a9b72a5ca49582578fa upstream.

    Track whether bfq_group is still online. We cannot rely on
    blkcg_gq->online because that gets cleared only after all policies are
    offlined and we need something that gets updated already under
    bfqd->lock when we are cleaning up our bfq_group to be able to guarantee
    that when we see online bfq_group, it will stay online while we are
    holding bfqd->lock lock.

    CC: stable@vger.kernel.org
    Tested-by: "yukuai (C)"
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20220401102752.8599-7-jack@suse.cz
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 5f550ede5edf846ecc0067be1ba80514e6fe7f8e upstream.

    We call bfq_init_rq() from request merging functions where requests we
    get should have already gone through bfq_init_rq() during insert and
    anyway we want to do anything only if the request is already tracked by
    BFQ. So replace calls to bfq_init_rq() with RQ_BFQQ() instead to simply
    skip requests untracked by BFQ. We move bfq_init_rq() call in
    bfq_insert_request() a bit earlier to cover request merging and thus
    can transfer FIFO position in case of a merge.

    CC: stable@vger.kernel.org
    Tested-by: "yukuai (C)"
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20220401102752.8599-6-jack@suse.cz
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit fc84e1f941b91221092da5b3102ec82da24c5673 upstream.

    In bfq_insert_request() we unlock bfqd->lock only to call
    trace_block_rq_insert() and then lock bfqd->lock again. This is really
    pointless since tracing is disabled if we really care about performance
    and even if the tracepoint is enabled, it is a quick call.

    CC: stable@vger.kernel.org
    Tested-by: "yukuai (C)"
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20220401102752.8599-5-jack@suse.cz
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit ea591cd4eb270393810e7be01feb8fde6a34fbbe upstream.

    When the process is migrated to a different cgroup (or in case of
    writeback just starts submitting bios associated with a different
    cgroup) bfq_merge_bio() can operate with stale cgroup information in
    bic. Thus the bio can be merged to a request from a different cgroup or
    it can result in merging of bfqqs for different cgroups or bfqqs of
    already dead cgroups and causing possible use-after-free issues. Fix the
    problem by updating cgroup information in bfq_merge_bio().

    CC: stable@vger.kernel.org
    Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support")
    Tested-by: "yukuai (C)"
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20220401102752.8599-4-jack@suse.cz
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 3bc5e683c67d94bd839a1da2e796c15847b51b69 upstream.

    When bfqq is shared by multiple processes it can happen that one of the
    processes gets moved to a different cgroup (or just starts submitting IO
    for different cgroup). In case that happens we need to split the merged
    bfqq as otherwise we will have IO for multiple cgroups in one bfqq and
    we will just account IO time to wrong entities etc.

    Similarly if the bfqq is scheduled to merge with another bfqq but the
    merge didn't happen yet, cancel the merge as it need not be valid
    anymore.

    CC: stable@vger.kernel.org
    Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support")
    Tested-by: "yukuai (C)"
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20220401102752.8599-3-jack@suse.cz
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit c1cee4ab36acef271be9101590756ed0c0c374d9 upstream.

    It can happen that the parent of a bfqq changes between the moment we
    decide two queues are worth to merge (and set bic->stable_merge_bfqq)
    and the moment bfq_setup_merge() is called. This can happen e.g. because
    the process submitted IO for a different cgroup and thus bfqq got
    reparented. It can even happen that the bfqq we are merging with has
    parent cgroup that is already offline and going to be destroyed in which
    case the merge can lead to use-after-free issues such as:

    BUG: KASAN: use-after-free in __bfq_deactivate_entity+0x9cb/0xa50
    Read of size 8 at addr ffff88800693c0c0 by task runc:[2:INIT]/10544

    CPU: 0 PID: 10544 Comm: runc:[2:INIT] Tainted: G E 5.15.2-0.g5fb85fd-default #1 openSUSE Tumbleweed (unreleased) f1f3b891c72369aebecd2e43e4641a6358867c70
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
    Call Trace:

    dump_stack_lvl+0x46/0x5a
    print_address_description.constprop.0+0x1f/0x140
    ? __bfq_deactivate_entity+0x9cb/0xa50
    kasan_report.cold+0x7f/0x11b
    ? __bfq_deactivate_entity+0x9cb/0xa50
    __bfq_deactivate_entity+0x9cb/0xa50
    ? update_curr+0x32f/0x5d0
    bfq_deactivate_entity+0xa0/0x1d0
    bfq_del_bfqq_busy+0x28a/0x420
    ? resched_curr+0x116/0x1d0
    ? bfq_requeue_bfqq+0x70/0x70
    ? check_preempt_wakeup+0x52b/0xbc0
    __bfq_bfqq_expire+0x1a2/0x270
    bfq_bfqq_expire+0xd16/0x2160
    ? try_to_wake_up+0x4ee/0x1260
    ? bfq_end_wr_async_queues+0xe0/0xe0
    ? _raw_write_unlock_bh+0x60/0x60
    ? _raw_spin_lock_irq+0x81/0xe0
    bfq_idle_slice_timer+0x109/0x280
    ? bfq_dispatch_request+0x4870/0x4870
    __hrtimer_run_queues+0x37d/0x700
    ? enqueue_hrtimer+0x1b0/0x1b0
    ? kvm_clock_get_cycles+0xd/0x10
    ? ktime_get_update_offsets_now+0x6f/0x280
    hrtimer_interrupt+0x2c8/0x740

    Fix the problem by checking that the parent of the two bfqqs we are
    merging in bfq_setup_merge() is the same.

    Link: https://lore.kernel.org/linux-block/20211125172809.GC19572@quack2.suse.cz/
    CC: stable@vger.kernel.org
    Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues")
    Tested-by: "yukuai (C)"
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20220401102752.8599-2-jack@suse.cz
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 70456e5210f40ffdb8f6d905acfdcec5bd5fad9e upstream.

    bfq_setup_cooperator() can mark bic as stably merged even though it
    decides to not merge its bfqqs (when bfq_setup_merge() returns NULL).
    Make sure to mark bic as stably merged only if we are really going to
    merge bfqqs.

    CC: stable@vger.kernel.org
    Tested-by: "yukuai (C)"
    Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues")
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20220401102752.8599-1-jack@suse.cz
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • [ Upstream commit c5ac56bb6110e42e79d3106866658376b2e48ab9 ]

    The code in bfq_check_waker() ignores wake up events from the current
    waker. This makes it more likely we select a new tentative waker
    although the current one is generating more wake up events. Treat
    current waker the same way as any other process and allow it to reset
    the waker detection logic.

    Fixes: 71217df39dc6 ("block, bfq: make waker-queue detection more robust")
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20220519105235.31397-2-jack@suse.cz
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jan Kara
     
  • [ Upstream commit f950667356ce90a41b446b726d4595a10cb65415 ]

    Currently we look for waker only if current queue has no requests. This
    makes sense for bfq queues with a single process however for shared
    queues when there is a larger number of processes the condition that
    queue has no requests is difficult to meet because often at least one
    process has some request in flight although all the others are waiting
    for the waker to do the work and this harms throughput. Relax the "no
    queued request for bfq queue" condition to "the current task has no
    queued requests yet". For this, we also need to start tracking number of
    requests in flight for each task.

    This patch (together with the following one) restores the performance
    for dbench with 128 clients that regressed with commit c65e6fd460b4
    ("bfq: Do not let waker requests skip proper accounting") because
    this commit makes requests of wakers properly enter BFQ queues and thus
    these queues become ineligible for the old waker detection logic.
    Dbench results:

    Vanilla 5.18-rc3 5.18-rc3 + revert 5.18-rc3 patched
    Mean 1237.36 ( 0.00%) 950.16 * 23.21%* 988.35 * 20.12%*

    Numbers are time to complete workload so lower is better.

    Fixes: c65e6fd460b4 ("bfq: Do not let waker requests skip proper accounting")
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20220519105235.31397-1-jack@suse.cz
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jan Kara
     

09 May, 2022

1 commit

  • commit 8c936f9ea11ec4e35e288810a7503b5c841a355f upstream.

    When an iocg is in debt, its inuse weight is owned by debt handling and
    should stay at 1. This invariant was broken when determining the amount of
    surpluses at the beginning of donation calculation - when an iocg's
    hierarchical weight is too low, the iocg is excluded from donation
    calculation and its inuse is reset to its active regardless of its
    indebtedness, triggering warnings like the following:

    WARNING: CPU: 5 PID: 0 at block/blk-iocost.c:1416 iocg_kick_waitq+0x392/0x3a0
    ...
    RIP: 0010:iocg_kick_waitq+0x392/0x3a0
    Code: 00 00 be ff ff ff ff 48 89 4d a8 e8 98 b2 70 00 48 8b 4d a8 85 c0 0f 85 4a fe ff ff 0f 0b e9 43 fe ff ff 0f 0b e9 4d fe ff ff 0b e9 50 fe ff ff e8 a2 ae 70 00 66 90 0f 1f 44 00 00 55 48 89
    RSP: 0018:ffffc90000200d08 EFLAGS: 00010016
    ...

    ioc_timer_fn+0x2e0/0x1470
    call_timer_fn+0xa1/0x2c0
    ...

    As this happens only when an iocg's hierarchical weight is negligible, its
    impact likely is limited to triggering the warnings. Fix it by skipping
    resetting inuse of under-weighted debtors.

    Signed-off-by: Tejun Heo
    Reported-by: Rik van Riel
    Fixes: c421a3eb2e27 ("blk-iocost: revamp debt handling")
    Cc: stable@vger.kernel.org # v5.10+
    Link: https://lore.kernel.org/r/YmjODd4aif9BzFuO@slm.duckdns.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

27 Apr, 2022

3 commits

  • commit ccf16413e520164eb718cf8b22a30438da80ff23 upstream.

    kernel ulong and compat_ulong_t may not be same width. Use type directly
    to eliminate mismatches.

    This would result in truncation rather than EFBIG for 32bit mode for
    large disks.

    Reviewed-by: Bart Van Assche
    Signed-off-by: Khazhismel Kumykov
    Reviewed-by: Chaitanya Kulkarni
    Link: https://lore.kernel.org/r/20220414224056.2875681-1-khazhy@google.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Khazhismel Kumykov
     
  • [ Upstream commit 1e03a36bdff4709c1bbf0f57f60ae3f776d51adf ]

    Get rid of the indirections and just provide a sync_bdevs
    helper for the generic sync code.

    Signed-off-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20211019062530.2174626-8-hch@lst.de
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Christoph Hellwig
     
  • [ Upstream commit 70164eb6ccb76ab679b016b4b60123bf4ec6c162 ]

    Instead offer a new sync_blockdev_nowait helper for the !wait case.
    This new helper is exported as it will grow modular callers in a bit.

    Signed-off-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20211019062530.2174626-3-hch@lst.de
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Christoph Hellwig
     

20 Apr, 2022

1 commit

  • [ Upstream commit 8535c0185d14ea41f0efd6a357961b05daf6687e ]

    Unit of bio->bi_iter.bi_size is bytes, but unit of offset/size
    is sector.

    Fix the above issue in checking offset/size in bio_trim().

    Fixes: e83502ca5f1e ("block: fix argument type of bio_trim()")
    Cc: Chaitanya Kulkarni
    Signed-off-by: Ming Lei
    Link: https://lore.kernel.org/r/20220414084443.1736850-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     

08 Apr, 2022

4 commits

  • commit d1868328dec5ae2cf210111025fcbc71f78dd5ca upstream.

    ida_alloc_range(..., min, max, ...) returns values from min to max,
    inclusive.

    So, NR_EXT_DEVT is a valid idx returned by blk_alloc_ext_minor().

    This is an issue because in device_add_disk(), this value is used in:
    ddev->devt = MKDEV(disk->major, disk->first_minor);
    and NR_EXT_DEVT is '(1 << MINORBITS)'.

    So, should 'disk->first_minor' be NR_EXT_DEVT, it would overflow.

    Fixes: 22ae8ce8b892 ("block: simplify bdev/disk lookup in blkdev_get")
    Signed-off-by: Christophe JAILLET
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/cc17199798312406b90834e433d2cefe8266823d.1648306232.git.christophe.jaillet@wanadoo.fr
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Christophe JAILLET
     
  • [ Upstream commit 15729ff8143f8135b03988a100a19e66d7cb7ecd ]

    A crash [1] happened to be triggered in conjunction with commit
    2d52c58b9c9b ("block, bfq: honor already-setup queue merges"). The
    latter was then reverted by commit ebc69e897e17 ("Revert "block, bfq:
    honor already-setup queue merges""). Yet, the reverted commit was not
    the one introducing the bug. In fact, it actually triggered a UAF
    introduced by a different commit, and now fixed by commit d29bd41428cf
    ("block, bfq: reset last_bfqq_created on group change").

    So, there is no point in keeping commit 2d52c58b9c9b ("block, bfq:
    honor already-setup queue merges") out. This commit restores it.

    [1] https://bugzilla.kernel.org/show_bug.cgi?id=214503

    Reported-by: Holger Hoffstätte
    Signed-off-by: Paolo Valente
    Link: https://lore.kernel.org/r/20211125181510.15004-1-paolo.valente@linaro.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Paolo Valente
     
  • [ Upstream commit ab552fcb17cc9e4afe0e4ac4df95fc7b30e8490a ]

    KASAN reports a use-after-free report when doing normal scsi-mq test

    [69832.239032] ==================================================================
    [69832.241810] BUG: KASAN: use-after-free in bfq_dispatch_request+0x1045/0x44b0
    [69832.243267] Read of size 8 at addr ffff88802622ba88 by task kworker/3:1H/155
    [69832.244656]
    [69832.245007] CPU: 3 PID: 155 Comm: kworker/3:1H Not tainted 5.10.0-10295-g576c6382529e #8
    [69832.246626] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
    [69832.249069] Workqueue: kblockd blk_mq_run_work_fn
    [69832.250022] Call Trace:
    [69832.250541] dump_stack+0x9b/0xce
    [69832.251232] ? bfq_dispatch_request+0x1045/0x44b0
    [69832.252243] print_address_description.constprop.6+0x3e/0x60
    [69832.253381] ? __cpuidle_text_end+0x5/0x5
    [69832.254211] ? vprintk_func+0x6b/0x120
    [69832.254994] ? bfq_dispatch_request+0x1045/0x44b0
    [69832.255952] ? bfq_dispatch_request+0x1045/0x44b0
    [69832.256914] kasan_report.cold.9+0x22/0x3a
    [69832.257753] ? bfq_dispatch_request+0x1045/0x44b0
    [69832.258755] check_memory_region+0x1c1/0x1e0
    [69832.260248] bfq_dispatch_request+0x1045/0x44b0
    [69832.261181] ? bfq_bfqq_expire+0x2440/0x2440
    [69832.262032] ? blk_mq_delay_run_hw_queues+0xf9/0x170
    [69832.263022] __blk_mq_do_dispatch_sched+0x52f/0x830
    [69832.264011] ? blk_mq_sched_request_inserted+0x100/0x100
    [69832.265101] __blk_mq_sched_dispatch_requests+0x398/0x4f0
    [69832.266206] ? blk_mq_do_dispatch_ctx+0x570/0x570
    [69832.267147] ? __switch_to+0x5f4/0xee0
    [69832.267898] blk_mq_sched_dispatch_requests+0xdf/0x140
    [69832.268946] __blk_mq_run_hw_queue+0xc0/0x270
    [69832.269840] blk_mq_run_work_fn+0x51/0x60
    [69832.278170] process_one_work+0x6d4/0xfe0
    [69832.278984] worker_thread+0x91/0xc80
    [69832.279726] ? __kthread_parkme+0xb0/0x110
    [69832.280554] ? process_one_work+0xfe0/0xfe0
    [69832.281414] kthread+0x32d/0x3f0
    [69832.282082] ? kthread_park+0x170/0x170
    [69832.282849] ret_from_fork+0x1f/0x30
    [69832.283573]
    [69832.283886] Allocated by task 7725:
    [69832.284599] kasan_save_stack+0x19/0x40
    [69832.285385] __kasan_kmalloc.constprop.2+0xc1/0xd0
    [69832.286350] kmem_cache_alloc_node+0x13f/0x460
    [69832.287237] bfq_get_queue+0x3d4/0x1140
    [69832.287993] bfq_get_bfqq_handle_split+0x103/0x510
    [69832.289015] bfq_init_rq+0x337/0x2d50
    [69832.289749] bfq_insert_requests+0x304/0x4e10
    [69832.290634] blk_mq_sched_insert_requests+0x13e/0x390
    [69832.291629] blk_mq_flush_plug_list+0x4b4/0x760
    [69832.292538] blk_flush_plug_list+0x2c5/0x480
    [69832.293392] io_schedule_prepare+0xb2/0xd0
    [69832.294209] io_schedule_timeout+0x13/0x80
    [69832.295014] wait_for_common_io.constprop.1+0x13c/0x270
    [69832.296137] submit_bio_wait+0x103/0x1a0
    [69832.296932] blkdev_issue_discard+0xe6/0x160
    [69832.297794] blk_ioctl_discard+0x219/0x290
    [69832.298614] blkdev_common_ioctl+0x50a/0x1750
    [69832.304715] blkdev_ioctl+0x470/0x600
    [69832.305474] block_ioctl+0xde/0x120
    [69832.306232] vfs_ioctl+0x6c/0xc0
    [69832.306877] __se_sys_ioctl+0x90/0xa0
    [69832.307629] do_syscall_64+0x2d/0x40
    [69832.308362] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [69832.309382]
    [69832.309701] Freed by task 155:
    [69832.310328] kasan_save_stack+0x19/0x40
    [69832.311121] kasan_set_track+0x1c/0x30
    [69832.311868] kasan_set_free_info+0x1b/0x30
    [69832.312699] __kasan_slab_free+0x111/0x160
    [69832.313524] kmem_cache_free+0x94/0x460
    [69832.314367] bfq_put_queue+0x582/0x940
    [69832.315112] __bfq_bfqd_reset_in_service+0x166/0x1d0
    [69832.317275] bfq_bfqq_expire+0xb27/0x2440
    [69832.318084] bfq_dispatch_request+0x697/0x44b0
    [69832.318991] __blk_mq_do_dispatch_sched+0x52f/0x830
    [69832.319984] __blk_mq_sched_dispatch_requests+0x398/0x4f0
    [69832.321087] blk_mq_sched_dispatch_requests+0xdf/0x140
    [69832.322225] __blk_mq_run_hw_queue+0xc0/0x270
    [69832.323114] blk_mq_run_work_fn+0x51/0x60
    [69832.323942] process_one_work+0x6d4/0xfe0
    [69832.324772] worker_thread+0x91/0xc80
    [69832.325518] kthread+0x32d/0x3f0
    [69832.326205] ret_from_fork+0x1f/0x30
    [69832.326932]
    [69832.338297] The buggy address belongs to the object at ffff88802622b968
    [69832.338297] which belongs to the cache bfq_queue of size 512
    [69832.340766] The buggy address is located 288 bytes inside of
    [69832.340766] 512-byte region [ffff88802622b968, ffff88802622bb68)
    [69832.343091] The buggy address belongs to the page:
    [69832.344097] page:ffffea0000988a00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88802622a528 pfn:0x26228
    [69832.346214] head:ffffea0000988a00 order:2 compound_mapcount:0 compound_pincount:0
    [69832.347719] flags: 0x1fffff80010200(slab|head)
    [69832.348625] raw: 001fffff80010200 ffffea0000dbac08 ffff888017a57650 ffff8880179fe840
    [69832.354972] raw: ffff88802622a528 0000000000120008 00000001ffffffff 0000000000000000
    [69832.356547] page dumped because: kasan: bad access detected
    [69832.357652]
    [69832.357970] Memory state around the buggy address:
    [69832.358926] ffff88802622b980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [69832.360358] ffff88802622ba00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [69832.361810] >ffff88802622ba80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [69832.363273] ^
    [69832.363975] ffff88802622bb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
    [69832.375960] ffff88802622bb80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    [69832.377405] ==================================================================

    In bfq_dispatch_requestfunction, it may have function call:

    bfq_dispatch_request
    __bfq_dispatch_request
    bfq_select_queue
    bfq_bfqq_expire
    __bfq_bfqd_reset_in_service
    bfq_put_queue
    kmem_cache_free
    In this function call, in_serv_queue has beed expired and meet the
    conditions to free. In the function bfq_dispatch_request, the address
    of in_serv_queue pointing to has been released. For getting the value
    of idle_timer_disabled, it will get flags value from the address which
    in_serv_queue pointing to, then the problem of use-after-free happens;

    Fix the problem by check in_serv_queue == bfqd->in_service_queue, to
    get the value of idle_timer_disabled if in_serve_queue is equel to
    bfqd->in_service_queue. If the space of in_serv_queue pointing has
    been released, this judge will aviod use-after-free problem.
    And if in_serv_queue may be expired or finished, the idle_timer_disabled
    will be false which would not give effects to bfq_update_dispatch_stats.

    Reported-by: Hulk Robot
    Signed-off-by: Zhang Wensheng
    Link: https://lore.kernel.org/r/20220303070334.3020168-1-zhangwensheng5@huawei.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Zhang Wensheng
     
  • [ Upstream commit 8410f70977734f21b8ed45c37e925d311dfda2e7 ]

    Our test report a UAF:

    [ 2073.019181] ==================================================================
    [ 2073.019188] BUG: KASAN: use-after-free in __bfq_put_async_bfqq+0xa0/0x168
    [ 2073.019191] Write of size 8 at addr ffff8000ccf64128 by task rmmod/72584
    [ 2073.019192]
    [ 2073.019196] CPU: 0 PID: 72584 Comm: rmmod Kdump: loaded Not tainted 4.19.90-yk #5
    [ 2073.019198] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
    [ 2073.019200] Call trace:
    [ 2073.019203] dump_backtrace+0x0/0x310
    [ 2073.019206] show_stack+0x28/0x38
    [ 2073.019210] dump_stack+0xec/0x15c
    [ 2073.019216] print_address_description+0x68/0x2d0
    [ 2073.019220] kasan_report+0x238/0x2f0
    [ 2073.019224] __asan_store8+0x88/0xb0
    [ 2073.019229] __bfq_put_async_bfqq+0xa0/0x168
    [ 2073.019233] bfq_put_async_queues+0xbc/0x208
    [ 2073.019236] bfq_pd_offline+0x178/0x238
    [ 2073.019240] blkcg_deactivate_policy+0x1f0/0x420
    [ 2073.019244] bfq_exit_queue+0x128/0x178
    [ 2073.019249] blk_mq_exit_sched+0x12c/0x160
    [ 2073.019252] elevator_exit+0xc8/0xd0
    [ 2073.019256] blk_exit_queue+0x50/0x88
    [ 2073.019259] blk_cleanup_queue+0x228/0x3d8
    [ 2073.019267] null_del_dev+0xfc/0x1e0 [null_blk]
    [ 2073.019274] null_exit+0x90/0x114 [null_blk]
    [ 2073.019278] __arm64_sys_delete_module+0x358/0x5a0
    [ 2073.019282] el0_svc_common+0xc8/0x320
    [ 2073.019287] el0_svc_handler+0xf8/0x160
    [ 2073.019290] el0_svc+0x10/0x218
    [ 2073.019291]
    [ 2073.019294] Allocated by task 14163:
    [ 2073.019301] kasan_kmalloc+0xe0/0x190
    [ 2073.019305] kmem_cache_alloc_node_trace+0x1cc/0x418
    [ 2073.019308] bfq_pd_alloc+0x54/0x118
    [ 2073.019313] blkcg_activate_policy+0x250/0x460
    [ 2073.019317] bfq_create_group_hierarchy+0x38/0x110
    [ 2073.019321] bfq_init_queue+0x6d0/0x948
    [ 2073.019325] blk_mq_init_sched+0x1d8/0x390
    [ 2073.019330] elevator_switch_mq+0x88/0x170
    [ 2073.019334] elevator_switch+0x140/0x270
    [ 2073.019338] elv_iosched_store+0x1a4/0x2a0
    [ 2073.019342] queue_attr_store+0x90/0xe0
    [ 2073.019348] sysfs_kf_write+0xa8/0xe8
    [ 2073.019351] kernfs_fop_write+0x1f8/0x378
    [ 2073.019359] __vfs_write+0xe0/0x360
    [ 2073.019363] vfs_write+0xf0/0x270
    [ 2073.019367] ksys_write+0xdc/0x1b8
    [ 2073.019371] __arm64_sys_write+0x50/0x60
    [ 2073.019375] el0_svc_common+0xc8/0x320
    [ 2073.019380] el0_svc_handler+0xf8/0x160
    [ 2073.019383] el0_svc+0x10/0x218
    [ 2073.019385]
    [ 2073.019387] Freed by task 72584:
    [ 2073.019391] __kasan_slab_free+0x120/0x228
    [ 2073.019394] kasan_slab_free+0x10/0x18
    [ 2073.019397] kfree+0x94/0x368
    [ 2073.019400] bfqg_put+0x64/0xb0
    [ 2073.019404] bfqg_and_blkg_put+0x90/0xb0
    [ 2073.019408] bfq_put_queue+0x220/0x228
    [ 2073.019413] __bfq_put_async_bfqq+0x98/0x168
    [ 2073.019416] bfq_put_async_queues+0xbc/0x208
    [ 2073.019420] bfq_pd_offline+0x178/0x238
    [ 2073.019424] blkcg_deactivate_policy+0x1f0/0x420
    [ 2073.019429] bfq_exit_queue+0x128/0x178
    [ 2073.019433] blk_mq_exit_sched+0x12c/0x160
    [ 2073.019437] elevator_exit+0xc8/0xd0
    [ 2073.019440] blk_exit_queue+0x50/0x88
    [ 2073.019443] blk_cleanup_queue+0x228/0x3d8
    [ 2073.019451] null_del_dev+0xfc/0x1e0 [null_blk]
    [ 2073.019459] null_exit+0x90/0x114 [null_blk]
    [ 2073.019462] __arm64_sys_delete_module+0x358/0x5a0
    [ 2073.019467] el0_svc_common+0xc8/0x320
    [ 2073.019471] el0_svc_handler+0xf8/0x160
    [ 2073.019474] el0_svc+0x10/0x218
    [ 2073.019475]
    [ 2073.019479] The buggy address belongs to the object at ffff8000ccf63f00
    which belongs to the cache kmalloc-1024 of size 1024
    [ 2073.019484] The buggy address is located 552 bytes inside of
    1024-byte region [ffff8000ccf63f00, ffff8000ccf64300)
    [ 2073.019486] The buggy address belongs to the page:
    [ 2073.019492] page:ffff7e000333d800 count:1 mapcount:0 mapping:ffff8000c0003a00 index:0x0 compound_mapcount: 0
    [ 2073.020123] flags: 0x7ffff0000008100(slab|head)
    [ 2073.020403] raw: 07ffff0000008100 ffff7e0003334c08 ffff7e00001f5a08 ffff8000c0003a00
    [ 2073.020409] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000
    [ 2073.020411] page dumped because: kasan: bad access detected
    [ 2073.020412]
    [ 2073.020414] Memory state around the buggy address:
    [ 2073.020420] ffff8000ccf64000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [ 2073.020424] ffff8000ccf64080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [ 2073.020428] >ffff8000ccf64100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [ 2073.020430] ^
    [ 2073.020434] ffff8000ccf64180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [ 2073.020438] ffff8000ccf64200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [ 2073.020439] ==================================================================

    The same problem exist in mainline as well.

    This is because oom_bfqq is moved to a non-root group, thus root_group
    is freed earlier.

    Thus fix the problem by don't move oom_bfqq.

    Signed-off-by: Yu Kuai
    Reviewed-by: Jan Kara
    Acked-by: Paolo Valente
    Link: https://lore.kernel.org/r/20220129015924.3958918-4-yukuai3@huawei.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Yu Kuai