08 Oct, 2020

1 commit

  • * tag 'v5.4.70': (3051 commits)
    Linux 5.4.70
    netfilter: ctnetlink: add a range check for l3/l4 protonum
    ep_create_wakeup_source(): dentry name can change under you...
    ...

    Conflicts:
    arch/arm/mach-imx/pm-imx6.c
    arch/arm64/boot/dts/freescale/imx8mm-evk.dts
    arch/arm64/boot/dts/freescale/imx8mn-ddr4-evk.dts
    drivers/crypto/caam/caamalg.c
    drivers/gpu/drm/imx/dw_hdmi-imx.c
    drivers/gpu/drm/imx/imx-ldb.c
    drivers/gpu/drm/imx/ipuv3/ipuv3-crtc.c
    drivers/mmc/host/sdhci-esdhc-imx.c
    drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
    drivers/net/ethernet/freescale/enetc/enetc.c
    drivers/net/ethernet/freescale/enetc/enetc_pf.c
    drivers/thermal/imx_thermal.c
    drivers/usb/cdns3/ep0.c
    drivers/xen/swiotlb-xen.c
    sound/soc/fsl/fsl_esai.c
    sound/soc/fsl/fsl_sai.c

    Signed-off-by: Jason Liu

    Jason Liu
     

07 Oct, 2020

1 commit

  • commit 2b8bd423614c595540eaadcfbc702afe8e155e50 upstream.

    Currently io_ticks is approximated by adding one at each start and end of
    requests if jiffies counter has changed. This works perfectly for requests
    shorter than a jiffy or if one of requests starts/ends at each jiffy.

    If disk executes just one request at a time and they are longer than two
    jiffies then only first and last jiffies will be accounted.

    Fix is simple: at the end of request add up into io_ticks jiffies passed
    since last update rather than just one jiffy.

    Example: common HDD executes random read 4k requests around 12ms.

    fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
    iostat -x 10 sdb

    Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:

    Before:

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
    sdb 0,00 0,00 82,60 0,00 330,40 0,00 8,00 0,96 12,09 12,09 0,00 1,02 8,43

    After:

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
    sdb 0,00 0,00 82,50 0,00 330,00 0,00 8,00 1,00 12,10 12,10 0,00 12,12 99,99

    Now io_ticks does not loose time between start and end of requests, but
    for queue-depth > 1 some I/O time between adjacent starts might be lost.

    For load estimation "%util" is not as useful as average queue length,
    but it clearly shows how often disk queue is completely empty.

    Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe
    From: "Banerjee, Debabrata"
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     

23 Sep, 2020

1 commit

  • [ Upstream commit e8a8a185051a460e3eb0617dca33f996f4e31516 ]

    Yang Yang reported the following crash caused by requeueing a flush
    request in Kyber:

    [ 2.517297] Unable to handle kernel paging request at virtual address ffffffd8071c0b00
    ...
    [ 2.517468] pc : clear_bit+0x18/0x2c
    [ 2.517502] lr : sbitmap_queue_clear+0x40/0x228
    [ 2.517503] sp : ffffff800832bc60 pstate : 00c00145
    ...
    [ 2.517599] Process ksoftirqd/5 (pid: 51, stack limit = 0xffffff8008328000)
    [ 2.517602] Call trace:
    [ 2.517606] clear_bit+0x18/0x2c
    [ 2.517619] kyber_finish_request+0x74/0x80
    [ 2.517627] blk_mq_requeue_request+0x3c/0xc0
    [ 2.517637] __scsi_queue_insert+0x11c/0x148
    [ 2.517640] scsi_softirq_done+0x114/0x130
    [ 2.517643] blk_done_softirq+0x7c/0xb0
    [ 2.517651] __do_softirq+0x208/0x3bc
    [ 2.517657] run_ksoftirqd+0x34/0x60
    [ 2.517663] smpboot_thread_fn+0x1c4/0x2c0
    [ 2.517667] kthread+0x110/0x120
    [ 2.517669] ret_from_fork+0x10/0x18

    This happens because Kyber doesn't track flush requests, so
    kyber_finish_request() reads a garbage domain token. Only call the
    scheduler's requeue_request() hook if RQF_ELVPRIV is set (like we do for
    the finish_request() hook in blk_mq_free_request()). Now that we're
    handling it in blk-mq, also remove the check from BFQ.

    Reported-by: Yang Yang
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Omar Sandoval
     

17 Sep, 2020

1 commit

  • [ Upstream commit 2cd896a5e86fc326bda8614b96c0401dcc145868 ]

    If we hit the UINT_MAX limit of bio->bi_iter.bi_size and so we are anyway
    not merging this page in this bio, then it make sense to make same_page
    also as false before returning.

    Without this patch, we hit below WARNING in iomap.
    This mostly happens with very large memory system and / or after tweaking
    vm dirty threshold params to delay writeback of dirty data.

    WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 iomap_page_release+0x120/0x150
    CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: G W 5.8.0-rc3 #6
    Call Trace:
    __remove_mapping+0x154/0x320 (unreliable)
    iomap_releasepage+0x80/0x180
    try_to_release_page+0x94/0xe0
    invalidate_inode_page+0xc8/0x110
    invalidate_mapping_pages+0x1dc/0x540
    generic_fadvise+0x3c8/0x450
    xfs_file_fadvise+0x2c/0xe0 [xfs]
    vfs_fadvise+0x3c/0x60
    ksys_fadvise64_64+0x68/0xe0
    sys_fadvise64+0x28/0x40
    system_call_exception+0xf8/0x1c0
    system_call_common+0xf0/0x278

    Fixes: cc90bc68422 ("block: fix "check bi_size overflow before merge"")
    Reported-by: Shivaprasad G Bhat
    Suggested-by: Christoph Hellwig
    Signed-off-by: Anju T Sudhakar
    Signed-off-by: Ritesh Harjani
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ritesh Harjani
     

10 Sep, 2020

2 commits

  • commit 5aeac7c4b16069aae49005f0a8d4526baa83341b upstream.

    ioc_pd_free() grabs irq-safe ioc->lock without ensuring that irq is disabled
    when it can be called with irq disabled or enabled. This has a small chance
    of causing A-A deadlocks and triggers lockdep splats. Use irqsave operations
    instead.

    Signed-off-by: Tejun Heo
    Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
    Cc: stable@vger.kernel.org # v5.4+
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit de1b0ee490eafdf65fac9eef9925391a8369f2dc upstream.

    If a driver leaves the limit settings as the defaults, then we don't
    initialize bdi->io_pages. This means that file systems may need to
    work around bdi->io_pages == 0, which is somewhat messy.

    Initialize the default value just like we do for ->ra_pages.

    Cc: stable@vger.kernel.org
    Fixes: 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting")
    Reported-by: OGAWA Hirofumi
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     

03 Sep, 2020

7 commits

  • commit d7d8535f377e9ba87edbf7fbbd634ac942f3f54f upstream.

    SCHED_RESTART code path is relied to re-run queue for dispatch requests
    in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
    requests to hctx->dispatch.

    memory barriers have to be used for ordering the following two pair of OPs:

    1) adding requests to hctx->dispatch and checking SCHED_RESTART in
    blk_mq_dispatch_rq_list()

    2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
    in blk_mq_sched_restart().

    Without the added memory barrier, either:

    1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
    blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
    dispatch side

    or

    2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
    in dispatch side, meantime checking if there is request in
    hctx->dispatch from blk_mq_sched_restart() is missed.

    IO hang in ltp/fs_fill test is reported by kernel test robot:

    https://lkml.org/lkml/2020/7/26/77

    Turns out it is caused by the above out-of-order OPs. And the IO hang
    can't be observed any more after applying this patch.

    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Reported-by: kernel test robot
    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: David Jeffery
    Cc:
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • commit e4b469c66f3cbb81c2e94d31123d7bcdf3c1dabd upstream.

    A previous commit aligning splits to physical block sizes inadvertently
    modified one return case such that that it now returns 0 length splits
    when the number of sectors doesn't exceed the physical offset. This
    later hits a BUG in bio_split(). Restore the previous working behavior.

    Fixes: 9cc5169cd478b ("block: Improve physical block alignment of split bios")
    Reported-by: Eric Deal
    Signed-off-by: Keith Busch
    Cc: Bart Van Assche
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Keith Busch
     
  • [ Upstream commit 27029b4b18aa5d3b060f0bf2c26dae254132cfce ]

    Normally, blkcg_iolatency_exit() will free related memory in iolatency
    when cleanup queue. But if blk_throtl_init() return error and queue init
    fail, blkcg_iolatency_exit() will not do that for us. Then it cause
    memory leak.

    Fixes: d70675121546 ("block: introduce blk-iolatency io controller")
    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Yufen Yu
     
  • [ Upstream commit db03f88fae8a2c8007caafa70287798817df2875 ]

    c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list") supposed
    to add request which has been through ->queue_rq() to the hw queue dispatch
    list, however it adds request running out of budget or driver tag to hw queue
    too. This way basically bypasses request merge, and causes too many request
    dispatched to LLD, and system% is unnecessary increased.

    Fixes this issue by adding request not through ->queue_rq into sw/scheduler
    queue, and this way is safe because no ->queue_rq is called on this request
    yet.

    High %system can be observed on Azure storvsc device, and even soft lock
    is observed. This patch reduces %system during heavy sequential IO,
    meantime decreases soft lockup risk.

    Fixes: c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list")
    Signed-off-by: Ming Lei
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Mike Snitzer
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     
  • [ Upstream commit 2de791ab4918969d8108f15238a701968375f235 ]

    Changes from v1:
    - update commit description with proper ref-accounting justification

    commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
    introduce leak forbfq_group and blkcg_gq objects because of get/put
    imbalance.
    In fact whole idea of original commit is wrong because bfq_group entity
    can not dissapear under us because it is referenced by child bfq_queue's
    entities from here:
    -> bfq_init_entity()
    ->bfqg_and_blkg_get(bfqg);
    ->entity->parent = bfqg->my_entity

    -> bfq_put_queue(bfqq)
    FINAL_PUT
    ->bfqg_and_blkg_put(bfqq_group(bfqq))
    ->kmem_cache_free(bfq_pool, bfqq);

    So parent entity can not disappear while child entity is in tree,
    and child entities already has proper protection.
    This patch revert commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")

    bfq_group leak trace caused by bad commit:
    -> blkg_alloc
    -> bfq_pq_alloc
    -> bfqg_get (+1)
    ->bfq_activate_bfqq
    ->bfq_activate_requeue_entity
    -> __bfq_activate_entity
    ->bfq_get_entity
    ->bfqg_and_blkg_get (+1) bfq_del_bfqq_busy
    ->bfq_deactivate_entity+0x53/0xc0 [bfq]
    ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
    -> bfq_forget_entity(is_in_service = true)
    entity->on_st_or_in_serv = false do not touch reference
    -> blkcg_css_offline
    -> blkcg_destroy_blkgs
    -> blkg_destroy
    -> bfq_pd_offline
    -> __bfq_deactivate_entity
    if (!entity->on_st_or_in_serv) /* true, because (Note2)
    return false;
    -> bfq_pd_free
    -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
    So bfq_group and blkcg_gq will leak forever, see test-case below.

    ##TESTCASE_BEGIN:
    #!/bin/bash

    max_iters=${1:-100}
    #prep cgroup mounts
    mount -t tmpfs cgroup_root /sys/fs/cgroup
    mkdir /sys/fs/cgroup/blkio
    mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

    # Prepare blkdev
    grep blkio /proc/cgroups
    truncate -s 1M img
    losetup /dev/loop0 img
    echo bfq > /sys/block/loop0/queue/scheduler

    grep blkio /proc/cgroups
    for ((i=0;i /sys/fs/cgroup/blkio/a/cgroup.procs
    dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
    echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
    rmdir /sys/fs/cgroup/blkio/a
    grep blkio /proc/cgroups
    done
    ##TESTCASE_END:

    Fixes: db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Dmitry Monakhov
     
  • [ Upstream commit d81665198b83e55a28339d1f3e4890ed8a434556 ]

    If we pass in an offset which is larger than PAGE_SIZE, then
    page_is_mergeable() thinks it's not mergeable with the previous bio_vec,
    leading to a large number of bio_vecs being used. Use a slightly more
    obvious test that the two pages are compatible with each other.

    Fixes: 52d52d1c98a9 ("block: only allow contiguous page structs in a bio_vec")
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Matthew Wilcox (Oracle)
     
  • [ Upstream commit 943b40c832beb71115e38a1c4d99b640b5342738 ]

    When queue_max_discard_segments(q) is 1, blk_discard_mergable() will
    return false for discard request, then normal request merge is applied.
    However, only queue_max_segments() is checked, so max discard segment
    limit isn't respected.

    Check max discard segment limit in the request merge code for fixing
    the issue.

    Discard request failure of virtio_blk is fixed.

    Fixes: 69840466086d ("block: fix the DISCARD request merge")
    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Stefano Garzarella
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     

19 Aug, 2020

1 commit

  • [ Upstream commit d9012a59db54442d5b2fcfdfcded35cf566397d3 ]

    We shouldn't skip iocg when its abs_vdebt is not zero.

    Fixes: 0b80f9866e6b ("iocost: protect iocg->abs_vdebt with iocg->waitq.lock")
    Signed-off-by: Chengming Zhou
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Chengming Zhou
     

22 Jul, 2020

3 commits

  • commit 4a2f704eb2d831a2d73d7f4cdd54f45c49c3c353 upstream.

    Commit 429120f3df2d starts to take account of segment's start dma address
    when computing max segment size, and data type of 'unsigned long'
    is used to do that. However, the segment mask may be 0xffffffff, so
    the figured out segment size may be overflowed in case of zero physical
    address on 32bit arch.

    Fix the issue by returning queue_max_segment_size() directly when that
    happens.

    Fixes: 429120f3df2d ("block: fix splitting segments on boundary masks")
    Reported-by: Guenter Roeck
    Tested-by: Guenter Roeck
    Cc: Christoph Hellwig
    Tested-by: Steven Rostedt (VMware)
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • commit 429120f3df2dba2bf3a4a19f4212a53ecefc7102 upstream.

    We ran into a problem with a mpt3sas based controller, where we would
    see random (and hard to reproduce) file corruption). The issue seemed
    specific to this controller, but wasn't specific to the file system.
    After a lot of debugging, we find out that it's caused by segments
    spanning a 4G memory boundary. This shouldn't happen, as the default
    setting for segment boundary masks is 4G.

    Turns out there are two issues in get_max_segment_size():

    1) The default segment boundary mask is bypassed

    2) The segment start address isn't taken into account when checking
    segment boundary limit

    Fix these two issues by removing the bypass of the segment boundary
    check even if the mask is set to the default value, and taking into
    account the actual start address of the request when checking if a
    segment needs splitting.

    Cc: stable@vger.kernel.org # v5.1+
    Reviewed-by: Chris Mason
    Tested-by: Chris Mason
    Fixes: dcebd755926b ("block: use bio_for_each_bvec() to compute multi-page bvec count")
    Signed-off-by: Ming Lei
    Signed-off-by: Greg Kroah-Hartman

    Dropped const on the page pointer, ppc page_to_phys() doesn't mark the
    page as const...

    Signed-off-by: Jens Axboe

    Ming Lei
     
  • [ Upstream commit bfe373f608cf81b7626dfeb904001b0e867c5110 ]

    Else there may be magic numbers in /sys/kernel/debug/block/*/state.

    Signed-off-by: Hou Tao
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Hou Tao
     

16 Jul, 2020

2 commits

  • commit 05a4fed69ff00a8bd83538684cb602a4636b07a7 upstream.

    dm-multipath is the only user of blk_mq_queue_inflight(). When
    dm-multipath calls blk_mq_queue_inflight() to check if it has
    outstanding IO it can get a false negative. The reason for this is
    blk_mq_rq_inflight() doesn't consider requests that are no longer
    MQ_RQ_IN_FLIGHT but that are now MQ_RQ_COMPLETE (->complete isn't
    called or finished yet) as "inflight".

    This causes request-based dm-multipath's dm_wait_for_completion() to
    return before all outstanding dm-multipath requests have actually
    completed. This breaks DM multipath's suspend functionality because
    blk-mq requests complete after DM's suspend has finished -- which
    shouldn't happen.

    Fix this by considering any request not in the MQ_RQ_IDLE state
    (so either MQ_RQ_COMPLETE or MQ_RQ_IN_FLIGHT) as "inflight" in
    blk_mq_rq_inflight().

    Fixes: 3c94d83cb3526 ("blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()")
    Signed-off-by: Ming Lei
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • [ Upstream commit 0b8eb629a700c0ef15a437758db8255f8444e76c ]

    Release bip using kfree() in error path when that was allocated
    by kmalloc().

    Signed-off-by: Chengguang Xu
    Reviewed-by: Christoph Hellwig
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Chengguang Xu
     

01 Jul, 2020

2 commits

  • [ Upstream commit fe35ec58f0d339221643287bbb7cee15c93a5389 ]

    There is an issue when tune the number for read and write queues,
    if the total queue count was not changed. The hctx->type cannot
    be updated, since __blk_mq_update_nr_hw_queues will return directly
    if the total queue count has not been changed.

    Reproduce:

    dmesg | grep "default/read/poll"
    [ 2.607459] nvme nvme0: 48/0/0 default/read/poll queues
    cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c
    48 default

    tune the write queues to 24:
    echo 24 > /sys/module/nvme/parameters/write_queues
    echo 1 > /sys/block/nvme0n1/device/reset_controller

    dmesg | grep "default/read/poll"
    [ 433.547235] nvme nvme0: 24/24/0 default/read/poll queues

    cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c
    48 default

    The driver's hardware queue mapping is not same as block layer.

    Signed-off-by: Weiping Zhang
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Weiping Zhang
     
  • commit a75ca9303175d36af93c0937dd9b1a6422908b8d upstream.

    commit e7bf90e5afe3 ("block/bio-integrity: fix a memory leak bug") added
    a kfree() for 'buf' if bio_integrity_add_page() returns '0'. However,
    the object will be freed in bio_integrity_free() since 'bio->bi_opf' and
    'bio->bi_integrity' were set previousy in bio_integrity_alloc().

    Fixes: commit e7bf90e5afe3 ("block/bio-integrity: fix a memory leak bug")
    Signed-off-by: yu kuai
    Reviewed-by: Ming Lei
    Reviewed-by: Bob Liu
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe
    Cc: Guenter Roeck
    Signed-off-by: Greg Kroah-Hartman

    yu kuai
     

22 Jun, 2020

3 commits

  • [ Upstream commit 81ca627a933063fa63a6d4c66425de822a2ab7f5 ]

    When the QoS targets are met and nothing is being throttled, there's
    no way to tell how saturated the underlying device is - it could be
    almost entirely idle, at the cusp of saturation or anywhere inbetween.
    Given that there's no information, it's best to keep vrate as-is in
    this state. Before 7cd806a9a953 ("iocost: improve nr_lagging
    handling"), this was the case - if the device isn't missing QoS
    targets and nothing is being throttled, busy_level was reset to zero.

    While fixing nr_lagging handling, 7cd806a9a953 ("iocost: improve
    nr_lagging handling") broke this. Now, while the device is hitting
    QoS targets and nothing is being throttled, vrate keeps getting
    adjusted according to the existing busy_level.

    This led to vrate keeping climing till it hits max when there's an IO
    issuer with limited request concurrency if the vrate started low.
    vrate starts getting adjusted upwards until the issuer can issue IOs
    w/o being throttled. From then on, QoS targets keeps getting met and
    nothing on the system needs throttling and vrate keeps getting
    increased due to the existing busy_level.

    This patch makes the following changes to the busy_level logic.

    * Reset busy_level if nr_shortages is zero to avoid the above
    scenario.

    * Make non-zero nr_lagging block lowering nr_level but still clear
    positive busy_level if there's clear non-saturation signal - QoS
    targets are met and nr_shortages is non-zero. nr_lagging's role is
    preventing adjusting vrate upwards while there are long-running
    commands and it shouldn't keep busy_level positive while there's
    clear non-saturation signal.

    * Restructure code for clarity and add comments.

    Signed-off-by: Tejun Heo
    Reported-by: Andy Newell
    Fixes: 7cd806a9a953 ("iocost: improve nr_lagging handling")
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Tejun Heo
     
  • [ Upstream commit aa880ad690ab6d4c53934af85fb5a43e69ecb0f5 ]

    When we increase hardware queue count, blk_mq_update_queue_map will
    reset the mapping between cpu and hardware queue base on the hardware
    queue count(set->nr_hw_queues). The mapping cannot be reset if it
    encounters error in blk_mq_realloc_hw_ctxs, but the fallback flow will
    continue using it, then blk_mq_map_swqueue will touch a invalid memory,
    because the mapping points to a wrong hctx.

    blktest block/030:

    null_blk: module loaded
    Increasing nr_hw_queues to 8 fails, fallback to 1
    ==================================================================
    BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x2f2/0x830
    Read of size 8 at addr 0000000000000128 by task nproc/8541

    CPU: 5 PID: 8541 Comm: nproc Not tainted 5.7.0-rc4-dbg+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
    Call Trace:
    dump_stack+0xa5/0xe6
    __kasan_report.cold+0x65/0xbb
    kasan_report+0x45/0x60
    check_memory_region+0x15e/0x1c0
    __kasan_check_read+0x15/0x20
    blk_mq_map_swqueue+0x2f2/0x830
    __blk_mq_update_nr_hw_queues+0x3df/0x690
    blk_mq_update_nr_hw_queues+0x32/0x50
    nullb_device_submit_queues_store+0xde/0x160 [null_blk]
    configfs_write_file+0x1c4/0x250 [configfs]
    __vfs_write+0x4c/0x90
    vfs_write+0x14b/0x2d0
    ksys_write+0xdd/0x180
    __x64_sys_write+0x47/0x50
    do_syscall_64+0x6f/0x310
    entry_SYSCALL_64_after_hwframe+0x49/0xb3

    Signed-off-by: Weiping Zhang
    Tested-by: Bart van Assche
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Weiping Zhang
     
  • [ Upstream commit fd689871bbfbb41cd77379d3e9e5f4def0f7d6c6 ]

    Alloc new map and request for new hardware queue when increse
    hardware queue count. Before this patch, it will show a
    warning for each new hardware queue, but it's not enough, these
    hctx have no maps and reqeust, when a bio was mapped to these
    hardware queue, it will trigger kernel panic when get request
    from these hctx.

    Test environment:
    * A NVMe disk supports 128 io queues
    * 96 cpus in system

    A corner case can always trigger this panic, there are 96
    io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel
    log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write
    queues to 96, then nvme will alloc others(32) queues for read, but
    blk_mq_update_nr_hw_queues does not alloc map and request for these new
    added io queues. So when process read nvme disk, it will trigger kernel
    panic when get request from these hardware context.

    Reproduce script:

    nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1)
    echo $nr > /sys/module/nvme/parameters/write_queues
    echo 1 > /sys/block/nvme0n1/device/reset_controller
    dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1

    [ 8040.805626] ------------[ cut here ]------------
    [ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0
    [ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f
    ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne
    tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_
    cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d
    rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
    [ 8040.805637] ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core]
    [ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
    [ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
    [ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
    [ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0
    [ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56
    [ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246
    [ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d
    [ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90
    [ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d
    [ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008
    [ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f
    [ 8040.805645] FS: 0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000
    [ 8040.805646] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0
    [ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 8040.805647] PKRU: 55555554
    [ 8040.805647] Call Trace:
    [ 8040.805649] blk_mq_update_nr_hw_queues+0x31b/0x390
    [ 8040.805650] nvme_reset_work+0xb4b/0xeab [nvme]
    [ 8040.805651] process_one_work+0x1a7/0x370
    [ 8040.805652] worker_thread+0x1c9/0x380
    [ 8040.805653] ? max_active_store+0x80/0x80
    [ 8040.805655] kthread+0x112/0x130
    [ 8040.805656] ? __kthread_parkme+0x70/0x70
    [ 8040.805657] ret_from_fork+0x35/0x40
    [ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]---
    [ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004
    [ 8229.365165] #PF: supervisor read access in kernel mode
    [ 8229.365178] #PF: error_code(0x0000) - not-present page
    [ 8229.365191] PGD 0 P4D 0
    [ 8229.365201] Oops: 0000 [#1] SMP PTI
    [ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
    [ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
    [ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250
    [ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff
    [ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246
    [ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998
    [ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38
    [ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198
    [ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000
    [ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001
    [ 8229.365397] FS: 00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000
    [ 8229.365415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0
    [ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 8229.365476] PKRU: 55555554
    [ 8229.365484] Call Trace:
    [ 8229.365498] ? finish_wait+0x80/0x80
    [ 8229.365512] blk_mq_get_request+0xcb/0x3f0
    [ 8229.365525] blk_mq_make_request+0x143/0x5d0
    [ 8229.365538] generic_make_request+0xcf/0x310
    [ 8229.365553] ? scan_shadow_nodes+0x30/0x30
    [ 8229.365564] submit_bio+0x3c/0x150
    [ 8229.365576] mpage_readpages+0x163/0x1a0
    [ 8229.365588] ? blkdev_direct_IO+0x490/0x490
    [ 8229.365601] read_pages+0x6b/0x190
    [ 8229.365612] __do_page_cache_readahead+0x1c1/0x1e0
    [ 8229.365626] ondemand_readahead+0x182/0x2f0
    [ 8229.365639] generic_file_buffered_read+0x590/0xab0
    [ 8229.365655] new_sync_read+0x12a/0x1c0
    [ 8229.365666] vfs_read+0x8a/0x140
    [ 8229.365676] ksys_read+0x59/0xd0
    [ 8229.365688] do_syscall_64+0x55/0x1d0
    [ 8229.365700] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Signed-off-by: Ming Lei
    Signed-off-by: Weiping Zhang
    Tested-by: Weiping Zhang
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     

19 Jun, 2020

1 commit

  • * tag 'v5.4.47': (2193 commits)
    Linux 5.4.47
    KVM: arm64: Save the host's PtrAuth keys in non-preemptible context
    KVM: arm64: Synchronize sysreg state on injecting an AArch32 exception
    ...

    Conflicts:
    arch/arm/boot/dts/imx6qdl.dtsi
    arch/arm/mach-imx/Kconfig
    arch/arm/mach-imx/common.h
    arch/arm/mach-imx/suspend-imx6.S
    arch/arm64/boot/dts/freescale/imx8qxp-mek.dts
    arch/powerpc/include/asm/cacheflush.h
    drivers/cpufreq/imx6q-cpufreq.c
    drivers/dma/imx-sdma.c
    drivers/edac/synopsys_edac.c
    drivers/firmware/imx/imx-scu.c
    drivers/net/ethernet/freescale/fec.h
    drivers/net/ethernet/freescale/fec_main.c
    drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
    drivers/net/phy/phy_device.c
    drivers/perf/fsl_imx8_ddr_perf.c
    drivers/usb/cdns3/gadget.c
    drivers/usb/dwc3/gadget.c
    include/uapi/linux/dma-buf.h

    Signed-off-by: Jason Liu

    Jason Liu
     

03 Jun, 2020

1 commit

  • [ Upstream commit b0beb28097fa04177b3769f4bb7a0d0d9c4ae76e ]

    This reverts commit c58c1f83436b501d45d4050fd1296d71a9760bcb.

    io_uring does do the right thing for this case, and we're still returning
    -EAGAIN to userspace for the cases we don't support. Revert this change
    to avoid doing endless spins of resubmits.

    Cc: stable@vger.kernel.org # v5.6
    Reported-by: Bijan Mottahedeh
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     

14 May, 2020

1 commit

  • commit 0b80f9866e6bbfb905140ed8787ff2af03652c0c upstream.

    abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup
    is and controls the activation of use_delay mechanism. Once a cgroup goes
    over budget from forced IOs, it has to pay it back with its future budget.
    The progress guarantee on debt paying comes from the iocg being active -
    active iocgs are processed by the periodic timer, which ensures that as time
    passes the debts dissipate and the iocg returns to normal operation.

    However, both iocg activation and vdebt handling are asynchronous and a
    sequence like the following may happen.

    1. The iocg is in the process of being deactivated by the periodic timer.

    2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns
    without anything because it still sees that the iocg is already active.

    3. The iocg is deactivated.

    4. The bio from #2 is over budget but needs to be forced. It increases
    abs_vdebt and goes over the threshold and enables use_delay.

    5. IO control is enabled for the iocg's subtree and now IOs are attributed
    to the descendant cgroups and the iocg itself no longer issues IOs.

    This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no
    further IOs which can activate it. This can end up unduly punishing all the
    descendants cgroups.

    The usual throttling path has the same issue - the iocg must be active while
    throttled to ensure that future event will wake it up - and solves the
    problem by synchronizing the throttling path with a spinlock. abs_vdebt
    handling is another form of overage handling and shares a lot of
    characteristics including the fact that it isn't in the hottest path.

    This patch fixes the above and other possible races by strictly
    synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock.

    Signed-off-by: Tejun Heo
    Reported-by: Vlad Dmitriev
    Cc: stable@vger.kernel.org # v5.4+
    Fixes: e1518f63f246 ("blk-iocost: Don't let merges push vtime into the future")
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

02 May, 2020

2 commits

  • [ Upstream commit 5fe56de799ad03e92d794c7936bf363922b571df ]

    If in blk_mq_dispatch_rq_list() we find no budget, then we break of the
    dispatch loop, but the request may keep the driver tag, evaulated
    in 'nxt' in the previous loop iteration.

    Fix by putting the driver tag for that request.

    Reviewed-by: Ming Lei
    Signed-off-by: John Garry
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    John Garry
     
  • commit d6c8e949a35d6906d6c03a50e9a9cdf4e494528a upstream.

    Systemtap 4.2 is unable to correctly interpret the "u32 (*missed_ppm)[2]"
    argument of the iocost_ioc_vrate_adj trace entry defined in
    include/trace/events/iocost.h leading to the following error:

    /tmp/stapAcz0G0/stap_c89c58b83cea1724e26395efa9ed4939_6321_aux_6.c:78:8:
    error: expected ‘;’, ‘,’ or ‘)’ before ‘*’ token
    , u32[]* __tracepoint_arg_missed_ppm

    That argument type is indeed rather complex and hard to read. Looking
    at block/blk-iocost.c. It is just a 2-entry u32 array. By simplifying
    the argument to a simple "u32 *missed_ppm" and adjusting the trace
    entry accordingly, the compilation error was gone.

    Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
    Acked-by: Steven Rostedt (VMware)
    Acked-by: Tejun Heo
    Signed-off-by: Waiman Long
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Waiman Long
     

23 Apr, 2020

3 commits

  • commit 4d38a87fbb77fb9ff2ff4e914162a8ae6453eff5 upstream.

    In bfq_pd_offline(), the function bfq_flush_idle_tree() is invoked to
    flush the rb tree that contains all idle entities belonging to the pd
    (cgroup) being destroyed. In particular, bfq_flush_idle_tree() is
    invoked before bfq_reparent_active_queues(). Yet the latter may happen
    to add some entities to the idle tree. It happens if, in some of the
    calls to bfq_bfqq_move() performed by bfq_reparent_active_queues(),
    the queue to move is empty and gets expired.

    This commit simply reverses the invocation order between
    bfq_flush_idle_tree() and bfq_reparent_active_queues().

    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     
  • commit 576682fa52cbd95deb3773449566274f206acc58 upstream.

    bfq_reparent_leaf_entity() reparents the input leaf entity (a leaf
    entity represents just a bfq_queue in an entity tree). Yet, the input
    entity is guaranteed to always be a leaf entity only in two-level
    entity trees. In this respect, because of the error fixed by
    commit 14afc5936197 ("block, bfq: fix overwrite of bfq_group pointer
    in bfq_find_set_group()"), all (wrongly collapsed) entity trees happened
    to actually have only two levels. After the latter commit, this does not
    hold any longer.

    This commit fixes this problem by modifying
    bfq_reparent_leaf_entity(), so that it searches an active leaf entity
    down the path that stems from the input entity. Such a leaf entity is
    guaranteed to exist when bfq_reparent_leaf_entity() is invoked.

    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     
  • commit c8997736650060594845e42c5d01d3118aec8d25 upstream.

    A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The
    goal of this put is to release a process reference to a bfq_queue. But
    process-reference releases may trigger also some extra operation, and,
    to this goal, are handled through bfq_release_process_ref(). So, turn
    the invocation of bfq_put_queue() into an invocation of
    bfq_release_process_ref().

    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     

17 Apr, 2020

4 commits

  • [ Upstream commit 2f95fa5c955d0a9987ffdc3a095e2f4e62c5f2a9 ]

    In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is
    not in bfqd-lock critical section. The bfqq, which is not
    equal to NULL in bfq_idle_slice_timer, may be freed after passing
    to bfq_idle_slice_timer_body. So we will access the freed memory.

    In addition, considering the bfqq may be in race, we should
    firstly check whether bfqq is in service before doing something
    on it in bfq_idle_slice_timer_body func. If the bfqq in race is
    not in service, it means the bfqq has been expired through
    __bfq_bfqq_expire func, and wait_request flags has been cleared in
    __bfq_bfqd_reset_in_service func. So we do not need to re-clear the
    wait_request of bfqq which is not in service.

    KASAN log is given as follows:
    [13058.354613] ==================================================================
    [13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290
    [13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767
    [13058.354646]
    [13058.354655] CPU: 96 PID: 19767 Comm: fork13
    [13058.354661] Call trace:
    [13058.354667] dump_backtrace+0x0/0x310
    [13058.354672] show_stack+0x28/0x38
    [13058.354681] dump_stack+0xd8/0x108
    [13058.354687] print_address_description+0x68/0x2d0
    [13058.354690] kasan_report+0x124/0x2e0
    [13058.354697] __asan_load8+0x88/0xb0
    [13058.354702] bfq_idle_slice_timer+0xac/0x290
    [13058.354707] __hrtimer_run_queues+0x298/0x8b8
    [13058.354710] hrtimer_interrupt+0x1b8/0x678
    [13058.354716] arch_timer_handler_phys+0x4c/0x78
    [13058.354722] handle_percpu_devid_irq+0xf0/0x558
    [13058.354731] generic_handle_irq+0x50/0x70
    [13058.354735] __handle_domain_irq+0x94/0x110
    [13058.354739] gic_handle_irq+0x8c/0x1b0
    [13058.354742] el1_irq+0xb8/0x140
    [13058.354748] do_wp_page+0x260/0xe28
    [13058.354752] __handle_mm_fault+0x8ec/0x9b0
    [13058.354756] handle_mm_fault+0x280/0x460
    [13058.354762] do_page_fault+0x3ec/0x890
    [13058.354765] do_mem_abort+0xc0/0x1b0
    [13058.354768] el0_da+0x24/0x28
    [13058.354770]
    [13058.354773] Allocated by task 19731:
    [13058.354780] kasan_kmalloc+0xe0/0x190
    [13058.354784] kasan_slab_alloc+0x14/0x20
    [13058.354788] kmem_cache_alloc_node+0x130/0x440
    [13058.354793] bfq_get_queue+0x138/0x858
    [13058.354797] bfq_get_bfqq_handle_split+0xd4/0x328
    [13058.354801] bfq_init_rq+0x1f4/0x1180
    [13058.354806] bfq_insert_requests+0x264/0x1c98
    [13058.354811] blk_mq_sched_insert_requests+0x1c4/0x488
    [13058.354818] blk_mq_flush_plug_list+0x2d4/0x6e0
    [13058.354826] blk_flush_plug_list+0x230/0x548
    [13058.354830] blk_finish_plug+0x60/0x80
    [13058.354838] read_pages+0xec/0x2c0
    [13058.354842] __do_page_cache_readahead+0x374/0x438
    [13058.354846] ondemand_readahead+0x24c/0x6b0
    [13058.354851] page_cache_sync_readahead+0x17c/0x2f8
    [13058.354858] generic_file_buffered_read+0x588/0xc58
    [13058.354862] generic_file_read_iter+0x1b4/0x278
    [13058.354965] ext4_file_read_iter+0xa8/0x1d8 [ext4]
    [13058.354972] __vfs_read+0x238/0x320
    [13058.354976] vfs_read+0xbc/0x1c0
    [13058.354980] ksys_read+0xdc/0x1b8
    [13058.354984] __arm64_sys_read+0x50/0x60
    [13058.354990] el0_svc_common+0xb4/0x1d8
    [13058.354994] el0_svc_handler+0x50/0xa8
    [13058.354998] el0_svc+0x8/0xc
    [13058.354999]
    [13058.355001] Freed by task 19731:
    [13058.355007] __kasan_slab_free+0x120/0x228
    [13058.355010] kasan_slab_free+0x10/0x18
    [13058.355014] kmem_cache_free+0x288/0x3f0
    [13058.355018] bfq_put_queue+0x134/0x208
    [13058.355022] bfq_exit_icq_bfqq+0x164/0x348
    [13058.355026] bfq_exit_icq+0x28/0x40
    [13058.355030] ioc_exit_icq+0xa0/0x150
    [13058.355035] put_io_context_active+0x250/0x438
    [13058.355038] exit_io_context+0xd0/0x138
    [13058.355045] do_exit+0x734/0xc58
    [13058.355050] do_group_exit+0x78/0x220
    [13058.355054] __wake_up_parent+0x0/0x50
    [13058.355058] el0_svc_common+0xb4/0x1d8
    [13058.355062] el0_svc_handler+0x50/0xa8
    [13058.355066] el0_svc+0x8/0xc
    [13058.355067]
    [13058.355071] The buggy address belongs to the object at ffffa02cf3e63e70#012 which belongs to the cache bfq_queue of size 464
    [13058.355075] The buggy address is located 264 bytes inside of#012 464-byte region [ffffa02cf3e63e70, ffffa02cf3e64040)
    [13058.355077] The buggy address belongs to the page:
    [13058.355083] page:ffff7e80b3cf9800 count:1 mapcount:0 mapping:ffff802db5c90780 index:0xffffa02cf3e606f0 compound_mapcount: 0
    [13058.366175] flags: 0x2ffffe0000008100(slab|head)
    [13058.370781] raw: 2ffffe0000008100 ffff7e80b53b1408 ffffa02d730c1c90 ffff802db5c90780
    [13058.370787] raw: ffffa02cf3e606f0 0000000000370023 00000001ffffffff 0000000000000000
    [13058.370789] page dumped because: kasan: bad access detected
    [13058.370791]
    [13058.370792] Memory state around the buggy address:
    [13058.370797] ffffa02cf3e63e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb
    [13058.370801] ffffa02cf3e63e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [13058.370805] >ffffa02cf3e63f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [13058.370808] ^
    [13058.370811] ffffa02cf3e63f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [13058.370815] ffffa02cf3e64000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
    [13058.370817] ==================================================================
    [13058.370820] Disabling lock debugging due to kernel taint

    Here, we directly pass the bfqd to bfq_idle_slice_timer_body func.
    --
    V2->V3: rewrite the comment as suggested by Paolo Valente
    V1->V2: add one comment, and add Fixes and Reported-by tag.

    Fixes: aee69d78d ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
    Acked-by: Paolo Valente
    Reported-by: Wang Wang
    Signed-off-by: Zhiqiang Liu
    Signed-off-by: Feilong Lin
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Zhiqiang Liu
     
  • [ Upstream commit 30a2da7b7e225ef6c87a660419ea04d3cef3f6a7 ]

    There is a potential race between ioc_release_fn() and
    ioc_clear_queue() as shown below, due to which below kernel
    crash is observed. It also can result into use-after-free
    issue.

    context#1: context#2:
    ioc_release_fn() __ioc_clear_queue() gets the same icq
    ->spin_lock(&ioc->lock); ->spin_lock(&ioc->lock);
    ->ioc_destroy_icq(icq);
    ->list_del_init(&icq->q_node);
    ->call_rcu(&icq->__rcu_head,
    icq_free_icq_rcu);
    ->spin_unlock(&ioc->lock);
    ->ioc_destroy_icq(icq);
    ->hlist_del_init(&icq->ioc_node);
    This results into below crash as this memory
    is now used by icq->__rcu_head in context#1.
    There is a chance that icq could be free'd
    as well.

    22150.386550: Unable to handle kernel write to read-only memory
    at virtual address ffffffaa8d31ca50
    ...
    Call trace:
    22150.607350: ioc_destroy_icq+0x44/0x110
    22150.611202: ioc_clear_queue+0xac/0x148
    22150.615056: blk_cleanup_queue+0x11c/0x1a0
    22150.619174: __scsi_remove_device+0xdc/0x128
    22150.623465: scsi_forget_host+0x2c/0x78
    22150.627315: scsi_remove_host+0x7c/0x2a0
    22150.631257: usb_stor_disconnect+0x74/0xc8
    22150.635371: usb_unbind_interface+0xc8/0x278
    22150.639665: device_release_driver_internal+0x198/0x250
    22150.644897: device_release_driver+0x24/0x30
    22150.649176: bus_remove_device+0xec/0x140
    22150.653204: device_del+0x270/0x460
    22150.656712: usb_disable_device+0x120/0x390
    22150.660918: usb_disconnect+0xf4/0x2e0
    22150.664684: hub_event+0xd70/0x17e8
    22150.668197: process_one_work+0x210/0x480
    22150.672222: worker_thread+0x32c/0x4c8

    Fix this by adding a new ICQ_DESTROYED flag in ioc_destroy_icq() to
    indicate this icq is once marked as destroyed. Also, ensure
    __ioc_clear_queue() is accessing icq within rcu_read_lock/unlock so
    that icq doesn't get free'd up while it is still using it.

    Signed-off-by: Sahitya Tummala
    Co-developed-by: Pradeep P V K
    Signed-off-by: Pradeep P V K
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Sahitya Tummala
     
  • [ Upstream commit fd1bb3ae54a9a2e0c42709de861c69aa146b8955 ]

    Commit ecedd3d7e199 ("block, bfq: get extra ref to prevent a queue
    from being freed during a group move") gets an extra reference to a
    bfq_queue before possibly deactivating it (temporarily), in
    bfq_bfqq_move(). This prevents the bfq_queue from disappearing before
    being reactivated in its new group.

    Yet, the bfq_queue may also be expired (i.e., its service may be
    stopped) before the bfq_queue is deactivated. And also an expiration
    may lead to a premature freeing. This commit fixes this issue by
    simply moving forward the getting of the extra reference already
    introduced by commit ecedd3d7e199 ("block, bfq: get extra ref to
    prevent a queue from being freed during a group move").

    Reported-by: cki-project@redhat.com
    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Paolo Valente
     
  • [ Upstream commit e74d93e96d721c4297f2a900ad0191890d2fc2b0 ]

    Field bdi->io_pages added in commit 9491ae4aade6 ("mm: don't cap request
    size based on read-ahead setting") removes unneeded split of read requests.

    Stacked drivers do not call blk_queue_max_hw_sectors(). Instead they set
    limits of their devices by blk_set_stacking_limits() + disk_stack_limits().
    Field bio->io_pages stays zero until user set max_sectors_kb via sysfs.

    This patch updates io_pages after merging limits in disk_stack_limits().

    Commit c6d6e9b0f6b4 ("dm: do not allow readahead to limit IO size") fixed
    the same problem for device-mapper devices, this one fixes MD RAIDs.

    Fixes: 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting")
    Reviewed-by: Paul Menzel
    Reviewed-by: Bob Liu
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Song Liu
    Signed-off-by: Sasha Levin

    Konstantin Khlebnikov
     

13 Apr, 2020

1 commit

  • commit 6e66b49392419f3fe134e1be583323ef75da1e4b upstream.

    blk_mq_map_queues() and multiple .map_queues() implementations expect that
    set->map[HCTX_TYPE_DEFAULT].nr_queues is set to the number of hardware
    queues. Hence set .nr_queues before calling these functions. This patch
    fixes the following kernel warning:

    WARNING: CPU: 0 PID: 2501 at include/linux/cpumask.h:137
    Call Trace:
    blk_mq_run_hw_queue+0x19d/0x350 block/blk-mq.c:1508
    blk_mq_run_hw_queues+0x112/0x1a0 block/blk-mq.c:1525
    blk_mq_requeue_work+0x502/0x780 block/blk-mq.c:775
    process_one_work+0x9af/0x1740 kernel/workqueue.c:2269
    worker_thread+0x98/0xe40 kernel/workqueue.c:2415
    kthread+0x361/0x430 kernel/kthread.c:255

    Fixes: ed76e329d74a ("blk-mq: abstract out queue map") # v5.0
    Reported-by: syzbot+d44e1b26ce5c3e77458d@syzkaller.appspotmail.com
    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Chaitanya Kulkarni
    Cc: Johannes Thumshirn
    Cc: Hannes Reinecke
    Cc: Ming Lei
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     

25 Mar, 2020

1 commit

  • [ Upstream commit 14afc59361976c0ba39e3a9589c3eaa43ebc7e1d ]

    The bfq_find_set_group() function takes as input a blkcg (which represents
    a cgroup) and retrieves the corresponding bfq_group, then it updates the
    bfq internal group hierarchy (see comments inside the function for why
    this is needed) and finally it returns the bfq_group.
    In the hierarchy update cycle, the pointer holding the correct bfq_group
    that has to be returned is mistakenly used to traverse the hierarchy
    bottom to top, meaning that in each iteration it gets overwritten with the
    parent of the current group. Since the update cycle stops at root's
    children (depth = 2), the overwrite becomes a problem only if the blkcg
    describes a cgroup at a hierarchy level deeper than that (depth > 2). In
    this case the root's child that happens to be also an ancestor of the
    correct bfq_group is returned. The main consequence is that processes
    contained in a cgroup at depth greater than 2 are wrongly placed in the
    group described above by BFQ.

    This commits fixes this problem by using a different bfq_group pointer in
    the update cycle in order to avoid the overwrite of the variable holding
    the original group reference.

    Reported-by: Kwon Je Oh
    Signed-off-by: Carlo Nonato
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Carlo Nonato
     

21 Mar, 2020

2 commits

  • [ Upstream commit cc3200eac4c5eb11c3f34848a014d1f286316310 ]

    commit 01e99aeca397 ("blk-mq: insert passthrough request into
    hctx->dispatch directly") may change to add flush request to the tail
    of dispatch by applying the 'add_head' parameter of
    blk_mq_sched_insert_request.

    Turns out this way causes performance regression on NCQ controller because
    flush is non-NCQ command, which can't be queued when there is any in-flight
    NCQ command. When adding flush rq to the front of hctx->dispatch, it is
    easier to introduce extra time to flush rq's latency compared with adding
    to the tail of dispatch queue because of S_SCHED_RESTART, then chance of
    flush merge is increased, and less flush requests may be issued to
    controller.

    So always insert flush request to the front of dispatch queue just like
    before applying commit 01e99aeca397 ("blk-mq: insert passthrough request
    into hctx->dispatch directly").

    Cc: Damien Le Moal
    Cc: Shinichiro Kawasaki
    Reported-by: Shinichiro Kawasaki
    Fixes: 01e99aeca397 ("blk-mq: insert passthrough request into hctx->dispatch directly")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     
  • [ Upstream commit 01e99aeca3979600302913cef3f89076786f32c8 ]

    For some reason, device may be in one situation which can't handle
    FS request, so STS_RESOURCE is always returned and the FS request
    will be added to hctx->dispatch. However passthrough request may
    be required at that time for fixing the problem. If passthrough
    request is added to scheduler queue, there isn't any chance for
    blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
    Then the FS IO request may never be completed, and IO hang is caused.

    So passthrough request has to be added to hctx->dispatch directly
    for fixing the IO hang.

    Fix this issue by inserting passthrough request into hctx->dispatch
    directly together withing adding FS request to the tail of
    hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
    to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().

    Then it becomes consistent with original legacy IO request
    path, in which passthrough request is always added to q->queue_head.

    Cc: Dongli Zhang
    Cc: Christoph Hellwig
    Cc: Ewan D. Milne
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei