08 Mar, 2020

1 commit

  • Merge Linux stable release v5.4.24 into imx_5.4.y

    * tag 'v5.4.24': (3306 commits)
    Linux 5.4.24
    blktrace: Protect q->blk_trace with RCU
    kvm: nVMX: VMWRITE checks unsupported field before read-only field
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    arch/arm/boot/dts/imx6sll-evk.dts
    arch/arm/boot/dts/imx7ulp.dtsi
    arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
    drivers/clk/imx/clk-composite-8m.c
    drivers/gpio/gpio-mxc.c
    drivers/irqchip/Kconfig
    drivers/mmc/host/sdhci-of-esdhc.c
    drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c
    drivers/net/can/flexcan.c
    drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
    drivers/net/ethernet/mscc/ocelot.c
    drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
    drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
    drivers/net/phy/realtek.c
    drivers/pci/controller/mobiveil/pcie-mobiveil-host.c
    drivers/perf/fsl_imx8_ddr_perf.c
    drivers/tee/optee/shm_pool.c
    drivers/usb/cdns3/gadget.c
    kernel/sched/cpufreq.c
    net/core/xdp.c
    sound/soc/fsl/fsl_esai.c
    sound/soc/fsl/fsl_sai.c
    sound/soc/sof/core.c
    sound/soc/sof/imx/Kconfig
    sound/soc/sof/loader.c

    Jason Liu
     

24 Feb, 2020

1 commit

  • [ Upstream commit f718b093277df582fbf8775548a4f163e664d282 ]

    Commit 478de3380c1c ("block, bfq: deschedule empty bfq_queues not
    referred by any process") fixed commit 3726112ec731 ("block, bfq:
    re-schedule empty queues if they deserve I/O plugging") by
    descheduling an empty bfq_queue when it remains with not process
    reference. Yet, this still left a case uncovered: an empty bfq_queue
    with not process reference that remains in service. This happens for
    an in-service sync bfq_queue that is deemed to deserve I/O-dispatch
    plugging when it remains empty. Yet no new requests will arrive for
    such a bfq_queue if no process sends requests to it any longer. Even
    worse, the bfq_queue may happen to be prematurely freed while still in
    service (because there may remain no reference to it any longer).

    This commit solves this problem by preventing I/O dispatch from being
    plugged for the in-service bfq_queue, if the latter has no process
    reference (the bfq_queue is then prevented from remaining in service).

    Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging")
    Tested-by: Oleksandr Natalenko
    Reported-by: Patrick Dung
    Tested-by: Patrick Dung
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Paolo Valente
     

26 Jan, 2020

1 commit

  • [ Upstream commit ece841abbed2da71fa10710c687c9ce9efb6bf69 ]

    7c20f11680a4 ("bio-integrity: stop abusing bi_end_io") moves
    bio_integrity_free from bio_uninit() to bio_integrity_verify_fn()
    and bio_endio(). This way looks wrong because bio may be freed
    without calling bio_endio(), for example, blk_rq_unprep_clone() is
    called from dm_mq_queue_rq() when the underlying queue of dm-mpath
    is busy.

    So memory leak of bio integrity data is caused by commit 7c20f11680a4.

    Fixes this issue by re-adding bio_integrity_free() to bio_uninit().

    Fixes: 7c20f11680a4 ("bio-integrity: stop abusing bi_end_io")
    Reviewed-by: Christoph Hellwig
    Signed-off-by Justin Tee

    Add commit log, and simplify/fix the original patch wroten by Justin.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Justin Tee
     

23 Jan, 2020

2 commits

  • commit c44a4edb20938c85b64a256661443039f5bffdea upstream.

    This patch fixes the following sparse warnings:

    block/bsg-lib.c:269:19: warning: incorrect type in initializer (different base types)
    block/bsg-lib.c:269:19: expected int sts
    block/bsg-lib.c:269:19: got restricted blk_status_t [usertype]
    block/bsg-lib.c:286:16: warning: incorrect type in return expression (different base types)
    block/bsg-lib.c:286:16: expected restricted blk_status_t
    block/bsg-lib.c:286:16: got int [assigned] sts

    Cc: Martin Wilck
    Fixes: d46fe2cb2dce ("block: drop device references in bsg_queue_rq()")
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     
  • commit ad6bf88a6c19a39fb3b0045d78ea880325dfcf15 upstream.

    Logical block size has type unsigned short. That means that it can be at
    most 32768. However, there are architectures that can run with 64k pages
    (for example arm64) and on these architectures, it may be possible to
    create block devices with 64k block size.

    For exmaple (run this on an architecture with 64k pages):

    Mount will fail with this error because it tries to read the superblock using 2-sector
    access:
    device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
    EXT4-fs (dm-0): unable to read superblock

    This patch changes the logical block size from unsigned short to unsigned
    int to avoid the overflow.

    Cc: stable@vger.kernel.org
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Ming Lei
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     

18 Jan, 2020

1 commit

  • commit 83c9c547168e8b914ea6398430473a4de68c52cc upstream.

    Commit 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod")
    adds bio_truncate() for handling bio EOD. However, bio_truncate()
    doesn't use the passed 'op' parameter from guard_bio_eod's callers.

    So bio_trunacate() may retrieve wrong 'op', and zering pages may
    not be done for READ bio.

    Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs()
    in submit_bh_wbc() so that bio_truncate() can always retrieve correct
    op info.

    Meantime remove the 'op' parameter from guard_bio_eod() because it isn't
    used any more.

    Cc: Carlos Maiolino
    Cc: linux-fsdevel@vger.kernel.org
    Fixes: 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod")
    Signed-off-by: Ming Lei
    Signed-off-by: Greg Kroah-Hartman

    Fold in kerneldoc and bio_op() change.

    Signed-off-by: Jens Axboe

    Ming Lei
     

12 Jan, 2020

3 commits

  • [ Upstream commit 3b7995a98ad76da5597b488fa84aa5a56d43b608 ]

    When I doing fuzzy test, get the memleak report:

    BUG: memory leak
    unreferenced object 0xffff88837af80000 (size 4096):
    comm "memleak", pid 3557, jiffies 4294817681 (age 112.499s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    20 00 00 00 10 01 00 00 00 00 00 00 01 00 00 00 ...............
    backtrace:
    [] bio_alloc_bioset+0x393/0x590
    [] bio_copy_user_iov+0x300/0xcd0
    [] blk_rq_map_user_iov+0x2f1/0x5f0
    [] blk_rq_map_user+0xf2/0x160
    [] sg_common_write.isra.21+0x1094/0x1870
    [] sg_write.part.25+0x5d9/0x950
    [] sg_write+0x5f/0x8c
    [] __vfs_write+0x7c/0x100
    [] vfs_write+0x1c3/0x500
    [] ksys_write+0xf9/0x200
    [] do_syscall_64+0x9f/0x4f0
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe

    If __blk_rq_map_user_iov() is failed in blk_rq_map_user_iov(),
    the bio(s) which is allocated before this failing will leak. The
    refcount of the bio(s) is init to 1 and increased to 2 by calling
    bio_get(), but __blk_rq_unmap_user() only decrease it to 1, so
    the bio cannot be freed. Fix it by calling blk_rq_unmap_user().

    Reviewed-by: Bob Liu
    Reported-by: Hulk Robot
    Signed-off-by: Yang Yingliang
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Yang Yingliang
     
  • [ Upstream commit b3c6a59975415bde29cfd76ff1ab008edbf614a9 ]

    Avoid that running test nvme/012 from the blktests suite triggers the
    following false positive lockdep complaint:

    ============================================
    WARNING: possible recursive locking detected
    5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Not tainted
    --------------------------------------------
    ksoftirqd/1/16 is trying to acquire lock:
    000000000282032e (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    but task is already holding lock:
    00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&(&fq->mq_flush_lock)->rlock);
    lock(&(&fq->mq_flush_lock)->rlock);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    1 lock held by ksoftirqd/1/16:
    #0: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    stack backtrace:
    CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    dump_stack+0x67/0x90
    __lock_acquire.cold.45+0x2b4/0x313
    lock_acquire+0x98/0x160
    _raw_spin_lock_irqsave+0x3b/0x80
    flush_end_io+0x4e/0x1d0
    blk_mq_complete_request+0x76/0x110
    nvmet_req_complete+0x15/0x110 [nvmet]
    nvmet_bio_done+0x27/0x50 [nvmet]
    blk_update_request+0xd7/0x2d0
    blk_mq_end_request+0x1a/0x100
    blk_flush_complete_seq+0xe5/0x350
    flush_end_io+0x12f/0x1d0
    blk_done_softirq+0x9f/0xd0
    __do_softirq+0xca/0x440
    run_ksoftirqd+0x24/0x50
    smpboot_thread_fn+0x113/0x1e0
    kthread+0x121/0x140
    ret_from_fork+0x3a/0x50

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Bart Van Assche
     
  • [ Upstream commit c58c1f83436b501d45d4050fd1296d71a9760bcb ]

    Non-mq devs do not honor REQ_NOWAIT so give a chance to the caller to repeat
    request gracefully on -EAGAIN error.

    The problem is well reproduced using io_uring:

    mkfs.ext4 /dev/ram0
    mount /dev/ram0 /mnt

    # Preallocate a file
    dd if=/dev/zero of=/mnt/file bs=1M count=1

    # Start fio with io_uring and get -EIO
    fio --rw=write --ioengine=io_uring --size=1M --direct=1 --name=job --filename=/mnt/file

    Signed-off-by: Roman Penyaev
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Roman Penyaev
     

09 Jan, 2020

4 commits

  • commit 21d37340912d74b1222d43c11aa9dd0687162573 upstream.

    These were added to blkdev_ioctl() in v4.20 but not blkdev_compat_ioctl,
    so add them now.

    Cc: # v4.20+
    Fixes: 72cd87576d1d ("block: Introduce BLKGETZONESZ ioctl")
    Fixes: 65e4e3eee83d ("block: Introduce BLKGETNRZONES ioctl")
    Reviewed-by: Damien Le Moal
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     
  • commit 673bdf8ce0a387ef585c13b69a2676096c6edfe9 upstream.

    These were added to blkdev_ioctl() but not blkdev_compat_ioctl,
    so add them now.

    Cc: # v4.10+
    Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
    Reviewed-by: Damien Le Moal
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     
  • commit b2c0fcd28772f99236d261509bcd242135677965 upstream.

    These were added to blkdev_ioctl() in linux-5.5 but not
    blkdev_compat_ioctl, so add them now.

    Cc: # v4.4+
    Fixes: bbd3e064362e ("block: add an API for Persistent Reservations")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Greg Kroah-Hartman

    Fold in followup patch from Arnd with missing pr.h header include.

    Signed-off-by: Jens Axboe

    Arnd Bergmann
     
  • [ Upstream commit 85a8ce62c2eabe28b9d76ca4eecf37922402df93 ]

    Some filesystem, such as vfat, may send bio which crosses device boundary,
    and the worse thing is that the IO request starting within device boundaries
    can contain more than one segment past EOD.

    Commit dce30ca9e3b6 ("fs: fix guard_bio_eod to check for real EOD errors")
    tries to fix this issue by returning -EIO for this situation. However,
    this way lets fs user code lose chance to handle -EIO, then sync_inodes_sb()
    may hang for ever.

    Also the current truncating on last segment is dangerous by updating the
    last bvec, given bvec table becomes not immutable any more, and fs bio
    users may not retrieve the truncated pages via bio_for_each_segment_all() in
    its .end_io callback.

    Fixes this issue by supporting multi-segment truncating. And the
    approach is simpler:

    - just update bio size since block layer can make correct bvec with
    the updated bio size. Then bvec table becomes really immutable.

    - zero all truncated segments for read bio

    Cc: Carlos Maiolino
    Cc: linux-fsdevel@vger.kernel.org
    Fixed-by: dce30ca9e3b6 ("fs: fix guard_bio_eod to check for real EOD errors")
    Reported-by: syzbot+2b9e54155c8c25d8d165@syzkaller.appspotmail.com
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     

31 Dec, 2019

1 commit

  • commit d7bd15a138aef3be227818aad9c501e43c89c8c5 upstream.

    When over-budget IOs are force-issued through root cgroup,
    iocg_kick_delay() adjusts the async delay accordingly but doesn't
    actually schedule async throttle for the issuing task. This bug is
    pretty well masked because sooner or later the offending threads are
    gonna get directly throttled on regular IOs or have async delay
    scheduled by mem_cgroup_throttle_swaprate().

    However, it can affect control quality on filesystem metadata heavy
    operations. Let's fix it by invoking blkcg_schedule_throttle() when
    iocg_kick_delay() says async delay is needed.

    Signed-off-by: Tejun Heo
    Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
    Cc: stable@vger.kernel.org
    Reported-by: Josef Bacik
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

21 Dec, 2019

1 commit

  • commit cc90bc68422318eb8e75b15cd74bc8d538a7df29 upstream.

    This partially reverts commit e3a5d8e386c3fb973fa75f2403622a8f3640ec06.

    Commit e3a5d8e386c3 ("check bi_size overflow before merge") adds a bio_full
    check to __bio_try_merge_page. This will cause __bio_try_merge_page to fail
    when the last bi_io_vec has been reached. Instead, what we want here is only
    the bi_size overflow check.

    Fixes: e3a5d8e386c3 ("block: check bi_size overflow before merge")
    Cc: stable@vger.kernel.org # v5.4+
    Reviewed-by: Ming Lei
    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Andreas Gruenbacher
     

18 Dec, 2019

2 commits

  • commit d2c9be89f8ebe7ebcc97676ac40f8dec1cf9b43a upstream.

    8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores")
    avoids sysfs buffer overflow, and reserves one character for line break.
    However, the last snprintf() doesn't get correct 'size' parameter passed
    in, so fixed it.

    Fixes: 8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Cc: Nobuhiro Iwamatsu
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • commit 8962842ca5abdcf98e22ab3b2b45a103f0408b95 upstream.

    It is reported that sysfs buffer overflow can be triggered if the system
    has too many CPU cores(>841 on 4K PAGE_SIZE) when showing CPUs of
    hctx via /sys/block/$DEV/mq/$N/cpu_list.

    Use snprintf to avoid the potential buffer overflow.

    This version doesn't change the attribute format, and simply stops
    showing CPU numbers if the buffer is going to overflow.

    Cc: stable@vger.kernel.org
    Fixes: 676141e48af7("blk-mq: don't dump CPU -> hw queue map on driver load")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

25 Nov, 2019

1 commit

  • errata:
    When a read command returns less data than specified in the PRDs (for
    example, there are two PRDs for this command, but the device returns a
    number of bytes which is less than in the first PRD), the second PRD of
    this command is not read out of the PRD FIFO, causing the next command
    to use this PRD erroneously.

    workaround
    - forces sg_tablesize = 1
    - modified the sg_io function in block/scsi_ioctl.c to use a 64k buffer
    allocated with dma_alloc_coherent during the probe in ahci_imx
    - In order to fix the scsi/sata hang, when CD_ROM and HDD are
    accessed simultaneously after the workaround is applied.
    Do not go to sleep in scsi_eh_handler, when there is host failed.

    Signed-off-by: Richard Zhu

    Richard Zhu
     

15 Nov, 2019

1 commit


14 Nov, 2019

1 commit

  • Since commit 3726112ec731 ("block, bfq: re-schedule empty queues if
    they deserve I/O plugging"), to prevent the service guarantees of a
    bfq_queue from being violated, the bfq_queue may be left busy, i.e.,
    scheduled for service, even if empty (see comments in
    __bfq_bfqq_expire() for details). But, if no process will send
    requests to the bfq_queue any longer, then there is no point in
    keeping the bfq_queue scheduled for service.

    In addition, keeping the bfq_queue scheduled for service, but with no
    process reference any longer, may cause the bfq_queue to be freed when
    descheduled from service. But this is assumed to never happen, and
    causes a UAF if it happens. This, in turn, caused crashes [1, 2].

    This commit fixes this issue by descheduling an empty bfq_queue when
    it remains with not process reference.

    [1] https://bugzilla.redhat.com/show_bug.cgi?id=1767539
    [2] https://bugzilla.kernel.org/show_bug.cgi?id=205447

    Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging")
    Reported-by: Chris Evich
    Reported-by: Patrick Dung
    Reported-by: Thorsten Schubert
    Tested-by: Thorsten Schubert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

12 Nov, 2019

1 commit

  • __bio_try_merge_page() may merge a page to bio without bio_full() check
    and cause bi_size overflow.

    The overflow typically ends up with sd_init_command() warning on zero
    segment request with call trace like this:

    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 1986 at drivers/scsi/scsi_lib.c:1025 scsi_init_io+0x156/0x180
    CPU: 2 PID: 1986 Comm: kworker/2:1H Kdump: loaded Not tainted 5.4.0-rc7 #1
    Workqueue: kblockd blk_mq_run_work_fn
    RIP: 0010:scsi_init_io+0x156/0x180
    RSP: 0018:ffffa11487663bf0 EFLAGS: 00010246
    RAX: 00000000002be0a0 RBX: ffff8e6e9ff30118 RCX: 0000000000000000
    RDX: 00000000ffffffe1 RSI: 0000000000000000 RDI: ffff8e6e9ff30118
    RBP: ffffa11487663c18 R08: ffffa11487663d28 R09: ffff8e6e9ff30150
    R10: 0000000000000001 R11: 0000000000000000 R12: ffff8e6e9ff30000
    R13: 0000000000000001 R14: ffff8e74a1cf1800 R15: ffff8e6e9ff30000
    FS: 0000000000000000(0000) GS:ffff8e6ea7680000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fff18cf0fe8 CR3: 0000000659f0a001 CR4: 00000000001606e0
    Call Trace:
    sd_init_command+0x326/0xb40 [sd_mod]
    scsi_queue_rq+0x502/0xaa0
    ? blk_mq_get_driver_tag+0xe7/0x120
    blk_mq_dispatch_rq_list+0x256/0x5a0
    ? elv_rb_del+0x24/0x30
    ? deadline_remove_request+0x7b/0xc0
    blk_mq_do_dispatch_sched+0xa3/0x140
    blk_mq_sched_dispatch_requests+0xfb/0x170
    __blk_mq_run_hw_queue+0x81/0x130
    blk_mq_run_work_fn+0x1b/0x20
    process_one_work+0x179/0x390
    worker_thread+0x4f/0x3e0
    kthread+0x105/0x140
    ? max_active_store+0x80/0x80
    ? kthread_bind+0x20/0x20
    ret_from_fork+0x35/0x40
    ---[ end trace f9036abf5af4a4d3 ]---
    blk_update_request: I/O error, dev sdd, sector 2875552 op 0x1:(WRITE) flags 0x0 phys_seg 0 prio class 0
    XFS (sdd1): writeback error on sector 2875552

    __bio_try_merge_page() should check the overflow before actually doing
    merge.

    Fixes: 07173c3ec276c ("block: enable multipage bvecs")
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Jens Axboe

    Junichi Nomura
     

07 Nov, 2019

1 commit

  • blkcg_print_stat() iterates blkgs under RCU and doesn't test whether
    the blkg is online. This can call into pd_stat_fn() on a pd which is
    still being initialized leading to an oops.

    The heaviest operation - recursively summing up rwstat counters - is
    already done while holding the queue_lock. Expand queue_lock to cover
    the other operations and skip the blkg if it isn't online yet. The
    online state is protected by both blkcg and queue locks, so this
    guarantees that only online blkgs are processed.

    Signed-off-by: Tejun Heo
    Reported-by: Roman Gushchin
    Cc: Josef Bacik
    Fixes: 903d23f0a354 ("blk-cgroup: allow controllers to output their own stats")
    Cc: stable@vger.kernel.org # v4.19+
    Signed-off-by: Jens Axboe

    Tejun Heo
     

01 Nov, 2019

1 commit

  • This code causes a static analysis warning:

    block/blk-iocost.c:2113 ioc_weight_write() error: double lock 'irq'

    We disable IRQs in blkg_conf_prep() and re-enable them in
    blkg_conf_finish(). IRQ disable/enable should not be nested because
    that means the IRQs will be enabled at the first unlock instead of the
    second one.

    Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
    Acked-by: Tejun Heo
    Signed-off-by: Dan Carpenter
    Signed-off-by: Jens Axboe

    Dan Carpenter
     

16 Oct, 2019

2 commits

  • rq_qos_del() incorrectly assigns the node being deleted to the head if
    it was the first on the list in the !prev path. Fix it by iterating
    with ** instead.

    Signed-off-by: Tejun Heo
    Cc: Josef Bacik
    Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
    Cc: stable@vger.kernel.org # v4.19+
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg_activate_policy() has the following bugs.

    * cf09a8ee19ad ("blkcg: pass @q and @blkcg into
    blkcg_pol_alloc_pd_fn()") added @blkcg to ->pd_alloc_fn(); however,
    blkcg_activate_policy() ends up using pd's allocated for the root
    blkcg for all preallocations, so ->pd_init_fn() for non-root blkcgs
    can be passed in pd's which are allocated for the root blkcg.

    For blk-iocost, this means that ->pd_init_fn() can write beyond the
    end of the allocated object as it determines the length of the flex
    array at the end based on the blkcg's nesting level.

    * Each pd is initialized as they get allocated. If alloc fails, the
    policy will get freed with pd's initialized on it.

    * After the above partial failure, the partial pds are not freed.

    This patch fixes all the above issues by

    * Restructuring blkcg_activate_policy() so that alloc and init passes
    are separate. Init takes place only after all allocs succeeded and
    on failure all allocated pds are freed.

    * Unifying and fixing the cleanup of the remaining pd_prealloc.

    Signed-off-by: Tejun Heo
    Fixes: cf09a8ee19ad ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()")
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Oct, 2019

1 commit

  • A BIO based request queue does not have a tag_set, which prevent testing
    for the flag BLK_MQ_F_NO_SCHED indicating that the queue does not
    require an elevator. This leads to an incorrect initialization of a
    default elevator in some cases such as BIO based null_blk
    (queue_mode == BIO) with zoned mode enabled as the default elevator in
    this case is mq-deadline instead of "none".

    Fix this by testing for a NULL queue mq_ops field which indicates that
    the queue is BIO based and should not have an elevator.

    Reported-by: Shinichiro Kawasaki
    Reviewed-by: Bob Liu
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

06 Oct, 2019

1 commit

  • scale_up wakes up waiters after scaling up. But after scaling max, it
    should not wake up more waiters as waiters will not have anything to
    do. This patch fixes this by making scale_up (and also scale_down)
    return when threshold is reached.

    This bug causes increased fdatasync latency when fdatasync and dd
    conv=sync are performed in parallel on 4.19 compared to 4.14. This
    bug was introduced during refactoring of blk-wbt code.

    Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
    Cc: stable@vger.kernel.org
    Cc: Josef Bacik
    Signed-off-by: Harshad Shirwadkar
    Signed-off-by: Jens Axboe

    Harshad Shirwadkar
     

04 Oct, 2019

2 commits

  • sparse warns about incorrect type when using __be64 data.
    It is not being converted to CPU-endian but it should be.

    Fixes these sparse warnings:

    ../block/sed-opal.c:375:20: warning: incorrect type in assignment (different base types)
    ../block/sed-opal.c:375:20: expected unsigned long long [usertype] align
    ../block/sed-opal.c:375:20: got restricted __be64 const [usertype] alignment_granularity
    ../block/sed-opal.c:376:25: warning: incorrect type in assignment (different base types)
    ../block/sed-opal.c:376:25: expected unsigned long long [usertype] lowest_lba
    ../block/sed-opal.c:376:25: got restricted __be64 const [usertype] lowest_aligned_lba

    Fixes: 455a7b238cd6 ("block: Add Sed-opal library")
    Cc: Scott Bauer
    Cc: Rafael Antognolli
    Cc: linux-block@vger.kernel.org
    Reviewed-by: Jon Derrick
    Signed-off-by: Randy Dunlap
    Signed-off-by: Jens Axboe

    Randy Dunlap
     
  • Fix sparse warning: (missing '=')
    ../block/sed-opal.c:133:17: warning: obsolete array initializer, use C99 syntax

    Fixes: ff91064ea37c ("block: sed-opal: check size of shadow mbr")
    Cc: linux-block@vger.kernel.org
    Cc: Jonas Rabenstein
    Cc: David Kozub
    Reviewed-by: Scott Bauer
    Reviewed-by: Revanth Rajashekar
    Signed-off-by: Randy Dunlap
    Signed-off-by: Jens Axboe

    Randy Dunlap
     

28 Sep, 2019

2 commits

  • Some HDD drive may expose multiple hardware queues, such as MegraRaid.
    Let's apply the normal plugging for such devices because sequential IO
    may benefit a lot from plug merging.

    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Cc: Dave Chinner
    Reviewed-by: Damien Le Moal
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • If a device is using multiple queues, the IO scheduler may be bypassed.
    This may hurt performance for some slow MQ devices, and it also breaks
    zoned devices which depend on mq-deadline for respecting the write order
    in one zone.

    Don't bypass io scheduler if we have one setup.

    This patch can double sequential write performance basically on MQ
    scsi_debug when mq-deadline is applied.

    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Cc: Dave Chinner
    Reviewed-by: Javier González
    Reviewed-by: Damien Le Moal
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

27 Sep, 2019

2 commits

  • We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
    as following:

    [ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
    [ 108.827059] PGD 0 P4D 0
    [ 108.827313] Oops: 0000 [#1] SMP PTI
    [ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
    [ 108.829503] Workqueue: kblockd blk_mq_timeout_work
    [ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
    [ 108.838191] Call Trace:
    [ 108.838406] bt_iter+0x74/0x80
    [ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
    [ 108.839074] ? __switch_to_asm+0x34/0x70
    [ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
    [ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
    [ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
    [ 108.840732] blk_mq_timeout_work+0x74/0x200
    [ 108.841151] process_one_work+0x297/0x680
    [ 108.841550] worker_thread+0x29c/0x6f0
    [ 108.841926] ? rescuer_thread+0x580/0x580
    [ 108.842344] kthread+0x16a/0x1a0
    [ 108.842666] ? kthread_flush_work+0x170/0x170
    [ 108.843100] ret_from_fork+0x35/0x40

    The bug is caused by the race between timeout handle and completion for
    flush request.

    When timeout handle function blk_mq_rq_timed_out() try to read
    'req->q->mq_ops', the 'req' have completed and reinitiated by next
    flush request, which would call blk_rq_init() to clear 'req' as 0.

    After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
    normal requests lifetime are protected by refcount. Until 'rq->ref'
    drop to zero, the request can really be free. Thus, these requests
    cannot been reused before timeout handle finish.

    However, flush request has defined .end_io and rq->end_io() is still
    called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
    can be reused by the next flush request handle, resulting in null
    pointer deference BUG ON.

    We fix this problem by covering flush request with 'rq->ref'.
    If the refcount is not zero, flush_end_io() return and wait the
    last holder recall it. To record the request status, we add a new
    entry 'rq_status', which will be used in flush_end_io().

    Cc: Christoph Hellwig
    Cc: Keith Busch
    Cc: Bart Van Assche
    Cc: stable@vger.kernel.org # v4.18+
    Reviewed-by: Ming Lei
    Reviewed-by: Bob Liu
    Signed-off-by: Yufen Yu

    -------
    v2:
    - move rq_status from struct request to struct blk_flush_queue
    v3:
    - remove unnecessary '{}' pair.
    v4:
    - let spinlock to protect 'fq->rq_status'
    v5:
    - move rq_status after flush_running_idx member of struct blk_flush_queue
    Signed-off-by: Jens Axboe

    Yufen Yu
     
  • We have updated limits after calling wbt_set_min_lat(). No need to
    update again.

    Reviewed-by: Bob Liu
    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     

26 Sep, 2019

5 commits

  • The default hard disk param sets latency targets at 50ms. As the
    default target percentiles are zero, these don't directly regulate
    vrate; however, they're still used to calculate the period length -
    100ms in this case.

    This is excessively low. A SATA drive with QD32 saturated with random
    IOs can easily reach avg completion latency of several hundred msecs.
    A period duration which is substantially lower than avg completion
    latency can lead to wildly fluctuating vrate.

    Let's bump up the default latency targets to 250ms so that the period
    duration is sufficiently long.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Some IOs may span multiple periods. As latencies are collected on
    completion, the inbetween periods won't register them and may
    incorrectly decide to increase vrate. nr_lagging tracks these IOs to
    avoid those situations. Currently, whenever there are IOs which are
    spanning from the previous period, busy_level is reset to 0 if
    negative thus suppressing vrate increase.

    This has the following two problems.

    * When latency target percentiles aren't set, vrate adjustment should
    only be governed by queue depth depletion; however, the current code
    keeps nr_lagging active which pulls in latency results and can keep
    down vrate unexpectedly.

    * When lagging condition is detected, it resets the entire negative
    busy_level. This turned out to be way too aggressive on some
    devices which sometimes experience extended latencies on a small
    subset of commands. In addition, a lagging IO will be accounted as
    latency target miss on completion anyway and resetting busy_level
    amplifies its impact unnecessarily.

    This patch fixes the above two problems by disabling nr_lagging
    counting when latency target percentiles aren't set and blocking vrate
    increases when there are lagging IOs while leaving busy_level as-is.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • vrate_adj tracepoint traces vrate changes; however, it does so only
    when busy_level is non-zero. busy_level turning to zero can sometimes
    be as interesting an event. This patch also enables vrate_adj
    tracepoint on other vrate related events - busy_level changes and
    non-zero nr_lagging.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cecf5d87ff20 ("block: split .sysfs_lock into two locks") starts to
    release & acquire sysfs_lock before registering/un-registering elevator
    queue during switching elevator for avoiding potential deadlock from
    showing & storing 'queue/iosched' attributes and removing elevator's
    kobject.

    Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
    required in .show & .store of queue/iosched's attributes, and just
    elevator's sysfs lock is acquired in elv_iosched_store() and
    elv_iosched_show(). So it is safe to hold queue's sysfs lock when
    registering/un-registering elevator queue.

    The biggest issue is that commit cecf5d87ff20 assumes that concurrent
    write on 'queue/scheduler' can't happen. However, this assumption isn't
    true, because kernfs_fop_write() only guarantees that concurrent write
    aren't called on the same open file, but the write could be from
    different open on the file. So we can't release & re-acquire queue's
    sysfs lock during switching elevator, otherwise use-after-free on
    elevator could be triggered.

    Fixes the issue by not releasing queue's sysfs lock during switching
    elevator.

    Fixes: cecf5d87ff20 ("block: split .sysfs_lock into two locks")
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Greg KH
    Cc: Mike Snitzer
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Commit c48dac137a62 ("block: don't hold q->sysfs_lock in elevator_init_mq")
    removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with
    lockdep_assert_held() called in blk_mq_sched_free_requests() which is
    run in failure path of elevator_init_mq().

    blk_mq_sched_free_requests() is called in the following 3 functions:

    elevator_init_mq()
    elevator_exit()
    blk_cleanup_queue()

    In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly
    by 'mutex_lock(&q->sysfs_lock)'.

    So moving the lockdep_assert_held() from blk_mq_sched_free_requests()
    into elevator_exit() for fixing the report by syzbot.

    Reported-by: syzbot+da3b7677bb913dc1b737@syzkaller.appspotmail.com
    Fixed: c48dac137a62 ("block: don't hold q->sysfs_lock in elevator_init_mq")
    Reviewed-by: Bart Van Assche
    Reviewed-by: Damien Le Moal
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

25 Sep, 2019

1 commit

  • Pull more block updates from Jens Axboe:
    "Some later additions that weren't quite done for the first pull
    request, and also a few fixes that have arrived since.

    This contains:

    - Kill silly pktcdvd warning on attempting to register a non-scsi
    passthrough device (me)

    - Use symbolic constants for the block t10 protection types, and
    switch to handling it in core rather than in the drivers (Max)

    - libahci platform missing node put fix (Nishka)

    - Small series of fixes for BFQ (Paolo)

    - Fix possible nbd crash (Xiubo)"

    * tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block:
    block: drop device references in bsg_queue_rq()
    block: t10-pi: fix -Wswitch warning
    pktcdvd: remove warning on attempting to register non-passthrough dev
    ata: libahci_platform: Add of_node_put() before loop exit
    nbd: fix possible page fault for nbd disk
    nbd: rename the runtime flags as NBD_RT_ prefixed
    block, bfq: push up injection only after setting service time
    block, bfq: increase update frequency of inject limit
    block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1
    block, bfq: update inject limit only after injection occurred
    block: centralize PI remapping logic to the block layer
    block: use symbolic constants for t10_pi type

    Linus Torvalds
     

24 Sep, 2019

1 commit