08 Mar, 2020
1 commit
-
Merge Linux stable release v5.4.24 into imx_5.4.y
* tag 'v5.4.24': (3306 commits)
Linux 5.4.24
blktrace: Protect q->blk_trace with RCU
kvm: nVMX: VMWRITE checks unsupported field before read-only field
...Signed-off-by: Jason Liu
Conflicts:
arch/arm/boot/dts/imx6sll-evk.dts
arch/arm/boot/dts/imx7ulp.dtsi
arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
drivers/clk/imx/clk-composite-8m.c
drivers/gpio/gpio-mxc.c
drivers/irqchip/Kconfig
drivers/mmc/host/sdhci-of-esdhc.c
drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c
drivers/net/can/flexcan.c
drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
drivers/net/ethernet/mscc/ocelot.c
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
drivers/net/phy/realtek.c
drivers/pci/controller/mobiveil/pcie-mobiveil-host.c
drivers/perf/fsl_imx8_ddr_perf.c
drivers/tee/optee/shm_pool.c
drivers/usb/cdns3/gadget.c
kernel/sched/cpufreq.c
net/core/xdp.c
sound/soc/fsl/fsl_esai.c
sound/soc/fsl/fsl_sai.c
sound/soc/sof/core.c
sound/soc/sof/imx/Kconfig
sound/soc/sof/loader.c
24 Feb, 2020
1 commit
-
[ Upstream commit f718b093277df582fbf8775548a4f163e664d282 ]
Commit 478de3380c1c ("block, bfq: deschedule empty bfq_queues not
referred by any process") fixed commit 3726112ec731 ("block, bfq:
re-schedule empty queues if they deserve I/O plugging") by
descheduling an empty bfq_queue when it remains with not process
reference. Yet, this still left a case uncovered: an empty bfq_queue
with not process reference that remains in service. This happens for
an in-service sync bfq_queue that is deemed to deserve I/O-dispatch
plugging when it remains empty. Yet no new requests will arrive for
such a bfq_queue if no process sends requests to it any longer. Even
worse, the bfq_queue may happen to be prematurely freed while still in
service (because there may remain no reference to it any longer).This commit solves this problem by preventing I/O dispatch from being
plugged for the in-service bfq_queue, if the latter has no process
reference (the bfq_queue is then prevented from remaining in service).Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging")
Tested-by: Oleksandr Natalenko
Reported-by: Patrick Dung
Tested-by: Patrick Dung
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
26 Jan, 2020
1 commit
-
[ Upstream commit ece841abbed2da71fa10710c687c9ce9efb6bf69 ]
7c20f11680a4 ("bio-integrity: stop abusing bi_end_io") moves
bio_integrity_free from bio_uninit() to bio_integrity_verify_fn()
and bio_endio(). This way looks wrong because bio may be freed
without calling bio_endio(), for example, blk_rq_unprep_clone() is
called from dm_mq_queue_rq() when the underlying queue of dm-mpath
is busy.So memory leak of bio integrity data is caused by commit 7c20f11680a4.
Fixes this issue by re-adding bio_integrity_free() to bio_uninit().
Fixes: 7c20f11680a4 ("bio-integrity: stop abusing bi_end_io")
Reviewed-by: Christoph Hellwig
Signed-off-by Justin TeeAdd commit log, and simplify/fix the original patch wroten by Justin.
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
23 Jan, 2020
2 commits
-
commit c44a4edb20938c85b64a256661443039f5bffdea upstream.
This patch fixes the following sparse warnings:
block/bsg-lib.c:269:19: warning: incorrect type in initializer (different base types)
block/bsg-lib.c:269:19: expected int sts
block/bsg-lib.c:269:19: got restricted blk_status_t [usertype]
block/bsg-lib.c:286:16: warning: incorrect type in return expression (different base types)
block/bsg-lib.c:286:16: expected restricted blk_status_t
block/bsg-lib.c:286:16: got int [assigned] stsCc: Martin Wilck
Fixes: d46fe2cb2dce ("block: drop device references in bsg_queue_rq()")
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman -
commit ad6bf88a6c19a39fb3b0045d78ea880325dfcf15 upstream.
Logical block size has type unsigned short. That means that it can be at
most 32768. However, there are architectures that can run with 64k pages
(for example arm64) and on these architectures, it may be possible to
create block devices with 64k block size.For exmaple (run this on an architecture with 64k pages):
Mount will fail with this error because it tries to read the superblock using 2-sector
access:
device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
EXT4-fs (dm-0): unable to read superblockThis patch changes the logical block size from unsigned short to unsigned
int to avoid the overflow.Cc: stable@vger.kernel.org
Reviewed-by: Martin K. Petersen
Reviewed-by: Ming Lei
Signed-off-by: Mikulas Patocka
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman
18 Jan, 2020
1 commit
-
commit 83c9c547168e8b914ea6398430473a4de68c52cc upstream.
Commit 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod")
adds bio_truncate() for handling bio EOD. However, bio_truncate()
doesn't use the passed 'op' parameter from guard_bio_eod's callers.So bio_trunacate() may retrieve wrong 'op', and zering pages may
not be done for READ bio.Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs()
in submit_bh_wbc() so that bio_truncate() can always retrieve correct
op info.Meantime remove the 'op' parameter from guard_bio_eod() because it isn't
used any more.Cc: Carlos Maiolino
Cc: linux-fsdevel@vger.kernel.org
Fixes: 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod")
Signed-off-by: Ming Lei
Signed-off-by: Greg Kroah-HartmanFold in kerneldoc and bio_op() change.
Signed-off-by: Jens Axboe
12 Jan, 2020
3 commits
-
[ Upstream commit 3b7995a98ad76da5597b488fa84aa5a56d43b608 ]
When I doing fuzzy test, get the memleak report:
BUG: memory leak
unreferenced object 0xffff88837af80000 (size 4096):
comm "memleak", pid 3557, jiffies 4294817681 (age 112.499s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
20 00 00 00 10 01 00 00 00 00 00 00 01 00 00 00 ...............
backtrace:
[] bio_alloc_bioset+0x393/0x590
[] bio_copy_user_iov+0x300/0xcd0
[] blk_rq_map_user_iov+0x2f1/0x5f0
[] blk_rq_map_user+0xf2/0x160
[] sg_common_write.isra.21+0x1094/0x1870
[] sg_write.part.25+0x5d9/0x950
[] sg_write+0x5f/0x8c
[] __vfs_write+0x7c/0x100
[] vfs_write+0x1c3/0x500
[] ksys_write+0xf9/0x200
[] do_syscall_64+0x9f/0x4f0
[] entry_SYSCALL_64_after_hwframe+0x49/0xbeIf __blk_rq_map_user_iov() is failed in blk_rq_map_user_iov(),
the bio(s) which is allocated before this failing will leak. The
refcount of the bio(s) is init to 1 and increased to 2 by calling
bio_get(), but __blk_rq_unmap_user() only decrease it to 1, so
the bio cannot be freed. Fix it by calling blk_rq_unmap_user().Reviewed-by: Bob Liu
Reported-by: Hulk Robot
Signed-off-by: Yang Yingliang
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin -
[ Upstream commit b3c6a59975415bde29cfd76ff1ab008edbf614a9 ]
Avoid that running test nvme/012 from the blktests suite triggers the
following false positive lockdep complaint:============================================
WARNING: possible recursive locking detected
5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Not tainted
--------------------------------------------
ksoftirqd/1/16 is trying to acquire lock:
000000000282032e (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0but task is already holding lock:
00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0other info that might help us debug this:
Possible unsafe locking scenario:CPU0
----
lock(&(&fq->mq_flush_lock)->rlock);
lock(&(&fq->mq_flush_lock)->rlock);*** DEADLOCK ***
May be due to missing lock nesting notation
1 lock held by ksoftirqd/1/16:
#0: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0stack backtrace:
CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
dump_stack+0x67/0x90
__lock_acquire.cold.45+0x2b4/0x313
lock_acquire+0x98/0x160
_raw_spin_lock_irqsave+0x3b/0x80
flush_end_io+0x4e/0x1d0
blk_mq_complete_request+0x76/0x110
nvmet_req_complete+0x15/0x110 [nvmet]
nvmet_bio_done+0x27/0x50 [nvmet]
blk_update_request+0xd7/0x2d0
blk_mq_end_request+0x1a/0x100
blk_flush_complete_seq+0xe5/0x350
flush_end_io+0x12f/0x1d0
blk_done_softirq+0x9f/0xd0
__do_softirq+0xca/0x440
run_ksoftirqd+0x24/0x50
smpboot_thread_fn+0x113/0x1e0
kthread+0x121/0x140
ret_from_fork+0x3a/0x50Cc: Christoph Hellwig
Cc: Ming Lei
Cc: Hannes Reinecke
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin -
[ Upstream commit c58c1f83436b501d45d4050fd1296d71a9760bcb ]
Non-mq devs do not honor REQ_NOWAIT so give a chance to the caller to repeat
request gracefully on -EAGAIN error.The problem is well reproduced using io_uring:
mkfs.ext4 /dev/ram0
mount /dev/ram0 /mnt# Preallocate a file
dd if=/dev/zero of=/mnt/file bs=1M count=1# Start fio with io_uring and get -EIO
fio --rw=write --ioengine=io_uring --size=1M --direct=1 --name=job --filename=/mnt/fileSigned-off-by: Roman Penyaev
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
09 Jan, 2020
4 commits
-
commit 21d37340912d74b1222d43c11aa9dd0687162573 upstream.
These were added to blkdev_ioctl() in v4.20 but not blkdev_compat_ioctl,
so add them now.Cc: # v4.20+
Fixes: 72cd87576d1d ("block: Introduce BLKGETZONESZ ioctl")
Fixes: 65e4e3eee83d ("block: Introduce BLKGETNRZONES ioctl")
Reviewed-by: Damien Le Moal
Signed-off-by: Arnd Bergmann
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman -
commit 673bdf8ce0a387ef585c13b69a2676096c6edfe9 upstream.
These were added to blkdev_ioctl() but not blkdev_compat_ioctl,
so add them now.Cc: # v4.10+
Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
Reviewed-by: Damien Le Moal
Signed-off-by: Arnd Bergmann
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman -
commit b2c0fcd28772f99236d261509bcd242135677965 upstream.
These were added to blkdev_ioctl() in linux-5.5 but not
blkdev_compat_ioctl, so add them now.Cc: # v4.4+
Fixes: bbd3e064362e ("block: add an API for Persistent Reservations")
Signed-off-by: Arnd Bergmann
Signed-off-by: Greg Kroah-HartmanFold in followup patch from Arnd with missing pr.h header include.
Signed-off-by: Jens Axboe
-
[ Upstream commit 85a8ce62c2eabe28b9d76ca4eecf37922402df93 ]
Some filesystem, such as vfat, may send bio which crosses device boundary,
and the worse thing is that the IO request starting within device boundaries
can contain more than one segment past EOD.Commit dce30ca9e3b6 ("fs: fix guard_bio_eod to check for real EOD errors")
tries to fix this issue by returning -EIO for this situation. However,
this way lets fs user code lose chance to handle -EIO, then sync_inodes_sb()
may hang for ever.Also the current truncating on last segment is dangerous by updating the
last bvec, given bvec table becomes not immutable any more, and fs bio
users may not retrieve the truncated pages via bio_for_each_segment_all() in
its .end_io callback.Fixes this issue by supporting multi-segment truncating. And the
approach is simpler:- just update bio size since block layer can make correct bvec with
the updated bio size. Then bvec table becomes really immutable.- zero all truncated segments for read bio
Cc: Carlos Maiolino
Cc: linux-fsdevel@vger.kernel.org
Fixed-by: dce30ca9e3b6 ("fs: fix guard_bio_eod to check for real EOD errors")
Reported-by: syzbot+2b9e54155c8c25d8d165@syzkaller.appspotmail.com
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
31 Dec, 2019
1 commit
-
commit d7bd15a138aef3be227818aad9c501e43c89c8c5 upstream.
When over-budget IOs are force-issued through root cgroup,
iocg_kick_delay() adjusts the async delay accordingly but doesn't
actually schedule async throttle for the issuing task. This bug is
pretty well masked because sooner or later the offending threads are
gonna get directly throttled on regular IOs or have async delay
scheduled by mem_cgroup_throttle_swaprate().However, it can affect control quality on filesystem metadata heavy
operations. Let's fix it by invoking blkcg_schedule_throttle() when
iocg_kick_delay() says async delay is needed.Signed-off-by: Tejun Heo
Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org
Reported-by: Josef Bacik
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman
21 Dec, 2019
1 commit
-
commit cc90bc68422318eb8e75b15cd74bc8d538a7df29 upstream.
This partially reverts commit e3a5d8e386c3fb973fa75f2403622a8f3640ec06.
Commit e3a5d8e386c3 ("check bi_size overflow before merge") adds a bio_full
check to __bio_try_merge_page. This will cause __bio_try_merge_page to fail
when the last bi_io_vec has been reached. Instead, what we want here is only
the bi_size overflow check.Fixes: e3a5d8e386c3 ("block: check bi_size overflow before merge")
Cc: stable@vger.kernel.org # v5.4+
Reviewed-by: Ming Lei
Signed-off-by: Andreas Gruenbacher
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman
18 Dec, 2019
2 commits
-
commit d2c9be89f8ebe7ebcc97676ac40f8dec1cf9b43a upstream.
8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores")
avoids sysfs buffer overflow, and reserves one character for line break.
However, the last snprintf() doesn't get correct 'size' parameter passed
in, so fixed it.Fixes: 8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores")
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Cc: Nobuhiro Iwamatsu
Signed-off-by: Greg Kroah-Hartman -
commit 8962842ca5abdcf98e22ab3b2b45a103f0408b95 upstream.
It is reported that sysfs buffer overflow can be triggered if the system
has too many CPU cores(>841 on 4K PAGE_SIZE) when showing CPUs of
hctx via /sys/block/$DEV/mq/$N/cpu_list.Use snprintf to avoid the potential buffer overflow.
This version doesn't change the attribute format, and simply stops
showing CPU numbers if the buffer is going to overflow.Cc: stable@vger.kernel.org
Fixes: 676141e48af7("blk-mq: don't dump CPU -> hw queue map on driver load")
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman
25 Nov, 2019
1 commit
-
errata:
When a read command returns less data than specified in the PRDs (for
example, there are two PRDs for this command, but the device returns a
number of bytes which is less than in the first PRD), the second PRD of
this command is not read out of the PRD FIFO, causing the next command
to use this PRD erroneously.workaround
- forces sg_tablesize = 1
- modified the sg_io function in block/scsi_ioctl.c to use a 64k buffer
allocated with dma_alloc_coherent during the probe in ahci_imx
- In order to fix the scsi/sata hang, when CD_ROM and HDD are
accessed simultaneously after the workaround is applied.
Do not go to sleep in scsi_eh_handler, when there is host failed.Signed-off-by: Richard Zhu
15 Nov, 2019
1 commit
-
There is a bug that checking the same active_list over and over again
in iocg_activate(). The intention of the code was checking whether all
the ancestors and self have already been activated. So fix it.Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Acked-by: Tejun Heo
Signed-off-by: Jiufei Xue
Signed-off-by: Jens Axboe
14 Nov, 2019
1 commit
-
Since commit 3726112ec731 ("block, bfq: re-schedule empty queues if
they deserve I/O plugging"), to prevent the service guarantees of a
bfq_queue from being violated, the bfq_queue may be left busy, i.e.,
scheduled for service, even if empty (see comments in
__bfq_bfqq_expire() for details). But, if no process will send
requests to the bfq_queue any longer, then there is no point in
keeping the bfq_queue scheduled for service.In addition, keeping the bfq_queue scheduled for service, but with no
process reference any longer, may cause the bfq_queue to be freed when
descheduled from service. But this is assumed to never happen, and
causes a UAF if it happens. This, in turn, caused crashes [1, 2].This commit fixes this issue by descheduling an empty bfq_queue when
it remains with not process reference.[1] https://bugzilla.redhat.com/show_bug.cgi?id=1767539
[2] https://bugzilla.kernel.org/show_bug.cgi?id=205447Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging")
Reported-by: Chris Evich
Reported-by: Patrick Dung
Reported-by: Thorsten Schubert
Tested-by: Thorsten Schubert
Tested-by: Oleksandr Natalenko
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe
12 Nov, 2019
1 commit
-
__bio_try_merge_page() may merge a page to bio without bio_full() check
and cause bi_size overflow.The overflow typically ends up with sd_init_command() warning on zero
segment request with call trace like this:------------[ cut here ]------------
WARNING: CPU: 2 PID: 1986 at drivers/scsi/scsi_lib.c:1025 scsi_init_io+0x156/0x180
CPU: 2 PID: 1986 Comm: kworker/2:1H Kdump: loaded Not tainted 5.4.0-rc7 #1
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:scsi_init_io+0x156/0x180
RSP: 0018:ffffa11487663bf0 EFLAGS: 00010246
RAX: 00000000002be0a0 RBX: ffff8e6e9ff30118 RCX: 0000000000000000
RDX: 00000000ffffffe1 RSI: 0000000000000000 RDI: ffff8e6e9ff30118
RBP: ffffa11487663c18 R08: ffffa11487663d28 R09: ffff8e6e9ff30150
R10: 0000000000000001 R11: 0000000000000000 R12: ffff8e6e9ff30000
R13: 0000000000000001 R14: ffff8e74a1cf1800 R15: ffff8e6e9ff30000
FS: 0000000000000000(0000) GS:ffff8e6ea7680000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fff18cf0fe8 CR3: 0000000659f0a001 CR4: 00000000001606e0
Call Trace:
sd_init_command+0x326/0xb40 [sd_mod]
scsi_queue_rq+0x502/0xaa0
? blk_mq_get_driver_tag+0xe7/0x120
blk_mq_dispatch_rq_list+0x256/0x5a0
? elv_rb_del+0x24/0x30
? deadline_remove_request+0x7b/0xc0
blk_mq_do_dispatch_sched+0xa3/0x140
blk_mq_sched_dispatch_requests+0xfb/0x170
__blk_mq_run_hw_queue+0x81/0x130
blk_mq_run_work_fn+0x1b/0x20
process_one_work+0x179/0x390
worker_thread+0x4f/0x3e0
kthread+0x105/0x140
? max_active_store+0x80/0x80
? kthread_bind+0x20/0x20
ret_from_fork+0x35/0x40
---[ end trace f9036abf5af4a4d3 ]---
blk_update_request: I/O error, dev sdd, sector 2875552 op 0x1:(WRITE) flags 0x0 phys_seg 0 prio class 0
XFS (sdd1): writeback error on sector 2875552__bio_try_merge_page() should check the overflow before actually doing
merge.Fixes: 07173c3ec276c ("block: enable multipage bvecs")
Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Reviewed-by: Hannes Reinecke
Signed-off-by: Jun'ichi Nomura
Signed-off-by: Jens Axboe
07 Nov, 2019
1 commit
-
blkcg_print_stat() iterates blkgs under RCU and doesn't test whether
the blkg is online. This can call into pd_stat_fn() on a pd which is
still being initialized leading to an oops.The heaviest operation - recursively summing up rwstat counters - is
already done while holding the queue_lock. Expand queue_lock to cover
the other operations and skip the blkg if it isn't online yet. The
online state is protected by both blkcg and queue locks, so this
guarantees that only online blkgs are processed.Signed-off-by: Tejun Heo
Reported-by: Roman Gushchin
Cc: Josef Bacik
Fixes: 903d23f0a354 ("blk-cgroup: allow controllers to output their own stats")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Jens Axboe
01 Nov, 2019
1 commit
-
This code causes a static analysis warning:
block/blk-iocost.c:2113 ioc_weight_write() error: double lock 'irq'
We disable IRQs in blkg_conf_prep() and re-enable them in
blkg_conf_finish(). IRQ disable/enable should not be nested because
that means the IRQs will be enabled at the first unlock instead of the
second one.Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Acked-by: Tejun Heo
Signed-off-by: Dan Carpenter
Signed-off-by: Jens Axboe
16 Oct, 2019
2 commits
-
rq_qos_del() incorrectly assigns the node being deleted to the head if
it was the first on the list in the !prev path. Fix it by iterating
with ** instead.Signed-off-by: Tejun Heo
Cc: Josef Bacik
Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Jens Axboe -
blkcg_activate_policy() has the following bugs.
* cf09a8ee19ad ("blkcg: pass @q and @blkcg into
blkcg_pol_alloc_pd_fn()") added @blkcg to ->pd_alloc_fn(); however,
blkcg_activate_policy() ends up using pd's allocated for the root
blkcg for all preallocations, so ->pd_init_fn() for non-root blkcgs
can be passed in pd's which are allocated for the root blkcg.For blk-iocost, this means that ->pd_init_fn() can write beyond the
end of the allocated object as it determines the length of the flex
array at the end based on the blkcg's nesting level.* Each pd is initialized as they get allocated. If alloc fails, the
policy will get freed with pd's initialized on it.* After the above partial failure, the partial pds are not freed.
This patch fixes all the above issues by
* Restructuring blkcg_activate_policy() so that alloc and init passes
are separate. Init takes place only after all allocs succeeded and
on failure all allocated pds are freed.* Unifying and fixing the cleanup of the remaining pd_prealloc.
Signed-off-by: Tejun Heo
Fixes: cf09a8ee19ad ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()")
Signed-off-by: Jens Axboe
15 Oct, 2019
1 commit
-
A BIO based request queue does not have a tag_set, which prevent testing
for the flag BLK_MQ_F_NO_SCHED indicating that the queue does not
require an elevator. This leads to an incorrect initialization of a
default elevator in some cases such as BIO based null_blk
(queue_mode == BIO) with zoned mode enabled as the default elevator in
this case is mq-deadline instead of "none".Fix this by testing for a NULL queue mq_ops field which indicates that
the queue is BIO based and should not have an elevator.Reported-by: Shinichiro Kawasaki
Reviewed-by: Bob Liu
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe
06 Oct, 2019
1 commit
-
scale_up wakes up waiters after scaling up. But after scaling max, it
should not wake up more waiters as waiters will not have anything to
do. This patch fixes this by making scale_up (and also scale_down)
return when threshold is reached.This bug causes increased fdatasync latency when fdatasync and dd
conv=sync are performed in parallel on 4.19 compared to 4.14. This
bug was introduced during refactoring of blk-wbt code.Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
Cc: stable@vger.kernel.org
Cc: Josef Bacik
Signed-off-by: Harshad Shirwadkar
Signed-off-by: Jens Axboe
04 Oct, 2019
2 commits
-
sparse warns about incorrect type when using __be64 data.
It is not being converted to CPU-endian but it should be.Fixes these sparse warnings:
../block/sed-opal.c:375:20: warning: incorrect type in assignment (different base types)
../block/sed-opal.c:375:20: expected unsigned long long [usertype] align
../block/sed-opal.c:375:20: got restricted __be64 const [usertype] alignment_granularity
../block/sed-opal.c:376:25: warning: incorrect type in assignment (different base types)
../block/sed-opal.c:376:25: expected unsigned long long [usertype] lowest_lba
../block/sed-opal.c:376:25: got restricted __be64 const [usertype] lowest_aligned_lbaFixes: 455a7b238cd6 ("block: Add Sed-opal library")
Cc: Scott Bauer
Cc: Rafael Antognolli
Cc: linux-block@vger.kernel.org
Reviewed-by: Jon Derrick
Signed-off-by: Randy Dunlap
Signed-off-by: Jens Axboe -
Fix sparse warning: (missing '=')
../block/sed-opal.c:133:17: warning: obsolete array initializer, use C99 syntaxFixes: ff91064ea37c ("block: sed-opal: check size of shadow mbr")
Cc: linux-block@vger.kernel.org
Cc: Jonas Rabenstein
Cc: David Kozub
Reviewed-by: Scott Bauer
Reviewed-by: Revanth Rajashekar
Signed-off-by: Randy Dunlap
Signed-off-by: Jens Axboe
28 Sep, 2019
2 commits
-
Some HDD drive may expose multiple hardware queues, such as MegraRaid.
Let's apply the normal plugging for such devices because sequential IO
may benefit a lot from plug merging.Cc: Bart Van Assche
Cc: Hannes Reinecke
Cc: Dave Chinner
Reviewed-by: Damien Le Moal
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe -
If a device is using multiple queues, the IO scheduler may be bypassed.
This may hurt performance for some slow MQ devices, and it also breaks
zoned devices which depend on mq-deadline for respecting the write order
in one zone.Don't bypass io scheduler if we have one setup.
This patch can double sequential write performance basically on MQ
scsi_debug when mq-deadline is applied.Cc: Bart Van Assche
Cc: Hannes Reinecke
Cc: Dave Chinner
Reviewed-by: Javier González
Reviewed-by: Damien Le Moal
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
27 Sep, 2019
2 commits
-
We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:[ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[ 108.827059] PGD 0 P4D 0
[ 108.827313] Oops: 0000 [#1] SMP PTI
[ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[ 108.829503] Workqueue: kblockd blk_mq_timeout_work
[ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[ 108.838191] Call Trace:
[ 108.838406] bt_iter+0x74/0x80
[ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
[ 108.839074] ? __switch_to_asm+0x34/0x70
[ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
[ 108.840732] blk_mq_timeout_work+0x74/0x200
[ 108.841151] process_one_work+0x297/0x680
[ 108.841550] worker_thread+0x29c/0x6f0
[ 108.841926] ? rescuer_thread+0x580/0x580
[ 108.842344] kthread+0x16a/0x1a0
[ 108.842666] ? kthread_flush_work+0x170/0x170
[ 108.843100] ret_from_fork+0x35/0x40The bug is caused by the race between timeout handle and completion for
flush request.When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().Cc: Christoph Hellwig
Cc: Keith Busch
Cc: Bart Van Assche
Cc: stable@vger.kernel.org # v4.18+
Reviewed-by: Ming Lei
Reviewed-by: Bob Liu
Signed-off-by: Yufen Yu-------
v2:
- move rq_status from struct request to struct blk_flush_queue
v3:
- remove unnecessary '{}' pair.
v4:
- let spinlock to protect 'fq->rq_status'
v5:
- move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: Jens Axboe -
We have updated limits after calling wbt_set_min_lat(). No need to
update again.Reviewed-by: Bob Liu
Signed-off-by: Yufen Yu
Signed-off-by: Jens Axboe
26 Sep, 2019
5 commits
-
The default hard disk param sets latency targets at 50ms. As the
default target percentiles are zero, these don't directly regulate
vrate; however, they're still used to calculate the period length -
100ms in this case.This is excessively low. A SATA drive with QD32 saturated with random
IOs can easily reach avg completion latency of several hundred msecs.
A period duration which is substantially lower than avg completion
latency can lead to wildly fluctuating vrate.Let's bump up the default latency targets to 250ms so that the period
duration is sufficiently long.Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe -
Some IOs may span multiple periods. As latencies are collected on
completion, the inbetween periods won't register them and may
incorrectly decide to increase vrate. nr_lagging tracks these IOs to
avoid those situations. Currently, whenever there are IOs which are
spanning from the previous period, busy_level is reset to 0 if
negative thus suppressing vrate increase.This has the following two problems.
* When latency target percentiles aren't set, vrate adjustment should
only be governed by queue depth depletion; however, the current code
keeps nr_lagging active which pulls in latency results and can keep
down vrate unexpectedly.* When lagging condition is detected, it resets the entire negative
busy_level. This turned out to be way too aggressive on some
devices which sometimes experience extended latencies on a small
subset of commands. In addition, a lagging IO will be accounted as
latency target miss on completion anyway and resetting busy_level
amplifies its impact unnecessarily.This patch fixes the above two problems by disabling nr_lagging
counting when latency target percentiles aren't set and blocking vrate
increases when there are lagging IOs while leaving busy_level as-is.Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe -
vrate_adj tracepoint traces vrate changes; however, it does so only
when busy_level is non-zero. busy_level turning to zero can sometimes
be as interesting an event. This patch also enables vrate_adj
tracepoint on other vrate related events - busy_level changes and
non-zero nr_lagging.Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe -
cecf5d87ff20 ("block: split .sysfs_lock into two locks") starts to
release & acquire sysfs_lock before registering/un-registering elevator
queue during switching elevator for avoiding potential deadlock from
showing & storing 'queue/iosched' attributes and removing elevator's
kobject.Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
required in .show & .store of queue/iosched's attributes, and just
elevator's sysfs lock is acquired in elv_iosched_store() and
elv_iosched_show(). So it is safe to hold queue's sysfs lock when
registering/un-registering elevator queue.The biggest issue is that commit cecf5d87ff20 assumes that concurrent
write on 'queue/scheduler' can't happen. However, this assumption isn't
true, because kernfs_fop_write() only guarantees that concurrent write
aren't called on the same open file, but the write could be from
different open on the file. So we can't release & re-acquire queue's
sysfs lock during switching elevator, otherwise use-after-free on
elevator could be triggered.Fixes the issue by not releasing queue's sysfs lock during switching
elevator.Fixes: cecf5d87ff20 ("block: split .sysfs_lock into two locks")
Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Greg KH
Cc: Mike Snitzer
Reviewed-by: Bart Van Assche
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe -
Commit c48dac137a62 ("block: don't hold q->sysfs_lock in elevator_init_mq")
removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with
lockdep_assert_held() called in blk_mq_sched_free_requests() which is
run in failure path of elevator_init_mq().blk_mq_sched_free_requests() is called in the following 3 functions:
elevator_init_mq()
elevator_exit()
blk_cleanup_queue()In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly
by 'mutex_lock(&q->sysfs_lock)'.So moving the lockdep_assert_held() from blk_mq_sched_free_requests()
into elevator_exit() for fixing the report by syzbot.Reported-by: syzbot+da3b7677bb913dc1b737@syzkaller.appspotmail.com
Fixed: c48dac137a62 ("block: don't hold q->sysfs_lock in elevator_init_mq")
Reviewed-by: Bart Van Assche
Reviewed-by: Damien Le Moal
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
25 Sep, 2019
1 commit
-
Pull more block updates from Jens Axboe:
"Some later additions that weren't quite done for the first pull
request, and also a few fixes that have arrived since.This contains:
- Kill silly pktcdvd warning on attempting to register a non-scsi
passthrough device (me)- Use symbolic constants for the block t10 protection types, and
switch to handling it in core rather than in the drivers (Max)- libahci platform missing node put fix (Nishka)
- Small series of fixes for BFQ (Paolo)
- Fix possible nbd crash (Xiubo)"
* tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block:
block: drop device references in bsg_queue_rq()
block: t10-pi: fix -Wswitch warning
pktcdvd: remove warning on attempting to register non-passthrough dev
ata: libahci_platform: Add of_node_put() before loop exit
nbd: fix possible page fault for nbd disk
nbd: rename the runtime flags as NBD_RT_ prefixed
block, bfq: push up injection only after setting service time
block, bfq: increase update frequency of inject limit
block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1
block, bfq: update inject limit only after injection occurred
block: centralize PI remapping logic to the block layer
block: use symbolic constants for t10_pi type
24 Sep, 2019
1 commit
-
Make sure that bsg_queue_rq() calls put_device() if an error is
encountered after get_device() was successful.Fixes: cd2f076f1d7a ("bsg: convert to use blk-mq")
Signed-off-by: Martin Wilck
Signed-off-by: Jens Axboe