Eric Lee / smarc-fsl-linux-kernel

12 Feb, 2019

5 commits

a4f73a021 block: bio_check_eod() needs to consider partitions ... Browse Code »

bio_check_eod() should check partition size not the whole disk if
bio->bi_partno is non-zero. Do this by moving the call
to bio_check_eod() into blk_partition_remap().

Based on an earlier patch from Jiufei Xue.

Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Reported-by: Jiufei Xue
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
(cherry picked from commit 52c5e62d4c4beecddc6e1b8045ce1d695fca1ba7)

Christoph Hellwig
2019-02-12 10:33:27 +0800
e0d75ce59 block: fail op_is_write() requests to read-only partitions ... Browse Code »

Regular block device writes go through blkdev_write_iter(), which does
bdev_read_only(), while zeroout/discard/etc requests are never checked,
both userspace- and kernel-triggered. Add a generic catch-all check to
generic_make_request_checks() to actually enforce ioctl(BLKROSET) and
set_disk_ro(), which is used by quite a few drivers for things like
snapshots, read-only backing files/images, etc.

Reviewed-by: Sagi Grimberg
Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe
(cherry picked from commit 721c7fc701c71f693307d274d2b346a1ecd4a534)

Ilya Dryomov
2019-02-12 10:33:27 +0800
ee4e916b2 block: add a poll_fn callback to struct request_queue ... Browse Code »

That we we can also poll non blk-mq queues. Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.

Signed-off-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
(cherry picked from commit ea435e1b9392a33deceaea2a16ebaa3397bead93)

Christoph Hellwig
2019-02-12 10:33:25 +0800
c21c8b24e block: add a blk_steal_bios helper ... Browse Code »

This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.

Signed-off-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
(cherry picked from commit ef71de8b15d891b27b8c983a9a8972b11cb4576a)

Christoph Hellwig
2019-02-12 10:33:25 +0800
7055bd523 block: provide a direct_make_request helper ... Browse Code »

This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions.

Signed-off-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Reviewed-by: Javier González
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
(cherry picked from commit f421e1d9ade4e1b88183e54425cf50e390d16a7f)

Christoph Hellwig
2019-02-12 10:33:25 +0800

21 Nov, 2018

1 commit

a06139578 SCSI: fix queue cleanup race before queue initialization is done ... Browse Code »

commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.

c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has
already fixed this race, however the implied synchronize_rcu()
in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
performance regression.

Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
tried to quiesce queue for avoiding unnecessary synchronize_rcu()
only when queue initialization is done, because it is usual to see
lots of inexistent LUNs which need to be probed.

However, turns out it isn't safe to quiesce queue only when queue
initialization is done. Because when one SCSI command is completed,
the user of sending command can be waken up immediately, then the
scsi device may be removed, meantime the run queue in scsi_end_request()
is still in-progress, so kernel panic can be caused.

In Red Hat QE lab, there are several reports about this kind of kernel
panic triggered during kernel booting.

This patch tries to address the issue by grabing one queue usage
counter during freeing one request and the following run queue.

Fixes: 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
Cc: Andrew Jones
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org
Cc: Martin K. Petersen
Cc: Christoph Hellwig
Cc: James E.J. Bottomley
Cc: stable
Cc: jianchao.wang
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-11-21 16:24:09 +0800

26 Sep, 2018

1 commit

b520f00da blk-mq: avoid to synchronize rcu inside blk_cleanup_queue() ... Browse Code »

[ Upstream commit 1311326cf4755c7ffefd20f576144ecf46d9906b ]

SCSI probing may synchronously create and destroy a lot of request_queues
for non-existent devices. Any synchronize_rcu() in queue creation or
destroy path may introduce long latency during booting, see detailed
description in comment of blk_register_queue().

This patch removes one synchronize_rcu() inside blk_cleanup_queue()
for this case, commit c2856ae2f315d75(blk-mq: quiesce queue before freeing queue)
needs synchronize_rcu() for implementing blk_mq_quiesce_queue(), but
when queue isn't initialized, it isn't necessary to do that since
only pass-through requests are involved, no original issue in
scsi_execute() at all.

Without this patch and previous one, it may take more 20+ seconds for
virtio-scsi to complete disk probe. With the two patches, the time becomes
less than 100ms.

Fixes: c2856ae2f315d75 ("blk-mq: quiesce queue before freeing queue")
Reported-by: Andrew Jones
Cc: Omar Sandoval
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen"
Cc: Christoph Hellwig
Tested-by: Andrew Jones
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-09-26 14:38:13 +0800

10 Sep, 2018

2 commits

1e2698976 block: really disable runtime-pm for blk-mq ... Browse Code »

commit b233f127042dba991229e3882c6217c80492f6ef upstream.

Runtime PM isn't ready for blk-mq yet, and commit 765e40b675a9 ("block:
disable runtime-pm for blk-mq") tried to disable it. Unfortunately,
it can't take effect in that way since user space still can switch
it on via 'echo auto > /sys/block/sdN/device/power/control'.

This patch disables runtime-pm for blk-mq really by pm_runtime_disable()
and fixes all kinds of PM related kernel crash.

Cc: Tomas Janousek
Cc: Przemek Socha
Cc: Alan Stern
Cc:
Reviewed-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Tested-by: Patrick Steinhardt
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-09-10 01:55:53 +0800
0affbaece block: blk_init_allocated_queue() set q->fq as NULL in the fail case ... Browse Code »

commit 54648cf1ec2d7f4b6a71767799c45676a138ca24 upstream.

We find the memory use-after-free issue in __blk_drain_queue()
on the kernel 4.14. After read the latest kernel 4.18-rc6 we
think it has the same problem.

Memory is allocated for q->fq in the blk_init_allocated_queue().
If the elevator init function called with error return, it will
run into the fail case to free the q->fq.

Then the __blk_drain_queue() uses the same memory after the free
of the q->fq, it will lead to the unpredictable event.

The patch is to set q->fq as NULL in the fail case of
blk_init_allocated_queue().

Fixes: commit 7c94e1c157a2 ("block: introduce blk_flush_queue to drive flush machinery")
Cc:
Reviewed-by: Ming Lei
Reviewed-by: Bart Van Assche
Signed-off-by: xiao jin
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

xiao jin
2018-09-10 01:55:53 +0800

22 Jul, 2018

1 commit

aa6be3967 block: do not use interruptible wait anywhere ... Browse Code »

commit 1dc3039bc87ae7d19a990c3ee71cfd8a9068f428 upstream.

When blk_queue_enter() waits for a queue to unfreeze, or unset the
PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.

The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
("block, scsi: Make SCSI quiesce and resume work reliably"). Note the SCSI
device is resumed asynchronously, i.e. after un-freezing userspace tasks.

So that commit exposed the bug as a regression in v4.15. A mysterious
SIGBUS (or -EIO) sometimes happened during the time the device was being
resumed. Most frequently, there was no kernel log message, and we saw Xorg
or Xwayland killed by SIGBUS.[1]

[1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979

Without this fix, I get an IO error in this test:

# dd if=/dev/sda of=/dev/null iflag=direct & \
while killall -SIGUSR1 dd; do sleep 0.1; done & \
echo mem > /sys/power/state ; \
sleep 5; killall dd # stop after 5 seconds

The interruptible wait was added to blk_queue_enter in
commit 3ef28e83ab15 ("block: generic request_queue reference counting").
Before then, the interruptible wait was only in blk-mq, but I don't think
it could ever have been correct.

Reviewed-by: Bart Van Assche
Cc: stable@vger.kernel.org
Signed-off-by: Alan Jenkins
Signed-off-by: Jens Axboe
Signed-off-by: Sudip Mukherjee
Signed-off-by: Greg Kroah-Hartman

Alan Jenkins
2018-07-22 20:28:48 +0800

03 Jul, 2018

1 commit

251141340 block: Fix cloning of requests with a special payload ... Browse Code »

commit 297ba57dcdec7ea37e702bcf1a577ac32a034e21 upstream.

This patch avoids that removing a path controlled by the dm-mpath driver
while mkfs is running triggers the following kernel bug:

kernel BUG at block/blk-core.c:3347!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 20 PID: 24369 Comm: mkfs.ext4 Not tainted 4.18.0-rc1-dbg+ #2
RIP: 0010:blk_end_request_all+0x68/0x70
Call Trace:

dm_softirq_done+0x326/0x3d0 [dm_mod]
blk_done_softirq+0x19b/0x1e0
__do_softirq+0x128/0x60d
irq_exit+0x100/0x110
smp_call_function_single_interrupt+0x90/0x330
call_function_single_interrupt+0xf/0x20

Fixes: f9d03f96b988 ("block: improve handling of the magic discard payload")
Reviewed-by: Ming Lei
Reviewed-by: Christoph Hellwig
Acked-by: Mike Snitzer
Signed-off-by: Bart Van Assche
Cc: Hannes Reinecke
Cc: Johannes Thumshirn
Cc:
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Bart Van Assche
2018-07-03 17:25:05 +0800

26 Apr, 2018

1 commit

73027d80d blk-mq: fix discard merge with scheduler attached ... Browse Code »

[ Upstream commit 445251d0f4d329aa061f323546cd6388a3bb7ab5 ]

I ran into an issue on my laptop that triggered a bug on the
discard path:

WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3/0x430
Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd intel_gtt
CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G U 4.15.0+ #176
Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017
RIP: 0010:nvme_setup_cmd+0x3d3/0x430
RSP: 0018:ffff880423e9f838 EFLAGS: 00010217
RAX: 0000000000000000 RBX: ffff880423e9f8c8 RCX: 0000000000010000
RDX: ffff88022b200010 RSI: 0000000000000002 RDI: 00000000327f0000
RBP: ffff880421251400 R08: ffff88022b200000 R09: 0000000000000009
R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000ffff
R13: ffff88042341e280 R14: 000000000000ffff R15: ffff880421251440
FS: 0000000000000000(0000) GS:ffff880441500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055b684795030 CR3: 0000000002e09006 CR4: 00000000001606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
nvme_queue_rq+0x40/0xa00
? __sbitmap_queue_get+0x24/0x90
? blk_mq_get_tag+0xa3/0x250
? wait_woken+0x80/0x80
? blk_mq_get_driver_tag+0x97/0xf0
blk_mq_dispatch_rq_list+0x7b/0x4a0
? deadline_remove_request+0x49/0xb0
blk_mq_do_dispatch_sched+0x4f/0xc0
blk_mq_sched_dispatch_requests+0x106/0x170
__blk_mq_run_hw_queue+0x53/0xa0
__blk_mq_delay_run_hw_queue+0x83/0xa0
blk_mq_run_hw_queue+0x6c/0xd0
blk_mq_sched_insert_request+0x96/0x140
__blk_mq_try_issue_directly+0x3d/0x190
blk_mq_try_issue_directly+0x30/0x70
blk_mq_make_request+0x1a4/0x6a0
generic_make_request+0xfd/0x2f0
? submit_bio+0x5c/0x110
submit_bio+0x5c/0x110
? __blkdev_issue_discard+0x152/0x200
submit_bio_wait+0x43/0x60
ext4_process_freed_data+0x1cd/0x440
? account_page_dirtied+0xe2/0x1a0
ext4_journal_commit_callback+0x4a/0xc0
jbd2_journal_commit_transaction+0x17e2/0x19e0
? kjournald2+0xb0/0x250
kjournald2+0xb0/0x250
? wait_woken+0x80/0x80
? commit_timeout+0x10/0x10
kthread+0x111/0x130
? kthread_create_worker_on_cpu+0x50/0x50
? do_group_exit+0x3a/0xa0
ret_from_fork+0x1f/0x30
Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff ff 4c 89 c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff
---[ end trace 50d361cc444506c8 ]---
print_req_error: I/O error, dev nvme0n1, sector 847167488

Decoding the assembly, the request claims to have 0xffff segments,
while nvme counts two. This turns out to be because we don't check
for a data carrying request on the mq scheduler path, and since
blk_phys_contig_segment() returns true for a non-data request,
we decrement the initial segment count of 0 and end up with
0xffff in the unsigned short.

There are a few issues here:

1) We should initialize the segment count for a discard to 1.
2) The discard merging is currently using the data limits for
segments and sectors.

Fix this up by having attempt_merge() correctly identify the
request, and by initializing the segment count correctly
for discards.

This can only be triggered with mq-deadline on discard capable
devices right now, which isn't a common configuration.

Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Jens Axboe
2018-04-26 17:02:15 +0800

09 Mar, 2018

1 commit

17644a0bb block: fix the count of PGPGOUT for WRITE_SAME ... Browse Code »

commit 7c5a0dcf557c6511a61e092ba887de28882fe857 upstream.

The vm counters is counted in sectors, so we should do the conversation
in submit_bio.

Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Cc: stable@vger.kernel.org
Reviewed-by: Omar Sandoval
Reviewed-by: Christoph Hellwig
Signed-off-by: Jiufei Xue
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Jiufei Xue
2018-03-09 14:41:05 +0800

03 Mar, 2018

1 commit

7e3acce11 block: drain queue before waiting for q_usage_counter becoming zero ... Browse Code »

[ Upstream commit 454be724f6f99cc7e7bbf15067128be9868186c6 ]

Now we track legacy requests with .q_usage_counter in commit 055f6e18e08f
("block: Make q_usage_counter also track legacy requests"), but that
commit never runs and drains legacy queue before waiting for this counter
becoming zero, then IO hang is caused in the test of pulling disk during IO.

This patch fixes the issue by draining requests before waiting for
q_usage_counter becoming zero, both Mauricio and chenxiang reported this
issue, and observed that it can be fixed by this patch.

Link: https://marc.info/?l=linux-block&m=151192424731797&w=2
Fixes: 055f6e18e08f("block: Make q_usage_counter also track legacy requests")
Cc: Wen Xiong
Tested-by: "chenxiang (M)"
Tested-by: Mauricio Faria de Oliveira
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-03-03 17:24:35 +0800

17 Feb, 2018

1 commit

392640fd1 blk-mq: quiesce queue before freeing queue ... Browse Code »

commit c2856ae2f315d754a0b6a268e4c6745b332b42e7 upstream.

After queue is frozen, dispatch still may happen, for example:

1) requests are submitted from several contexts
2) requests from all these contexts are inserted to queue, but may dispatch
to LLD in one of these paths, but other paths sill need to move on even all
these requests are completed(that means blk_mq_freeze_queue_wait() returns
at that time)
3) dispatch after queue freezing still moves on and causes use-after-free,
because request queue is freed

This patch quiesces queue after it is frozen, and makes sure all
in-progress dispatch are completed.

This patch fixes the following kernel crash when running heavy IOs vs.
deleting device:

[ 36.719251] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 36.720318] IP: kyber_has_work+0x14/0x40
[ 36.720847] PGD 254bf5067 P4D 254bf5067 PUD 255e6a067 PMD 0
[ 36.721584] Oops: 0000 [#1] PREEMPT SMP
[ 36.722105] Dumping ftrace buffer:
[ 36.722570] (ftrace buffer empty)
[ 36.723057] Modules linked in: scsi_debug ebtable_filter ebtables ip6table_filter ip6_tables tcm_loop iscsi_target_mod target_core_file target_core_iblock target_core_pscsi target_core_mod xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc fuse iptable_filter ip_tables sd_mod sg btrfs xor zstd_decompress zstd_compress xxhash raid6_pq mptsas mptscsih bcache crc32c_intel ahci mptbase libahci serio_raw scsi_transport_sas nvme libata shpchp lpc_ich virtio_scsi nvme_core binfmt_misc dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs
[ 36.733438] CPU: 2 PID: 2374 Comm: fio Not tainted 4.15.0-rc2.blk_mq_quiesce+ #714
[ 36.735143] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-1.fc25 04/01/2014
[ 36.736688] RIP: 0010:kyber_has_work+0x14/0x40
[ 36.737515] RSP: 0018:ffffc9000209bca0 EFLAGS: 00010202
[ 36.738431] RAX: 0000000000000008 RBX: ffff88025578bfc8 RCX: ffff880257bf4ed0
[ 36.739581] RDX: 0000000000000038 RSI: ffffffff81a98c6d RDI: ffff88025578bfc8
[ 36.740730] RBP: ffff880253cebfc8 R08: ffffc9000209bda0 R09: ffff8802554f3480
[ 36.741885] R10: ffffc9000209be60 R11: ffff880263f72538 R12: ffff88025573e9e8
[ 36.743036] R13: ffff88025578bfd0 R14: 0000000000000001 R15: 0000000000000000
[ 36.744189] FS: 00007f9b9bee67c0(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
[ 36.746617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 36.748483] CR2: 0000000000000008 CR3: 0000000254bf4001 CR4: 00000000003606e0
[ 36.750164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 36.751455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 36.752796] Call Trace:
[ 36.753992] blk_mq_do_dispatch_sched+0x7f/0xe0
[ 36.755110] blk_mq_sched_dispatch_requests+0x119/0x190
[ 36.756179] __blk_mq_run_hw_queue+0x83/0x90
[ 36.757144] __blk_mq_delay_run_hw_queue+0xaf/0x110
[ 36.758046] blk_mq_run_hw_queue+0x24/0x70
[ 36.758845] blk_mq_flush_plug_list+0x1e7/0x270
[ 36.759676] blk_flush_plug_list+0xd6/0x240
[ 36.760463] blk_finish_plug+0x27/0x40
[ 36.761195] do_io_submit+0x19b/0x780
[ 36.761921] ? entry_SYSCALL_64_fastpath+0x1a/0x7d
[ 36.762788] entry_SYSCALL_64_fastpath+0x1a/0x7d
[ 36.763639] RIP: 0033:0x7f9b9699f697
[ 36.764352] RSP: 002b:00007ffc10f991b8 EFLAGS: 00000206 ORIG_RAX: 00000000000000d1
[ 36.765773] RAX: ffffffffffffffda RBX: 00000000008f6f00 RCX: 00007f9b9699f697
[ 36.766965] RDX: 0000000000a5e6c0 RSI: 0000000000000001 RDI: 00007f9b8462a000
[ 36.768377] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000008f6420
[ 36.769649] R10: 00007f9b846e5000 R11: 0000000000000206 R12: 00007f9b795d6a70
[ 36.770807] R13: 00007f9b795e4140 R14: 00007f9b795e3fe0 R15: 0000000100000000
[ 36.771955] Code: 83 c7 10 e9 3f 68 d1 ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 97 b0 00 00 00 48 8d 42 08 48 83 c2 38 3b 00 74 06 b8 01 00 00 00 c3 48 3b 40 08 75 f4 48 83 c0 10
[ 36.775004] RIP: kyber_has_work+0x14/0x40 RSP: ffffc9000209bca0
[ 36.776012] CR2: 0000000000000008
[ 36.776690] ---[ end trace 4045cbce364ff2a4 ]---
[ 36.777527] Kernel panic - not syncing: Fatal exception
[ 36.778526] Dumping ftrace buffer:
[ 36.779313] (ftrace buffer empty)
[ 36.780081] Kernel Offset: disabled
[ 36.780877] ---[ end Kernel panic - not syncing: Fatal exception

Reviewed-by: Christoph Hellwig
Tested-by: Yi Zhang
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-02-17 03:23:09 +0800

17 Dec, 2017

1 commit

a4000d951 blk-mq: Avoid that request queue removal can trigger list corruption ... Browse Code »

[ Upstream commit aba7afc5671c23beade64d10caf86e24a9105dab ]

Avoid that removal of a request queue sporadically triggers the
following warning:

list_del corruption. next->prev should be ffff8807d649b970, but was 6b6b6b6b6b6b6b6b
WARNING: CPU: 3 PID: 342 at lib/list_debug.c:56 __list_del_entry_valid+0x92/0xa0
Call Trace:
process_one_work+0x11b/0x660
worker_thread+0x3d/0x3b0
kthread+0x129/0x140
ret_from_fork+0x27/0x40

Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Johannes Thumshirn
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Bart Van Assche
2017-12-17 22:08:00 +0800

14 Dec, 2017

1 commit

60bed713a block: wake up all tasks blocked in get_request() ... Browse Code »

[ Upstream commit 34d9715ac1edd50285168dd8d80c972739a4f6a4 ]

Once blk_set_queue_dying() is done in blk_cleanup_queue(), we call
blk_freeze_queue() and wait for q->q_usage_counter becoming zero. But
if there are tasks blocked in get_request(), q->q_usage_counter can
never become zero. So we have to wake up all these tasks in
blk_set_queue_dying() first.

Fixes: 3ef28e83ab157997 ("block: generic request_queue reference counting")
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2017-12-14 16:53:10 +0800

30 Nov, 2017

1 commit

77a38e88c block: Fix a race between blk_cleanup_queue() and timeout handling ... Browse Code »

commit 4e9b6f20828ac880dbc1fa2fdbafae779473d1af upstream.

Make sure that if the timeout timer fires after a queue has been
marked "dying" that the affected requests are finished.

Reported-by: chenxiang (M)
Fixes: commit 287922eb0b18 ("block: defer timeouts to a workqueue")
Signed-off-by: Bart Van Assche
Tested-by: chenxiang (M)
Cc: Christoph Hellwig
Cc: Keith Busch
Cc: Hannes Reinecke
Cc: Ming Lei
Cc: Johannes Thumshirn
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Bart Van Assche
2017-11-30 16:40:52 +0800

25 Sep, 2017

1 commit

5acb3cc2c blktrace: Fix potential deadlock between delete & sysfs ops ... Browse Code »

The lockdep code had reported the following unsafe locking scenario:

CPU0 CPU1
---- ----
lock(s_active#228);
lock(&bdev->bd_mutex/1);
lock(s_active#228);
lock(&bdev->bd_mutex);

*** DEADLOCK ***

The deadlock may happen when one task (CPU1) is trying to delete a
partition in a block device and another task (CPU0) is accessing
tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
partition.

The s_active isn't an actual lock. It is a reference count (kn->count)
on the sysfs (kernfs) file. Removal of a sysfs file, however, require
a wait until all the references are gone. The reference count is
treated like a rwsem using lockdep instrumentation code.

The fact that a thread is in the sysfs callback method or in the
ioctl call means there is a reference to the opended sysfs or device
file. That should prevent the underlying block structure from being
removed.

Instead of using bd_mutex in the block_device structure, a new
blk_trace_mutex is now added to the request_queue structure to protect
access to the blk_trace structure.

Suggested-by: Christoph Hellwig
Signed-off-by: Waiman Long
Acked-by: Steven Rostedt (VMware)

Fix typo in patch subject line, and prune a comment detailing how
the code used to work.

Signed-off-by: Jens Axboe

Waiman Long
2017-09-25 22:56:05 +0800

12 Sep, 2017

1 commit

157f377be block: directly insert blk-mq request from blk_insert_cloned_request() ... Browse Code »

A NULL pointer crash was reported for the case of having the BFQ IO
scheduler attached to the underlying blk-mq paths of a DM multipath
device. The crash occured in blk_mq_sched_insert_request()'s call to
e->type->ops.mq.insert_requests().

Paolo Valente correctly summarized why the crash occured with:
"the call chain (dm_mq_queue_rq -> map_request -> setup_clone ->
blk_rq_prep_clone) creates a cloned request without invoking
e->type->ops.mq.prepare_request for the target elevator e. The cloned
request is therefore not initialized for the scheduler, but it is
however inserted into the scheduler by blk_mq_sched_insert_request."

All said, a request-based DM multipath device's IO scheduler should be
the only one used -- when the original requests are issued to the
underlying paths as cloned requests they are inserted directly in the
underlying dispatch queue(s) rather than through an additional elevator.

But commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO
schedulers") switched blk_insert_cloned_request() from using
blk_mq_insert_request() to blk_mq_sched_insert_request(). Which
incorrectly added elevator machinery into a call chain that isn't
supposed to have any.

To fix this introduce a blk-mq private blk_mq_request_bypass_insert()
that blk_insert_cloned_request() calls to insert the request without
involving any elevator that may be attached to the cloned request's
request_queue.

Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Cc: stable@vger.kernel.org
Reported-by: Bart Van Assche
Tested-by: Mike Snitzer
Signed-off-by: Jens Axboe

Jens Axboe
2017-09-12 06:43:57 +0800

29 Aug, 2017

1 commit

5034435c8 block: Make blk_dequeue_request() static ... Browse Code »

The only caller of this function is blk_start_request() in the same
file. Fix blk_start_request() description accordingly.

Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Damien Le Moal
Signed-off-by: Jens Axboe

Damien Le Moal
2017-08-29 23:49:31 +0800

24 Aug, 2017

1 commit

74d46992e block: replace bi_bdev with a gendisk pointer and partitions index ... Browse Code »

This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-08-24 02:49:55 +0800

18 Aug, 2017

1 commit

4ddd56b00 block: Relax a check in blk_start_queue() ... Browse Code »

Calling blk_start_queue() from interrupt context with the queue
lock held and without disabling IRQs, as the skd driver does, is
safe. This patch avoids that loading the skd driver triggers the
following warning:

WARNING: CPU: 11 PID: 1348 at block/blk-core.c:283 blk_start_queue+0x84/0xa0
RIP: 0010:blk_start_queue+0x84/0xa0
Call Trace:
skd_unquiesce_dev+0x12a/0x1d0 [skd]
skd_complete_internal+0x1e7/0x5a0 [skd]
skd_complete_other+0xc2/0xd0 [skd]
skd_isr_completion_posted.isra.30+0x2a5/0x470 [skd]
skd_isr+0x14f/0x180 [skd]
irq_forced_thread_fn+0x2a/0x70
irq_thread+0x144/0x1a0
kthread+0x125/0x140
ret_from_fork+0x2a/0x40

Fixes: commit a038e2536472 ("[PATCH] blk_start_queue() must be called with irq disabled - add warning")
Signed-off-by: Bart Van Assche
Cc: Paolo 'Blaisorblade' Giarrusso
Cc: Andrew Morton
Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Johannes Thumshirn
Cc:
Signed-off-by: Jens Axboe

Bart Van Assche
2017-08-18 22:45:29 +0800

10 Aug, 2017

3 commits

b8d62b3a9 blk-mq: enable checking two part inflight counts at the same time ... Browse Code »

Modify blk_mq_in_flight() to count both a partition and root at
the same time. Then we only have to call it once, instead of
potentially looping the tags twice.

Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2017-08-10 03:09:33 +0800
0609e0efc block: make part_in_flight() take an array of two ints ... Browse Code »

Instead of returning the count that matches the partition, pass
in an array of two ints. Index 0 will be filled with the inflight
count for the partition in question, and index 1 will filled
with the root inflight count, if the partition passed in is not the
root.

This is in preparation for being able to calculate both in one
go.

Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2017-08-10 03:09:20 +0800
d62e26b3f block: pass in queue to inflight accounting ... Browse Code »

No functional change in this patch, just in preparation for
basing the inflight mechanism on the queue in question.

Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2017-08-10 03:09:16 +0800

24 Jul, 2017

1 commit

765e40b67 block: disable runtime-pm for blk-mq ... Browse Code »

The blk-mq code lacks support for looking at the rpm_status field, tracking
active requests and the RQF_PM flag.

Due to the default switch to blk-mq for scsi people start to run into
suspend / resume issue due to this fact, so make sure we disable the runtime
PM functionality until it is properly implemented.

Signed-off-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-07-24 22:46:40 +0800

04 Jul, 2017

1 commit

e23947bd7 bio-integrity: fold bio_integrity_enabled to bio_integrity_prep ... Browse Code »

Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly.

In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
so it is reasonable to fold it in to one function.

Signed-off-by: Dmitry Monakhov
Reviewed-by: Martin K. Petersen
[hch: merged with the latest block tree,
return bool from bio_integrity_prep]
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Dmitry Monakhov
2017-07-04 06:56:24 +0800

28 Jun, 2017

4 commits

8fc450443 block: don't set bounce limit in blk_init_queue ... Browse Code »

Instead move it to the callers. Those that either don't use bio_data() or
page_address() or are specific to architectures that do not support highmem
are skipped.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-28 02:13:45 +0800
0bf6595ec block: don't set bounce limit in blk_init_allocated_queue ... Browse Code »

And just move it into scsi_transport_sas which needs it due to low-level
drivers directly derferencing bio_data, and into blk_init_queue_node,
which will need a further push into the callers.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-28 02:13:45 +0800
0b0bcacc3 block: don't bother with bounce limits for make_request drivers ... Browse Code »

We only call blk_queue_bounce for request-based drivers, so stop messing
with it for make_request based drivers.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-28 02:13:45 +0800
cb6934f8e block: add support for write hints in a bio ... Browse Code »

No functional changes in this patch, we just use up some holes
in the bio and request structures to define a write hint that
we psas down the stack.

Ensure that we don't merge requests that have different life time
hints assigned to them, and that we inherit the write hint when
cloning a bio.

Reviewed-by: Martin K. Petersen
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2017-06-28 02:05:27 +0800

22 Jun, 2017

1 commit

34bd9c1c4 block: Fix off-by-one errors in blk_status_to_errno() and print_req_error() ... Browse Code »

This was detected by the smatch static analyzer.

Fixes: commit 2a842acab109 ("block: introduce new block status code type")
Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-22 02:01:14 +0800

21 Jun, 2017

4 commits

332ebbf7f block: Document what queue type each function is intended for ... Browse Code »

Some functions in block/blk-core.c must only be used on blk-sq queues
while others are safe to use against any queue type. Document which
functions are intended for blk-sq queues and issue a warning if the
blk-sq API is misused. This does not only help block driver authors
but will also make it easier to remove the blk-sq code once that code
is declared obsolete.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
2fff8a924 block: Check locking assumptions at runtime ... Browse Code »

Instead of documenting the locking assumptions of most block layer
functions as a comment, use lockdep_assert_held() to verify locking
assumptions at runtime.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
d280bab30 block: Introduce request_queue.initialize_rq_fn() ... Browse Code »

Several block drivers need to initialize the driver-private request
data after having called blk_get_request() and before .prep_rq_fn()
is called, e.g. when submitting a REQ_OP_SCSI_* request. Avoid that
that initialization code has to be repeated after every
blk_get_request() call by adding new callback functions to struct
request_queue and to struct blk_mq_ops.

Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
cd6ce1482 block: Make request operation type argument declarations consistent ... Browse Code »

Instead of declaring the second argument of blk_*_get_request()
as int and passing it to functions that expect an unsigned int,
declare that second argument as unsigned int. Also because of
consistency, rename that second argument from 'rw' into 'op'.
This patch does not change any functionality.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800

20 Jun, 2017

1 commit

03a07c92a block: return on congested block device ... Browse Code »

A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Stacked devices such as md (the ones with make_request_fn hooks)
currently are not supported because it may block for housekeeping.
For example, an md can have a part of the device suspended.
For this reason, only request based devices are supported.
In the future, this feature will be expanded to stacked devices
by teaching them how to handle the REQ_NOWAIT flags.

Reviewed-by: Christoph Hellwig
Reviewed-by: Jens Axboe
Signed-off-by: Goldwyn Rodrigues
Signed-off-by: Jens Axboe

Goldwyn Rodrigues
2017-06-20 21:12:03 +0800

19 Jun, 2017

2 commits

93b27e729 blk: use non-rescuing bioset for q->bio_split. ... Browse Code »

A rescuing bioset is only useful if there might be bios from
that same bioset on the bio_list_on_stack queue at a time
when bio_alloc_bioset() is called. This never applies to
q->bio_split.

Allocations from q->bio_split are only ever made from
blk_queue_split() which is only ever called early in each of
various make_request_fn()s. The original bio (call this A)
is then passed to generic_make_request() and is placed on
the bio_list_on_stack queue, and the bio that was allocated
from q->bio_split (B) is processed.

The processing of this may cause other bios to be passed to
generic_make_request() or may even cause the bio B itself to
be passed, possible after some prefix has been split off
(using some other bioset).

generic_make_request() now guarantees that all of these bios
(B and dependants) will be fully processed before the tail
of the original bio A gets handled. None of these early bios
can possible trigger an allocation from the original
q->bio_split as they are either too small to require
splitting or (more likely) are destined for a different queue.

The next time that the original q->bio_split might be used
by this thread is when A is processed again, as it might
still be too big to handle directly. By this time there
cannot be any other bios allocated from q->bio_split in the
generic_make_request() queue. So no rescuing will ever be
needed.

Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2017-06-19 02:40:59 +0800
47e0fb461 blk: make the bioset rescue_workqueue optional. ... Browse Code »

This patch converts bioset_create() to not create a workqueue by
default, so alloctions will never trigger punt_bios_to_rescuer(). It
also introduces a new flag BIOSET_NEED_RESCUER which tells
bioset_create() to preserve the old behavior.

All callers of bioset_create() that are inside block device drivers,
are given the BIOSET_NEED_RESCUER flag.

biosets used by filesystems or other top-level users do not
need rescuing as the bio can never be queued behind other
bios. This includes fs_bio_set, blkdev_dio_pool,
btrfs_bioset, xfs_ioend_bioset, and one allocated by
target_core_iblock.c.

biosets used by md/raid do not need rescuing as
their usage was recently audited and revised to never
risk deadlock.

It is hoped that most, if not all, of the remaining biosets
can end up being the non-rescued version.

Reviewed-by: Christoph Hellwig
Credit-to: Ming Lei (minor fixes)
Reviewed-by: Ming Lei
Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2017-06-19 02:40:59 +0800