Eric Lee / smarc-fsl-linux-kernel

29 Oct, 2018

1 commit

45cde2f81 block: add a poll_fn callback to struct request_queue ... Browse Code »

That we we can also poll non blk-mq queues. Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.

Signed-off-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
(cherry picked from commit ea435e1b9392a33deceaea2a16ebaa3397bead93)

Christoph Hellwig
2018-10-29 11:10:38 +0800

13 Oct, 2018

1 commit

8e2e2192e blk-mq: I/O and timer unplugs are inverted in blktrace ... Browse Code »

commit 587562d0c7cd6861f4f90a2eb811cccb1a376f5f upstream.

trace_block_unplug() takes true for explicit unplugs and false for
implicit unplugs. schedule() unplugs are implicit and should be
reported as timer unplugs. While correct in the legacy code, this has
been inverted in blk-mq since 4.11.

Cc: stable@vger.kernel.org
Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reviewed-by: Omar Sandoval
Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Ilya Dryomov
2018-10-13 15:27:22 +0800

26 Jun, 2018

1 commit

ba502bf2b blk-mq: reinit q->tag_set_list entry only after grace period ... Browse Code »

commit a347c7ad8edf4c5685154f3fdc3c12fc1db800ba upstream.

It is not allowed to reinit q->tag_set_list list entry while RCU grace
period has not completed yet, otherwise the following soft lockup in
blk_mq_sched_restart() happens:

[ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
[ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
[ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
[ 1064.256510] Call Trace:
[ 1064.256664]
[ 1064.256824] blk_mq_free_request+0xea/0x100
[ 1064.256987] msg_io_conf+0x59/0xd0 [ibnbd_client]
[ 1064.257175] complete_rdma_req+0xf2/0x230 [ibtrs_client]
[ 1064.257340] ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
[ 1064.257502] ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
[ 1064.257669] ib_create_qp+0x321/0x380 [ib_core]
[ 1064.257841] ib_process_cq_direct+0xbd/0x120 [ib_core]
[ 1064.258007] irq_poll_softirq+0xb7/0xe0
[ 1064.258165] __do_softirq+0x106/0x2a2
[ 1064.258328] irq_exit+0x92/0xa0
[ 1064.258509] do_IRQ+0x4a/0xd0
[ 1064.258660] common_interrupt+0x7a/0x7a
[ 1064.258818]

Meanwhile another context frees other queue but with the same set of
shared tags:

[ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
[ 1288.201833] bash D 0 5910 5820 0x00000000
[ 1288.202016] Call Trace:
[ 1288.202315] schedule+0x32/0x80
[ 1288.202462] schedule_timeout+0x1e5/0x380
[ 1288.203838] wait_for_completion+0xb0/0x120
[ 1288.204137] __wait_rcu_gp+0x125/0x160
[ 1288.204287] synchronize_sched+0x6e/0x80
[ 1288.204770] blk_mq_free_queue+0x74/0xe0
[ 1288.204922] blk_cleanup_queue+0xc7/0x110
[ 1288.205073] ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
[ 1288.205389] ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
[ 1288.205548] kernfs_fop_write+0x109/0x180
[ 1288.206328] vfs_write+0xb3/0x1a0
[ 1288.206476] SyS_write+0x52/0xc0
[ 1288.206624] do_syscall_64+0x68/0x1d0
[ 1288.206774] entry_SYSCALL_64_after_hwframe+0x3d/0xa2

What happened is the following:

1. There are several MQ queues with shared tags.
2. One queue is about to be freed and now task is in
blk_mq_del_queue_tag_set().
3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
tag list in order to find hctx to restart.

Because linked list entry was modified in blk_mq_del_queue_tag_set()
without proper waiting for a grace period, blk_mq_sched_restart()
never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.

Fix is simple: reinit list entry after an RCU grace period elapsed.

Fixes: Fixes: 705cda97ee3a ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Cc: stable@vger.kernel.org
Cc: Sagi Grimberg
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Reviewed-by: Bart Van Assche
Signed-off-by: Roman Pen
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Roman Pen
2018-06-26 08:06:32 +0800

21 Jun, 2018

1 commit

06beec871 blk-mq: fix sysfs inflight counter ... Browse Code »

[ Upstream commit bf0ddaba65ddbb2715af97041da8e7a45b2d8628 ]

When the blk-mq inflight implementation was added, /proc/diskstats was
converted to use it, but /sys/block/$dev/inflight was not. Fix it by
adding another helper to count in-flight requests by data direction.

Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Omar Sandoval
2018-06-21 03:02:49 +0800

26 Apr, 2018

1 commit

a1dfcb01e blk-mq: turn WARN_ON in __blk_mq_run_hw_queue into printk ... Browse Code »

[ Upstream commit 7df938fbc4ee641e70e05002ac67c24b19e86e74 ]

We know this WARN_ON is harmless and in reality it may be trigged,
so convert it to printk() and dump_stack() to avoid to confusing
people.

Also add comment about two releated races here.

Cc: Christian Borntraeger
Cc: Stefan Haberland
Cc: Christoph Hellwig
Cc: Thomas Gleixner
Cc: "jianchao.wang"
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-04-26 17:02:07 +0800

12 Apr, 2018

3 commits

8976d64b2 blk-mq: fix kernel oops in blk_mq_tag_idle() ... Browse Code »

[ Upstream commit 8ab0b7dc73e1b3e2987d42554b2bff503f692772 ]

HW queues may be unmapped in some cases, such as blk_mq_update_nr_hw_queues(),
then we need to check it before calling blk_mq_tag_idle(), otherwise
the following kernel oops can be triggered, so fix it by checking if
the hw queue is unmapped since it doesn't make sense to idle the tags
any more after hw queues are unmapped.

[ 440.771298] Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma]
[ 440.779104] task: ffff894bae755ee0 ti: ffff893bf9bc8000 task.ti: ffff893bf9bc8000
[ 440.788359] RIP: 0010:[] [] __blk_mq_tag_idle+0x24/0x40
[ 440.798697] RSP: 0018:ffff893bf9bcbd10 EFLAGS: 00010286
[ 440.805538] RAX: 0000000000000000 RBX: ffff895bb131dc00 RCX: 000000000000011f
[ 440.814426] RDX: 00000000ffffffff RSI: 0000000000000120 RDI: ffff895bb131dc00
[ 440.823301] RBP: ffff893bf9bcbd10 R08: 000000000001b860 R09: 4a51d361c00c0000
[ 440.832193] R10: b5907f32b4cc7003 R11: ffffd6cabfb57000 R12: ffff894bafd1e008
[ 440.841091] R13: 0000000000000001 R14: ffff895baf770000 R15: 0000000000000080
[ 440.849988] FS: 0000000000000000(0000) GS:ffff894bbdcc0000(0000) knlGS:0000000000000000
[ 440.859955] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 440.867274] CR2: 0000000000000008 CR3: 000000103d098000 CR4: 00000000001407e0
[ 440.876169] Call Trace:
[ 440.879818] [] blk_mq_exit_hctx+0xd8/0xe0
[ 440.887051] [] blk_mq_free_queue+0xf0/0x160
[ 440.894465] [] blk_cleanup_queue+0xd9/0x150
[ 440.901881] [] nvme_ns_remove+0x5b/0xb0 [nvme_core]
[ 440.910068] [] nvme_remove_namespaces+0x3b/0x60 [nvme_core]
[ 440.919026] [] __nvme_rdma_remove_ctrl+0x2b/0xb0 [nvme_rdma]
[ 440.928079] [] nvme_rdma_del_ctrl_work+0x17/0x20 [nvme_rdma]
[ 440.937126] [] process_one_work+0x17a/0x440
[ 440.944517] [] worker_thread+0x278/0x3c0
[ 440.951607] [] ? manage_workers.isra.24+0x2a0/0x2a0
[ 440.959760] [] kthread+0xcf/0xe0
[ 440.966055] [] ? insert_kthread_work+0x40/0x40
[ 440.973715] [] ret_from_fork+0x58/0x90
[ 440.980586] [] ? insert_kthread_work+0x40/0x40
[ 440.988229] Code: 5b 41 5c 5d c3 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 f0 0f ba 77 40 01 19 d2 85 d2 75 08 c3 0f 1f 80 00 00 00 00 55 48 89 e5 ff 48 08 48 8d 78 10 e8 7f 0f 05 00 5d c3 0f 1f 00 66 2e 0f
[ 441.011620] RIP [] __blk_mq_tag_idle+0x24/0x40
[ 441.019301] RSP
[ 441.024052] CR2: 0000000000000008

Reported-by: Zhang Yi
Tested-by: Zhang Yi
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-04-12 18:32:19 +0800
fb1ef85d5 blk-mq: fix race between updating nr_hw_queues and switching io sched ... Browse Code »

[ Upstream commit fb350e0ad99359768e1e80b4784692031ec340e4 ]

In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags
can be allocated, and q->nr_hw_queue is used, and race is inevitable, for
example: blk_mq_init_sched() may trigger use-after-free on hctx, which is
freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased.

This patch fixes the race be holding q->sysfs_lock.

Reviewed-by: Christoph Hellwig
Reported-by: Yi Zhang
Tested-by: Yi Zhang
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-04-12 18:32:15 +0800
eaa077800 blk-mq: avoid to map CPU into stale hw queue ... Browse Code »

[ Upstream commit 7d4901a90d02500c8011472a060f9b2e60e6e605 ]

blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
previous map isn't cleared yet, and may point to one stale hw queue
index.

This patch fixes the following issue by clearing the mapping table before
setting it up in blk_mq_pci_map_queues().

This patches fixes this following issue reported by Zhang Yi:

[ 101.202734] BUG: unable to handle kernel NULL pointer dereference at 0000000094d3013f
[ 101.211487] IP: blk_mq_map_swqueue+0xbc/0x200
[ 101.216346] PGD 0 P4D 0
[ 101.219171] Oops: 0000 [#1] SMP
[ 101.222674] Modules linked in: sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mxm_wmi intel_rapl_perf iTCO_wdt ipmi_si ipmi_devintf pcspkr iTCO_vendor_support sg dcdbas ipmi_msghandler wmi mei_me lpc_ich shpchp mei acpi_power_meter dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel libata tg3 nvme nvme_core megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[ 101.284881] CPU: 0 PID: 504 Comm: kworker/u25:5 Not tainted 4.15.0-rc2 #1
[ 101.292455] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[ 101.301001] Workqueue: nvme-wq nvme_reset_work [nvme]
[ 101.306636] task: 00000000f2c53190 task.stack: 000000002da874f9
[ 101.313241] RIP: 0010:blk_mq_map_swqueue+0xbc/0x200
[ 101.318681] RSP: 0018:ffffc9000234fd70 EFLAGS: 00010282
[ 101.324511] RAX: ffff88047ffc9480 RBX: ffff88047e130850 RCX: 0000000000000000
[ 101.332471] RDX: ffffe8ffffd40580 RSI: ffff88047e509b40 RDI: ffff88046f37a008
[ 101.340432] RBP: 000000000000000b R08: ffff88046f37a008 R09: 0000000011f94280
[ 101.348392] R10: ffff88047ffd4d00 R11: 0000000000000000 R12: ffff88046f37a008
[ 101.356353] R13: ffff88047e130f38 R14: 000000000000000b R15: ffff88046f37a558
[ 101.364314] FS: 0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000
[ 101.373342] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 101.379753] CR2: 0000000000000098 CR3: 000000047f409004 CR4: 00000000001606f0
[ 101.387714] Call Trace:
[ 101.390445] blk_mq_update_nr_hw_queues+0xbf/0x130
[ 101.395791] nvme_reset_work+0x6f4/0xc06 [nvme]
[ 101.400848] ? pick_next_task_fair+0x290/0x5f0
[ 101.405807] ? __switch_to+0x1f5/0x430
[ 101.409988] ? put_prev_entity+0x2f/0xd0
[ 101.414365] process_one_work+0x141/0x340
[ 101.418836] worker_thread+0x47/0x3e0
[ 101.422921] kthread+0xf5/0x130
[ 101.426424] ? rescuer_thread+0x380/0x380
[ 101.430896] ? kthread_associate_blkcg+0x90/0x90
[ 101.436048] ret_from_fork+0x1f/0x30
[ 101.440034] Code: 48 83 3c ca 00 0f 84 2b 01 00 00 48 63 cd 48 8b 93 10 01 00 00 8b 0c 88 48 8b 83 20 01 00 00 4a 03 14 f5 60 04 af 81 48 8b 0c c8 8b 81 98 00 00 00 f0 4c 0f ab 30 8b 81 f8 00 00 00 89 42 44
[ 101.461116] RIP: blk_mq_map_swqueue+0xbc/0x200 RSP: ffffc9000234fd70
[ 101.468205] CR2: 0000000000000098
[ 101.471907] ---[ end trace 5fe710f98228a3ca ]---
[ 101.482489] Kernel panic - not syncing: Fatal exception
[ 101.488505] Kernel Offset: disabled
[ 101.497752] ---[ end Kernel panic - not syncing: Fatal exception

Reviewed-by: Christoph Hellwig
Suggested-by: Christoph Hellwig
Reported-by: Yi Zhang
Tested-by: Yi Zhang
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-04-12 18:32:15 +0800

09 Mar, 2018

1 commit

ccddee811 blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatch ... Browse Code »

commit 105976f517791aed3b11f8f53b308a2069d42055 upstream.

__blk_mq_requeue_request() covers two cases:

- one is that the requeued request is added to hctx->dispatch, such as
blk_mq_dispatch_rq_list()

- another case is that the request is requeued to io scheduler, such as
blk_mq_requeue_request().

We should call io sched's .requeue_request callback only for the 2nd
case.

Cc: Paolo Valente
Cc: Omar Sandoval
Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Cc: stable@vger.kernel.org
Reviewed-by: Bart Van Assche
Acked-by: Paolo Valente
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-03-09 14:41:20 +0800

03 Mar, 2018

1 commit

7e3acce11 block: drain queue before waiting for q_usage_counter becoming zero ... Browse Code »

[ Upstream commit 454be724f6f99cc7e7bbf15067128be9868186c6 ]

Now we track legacy requests with .q_usage_counter in commit 055f6e18e08f
("block: Make q_usage_counter also track legacy requests"), but that
commit never runs and drains legacy queue before waiting for this counter
becoming zero, then IO hang is caused in the test of pulling disk during IO.

This patch fixes the issue by draining requests before waiting for
q_usage_counter becoming zero, both Mauricio and chenxiang reported this
issue, and observed that it can be fixed by this patch.

Link: https://marc.info/?l=linux-block&m=151192424731797&w=2
Fixes: 055f6e18e08f("block: Make q_usage_counter also track legacy requests")
Cc: Wen Xiong
Tested-by: "chenxiang (M)"
Tested-by: Mauricio Faria de Oliveira
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-03-03 17:24:35 +0800

12 Sep, 2017

1 commit

157f377be block: directly insert blk-mq request from blk_insert_cloned_request() ... Browse Code »

A NULL pointer crash was reported for the case of having the BFQ IO
scheduler attached to the underlying blk-mq paths of a DM multipath
device. The crash occured in blk_mq_sched_insert_request()'s call to
e->type->ops.mq.insert_requests().

Paolo Valente correctly summarized why the crash occured with:
"the call chain (dm_mq_queue_rq -> map_request -> setup_clone ->
blk_rq_prep_clone) creates a cloned request without invoking
e->type->ops.mq.prepare_request for the target elevator e. The cloned
request is therefore not initialized for the scheduler, but it is
however inserted into the scheduler by blk_mq_sched_insert_request."

All said, a request-based DM multipath device's IO scheduler should be
the only one used -- when the original requests are issued to the
underlying paths as cloned requests they are inserted directly in the
underlying dispatch queue(s) rather than through an additional elevator.

But commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO
schedulers") switched blk_insert_cloned_request() from using
blk_mq_insert_request() to blk_mq_sched_insert_request(). Which
incorrectly added elevator machinery into a call chain that isn't
supposed to have any.

To fix this introduce a blk-mq private blk_mq_request_bypass_insert()
that blk_insert_cloned_request() calls to insert the request without
involving any elevator that may be attached to the cloned request's
request_queue.

Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Cc: stable@vger.kernel.org
Reported-by: Bart Van Assche
Tested-by: Mike Snitzer
Signed-off-by: Jens Axboe

Jens Axboe
2017-09-12 06:43:57 +0800

08 Sep, 2017

1 commit

a0725ab0c Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer updates from Jens Axboe:
"This is the first pull request for 4.14, containing most of the code
changes. It's a quiet series this round, which I think we needed after
the churn of the last few series. This contains:

- Fix for a registration race in loop, from Anton Volkov.

- Overflow complaint fix from Arnd for DAC960.

- Series of drbd changes from the usual suspects.

- Conversion of the stec/skd driver to blk-mq. From Bart.

- A few BFQ improvements/fixes from Paolo.

- CFQ improvement from Ritesh, allowing idling for group idle.

- A few fixes found by Dan's smatch, courtesy of Dan.

- A warning fixup for a race between changing the IO scheduler and
device remova. From David Jeffery.

- A few nbd fixes from Josef.

- Support for cgroup info in blktrace, from Shaohua.

- Also from Shaohua, new features in the null_blk driver to allow it
to actually hold data, among other things.

- Various corner cases and error handling fixes from Weiping Zhang.

- Improvements to the IO stats tracking for blk-mq from me. Can
drastically improve performance for fast devices and/or big
machines.

- Series from Christoph removing bi_bdev as being needed for IO
submission, in preparation for nvme multipathing code.

- Series from Bart, including various cleanups and fixes for switch
fall through case complaints"

* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
kernfs: checking for IS_ERR() instead of NULL
drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
drbd: Fix allyesconfig build, fix recent commit
drbd: switch from kmalloc() to kmalloc_array()
drbd: abort drbd_start_resync if there is no connection
drbd: move global variables to drbd namespace and make some static
drbd: rename "usermode_helper" to "drbd_usermode_helper"
drbd: fix race between handshake and admin disconnect/down
drbd: fix potential deadlock when trying to detach during handshake
drbd: A single dot should be put into a sequence.
drbd: fix rmmod cleanup, remove _all_ debugfs entries
drbd: Use setup_timer() instead of init_timer() to simplify the code.
drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
drbd: new disk-option disable-write-same
drbd: Fix resource role for newly created resources in events2
drbd: mark symbols static where possible
drbd: Send P_NEG_ACK upon write error in protocol != C
drbd: add explicit plugging when submitting batches
drbd: change list_for_each_safe to while(list_first_entry_or_null)
drbd: introduce drbd_recv_header_maybe_unplug
...

Linus Torvalds
2017-09-08 02:59:42 +0800

18 Aug, 2017

1 commit

4d6062193 block: Fix two comments that refer to .queue_rq() return values ... Browse Code »

Since patch "blk-mq: switch .queue_rq return value to blk_status_t"
.queue_rq() returns a BLK_STS_* value instead of a BLK_MQ_RQ_*
value. Hence refer to the former in comments about .queue_rq()
return values.

Fixes: commit 39a70c76b89b ("blk-mq: clarify dispatch may not be drained/blocked by stopping queue")
Signed-off-by: Bart Van Assche
Reviewed-by: Hannes Reinecke
Cc: Ming Lei
Cc: Christoph Hellwig
Cc: Johannes Thumshirn
Signed-off-by: Jens Axboe

Bart Van Assche
2017-08-18 22:36:58 +0800

15 Aug, 2017

1 commit

3280d66a6 blk-mq: Fix queue usage on failed request allocation ... Browse Code »

blk_mq_get_request() does not release the callers queue usage counter
when allocation fails. The caller still needs to account for its own
queue usage when it is unable to allocate a request.

Fixes: 1ad43c0078b7 ("blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed")

Reported-by: Max Gurtovoy
Reviewed-by: Ming Lei
Reviewed-by: Sagi Grimberg
Tested-by: Max Gurtovoy
Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Keith Busch
2017-08-15 21:22:34 +0800

10 Aug, 2017

3 commits

d4acf3650 block: Make blk_mq_delay_kick_requeue_list() rerun the queue at a quiet time ... Browse Code »

The blk_mq_delay_kick_requeue_list() function is used by the device
mapper and only by the device mapper to rerun the queue and requeue
list after a delay. This function is called once per request that
gets requeued. Modify this function such that the queue is run once
per path change event instead of once per request that is requeued.

Fixes: commit 2849450ad39d ("blk-mq: introduce blk_mq_delay_kick_requeue_list()")
Signed-off-by: Bart Van Assche
Cc: Mike Snitzer
Cc: Laurence Oberman
Cc:
Signed-off-by: Jens Axboe

Bart Van Assche
2017-08-10 10:24:38 +0800
b8d62b3a9 blk-mq: enable checking two part inflight counts at the same time ... Browse Code »

Modify blk_mq_in_flight() to count both a partition and root at
the same time. Then we only have to call it once, instead of
potentially looping the tags twice.

Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2017-08-10 03:09:33 +0800
f299b7c7a blk-mq: provide internal in-flight variant ... Browse Code »

We don't have to inc/dec some counter, since we can just
iterate the tags. That makes inc/dec a noop, but means we
have to iterate busy tags to get an in-flight count.

Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2017-08-10 03:09:28 +0800

02 Aug, 2017

1 commit

1ad43c007 blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed ... Browse Code »

When blk_mq_get_request() failed, preempt counter isn't
released, and blk_mq_make_request() doesn't release the counter
too.

This patch fixes the issue, and makes sure that preempt counter
is only held if rq is allocated successfully. The same policy is
applied on .q_usage_counter too.

Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2017-08-02 22:23:57 +0800

01 Aug, 2017

1 commit

b7a71e66d blk-mq: add warning to __blk_mq_run_hw_queue() for ints disabled ... Browse Code »

We recently had a bug in the IPR SCSI driver, where it would end up
making the SCSI mid layer run the mq hardware queue with interrupts
disabled. This isn't legal, since the software queue locking relies
on never being grabbed from interrupt context. Additionally, drivers
that set BLK_MQ_F_BLOCKING may schedule from this context.

Add a WARN_ON_ONCE() to catch bad users up front.

Signed-off-by: Jens Axboe

Jens Axboe
2017-08-01 23:28:24 +0800

29 Jul, 2017

1 commit

18e9781d4 blk-mq: blk_mq_requeue_work() doesn't need to save IRQ flags ... Browse Code »

We know we're in process context, so don't bother using the
IRQ safe versions of the spin lock.

Signed-off-by: Jens Axboe

Jens Axboe
2017-07-29 23:00:03 +0800

12 Jul, 2017

1 commit

130568d5e Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull more block updates from Jens Axboe:
"This is a followup for block changes, that didn't make the initial
pull request. It's a bit of a mixed bag, this contains:

- A followup pull request from Sagi for NVMe. Outside of fixups for
NVMe, it also includes a series for ensuring that we properly
quiesce hardware queues when browsing live tags.

- Set of integrity fixes from Dmitry (mostly), fixing various issues
for folks using DIF/DIX.

- Fix for a bug introduced in cciss, with the req init changes. From
Christoph.

- Fix for a bug in BFQ, from Paolo.

- Two followup fixes for lightnvm/pblk from Javier.

- Depth fix from Ming for blk-mq-sched.

- Also from Ming, performance fix for mtip32xx that was introduced
with the dynamic initialization of commands"

* 'for-linus' of git://git.kernel.dk/linux-block: (44 commits)
block: call bio_uninit in bio_endio
nvmet: avoid unneeded assignment of submit_bio return value
nvme-pci: add module parameter for io queue depth
nvme-pci: compile warnings in nvme_alloc_host_mem()
nvmet_fc: Accept variable pad lengths on Create Association LS
nvme_fc/nvmet_fc: revise Create Association descriptor length
lightnvm: pblk: remove unnecessary checks
lightnvm: pblk: control I/O flow also on tear down
cciss: initialize struct scsi_req
null_blk: fix error flow for shared tags during module_init
block: Fix __blkdev_issue_zeroout loop
nvme-rdma: unconditionally recycle the request mr
nvme: split nvme_uninit_ctrl into stop and uninit
virtio_blk: quiesce/unquiesce live IO when entering PM states
mtip32xx: quiesce request queues to make sure no submissions are inflight
nbd: quiesce request queues to make sure no submissions are inflight
nvme: kick requeue list when requeueing a request instead of when starting the queues
nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues
nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues
nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues
...

Linus Torvalds
2017-07-12 06:36:52 +0800

04 Jul, 2017

3 commits

03ffbcdd7 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull irq updates from Thomas Gleixner:
"The irq department delivers:

- Expand the generic infrastructure handling the irq migration on CPU
hotplug and convert X86 over to it. (Thomas Gleixner)

Aside of consolidating code this is a preparatory change for:

- Finalizing the affinity management for multi-queue devices. The
main change here is to shut down interrupts which are affine to a
outgoing CPU and reenabling them when the CPU comes online again.
That avoids moving interrupts pointlessly around and breaking and
reestablishing affinities for no value. (Christoph Hellwig)

Note: This contains also the BLOCK-MQ and NVME changes which depend
on the rework of the irq core infrastructure. Jens acked them and
agreed that they should go with the irq changes.

- Consolidation of irq domain code (Marc Zyngier)

- State tracking consolidation in the core code (Jeffy Chen)

- Add debug infrastructure for hierarchical irq domains (Thomas
Gleixner)

- Infrastructure enhancement for managing generic interrupt chips via
devmem (Bartosz Golaszewski)

- Constification work all over the place (Tobias Klauser)

- Two new interrupt controller drivers for MVEBU (Thomas Petazzoni)

- The usual set of fixes, updates and enhancements all over the
place"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
irqchip/or1k-pic: Fix interrupt acknowledgement
irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap
irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
nvme: Allocate queues for all possible CPUs
blk-mq: Create hctx for each present CPU
blk-mq: Include all present CPUs in the default queue mapping
genirq: Avoid unnecessary low level irq function calls
genirq: Set irq masked state when initializing irq_desc
genirq/timings: Add infrastructure for estimating the next interrupt arrival time
genirq/timings: Add infrastructure to track the interrupt timings
genirq/debugfs: Remove pointless NULL pointer check
irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID
irqchip/gic-v3-its: Add ACPI NUMA node mapping
irqchip/gic-v3-its-platform-msi: Make of_device_ids const
irqchip/gic-v3-its: Make of_device_ids const
irqchip/irq-mvebu-icu: Add new driver for Marvell ICU
irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP
dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU
genirq/irqdomain: Remove auto-recursive hierarchy support
irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access
...

Linus Torvalds
2017-07-04 07:50:31 +0800
e23947bd7 bio-integrity: fold bio_integrity_enabled to bio_integrity_prep ... Browse Code »

Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly.

In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
so it is reasonable to fold it in to one function.

Signed-off-by: Dmitry Monakhov
Reviewed-by: Martin K. Petersen
[hch: merged with the latest block tree,
return bool from bio_integrity_prep]
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Dmitry Monakhov
2017-07-04 06:56:24 +0800
9bd42183b Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:

- Add the SYSTEM_SCHEDULING bootup state to move various scheduler
debug checks earlier into the bootup. This turns silent and
sporadically deadly bugs into nice, deterministic splats. Fix some
of the splats that triggered. (Thomas Gleixner)

- A round of restructuring and refactoring of the load-balancing and
topology code (Peter Zijlstra)

- Another round of consolidating ~20 of incremental scheduler code
history: this time in terms of wait-queue nomenclature. (I didn't
get much feedback on these renaming patches, and we can still
easily change any names I might have misplaced, so if anyone hates
a new name, please holler and I'll fix it.) (Ingo Molnar)

- sched/numa improvements, fixes and updates (Rik van Riel)

- Another round of x86/tsc scheduler clock code improvements, in hope
of making it more robust (Peter Zijlstra)

- Improve NOHZ behavior (Frederic Weisbecker)

- Deadline scheduler improvements and fixes (Luca Abeni, Daniel
Bristot de Oliveira)

- Simplify and optimize the topology setup code (Lauro Ramos
Venancio)

- Debloat and decouple scheduler code some more (Nicolas Pitre)

- Simplify code by making better use of llist primitives (Byungchul
Park)

- ... plus other fixes and improvements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
sched/cputime: Refactor the cputime_adjust() code
sched/debug: Expose the number of RT/DL tasks that can migrate
sched/numa: Hide numa_wake_affine() from UP build
sched/fair: Remove effective_load()
sched/numa: Implement NUMA node level wake_affine()
sched/fair: Simplify wake_affine() for the single socket case
sched/numa: Override part of migrate_degrades_locality() when idle balancing
sched/rt: Move RT related code from sched/core.c to sched/rt.c
sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
sched/fair: Spare idle load balancing on nohz_full CPUs
nohz: Move idle balancer registration to the idle path
sched/loadavg: Generalize "_idle" naming to "_nohz"
sched/core: Drop the unused try_get_task_struct() helper function
sched/fair: WARN() and refuse to set buddy when !se->on_rq
sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
sched/wait: Split out the wait_bit*() APIs from into
sched/wait: Re-adjust macro line continuation backslashes in
...

Linus Torvalds
2017-07-04 04:08:04 +0800

29 Jun, 2017

1 commit

4b855ad37 blk-mq: Create hctx for each present CPU ... Browse Code »

Currently we only create hctx for online CPUs, which can lead to a lot
of churn due to frequent soft offline / online operations. Instead
allocate one for each present CPU to avoid this and dramatically simplify
the code.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jens Axboe
Cc: Keith Busch
Cc: linux-block@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
Signed-off-by: Thomas Gleixner

Christoph Hellwig
2017-06-29 05:00:07 +0800

28 Jun, 2017

2 commits

46685d1a9 blk-mq: don't bounce by default ... Browse Code »

For historical reasons we default to bouncing highmem pages for all block
queues. But the blk-mq drivers are easy to audit to ensure that we don't
need this - scsi and mtip32xx set explicit limits and everyone else doesn't
have any particular ones.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-28 02:13:45 +0800
0b0bcacc3 block: don't bother with bounce limits for make_request drivers ... Browse Code »

We only call blk_queue_bounce for request-based drivers, so stop messing
with it for make_request based drivers.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-28 02:13:45 +0800

24 Jun, 2017

1 commit

1bc3cd4df Merge branch 'linus' into sched/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2017-06-24 14:57:20 +0800

23 Jun, 2017

1 commit

f95a0d6a9 Merge commit '8e8320c9315c ' into for-4.13/block ... Browse Code »

Pull in the fix for shared tags, as it conflicts with the pending
changes in for-4.13/block. We already pulled in v4.12-rc5 to solve
other conflicts or get fixes that went into 4.12, so not a lot
of changes in this merge.

Signed-off-by: Jens Axboe

Jens Axboe
2017-06-23 11:55:24 +0800

22 Jun, 2017

3 commits

a9590fe14 blk-mq: remove double set queue_num ... Browse Code »

hwctx's queue_num has been set prior call blk_mq_init_hctx, so no need
set it again.

Signed-off-by: weiping
Signed-off-by: Jens Axboe

weiping
2017-06-22 23:18:25 +0800
852ec8098 blk-mq: Make it safe to quiesce and unquiesce from an interrupt handler ... Browse Code »

Since blk_mq_quiesce_queue_nowait() can be called from interrupt
context, make this safe. Since this function is not in the hot
path, uninline it.

Fixes: commit f4560ffe8cec ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue")
Signed-off-by: Bart Van Assche
Cc: Ming Lei
Cc: Hannes Reinecke
Cc: Martin K. Petersen
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-22 02:01:15 +0800
8e8320c93 blk-mq: fix performance regression with shared tags ... Browse Code »

If we have shared tags enabled, then every IO completion will trigger
a full loop of every queue belonging to a tag set, and every hardware
queue for each of those queues, even if nothing needs to be done.
This causes a massive performance regression if you have a lot of
shared devices.

Instead of doing this huge full scan on every IO, add an atomic
counter to the main queue that tracks how many hardware queues have
been marked as needing a restart. With that, we can avoid looking for
restartable queues, if we don't have to.

Max reports that this restores performance. Before this patch, 4K
IOPS was limited to 22-23K IOPS. With the patch, we are running at
950-970K IOPS.

Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
Reported-by: Max Gurtovoy
Tested-by: Max Gurtovoy
Reviewed-by: Bart Van Assche
Tested-by: Bart Van Assche
Signed-off-by: Jens Axboe

Jens Axboe
2017-06-22 00:17:49 +0800

21 Jun, 2017

5 commits

5435c023b blk-mq: Warn when attempting to run a hardware queue that is not mapped ... Browse Code »

A queue must be frozen while the mapped state of a hardware queue
is changed. Additionally, any change of the mapped state is
followed by a call to blk_mq_map_swqueue() (see also
blk_mq_init_allocated_queue() and blk_mq_update_nr_hw_queues()).
Since blk_mq_map_swqueue() does not map any unmapped hardware
queue onto any software queue, no attempt will be made to run
an unmapped hardware queue. Hence issue a warning upon attempts
to run an unmapped hardware queue.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
7b6078146 blk-mq: Document locking assumptions ... Browse Code »

Document the locking assumptions in functions that modify
blk_mq_ctx.rq_list to make it easier for humans to verify
this code.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
c3a148d20 blk-mq: Initialize .rq_flags in blk_mq_rq_ctx_init() ... Browse Code »

Initialization of blk-mq requests is a bit weird: blk_mq_rq_ctx_init()
is called after a value has been assigned to .rq_flags and .rq_flags
is initialized in __blk_mq_finish_request(). Initialize .rq_flags in
blk_mq_rq_ctx_init() instead of relying on __blk_mq_finish_request().
Moving the initialization of .rq_flags is fine because all changes
and tests of .rq_flags occur between blk_get_request() and finishing
a request.

Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
cd6ce1482 block: Make request operation type argument declarations consistent ... Browse Code »

Instead of declaring the second argument of blk_*_get_request()
as int and passing it to functions that expect an unsigned int,
declare that second argument as unsigned int. Also because of
consistency, rename that second argument from 'rw' into 'op'.
This patch does not change any functionality.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Omar Sandoval
Cc: Ming Lei
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800
073196787 blk-mq: Reduce blk_mq_hw_ctx size ... Browse Code »

Since the srcu structure is rather large (184 bytes on an x86-64
system with kernel debugging disabled), only allocate it if needed.

Reported-by: Ming Lei
Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Cc: Hannes Reinecke
Cc: Omar Sandoval
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-21 09:27:14 +0800

20 Jun, 2017

3 commits

03a07c92a block: return on congested block device ... Browse Code »

A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Stacked devices such as md (the ones with make_request_fn hooks)
currently are not supported because it may block for housekeeping.
For example, an md can have a part of the device suspended.
For this reason, only request based devices are supported.
In the future, this feature will be expanded to stacked devices
by teaching them how to handle the REQ_NOWAIT flags.

Reviewed-by: Christoph Hellwig
Reviewed-by: Jens Axboe
Signed-off-by: Goldwyn Rodrigues
Signed-off-by: Jens Axboe

Goldwyn Rodrigues
2017-06-20 21:12:03 +0800
2055da973 sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming ... Browse Code »

So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.

Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.

To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:

struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry

For example, this code:

rqw->wait.task_list.next != &wait->task_list

... is was pretty unclear (to me) what it's doing, while now it's written this way:

rqw->wait.head.next != &wait->entry

... which makes it pretty clear that we are iterating a list until we see the head.

Other examples are:

list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {

... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:

list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {

Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-06-20 18:19:14 +0800
ac6424b98 sched/wait: Rename wait_queue_t => wait_queue_entry_t ... Browse Code »

Rename:

wait_queue_t => wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-06-20 18:18:27 +0800