17 Aug, 2022
1 commit
-
[ Upstream commit 14a6e2eb7df5c7897c15b109cba29ab0c4a791b6 ]
In our test of iocost, we encountered some list add/del corruptions of
inner_walk list in ioc_timer_fn.The reason can be described as follows:
cpu 0 cpu 1
ioc_qos_write ioc_qos_writeioc = q_to_ioc(queue);
if (!ioc) {
ioc = kzalloc();
ioc = q_to_ioc(queue);
if (!ioc) {
ioc = kzalloc();
...
rq_qos_add(q, rqos);
}
...
rq_qos_add(q, rqos);
...
}When the io.cost.qos file is written by two cpus concurrently, rq_qos may
be added to one disk twice. In that case, there will be two iocs enabled
and running on one disk. They own different iocgs on their active list. In
the ioc_timer_fn function, because of the iocgs from two iocs have the
same root iocg, the root iocg's walk_list may be overwritten by each other
and this leads to list add/del corruptions in building or destroying the
inner_walk list.And so far, the blk-rq-qos framework works in case that one instance for
one type rq_qos per queue by default. This patch make this explicit and
also fix the crash above.Signed-off-by: Jinke Han
Reviewed-by: Muchun Song
Acked-by: Tejun Heo
Cc:
Link: https://lore.kernel.org/r/20220720093616.70584-1-hanjinke.666@bytedance.com
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
19 Nov, 2021
1 commit
-
[ Upstream commit 480d42dc001bbfe953825a92073012fcd5a99161 ]
The timer callback used to evaluate if the latency is exceeded can be
executed after the corresponding disk has been released, causing the
following NULL pointer dereference:[ 119.987108] BUG: kernel NULL pointer dereference, address: 0000000000000098
[ 119.987617] #PF: supervisor read access in kernel mode
[ 119.987971] #PF: error_code(0x0000) - not-present page
[ 119.988325] PGD 7c4a4067 P4D 7c4a4067 PUD 7bf63067 PMD 0
[ 119.988697] Oops: 0000 [#1] SMP NOPTI
[ 119.988959] CPU: 1 PID: 9353 Comm: cloud-init Not tainted 5.15-rc5+arighi #rc5+arighi
[ 119.989520] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[ 119.990055] RIP: 0010:wb_timer_fn+0x44/0x3c0
[ 119.990376] Code: 41 8b 9c 24 98 00 00 00 41 8b 94 24 b8 00 00 00 41 8b 84 24 d8 00 00 00 4d 8b 74 24 28 01 d3 01 c3 49 8b 44 24 60 48 8b 40 78 8b b8 98 00 00 00 4d 85 f6 0f 84 c4 00 00 00 49 83 7c 24 30 00
[ 119.991578] RSP: 0000:ffffb5f580957da8 EFLAGS: 00010246
[ 119.991937] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
[ 119.992412] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88f476d7f780
[ 119.992895] RBP: ffffb5f580957dd0 R08: 0000000000000000 R09: 0000000000000000
[ 119.993371] R10: 0000000000000004 R11: 0000000000000002 R12: ffff88f476c84500
[ 119.993847] R13: ffff88f4434390c0 R14: 0000000000000000 R15: ffff88f4bdc98c00
[ 119.994323] FS: 00007fb90bcd9c00(0000) GS:ffff88f4bdc80000(0000) knlGS:0000000000000000
[ 119.994952] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 119.995380] CR2: 0000000000000098 CR3: 000000007c0d6000 CR4: 00000000000006e0
[ 119.995906] Call Trace:
[ 119.996130] ? blk_stat_free_callback_rcu+0x30/0x30
[ 119.996505] blk_stat_timer_fn+0x138/0x140
[ 119.996830] call_timer_fn+0x2b/0x100
[ 119.997136] __run_timers.part.0+0x1d1/0x240
[ 119.997470] ? kvm_clock_get_cycles+0x11/0x20
[ 119.997826] ? ktime_get+0x3e/0xa0
[ 119.998110] ? native_apic_msr_write+0x2c/0x30
[ 119.998456] ? lapic_next_event+0x20/0x30
[ 119.998779] ? clockevents_program_event+0x94/0xf0
[ 119.999150] run_timer_softirq+0x2a/0x50
[ 119.999465] __do_softirq+0xcb/0x26f
[ 119.999764] irq_exit_rcu+0x8c/0xb0
[ 120.000057] sysvec_apic_timer_interrupt+0x43/0x90
[ 120.000429] ? asm_sysvec_apic_timer_interrupt+0xa/0x20
[ 120.000836] asm_sysvec_apic_timer_interrupt+0x12/0x20In this case simply return from the timer callback (no action
required) to prevent the NULL pointer dereference.BugLink: https://bugs.launchpad.net/bugs/1947557
Link: https://lore.kernel.org/linux-mm/YWRNVTk9N8K0RMst@arighi-desktop/
Fixes: 34dbad5d26e2 ("blk-stat: convert to callback-based statistics reporting")
Signed-off-by: Andrea Righi
Link: https://lore.kernel.org/r/YW6N2qXpBU3oc50q@arighi-desktop
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
24 Aug, 2021
1 commit
-
Replace the magic lookup through the kobject tree with an explicit
backpointer, given that the device model links are set up and torn
down at times when I/O is still possible, leading to potential
NULL or invalid pointer dereferences.Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk")
Reported-by: syzbot
Signed-off-by: Christoph Hellwig
Tested-by: Sven Schnelle
Link: https://lore.kernel.org/r/20210816134624.GA24234@lst.de
Signed-off-by: Jens Axboe
10 Aug, 2021
1 commit
-
The backing device information only makes sense for file system I/O,
and thus belongs into the gendisk and not the lower level request_queue
structure. Move it there.Signed-off-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Link: https://lore.kernel.org/r/20210809141744.1203023-5-hch@lst.de
Signed-off-by: Jens Axboe
22 Jun, 2021
2 commits
-
After commit a79050434b45 ("blk-rq-qos: refactor out common elements of
blk-wbt"), if throttle was disabled by wbt_disable_default(), we could
not enable again, fix this by set enable_state back to
WBT_STATE_ON_DEFAULT.Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
Signed-off-by: Zhang Yi
Link: https://lore.kernel.org/r/20210619093700.920393-3-yi.zhang@huawei.com
Signed-off-by: Jens Axboe -
Now that we disable wbt by simply zero out rwb->wb_normal in
wbt_disable_default() when switch elevator to bfq, but it's not safe
because it will become false positive if we change queue depth. If it
become false positive between wbt_wait() and wbt_track() when submit
write request, it will lead to drop rqw->inflight to -1 in wbt_done(),
which will end up trigger IO hung. Fix this issue by introduce a new
state which mean the wbt was disabled.Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
Signed-off-by: Zhang Yi
Link: https://lore.kernel.org/r/20210619093700.920393-2-yi.zhang@huawei.com
Signed-off-by: Jens Axboe
19 Jun, 2021
1 commit
-
Now wbt_wait() returns void, so remove now outdated comment.
Signed-off-by: lijiazi
Link: https://lore.kernel.org/r/1623986240-13878-1-git-send-email-lijiazi@xiaomi.com
Signed-off-by: Jens Axboe
27 Jan, 2021
1 commit
-
The first parameter rwb is not used for this function.
So just remove it.Signed-off-by: Lei Chen
Signed-off-by: Jens Axboe
01 Dec, 2020
1 commit
-
It's unnecessary to call wbt_update_limits explicitly within wbt_init,
because it will be called in the following function wbt_queue_depth_changed.Signed-off-by: Lei Chen
Signed-off-by: Jens Axboe
24 Aug, 2020
1 commit
-
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through
Signed-off-by: Gustavo A. R. Silva
30 May, 2020
2 commits
-
Now let's rename __wbt_update_limits to wbt_update_limits after the
previous one is deleted.Signed-off-by: Guoqing Jiang
Signed-off-by: Jens Axboe -
No one call this function after commit 2af2783f2ea4f ("rq-qos: get rid of
redundant wbt_update_limits()"), so remove it.Signed-off-by: Guoqing Jiang
Signed-off-by: Jens Axboe
17 Apr, 2020
1 commit
-
Use tracepoint_string() for string literals that are used in the
wbt_step tracepoint, so that userspace tools can display the string
content.Signed-off-by: Tommi Rantala
Signed-off-by: Jens Axboe
06 Oct, 2019
1 commit
-
scale_up wakes up waiters after scaling up. But after scaling max, it
should not wake up more waiters as waiters will not have anything to
do. This patch fixes this by making scale_up (and also scale_down)
return when threshold is reached.This bug causes increased fdatasync latency when fdatasync and dd
conv=sync are performed in parallel on 4.19 compared to 4.14. This
bug was introduced during refactoring of blk-wbt code.Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
Cc: stable@vger.kernel.org
Cc: Josef Bacik
Signed-off-by: Harshad Shirwadkar
Signed-off-by: Jens Axboe
29 Aug, 2019
1 commit
-
wbt already gets queue depth changed notification through
wbt_set_queue_depth(). Generalize it into
rq_qos_ops->queue_depth_changed() so that other rq_qos policies can
easily hook into the events too.Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe
28 Aug, 2019
1 commit
-
There are 4 users which check if queue is registered, so add one helper
to check it.Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Greg KH
Cc: Mike Snitzer
Cc: Bart Van Assche
Reviewed-by: Bart Van Assche
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
01 May, 2019
1 commit
-
Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
25 Jan, 2019
1 commit
-
This patch avoids that sparse reports the following warnings:
CHECK block/blk-wbt.c
block/blk-wbt.c:600:6: warning: symbol 'wbt_issue' was not declared. Should it be static?
block/blk-wbt.c:620:6: warning: symbol 'wbt_requeue' was not declared. Should it be static?
CC block/blk-wbt.o
block/blk-wbt.c:600:6: warning: no previous prototype for wbt_issue [-Wmissing-prototypes]
void wbt_issue(struct rq_qos *rqos, struct request *rq)
^~~~~~~~~
block/blk-wbt.c:620:6: warning: no previous prototype for wbt_requeue [-Wmissing-prototypes]
void wbt_requeue(struct rq_qos *rqos, struct request *rq)
^~~~~~~~~~~Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
17 Dec, 2018
1 commit
-
This information is helpful to either investigate issues, or understand
wbt's internal behaviour.Cc: Bart Van Assche
Cc: Omar Sandoval
Cc: Christoph Hellwig
Cc: Josef Bacik
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
12 Dec, 2018
1 commit
-
rwb_enabled() can't be changed when there is any inflight IO.
wbt_disable_default() may set rwb->wb_normal as zero, however the
blk_stat timer may still be pending, and the timer function will update
wrb->wb_normal again.This patch introduces blk_stat_deactivate() and applies it in
wbt_disable_default(), then the following IO hang triggered when running
parted & switching io scheduler can be fixed:[ 369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
[ 369.938941] Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
[ 369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 369.940768] parted D 0 3645 3239 0x00000000
[ 369.941500] Call Trace:
[ 369.941874] ? __schedule+0x6d9/0x74c
[ 369.942392] ? wbt_done+0x5e/0x5e
[ 369.942864] ? wbt_cleanup_cb+0x16/0x16
[ 369.943404] ? wbt_done+0x5e/0x5e
[ 369.943874] schedule+0x67/0x78
[ 369.944298] io_schedule+0x12/0x33
[ 369.944771] rq_qos_wait+0xb5/0x119
[ 369.945193] ? karma_partition+0x1c2/0x1c2
[ 369.945691] ? wbt_cleanup_cb+0x16/0x16
[ 369.946151] wbt_wait+0x85/0xb6
[ 369.946540] __rq_qos_throttle+0x23/0x2f
[ 369.947014] blk_mq_make_request+0xe6/0x40a
[ 369.947518] generic_make_request+0x192/0x2fe
[ 369.948042] ? submit_bio+0x103/0x11f
[ 369.948486] ? __radix_tree_lookup+0x35/0xb5
[ 369.949011] submit_bio+0x103/0x11f
[ 369.949436] ? blkg_lookup_slowpath+0x25/0x44
[ 369.949962] submit_bio_wait+0x53/0x7f
[ 369.950469] blkdev_issue_flush+0x8a/0xae
[ 369.951032] blkdev_fsync+0x2f/0x3a
[ 369.951502] do_fsync+0x2e/0x47
[ 369.951887] __x64_sys_fsync+0x10/0x13
[ 369.952374] do_syscall_64+0x89/0x149
[ 369.952819] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 369.953492] RIP: 0033:0x7f95a1e729d4
[ 369.953996] Code: Bad RIP value.
[ 369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[ 369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
[ 369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
[ 369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
[ 369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
[ 369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008Cc: stable@vger.kernel.org
Cc: Paolo Valente
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
08 Dec, 2018
1 commit
-
Now that we have rq_qos_wait() in place, convert wbt_wait() over to
using it with it's specific callbacks.Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe
16 Nov, 2018
4 commits
-
Various spots check for q->mq_ops being non-NULL, but provide
a helper to do this instead.Where the ->mq_ops != NULL check is redundant, remove it.
Since mq == rq-based now that legacy is gone, get rid of the
queue_is_rq_based() and just use queue_is_mq() everywhere.Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
This isn't unused, if BFQ is modular we get into trouble.
Fixes: b6676f653f13 ("block: remove a few unused exports")
Signed-off-by: Jens Axboe -
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Unused now that the legacy request path is gone.
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
08 Nov, 2018
1 commit
-
Everything is blk-mq at this point, so it doesn't make any sense
to have this option available as it does nothing.Reviewed-by: Hannes Reinecke
Tested-by: Ming Lei
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe
12 Oct, 2018
1 commit
-
Tetsuo brought to my attention that I screwed up the scale_up/scale_down
helpers when I factored out the rq-qos code. We need to wake up all the
waiters when we add slots for requests to make, not when we shrink the
slots. Otherwise we'll end up things waiting forever. This was a
mistake and simply puts everything back the way it was.cc: stable@vger.kernel.org
Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
eported-by: Tetsuo Handa
Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe
28 Aug, 2018
3 commits
-
We already note and mark discard and swap IO from bio_to_wbt_flags().
Signed-off-by: Jens Axboe
-
We have two potential issues:
1) After commit 2887e41b910b, we only wake one process at the time when
we finish an IO. We really want to wake up as many tasks as can
queue IO. Before this commit, we woke up everyone, which could cause
a thundering herd issue.2) A task can potentially consume two wakeups, causing us to (in
practice) miss a wakeup.Fix both by providing our own wakeup function, which stops
__wake_up_common() from waking up more tasks if we fail to get a
queueing token. With the strict ordering we have on the wait list, this
wakes the right tasks and the right amount of tasks.Based on a patch from Jianchao Wang .
Tested-by: Agarwal, Anchal
Signed-off-by: Jens Axboe -
Prep patch for calling the handler from a different context,
no functional changes in this patch.Tested-by: Agarwal, Anchal
Signed-off-by: Jens Axboe
23 Aug, 2018
4 commits
-
A previous commit removed the ability to have per-rq flags. We used
those flags to maintain inflight counts. Since we don't have those
anymore, we have to always maintain inflight counts, even if wbt is
disabled. This is clearly suboptimal.Add a queue quiesce around changing the wbt latency settings from sysfs
to work around this. With that, we can reliably put the enabled check in
our bio_to_wbt_flags(), since we know the WBT_TRACKED flag will be
consistent for the lifetime of the request.Fixes: c1c80384c8f ("block: remove external dependency on wbt_flags")
Reviewed-by: Josef Bacik
Signed-off-by: Jens Axboe -
We need to do this inside the loop as well, or we can allow new
IO to supersede previous IO.Tested-by: Anchal Agarwal
Signed-off-by: Jens Axboe -
We need the memory barrier before checking the list head,
use the appropriate helper for this. The matching queue
side memory barrier is provided by set_current_state().Tested-by: Anchal Agarwal
Signed-off-by: Jens Axboe -
Check it in one place, instead of in multiple places.
Tested-by: Anchal Agarwal
Signed-off-by: Jens Axboe
15 Aug, 2018
1 commit
-
On wbt invariant is that if one IO is tracked via WBT_TRACKED, rqw->inflight
should be updated for tracking this IO.But commit c1c80384c8f ("block: remove external dependency on wbt_flags")
forgets to remove the early handling of !rwb_enabled(rwb) inside wbt_wait(),
then the inflight counter may not be increased in wbt_wait(), but decreased
in wbt_done() for this kind of IO, so this counter may become negative, then
wbt_wait() may wait forever.This patch fixes the report in the following link:
https://marc.info/?l=linux-block&m=153221542021033&w=2
Fixes: c1c80384c8f ("block: remove external dependency on wbt_flags")
Cc: Josef Bacik
Reported-by: Ming Lei
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
08 Aug, 2018
1 commit
-
I am currently running a large bare metal instance (i3.metal)
on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
4.18 kernel. I have a workload that simulates a database
workload and I am running into lockup issues when writeback
throttling is enabled,with the hung task detector also
kicking in.Crash dumps show that most CPUs (up to 50 of them) are
all trying to get the wbt wait queue lock while trying to add
themselves to it in __wbt_wait (see stack traces below).[ 0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[ 0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[ 0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
[ 0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
[ 0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
[ 0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
[ 0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
[ 0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
[ 0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
[ 0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[ 0.948132] FS: 0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
[ 0.948134] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
[ 0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 0.948138] Call Trace:
[ 0.948139]
[ 0.948142] do_raw_spin_lock+0xad/0xc0
[ 0.948145] _raw_spin_lock_irqsave+0x44/0x4b
[ 0.948149] ? __wake_up_common_lock+0x53/0x90
[ 0.948150] __wake_up_common_lock+0x53/0x90
[ 0.948155] wbt_done+0x7b/0xa0
[ 0.948158] blk_mq_free_request+0xb7/0x110
[ 0.948161] __blk_mq_complete_request+0xcb/0x140
[ 0.948166] nvme_process_cq+0xce/0x1a0 [nvme]
[ 0.948169] nvme_irq+0x23/0x50 [nvme]
[ 0.948173] __handle_irq_event_percpu+0x46/0x300
[ 0.948176] handle_irq_event_percpu+0x20/0x50
[ 0.948179] handle_irq_event+0x34/0x60
[ 0.948181] handle_edge_irq+0x77/0x190
[ 0.948185] handle_irq+0xaf/0x120
[ 0.948188] do_IRQ+0x53/0x110
[ 0.948191] common_interrupt+0x87/0x87
[ 0.948192]
....
[ 0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[ 0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[ 0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
[ 0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
[ 0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
[ 0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
[ 0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
[ 0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
[ 0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
[ 0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
[ 0.311149] FS: 000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
[ 0.311150] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
[ 0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 0.311154] Call Trace:
[ 0.311157] do_raw_spin_lock+0xad/0xc0
[ 0.311160] _raw_spin_lock_irqsave+0x44/0x4b
[ 0.311162] ? prepare_to_wait_exclusive+0x28/0xb0
[ 0.311164] prepare_to_wait_exclusive+0x28/0xb0
[ 0.311167] wbt_wait+0x127/0x330
[ 0.311169] ? finish_wait+0x80/0x80
[ 0.311172] ? generic_make_request+0xda/0x3b0
[ 0.311174] blk_mq_make_request+0xd6/0x7b0
[ 0.311176] ? blk_queue_enter+0x24/0x260
[ 0.311178] ? generic_make_request+0xda/0x3b0
[ 0.311181] generic_make_request+0x10c/0x3b0
[ 0.311183] ? submit_bio+0x5c/0x110
[ 0.311185] submit_bio+0x5c/0x110
[ 0.311197] ? __ext4_journal_stop+0x36/0xa0 [ext4]
[ 0.311210] ext4_io_submit+0x48/0x60 [ext4]
[ 0.311222] ext4_writepages+0x810/0x11f0 [ext4]
[ 0.311229] ? do_writepages+0x3c/0xd0
[ 0.311239] ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
[ 0.311240] do_writepages+0x3c/0xd0
[ 0.311243] ? _raw_spin_unlock+0x24/0x30
[ 0.311245] ? wbc_attach_and_unlock_inode+0x165/0x280
[ 0.311248] ? __filemap_fdatawrite_range+0xa3/0xe0
[ 0.311250] __filemap_fdatawrite_range+0xa3/0xe0
[ 0.311253] file_write_and_wait_range+0x34/0x90
[ 0.311264] ext4_sync_file+0x151/0x500 [ext4]
[ 0.311267] do_fsync+0x38/0x60
[ 0.311270] SyS_fsync+0xc/0x10
[ 0.311272] do_syscall_64+0x6f/0x170
[ 0.311274] entry_SYSCALL_64_after_hwframe+0x42/0xb7In the original patch, wbt_done is waking up all the exclusive
processes in the wait queue, which can cause a thundering herd
if there is a large number of writer threads in the queue. The
original intention of the code seems to be to wake up one thread
only however, it uses wake_up_all() in __wbt_done(), and then
uses the following check in __wbt_wait to have only one thread
actually get out of the wait loop:if (waitqueue_active(&rqw->wait) &&
rqw->wait.head.next != &wait->entry)
return false;The problem with this is that the wait entry in wbt_wait is
define with DEFINE_WAIT, which uses the autoremove wakeup function.
That means that the above check is invalid - the wait entry will
have been removed from the queue already by the time we hit the
check in the loop.Secondly, auto-removing the wait entries also means that the wait
queue essentially gets reordered "randomly" (e.g. threads re-add
themselves in the order they got to run after being woken up).
Additionally, new requests entering wbt_wait might overtake requests
that were queued earlier, because the wait queue will be
(temporarily) empty after the wake_up_all, so the waitqueue_active
check will not stop them. This can cause certain threads to starve
under high load.The fix is to leave the woken up requests in the queue and remove
them in finish_wait() once the current thread breaks out of the
wait loop in __wbt_wait. This will ensure new requests always
end up at the back of the queue, and they won't overtake requests
that are already in the wait queue. With that change, the loop
in wbt_wait is also in line with many other wait loops in the kernel.
Waking up just one thread drastically reduces lock contention, as
does moving the wait queue add/remove out of the loop.A significant drop in lockdep's lock contention numbers is seen when
running the test application on the patched kernel.Signed-off-by: Anchal Agarwal
Signed-off-by: Frank van der Linden
Signed-off-by: Jens Axboe
09 Jul, 2018
2 commits
-
We don't really need to save this stuff in the core block code, we can
just pass the bio back into the helpers later on to derive the same
flags and update the rq->wbt_flags appropriately.Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe -
blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis. Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe
09 May, 2018
2 commits
-
struct blk_issue_stat squashes three things into one u64:
- The time the driver started working on a request
- The original size of the request (for the io.low controller)
- Flags for writeback throttlingIt turns out that on x86_64, we have a 4 byte hole in struct request
which we can fill with the non-timestamp fields from blk_issue_stat,
simplifying things quite a bit.Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe -
issue_stat is going to go away, so first make writeback throttling take
the containing request, update the internal wbt helpers accordingly, and
change rwb->sync_cookie to be the request pointer instead of the
issue_stat pointer. No functional change.Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe