Eric Lee / smarc-fsl-linux-kernel

17 Aug, 2022

1 commit

0b7f5d7a4 block: don't allow the same type rq_qos add more than once ... Browse Code »

[ Upstream commit 14a6e2eb7df5c7897c15b109cba29ab0c4a791b6 ]

In our test of iocost, we encountered some list add/del corruptions of
inner_walk list in ioc_timer_fn.

The reason can be described as follows:

cpu 0 cpu 1
ioc_qos_write ioc_qos_write

ioc = q_to_ioc(queue);
if (!ioc) {
ioc = kzalloc();
ioc = q_to_ioc(queue);
if (!ioc) {
ioc = kzalloc();
...
rq_qos_add(q, rqos);
}
...
rq_qos_add(q, rqos);
...
}

When the io.cost.qos file is written by two cpus concurrently, rq_qos may
be added to one disk twice. In that case, there will be two iocs enabled
and running on one disk. They own different iocgs on their active list. In
the ioc_timer_fn function, because of the iocgs from two iocs have the
same root iocg, the root iocg's walk_list may be overwritten by each other
and this leads to list add/del corruptions in building or destroying the
inner_walk list.

And so far, the blk-rq-qos framework works in case that one instance for
one type rq_qos per queue by default. This patch make this explicit and
also fix the crash above.

Signed-off-by: Jinke Han
Reviewed-by: Muchun Song
Acked-by: Tejun Heo
Cc:
Link: https://lore.kernel.org/r/20220720093616.70584-1-hanjinke.666@bytedance.com
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Jinke Han
2022-08-17 20:24:24 +0800

19 Nov, 2021

1 commit

3e6e11f85 blk-wbt: prevent NULL pointer dereference in wb_timer_fn ... Browse Code »

[ Upstream commit 480d42dc001bbfe953825a92073012fcd5a99161 ]

The timer callback used to evaluate if the latency is exceeded can be
executed after the corresponding disk has been released, causing the
following NULL pointer dereference:

[ 119.987108] BUG: kernel NULL pointer dereference, address: 0000000000000098
[ 119.987617] #PF: supervisor read access in kernel mode
[ 119.987971] #PF: error_code(0x0000) - not-present page
[ 119.988325] PGD 7c4a4067 P4D 7c4a4067 PUD 7bf63067 PMD 0
[ 119.988697] Oops: 0000 [#1] SMP NOPTI
[ 119.988959] CPU: 1 PID: 9353 Comm: cloud-init Not tainted 5.15-rc5+arighi #rc5+arighi
[ 119.989520] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[ 119.990055] RIP: 0010:wb_timer_fn+0x44/0x3c0
[ 119.990376] Code: 41 8b 9c 24 98 00 00 00 41 8b 94 24 b8 00 00 00 41 8b 84 24 d8 00 00 00 4d 8b 74 24 28 01 d3 01 c3 49 8b 44 24 60 48 8b 40 78 8b b8 98 00 00 00 4d 85 f6 0f 84 c4 00 00 00 49 83 7c 24 30 00
[ 119.991578] RSP: 0000:ffffb5f580957da8 EFLAGS: 00010246
[ 119.991937] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
[ 119.992412] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88f476d7f780
[ 119.992895] RBP: ffffb5f580957dd0 R08: 0000000000000000 R09: 0000000000000000
[ 119.993371] R10: 0000000000000004 R11: 0000000000000002 R12: ffff88f476c84500
[ 119.993847] R13: ffff88f4434390c0 R14: 0000000000000000 R15: ffff88f4bdc98c00
[ 119.994323] FS: 00007fb90bcd9c00(0000) GS:ffff88f4bdc80000(0000) knlGS:0000000000000000
[ 119.994952] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 119.995380] CR2: 0000000000000098 CR3: 000000007c0d6000 CR4: 00000000000006e0
[ 119.995906] Call Trace:
[ 119.996130] ? blk_stat_free_callback_rcu+0x30/0x30
[ 119.996505] blk_stat_timer_fn+0x138/0x140
[ 119.996830] call_timer_fn+0x2b/0x100
[ 119.997136] __run_timers.part.0+0x1d1/0x240
[ 119.997470] ? kvm_clock_get_cycles+0x11/0x20
[ 119.997826] ? ktime_get+0x3e/0xa0
[ 119.998110] ? native_apic_msr_write+0x2c/0x30
[ 119.998456] ? lapic_next_event+0x20/0x30
[ 119.998779] ? clockevents_program_event+0x94/0xf0
[ 119.999150] run_timer_softirq+0x2a/0x50
[ 119.999465] __do_softirq+0xcb/0x26f
[ 119.999764] irq_exit_rcu+0x8c/0xb0
[ 120.000057] sysvec_apic_timer_interrupt+0x43/0x90
[ 120.000429] ? asm_sysvec_apic_timer_interrupt+0xa/0x20
[ 120.000836] asm_sysvec_apic_timer_interrupt+0x12/0x20

In this case simply return from the timer callback (no action
required) to prevent the NULL pointer dereference.

BugLink: https://bugs.launchpad.net/bugs/1947557
Link: https://lore.kernel.org/linux-mm/YWRNVTk9N8K0RMst@arighi-desktop/
Fixes: 34dbad5d26e2 ("blk-stat: convert to callback-based statistics reporting")
Signed-off-by: Andrea Righi
Link: https://lore.kernel.org/r/YW6N2qXpBU3oc50q@arighi-desktop
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Andrea Righi
2021-11-19 02:16:34 +0800

24 Aug, 2021

1 commit

d152c682f block: add an explicit ->disk backpointer to the request_queue ... Browse Code »

Replace the magic lookup through the kobject tree with an explicit
backpointer, given that the device model links are set up and torn
down at times when I/O is still possible, leading to potential
NULL or invalid pointer dereferences.

Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk")
Reported-by: syzbot
Signed-off-by: Christoph Hellwig
Tested-by: Sven Schnelle
Link: https://lore.kernel.org/r/20210816134624.GA24234@lst.de
Signed-off-by: Jens Axboe

Christoph Hellwig
2021-08-24 02:54:31 +0800

10 Aug, 2021

1 commit

edb0872f4 block: move the bdi from the request_queue to the gendisk ... Browse Code »

The backing device information only makes sense for file system I/O,
and thus belongs into the gendisk and not the lower level request_queue
structure. Move it there.

Signed-off-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Link: https://lore.kernel.org/r/20210809141744.1203023-5-hch@lst.de
Signed-off-by: Jens Axboe

Christoph Hellwig
2021-08-10 01:53:23 +0800

22 Jun, 2021

2 commits

76a804081 blk-wbt: make sure throttle is enabled properly ... Browse Code »

After commit a79050434b45 ("blk-rq-qos: refactor out common elements of
blk-wbt"), if throttle was disabled by wbt_disable_default(), we could
not enable again, fix this by set enable_state back to
WBT_STATE_ON_DEFAULT.

Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
Signed-off-by: Zhang Yi
Link: https://lore.kernel.org/r/20210619093700.920393-3-yi.zhang@huawei.com
Signed-off-by: Jens Axboe

Zhang Yi
2021-06-22 05:03:41 +0800
1d0903d61 blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled() ... Browse Code »

Now that we disable wbt by simply zero out rwb->wb_normal in
wbt_disable_default() when switch elevator to bfq, but it's not safe
because it will become false positive if we change queue depth. If it
become false positive between wbt_wait() and wbt_track() when submit
write request, it will lead to drop rqw->inflight to -1 in wbt_done(),
which will end up trigger IO hung. Fix this issue by introduce a new
state which mean the wbt was disabled.

Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
Signed-off-by: Zhang Yi
Link: https://lore.kernel.org/r/20210619093700.920393-2-yi.zhang@huawei.com
Signed-off-by: Jens Axboe

Zhang Yi
2021-06-22 05:03:41 +0800

19 Jun, 2021

1 commit

a79da21b4 blk-wbt: remove outdated comment ... Browse Code »

Now wbt_wait() returns void, so remove now outdated comment.

Signed-off-by: lijiazi
Link: https://lore.kernel.org/r/1623986240-13878-1-git-send-email-lijiazi@xiaomi.com
Signed-off-by: Jens Axboe

lijiazi
2021-06-19 00:45:33 +0800

27 Jan, 2021

1 commit

482e302a6 blk: wbt: remove unused parameter from wbt_should_throttle ... Browse Code »

The first parameter rwb is not used for this function.
So just remove it.

Signed-off-by: Lei Chen
Signed-off-by: Jens Axboe

Lei Chen
2021-01-27 04:13:00 +0800

01 Dec, 2020

1 commit

5a20d073e block: wbt: Remove unnecessary invoking of wbt_update_limits in wbt_init ... Browse Code »

It's unnecessary to call wbt_update_limits explicitly within wbt_init,
because it will be called in the following function wbt_queue_depth_changed.

Signed-off-by: Lei Chen
Signed-off-by: Jens Axboe

Lei Chen
2020-12-01 06:01:18 +0800

24 Aug, 2020

1 commit

df561f668 treewide: Use fallthrough pseudo-keyword ... Browse Code »

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva

Gustavo A. R. Silva
2020-08-24 06:36:59 +0800

30 May, 2020

2 commits

4d89e1d11 blk-wbt: rename __wbt_update_limits to wbt_update_limits ... Browse Code »

Now let's rename __wbt_update_limits to wbt_update_limits after the
previous one is deleted.

Signed-off-by: Guoqing Jiang
Signed-off-by: Jens Axboe

Guoqing Jiang
2020-05-30 06:30:39 +0800
26e0ca12e blk-wbt: remove wbt_update_limits ... Browse Code »

No one call this function after commit 2af2783f2ea4f ("rq-qos: get rid of
redundant wbt_update_limits()"), so remove it.

Signed-off-by: Guoqing Jiang
Signed-off-by: Jens Axboe

Guoqing Jiang
2020-05-30 06:30:39 +0800

17 Apr, 2020

1 commit

3a89c25d9 blk-wbt: Use tracepoint_string() for wbt_step tracepoint string literals ... Browse Code »

Use tracepoint_string() for string literals that are used in the
wbt_step tracepoint, so that userspace tools can display the string
content.

Signed-off-by: Tommi Rantala
Signed-off-by: Jens Axboe

Tommi Rantala
2020-04-17 22:21:44 +0800

06 Oct, 2019

1 commit

b84477d3e blk-wbt: fix performance regression in wbt scale_up/scale_down ... Browse Code »

scale_up wakes up waiters after scaling up. But after scaling max, it
should not wake up more waiters as waiters will not have anything to
do. This patch fixes this by making scale_up (and also scale_down)
return when threshold is reached.

This bug causes increased fdatasync latency when fdatasync and dd
conv=sync are performed in parallel on 4.19 compared to 4.14. This
bug was introduced during refactoring of blk-wbt code.

Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
Cc: stable@vger.kernel.org
Cc: Josef Bacik
Signed-off-by: Harshad Shirwadkar
Signed-off-by: Jens Axboe

Harshad Shirwadkar
2019-10-06 23:26:41 +0800

29 Aug, 2019

1 commit

9677a3e01 block/rq_qos: implement rq_qos_ops->queue_depth_changed() ... Browse Code »

wbt already gets queue depth changed notification through
wbt_set_queue_depth(). Generalize it into
rq_qos_ops->queue_depth_changed() so that other rq_qos policies can
easily hook into the events too.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-08-29 11:17:07 +0800

28 Aug, 2019

1 commit

58c898ba3 block: add helper for checking if queue is registered ... Browse Code »

There are 4 users which check if queue is registered, so add one helper
to check it.

Cc: Christoph Hellwig
Cc: Hannes Reinecke
Cc: Greg KH
Cc: Mike Snitzer
Cc: Bart Van Assche
Reviewed-by: Bart Van Assche
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2019-08-28 00:40:20 +0800

01 May, 2019

1 commit

3dcf60bcb block: add SPDX tags to block layer files missing licensing information ... Browse Code »

Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-05-01 06:12:03 +0800

25 Jan, 2019

1 commit

c83f536a8 blk-wbt: Declare local functions static ... Browse Code »

This patch avoids that sparse reports the following warnings:

CHECK block/blk-wbt.c
block/blk-wbt.c:600:6: warning: symbol 'wbt_issue' was not declared. Should it be static?
block/blk-wbt.c:620:6: warning: symbol 'wbt_requeue' was not declared. Should it be static?
CC block/blk-wbt.o
block/blk-wbt.c:600:6: warning: no previous prototype for wbt_issue [-Wmissing-prototypes]
void wbt_issue(struct rq_qos *rqos, struct request *rq)
^~~~~~~~~
block/blk-wbt.c:620:6: warning: no previous prototype for wbt_requeue [-Wmissing-prototypes]
void wbt_requeue(struct rq_qos *rqos, struct request *rq)
^~~~~~~~~~~

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-01-25 02:09:21 +0800

17 Dec, 2018

1 commit

d19afebca blk-wbt: export internal state via debugfs ... Browse Code »

This information is helpful to either investigate issues, or understand
wbt's internal behaviour.

Cc: Bart Van Assche
Cc: Omar Sandoval
Cc: Christoph Hellwig
Cc: Josef Bacik
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-12-17 10:53:49 +0800

12 Dec, 2018

1 commit

544fbd16a block: deactivate blk_stat timer in wbt_disable_default() ... Browse Code »

rwb_enabled() can't be changed when there is any inflight IO.

wbt_disable_default() may set rwb->wb_normal as zero, however the
blk_stat timer may still be pending, and the timer function will update
wrb->wb_normal again.

This patch introduces blk_stat_deactivate() and applies it in
wbt_disable_default(), then the following IO hang triggered when running
parted & switching io scheduler can be fixed:

[ 369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
[ 369.938941] Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
[ 369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 369.940768] parted D 0 3645 3239 0x00000000
[ 369.941500] Call Trace:
[ 369.941874] ? __schedule+0x6d9/0x74c
[ 369.942392] ? wbt_done+0x5e/0x5e
[ 369.942864] ? wbt_cleanup_cb+0x16/0x16
[ 369.943404] ? wbt_done+0x5e/0x5e
[ 369.943874] schedule+0x67/0x78
[ 369.944298] io_schedule+0x12/0x33
[ 369.944771] rq_qos_wait+0xb5/0x119
[ 369.945193] ? karma_partition+0x1c2/0x1c2
[ 369.945691] ? wbt_cleanup_cb+0x16/0x16
[ 369.946151] wbt_wait+0x85/0xb6
[ 369.946540] __rq_qos_throttle+0x23/0x2f
[ 369.947014] blk_mq_make_request+0xe6/0x40a
[ 369.947518] generic_make_request+0x192/0x2fe
[ 369.948042] ? submit_bio+0x103/0x11f
[ 369.948486] ? __radix_tree_lookup+0x35/0xb5
[ 369.949011] submit_bio+0x103/0x11f
[ 369.949436] ? blkg_lookup_slowpath+0x25/0x44
[ 369.949962] submit_bio_wait+0x53/0x7f
[ 369.950469] blkdev_issue_flush+0x8a/0xae
[ 369.951032] blkdev_fsync+0x2f/0x3a
[ 369.951502] do_fsync+0x2e/0x47
[ 369.951887] __x64_sys_fsync+0x10/0x13
[ 369.952374] do_syscall_64+0x89/0x149
[ 369.952819] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 369.953492] RIP: 0033:0x7f95a1e729d4
[ 369.953996] Code: Bad RIP value.
[ 369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[ 369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
[ 369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
[ 369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
[ 369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
[ 369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008

Cc: stable@vger.kernel.org
Cc: Paolo Valente
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-12-12 21:47:51 +0800

08 Dec, 2018

1 commit

b6c7b58f5 block: convert wbt_wait() to use rq_qos_wait() ... Browse Code »

Now that we have rq_qos_wait() in place, convert wbt_wait() over to
using it with it's specific callbacks.

Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe

Josef Bacik
2018-12-08 13:26:38 +0800

16 Nov, 2018

4 commits

344e9ffcb block: add queue_is_mq() helper ... Browse Code »

Various spots check for q->mq_ops being non-NULL, but provide
a helper to do this instead.

Where the ->mq_ops != NULL check is redundant, remove it.

Since mq == rq-based now that legacy is gone, get rid of the
queue_is_rq_based() and just use queue_is_mq() everywhere.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2018-11-16 23:34:06 +0800
e815f404a block: add wbt_disable_default export for BFQ ... Browse Code »

This isn't unused, if BFQ is modular we get into trouble.

Fixes: b6676f653f13 ("block: remove a few unused exports")
Signed-off-by: Jens Axboe

Jens Axboe
2018-11-16 03:31:27 +0800
b6676f653 block: remove a few unused exports ... Browse Code »

Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-11-16 03:13:25 +0800
d53375608 block: remove the unused lock argument to rq_qos_throttle ... Browse Code »

Unused now that the legacy request path is gone.

Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-11-16 03:13:22 +0800

08 Nov, 2018

1 commit

3c7741567 blk-wbt: kill check for legacy queue type ... Browse Code »

Everything is blk-mq at this point, so it doesn't make any sense
to have this option available as it does nothing.

Reviewed-by: Hannes Reinecke
Tested-by: Ming Lei
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2018-11-08 04:42:32 +0800

12 Oct, 2018

1 commit

5e65a2034 blk-wbt: wake up all when we scale up, not down ... Browse Code »

Tetsuo brought to my attention that I screwed up the scale_up/scale_down
helpers when I factored out the rq-qos code. We need to wake up all the
waiters when we add slots for requests to make, not when we shrink the
slots. Otherwise we'll end up things waiting forever. This was a
mistake and simply puts everything back the way it was.

cc: stable@vger.kernel.org
Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
eported-by: Tetsuo Handa
Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe

Josef Bacik
2018-10-12 03:31:28 +0800

28 Aug, 2018

3 commits

b0a84beb2 blk-wbt: remove dead code ... Browse Code »

We already note and mark discard and swap IO from bio_to_wbt_flags().

Signed-off-by: Jens Axboe

Jens Axboe
2018-08-28 03:32:12 +0800
38cfb5a45 blk-wbt: improve waking of tasks ... Browse Code »

We have two potential issues:

1) After commit 2887e41b910b, we only wake one process at the time when
we finish an IO. We really want to wake up as many tasks as can
queue IO. Before this commit, we woke up everyone, which could cause
a thundering herd issue.

2) A task can potentially consume two wakeups, causing us to (in
practice) miss a wakeup.

Fix both by providing our own wakeup function, which stops
__wake_up_common() from waking up more tasks if we fail to get a
queueing token. With the strict ordering we have on the wait list, this
wakes the right tasks and the right amount of tasks.

Based on a patch from Jianchao Wang .

Tested-by: Agarwal, Anchal
Signed-off-by: Jens Axboe

Jens Axboe
2018-08-28 01:27:26 +0800
061a54275 blk-wbt: abstract out end IO completion handler ... Browse Code »

Prep patch for calling the handler from a different context,
no functional changes in this patch.

Tested-by: Agarwal, Anchal
Signed-off-by: Jens Axboe

Jens Axboe
2018-08-28 01:27:24 +0800

23 Aug, 2018

4 commits

c125311d9 blk-wbt: don't maintain inflight counts if disabled ... Browse Code »

A previous commit removed the ability to have per-rq flags. We used
those flags to maintain inflight counts. Since we don't have those
anymore, we have to always maintain inflight counts, even if wbt is
disabled. This is clearly suboptimal.

Add a queue quiesce around changing the wbt latency settings from sysfs
to work around this. With that, we can reliably put the enabled check in
our bio_to_wbt_flags(), since we know the WBT_TRACKED flag will be
consistent for the lifetime of the request.

Fixes: c1c80384c8f ("block: remove external dependency on wbt_flags")
Reviewed-by: Josef Bacik
Signed-off-by: Jens Axboe

Jens Axboe
2018-08-23 23:34:46 +0800
c45e6a037 blk-wbt: fix has-sleeper queueing check ... Browse Code »

We need to do this inside the loop as well, or we can allow new
IO to supersede previous IO.

Tested-by: Anchal Agarwal
Signed-off-by: Jens Axboe

Jens Axboe
2018-08-23 05:07:32 +0800
b78820937 blk-wbt: use wq_has_sleeper() for wq active check ... Browse Code »

We need the memory barrier before checking the list head,
use the appropriate helper for this. The matching queue
side memory barrier is provided by set_current_state().

Tested-by: Anchal Agarwal
Signed-off-by: Jens Axboe

Jens Axboe
2018-08-23 05:07:31 +0800
ffa358dca blk-wbt: move disable check into get_limit() ... Browse Code »

Check it in one place, instead of in multiple places.

Tested-by: Anchal Agarwal
Signed-off-by: Jens Axboe

Jens Axboe
2018-08-23 05:07:31 +0800

15 Aug, 2018

1 commit

df60f6e83 blk-wbt: fix IO hang in wbt_wait() ... Browse Code »

On wbt invariant is that if one IO is tracked via WBT_TRACKED, rqw->inflight
should be updated for tracking this IO.

But commit c1c80384c8f ("block: remove external dependency on wbt_flags")
forgets to remove the early handling of !rwb_enabled(rwb) inside wbt_wait(),
then the inflight counter may not be increased in wbt_wait(), but decreased
in wbt_done() for this kind of IO, so this counter may become negative, then
wbt_wait() may wait forever.

This patch fixes the report in the following link:

https://marc.info/?l=linux-block&m=153221542021033&w=2

Fixes: c1c80384c8f ("block: remove external dependency on wbt_flags")
Cc: Josef Bacik
Reported-by: Ming Lei
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2018-08-15 01:05:52 +0800

08 Aug, 2018

1 commit

2887e41b9 blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait ... Browse Code »

I am currently running a large bare metal instance (i3.metal)
on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
4.18 kernel. I have a workload that simulates a database
workload and I am running into lockup issues when writeback
throttling is enabled,with the hung task detector also
kicking in.

Crash dumps show that most CPUs (up to 50 of them) are
all trying to get the wbt wait queue lock while trying to add
themselves to it in __wbt_wait (see stack traces below).

[ 0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[ 0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[ 0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
[ 0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
[ 0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
[ 0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
[ 0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
[ 0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
[ 0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
[ 0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[ 0.948132] FS: 0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
[ 0.948134] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
[ 0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 0.948138] Call Trace:
[ 0.948139]
[ 0.948142] do_raw_spin_lock+0xad/0xc0
[ 0.948145] _raw_spin_lock_irqsave+0x44/0x4b
[ 0.948149] ? __wake_up_common_lock+0x53/0x90
[ 0.948150] __wake_up_common_lock+0x53/0x90
[ 0.948155] wbt_done+0x7b/0xa0
[ 0.948158] blk_mq_free_request+0xb7/0x110
[ 0.948161] __blk_mq_complete_request+0xcb/0x140
[ 0.948166] nvme_process_cq+0xce/0x1a0 [nvme]
[ 0.948169] nvme_irq+0x23/0x50 [nvme]
[ 0.948173] __handle_irq_event_percpu+0x46/0x300
[ 0.948176] handle_irq_event_percpu+0x20/0x50
[ 0.948179] handle_irq_event+0x34/0x60
[ 0.948181] handle_edge_irq+0x77/0x190
[ 0.948185] handle_irq+0xaf/0x120
[ 0.948188] do_IRQ+0x53/0x110
[ 0.948191] common_interrupt+0x87/0x87
[ 0.948192]
....
[ 0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
[ 0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
[ 0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
[ 0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
[ 0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
[ 0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
[ 0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
[ 0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
[ 0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
[ 0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
[ 0.311149] FS: 000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
[ 0.311150] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
[ 0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 0.311154] Call Trace:
[ 0.311157] do_raw_spin_lock+0xad/0xc0
[ 0.311160] _raw_spin_lock_irqsave+0x44/0x4b
[ 0.311162] ? prepare_to_wait_exclusive+0x28/0xb0
[ 0.311164] prepare_to_wait_exclusive+0x28/0xb0
[ 0.311167] wbt_wait+0x127/0x330
[ 0.311169] ? finish_wait+0x80/0x80
[ 0.311172] ? generic_make_request+0xda/0x3b0
[ 0.311174] blk_mq_make_request+0xd6/0x7b0
[ 0.311176] ? blk_queue_enter+0x24/0x260
[ 0.311178] ? generic_make_request+0xda/0x3b0
[ 0.311181] generic_make_request+0x10c/0x3b0
[ 0.311183] ? submit_bio+0x5c/0x110
[ 0.311185] submit_bio+0x5c/0x110
[ 0.311197] ? __ext4_journal_stop+0x36/0xa0 [ext4]
[ 0.311210] ext4_io_submit+0x48/0x60 [ext4]
[ 0.311222] ext4_writepages+0x810/0x11f0 [ext4]
[ 0.311229] ? do_writepages+0x3c/0xd0
[ 0.311239] ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
[ 0.311240] do_writepages+0x3c/0xd0
[ 0.311243] ? _raw_spin_unlock+0x24/0x30
[ 0.311245] ? wbc_attach_and_unlock_inode+0x165/0x280
[ 0.311248] ? __filemap_fdatawrite_range+0xa3/0xe0
[ 0.311250] __filemap_fdatawrite_range+0xa3/0xe0
[ 0.311253] file_write_and_wait_range+0x34/0x90
[ 0.311264] ext4_sync_file+0x151/0x500 [ext4]
[ 0.311267] do_fsync+0x38/0x60
[ 0.311270] SyS_fsync+0xc/0x10
[ 0.311272] do_syscall_64+0x6f/0x170
[ 0.311274] entry_SYSCALL_64_after_hwframe+0x42/0xb7

In the original patch, wbt_done is waking up all the exclusive
processes in the wait queue, which can cause a thundering herd
if there is a large number of writer threads in the queue. The
original intention of the code seems to be to wake up one thread
only however, it uses wake_up_all() in __wbt_done(), and then
uses the following check in __wbt_wait to have only one thread
actually get out of the wait loop:

if (waitqueue_active(&rqw->wait) &&
rqw->wait.head.next != &wait->entry)
return false;

The problem with this is that the wait entry in wbt_wait is
define with DEFINE_WAIT, which uses the autoremove wakeup function.
That means that the above check is invalid - the wait entry will
have been removed from the queue already by the time we hit the
check in the loop.

Secondly, auto-removing the wait entries also means that the wait
queue essentially gets reordered "randomly" (e.g. threads re-add
themselves in the order they got to run after being woken up).
Additionally, new requests entering wbt_wait might overtake requests
that were queued earlier, because the wait queue will be
(temporarily) empty after the wake_up_all, so the waitqueue_active
check will not stop them. This can cause certain threads to starve
under high load.

The fix is to leave the woken up requests in the queue and remove
them in finish_wait() once the current thread breaks out of the
wait loop in __wbt_wait. This will ensure new requests always
end up at the back of the queue, and they won't overtake requests
that are already in the wait queue. With that change, the loop
in wbt_wait is also in line with many other wait loops in the kernel.
Waking up just one thread drastically reduces lock contention, as
does moving the wait queue add/remove out of the loop.

A significant drop in lockdep's lock contention numbers is seen when
running the test application on the patched kernel.

Signed-off-by: Anchal Agarwal
Signed-off-by: Frank van der Linden
Signed-off-by: Jens Axboe

Anchal Agarwal
2018-08-08 04:40:49 +0800

09 Jul, 2018

2 commits

c1c80384c block: remove external dependency on wbt_flags ... Browse Code »

We don't really need to save this stuff in the core block code, we can
just pass the bio back into the helpers later on to derive the same
flags and update the rq->wbt_flags appropriately.

Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe

Josef Bacik
2018-07-09 23:07:54 +0800
a79050434 blk-rq-qos: refactor out common elements of blk-wbt ... Browse Code »

blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis. Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.

Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe

Josef Bacik
2018-07-09 23:07:54 +0800

09 May, 2018

2 commits

544ccc8dc block: get rid of struct blk_issue_stat ... Browse Code »

struct blk_issue_stat squashes three things into one u64:

- The time the driver started working on a request
- The original size of the request (for the io.low controller)
- Flags for writeback throttling

It turns out that on x86_64, we have a 4 byte hole in struct request
which we can fill with the non-timestamp fields from blk_issue_stat,
simplifying things quite a bit.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2018-05-09 22:33:05 +0800
a8a459417 block: pass struct request instead of struct blk_issue_stat to wbt ... Browse Code »

issue_stat is going to go away, so first make writeback throttling take
the containing request, update the internal wbt helpers accordingly, and
change rwb->sync_cookie to be the request pointer instead of the
issue_stat pointer. No functional change.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2018-05-09 22:33:02 +0800