Eric Lee / smarc-fsl-linux-kernel

10 Sep, 2020

1 commit

9f4ab0172 blk-iocost: ioc_pd_free() shouldn't assume irq disabled ... Browse Code »

commit 5aeac7c4b16069aae49005f0a8d4526baa83341b upstream.

ioc_pd_free() grabs irq-safe ioc->lock without ensuring that irq is disabled
when it can be called with irq disabled or enabled. This has a small chance
of causing A-A deadlocks and triggers lockdep splats. Use irqsave operations
instead.

Signed-off-by: Tejun Heo
Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2020-09-10 01:12:35 +0800

19 Aug, 2020

1 commit

2f53a4b54 iocost: Fix check condition of iocg abs_vdebt ... Browse Code »

[ Upstream commit d9012a59db54442d5b2fcfdfcded35cf566397d3 ]

We shouldn't skip iocg when its abs_vdebt is not zero.

Fixes: 0b80f9866e6b ("iocost: protect iocg->abs_vdebt with iocg->waitq.lock")
Signed-off-by: Chengming Zhou
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Chengming Zhou
2020-08-19 14:15:58 +0800

22 Jun, 2020

1 commit

894d9cd52 iocost: don't let vrate run wild while there's no saturation signal ... Browse Code »

[ Upstream commit 81ca627a933063fa63a6d4c66425de822a2ab7f5 ]

When the QoS targets are met and nothing is being throttled, there's
no way to tell how saturated the underlying device is - it could be
almost entirely idle, at the cusp of saturation or anywhere inbetween.
Given that there's no information, it's best to keep vrate as-is in
this state. Before 7cd806a9a953 ("iocost: improve nr_lagging
handling"), this was the case - if the device isn't missing QoS
targets and nothing is being throttled, busy_level was reset to zero.

While fixing nr_lagging handling, 7cd806a9a953 ("iocost: improve
nr_lagging handling") broke this. Now, while the device is hitting
QoS targets and nothing is being throttled, vrate keeps getting
adjusted according to the existing busy_level.

This led to vrate keeping climing till it hits max when there's an IO
issuer with limited request concurrency if the vrate started low.
vrate starts getting adjusted upwards until the issuer can issue IOs
w/o being throttled. From then on, QoS targets keeps getting met and
nothing on the system needs throttling and vrate keeps getting
increased due to the existing busy_level.

This patch makes the following changes to the busy_level logic.

* Reset busy_level if nr_shortages is zero to avoid the above
scenario.

* Make non-zero nr_lagging block lowering nr_level but still clear
positive busy_level if there's clear non-saturation signal - QoS
targets are met and nr_shortages is non-zero. nr_lagging's role is
preventing adjusting vrate upwards while there are long-running
commands and it shouldn't keep busy_level positive while there's
clear non-saturation signal.

* Restructure code for clarity and add comments.

Signed-off-by: Tejun Heo
Reported-by: Andy Newell
Fixes: 7cd806a9a953 ("iocost: improve nr_lagging handling")
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Tejun Heo
2020-06-22 15:31:06 +0800

14 May, 2020

1 commit

34ca08008 iocost: protect iocg->abs_vdebt with iocg->waitq.lock ... Browse Code »

commit 0b80f9866e6bbfb905140ed8787ff2af03652c0c upstream.

abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup
is and controls the activation of use_delay mechanism. Once a cgroup goes
over budget from forced IOs, it has to pay it back with its future budget.
The progress guarantee on debt paying comes from the iocg being active -
active iocgs are processed by the periodic timer, which ensures that as time
passes the debts dissipate and the iocg returns to normal operation.

However, both iocg activation and vdebt handling are asynchronous and a
sequence like the following may happen.

1. The iocg is in the process of being deactivated by the periodic timer.

2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns
without anything because it still sees that the iocg is already active.

3. The iocg is deactivated.

4. The bio from #2 is over budget but needs to be forced. It increases
abs_vdebt and goes over the threshold and enables use_delay.

5. IO control is enabled for the iocg's subtree and now IOs are attributed
to the descendant cgroups and the iocg itself no longer issues IOs.

This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no
further IOs which can activate it. This can end up unduly punishing all the
descendants cgroups.

The usual throttling path has the same issue - the iocg must be active while
throttled to ensure that future event will wake it up - and solves the
problem by synchronizing the throttling path with a spinlock. abs_vdebt
handling is another form of overage handling and shares a lot of
characteristics including the fact that it isn't in the hottest path.

This patch fixes the above and other possible races by strictly
synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock.

Signed-off-by: Tejun Heo
Reported-by: Vlad Dmitriev
Cc: stable@vger.kernel.org # v5.4+
Fixes: e1518f63f246 ("blk-iocost: Don't let merges push vtime into the future")
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2020-05-14 13:58:27 +0800

02 May, 2020

1 commit

9c5c94c50 blk-iocost: Fix error on iocost_ioc_vrate_adj ... Browse Code »

commit d6c8e949a35d6906d6c03a50e9a9cdf4e494528a upstream.

Systemtap 4.2 is unable to correctly interpret the "u32 (*missed_ppm)[2]"
argument of the iocost_ioc_vrate_adj trace entry defined in
include/trace/events/iocost.h leading to the following error:

/tmp/stapAcz0G0/stap_c89c58b83cea1724e26395efa9ed4939_6321_aux_6.c:78:8:
error: expected ‘;’, ‘,’ or ‘)’ before ‘*’ token
, u32[]* __tracepoint_arg_missed_ppm

That argument type is indeed rather complex and hard to read. Looking
at block/blk-iocost.c. It is just a 2-entry u32 array. By simplifying
the argument to a simple "u32 *missed_ppm" and adjusting the trace
entry accordingly, the compilation error was gone.

Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Acked-by: Steven Rostedt (VMware)
Acked-by: Tejun Heo
Signed-off-by: Waiman Long
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Waiman Long
2020-05-02 14:48:53 +0800

18 Mar, 2020

1 commit

b7e54dd75 blk-iocost: fix incorrect vtime comparison in iocg_is_idle() ... Browse Code »

commit dcd6589b11d3b1e71f516a87a7b9646ed356b4c0 upstream.

vtimes may wrap and time_before/after64() should be used to determine
whether a given vtime is before or after another. iocg_is_idle() was
incorrectly using plain "
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2020-03-18 14:17:52 +0800

31 Dec, 2019

1 commit

377a8744c iocost: over-budget forced IOs should schedule async delay ... Browse Code »

commit d7bd15a138aef3be227818aad9c501e43c89c8c5 upstream.

When over-budget IOs are force-issued through root cgroup,
iocg_kick_delay() adjusts the async delay accordingly but doesn't
actually schedule async throttle for the issuing task. This bug is
pretty well masked because sooner or later the offending threads are
gonna get directly throttled on regular IOs or have async delay
scheduled by mem_cgroup_throttle_swaprate().

However, it can affect control quality on filesystem metadata heavy
operations. Let's fix it by invoking blkcg_schedule_throttle() when
iocg_kick_delay() says async delay is needed.

Signed-off-by: Tejun Heo
Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org
Reported-by: Josef Bacik
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2019-12-31 23:46:19 +0800

15 Nov, 2019

1 commit

8b37bc277 iocost: check active_list of all the ancestors in iocg_activate() ... Browse Code »

There is a bug that checking the same active_list over and over again
in iocg_activate(). The intention of the code was checking whether all
the ancestors and self have already been activated. So fix it.

Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Acked-by: Tejun Heo
Signed-off-by: Jiufei Xue
Signed-off-by: Jens Axboe

Jiufei Xue
2019-11-15 04:56:54 +0800

01 Nov, 2019

1 commit

41591a51f iocost: don't nest spin_lock_irq in ioc_weight_write() ... Browse Code »

This code causes a static analysis warning:

block/blk-iocost.c:2113 ioc_weight_write() error: double lock 'irq'

We disable IRQs in blkg_conf_prep() and re-enable them in
blkg_conf_finish(). IRQ disable/enable should not be nested because
that means the IRQs will be enabled at the first unlock instead of the
second one.

Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Acked-by: Tejun Heo
Signed-off-by: Dan Carpenter
Signed-off-by: Jens Axboe

Dan Carpenter
2019-11-01 01:40:57 +0800

26 Sep, 2019

3 commits

7afcccafa iocost: bump up default latency targets for hard disks ... Browse Code »

The default hard disk param sets latency targets at 50ms. As the
default target percentiles are zero, these don't directly regulate
vrate; however, they're still used to calculate the period length -
100ms in this case.

This is excessively low. A SATA drive with QD32 saturated with random
IOs can easily reach avg completion latency of several hundred msecs.
A period duration which is substantially lower than avg completion
latency can lead to wildly fluctuating vrate.

Let's bump up the default latency targets to 250ms so that the period
duration is sufficiently long.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-09-26 15:12:01 +0800
7cd806a9a iocost: improve nr_lagging handling ... Browse Code »

Some IOs may span multiple periods. As latencies are collected on
completion, the inbetween periods won't register them and may
incorrectly decide to increase vrate. nr_lagging tracks these IOs to
avoid those situations. Currently, whenever there are IOs which are
spanning from the previous period, busy_level is reset to 0 if
negative thus suppressing vrate increase.

This has the following two problems.

* When latency target percentiles aren't set, vrate adjustment should
only be governed by queue depth depletion; however, the current code
keeps nr_lagging active which pulls in latency results and can keep
down vrate unexpectedly.

* When lagging condition is detected, it resets the entire negative
busy_level. This turned out to be way too aggressive on some
devices which sometimes experience extended latencies on a small
subset of commands. In addition, a lagging IO will be accounted as
latency target miss on completion anyway and resetting busy_level
amplifies its impact unnecessarily.

This patch fixes the above two problems by disabling nr_lagging
counting when latency target percentiles aren't set and blocking vrate
increases when there are lagging IOs while leaving busy_level as-is.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-09-26 15:12:00 +0800
25d41e4aa iocost: better trace vrate changes ... Browse Code »

vrate_adj tracepoint traces vrate changes; however, it does so only
when busy_level is non-zero. busy_level turning to zero can sometimes
be as interesting an event. This patch also enables vrate_adj
tracepoint on other vrate related events - busy_level changes and
non-zero nr_lagging.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-09-26 15:11:58 +0800

11 Sep, 2019

4 commits

7c1ee704a iocost_monitor: Report debt ... Browse Code »

Report debt and rename del_ms row to delay for consistency.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-09-11 02:31:39 +0800
e1518f63f blk-iocost: Don't let merges push vtime into the future ... Browse Code »

Merges have the same problem that forced-bios had which is fixed by
the previous patch. The cost of a merge is calculated at the time of
issue and force-advances vtime into the future. Until global vtime
catches up, how the cgroup's hweight changes in the meantime doesn't
matter and it often leads to situations where the cost is calculated
at one hweight and paid at a very different one. See the previous
patch for more details.

Fix it by never advancing vtime into the future for merges. If budget
is available, vtime is advanced. Otherwise, the cost is charged as
debt.

This brings merge cost handling in line with issue cost handling in
ioc_rqos_throttle().

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-09-11 02:31:39 +0800
36a524814 blk-iocost: Account force-charged overage in absolute vtime ... Browse Code »

Currently, when a bio needs to be force-charged and there isn't enough
budget, vtime is simply pushed into the future. This means that the
cost of the whole bio is scaled using the current hweight and then
charged immediately. Until the global vtime advances beyond this
future vtime, the cgroup won't be allowed to issue normal IOs.

This is incorrect and can lead to, for example, exploding vrate or
extended stalls if vrate range is constrained. Consider the following
scenario.

1. A cgroup with a very low hweight runs out of budget.

2. A storm of swap-out happens on it. All of them are scaled
according to the current low hweight and charged to vtime pushing
it to a far future.

3. All other cgroups go idle and now the above cgroup has access to
the whole device. However, because vtime is already wound using
the past low hweight, what its current hweight is doesn't matter
until global vtime catches up to the local vtime.

4. As a result, either vrate gets ramped up extremely or the IOs stall
while the underlying device is idle.

This is because the hweight the overage is calculated at is different
from the hweight that it's being paid at.

Fix it by remembering the overage in absoulte vtime and continuously
paying with the actual budget according to the current hweight at each
period.

Note that non-forced bios which wait already remembers the cost in
absolute vtime. This brings forced-bio accounting in line.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-09-11 02:31:39 +0800
e036c4cab blk-iocost: Fix incorrect operation order during iocg free ... Browse Code »

ioc_pd_free() first cancels the hrtimers and then deactivates the
iocg. However, the iocg timer can run inbetween and reschedule the
hrtimers which will end up running after the iocg is freed leading to
crashes like the following.

general protection fault: 0000 [#1] SMP
...
RIP: 0010:iocg_kick_delay+0xbe/0x1b0
RSP: 0018:ffffc90003598ea0 EFLAGS: 00010046
RAX: 1cee00fd69512b54 RBX: ffff8881bba48400 RCX: 00000000000003e8
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881bba48400
RBP: 0000000000004e20 R08: 0000000000000002 R09: 00000000000003e8
R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90003598ef0
R13: 00979f3810ad461f R14: ffff8881bba4b400 R15: 25439f950d26e1d1
FS: 0000000000000000(0000) GS:ffff88885f800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f64328c7e40 CR3: 0000000002409005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:

iocg_delay_timer_fn+0x3d/0x60
__hrtimer_run_queues+0xfe/0x270
hrtimer_interrupt+0xf4/0x210
smp_apic_timer_interrupt+0x5e/0x120
apic_timer_interrupt+0xf/0x20

Fix it by canceling hrtimers after deactivating the iocg.

Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Reported-by: Dave Jones
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-09-11 02:17:04 +0800

30 Aug, 2019

1 commit

e916ad29d blkcg: add missing NULL check in ioc_cpd_alloc() ... Browse Code »

ioc_cpd_alloc() forgot to check NULL return from kzalloc(). Add it.

Signed-off-by: Tejun Heo
Reported-by: kbuild test robot
Signed-off-by: Jens Axboe

Tejun Heo
2019-08-30 21:16:19 +0800

29 Aug, 2019

4 commits

3532e7227 blkcg: fix missing free on error path of blk_iocost_init() ... Browse Code »

blk_iocost_init() forgot to free its percpu stat on the error path.
Fix it.

Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Reported-by: Hillf Danton
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-08-29 23:59:14 +0800
8504dea78 blkcg: add tools/cgroup/iocost_coef_gen.py ... Browse Code »

Add a script which can be used to generate device-specific iocost
linear model coefficients.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-08-29 11:17:17 +0800
6954ff185 blkcg: add tools/cgroup/iocost_monitor.py ... Browse Code »

Instead of mucking with debugfs and ->pd_stat(), add drgn based
monitoring script.

Signed-off-by: Tejun Heo
Cc: Omar Sandoval
Signed-off-by: Jens Axboe

Tejun Heo
2019-08-29 11:17:14 +0800
7caa47151 blkcg: implement blk-iocost ... Browse Code »

This patchset implements IO cost model based work-conserving
proportional controller.

While io.latency provides the capability to comprehensively prioritize
and protect IOs depending on the cgroups, its protection is binary -
the lowest latency target cgroup which is suffering is protected at
the cost of all others. In many use cases including stacking multiple
workload containers in a single system, it's necessary to distribute
IO capacity with better granularity.

One challenge of controlling IO resources is the lack of trivially
observable cost metric. The most common metrics - bandwidth and iops
- can be off by orders of magnitude depending on the device type and
IO pattern. However, the cost isn't a complete mystery. Given
several key attributes, we can make fairly reliable predictions on how
expensive a given stream of IOs would be, at least compared to other
IO patterns.

The function which determines the cost of a given IO is the IO cost
model for the device. This controller distributes IO capacity based
on the costs estimated by such model. The more accurate the cost
model the better but the controller adapts based on IO completion
latency and as long as the relative costs across differents IO
patterns are consistent and sensible, it'll adapt to the actual
performance of the device.

Currently, the only implemented cost model is a simple linear one with
a few sets of default parameters for different classes of device.
This covers most common devices reasonably well. All the
infrastructure to tune and add different cost models is already in
place and a later patch will also allow using bpf progs for cost
models.

Please see the top comment in blk-iocost.c and documentation for
more details.

v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
for a divide-by-zero bug in current_hweight() triggered by zero
inuse_sum.

Signed-off-by: Tejun Heo
Cc: Andy Newell
Cc: Josef Bacik
Cc: Rik van Riel
Signed-off-by: Jens Axboe

Tejun Heo
2019-08-29 11:17:12 +0800