Eric Lee / smarc-fsl-linux-kernel

17 Aug, 2022

1 commit

0b7f5d7a4 block: don't allow the same type rq_qos add more than once ... Browse Code »

[ Upstream commit 14a6e2eb7df5c7897c15b109cba29ab0c4a791b6 ]

In our test of iocost, we encountered some list add/del corruptions of
inner_walk list in ioc_timer_fn.

The reason can be described as follows:

cpu 0 cpu 1
ioc_qos_write ioc_qos_write

ioc = q_to_ioc(queue);
if (!ioc) {
ioc = kzalloc();
ioc = q_to_ioc(queue);
if (!ioc) {
ioc = kzalloc();
...
rq_qos_add(q, rqos);
}
...
rq_qos_add(q, rqos);
...
}

When the io.cost.qos file is written by two cpus concurrently, rq_qos may
be added to one disk twice. In that case, there will be two iocs enabled
and running on one disk. They own different iocgs on their active list. In
the ioc_timer_fn function, because of the iocgs from two iocs have the
same root iocg, the root iocg's walk_list may be overwritten by each other
and this leads to list add/del corruptions in building or destroying the
inner_walk list.

And so far, the blk-rq-qos framework works in case that one instance for
one type rq_qos per queue by default. This patch make this explicit and
also fix the crash above.

Signed-off-by: Jinke Han
Reviewed-by: Muchun Song
Acked-by: Tejun Heo
Cc:
Link: https://lore.kernel.org/r/20220720093616.70584-1-hanjinke.666@bytedance.com
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Jinke Han
2022-08-17 20:24:24 +0800

12 Jul, 2022

1 commit

af9452dfd block: fix rq-qos breakage from skipping rq_qos_done_bio() ... Browse Code »

[ Upstream commit aa1b46dcdc7baaf5fec0be25782ef24b26aa209e ]

a647a524a467 ("block: don't call rq_qos_ops->done_bio if the bio isn't
tracked") made bio_endio() skip rq_qos_done_bio() if BIO_TRACKED is not set.
While this fixed a potential oops, it also broke blk-iocost by skipping the
done_bio callback for merged bios.

Before, whether a bio goes through rq_qos_throttle() or rq_qos_merge(),
rq_qos_done_bio() would be called on the bio on completion with BIO_TRACKED
distinguishing the former from the latter. rq_qos_done_bio() is not called
for bios which wenth through rq_qos_merge(). This royally confuses
blk-iocost as the merged bios never finish and are considered perpetually
in-flight.

One reliably reproducible failure mode is an intermediate cgroup geting
stuck active preventing its children from being activated due to the
leaf-only rule, leading to loss of control. The following is from
resctl-bench protection scenario which emulates isolating a web server like
workload from a memory bomb run on an iocost configuration which should
yield a reasonable level of protection.

# cat /sys/block/nvme2n1/device/model
Samsung SSD 970 PRO 512GB
# cat /sys/fs/cgroup/io.cost.model
259:0 ctrl=user model=linear rbps=834913556 rseqiops=93622 rrandiops=102913 wbps=618985353 wseqiops=72325 wrandiops=71025
# cat /sys/fs/cgroup/io.cost.qos
259:0 enable=1 ctrl=user rpct=95.00 rlat=18776 wpct=95.00 wlat=8897 min=60.00 max=100.00
# resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1
...
Memory Hog Summary
==================

IO Latency: R p50=242u:336u/2.5m p90=794u:1.4m/7.5m p99=2.7m:8.0m/62.5m max=8.0m:36.4m/350m
W p50=221u:323u/1.5m p90=709u:1.2m/5.5m p99=1.5m:2.5m/9.5m max=6.9m:35.9m/350m

Isolation and Request Latency Impact Distributions:

min p01 p05 p10 p25 p50 p75 p90 p95 p99 max mean stdev
isol% 15.90 15.90 15.90 40.05 57.24 59.07 60.01 74.63 74.63 90.35 90.35 58.12 15.82
lat-imp% 0 0 0 0 0 4.55 14.68 15.54 233.5 548.1 548.1 53.88 143.6

Result: isol=58.12:15.82% lat_imp=53.88%:143.6 work_csv=100.0% missing=3.96%

The isolation result of 58.12% is close to what this device would show
without any IO control.

Fix it by introducing a new flag BIO_QOS_MERGED to mark merged bios and
calling rq_qos_done_bio() on them too. For consistency and clarity, rename
BIO_TRACKED to BIO_QOS_THROTTLED. The flag checks are moved into
rq_qos_done_bio() so that it's next to the code paths that set the flags.

With the patch applied, the above same benchmark shows:

# resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1
...
Memory Hog Summary
==================

IO Latency: R p50=123u:84.4u/985u p90=322u:256u/2.5m p99=1.6m:1.4m/9.5m max=11.1m:36.0m/350m
W p50=429u:274u/995u p90=1.7m:1.3m/4.5m p99=3.4m:2.7m/11.5m max=7.9m:5.9m/26.5m

Isolation and Request Latency Impact Distributions:

min p01 p05 p10 p25 p50 p75 p90 p95 p99 max mean stdev
isol% 84.91 84.91 89.51 90.73 92.31 94.49 96.36 98.04 98.71 100.0 100.0 94.42 2.81
lat-imp% 0 0 0 0 0 2.81 5.73 11.11 13.92 17.53 22.61 4.10 4.68

Result: isol=94.42:2.81% lat_imp=4.10%:4.68 work_csv=58.34% missing=0%

Signed-off-by: Tejun Heo
Fixes: a647a524a467 ("block: don't call rq_qos_ops->done_bio if the bio isn't tracked")
Cc: stable@vger.kernel.org # v5.15+
Cc: Ming Lei
Cc: Yu Kuai
Reviewed-by: Ming Lei
Link: https://lore.kernel.org/r/Yi7rdrzQEHjJLGKB@slm.duckdns.org
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Tejun Heo
2022-07-12 22:34:57 +0800

09 Jun, 2022

1 commit

a30acbb5d blk-iolatency: Fix inflight count imbalances and IO hangs on offline ... Browse Code »

commit 8a177a36da6c54c98b8685d4f914cb3637d53c0d upstream.

iolatency needs to track the number of inflight IOs per cgroup. As this
tracking can be expensive, it is disabled when no cgroup has iolatency
configured for the device. To ensure that the inflight counters stay
balanced, iolatency_set_limit() freezes the request_queue while manipulating
the enabled counter, which ensures that no IO is in flight and thus all
counters are zero.

Unfortunately, iolatency_set_limit() isn't the only place where the enabled
counter is manipulated. iolatency_pd_offline() can also dec the counter and
trigger disabling. As this disabling happens without freezing the q, this
can easily happen while some IOs are in flight and thus leak the counts.

This can be easily demonstrated by turning on iolatency on an one empty
cgroup while IOs are in flight in other cgroups and then removing the
cgroup. Note that iolatency shouldn't have been enabled elsewhere in the
system to ensure that removing the cgroup disables iolatency for the whole
device.

The following keeps flipping on and off iolatency on sda:

echo +io > /sys/fs/cgroup/cgroup.subtree_control
while true; do
mkdir -p /sys/fs/cgroup/test
echo '8:0 target=100000' > /sys/fs/cgroup/test/io.latency
sleep 1
rmdir /sys/fs/cgroup/test
sleep 1
done

and there's concurrent fio generating direct rand reads:

fio --name test --filename=/dev/sda --direct=1 --rw=randread \
--runtime=600 --time_based --iodepth=256 --numjobs=4 --bs=4k

while monitoring with the following drgn script:

while True:
for css in css_for_each_descendant_pre(prog['blkcg_root'].css.address_of_()):
for pos in hlist_for_each(container_of(css, 'struct blkcg', 'css').blkg_list):
blkg = container_of(pos, 'struct blkcg_gq', 'blkcg_node')
pd = blkg.pd[prog['blkcg_policy_iolatency'].plid]
if pd.value_() == 0:
continue
iolat = container_of(pd, 'struct iolatency_grp', 'pd')
inflight = iolat.rq_wait.inflight.counter.value_()
if inflight:
print(f'inflight={inflight} {disk_name(blkg.q.disk).decode("utf-8")} '
f'{cgroup_path(css.cgroup).decode("utf-8")}')
time.sleep(1)

The monitoring output looks like the following:

inflight=1 sda /user.slice
inflight=1 sda /user.slice
...
inflight=14 sda /user.slice
inflight=13 sda /user.slice
inflight=17 sda /user.slice
inflight=15 sda /user.slice
inflight=18 sda /user.slice
inflight=17 sda /user.slice
inflight=20 sda /user.slice
inflight=19 sda /user.slice
Cc: Josef Bacik
Cc: Liu Bo
Fixes: 8c772a9bfc7c ("blk-iolatency: fix IO hang due to negative inflight counter")
Cc: stable@vger.kernel.org # v5.0+
Link: https://lore.kernel.org/r/Yn9ScX6Nx2qIiQQi@slm.duckdns.org
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2022-06-09 16:23:30 +0800

31 Aug, 2021

1 commit

679369114 Merge tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block updates from Jens Axboe:
"Nothing major in here - lots of good cleanups and tech debt handling,
which is also evident in the diffstats. In particular:

- Add disk sequence numbers (Matteo)

- Discard merge fix (Ming)

- Relax disk zoned reporting restrictions (Niklas)

- Bio error handling zoned leak fix (Pavel)

- Start of proper add_disk() error handling (Luis, Christoph)

- blk crypto fix (Eric)

- Non-standard GPT location support (Dmitry)

- IO priority improvements and cleanups (Damien)o

- blk-throtl improvements (Chunguang)

- diskstats_show() stack reduction (Abd-Alrhman)

- Loop scheduler selection (Bart)

- Switch block layer to use kmap_local_page() (Christoph)

- Remove obsolete disk_name helper (Christoph)

- block_device refcounting improvements (Christoph)

- Ensure gendisk always has a request queue reference (Christoph)

- Misc fixes/cleanups (Shaokun, Oliver, Guoqing)"

* tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits)
sg: pass the device name to blk_trace_setup
block, bfq: cleanup the repeated declaration
blk-crypto: fix check for too-large dun_bytes
blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN
blk-zoned: allow zone management send operations without CAP_SYS_ADMIN
block: mark blkdev_fsync static
block: refine the disk_live check in del_gendisk
mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA
mmc: block: Support alternative_gpt_sector() operation
partitions/efi: Support non-standard GPT location
block: Add alternative_gpt_sector() operation
bio: fix page leak bio_add_hw_page failure
block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT
block: remove a pointless call to MINOR() in device_add_disk
null_blk: add error handling support for add_disk()
virtio_blk: add error handling support for add_disk()
block: add error handling for device_add_disk / add_disk
block: return errors from disk_alloc_events
block: return errors from blk_integrity_add
block: call blk_register_queue earlier in device_add_disk
...

Linus Torvalds
2021-08-31 09:52:11 +0800

17 Aug, 2021

1 commit

252c651a4 blk-cgroup: stop using seq_get_buf ... Browse Code »

seq_get_buf is a crutch that undoes all the memory safety of the
seq_file interface. Use the normal seq_printf interfaces instead.

Signed-off-by: Christoph Hellwig
Acked-by: Tejun Heo
Link: https://lore.kernel.org/r/20210810152623.1796144-2-hch@lst.de
Signed-off-by: Jens Axboe

Christoph Hellwig
2021-08-17 00:53:01 +0800

06 Aug, 2021

1 commit

8d75d0eff blk-iolatency: error out if blk_get_queue() failed in iolatency_set_limit() ... Browse Code »

If queue is dying while iolatency_set_limit() is in progress,
blk_get_queue() won't increment the refcount of the queue. However,
blk_put_queue() will still decrement the refcount later, which will
cause the refcout to be unbalanced.

Thus error out in such case to fix the problem.

Fixes: 8c772a9bfc7c ("blk-iolatency: fix IO hang due to negative inflight counter")
Signed-off-by: Yu Kuai
Acked-by: Tejun Heo
Link: https://lore.kernel.org/r/20210805124645.543797-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe

Yu Kuai
2021-08-06 01:18:13 +0800

08 Oct, 2020

1 commit

fa1c3eaf4 block: Remove redundant 'return' statement ... Browse Code »

Remove redundant 'return' statement for 'void' functions.

Signed-off-by: Baolin Wang
Signed-off-by: Jens Axboe

Baolin Wang
2020-10-08 21:59:48 +0800

01 Jul, 2020

1 commit

6e2fa4dd6 blk-iolatency: only call ktime_get() if needed ... Browse Code »

ktime_to_ns(ktime_get()), which is expensive, does not need to be called
if blk_iolatency_enabled() return false in blkcg_iolatency_done_bio().
Postponing ktime_to_ns(ktime_get()) execution reduces the CPU usage when
blk_iolatency is disabled.

Signed-off-by: Hongnan Li
Signed-off-by: Jens Axboe

Hongnan Li
2020-07-01 22:02:38 +0800

29 Aug, 2019

2 commits

beab17fc2 blkcg: s/RQ_QOS_CGROUP/RQ_QOS_LATENCY/ ... Browse Code »

io.weight is gonna be another rq_qos cgroup mechanism. Let's rename
RQ_QOS_CGROUP which is being used by io.latency to RQ_QOS_LATENCY in
preparation.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-08-29 11:17:08 +0800
cf09a8ee1 blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn() ... Browse Code »

Instead of @node, pass in @q and @blkcg so that the alloc function has
more context. This doesn't cause any behavior change and will be used
by io.weight implementation.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-08-29 11:17:01 +0800

17 Jul, 2019

1 commit

07b0fdecb blkcg: allow blkcg_policy->pd_stat() to print non-debug info too ... Browse Code »

Currently, ->pd_stat() is called only when moduleparam
blkcg_debug_stats is set which prevents it from printing non-debug
policy-specific statistics. Let's move debug testing down so that
->pd_stat() can print non-debug stat too. This patch doesn't cause
any visible behavior change.

Signed-off-by: Tejun Heo
Cc: Josef Bacik
Signed-off-by: Jens Axboe

Tejun Heo
2019-07-17 00:06:39 +0800

06 Jul, 2019

1 commit

c9b3007fe blk-iolatency: fix STS_AGAIN handling ... Browse Code »

The iolatency controller is based on rq_qos. It increments on
rq_qos_throttle() and decrements on either rq_qos_cleanup() or
rq_qos_done_bio(). a3fb01ba5af0 fixes the double accounting issue where
blk_mq_make_request() may call both rq_qos_cleanup() and
rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the
double decrement.

The above works upstream as the only way we can get STS_AGAIN is from
blk_mq_get_request() failing. The STS_AGAIN handling isn't a real
problem as bio_endio() skipping only happens on reserved tag allocation
failures which can only be caused by driver bugs and already triggers
WARN.

However, the fix creates a not so great dependency on how STS_AGAIN can
be propagated. Internally, we (Facebook) carry a patch that kills read
ahead if a cgroup is io congested or a fatal signal is pending. This
combined with chained bios progagate their bi_status to the parent is
not already set can can cause the parent bio to not clean up properly
even though it was successful. This consequently leaks the inflight
counter and can hang all IOs under that blkg.

To nip the adverse interaction early, this removes the rq_qos_cleanup()
callback in iolatency in favor of cleaning up always on the
rq_qos_done_bio() path.

Fixes: a3fb01ba5af0 ("blk-iolatency: only account submitted bios")
Debugged-by: Tejun Heo
Debugged-by: Josef Bacik
Signed-off-by: Dennis Zhou
Signed-off-by: Jens Axboe

Dennis Zhou
2019-07-06 05:14:00 +0800

20 Jun, 2019

1 commit

a3fb01ba5 blk-iolatency: only account submitted bios ... Browse Code »

As is, iolatency recognizes done_bio and cleanup as ending paths. If a
request is marked REQ_NOWAIT and fails to get a request, the bio is
cleaned up via rq_qos_cleanup() and ended in bio_wouldblock_error().
This results in underflowing the inflight counter. Fix this by only
accounting bios that were actually submitted.

Signed-off-by: Dennis Zhou
Cc: Josef Bacik
Signed-off-by: Jens Axboe

Dennis Zhou
2019-06-20 17:29:56 +0800

16 Jun, 2019

1 commit

5de0073fc blk-iolatency: clear use_delay when io.latency is set to zero ... Browse Code »

If use_delay was non-zero when the latency target of a cgroup was set
to zero, it will stay stuck until io.latency is enabled on the cgroup
again. This keeps readahead disabled for the cgroup impacting
performance negatively.

Signed-off-by: Tejun Heo
Cc: Josef Bacik
Fixes: d70675121546 ("block: introduce blk-iolatency io controller")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Jens Axboe

Tejun Heo
2019-06-16 00:39:36 +0800

01 May, 2019

1 commit

3dcf60bcb block: add SPDX tags to block layer files missing licensing information ... Browse Code »

Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.

Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-05-01 06:12:03 +0800

21 Mar, 2019

1 commit

373e915cd blk-iolatency: #include "blk.h" ... Browse Code »

This patch avoids that the following warning is reported when building
with W=1:

block/blk-iolatency.c:734:5: warning: no previous prototype for 'blk_iolatency_init' [-Wmissing-prototypes]

Cc: Josef Bacik
Fixes: d70675121546 ("block: introduce blk-iolatency io controller") # v4.19
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2019-03-21 04:19:38 +0800

09 Feb, 2019

2 commits

391f552af Blk-iolatency: warn on negative inflight IO counter ... Browse Code »

This is to catch any unexpected negative value of inflight IO counter.

Signed-off-by: Liu Bo
Signed-off-by: Jens Axboe

Liu Bo
2019-02-09 03:42:27 +0800
8c772a9bf blk-iolatency: fix IO hang due to negative inflight counter ... Browse Code »

Our test reported the following stack, and vmcore showed that
->inflight counter is -1.

[ffffc9003fcc38d0] __schedule at ffffffff8173d95d
[ffffc9003fcc3958] schedule at ffffffff8173de26
[ffffc9003fcc3970] io_schedule at ffffffff810bb6b6
[ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb
[ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3
[ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a
[ffffc9003fcc3b08] generic_make_request at ffffffff81368b49
[ffffc9003fcc3b68] submit_bio at ffffffff81368d7d
[ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4]
[ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4]
[ffffc9003fcc3d68] do_writepages at ffffffff811c49ae
[ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188
[ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301
[ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4]
[ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b
[ffffc9003fcc3ee8] do_fsync at ffffffff81285abd
[ffffc9003fcc3f18] sys_fsync at ffffffff81285d50
[ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04
[ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e

The ->inflight counter may be negative (-1) if

1) blk-iolatency was disabled when the IO was issued,

2) blk-iolatency was enabled before this IO reached its endio,

3) the ->inflight counter is decreased from 0 to -1 in endio()

In fact the hang can be easily reproduced by the below script,

H=/sys/fs/cgroup/unified/
P=/sys/fs/cgroup/unified/test

echo "+io" > $H/cgroup.subtree_control
mkdir -p $P

echo $$ > $P/cgroup.procs

xfs_io -f -d -c "pwrite 0 4k" /dev/sdg

echo "`cat /sys/block/sdg/dev` target=1000000" > $P/io.latency

xfs_io -f -d -c "pwrite 0 4k" /dev/sdg

This fixes the problem by freezing the queue so that while
enabling/disabling iolatency, there is no inflight rq running.

Note that quiesce_queue is not needed as this only updating iolatency
configuration about which dispatching request_queue doesn't care.

Signed-off-by: Liu Bo
Signed-off-by: Jens Axboe

Liu Bo
2019-02-09 03:42:26 +0800

18 Dec, 2018

1 commit

13369816c block: fix blk-iolatency accounting underflow ... Browse Code »

The blk-iolatency controller measures the time from rq_qos_throttle() to
rq_qos_done_bio() and attributes this time to the first bio that needs
to create the request. This means if a bio is plug-mergeable or
bio-mergeable, it gets to bypass the blk-iolatency controller.

The recent series [1], to tag all bios w/ blkgs undermined how iolatency
was determining which bios it was charging and should process in
rq_qos_done_bio(). Because all bios are being tagged, this caused the
atomic_t for the struct rq_wait inflight count to underflow and result
in a stall.

This patch adds a new flag BIO_TRACKED to let controllers know that a
bio is going through the rq_qos path. blk-iolatency now checks if this
flag is set to see if it should process the bio in rq_qos_done_bio().

Overloading BLK_QUEUE_ENTERED works, but makes the flag rules confusing.
BIO_THROTTLED was another candidate, but the flag is set for all bios
that have gone through blk-throttle code. Overloading a flag comes with
the burden of making sure that when either implementation changes, a
change in setting rules for one doesn't cause a bug in the other. So
here, we unfortunately opt for adding a new flag.

[1] https://lore.kernel.org/lkml/20181205171039.73066-1-dennis@kernel.org/

Fixes: 5cdf2e3fea5e ("blkcg: associate blkg when associating a device")
Signed-off-by: Dennis Zhou
Cc: Josef Bacik
Signed-off-by: Jens Axboe

Dennis Zhou
2018-12-18 02:19:54 +0800

08 Dec, 2018

8 commits

d3fcdff19 block: convert io-latency to use rq_qos_wait ... Browse Code »

Now that we have this common helper, convert io-latency over to use it
as well.

Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe

Josef Bacik
2018-12-08 13:26:38 +0800
7754f669f blkcg: rename blkg_try_get() to blkg_tryget() ... Browse Code »

blkg reference counting now uses percpu_ref rather than atomic_t. Let's
make this consistent with css_tryget. This renames blkg_try_get to
blkg_tryget and now returns a bool rather than the blkg or %NULL.

Signed-off-by: Dennis Zhou
Reviewed-by: Josef Bacik
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Dennis Zhou
2018-12-08 13:26:38 +0800
e439bedf6 blkcg: consolidate bio_issue_init() to be a part of core ... Browse Code »

bio_issue_init among other things initializes the timestamp for an IO.
Rather than have this logic handled by policies, this consolidates it to
be on the init paths (normal, clone, bounce clone).

Signed-off-by: Dennis Zhou
Acked-by: Tejun Heo
Reviewed-by: Liu Bo
Reviewed-by: Josef Bacik
Signed-off-by: Jens Axboe

Dennis Zhou
2018-12-08 13:26:37 +0800
5cdf2e3fe blkcg: associate blkg when associating a device ... Browse Code »

Previously, blkg association was handled by controller specific code in
blk-throttle and blk-iolatency. However, because a blkg represents a
relationship between a blkcg and a request_queue, it makes sense to keep
the blkg->q and bio->bi_disk->queue consistent.

This patch moves association into the bio_set_dev macro(). This should
cover the majority of cases where the device is set/changed keeping the
two pointers consistent. Fallback code is added to
blkcg_bio_issue_check() to catch any missing paths.

Signed-off-by: Dennis Zhou
Reviewed-by: Josef Bacik
Signed-off-by: Jens Axboe

Dennis Zhou
2018-12-08 13:26:37 +0800
2268c0feb blkcg: introduce common blkg association logic ... Browse Code »

There are 3 ways blkg association can happen: association with the
current css, with the page css (swap), or from the wbc css (writeback).

This patch handles how association is done for the first case where we
are associating bsaed on the current css. If there is already a blkg
associated, the css will be reused and association will be redone as the
request_queue may have changed.

Signed-off-by: Dennis Zhou
Reviewed-by: Josef Bacik
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Dennis Zhou
2018-12-08 13:26:36 +0800
beea9da07 blkcg: convert blkg_lookup_create() to find closest blkg ... Browse Code »

There are several scenarios where blkg_lookup_create() can fail such as
the blkcg dying, request_queue is dying, or simply being OOM. Most
handle this by simply falling back to the q->root_blkg and calling it a
day.

This patch implements the notion of closest blkg. During
blkg_lookup_create(), if it fails to create, return the closest blkg
found or the q->root_blkg. blkg_try_get_closest() is introduced and used
during association so a bio is always attached to a blkg.

Signed-off-by: Dennis Zhou
Acked-by: Tejun Heo
Reviewed-by: Josef Bacik
Signed-off-by: Jens Axboe

Dennis Zhou
2018-12-08 13:26:36 +0800
b978962ad blkcg: update blkg_lookup_create() to do locking ... Browse Code »

To know when to create a blkg, the general pattern is to do a
blkg_lookup() and if that fails, lock and do the lookup again, and if
that fails finally create. It doesn't make much sense for everyone who
wants to do creation to write this themselves.

This changes blkg_lookup_create() to do locking and implement this
pattern. The old blkg_lookup_create() is renamed to
__blkg_lookup_create(). If a call site wants to do its own error
handling or already owns the queue lock, they can use
__blkg_lookup_create(). This will be used in upcoming patches.

Signed-off-by: Dennis Zhou
Reviewed-by: Josef Bacik
Acked-by: Tejun Heo
Reviewed-by: Liu Bo
Signed-off-by: Jens Axboe

Dennis Zhou
2018-12-08 13:26:36 +0800
0fe061b9f blkcg: fix ref count issue with bio_blkcg() using task_css ... Browse Code »

The bio_blkcg() function turns out to be inconsistent and consequently
dangerous to use. The first part returns a blkcg where a reference is
owned by the bio meaning it does not need to be rcu protected. However,
the third case, the last line, is problematic:

return css_to_blkcg(task_css(current, io_cgrp_id));

This can race against task migration and the cgroup dying. It is also
semantically different as it must be called rcu protected and is
susceptible to failure when trying to get a reference to it.

This patch adds association ahead of calling bio_blkcg() rather than
after. This makes association a required and explicit step along the
code paths for calling bio_blkcg(). In blk-iolatency, association is
moved above the bio_blkcg() call to ensure it will not return %NULL.

BFQ uses the old bio_blkcg() function, but I do not want to address it
in this series due to the complexity. I have created a private version
documenting the inconsistency and noting not to use it.

Signed-off-by: Dennis Zhou
Acked-by: Tejun Heo
Reviewed-by: Josef Bacik
Signed-off-by: Jens Axboe

Dennis Zhou
2018-12-08 13:26:36 +0800

16 Nov, 2018

2 commits

0d945c1f9 block: remove the queue_lock indirection ... Browse Code »

With the legacy request path gone there is no good reason to keep
queue_lock as a pointer, we can always use the embedded lock now.

Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig

Fixed floppy and blk-cgroup missing conversions and half done edits.

Signed-off-by: Jens Axboe

Christoph Hellwig
2018-11-16 03:17:28 +0800
d53375608 block: remove the unused lock argument to rq_qos_throttle ... Browse Code »

Unused now that the legacy request path is gone.

Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-11-16 03:13:22 +0800

03 Nov, 2018

1 commit

5f2158538 Merge tag 'for-linus-20181102' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer fixes from Jens Axboe:
"The biggest part of this pull request is the revert of the blkcg
cleanup series. It had one fix earlier for a stacked device issue, but
another one was reported. Rather than play whack-a-mole with this,
revert the entire series and try again for the next kernel release.

Apart from that, only small fixes/changes.

Summary:

- Indentation fixup for mtip32xx (Colin Ian King)

- The blkcg cleanup series revert (Dennis Zhou)

- Two NVMe fixes. One fixing a regression in the nvme request
initialization in this merge window, causing nvme-fc to not work.
The other is a suspend/resume p2p resource issue (James, Keith)

- Fix sg discard merge, allowing us to merge in cases where we didn't
before (Jianchao Wang)

- Call rq_qos_exit() after the queue is frozen, preventing a hang
(Ming)

- Fix brd queue setup, fixing an oops if we fail setting up all
devices (Ming)"

* tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
nvme-pci: fix conflicting p2p resource adds
nvme-fc: fix request private initialization
blkcg: revert blkcg cleanups series
block: brd: associate with queue until adding disk
block: call rq_qos_exit() after queue is frozen
mtip32xx: clean an indentation issue, remove extraneous tabs
block: fix the DISCARD request merge

Linus Torvalds
2018-11-03 02:25:48 +0800

02 Nov, 2018

1 commit

b5f2954d3 blkcg: revert blkcg cleanups series ... Browse Code »

This reverts a series committed earlier due to null pointer exception
bug report in [1]. It seems there are edge case interactions that I did
not consider and will need some time to understand what causes the
adverse interactions.

The original series can be found in [2] with a follow up series in [3].

[1] https://www.spinics.net/lists/cgroups/msg20719.html
[2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
[3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

This reverts the following commits:
d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

Signed-off-by: Dennis Zhou
Signed-off-by: Jens Axboe

Dennis Zhou
2018-11-02 09:59:53 +0800

27 Oct, 2018

1 commit

8508cf3ff sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD ... Browse Code »

There are several definitions of those functions/macros in places that
mess with fixed-point load averages. Provide an official version.

[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Peter Zijlstra (Intel)
Tested-by: Suren Baghdasaryan
Tested-by: Daniel Drake
Cc: Christopher Lameter
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Mike Galbraith
Cc: Peter Enderborg
Cc: Randy Dunlap
Cc: Shakeel Butt
Cc: Tejun Heo
Cc: Vinayak Menon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2018-10-27 07:26:32 +0800

29 Sep, 2018

5 commits

451bb7c33 blk-iolatency: keep track of previous windows stats ... Browse Code »

We apply a smoothing to the scale changes in order to keep sawtoothy
behavior from occurring. However our window for checking if we've
missed our target can sometimes be lower than the smoothing interval
(500ms), especially on faster drives like ssd's. In order to deal with
this keep track of the running tally of the previous intervals that we
threw away because we had already done a scale event recently.

This is needed for the ssd case as these low latency drives will have
bursts of latency, and if it happens to be ok for the window that
directly follows the opening of the scale window we could unthrottle
when previous windows we were missing our target.

Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe

Josef Bacik
2018-09-29 01:47:32 +0800
1fa2840e5 blk-iolatency: use a percentile approache for ssd's ... Browse Code »

We use an average latency approach for determining if we're missing our
latency target. This works well for rotational storage where we have
generally consistent latencies, but for ssd's and other low latency
devices you have more of a spikey behavior, which means we often won't
throttle misbehaving groups because a lot of IO completes at drastically
faster times than our latency target. Instead keep track of how many
IO's miss our target and how many IO's are done in our time window. If
the p(90) latency is above our target then we know we need to throttle.
With this change in place we are seeing the same throttling behavior
with our testcase on ssd's as we see with rotational drives.

Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe

Josef Bacik
2018-09-29 01:47:31 +0800
22ed8a93a blk-iolatency: deal with small samples ... Browse Code »

There is logic to keep cgroups that haven't done a lot of IO in the most
recent scale window from being punished for over-active higher priority
groups. However for things like ssd's where the windows are pretty
short we'll end up with small numbers of samples, so 5% of samples will
come out to 0 if there aren't enough. Make the floor 1 sample to keep
us from improperly bailing out of scaling down.

Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe

Josef Bacik
2018-09-29 01:47:29 +0800
9f60511a0 blk-iolatency: deal with nr_requests == 1 ... Browse Code »

Hitting the case where blk_queue_depth() returned 1 uncovered the fact
that iolatency doesn't actually handle this case properly, it simply
doesn't scale down anybody. For this case we should go straight into
applying the time delay, which we weren't doing. Since we already limit
the floor at 1 request this if statement is not needed, and this allows
us to set our depth to 1 which allows us to apply the delay if needed.

Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe

Josef Bacik
2018-09-29 01:47:28 +0800
ff4cee089 blk-iolatency: use q->nr_requests directly ... Browse Code »

We were using blk_queue_depth() assuming that it would return
nr_requests, but we hit a case in production on drives that had to have
NCQ turned off in order for them to not shit the bed which resulted in a
qd of 1, even though the nr_requests was much larger. iolatency really
only cares about requests we are allowed to queue up, as any io that
get's onto the request list is going to be serviced soonish, so we want
to be throttling before the bio gets onto the request list. To make
iolatency work as expected, simply use q->nr_requests instead of
blk_queue_depth() as that is what we actually care about.

Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe

Josef Bacik
2018-09-29 01:47:27 +0800

22 Sep, 2018

3 commits

101246ec0 blkcg: rename blkg_try_get to blkg_tryget ... Browse Code »

blkg reference counting now uses percpu_ref rather than atomic_t. Let's
make this consistent with css_tryget. This renames blkg_try_get to
blkg_tryget and now returns a bool rather than the blkg or NULL.

Signed-off-by: Dennis Zhou
Reviewed-by: Josef Bacik
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Dennis Zhou (Facebook)
2018-09-22 10:29:19 +0800
5bf9a1f3b blkcg: consolidate bio_issue_init to be a part of core ... Browse Code »

bio_issue_init among other things initializes the timestamp for an IO.
Rather than have this logic handled by policies, this consolidates it to
be on the init paths (normal, clone, bounce clone).

Signed-off-by: Dennis Zhou
Acked-by: Tejun Heo
Reviewed-by: Liu Bo
Signed-off-by: Jens Axboe

Dennis Zhou (Facebook)
2018-09-22 10:29:08 +0800
a7b39b4e9 blkcg: always associate a bio with a blkg ... Browse Code »

Previously, blkg's were only assigned as needed by blk-iolatency and
blk-throttle. bio->css was also always being associated while blkg was
being looked up and then thrown away in blkcg_bio_issue_check.

This patch begins the cleanup of bio->css and bio->bi_blkg by always
associating a blkg in blkcg_bio_issue_check. This tries to create the
blkg, but if it is not possible, falls back to using the root_blkg of
the request_queue. Therefore, a bio will always be associated with a
blkg. The duplicate association logic is removed from blk-throttle and
blk-iolatency.

Signed-off-by: Dennis Zhou
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Dennis Zhou (Facebook)
2018-09-22 10:29:06 +0800