Eric Lee / smarc-fsl-linux-kernel

15 Nov, 2020

1 commit

b7131ee0b blk-cgroup: fix a hd_struct leak in blkcg_fill_root_iostats ... Browse Code »

disk_get_part needs to be paired with a disk_put_part.

Cc: stable@vger.kernel.org
Fixes: ef45fe470e1 ("blk-cgroup: show global disk stats in root cgroup io.stat")
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-11-15 02:17:34 +0800

26 Oct, 2020

2 commits

f255c19b3 blk-cgroup: Pre-allocate tree node on blkg_conf_prep ... Browse Code »

Similarly to commit 457e490f2b741 ("blkcg: allocate struct blkcg_gq
outside request queue spinlock"), blkg_create can also trigger
occasional -ENOMEM failures at the radix insertion because any
allocation inside blkg_create has to be non-blocking, making it more
likely to fail. This causes trouble for userspace tools trying to
configure io weights who need to deal with this condition.

This patch reduces the occurrence of -ENOMEMs on this path by preloading
the radix tree element on a GFP_KERNEL context, such that we guarantee
the later non-blocking insertion won't fail.

A similar solution exists in blkcg_init_queue for the same situation.

Acked-by: Tejun Heo
Signed-off-by: Gabriel Krisman Bertazi
Signed-off-by: Jens Axboe

Gabriel Krisman Bertazi
2020-10-26 21:57:47 +0800
52abfcbd5 blk-cgroup: Fix memleak on error path ... Browse Code »

If new_blkg allocation raced with blk_policy change and
blkg_lookup_check fails, new_blkg is leaked.

Acked-by: Tejun Heo
Signed-off-by: Gabriel Krisman Bertazi
Signed-off-by: Jens Axboe

Gabriel Krisman Bertazi
2020-10-26 21:57:46 +0800

10 Sep, 2020

1 commit

192f1c6bc blkcg: add plugging support for punt bio ... Browse Code »

The test and the explaination of the patch as bellow.

Before test we added more debug code in blkg_async_bio_workfn():
int count = 0
if (bios.head && bios.head->bi_next) {
need_plug = true;
blk_start_plug(&plug);
}
while ((bio = bio_list_pop(&bios))) {
/*io_punt is a sysctl user interface to control the print*/
if(io_punt) {
printk("[%s:%d] bio start,size:%llu,%d count=%d plug?%d\n",
current->comm, current->pid, bio->bi_iter.bi_sector,
(bio->bi_iter.bi_size)>>9, count++, need_plug);
}
submit_bio(bio);
}
if (need_plug)
blk_finish_plug(&plug);

Steps that need to be set to trigger *PUNT* io before testing:
mount -t btrfs -o compress=lzo /dev/sda6 /btrfs
mount -t cgroup2 nodev /cgroup2
mkdir /cgroup2/cg3
echo "+io" > /cgroup2/cgroup.subtree_control
echo "8:0 wbps=1048576000" > /cgroup2/cg3/io.max #1000M/s
echo $$ > /cgroup2/cg3/cgroup.procs

Then use dd command to test btrfs PUNT io in current shell:
dd if=/dev/zero of=/btrfs/file bs=64K count=100000

Test hardware environment as below:
[root@localhost btrfs]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel

With above debug code, test command and test environment, I did the
tests under 3 different system loads, which are triggered by stress:
1, Run 64 threads by command "stress -c 64 &"
[53615.975974] [kworker/u66:18:1490] bio start,size:45583056,8 count=0 plug?1
[53615.975980] [kworker/u66:18:1490] bio start,size:45583064,8 count=1 plug?1
[53615.975984] [kworker/u66:18:1490] bio start,size:45583072,8 count=2 plug?1
[53615.975987] [kworker/u66:18:1490] bio start,size:45583080,8 count=3 plug?1
[53615.975990] [kworker/u66:18:1490] bio start,size:45583088,8 count=4 plug?1
[53615.975993] [kworker/u66:18:1490] bio start,size:45583096,8 count=5 plug?1
... ...
[53615.977041] [kworker/u66:18:1490] bio start,size:45585480,8 count=303 plug?1
[53615.977044] [kworker/u66:18:1490] bio start,size:45585488,8 count=304 plug?1
[53615.977047] [kworker/u66:18:1490] bio start,size:45585496,8 count=305 plug?1
[53615.977050] [kworker/u66:18:1490] bio start,size:45585504,8 count=306 plug?1
[53615.977053] [kworker/u66:18:1490] bio start,size:45585512,8 count=307 plug?1
[53615.977056] [kworker/u66:18:1490] bio start,size:45585520,8 count=308 plug?1
[53615.977058] [kworker/u66:18:1490] bio start,size:45585528,8 count=309 plug?1

2, Run 32 threads by command "stress -c 32 &"
[50586.290521] [kworker/u66:6:32351] bio start,size:45806496,8 count=0 plug?1
[50586.290526] [kworker/u66:6:32351] bio start,size:45806504,8 count=1 plug?1
[50586.290529] [kworker/u66:6:32351] bio start,size:45806512,8 count=2 plug?1
[50586.290531] [kworker/u66:6:32351] bio start,size:45806520,8 count=3 plug?1
[50586.290533] [kworker/u66:6:32351] bio start,size:45806528,8 count=4 plug?1
[50586.290535] [kworker/u66:6:32351] bio start,size:45806536,8 count=5 plug?1
... ...
[50586.299640] [kworker/u66:5:32350] bio start,size:45808576,8 count=252 plug?1
[50586.299643] [kworker/u66:5:32350] bio start,size:45808584,8 count=253 plug?1
[50586.299646] [kworker/u66:5:32350] bio start,size:45808592,8 count=254 plug?1
[50586.299649] [kworker/u66:5:32350] bio start,size:45808600,8 count=255 plug?1
[50586.299652] [kworker/u66:5:32350] bio start,size:45808608,8 count=256 plug?1
[50586.299663] [kworker/u66:5:32350] bio start,size:45808616,8 count=257 plug?1
[50586.299665] [kworker/u66:5:32350] bio start,size:45808624,8 count=258 plug?1
[50586.299668] [kworker/u66:5:32350] bio start,size:45808632,8 count=259 plug?1

3, Don't run thread by stress
[50861.355246] [kworker/u66:19:32376] bio start,size:13544504,8 count=0 plug?0
[50861.355288] [kworker/u66:19:32376] bio start,size:13544512,8 count=0 plug?0
[50861.355322] [kworker/u66:19:32376] bio start,size:13544520,8 count=0 plug?0
[50861.355353] [kworker/u66:19:32376] bio start,size:13544528,8 count=0 plug?0
[50861.355392] [kworker/u66:19:32376] bio start,size:13544536,8 count=0 plug?0
[50861.355431] [kworker/u66:19:32376] bio start,size:13544544,8 count=0 plug?0
[50861.355468] [kworker/u66:19:32376] bio start,size:13544552,8 count=0 plug?0
[50861.355499] [kworker/u66:19:32376] bio start,size:13544560,8 count=0 plug?0
[50861.355532] [kworker/u66:19:32376] bio start,size:13544568,8 count=0 plug?0
[50861.355575] [kworker/u66:19:32376] bio start,size:13544576,8 count=0 plug?0
[50861.355618] [kworker/u66:19:32376] bio start,size:13544584,8 count=0 plug?0
[50861.355659] [kworker/u66:19:32376] bio start,size:13544592,8 count=0 plug?0
[50861.355740] [kworker/u66:0:32346] bio start,size:13544600,8 count=0 plug?1
[50861.355748] [kworker/u66:0:32346] bio start,size:13544608,8 count=1 plug?1
[50861.355962] [kworker/u66:2:32347] bio start,size:13544616,8 count=0 plug?0
[50861.356272] [kworker/u66:7:31962] bio start,size:13544624,8 count=0 plug?0
[50861.356446] [kworker/u66:7:31962] bio start,size:13544632,8 count=0 plug?0
[50861.356567] [kworker/u66:7:31962] bio start,size:13544640,8 count=0 plug?0
[50861.356707] [kworker/u66:19:32376] bio start,size:13544648,8 count=0 plug?0
[50861.356748] [kworker/u66:15:32355] bio start,size:13544656,8 count=0 plug?0
[50861.356825] [kworker/u66:17:31970] bio start,size:13544664,8 count=0 plug?0

Analysis of above 3 test results with different system load:
>From above test, we can see more and more continuous bios can be plugged
with system load increasing. When run "stress -c 64 &", 310 continuous
bios are plugged; When run "stress -c 32 &", 260 continuous bios are
plugged; When don't run stress, at most only 2 continuous bios are
plugged, in most cases, bio_list only contains one single bio.

How to explain above phenomenon:
We know, in submit_bio(), if the bio is a REQ_CGROUP_PUNT io, it will
queue a work to workqueue blkcg_punt_bio_wq. But when the workqueue is
scheduled, it depends on the system load. When system load is low, the
workqueue will be quickly scheduled, and the bio in bio_list will be
quickly processed in blkg_async_bio_workfn(), so there is less chance
that the same io submit thread can add multiple continuous bios to
bio_list before workqueue is scheduled to run. The analysis aligned with
above test "3".
When system load is high, there is some delay before the workqueue can
be scheduled to run, the higher the system load the greater the delay.
So there is more chance that the same io submit thread can add multiple
continuous bios to bio_list. Then when the workqueue is scheduled to run,
there are more continuous bios in bio_list, which will be processed in
blkg_async_bio_workfn(). The analysis aligned with above test "1" and "2".

According to test, we can get io performance improved with the patch,
especially when system load is higher. Another optimazition is to use
the plug only when bio_list contains at least 2 bios.

Signed-off-by: Xianting Tian
Signed-off-by: Jens Axboe

Xianting Tian
2020-09-10 23:56:34 +0800

02 Sep, 2020

1 commit

5160a5a53 blk-iocost: implement delay adjustment hysteresis ... Browse Code »

Curently, iocost syncs the delay duration to the outstanding debt amount,
which seemed enough to protect the system from anon memory hogs. However,
that was mostly because the delay calcuation was using hweight_inuse which
quickly converges towards zero under debt for delay duration calculation,
often pusnishing debtors overly harshly for longer than deserved.

The previous patch fixed the delay calcuation and now the protection against
anonymous memory hogs isn't enough because the effect of delay is indirect
and non-linear and a huge amount of future debt can accumulate abruptly
while unthrottled.

This patch implements delay hysteresis so that delay is decayed
exponentially over time instead of getting cleared immediately as debt is
paid off. While the overall behavior is similar to the blk-cgroup
implementation used by blk-iolatency, a lot of the details are different and
due to the empirical nature of the mechanism, it's challenging to adapt the
mechanism for one controller without negatively impacting the other.

As the delay is gradually decayed now, there's no point in running it from
its own hrtimer. Periodic updates are now performed from ioc_timer_fn() and
the dedicated hrtimer is removed.

Signed-off-by: Tejun Heo
Cc: Josef Bacik
Signed-off-by: Jens Axboe

Tejun Heo
2020-09-02 09:38:32 +0800

22 Aug, 2020

1 commit

27029b4b1 blkcg: fix memleak for iolatency ... Browse Code »

Normally, blkcg_iolatency_exit() will free related memory in iolatency
when cleanup queue. But if blk_throtl_init() return error and queue init
fail, blkcg_iolatency_exit() will not do that for us. Then it cause
memory leak.

Fixes: d70675121546 ("block: introduce blk-iolatency io controller")
Signed-off-by: Yufen Yu
Signed-off-by: Jens Axboe

Yufen Yu
2020-08-22 07:14:27 +0800

18 Jul, 2020

2 commits

ef45fe470 blk-cgroup: show global disk stats in root cgroup io.stat ... Browse Code »

In order to improve consistency and usability in cgroup stat accounting,
we would like to support the root cgroup's io.stat.

Since the root cgroup has processes doing io even if the system has no
explicitly created cgroups, we need to be careful to avoid overhead in
that case. For that reason, the rstat algorithms don't handle the root
cgroup, so just turning the file on wouldn't give correct statistics.

To get around this, we simulate flushing the iostat struct by filling it
out directly from global disk stats. The result is a root cgroup io.stat
file consistent with both /proc/diskstats and io.stat.

Note that in order to collect the disk stats, we needed to iterate over
devices. To facilitate that, we had to change the linkage of a disk_type
to external so that it can be used from blk-cgroup.c to iterate over
disks.

Suggested-by: Tejun Heo
Signed-off-by: Boris Burkov
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Boris Burkov
2020-07-18 10:18:00 +0800
cd1fc4b98 blk-cgroup: make iostat functions visible to stat printing ... Browse Code »

Previously, the code which printed io.stat only needed access to the
generic rstat flushing code, but since we plan to write some more
specific code for preparing root cgroup stats, we need to manipulate
iostat structs directly. Since declaring static functions ahead does not
seem like common practice in this file, simply move the iostat functions
up. We only plan to use blkg_iostat_set, but it seems better to keep them
all together.

Signed-off-by: Boris Burkov
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Boris Burkov
2020-07-18 10:17:59 +0800

09 Jul, 2020

1 commit

8c911f3d4 writeback: remove struct bdi_writeback_congested ... Browse Code »

We never set any congested bits in the group writeback instances of it.
And for the simpler bdi-wide case a simple scalar field is all that
that is needed.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-07-09 07:05:53 +0800

01 Jul, 2020

2 commits

c62b37d96 block: move ->make_request_fn to struct block_device_operations ... Browse Code »

The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector. Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does). Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-07-01 21:27:24 +0800
0b8cc25d9 blk-cgroup: clean up indentation ... Browse Code »

There is a statement that is indented one level too deeply, fix it
by removing a tab.

Signed-off-by: Colin Ian King
Signed-off-by: Jens Axboe

Colin Ian King
2020-07-01 03:00:04 +0800

29 Jun, 2020

5 commits

db18a53e5 blk-cgroup: remove blkcg_bio_issue_check ... Browse Code »

blkcg_bio_issue_check is a giant inline function that does three entirely
different things. Factor out the blk-cgroup related bio initalization
into a new helper, and the open code the sequence in the only caller,
relying on the fact that all the actual functionality is stubbed out for
non-cgroup builds.

Acked-by: Tejun Heo
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-06-29 23:09:08 +0800
13c7863d4 block: move the initial blkg lookup into blkg_tryget_closest ... Browse Code »

By moving the initial blkg lookup into blkg_tryget_closest we get
a nicely self contained routines that does all the RCU locking.

Acked-by: Tejun Heo
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-06-29 23:09:08 +0800
a5b97526b block: bypass blkg_tryget_closest for the root_blkg ... Browse Code »

The root_blkg is only torn down at the very end of removing a queue.
So in the I/O submission path is always has a life reference and we
can just grab another one using blkg_get instead of doing a tryget
and parent walk that won't lead anywhere.

Acked-by: Tejun Heo
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-06-29 23:09:08 +0800
8c5462875 block: merge blkg_lookup_create and __blkg_lookup_create ... Browse Code »

No good reason to keep these two functions split.

Acked-by: Tejun Heo
Reviewed-by: Johannes Thumshirn
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-06-29 23:09:08 +0800
28fc591ff block: move the bio cgroup associatation helpers to blk-cgroup.c ... Browse Code »

Keep the cgroup code together.

Acked-by: Tejun Heo
Reviewed-by: Johannes Thumshirn
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-06-29 23:09:08 +0800

10 May, 2020

2 commits

873f1c8df Merge branch 'block-5.7' into for-5.8/block ... Browse Code »

Pull in block-5.7 fixes for 5.8. Mostly to resolve a conflict with
the blk-iocost changes, but we also need the base of the bdi
use-after-free as well as we build on top of it.

* block-5.7:
nvme: fix possible hang when ns scanning fails during error recovery
nvme-pci: fix "slimmer CQ head update"
bdi: add a ->dev_name field to struct backing_dev_info
bdi: use bdi_dev_name() to get device name
bdi: move bdi_dev_name out of line
vboxsf: don't use the source name in the bdi name
iocost: protect iocg->abs_vdebt with iocg->waitq.lock
block: remove the bd_openers checks in blk_drop_partitions
nvme: prevent double free in nvme_alloc_ns() error handling
null_blk: Cleanup zoned device initialization
null_blk: Fix zoned command handling
block: remove unused header
blk-iocost: Fix error on iocost_ioc_vrate_adj
bdev: Reduce time holding bd_mutex in sync in blkdev_close()
buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.

Signed-off-by: Jens Axboe

Jens Axboe
2020-05-10 06:13:58 +0800
d51cfc53a bdi: use bdi_dev_name() to get device name ... Browse Code »

Use the common interface bdi_dev_name() to get device name.

Signed-off-by: Yufen Yu
Signed-off-by: Christoph Hellwig
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Jan Kara
Reviewed-by: Bart Van Assche

Add missing include BFQ

Signed-off-by: Jens Axboe

Yufen Yu
2020-05-10 06:07:39 +0800

01 May, 2020

1 commit

54c52e10d blk-iocost: switch to fixed non-auto-decaying use_delay ... Browse Code »

The use_delay mechanism was introduced by blk-iolatency to hold memory
allocators accountable for the reclaim and other shared IOs they cause. The
duration of the delay is dynamically balanced between iolatency increasing the
value on each target miss and it auto-decaying as time passes and threads get
delayed on it.

While this works well for iolatency, iocost's control model isn't compatible
with it. There is no repeated "violation" events which can be balanced against
auto-decaying. iocost instead knows how much a given cgroup is over budget and
wants to prevent that cgroup from issuing IOs while over budget. Until now,
iocost has been adding the cost of force-issued IOs. However, this doesn't
reflect the amount which is already over budget and is simply not enough to
counter the auto-decaying allowing anon-memory leaking low priority cgroup to
go over its alloted share of IOs.

As auto-decaying doesn't make much sense for iocost, this patch introduces a
different mode of operation for use_delay - when blkcg_set_delay() are used
insted of blkcg_add/use_delay(), the delay duration is not auto-decayed until it
is explicitly cleared with blkcg_clear_delay(). iocost is updated to keep the
delay duration synchronized to the budget overage amount.

With this change, iocost can effectively police cgroups which generate
significant amount of force-issued IOs.

Signed-off-by: Tejun Heo
Cc: Josef Bacik
Signed-off-by: Jens Axboe

Tejun Heo
2020-05-01 05:54:45 +0800

02 Apr, 2020

2 commits

4308a434e blkcg: don't offline parent blkcg first ... Browse Code »

blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs
don't get offlined while there are active cgwbs on them. However, it
ends up making offlining unordered sometimes causing parents to be
offlined before children.

Let's fix this by making child blkcgs pin the parents' online states.

Note that pin/unpin names are chosen over get/put intentionally
because css uses get/put online for something different.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2020-04-02 04:56:44 +0800
d866dbf61 blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it ... Browse Code »

blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs
don't get offlined while there are active cgwbs on them. However, it
ends up making offlining unordered sometimes causing parents to be
offlined before children.

To fix it, we want child blkcgs to pin the parents' online states
turning the refcnt into a more generic online pinning mechanism.

In prepartion,

* blkcg->cgwb_refcnt -> blkcg->online_pin
* blkcg_cgwb_get/put() -> blkcg_pin/unpin_online()
* Take them out of CONFIG_CGROUP_WRITEBACK

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2020-04-02 04:56:42 +0800

28 Mar, 2020

1 commit

3d745ea5b block: simplify queue allocation ... Browse Code »

Current make_request based drivers use either blk_alloc_queue_node or
blk_alloc_queue to allocate a queue, and then set up the make_request_fn
function pointer and a few parameters using the blk_queue_make_request
helper. Simplify this by passing the make_request pointer to
blk_alloc_queue, and while at it merge the _node variant into the main
helper by always passing a node_id, and remove the superfluous gfp_mask
parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-03-28 00:23:43 +0800

13 Dec, 2019

1 commit

5addeae1b blk-cgroup: remove blkcg_drain_queue ... Browse Code »

Since blk_drain_queue had already been removed, so this function
is not needed anymore.

Signed-off-by: Guoqing Jiang
Signed-off-by: Jens Axboe

Guoqing Jiang
2019-12-13 00:26:55 +0800

08 Nov, 2019

3 commits

1d156646e blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT ... Browse Code »

blkg_rwstat is now only used by bfq-iosched and blk-throtl when on
cgroup1. Let's move it into its own files and gate it behind a config
option.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-11-08 03:28:13 +0800
f73316482 blk-cgroup: reimplement basic IO stats using cgroup rstat ... Browse Code »

blk-cgroup has been using blkg_rwstat to track basic IO stats.
Unfortunately, reading recursive stats scales badly as itinvolves
walking all descendants. On systems with a huge number of cgroups
(dead or alive), this can lead to substantial CPU cost when reading IO
stats.

This patch reimplements basic IO stats using cgroup rstat which uses
more memory but makes recursive stat reading O(# descendants which
have been active since last reading) instead of O(# descendants).

* blk-cgroup core no longer uses sync/async stats. Introduce new stat
enums - BLKG_IOSTAT_{READ|WRITE|DISCARD}.

* Add blkg_iostat[_set] which encapsulates byte and io stats, last
values for propagation delta calculation and u64_stats_sync for
correctness on 32bit archs.

* Update the new percpu stat counters directly and implement
blkcg_rstat_flush() to implement propagation.

* blkg_print_stat() can now bring the stats up to date by calling
cgroup_rstat_flush() and print them instead of directly summing up
all descendants.

* It now allocates 96 bytes per cpu. It used to be 40 bytes.

Signed-off-by: Tejun Heo
Cc: Dan Schatzberg
Cc: Daniel Xu
Signed-off-by: Jens Axboe

Tejun Heo
2019-11-08 03:28:13 +0800
8a80d5d66 blk-cgroup: remove now unused blkg_print_stat_{bytes|ios}_recursive() ... Browse Code »

These don't have users anymore. Remove them.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-11-08 03:28:13 +0800

07 Nov, 2019

1 commit

b0814361a blkcg: make blkcg_print_stat() print stats only for online blkgs ... Browse Code »

blkcg_print_stat() iterates blkgs under RCU and doesn't test whether
the blkg is online. This can call into pd_stat_fn() on a pd which is
still being initialized leading to an oops.

The heaviest operation - recursively summing up rwstat counters - is
already done while holding the queue_lock. Expand queue_lock to cover
the other operations and skip the blkg if it isn't online yet. The
online state is protected by both blkcg and queue locks, so this
guarantees that only online blkgs are processed.

Signed-off-by: Tejun Heo
Reported-by: Roman Gushchin
Cc: Josef Bacik
Fixes: 903d23f0a354 ("blk-cgroup: allow controllers to output their own stats")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Jens Axboe

Tejun Heo
2019-11-07 08:08:38 +0800

16 Oct, 2019

1 commit

9d179b865 blkcg: Fix multiple bugs in blkcg_activate_policy() ... Browse Code »

blkcg_activate_policy() has the following bugs.

* cf09a8ee19ad ("blkcg: pass @q and @blkcg into
blkcg_pol_alloc_pd_fn()") added @blkcg to ->pd_alloc_fn(); however,
blkcg_activate_policy() ends up using pd's allocated for the root
blkcg for all preallocations, so ->pd_init_fn() for non-root blkcgs
can be passed in pd's which are allocated for the root blkcg.

For blk-iocost, this means that ->pd_init_fn() can write beyond the
end of the allocated object as it determines the length of the flex
array at the end based on the blkcg's nesting level.

* Each pd is initialized as they get allocated. If alloc fails, the
policy will get freed with pd's initialized on it.

* After the above partial failure, the partial pds are not freed.

This patch fixes all the above issues by

* Restructuring blkcg_activate_policy() so that alloc and init passes
are separate. Init takes place only after all allocs succeeded and
on failure all allocated pds are freed.

* Unifying and fixing the cleanup of the remaining pd_prealloc.

Signed-off-by: Tejun Heo
Fixes: cf09a8ee19ad ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()")
Signed-off-by: Jens Axboe

Tejun Heo
2019-10-16 00:13:00 +0800

15 Sep, 2019

1 commit

89f3b6d62 bfq: Fix bfq linkage error ... Browse Code »

Since commit 795fe54c2a828099e ("bfq: Add per-device weight"), bfq uses
blkg_conf_prep() and blkg_conf_finish(), which are not exported. So, it
causes linkage error if bfq compiled as a module.

Fixes: 795fe54c2a828099e ("bfq: Add per-device weight")
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Pavel Begunkov
2019-09-15 01:33:43 +0800

29 Aug, 2019

3 commits

015d254cb blkcg: separate blkcg_conf_get_disk() out of blkg_conf_prep() ... Browse Code »

Separate out blkcg_conf_get_disk() so that it can be used by blkcg
policy interface file input parsers before the policy is actually
enabled. This doesn't introduce any functional changes.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-08-29 11:17:04 +0800
86a5bba5c blkcg: make ->cpd_init_fn() optional ... Browse Code »

For policies which can do enough initialization from ->cpd_alloc_fn(),
make ->cpd_init_fn() optional.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-08-29 11:17:03 +0800
cf09a8ee1 blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn() ... Browse Code »

Instead of @node, pass in @q and @blkcg so that the alloc function has
more context. This doesn't cause any behavior change and will be used
by io.weight implementation.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-08-29 11:17:01 +0800

17 Jul, 2019

1 commit

07b0fdecb blkcg: allow blkcg_policy->pd_stat() to print non-debug info too ... Browse Code »

Currently, ->pd_stat() is called only when moduleparam
blkcg_debug_stats is set which prevents it from printing non-debug
policy-specific statistics. Let's move debug testing down so that
->pd_stat() can print non-debug stat too. This patch doesn't cause
any visible behavior change.

Signed-off-by: Tejun Heo
Cc: Josef Bacik
Signed-off-by: Jens Axboe

Tejun Heo
2019-07-17 00:06:39 +0800

10 Jul, 2019

3 commits

d3f77dfdc blkcg: implement REQ_CGROUP_PUNT ... Browse Code »

When a shared kthread needs to issue a bio for a cgroup, doing so
synchronously can lead to priority inversions as the kthread can be
trapped waiting for that cgroup. This patch implements
REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
to a dedicated per-blkcg work item to avoid such priority inversions.

This will be used to fix priority inversions in btrfs compression and
should be generally useful as we grow filesystem support for
comprehensive IO control.

Cc: Chris Mason
Reviewed-by: Josef Bacik
Reviewed-by: Jan Kara
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-07-10 23:00:57 +0800
9b0eb69b7 cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages ... Browse Code »

btrfs is going to use css_put() and wbc helpers to improve cgroup
writeback support. Add dummy css_get() definition and export wbc
helpers to prepare for module and !CONFIG_CGROUP builds.

Reported-by: kbuild test robot
Reviewed-by: Jan Kara
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-07-10 23:00:57 +0800
fd112c746 blk-cgroup: turn on psi memstall stuff ... Browse Code »

With the psi stuff in place we can use the memstall flag to indicate
pressure that happens from throttling.

Signed-off-by: Josef Bacik
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Josef Bacik
2019-07-10 23:00:57 +0800

21 Jun, 2019

4 commits

c0ce79dca blk-cgroup: move struct blkg_stat to bfq ... Browse Code »

This structure and assorted infrastructure is only used by the bfq I/O
scheduler. Move it there instead of bloating the common code.

Acked-by: Tejun Heo
Acked-by: Paolo Valente
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-21 00:32:34 +0800
7af6fd911 blk-cgroup: introduce a new struct blkg_rwstat_sample ... Browse Code »

When sampling the blkcg counts we don't need atomics or per-cpu
variables. Introduce a new structure just containing plain u64
counters.

Acked-by: Tejun Heo
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-21 00:32:34 +0800
5d0b6e48c blk-cgroup: pass blkg_rwstat structures by reference ... Browse Code »

Returning a structure generates rather bad code, so switch to passing
by reference. Also don't require the structure to be zeroed and add
to the 0-initialized counters, but actually set the counters to the
calculated value.

Acked-by: Tejun Heo
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-21 00:32:34 +0800
239eeb085 blk-cgroup: factor out a helper to read rwstat counter ... Browse Code »

Trying to break up the crazy statements to something readable.
Also switch to an unsigned counter as it can't ever turn negative.

Reviewed-by: Chaitanya Kulkarni
Acked-by: Tejun Heo
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-06-21 00:32:34 +0800