01 May, 2019
1 commit
-
Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.Reviewed-by: Chaitanya Kulkarni
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
21 Mar, 2019
1 commit
-
Avoid that the following warnings are reported when building with W=1:
block/blk-cgroup.c:1755: warning: Function parameter or member 'q' not described in 'blkcg_schedule_throttle'
block/blk-cgroup.c:1755: warning: Function parameter or member 'use_memdelay' not described in 'blkcg_schedule_throttle'
block/blk-cgroup.c:1779: warning: Function parameter or member 'blkg' not described in 'blkcg_add_delay'
block/blk-cgroup.c:1779: warning: Function parameter or member 'now' not described in 'blkcg_add_delay'
block/blk-cgroup.c:1779: warning: Function parameter or member 'delta' not described in 'blkcg_add_delay'Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
10 Feb, 2019
1 commit
-
Since 4cf6324b17e9, a portion of function blk_cleanup_queue was moved to
a newly created function called blk_exit_queue, including the call of
blkcg_exit_queue. So, adjust the documenation according.Reviewed-by: Bart Van Assche
Signed-off-by: Marcos Paulo de Souza
Signed-off-by: Jens Axboe
21 Dec, 2018
1 commit
-
An earlier commit 7fcf2b033b84 ("blkcg: change blkg reference counting
to use percpu_ref") moved around the release call from blkg_put() to be
a part of the percpu_ref cleanup. Remove the additional unused code
which should have been removed earlier.Signed-off-by: Dennis Zhou
Signed-off-by: Jens Axboe
20 Dec, 2018
1 commit
-
blkg_lookup_create() may be called from pool_map() in which
irq state is saved, so we have to do that in blkg_lookup_create().Otherwise, the following lockdep warning can be triggered:
[ 104.258537] ================================
[ 104.259129] WARNING: inconsistent lock state
[ 104.259725] 4.20.0-rc6+ #545 Not tainted
[ 104.260268] --------------------------------
[ 104.260865] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[ 104.261727] swapper/49/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
[ 104.262444] 00000000db365b5d (&(&pool->lock)->rlock#3){+.?.}, at: thin_endio+0xcf/0x2a3 [dm_thin_pool]
[ 104.263747] {SOFTIRQ-ON-W} state was registered at:
[ 104.264417] _raw_spin_unlock_irq+0x29/0x4c
[ 104.265014] blkg_lookup_create+0xdc/0xe6
[ 104.265609] bio_associate_blkg_from_css+0xd3/0x13f
[ 104.266312] bio_associate_blkg+0x15a/0x1bb
[ 104.266913] pool_map+0xe8/0x103 [dm_thin_pool]
[ 104.267572] __map_bio+0x98/0x29c [dm_mod]
[ 104.268162] __split_and_process_non_flush+0x29e/0x306 [dm_mod]
[ 104.269003] __split_and_process_bio+0x16a/0x25b [dm_mod]
[ 104.269971] __dm_make_request.isra.14+0xdc/0x124 [dm_mod]
[ 104.270973] generic_make_request+0x3f5/0x68b
[ 104.271676] process_prepared_mapping+0x166/0x1ef [dm_thin_pool]
[ 104.272531] schedule_zero+0x239/0x273 [dm_thin_pool]
[ 104.273245] process_cell+0x60c/0x6f1 [dm_thin_pool]
[ 104.273967] do_worker+0x60c/0xca8 [dm_thin_pool]
[ 104.274635] process_one_work+0x4eb/0x834
[ 104.275203] worker_thread+0x318/0x484
[ 104.275740] kthread+0x1d1/0x1e1
[ 104.276203] ret_from_fork+0x3a/0x50
[ 104.276714] irq event stamp: 170003
[ 104.277201] hardirqs last enabled at (170002): [] _raw_spin_unlock_irqrestore+0x44/0x6b
[ 104.278535] hardirqs last disabled at (170003): [] _raw_spin_lock_irqsave+0x20/0x55
[ 104.280273] softirqs last enabled at (169978): [] irq_enter+0x4c/0x73
[ 104.281617] softirqs last disabled at (169979): [] irq_exit+0x7e/0x11d
[ 104.282744]
[ 104.282744] other info that might help us debug this:
[ 104.283640] Possible unsafe locking scenario:
[ 104.283640]
[ 104.284452] CPU0
[ 104.284803] ----
[ 104.285150] lock(&(&pool->lock)->rlock#3);
[ 104.285762]
[ 104.286130] lock(&(&pool->lock)->rlock#3);
[ 104.286750]
[ 104.286750] *** DEADLOCK ***
[ 104.286750]
[ 104.287564] no locks held by swapper/49/0.
[ 104.288129]
[ 104.288129] stack backtrace:
[ 104.288738] CPU: 49 PID: 0 Comm: swapper/49 Not tainted 4.20.0-rc6+ #545
[ 104.289700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
[ 104.290858] Call Trace:
[ 104.291204]
[ 104.291502] dump_stack+0x9a/0xe6
[ 104.291968] mark_lock+0x56c/0x7a6
[ 104.292442] ? check_usage_backwards+0x209/0x209
[ 104.293086] __lock_acquire+0x400/0x15bf
[ 104.293662] ? check_chain_key+0x150/0x1aa
[ 104.294236] lock_acquire+0x1a6/0x1e3
[ 104.294768] ? thin_endio+0xcf/0x2a3 [dm_thin_pool]
[ 104.295444] ? _raw_spin_unlock_irqrestore+0x44/0x6b
[ 104.296143] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
[ 104.297031] _raw_spin_lock_irqsave+0x46/0x55
[ 104.297659] ? thin_endio+0xcf/0x2a3 [dm_thin_pool]
[ 104.298335] thin_endio+0xcf/0x2a3 [dm_thin_pool]
[ 104.298997] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
[ 104.299886] ? check_flags+0x20a/0x20a
[ 104.300408] ? lock_acquire+0x1a6/0x1e3
[ 104.300954] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
[ 104.301865] clone_endio+0x1bb/0x22d [dm_mod]
[ 104.302491] ? disable_write_zeroes+0x20/0x20 [dm_mod]
[ 104.303200] ? bio_disassociate_blkg+0xc6/0x15f
[ 104.303836] ? bio_endio+0x2b2/0x2da
[ 104.304349] clone_endio+0x1f3/0x22d [dm_mod]
[ 104.304978] ? disable_write_zeroes+0x20/0x20 [dm_mod]
[ 104.305709] ? bio_disassociate_blkg+0xc6/0x15f
[ 104.306333] ? bio_endio+0x2b2/0x2da
[ 104.306853] clone_endio+0x1f3/0x22d [dm_mod]
[ 104.307476] ? disable_write_zeroes+0x20/0x20 [dm_mod]
[ 104.308185] ? bio_disassociate_blkg+0xc6/0x15f
[ 104.308817] ? bio_endio+0x2b2/0x2da
[ 104.309319] blk_update_request+0x2de/0x4cc
[ 104.309927] blk_mq_end_request+0x2a/0x183
[ 104.310498] blk_done_softirq+0x16a/0x1a6
[ 104.311051] ? blk_softirq_cpu_dead+0xe2/0xe2
[ 104.311653] ? __lock_is_held+0x2a/0x87
[ 104.312186] __do_softirq+0x250/0x4e8
[ 104.312705] irq_exit+0x7e/0x11d
[ 104.313157] call_function_single_interrupt+0xf/0x20
[ 104.313860]
[ 104.314163] RIP: 0010:native_safe_halt+0x2/0x3
[ 104.314792] Code: 63 02 df f0 83 44 24 fc 00 48 89 df e8 cc 3f 7a ff 48 8b 03 a8 08 74 0b 65 81 25 9d 31 45 7e ff ff ff 7f 5b 5d 41 5c c3 fb f4 f4 c3 0f 1f 44 00 00 41 56 41 55 41 54 55 53 e8 a2 0d 5c ff e8
[ 104.317339] RSP: 0018:ffff888106c9fdc0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04
[ 104.318390] RAX: 1ffff11020d92100 RBX: 0000000000000000 RCX: ffffffff81159ac7
[ 104.319366] RDX: 1ffffffff05d5e69 RSI: 0000000000000007 RDI: ffff888106c90d1c
[ 104.320339] RBP: 0000000000000000 R08: dffffc0000000000 R09: 0000000000000001
[ 104.321313] R10: ffffed1025d57ba0 R11: ffffed1025d57b9f R12: 1ffff11020d93fbf
[ 104.322328] R13: 0000000000000031 R14: ffff888106c90040 R15: 0000000000000000
[ 104.323307] ? lockdep_hardirqs_on+0x26b/0x278
[ 104.323927] default_idle+0xd9/0x1a8
[ 104.324427] do_idle+0x162/0x2b2
[ 104.324891] ? arch_cpu_idle_exit+0x28/0x28
[ 104.325467] ? mark_held_locks+0x28/0x7f
[ 104.326031] ? _raw_spin_unlock_irqrestore+0x44/0x6b
[ 104.326719] cpu_startup_entry+0x1d/0x1f
[ 104.327261] start_secondary+0x2cb/0x308
[ 104.327806] ? set_cpu_sibling_map+0x8a3/0x8a3
[ 104.328421] secondary_startup_64+0xa4/0xb0Fixes: b978962ad4f7f9 ("blkcg: update blkg_lookup_create() to do locking")
Cc: Mike Snitzer
Cc: Dennis Zhou
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
13 Dec, 2018
1 commit
-
Between v3 [1] and v4 [2] of the blkg association series, the
association point moved from generic_make_request_checks(), which is
called after the request enters the queue, to bio_set_dev(), which is when
the bio is formed before submit_bio(). When the request_queue goes away,
the blkgs supporting the request_queue are destroyed and then the
q->root_blkg is set to %NULL.This patch adds a %NULL check to blkg_tryget_closest() to prevent the
NPE caused by the above. It also adds a guard to see if the
request_queue is dying when creating a blkg to prevent creating a blkg
for a dead request_queue.[1] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
[2] https://lore.kernel.org/lkml/20181126211946.77067-1-dennis@kernel.org/Fixes: 5cdf2e3fea5e ("blkcg: associate blkg when associating a device")
Reported-and-tested-by: Ming Lei
Reviewed-by: Bart Van Assche
Signed-off-by: Dennis Zhou
Signed-off-by: Jens Axboe
08 Dec, 2018
4 commits
-
blkg reference counting now uses percpu_ref rather than atomic_t. Let's
make this consistent with css_tryget. This renames blkg_try_get to
blkg_tryget and now returns a bool rather than the blkg or %NULL.Signed-off-by: Dennis Zhou
Reviewed-by: Josef Bacik
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe -
Every bio is now associated with a blkg putting blkg_get, blkg_try_get,
and blkg_put on the hot path. Switch over the refcnt in blkg to use
percpu_ref.Signed-off-by: Dennis Zhou
Acked-by: Tejun Heo
Reviewed-by: Josef Bacik
Signed-off-by: Jens Axboe -
There are several scenarios where blkg_lookup_create() can fail such as
the blkcg dying, request_queue is dying, or simply being OOM. Most
handle this by simply falling back to the q->root_blkg and calling it a
day.This patch implements the notion of closest blkg. During
blkg_lookup_create(), if it fails to create, return the closest blkg
found or the q->root_blkg. blkg_try_get_closest() is introduced and used
during association so a bio is always attached to a blkg.Signed-off-by: Dennis Zhou
Acked-by: Tejun Heo
Reviewed-by: Josef Bacik
Signed-off-by: Jens Axboe -
To know when to create a blkg, the general pattern is to do a
blkg_lookup() and if that fails, lock and do the lookup again, and if
that fails finally create. It doesn't make much sense for everyone who
wants to do creation to write this themselves.This changes blkg_lookup_create() to do locking and implement this
pattern. The old blkg_lookup_create() is renamed to
__blkg_lookup_create(). If a call site wants to do its own error
handling or already owns the queue lock, they can use
__blkg_lookup_create(). This will be used in upcoming patches.Signed-off-by: Dennis Zhou
Reviewed-by: Josef Bacik
Acked-by: Tejun Heo
Reviewed-by: Liu Bo
Signed-off-by: Jens Axboe
16 Nov, 2018
6 commits
-
Various spots check for q->mq_ops being non-NULL, but provide
a helper to do this instead.Where the ->mq_ops != NULL check is redundant, remove it.
Since mq == rq-based now that legacy is gone, get rid of the
queue_is_rq_based() and just use queue_is_mq() everywhere.Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
With the legacy request path gone there is no good reason to keep
queue_lock as a pointer, we can always use the embedded lock now.Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph HellwigFixed floppy and blk-cgroup missing conversions and half done edits.
Signed-off-by: Jens Axboe
-
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Use a goto label to merge two identical pieces of error handling code.
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Unused since the removal of the legacy request code.
Reviewed-by: Hannes Reinecke
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
08 Nov, 2018
2 commits
-
It's now dead code, nobody uses it.
Reviewed-by: Hannes Reinecke
Tested-by: Ming Lei
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe -
We only support mq devices now.
Reviewed-by: Hannes Reinecke
Tested-by: Ming Lei
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe
02 Nov, 2018
1 commit
-
This reverts a series committed earlier due to null pointer exception
bug report in [1]. It seems there are edge case interactions that I did
not consider and will need some time to understand what causes the
adverse interactions.The original series can be found in [2] with a follow up series in [3].
[1] https://www.spinics.net/lists/cgroups/msg20719.html
[2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
[3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/This reverts the following commits:
d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53Signed-off-by: Dennis Zhou
Signed-off-by: Jens Axboe
01 Oct, 2018
1 commit
-
Merge -rc6 in, for two reasons:
1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
they aren't in the 4.20 branch.Signed-off-by: Jens Axboe
* tag 'v4.19-rc6': (780 commits)
Linux 4.19-rc6
MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
cpufreq: qcom-kryo: Fix section annotations
perf/core: Add sanity check to deal with pinned event failure
xen/blkfront: correct purging of persistent grants
Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
selftests/powerpc: Fix Makefiles for headers_install change
blk-mq: I/O and timer unplugs are inverted in blktrace
dax: Fix deadlock in dax_lock_mapping_entry()
x86/boot: Fix kexec booting failure in the SEV bit detection code
bcache: add separate workqueue for journal_write to avoid deadlock
drm/amd/display: Fix Edid emulation for linux
drm/amd/display: Fix Vega10 lightup on S3 resume
drm/amdgpu: Fix vce work queue was not cancelled when suspend
Revert "drm/panel: Add device_link from panel device to DRM device"
xen/blkfront: When purging persistent grants, keep them in the buffer
clocksource/drivers/timer-atmel-pit: Properly handle error cases
block: fix deadline elevator drain for zoned block devices
ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
...Signed-off-by: Jens Axboe
22 Sep, 2018
4 commits
-
blkg reference counting now uses percpu_ref rather than atomic_t. Let's
make this consistent with css_tryget. This renames blkg_try_get to
blkg_tryget and now returns a bool rather than the blkg or NULL.Signed-off-by: Dennis Zhou
Reviewed-by: Josef Bacik
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe -
Now that every bio is associated with a blkg, this puts the use of
blkg_get, blkg_try_get, and blkg_put on the hot path. This switches over
the refcnt in blkg to use percpu_ref.Signed-off-by: Dennis Zhou
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe -
There are several scenarios where blkg_lookup_create can fail. Examples
include the blkcg dying, request_queue is dying, or simply being OOM. At
the end of the day, most handle this by simply falling back to the
q->root_blkg and calling it a day.This patch implements the notion of closest blkg. During
blkg_lookup_create, if it fails to create, return the closest blkg
found or the q->root_blkg. blkg_try_get_closest is introduced and used
during association so a bio is always attached to a blkg.Acked-by: Tejun Heo
Signed-off-by: Dennis Zhou
Signed-off-by: Jens Axboe -
To know when to create a blkg, the general pattern is to do a
blkg_lookup and if that fails, lock and then do a lookup again and if
that fails finally create. It doesn't make much sense for everyone who
wants to do creation to write this themselves.This changes blkg_lookup_create to do locking and implement this
pattern. The old blkg_lookup_create is renamed to __blkg_lookup_create.
If a call site wants to do its own error handling or already owns the
queue lock, they can use __blkg_lookup_create. This will be used in
upcoming patches.Signed-off-by: Dennis Zhou
Reviewed-by: Josef Bacik
Acked-by: Tejun Heo
Reviewed-by: Liu Bo
Signed-off-by: Jens Axboe
12 Sep, 2018
1 commit
-
After merging the iolatency policy, we potentially now have 4 policies
being registered, but only support 3. This causes one of them to fail
loading. Takashi reports that BFQ no longer works for him, because it
fails to load due to policy registration failure.Bump to 5 policies, and also add a warning for when we have exceeded
the global amount. If we have to touch this again, we should switch
to a dynamic scheme instead.Reported-by: Takashi Iwai
Reviewed-by: Jeff Moyer
Tested-by: Takashi Iwai
Signed-off-by: Jens Axboe
01 Sep, 2018
2 commits
-
Currently, blkcg destruction relies on a sequence of events:
1. Destruction starts. blkcg_css_offline() is called and blkgs
release their reference to the blkcg. This immediately destroys
the cgwbs (writeback).
2. With blkgs giving up their reference, the blkcg ref count should
become zero and eventually call blkcg_css_free() which finally
frees the blkcg.Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
on the completion of all writeback associated with the blkcg. A count of
the number of cgwbs is maintained and once that goes to zero, blkg
destruction can follow. This should prevent premature blkg destruction
related to writeback.The new process for blkcg cleanup is as follows:
1. Destruction starts. blkcg_css_offline() is called which offlines
writeback. Blkg destruction is delayed on the cgwb_refcnt count to
avoid punting potentially large amounts of outstanding writeback
to root while maintaining any ongoing policies. Here, the base
cgwb_refcnt is put back.
2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
and handles destruction of blkgs. This is where the css reference
held by each blkg is released.
3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
This finally frees the blkg.It seems in the past blk-throttle didn't do the most understandable
things with taking data from a blkg while associating with current. So,
the simplification and unification of what blk-throttle is doing caused
this.Fixes: 08e18eab0c579 ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: Josef Bacik
Signed-off-by: Dennis Zhou
Cc: Jiufei Xue
Cc: Joseph Qi
Cc: Tejun Heo
Cc: Josef Bacik
Cc: Jens Axboe
Signed-off-by: Jens Axboe -
This reverts commit 4c6994806f708559c2812b73501406e21ae5dcd0.
Destroying blkgs is tricky because of the nature of the relationship. A
blkg should go away when either a blkcg or a request_queue goes away.
However, blkg's pin the blkcg to ensure they remain valid. To break this
cycle, when a blkcg is offlined, blkgs put back their css ref. This
eventually lets css_free() get called which frees the blkcg.The above commit (4c6994806f70) breaks this order of events by trying to
destroy blkgs in css_free(). As the blkgs still hold references to the
blkcg, css_free() is never called.The race between blkcg_bio_issue_check() and cgroup_rmdir() will be
addressed in the following patch by delaying destruction of a blkg until
all writeback associated with the blkcg has been finished.Fixes: 4c6994806f70 ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
Reviewed-by: Josef Bacik
Signed-off-by: Dennis Zhou
Cc: Jiufei Xue
Cc: Joseph Qi
Cc: Tejun Heo
Cc: Jens Axboe
Signed-off-by: Jens Axboe
01 Aug, 2018
1 commit
-
The blkg lifetime is protected by the queue lifetime, so we need to put
the queue _after_ we're done using the blkg.Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe
18 Jul, 2018
1 commit
-
Add tracking of REQ_OP_DISCARD ios to the per-cgroup io.stat. Two
fields, dbytes and dios, to respectively count the total bytes and
number of discards are added.Signed-off-by: Tejun Heo
Cc: Andy Newell
Cc: Michael Callahan
Signed-off-by: Jens Axboe
09 Jul, 2018
3 commits
-
Current IO controllers for the block layer are less than ideal for our
use case. The io.max controller is great at hard limiting, but it is
not work conserving. This patch introduces io.latency. You provide a
latency target for your group and we monitor the io in short windows to
make sure we are not exceeding those latency targets. This makes use of
the rq-qos infrastructure and works much like the wbt stuff. There are
a few differences from wbt- It's bio based, so the latency covers the whole block layer in addition to
the actual io.
- We will throttle all IO types that comes in here if we need to.
- We use the mean latency over the 100ms window. This is because writes can
be particularly fast, which could give us a false sense of the impact of
other workloads on our protected workload.
- By default there's no throttling, we set the queue_depth to INT_MAX so that
we can have as many outstanding bio's as we're allowed to. Only at
throttle time do we pay attention to the actual queue depth.
- We backcharge cgroups for root cg issued IO and induce artificial
delays in order to deal with cases like metadata only or swap heavy
workloads.In testing this has worked out relatively well. Protected workloads
will throttle noisy workloads down to 1 io at time if they are doing
normal IO on their own, or induce up to a 1 second delay per syscall if
they are doing a lot of root issued IO (metadata/swap IO).Our testing has revolved mostly around our production web servers where
we have hhvm (the web server application) in a protected group and
everything else in another group. We see slightly higher requests per
second (RPS) on the test tier vs the control tier, and much more stable
RPS across all machines in the test tier vs the control tier.Another test we run is a slow memory allocator in the unprotected group.
Before this would eventually push us into swap and cause the whole box
to die and not recover at all. With these patches we see slight RPS
drops (usually 10-15%) before the memory consumer is properly killed and
things recover within seconds.Signed-off-by: Josef Bacik
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe -
Since IO can be issued from literally anywhere it's almost impossible to
do throttling without having some sort of adverse effect somewhere else
in the system because of locking or other dependencies. The best way to
solve this is to do the throttling when we know we aren't holding any
other kernel resources. Do this by tracking throttling in a per-blkg
basis, and if we require throttling flag the task that it needs to check
before it returns to user space and possibly sleep there.This is to address the case where a process is doing work that is
generating IO that can't be throttled, whether that is directly with a
lot of REQ_META IO, or indirectly by allocating so much memory that it
is swamping the disk with REQ_SWAP. We can't use task_add_work as we
don't want to induce a memory allocation in the IO path, so simply
saving the request queue in the task and flagging it to do the
notify_resume thing achieves the same result without the overhead of a
memory allocation.Signed-off-by: Josef Bacik
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe -
blk-iolatency has a few stats that it would like to print out, and
instead of adding a bunch of crap to the generic code just provide a
helper so that controllers can add stuff to the stat line if they want
to.Hide it behind a boot option since it changes the output of io.stat from
normal, and these stats are only interesting to developers.Signed-off-by: Josef Bacik
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe
19 Apr, 2018
2 commits
-
The initializing of q->root_blkg is currently outside of queue lock
and rcu, so the blkg may be destroied before the initializing, which
may cause dangling/null references. On the other side, the destroys
of blkg are protected by queue lock or rcu. Put the initializing
inside the queue lock and rcu to make it safer.Signed-off-by: Jiang Biao
Signed-off-by: Wen Yang
CC: Tejun Heo
CC: Jens Axboe
Signed-off-by: Jens Axboe -
The comment before blkg_create() in blkcg_init_queue() was moved
from blkcg_activate_policy() by commit ec13b1d6f0a0457312e615, but
it does not suit for the new context.Signed-off-by: Jiang Biao
Signed-off-by: Wen Yang
CC: Tejun Heo
CC: Jens Axboe
Signed-off-by: Jens Axboe
18 Apr, 2018
1 commit
-
As described in the comment of blkcg_activate_policy(),
*Update of each blkg is protected by both queue and blkcg locks so
that holding either lock and testing blkcg_policy_enabled() is
always enough for dereferencing policy data.*
with queue lock held, there is no need to hold blkcg lock in
blkcg_deactivate_policy(). Similar case is in
blkcg_activate_policy(), which has removed holding of blkcg lock in
commit 4c55f4f9ad3001ac1fefdd8d8ca7641d18558e23.Signed-off-by: Jiang Biao
Signed-off-by: Wen Yang
CC: Tejun Heo
Signed-off-by: Jens Axboe
17 Mar, 2018
1 commit
-
We've triggered a WARNING in blk_throtl_bio() when throttling writeback
io, which complains blkg->refcnt is already 0 when calling blkg_get(),
and then kernel crashes with invalid page request.
After investigating this issue, we've found it is caused by a race
between blkcg_bio_issue_check() and cgroup_rmdir(), which is described
below:writeback kworker cgroup_rmdir
cgroup_destroy_locked
kill_css
css_killed_ref_fn
css_killed_work_fn
offline_css
blkcg_css_offline
blkcg_bio_issue_check
rcu_read_lock
blkg_lookup
spin_trylock(q->queue_lock)
blkg_destroy
spin_unlock(q->queue_lock)
blk_throtl_bio
spin_lock_irq(q->queue_lock)
...
spin_unlock_irq(q->queue_lock)
rcu_read_unlockSince rcu can only prevent blkg from releasing when it is being used,
the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule
blkg release.
Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING.
And then the corresponding blkg_put() will schedule blkg release again,
which result in double free.
This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
creation in blkcg_bio_issue_check()"). Before this commit, it will
lookup first and then try to lookup/create again with queue_lock. Since
revive this logic is a bit drastic, so fix it by only offlining pd during
blkcg_css_offline(), and move the rest destruction (especially
blkg_put()) into blkcg_css_free(), which should be the right way as
discussed.Fixes: ae1188963611 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()")
Reported-by: Jiufei Xue
Signed-off-by: Joseph Qi
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe
27 Feb, 2018
1 commit
-
Add a proper counterpart to get_disk_and_module() -
put_disk_and_module(). Currently it is opencoded in several places.Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe
05 Nov, 2017
1 commit
-
blkcg policy should keep cpd/pd's alloc_fn and free_fn in pairs,
otherwise policy would register fail.Reviewed-by: Johannes Thumshirn
Signed-off-by: weiping zhang
Signed-off-by: Jens Axboe
10 Oct, 2017
1 commit
-
check pol->cpd_free_fn() instead of pol->cpd_alloc_fn() when free cpd.
Reviewed-by: Johannes Thumshirn
Signed-off-by: weiping zhang
Signed-off-by: Jens Axboe
26 Aug, 2017
1 commit
-
this patch fix two errors, firstly avoid kfree blk_root, secondly not
free(blkcg) ,if blkcg alloc fail(blkcg == NULL), just unlock that mutex;Signed-off-by: weiping zhang
Signed-off-by: Jens Axboe