01 May, 2019

1 commit


21 Mar, 2019

1 commit

  • Avoid that the following warnings are reported when building with W=1:

    block/blk-cgroup.c:1755: warning: Function parameter or member 'q' not described in 'blkcg_schedule_throttle'
    block/blk-cgroup.c:1755: warning: Function parameter or member 'use_memdelay' not described in 'blkcg_schedule_throttle'
    block/blk-cgroup.c:1779: warning: Function parameter or member 'blkg' not described in 'blkcg_add_delay'
    block/blk-cgroup.c:1779: warning: Function parameter or member 'now' not described in 'blkcg_add_delay'
    block/blk-cgroup.c:1779: warning: Function parameter or member 'delta' not described in 'blkcg_add_delay'

    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

10 Feb, 2019

1 commit


21 Dec, 2018

1 commit

  • An earlier commit 7fcf2b033b84 ("blkcg: change blkg reference counting
    to use percpu_ref") moved around the release call from blkg_put() to be
    a part of the percpu_ref cleanup. Remove the additional unused code
    which should have been removed earlier.

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

20 Dec, 2018

1 commit

  • blkg_lookup_create() may be called from pool_map() in which
    irq state is saved, so we have to do that in blkg_lookup_create().

    Otherwise, the following lockdep warning can be triggered:

    [ 104.258537] ================================
    [ 104.259129] WARNING: inconsistent lock state
    [ 104.259725] 4.20.0-rc6+ #545 Not tainted
    [ 104.260268] --------------------------------
    [ 104.260865] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    [ 104.261727] swapper/49/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
    [ 104.262444] 00000000db365b5d (&(&pool->lock)->rlock#3){+.?.}, at: thin_endio+0xcf/0x2a3 [dm_thin_pool]
    [ 104.263747] {SOFTIRQ-ON-W} state was registered at:
    [ 104.264417] _raw_spin_unlock_irq+0x29/0x4c
    [ 104.265014] blkg_lookup_create+0xdc/0xe6
    [ 104.265609] bio_associate_blkg_from_css+0xd3/0x13f
    [ 104.266312] bio_associate_blkg+0x15a/0x1bb
    [ 104.266913] pool_map+0xe8/0x103 [dm_thin_pool]
    [ 104.267572] __map_bio+0x98/0x29c [dm_mod]
    [ 104.268162] __split_and_process_non_flush+0x29e/0x306 [dm_mod]
    [ 104.269003] __split_and_process_bio+0x16a/0x25b [dm_mod]
    [ 104.269971] __dm_make_request.isra.14+0xdc/0x124 [dm_mod]
    [ 104.270973] generic_make_request+0x3f5/0x68b
    [ 104.271676] process_prepared_mapping+0x166/0x1ef [dm_thin_pool]
    [ 104.272531] schedule_zero+0x239/0x273 [dm_thin_pool]
    [ 104.273245] process_cell+0x60c/0x6f1 [dm_thin_pool]
    [ 104.273967] do_worker+0x60c/0xca8 [dm_thin_pool]
    [ 104.274635] process_one_work+0x4eb/0x834
    [ 104.275203] worker_thread+0x318/0x484
    [ 104.275740] kthread+0x1d1/0x1e1
    [ 104.276203] ret_from_fork+0x3a/0x50
    [ 104.276714] irq event stamp: 170003
    [ 104.277201] hardirqs last enabled at (170002): [] _raw_spin_unlock_irqrestore+0x44/0x6b
    [ 104.278535] hardirqs last disabled at (170003): [] _raw_spin_lock_irqsave+0x20/0x55
    [ 104.280273] softirqs last enabled at (169978): [] irq_enter+0x4c/0x73
    [ 104.281617] softirqs last disabled at (169979): [] irq_exit+0x7e/0x11d
    [ 104.282744]
    [ 104.282744] other info that might help us debug this:
    [ 104.283640] Possible unsafe locking scenario:
    [ 104.283640]
    [ 104.284452] CPU0
    [ 104.284803] ----
    [ 104.285150] lock(&(&pool->lock)->rlock#3);
    [ 104.285762]
    [ 104.286130] lock(&(&pool->lock)->rlock#3);
    [ 104.286750]
    [ 104.286750] *** DEADLOCK ***
    [ 104.286750]
    [ 104.287564] no locks held by swapper/49/0.
    [ 104.288129]
    [ 104.288129] stack backtrace:
    [ 104.288738] CPU: 49 PID: 0 Comm: swapper/49 Not tainted 4.20.0-rc6+ #545
    [ 104.289700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
    [ 104.290858] Call Trace:
    [ 104.291204]
    [ 104.291502] dump_stack+0x9a/0xe6
    [ 104.291968] mark_lock+0x56c/0x7a6
    [ 104.292442] ? check_usage_backwards+0x209/0x209
    [ 104.293086] __lock_acquire+0x400/0x15bf
    [ 104.293662] ? check_chain_key+0x150/0x1aa
    [ 104.294236] lock_acquire+0x1a6/0x1e3
    [ 104.294768] ? thin_endio+0xcf/0x2a3 [dm_thin_pool]
    [ 104.295444] ? _raw_spin_unlock_irqrestore+0x44/0x6b
    [ 104.296143] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
    [ 104.297031] _raw_spin_lock_irqsave+0x46/0x55
    [ 104.297659] ? thin_endio+0xcf/0x2a3 [dm_thin_pool]
    [ 104.298335] thin_endio+0xcf/0x2a3 [dm_thin_pool]
    [ 104.298997] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
    [ 104.299886] ? check_flags+0x20a/0x20a
    [ 104.300408] ? lock_acquire+0x1a6/0x1e3
    [ 104.300954] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
    [ 104.301865] clone_endio+0x1bb/0x22d [dm_mod]
    [ 104.302491] ? disable_write_zeroes+0x20/0x20 [dm_mod]
    [ 104.303200] ? bio_disassociate_blkg+0xc6/0x15f
    [ 104.303836] ? bio_endio+0x2b2/0x2da
    [ 104.304349] clone_endio+0x1f3/0x22d [dm_mod]
    [ 104.304978] ? disable_write_zeroes+0x20/0x20 [dm_mod]
    [ 104.305709] ? bio_disassociate_blkg+0xc6/0x15f
    [ 104.306333] ? bio_endio+0x2b2/0x2da
    [ 104.306853] clone_endio+0x1f3/0x22d [dm_mod]
    [ 104.307476] ? disable_write_zeroes+0x20/0x20 [dm_mod]
    [ 104.308185] ? bio_disassociate_blkg+0xc6/0x15f
    [ 104.308817] ? bio_endio+0x2b2/0x2da
    [ 104.309319] blk_update_request+0x2de/0x4cc
    [ 104.309927] blk_mq_end_request+0x2a/0x183
    [ 104.310498] blk_done_softirq+0x16a/0x1a6
    [ 104.311051] ? blk_softirq_cpu_dead+0xe2/0xe2
    [ 104.311653] ? __lock_is_held+0x2a/0x87
    [ 104.312186] __do_softirq+0x250/0x4e8
    [ 104.312705] irq_exit+0x7e/0x11d
    [ 104.313157] call_function_single_interrupt+0xf/0x20
    [ 104.313860]
    [ 104.314163] RIP: 0010:native_safe_halt+0x2/0x3
    [ 104.314792] Code: 63 02 df f0 83 44 24 fc 00 48 89 df e8 cc 3f 7a ff 48 8b 03 a8 08 74 0b 65 81 25 9d 31 45 7e ff ff ff 7f 5b 5d 41 5c c3 fb f4 f4 c3 0f 1f 44 00 00 41 56 41 55 41 54 55 53 e8 a2 0d 5c ff e8
    [ 104.317339] RSP: 0018:ffff888106c9fdc0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04
    [ 104.318390] RAX: 1ffff11020d92100 RBX: 0000000000000000 RCX: ffffffff81159ac7
    [ 104.319366] RDX: 1ffffffff05d5e69 RSI: 0000000000000007 RDI: ffff888106c90d1c
    [ 104.320339] RBP: 0000000000000000 R08: dffffc0000000000 R09: 0000000000000001
    [ 104.321313] R10: ffffed1025d57ba0 R11: ffffed1025d57b9f R12: 1ffff11020d93fbf
    [ 104.322328] R13: 0000000000000031 R14: ffff888106c90040 R15: 0000000000000000
    [ 104.323307] ? lockdep_hardirqs_on+0x26b/0x278
    [ 104.323927] default_idle+0xd9/0x1a8
    [ 104.324427] do_idle+0x162/0x2b2
    [ 104.324891] ? arch_cpu_idle_exit+0x28/0x28
    [ 104.325467] ? mark_held_locks+0x28/0x7f
    [ 104.326031] ? _raw_spin_unlock_irqrestore+0x44/0x6b
    [ 104.326719] cpu_startup_entry+0x1d/0x1f
    [ 104.327261] start_secondary+0x2cb/0x308
    [ 104.327806] ? set_cpu_sibling_map+0x8a3/0x8a3
    [ 104.328421] secondary_startup_64+0xa4/0xb0

    Fixes: b978962ad4f7f9 ("blkcg: update blkg_lookup_create() to do locking")
    Cc: Mike Snitzer
    Cc: Dennis Zhou
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

13 Dec, 2018

1 commit

  • Between v3 [1] and v4 [2] of the blkg association series, the
    association point moved from generic_make_request_checks(), which is
    called after the request enters the queue, to bio_set_dev(), which is when
    the bio is formed before submit_bio(). When the request_queue goes away,
    the blkgs supporting the request_queue are destroyed and then the
    q->root_blkg is set to %NULL.

    This patch adds a %NULL check to blkg_tryget_closest() to prevent the
    NPE caused by the above. It also adds a guard to see if the
    request_queue is dying when creating a blkg to prevent creating a blkg
    for a dead request_queue.

    [1] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
    [2] https://lore.kernel.org/lkml/20181126211946.77067-1-dennis@kernel.org/

    Fixes: 5cdf2e3fea5e ("blkcg: associate blkg when associating a device")
    Reported-and-tested-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

08 Dec, 2018

4 commits

  • blkg reference counting now uses percpu_ref rather than atomic_t. Let's
    make this consistent with css_tryget. This renames blkg_try_get to
    blkg_tryget and now returns a bool rather than the blkg or %NULL.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • Every bio is now associated with a blkg putting blkg_get, blkg_try_get,
    and blkg_put on the hot path. Switch over the refcnt in blkg to use
    percpu_ref.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • There are several scenarios where blkg_lookup_create() can fail such as
    the blkcg dying, request_queue is dying, or simply being OOM. Most
    handle this by simply falling back to the q->root_blkg and calling it a
    day.

    This patch implements the notion of closest blkg. During
    blkg_lookup_create(), if it fails to create, return the closest blkg
    found or the q->root_blkg. blkg_try_get_closest() is introduced and used
    during association so a bio is always attached to a blkg.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • To know when to create a blkg, the general pattern is to do a
    blkg_lookup() and if that fails, lock and do the lookup again, and if
    that fails finally create. It doesn't make much sense for everyone who
    wants to do creation to write this themselves.

    This changes blkg_lookup_create() to do locking and implement this
    pattern. The old blkg_lookup_create() is renamed to
    __blkg_lookup_create(). If a call site wants to do its own error
    handling or already owns the queue lock, they can use
    __blkg_lookup_create(). This will be used in upcoming patches.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Reviewed-by: Liu Bo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

16 Nov, 2018

6 commits


08 Nov, 2018

2 commits


02 Nov, 2018

1 commit

  • This reverts a series committed earlier due to null pointer exception
    bug report in [1]. It seems there are edge case interactions that I did
    not consider and will need some time to understand what causes the
    adverse interactions.

    The original series can be found in [2] with a follow up series in [3].

    [1] https://www.spinics.net/lists/cgroups/msg20719.html
    [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
    [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

    This reverts the following commits:
    d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
    f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
    a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

01 Oct, 2018

1 commit

  • Merge -rc6 in, for two reasons:

    1) Resolve a trivial conflict in the blk-mq-tag.c documentation
    2) A few important regression fixes went into upstream directly, so
    they aren't in the 4.20 branch.

    Signed-off-by: Jens Axboe

    * tag 'v4.19-rc6': (780 commits)
    Linux 4.19-rc6
    MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
    cpufreq: qcom-kryo: Fix section annotations
    perf/core: Add sanity check to deal with pinned event failure
    xen/blkfront: correct purging of persistent grants
    Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
    selftests/powerpc: Fix Makefiles for headers_install change
    blk-mq: I/O and timer unplugs are inverted in blktrace
    dax: Fix deadlock in dax_lock_mapping_entry()
    x86/boot: Fix kexec booting failure in the SEV bit detection code
    bcache: add separate workqueue for journal_write to avoid deadlock
    drm/amd/display: Fix Edid emulation for linux
    drm/amd/display: Fix Vega10 lightup on S3 resume
    drm/amdgpu: Fix vce work queue was not cancelled when suspend
    Revert "drm/panel: Add device_link from panel device to DRM device"
    xen/blkfront: When purging persistent grants, keep them in the buffer
    clocksource/drivers/timer-atmel-pit: Properly handle error cases
    block: fix deadline elevator drain for zoned block devices
    ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
    drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
    ...

    Signed-off-by: Jens Axboe

    Jens Axboe
     

22 Sep, 2018

4 commits

  • blkg reference counting now uses percpu_ref rather than atomic_t. Let's
    make this consistent with css_tryget. This renames blkg_try_get to
    blkg_tryget and now returns a bool rather than the blkg or NULL.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • Now that every bio is associated with a blkg, this puts the use of
    blkg_get, blkg_try_get, and blkg_put on the hot path. This switches over
    the refcnt in blkg to use percpu_ref.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • There are several scenarios where blkg_lookup_create can fail. Examples
    include the blkcg dying, request_queue is dying, or simply being OOM. At
    the end of the day, most handle this by simply falling back to the
    q->root_blkg and calling it a day.

    This patch implements the notion of closest blkg. During
    blkg_lookup_create, if it fails to create, return the closest blkg
    found or the q->root_blkg. blkg_try_get_closest is introduced and used
    during association so a bio is always attached to a blkg.

    Acked-by: Tejun Heo
    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • To know when to create a blkg, the general pattern is to do a
    blkg_lookup and if that fails, lock and then do a lookup again and if
    that fails finally create. It doesn't make much sense for everyone who
    wants to do creation to write this themselves.

    This changes blkg_lookup_create to do locking and implement this
    pattern. The old blkg_lookup_create is renamed to __blkg_lookup_create.
    If a call site wants to do its own error handling or already owns the
    queue lock, they can use __blkg_lookup_create. This will be used in
    upcoming patches.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Reviewed-by: Liu Bo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     

12 Sep, 2018

1 commit

  • After merging the iolatency policy, we potentially now have 4 policies
    being registered, but only support 3. This causes one of them to fail
    loading. Takashi reports that BFQ no longer works for him, because it
    fails to load due to policy registration failure.

    Bump to 5 policies, and also add a warning for when we have exceeded
    the global amount. If we have to touch this again, we should switch
    to a dynamic scheme instead.

    Reported-by: Takashi Iwai
    Reviewed-by: Jeff Moyer
    Tested-by: Takashi Iwai
    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Sep, 2018

2 commits

  • Currently, blkcg destruction relies on a sequence of events:
    1. Destruction starts. blkcg_css_offline() is called and blkgs
    release their reference to the blkcg. This immediately destroys
    the cgwbs (writeback).
    2. With blkgs giving up their reference, the blkcg ref count should
    become zero and eventually call blkcg_css_free() which finally
    frees the blkcg.

    Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
    and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
    on the completion of all writeback associated with the blkcg. A count of
    the number of cgwbs is maintained and once that goes to zero, blkg
    destruction can follow. This should prevent premature blkg destruction
    related to writeback.

    The new process for blkcg cleanup is as follows:
    1. Destruction starts. blkcg_css_offline() is called which offlines
    writeback. Blkg destruction is delayed on the cgwb_refcnt count to
    avoid punting potentially large amounts of outstanding writeback
    to root while maintaining any ongoing policies. Here, the base
    cgwb_refcnt is put back.
    2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
    and handles destruction of blkgs. This is where the css reference
    held by each blkg is released.
    3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
    This finally frees the blkg.

    It seems in the past blk-throttle didn't do the most understandable
    things with taking data from a blkg while associating with current. So,
    the simplification and unification of what blk-throttle is doing caused
    this.

    Fixes: 08e18eab0c579 ("block: add bi_blkg to the bio for cgroups")
    Reviewed-by: Josef Bacik
    Signed-off-by: Dennis Zhou
    Cc: Jiufei Xue
    Cc: Joseph Qi
    Cc: Tejun Heo
    Cc: Josef Bacik
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • This reverts commit 4c6994806f708559c2812b73501406e21ae5dcd0.

    Destroying blkgs is tricky because of the nature of the relationship. A
    blkg should go away when either a blkcg or a request_queue goes away.
    However, blkg's pin the blkcg to ensure they remain valid. To break this
    cycle, when a blkcg is offlined, blkgs put back their css ref. This
    eventually lets css_free() get called which frees the blkcg.

    The above commit (4c6994806f70) breaks this order of events by trying to
    destroy blkgs in css_free(). As the blkgs still hold references to the
    blkcg, css_free() is never called.

    The race between blkcg_bio_issue_check() and cgroup_rmdir() will be
    addressed in the following patch by delaying destruction of a blkg until
    all writeback associated with the blkcg has been finished.

    Fixes: 4c6994806f70 ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
    Reviewed-by: Josef Bacik
    Signed-off-by: Dennis Zhou
    Cc: Jiufei Xue
    Cc: Joseph Qi
    Cc: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     

01 Aug, 2018

1 commit


18 Jul, 2018

1 commit


09 Jul, 2018

3 commits

  • Current IO controllers for the block layer are less than ideal for our
    use case. The io.max controller is great at hard limiting, but it is
    not work conserving. This patch introduces io.latency. You provide a
    latency target for your group and we monitor the io in short windows to
    make sure we are not exceeding those latency targets. This makes use of
    the rq-qos infrastructure and works much like the wbt stuff. There are
    a few differences from wbt

    - It's bio based, so the latency covers the whole block layer in addition to
    the actual io.
    - We will throttle all IO types that comes in here if we need to.
    - We use the mean latency over the 100ms window. This is because writes can
    be particularly fast, which could give us a false sense of the impact of
    other workloads on our protected workload.
    - By default there's no throttling, we set the queue_depth to INT_MAX so that
    we can have as many outstanding bio's as we're allowed to. Only at
    throttle time do we pay attention to the actual queue depth.
    - We backcharge cgroups for root cg issued IO and induce artificial
    delays in order to deal with cases like metadata only or swap heavy
    workloads.

    In testing this has worked out relatively well. Protected workloads
    will throttle noisy workloads down to 1 io at time if they are doing
    normal IO on their own, or induce up to a 1 second delay per syscall if
    they are doing a lot of root issued IO (metadata/swap IO).

    Our testing has revolved mostly around our production web servers where
    we have hhvm (the web server application) in a protected group and
    everything else in another group. We see slightly higher requests per
    second (RPS) on the test tier vs the control tier, and much more stable
    RPS across all machines in the test tier vs the control tier.

    Another test we run is a slow memory allocator in the unprotected group.
    Before this would eventually push us into swap and cause the whole box
    to die and not recover at all. With these patches we see slight RPS
    drops (usually 10-15%) before the memory consumer is properly killed and
    things recover within seconds.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • Since IO can be issued from literally anywhere it's almost impossible to
    do throttling without having some sort of adverse effect somewhere else
    in the system because of locking or other dependencies. The best way to
    solve this is to do the throttling when we know we aren't holding any
    other kernel resources. Do this by tracking throttling in a per-blkg
    basis, and if we require throttling flag the task that it needs to check
    before it returns to user space and possibly sleep there.

    This is to address the case where a process is doing work that is
    generating IO that can't be throttled, whether that is directly with a
    lot of REQ_META IO, or indirectly by allocating so much memory that it
    is swamping the disk with REQ_SWAP. We can't use task_add_work as we
    don't want to induce a memory allocation in the IO path, so simply
    saving the request queue in the task and flagging it to do the
    notify_resume thing achieves the same result without the overhead of a
    memory allocation.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • blk-iolatency has a few stats that it would like to print out, and
    instead of adding a bunch of crap to the generic code just provide a
    helper so that controllers can add stuff to the stat line if they want
    to.

    Hide it behind a boot option since it changes the output of io.stat from
    normal, and these stats are only interesting to developers.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     

19 Apr, 2018

2 commits

  • The initializing of q->root_blkg is currently outside of queue lock
    and rcu, so the blkg may be destroied before the initializing, which
    may cause dangling/null references. On the other side, the destroys
    of blkg are protected by queue lock or rcu. Put the initializing
    inside the queue lock and rcu to make it safer.

    Signed-off-by: Jiang Biao
    Signed-off-by: Wen Yang
    CC: Tejun Heo
    CC: Jens Axboe
    Signed-off-by: Jens Axboe

    Jiang Biao
     
  • The comment before blkg_create() in blkcg_init_queue() was moved
    from blkcg_activate_policy() by commit ec13b1d6f0a0457312e615, but
    it does not suit for the new context.

    Signed-off-by: Jiang Biao
    Signed-off-by: Wen Yang
    CC: Tejun Heo
    CC: Jens Axboe
    Signed-off-by: Jens Axboe

    Jiang Biao
     

18 Apr, 2018

1 commit

  • As described in the comment of blkcg_activate_policy(),
    *Update of each blkg is protected by both queue and blkcg locks so
    that holding either lock and testing blkcg_policy_enabled() is
    always enough for dereferencing policy data.*
    with queue lock held, there is no need to hold blkcg lock in
    blkcg_deactivate_policy(). Similar case is in
    blkcg_activate_policy(), which has removed holding of blkcg lock in
    commit 4c55f4f9ad3001ac1fefdd8d8ca7641d18558e23.

    Signed-off-by: Jiang Biao
    Signed-off-by: Wen Yang
    CC: Tejun Heo
    Signed-off-by: Jens Axboe

    Jiang Biao
     

17 Mar, 2018

1 commit

  • We've triggered a WARNING in blk_throtl_bio() when throttling writeback
    io, which complains blkg->refcnt is already 0 when calling blkg_get(),
    and then kernel crashes with invalid page request.
    After investigating this issue, we've found it is caused by a race
    between blkcg_bio_issue_check() and cgroup_rmdir(), which is described
    below:

    writeback kworker cgroup_rmdir
    cgroup_destroy_locked
    kill_css
    css_killed_ref_fn
    css_killed_work_fn
    offline_css
    blkcg_css_offline
    blkcg_bio_issue_check
    rcu_read_lock
    blkg_lookup
    spin_trylock(q->queue_lock)
    blkg_destroy
    spin_unlock(q->queue_lock)
    blk_throtl_bio
    spin_lock_irq(q->queue_lock)
    ...
    spin_unlock_irq(q->queue_lock)
    rcu_read_unlock

    Since rcu can only prevent blkg from releasing when it is being used,
    the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule
    blkg release.
    Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING.
    And then the corresponding blkg_put() will schedule blkg release again,
    which result in double free.
    This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
    creation in blkcg_bio_issue_check()"). Before this commit, it will
    lookup first and then try to lookup/create again with queue_lock. Since
    revive this logic is a bit drastic, so fix it by only offlining pd during
    blkcg_css_offline(), and move the rest destruction (especially
    blkg_put()) into blkcg_css_free(), which should be the right way as
    discussed.

    Fixes: ae1188963611 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()")
    Reported-by: Jiufei Xue
    Signed-off-by: Joseph Qi
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Joseph Qi
     

27 Feb, 2018

1 commit


05 Nov, 2017

1 commit


10 Oct, 2017

1 commit


26 Aug, 2017

1 commit