03 Sep, 2020

1 commit

  • [ Upstream commit 27029b4b18aa5d3b060f0bf2c26dae254132cfce ]

    Normally, blkcg_iolatency_exit() will free related memory in iolatency
    when cleanup queue. But if blk_throtl_init() return error and queue init
    fail, blkcg_iolatency_exit() will not do that for us. Then it cause
    memory leak.

    Fixes: d70675121546 ("block: introduce blk-iolatency io controller")
    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Yufen Yu
     

07 Nov, 2019

1 commit

  • blkcg_print_stat() iterates blkgs under RCU and doesn't test whether
    the blkg is online. This can call into pd_stat_fn() on a pd which is
    still being initialized leading to an oops.

    The heaviest operation - recursively summing up rwstat counters - is
    already done while holding the queue_lock. Expand queue_lock to cover
    the other operations and skip the blkg if it isn't online yet. The
    online state is protected by both blkcg and queue locks, so this
    guarantees that only online blkgs are processed.

    Signed-off-by: Tejun Heo
    Reported-by: Roman Gushchin
    Cc: Josef Bacik
    Fixes: 903d23f0a354 ("blk-cgroup: allow controllers to output their own stats")
    Cc: stable@vger.kernel.org # v4.19+
    Signed-off-by: Jens Axboe

    Tejun Heo
     

16 Oct, 2019

1 commit

  • blkcg_activate_policy() has the following bugs.

    * cf09a8ee19ad ("blkcg: pass @q and @blkcg into
    blkcg_pol_alloc_pd_fn()") added @blkcg to ->pd_alloc_fn(); however,
    blkcg_activate_policy() ends up using pd's allocated for the root
    blkcg for all preallocations, so ->pd_init_fn() for non-root blkcgs
    can be passed in pd's which are allocated for the root blkcg.

    For blk-iocost, this means that ->pd_init_fn() can write beyond the
    end of the allocated object as it determines the length of the flex
    array at the end based on the blkcg's nesting level.

    * Each pd is initialized as they get allocated. If alloc fails, the
    policy will get freed with pd's initialized on it.

    * After the above partial failure, the partial pds are not freed.

    This patch fixes all the above issues by

    * Restructuring blkcg_activate_policy() so that alloc and init passes
    are separate. Init takes place only after all allocs succeeded and
    on failure all allocated pds are freed.

    * Unifying and fixing the cleanup of the remaining pd_prealloc.

    Signed-off-by: Tejun Heo
    Fixes: cf09a8ee19ad ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()")
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Sep, 2019

1 commit

  • Since commit 795fe54c2a828099e ("bfq: Add per-device weight"), bfq uses
    blkg_conf_prep() and blkg_conf_finish(), which are not exported. So, it
    causes linkage error if bfq compiled as a module.

    Fixes: 795fe54c2a828099e ("bfq: Add per-device weight")
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

29 Aug, 2019

3 commits


17 Jul, 2019

1 commit

  • Currently, ->pd_stat() is called only when moduleparam
    blkcg_debug_stats is set which prevents it from printing non-debug
    policy-specific statistics. Let's move debug testing down so that
    ->pd_stat() can print non-debug stat too. This patch doesn't cause
    any visible behavior change.

    Signed-off-by: Tejun Heo
    Cc: Josef Bacik
    Signed-off-by: Jens Axboe

    Tejun Heo
     

10 Jul, 2019

3 commits

  • When a shared kthread needs to issue a bio for a cgroup, doing so
    synchronously can lead to priority inversions as the kthread can be
    trapped waiting for that cgroup. This patch implements
    REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
    to a dedicated per-blkcg work item to avoid such priority inversions.

    This will be used to fix priority inversions in btrfs compression and
    should be generally useful as we grow filesystem support for
    comprehensive IO control.

    Cc: Chris Mason
    Reviewed-by: Josef Bacik
    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • btrfs is going to use css_put() and wbc helpers to improve cgroup
    writeback support. Add dummy css_get() definition and export wbc
    helpers to prepare for module and !CONFIG_CGROUP builds.

    Reported-by: kbuild test robot
    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With the psi stuff in place we can use the memstall flag to indicate
    pressure that happens from throttling.

    Signed-off-by: Josef Bacik
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     

21 Jun, 2019

4 commits


16 Jun, 2019

3 commits

  • When blkcg_activate_policy() is creating blkg_policy_data for existing
    blkgs, it did in the wrong order - descendants first. Fix it. None
    of the existing controllers seem affected by this.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkg alloc is performed as a separate step from the rest of blkg
    creation so that GFP_KERNEL allocations can be used when creating
    blkgs from configuration file writes because otherwise user actions
    may fail due to failures of opportunistic GFP_NOWAIT allocations.

    While making blkgs use percpu_ref, 7fcf2b033b84 ("blkcg: change blkg
    reference counting to use percpu_ref") incorrectly added unconditional
    opportunistic percpu_ref_init() to blkg_create() breaking this
    guarantee.

    This patch moves percpu_ref_init() to blkg_alloc() so makes it use
    @gfp_mask that blkg_alloc() is called with. Also, percpu_ref_exit()
    is moved to blkg_free() for consistency.

    Signed-off-by: Tejun Heo
    Fixes: 7fcf2b033b84 ("blkcg: change blkg reference counting to use percpu_ref")
    Cc: Dennis Zhou
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Depending on the number of devices, blkcg stats can go over the
    default seqfile buf size. seqfile normally retries with a larger
    buffer but since the ->pd_stat() addition, blkcg_print_stat() doesn't
    tell seqfile that overflow has happened and the output gets printed
    truncated. Fix it by calling seq_commit() w/ -1 on possible
    overflows.

    Signed-off-by: Tejun Heo
    Fixes: 903d23f0a354 ("blk-cgroup: allow controllers to output their own stats")
    Cc: stable@vger.kernel.org # v4.19+
    Cc: Josef Bacik
    Signed-off-by: Jens Axboe

    Tejun Heo
     

05 Jun, 2019

1 commit


01 May, 2019

1 commit


21 Mar, 2019

1 commit

  • Avoid that the following warnings are reported when building with W=1:

    block/blk-cgroup.c:1755: warning: Function parameter or member 'q' not described in 'blkcg_schedule_throttle'
    block/blk-cgroup.c:1755: warning: Function parameter or member 'use_memdelay' not described in 'blkcg_schedule_throttle'
    block/blk-cgroup.c:1779: warning: Function parameter or member 'blkg' not described in 'blkcg_add_delay'
    block/blk-cgroup.c:1779: warning: Function parameter or member 'now' not described in 'blkcg_add_delay'
    block/blk-cgroup.c:1779: warning: Function parameter or member 'delta' not described in 'blkcg_add_delay'

    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

10 Feb, 2019

1 commit


21 Dec, 2018

1 commit

  • An earlier commit 7fcf2b033b84 ("blkcg: change blkg reference counting
    to use percpu_ref") moved around the release call from blkg_put() to be
    a part of the percpu_ref cleanup. Remove the additional unused code
    which should have been removed earlier.

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

20 Dec, 2018

1 commit

  • blkg_lookup_create() may be called from pool_map() in which
    irq state is saved, so we have to do that in blkg_lookup_create().

    Otherwise, the following lockdep warning can be triggered:

    [ 104.258537] ================================
    [ 104.259129] WARNING: inconsistent lock state
    [ 104.259725] 4.20.0-rc6+ #545 Not tainted
    [ 104.260268] --------------------------------
    [ 104.260865] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    [ 104.261727] swapper/49/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
    [ 104.262444] 00000000db365b5d (&(&pool->lock)->rlock#3){+.?.}, at: thin_endio+0xcf/0x2a3 [dm_thin_pool]
    [ 104.263747] {SOFTIRQ-ON-W} state was registered at:
    [ 104.264417] _raw_spin_unlock_irq+0x29/0x4c
    [ 104.265014] blkg_lookup_create+0xdc/0xe6
    [ 104.265609] bio_associate_blkg_from_css+0xd3/0x13f
    [ 104.266312] bio_associate_blkg+0x15a/0x1bb
    [ 104.266913] pool_map+0xe8/0x103 [dm_thin_pool]
    [ 104.267572] __map_bio+0x98/0x29c [dm_mod]
    [ 104.268162] __split_and_process_non_flush+0x29e/0x306 [dm_mod]
    [ 104.269003] __split_and_process_bio+0x16a/0x25b [dm_mod]
    [ 104.269971] __dm_make_request.isra.14+0xdc/0x124 [dm_mod]
    [ 104.270973] generic_make_request+0x3f5/0x68b
    [ 104.271676] process_prepared_mapping+0x166/0x1ef [dm_thin_pool]
    [ 104.272531] schedule_zero+0x239/0x273 [dm_thin_pool]
    [ 104.273245] process_cell+0x60c/0x6f1 [dm_thin_pool]
    [ 104.273967] do_worker+0x60c/0xca8 [dm_thin_pool]
    [ 104.274635] process_one_work+0x4eb/0x834
    [ 104.275203] worker_thread+0x318/0x484
    [ 104.275740] kthread+0x1d1/0x1e1
    [ 104.276203] ret_from_fork+0x3a/0x50
    [ 104.276714] irq event stamp: 170003
    [ 104.277201] hardirqs last enabled at (170002): [] _raw_spin_unlock_irqrestore+0x44/0x6b
    [ 104.278535] hardirqs last disabled at (170003): [] _raw_spin_lock_irqsave+0x20/0x55
    [ 104.280273] softirqs last enabled at (169978): [] irq_enter+0x4c/0x73
    [ 104.281617] softirqs last disabled at (169979): [] irq_exit+0x7e/0x11d
    [ 104.282744]
    [ 104.282744] other info that might help us debug this:
    [ 104.283640] Possible unsafe locking scenario:
    [ 104.283640]
    [ 104.284452] CPU0
    [ 104.284803] ----
    [ 104.285150] lock(&(&pool->lock)->rlock#3);
    [ 104.285762]
    [ 104.286130] lock(&(&pool->lock)->rlock#3);
    [ 104.286750]
    [ 104.286750] *** DEADLOCK ***
    [ 104.286750]
    [ 104.287564] no locks held by swapper/49/0.
    [ 104.288129]
    [ 104.288129] stack backtrace:
    [ 104.288738] CPU: 49 PID: 0 Comm: swapper/49 Not tainted 4.20.0-rc6+ #545
    [ 104.289700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
    [ 104.290858] Call Trace:
    [ 104.291204]
    [ 104.291502] dump_stack+0x9a/0xe6
    [ 104.291968] mark_lock+0x56c/0x7a6
    [ 104.292442] ? check_usage_backwards+0x209/0x209
    [ 104.293086] __lock_acquire+0x400/0x15bf
    [ 104.293662] ? check_chain_key+0x150/0x1aa
    [ 104.294236] lock_acquire+0x1a6/0x1e3
    [ 104.294768] ? thin_endio+0xcf/0x2a3 [dm_thin_pool]
    [ 104.295444] ? _raw_spin_unlock_irqrestore+0x44/0x6b
    [ 104.296143] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
    [ 104.297031] _raw_spin_lock_irqsave+0x46/0x55
    [ 104.297659] ? thin_endio+0xcf/0x2a3 [dm_thin_pool]
    [ 104.298335] thin_endio+0xcf/0x2a3 [dm_thin_pool]
    [ 104.298997] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
    [ 104.299886] ? check_flags+0x20a/0x20a
    [ 104.300408] ? lock_acquire+0x1a6/0x1e3
    [ 104.300954] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
    [ 104.301865] clone_endio+0x1bb/0x22d [dm_mod]
    [ 104.302491] ? disable_write_zeroes+0x20/0x20 [dm_mod]
    [ 104.303200] ? bio_disassociate_blkg+0xc6/0x15f
    [ 104.303836] ? bio_endio+0x2b2/0x2da
    [ 104.304349] clone_endio+0x1f3/0x22d [dm_mod]
    [ 104.304978] ? disable_write_zeroes+0x20/0x20 [dm_mod]
    [ 104.305709] ? bio_disassociate_blkg+0xc6/0x15f
    [ 104.306333] ? bio_endio+0x2b2/0x2da
    [ 104.306853] clone_endio+0x1f3/0x22d [dm_mod]
    [ 104.307476] ? disable_write_zeroes+0x20/0x20 [dm_mod]
    [ 104.308185] ? bio_disassociate_blkg+0xc6/0x15f
    [ 104.308817] ? bio_endio+0x2b2/0x2da
    [ 104.309319] blk_update_request+0x2de/0x4cc
    [ 104.309927] blk_mq_end_request+0x2a/0x183
    [ 104.310498] blk_done_softirq+0x16a/0x1a6
    [ 104.311051] ? blk_softirq_cpu_dead+0xe2/0xe2
    [ 104.311653] ? __lock_is_held+0x2a/0x87
    [ 104.312186] __do_softirq+0x250/0x4e8
    [ 104.312705] irq_exit+0x7e/0x11d
    [ 104.313157] call_function_single_interrupt+0xf/0x20
    [ 104.313860]
    [ 104.314163] RIP: 0010:native_safe_halt+0x2/0x3
    [ 104.314792] Code: 63 02 df f0 83 44 24 fc 00 48 89 df e8 cc 3f 7a ff 48 8b 03 a8 08 74 0b 65 81 25 9d 31 45 7e ff ff ff 7f 5b 5d 41 5c c3 fb f4 f4 c3 0f 1f 44 00 00 41 56 41 55 41 54 55 53 e8 a2 0d 5c ff e8
    [ 104.317339] RSP: 0018:ffff888106c9fdc0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04
    [ 104.318390] RAX: 1ffff11020d92100 RBX: 0000000000000000 RCX: ffffffff81159ac7
    [ 104.319366] RDX: 1ffffffff05d5e69 RSI: 0000000000000007 RDI: ffff888106c90d1c
    [ 104.320339] RBP: 0000000000000000 R08: dffffc0000000000 R09: 0000000000000001
    [ 104.321313] R10: ffffed1025d57ba0 R11: ffffed1025d57b9f R12: 1ffff11020d93fbf
    [ 104.322328] R13: 0000000000000031 R14: ffff888106c90040 R15: 0000000000000000
    [ 104.323307] ? lockdep_hardirqs_on+0x26b/0x278
    [ 104.323927] default_idle+0xd9/0x1a8
    [ 104.324427] do_idle+0x162/0x2b2
    [ 104.324891] ? arch_cpu_idle_exit+0x28/0x28
    [ 104.325467] ? mark_held_locks+0x28/0x7f
    [ 104.326031] ? _raw_spin_unlock_irqrestore+0x44/0x6b
    [ 104.326719] cpu_startup_entry+0x1d/0x1f
    [ 104.327261] start_secondary+0x2cb/0x308
    [ 104.327806] ? set_cpu_sibling_map+0x8a3/0x8a3
    [ 104.328421] secondary_startup_64+0xa4/0xb0

    Fixes: b978962ad4f7f9 ("blkcg: update blkg_lookup_create() to do locking")
    Cc: Mike Snitzer
    Cc: Dennis Zhou
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

13 Dec, 2018

1 commit

  • Between v3 [1] and v4 [2] of the blkg association series, the
    association point moved from generic_make_request_checks(), which is
    called after the request enters the queue, to bio_set_dev(), which is when
    the bio is formed before submit_bio(). When the request_queue goes away,
    the blkgs supporting the request_queue are destroyed and then the
    q->root_blkg is set to %NULL.

    This patch adds a %NULL check to blkg_tryget_closest() to prevent the
    NPE caused by the above. It also adds a guard to see if the
    request_queue is dying when creating a blkg to prevent creating a blkg
    for a dead request_queue.

    [1] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
    [2] https://lore.kernel.org/lkml/20181126211946.77067-1-dennis@kernel.org/

    Fixes: 5cdf2e3fea5e ("blkcg: associate blkg when associating a device")
    Reported-and-tested-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

08 Dec, 2018

4 commits

  • blkg reference counting now uses percpu_ref rather than atomic_t. Let's
    make this consistent with css_tryget. This renames blkg_try_get to
    blkg_tryget and now returns a bool rather than the blkg or %NULL.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • Every bio is now associated with a blkg putting blkg_get, blkg_try_get,
    and blkg_put on the hot path. Switch over the refcnt in blkg to use
    percpu_ref.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • There are several scenarios where blkg_lookup_create() can fail such as
    the blkcg dying, request_queue is dying, or simply being OOM. Most
    handle this by simply falling back to the q->root_blkg and calling it a
    day.

    This patch implements the notion of closest blkg. During
    blkg_lookup_create(), if it fails to create, return the closest blkg
    found or the q->root_blkg. blkg_try_get_closest() is introduced and used
    during association so a bio is always attached to a blkg.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • To know when to create a blkg, the general pattern is to do a
    blkg_lookup() and if that fails, lock and do the lookup again, and if
    that fails finally create. It doesn't make much sense for everyone who
    wants to do creation to write this themselves.

    This changes blkg_lookup_create() to do locking and implement this
    pattern. The old blkg_lookup_create() is renamed to
    __blkg_lookup_create(). If a call site wants to do its own error
    handling or already owns the queue lock, they can use
    __blkg_lookup_create(). This will be used in upcoming patches.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Reviewed-by: Liu Bo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

16 Nov, 2018

6 commits


08 Nov, 2018

2 commits


02 Nov, 2018

1 commit

  • This reverts a series committed earlier due to null pointer exception
    bug report in [1]. It seems there are edge case interactions that I did
    not consider and will need some time to understand what causes the
    adverse interactions.

    The original series can be found in [2] with a follow up series in [3].

    [1] https://www.spinics.net/lists/cgroups/msg20719.html
    [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
    [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

    This reverts the following commits:
    d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
    f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
    a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

01 Oct, 2018

1 commit

  • Merge -rc6 in, for two reasons:

    1) Resolve a trivial conflict in the blk-mq-tag.c documentation
    2) A few important regression fixes went into upstream directly, so
    they aren't in the 4.20 branch.

    Signed-off-by: Jens Axboe

    * tag 'v4.19-rc6': (780 commits)
    Linux 4.19-rc6
    MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
    cpufreq: qcom-kryo: Fix section annotations
    perf/core: Add sanity check to deal with pinned event failure
    xen/blkfront: correct purging of persistent grants
    Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
    selftests/powerpc: Fix Makefiles for headers_install change
    blk-mq: I/O and timer unplugs are inverted in blktrace
    dax: Fix deadlock in dax_lock_mapping_entry()
    x86/boot: Fix kexec booting failure in the SEV bit detection code
    bcache: add separate workqueue for journal_write to avoid deadlock
    drm/amd/display: Fix Edid emulation for linux
    drm/amd/display: Fix Vega10 lightup on S3 resume
    drm/amdgpu: Fix vce work queue was not cancelled when suspend
    Revert "drm/panel: Add device_link from panel device to DRM device"
    xen/blkfront: When purging persistent grants, keep them in the buffer
    clocksource/drivers/timer-atmel-pit: Properly handle error cases
    block: fix deadline elevator drain for zoned block devices
    ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
    drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
    ...

    Signed-off-by: Jens Axboe

    Jens Axboe
     

22 Sep, 2018

1 commit

  • blkg reference counting now uses percpu_ref rather than atomic_t. Let's
    make this consistent with css_tryget. This renames blkg_try_get to
    blkg_tryget and now returns a bool rather than the blkg or NULL.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)