10 Sep, 2020

1 commit

  • commit 5aeac7c4b16069aae49005f0a8d4526baa83341b upstream.

    ioc_pd_free() grabs irq-safe ioc->lock without ensuring that irq is disabled
    when it can be called with irq disabled or enabled. This has a small chance
    of causing A-A deadlocks and triggers lockdep splats. Use irqsave operations
    instead.

    Signed-off-by: Tejun Heo
    Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
    Cc: stable@vger.kernel.org # v5.4+
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

19 Aug, 2020

1 commit

  • [ Upstream commit d9012a59db54442d5b2fcfdfcded35cf566397d3 ]

    We shouldn't skip iocg when its abs_vdebt is not zero.

    Fixes: 0b80f9866e6b ("iocost: protect iocg->abs_vdebt with iocg->waitq.lock")
    Signed-off-by: Chengming Zhou
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Chengming Zhou
     

22 Jun, 2020

1 commit

  • [ Upstream commit 81ca627a933063fa63a6d4c66425de822a2ab7f5 ]

    When the QoS targets are met and nothing is being throttled, there's
    no way to tell how saturated the underlying device is - it could be
    almost entirely idle, at the cusp of saturation or anywhere inbetween.
    Given that there's no information, it's best to keep vrate as-is in
    this state. Before 7cd806a9a953 ("iocost: improve nr_lagging
    handling"), this was the case - if the device isn't missing QoS
    targets and nothing is being throttled, busy_level was reset to zero.

    While fixing nr_lagging handling, 7cd806a9a953 ("iocost: improve
    nr_lagging handling") broke this. Now, while the device is hitting
    QoS targets and nothing is being throttled, vrate keeps getting
    adjusted according to the existing busy_level.

    This led to vrate keeping climing till it hits max when there's an IO
    issuer with limited request concurrency if the vrate started low.
    vrate starts getting adjusted upwards until the issuer can issue IOs
    w/o being throttled. From then on, QoS targets keeps getting met and
    nothing on the system needs throttling and vrate keeps getting
    increased due to the existing busy_level.

    This patch makes the following changes to the busy_level logic.

    * Reset busy_level if nr_shortages is zero to avoid the above
    scenario.

    * Make non-zero nr_lagging block lowering nr_level but still clear
    positive busy_level if there's clear non-saturation signal - QoS
    targets are met and nr_shortages is non-zero. nr_lagging's role is
    preventing adjusting vrate upwards while there are long-running
    commands and it shouldn't keep busy_level positive while there's
    clear non-saturation signal.

    * Restructure code for clarity and add comments.

    Signed-off-by: Tejun Heo
    Reported-by: Andy Newell
    Fixes: 7cd806a9a953 ("iocost: improve nr_lagging handling")
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Tejun Heo
     

14 May, 2020

1 commit

  • commit 0b80f9866e6bbfb905140ed8787ff2af03652c0c upstream.

    abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup
    is and controls the activation of use_delay mechanism. Once a cgroup goes
    over budget from forced IOs, it has to pay it back with its future budget.
    The progress guarantee on debt paying comes from the iocg being active -
    active iocgs are processed by the periodic timer, which ensures that as time
    passes the debts dissipate and the iocg returns to normal operation.

    However, both iocg activation and vdebt handling are asynchronous and a
    sequence like the following may happen.

    1. The iocg is in the process of being deactivated by the periodic timer.

    2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns
    without anything because it still sees that the iocg is already active.

    3. The iocg is deactivated.

    4. The bio from #2 is over budget but needs to be forced. It increases
    abs_vdebt and goes over the threshold and enables use_delay.

    5. IO control is enabled for the iocg's subtree and now IOs are attributed
    to the descendant cgroups and the iocg itself no longer issues IOs.

    This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no
    further IOs which can activate it. This can end up unduly punishing all the
    descendants cgroups.

    The usual throttling path has the same issue - the iocg must be active while
    throttled to ensure that future event will wake it up - and solves the
    problem by synchronizing the throttling path with a spinlock. abs_vdebt
    handling is another form of overage handling and shares a lot of
    characteristics including the fact that it isn't in the hottest path.

    This patch fixes the above and other possible races by strictly
    synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock.

    Signed-off-by: Tejun Heo
    Reported-by: Vlad Dmitriev
    Cc: stable@vger.kernel.org # v5.4+
    Fixes: e1518f63f246 ("blk-iocost: Don't let merges push vtime into the future")
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

02 May, 2020

1 commit

  • commit d6c8e949a35d6906d6c03a50e9a9cdf4e494528a upstream.

    Systemtap 4.2 is unable to correctly interpret the "u32 (*missed_ppm)[2]"
    argument of the iocost_ioc_vrate_adj trace entry defined in
    include/trace/events/iocost.h leading to the following error:

    /tmp/stapAcz0G0/stap_c89c58b83cea1724e26395efa9ed4939_6321_aux_6.c:78:8:
    error: expected ‘;’, ‘,’ or ‘)’ before ‘*’ token
    , u32[]* __tracepoint_arg_missed_ppm

    That argument type is indeed rather complex and hard to read. Looking
    at block/blk-iocost.c. It is just a 2-entry u32 array. By simplifying
    the argument to a simple "u32 *missed_ppm" and adjusting the trace
    entry accordingly, the compilation error was gone.

    Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
    Acked-by: Steven Rostedt (VMware)
    Acked-by: Tejun Heo
    Signed-off-by: Waiman Long
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Waiman Long
     

18 Mar, 2020

1 commit


31 Dec, 2019

1 commit

  • commit d7bd15a138aef3be227818aad9c501e43c89c8c5 upstream.

    When over-budget IOs are force-issued through root cgroup,
    iocg_kick_delay() adjusts the async delay accordingly but doesn't
    actually schedule async throttle for the issuing task. This bug is
    pretty well masked because sooner or later the offending threads are
    gonna get directly throttled on regular IOs or have async delay
    scheduled by mem_cgroup_throttle_swaprate().

    However, it can affect control quality on filesystem metadata heavy
    operations. Let's fix it by invoking blkcg_schedule_throttle() when
    iocg_kick_delay() says async delay is needed.

    Signed-off-by: Tejun Heo
    Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
    Cc: stable@vger.kernel.org
    Reported-by: Josef Bacik
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

15 Nov, 2019

1 commit


01 Nov, 2019

1 commit

  • This code causes a static analysis warning:

    block/blk-iocost.c:2113 ioc_weight_write() error: double lock 'irq'

    We disable IRQs in blkg_conf_prep() and re-enable them in
    blkg_conf_finish(). IRQ disable/enable should not be nested because
    that means the IRQs will be enabled at the first unlock instead of the
    second one.

    Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
    Acked-by: Tejun Heo
    Signed-off-by: Dan Carpenter
    Signed-off-by: Jens Axboe

    Dan Carpenter
     

26 Sep, 2019

3 commits

  • The default hard disk param sets latency targets at 50ms. As the
    default target percentiles are zero, these don't directly regulate
    vrate; however, they're still used to calculate the period length -
    100ms in this case.

    This is excessively low. A SATA drive with QD32 saturated with random
    IOs can easily reach avg completion latency of several hundred msecs.
    A period duration which is substantially lower than avg completion
    latency can lead to wildly fluctuating vrate.

    Let's bump up the default latency targets to 250ms so that the period
    duration is sufficiently long.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Some IOs may span multiple periods. As latencies are collected on
    completion, the inbetween periods won't register them and may
    incorrectly decide to increase vrate. nr_lagging tracks these IOs to
    avoid those situations. Currently, whenever there are IOs which are
    spanning from the previous period, busy_level is reset to 0 if
    negative thus suppressing vrate increase.

    This has the following two problems.

    * When latency target percentiles aren't set, vrate adjustment should
    only be governed by queue depth depletion; however, the current code
    keeps nr_lagging active which pulls in latency results and can keep
    down vrate unexpectedly.

    * When lagging condition is detected, it resets the entire negative
    busy_level. This turned out to be way too aggressive on some
    devices which sometimes experience extended latencies on a small
    subset of commands. In addition, a lagging IO will be accounted as
    latency target miss on completion anyway and resetting busy_level
    amplifies its impact unnecessarily.

    This patch fixes the above two problems by disabling nr_lagging
    counting when latency target percentiles aren't set and blocking vrate
    increases when there are lagging IOs while leaving busy_level as-is.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • vrate_adj tracepoint traces vrate changes; however, it does so only
    when busy_level is non-zero. busy_level turning to zero can sometimes
    be as interesting an event. This patch also enables vrate_adj
    tracepoint on other vrate related events - busy_level changes and
    non-zero nr_lagging.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

11 Sep, 2019

4 commits

  • Report debt and rename del_ms row to delay for consistency.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Merges have the same problem that forced-bios had which is fixed by
    the previous patch. The cost of a merge is calculated at the time of
    issue and force-advances vtime into the future. Until global vtime
    catches up, how the cgroup's hweight changes in the meantime doesn't
    matter and it often leads to situations where the cost is calculated
    at one hweight and paid at a very different one. See the previous
    patch for more details.

    Fix it by never advancing vtime into the future for merges. If budget
    is available, vtime is advanced. Otherwise, the cost is charged as
    debt.

    This brings merge cost handling in line with issue cost handling in
    ioc_rqos_throttle().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, when a bio needs to be force-charged and there isn't enough
    budget, vtime is simply pushed into the future. This means that the
    cost of the whole bio is scaled using the current hweight and then
    charged immediately. Until the global vtime advances beyond this
    future vtime, the cgroup won't be allowed to issue normal IOs.

    This is incorrect and can lead to, for example, exploding vrate or
    extended stalls if vrate range is constrained. Consider the following
    scenario.

    1. A cgroup with a very low hweight runs out of budget.

    2. A storm of swap-out happens on it. All of them are scaled
    according to the current low hweight and charged to vtime pushing
    it to a far future.

    3. All other cgroups go idle and now the above cgroup has access to
    the whole device. However, because vtime is already wound using
    the past low hweight, what its current hweight is doesn't matter
    until global vtime catches up to the local vtime.

    4. As a result, either vrate gets ramped up extremely or the IOs stall
    while the underlying device is idle.

    This is because the hweight the overage is calculated at is different
    from the hweight that it's being paid at.

    Fix it by remembering the overage in absoulte vtime and continuously
    paying with the actual budget according to the current hweight at each
    period.

    Note that non-forced bios which wait already remembers the cost in
    absolute vtime. This brings forced-bio accounting in line.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • ioc_pd_free() first cancels the hrtimers and then deactivates the
    iocg. However, the iocg timer can run inbetween and reschedule the
    hrtimers which will end up running after the iocg is freed leading to
    crashes like the following.

    general protection fault: 0000 [#1] SMP
    ...
    RIP: 0010:iocg_kick_delay+0xbe/0x1b0
    RSP: 0018:ffffc90003598ea0 EFLAGS: 00010046
    RAX: 1cee00fd69512b54 RBX: ffff8881bba48400 RCX: 00000000000003e8
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881bba48400
    RBP: 0000000000004e20 R08: 0000000000000002 R09: 00000000000003e8
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90003598ef0
    R13: 00979f3810ad461f R14: ffff8881bba4b400 R15: 25439f950d26e1d1
    FS: 0000000000000000(0000) GS:ffff88885f800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f64328c7e40 CR3: 0000000002409005 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:

    iocg_delay_timer_fn+0x3d/0x60
    __hrtimer_run_queues+0xfe/0x270
    hrtimer_interrupt+0xf4/0x210
    smp_apic_timer_interrupt+0x5e/0x120
    apic_timer_interrupt+0xf/0x20

    Fix it by canceling hrtimers after deactivating the iocg.

    Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
    Reported-by: Dave Jones
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

30 Aug, 2019

1 commit


29 Aug, 2019

4 commits

  • blk_iocost_init() forgot to free its percpu stat on the error path.
    Fix it.

    Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
    Reported-by: Hillf Danton
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Add a script which can be used to generate device-specific iocost
    linear model coefficients.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Instead of mucking with debugfs and ->pd_stat(), add drgn based
    monitoring script.

    Signed-off-by: Tejun Heo
    Cc: Omar Sandoval
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • This patchset implements IO cost model based work-conserving
    proportional controller.

    While io.latency provides the capability to comprehensively prioritize
    and protect IOs depending on the cgroups, its protection is binary -
    the lowest latency target cgroup which is suffering is protected at
    the cost of all others. In many use cases including stacking multiple
    workload containers in a single system, it's necessary to distribute
    IO capacity with better granularity.

    One challenge of controlling IO resources is the lack of trivially
    observable cost metric. The most common metrics - bandwidth and iops
    - can be off by orders of magnitude depending on the device type and
    IO pattern. However, the cost isn't a complete mystery. Given
    several key attributes, we can make fairly reliable predictions on how
    expensive a given stream of IOs would be, at least compared to other
    IO patterns.

    The function which determines the cost of a given IO is the IO cost
    model for the device. This controller distributes IO capacity based
    on the costs estimated by such model. The more accurate the cost
    model the better but the controller adapts based on IO completion
    latency and as long as the relative costs across differents IO
    patterns are consistent and sensible, it'll adapt to the actual
    performance of the device.

    Currently, the only implemented cost model is a simple linear one with
    a few sets of default parameters for different classes of device.
    This covers most common devices reasonably well. All the
    infrastructure to tune and add different cost models is already in
    place and a later patch will also allow using bpf progs for cost
    models.

    Please see the top comment in blk-iocost.c and documentation for
    more details.

    v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
    for a divide-by-zero bug in current_hweight() triggered by zero
    inuse_sum.

    Signed-off-by: Tejun Heo
    Cc: Andy Newell
    Cc: Josef Bacik
    Cc: Rik van Riel
    Signed-off-by: Jens Axboe

    Tejun Heo