29 Aug, 2019

2 commits


17 Jul, 2019

1 commit

  • Currently, ->pd_stat() is called only when moduleparam
    blkcg_debug_stats is set which prevents it from printing non-debug
    policy-specific statistics. Let's move debug testing down so that
    ->pd_stat() can print non-debug stat too. This patch doesn't cause
    any visible behavior change.

    Signed-off-by: Tejun Heo
    Cc: Josef Bacik
    Signed-off-by: Jens Axboe

    Tejun Heo
     

06 Jul, 2019

1 commit

  • The iolatency controller is based on rq_qos. It increments on
    rq_qos_throttle() and decrements on either rq_qos_cleanup() or
    rq_qos_done_bio(). a3fb01ba5af0 fixes the double accounting issue where
    blk_mq_make_request() may call both rq_qos_cleanup() and
    rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the
    double decrement.

    The above works upstream as the only way we can get STS_AGAIN is from
    blk_mq_get_request() failing. The STS_AGAIN handling isn't a real
    problem as bio_endio() skipping only happens on reserved tag allocation
    failures which can only be caused by driver bugs and already triggers
    WARN.

    However, the fix creates a not so great dependency on how STS_AGAIN can
    be propagated. Internally, we (Facebook) carry a patch that kills read
    ahead if a cgroup is io congested or a fatal signal is pending. This
    combined with chained bios progagate their bi_status to the parent is
    not already set can can cause the parent bio to not clean up properly
    even though it was successful. This consequently leaks the inflight
    counter and can hang all IOs under that blkg.

    To nip the adverse interaction early, this removes the rq_qos_cleanup()
    callback in iolatency in favor of cleaning up always on the
    rq_qos_done_bio() path.

    Fixes: a3fb01ba5af0 ("blk-iolatency: only account submitted bios")
    Debugged-by: Tejun Heo
    Debugged-by: Josef Bacik
    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

20 Jun, 2019

1 commit

  • As is, iolatency recognizes done_bio and cleanup as ending paths. If a
    request is marked REQ_NOWAIT and fails to get a request, the bio is
    cleaned up via rq_qos_cleanup() and ended in bio_wouldblock_error().
    This results in underflowing the inflight counter. Fix this by only
    accounting bios that were actually submitted.

    Signed-off-by: Dennis Zhou
    Cc: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

16 Jun, 2019

1 commit

  • If use_delay was non-zero when the latency target of a cgroup was set
    to zero, it will stay stuck until io.latency is enabled on the cgroup
    again. This keeps readahead disabled for the cgroup impacting
    performance negatively.

    Signed-off-by: Tejun Heo
    Cc: Josef Bacik
    Fixes: d70675121546 ("block: introduce blk-iolatency io controller")
    Cc: stable@vger.kernel.org # v4.19+
    Signed-off-by: Jens Axboe

    Tejun Heo
     

01 May, 2019

1 commit


21 Mar, 2019

1 commit

  • This patch avoids that the following warning is reported when building
    with W=1:

    block/blk-iolatency.c:734:5: warning: no previous prototype for 'blk_iolatency_init' [-Wmissing-prototypes]

    Cc: Josef Bacik
    Fixes: d70675121546 ("block: introduce blk-iolatency io controller") # v4.19
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

09 Feb, 2019

2 commits

  • This is to catch any unexpected negative value of inflight IO counter.

    Signed-off-by: Liu Bo
    Signed-off-by: Jens Axboe

    Liu Bo
     
  • Our test reported the following stack, and vmcore showed that
    ->inflight counter is -1.

    [ffffc9003fcc38d0] __schedule at ffffffff8173d95d
    [ffffc9003fcc3958] schedule at ffffffff8173de26
    [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6
    [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb
    [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3
    [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a
    [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49
    [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d
    [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4]
    [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4]
    [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae
    [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188
    [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301
    [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4]
    [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b
    [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd
    [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50
    [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04
    [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e

    The ->inflight counter may be negative (-1) if

    1) blk-iolatency was disabled when the IO was issued,

    2) blk-iolatency was enabled before this IO reached its endio,

    3) the ->inflight counter is decreased from 0 to -1 in endio()

    In fact the hang can be easily reproduced by the below script,

    H=/sys/fs/cgroup/unified/
    P=/sys/fs/cgroup/unified/test

    echo "+io" > $H/cgroup.subtree_control
    mkdir -p $P

    echo $$ > $P/cgroup.procs

    xfs_io -f -d -c "pwrite 0 4k" /dev/sdg

    echo "`cat /sys/block/sdg/dev` target=1000000" > $P/io.latency

    xfs_io -f -d -c "pwrite 0 4k" /dev/sdg

    This fixes the problem by freezing the queue so that while
    enabling/disabling iolatency, there is no inflight rq running.

    Note that quiesce_queue is not needed as this only updating iolatency
    configuration about which dispatching request_queue doesn't care.

    Signed-off-by: Liu Bo
    Signed-off-by: Jens Axboe

    Liu Bo
     

18 Dec, 2018

1 commit

  • The blk-iolatency controller measures the time from rq_qos_throttle() to
    rq_qos_done_bio() and attributes this time to the first bio that needs
    to create the request. This means if a bio is plug-mergeable or
    bio-mergeable, it gets to bypass the blk-iolatency controller.

    The recent series [1], to tag all bios w/ blkgs undermined how iolatency
    was determining which bios it was charging and should process in
    rq_qos_done_bio(). Because all bios are being tagged, this caused the
    atomic_t for the struct rq_wait inflight count to underflow and result
    in a stall.

    This patch adds a new flag BIO_TRACKED to let controllers know that a
    bio is going through the rq_qos path. blk-iolatency now checks if this
    flag is set to see if it should process the bio in rq_qos_done_bio().

    Overloading BLK_QUEUE_ENTERED works, but makes the flag rules confusing.
    BIO_THROTTLED was another candidate, but the flag is set for all bios
    that have gone through blk-throttle code. Overloading a flag comes with
    the burden of making sure that when either implementation changes, a
    change in setting rules for one doesn't cause a bug in the other. So
    here, we unfortunately opt for adding a new flag.

    [1] https://lore.kernel.org/lkml/20181205171039.73066-1-dennis@kernel.org/

    Fixes: 5cdf2e3fea5e ("blkcg: associate blkg when associating a device")
    Signed-off-by: Dennis Zhou
    Cc: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

08 Dec, 2018

8 commits

  • Now that we have this common helper, convert io-latency over to use it
    as well.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • blkg reference counting now uses percpu_ref rather than atomic_t. Let's
    make this consistent with css_tryget. This renames blkg_try_get to
    blkg_tryget and now returns a bool rather than the blkg or %NULL.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • bio_issue_init among other things initializes the timestamp for an IO.
    Rather than have this logic handled by policies, this consolidates it to
    be on the init paths (normal, clone, bounce clone).

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Liu Bo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • Previously, blkg association was handled by controller specific code in
    blk-throttle and blk-iolatency. However, because a blkg represents a
    relationship between a blkcg and a request_queue, it makes sense to keep
    the blkg->q and bio->bi_disk->queue consistent.

    This patch moves association into the bio_set_dev macro(). This should
    cover the majority of cases where the device is set/changed keeping the
    two pointers consistent. Fallback code is added to
    blkcg_bio_issue_check() to catch any missing paths.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • There are 3 ways blkg association can happen: association with the
    current css, with the page css (swap), or from the wbc css (writeback).

    This patch handles how association is done for the first case where we
    are associating bsaed on the current css. If there is already a blkg
    associated, the css will be reused and association will be redone as the
    request_queue may have changed.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • There are several scenarios where blkg_lookup_create() can fail such as
    the blkcg dying, request_queue is dying, or simply being OOM. Most
    handle this by simply falling back to the q->root_blkg and calling it a
    day.

    This patch implements the notion of closest blkg. During
    blkg_lookup_create(), if it fails to create, return the closest blkg
    found or the q->root_blkg. blkg_try_get_closest() is introduced and used
    during association so a bio is always attached to a blkg.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • To know when to create a blkg, the general pattern is to do a
    blkg_lookup() and if that fails, lock and do the lookup again, and if
    that fails finally create. It doesn't make much sense for everyone who
    wants to do creation to write this themselves.

    This changes blkg_lookup_create() to do locking and implement this
    pattern. The old blkg_lookup_create() is renamed to
    __blkg_lookup_create(). If a call site wants to do its own error
    handling or already owns the queue lock, they can use
    __blkg_lookup_create(). This will be used in upcoming patches.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Reviewed-by: Liu Bo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • The bio_blkcg() function turns out to be inconsistent and consequently
    dangerous to use. The first part returns a blkcg where a reference is
    owned by the bio meaning it does not need to be rcu protected. However,
    the third case, the last line, is problematic:

    return css_to_blkcg(task_css(current, io_cgrp_id));

    This can race against task migration and the cgroup dying. It is also
    semantically different as it must be called rcu protected and is
    susceptible to failure when trying to get a reference to it.

    This patch adds association ahead of calling bio_blkcg() rather than
    after. This makes association a required and explicit step along the
    code paths for calling bio_blkcg(). In blk-iolatency, association is
    moved above the bio_blkcg() call to ensure it will not return %NULL.

    BFQ uses the old bio_blkcg() function, but I do not want to address it
    in this series due to the complexity. I have created a private version
    documenting the inconsistency and noting not to use it.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

16 Nov, 2018

2 commits


03 Nov, 2018

1 commit

  • Pull block layer fixes from Jens Axboe:
    "The biggest part of this pull request is the revert of the blkcg
    cleanup series. It had one fix earlier for a stacked device issue, but
    another one was reported. Rather than play whack-a-mole with this,
    revert the entire series and try again for the next kernel release.

    Apart from that, only small fixes/changes.

    Summary:

    - Indentation fixup for mtip32xx (Colin Ian King)

    - The blkcg cleanup series revert (Dennis Zhou)

    - Two NVMe fixes. One fixing a regression in the nvme request
    initialization in this merge window, causing nvme-fc to not work.
    The other is a suspend/resume p2p resource issue (James, Keith)

    - Fix sg discard merge, allowing us to merge in cases where we didn't
    before (Jianchao Wang)

    - Call rq_qos_exit() after the queue is frozen, preventing a hang
    (Ming)

    - Fix brd queue setup, fixing an oops if we fail setting up all
    devices (Ming)"

    * tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
    nvme-pci: fix conflicting p2p resource adds
    nvme-fc: fix request private initialization
    blkcg: revert blkcg cleanups series
    block: brd: associate with queue until adding disk
    block: call rq_qos_exit() after queue is frozen
    mtip32xx: clean an indentation issue, remove extraneous tabs
    block: fix the DISCARD request merge

    Linus Torvalds
     

02 Nov, 2018

1 commit

  • This reverts a series committed earlier due to null pointer exception
    bug report in [1]. It seems there are edge case interactions that I did
    not consider and will need some time to understand what causes the
    adverse interactions.

    The original series can be found in [2] with a follow up series in [3].

    [1] https://www.spinics.net/lists/cgroups/msg20719.html
    [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
    [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

    This reverts the following commits:
    d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
    f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
    a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

27 Oct, 2018

1 commit

  • There are several definitions of those functions/macros in places that
    mess with fixed-point load averages. Provide an official version.

    [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
    Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Suren Baghdasaryan
    Tested-by: Daniel Drake
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

29 Sep, 2018

5 commits

  • We apply a smoothing to the scale changes in order to keep sawtoothy
    behavior from occurring. However our window for checking if we've
    missed our target can sometimes be lower than the smoothing interval
    (500ms), especially on faster drives like ssd's. In order to deal with
    this keep track of the running tally of the previous intervals that we
    threw away because we had already done a scale event recently.

    This is needed for the ssd case as these low latency drives will have
    bursts of latency, and if it happens to be ok for the window that
    directly follows the opening of the scale window we could unthrottle
    when previous windows we were missing our target.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • We use an average latency approach for determining if we're missing our
    latency target. This works well for rotational storage where we have
    generally consistent latencies, but for ssd's and other low latency
    devices you have more of a spikey behavior, which means we often won't
    throttle misbehaving groups because a lot of IO completes at drastically
    faster times than our latency target. Instead keep track of how many
    IO's miss our target and how many IO's are done in our time window. If
    the p(90) latency is above our target then we know we need to throttle.
    With this change in place we are seeing the same throttling behavior
    with our testcase on ssd's as we see with rotational drives.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • There is logic to keep cgroups that haven't done a lot of IO in the most
    recent scale window from being punished for over-active higher priority
    groups. However for things like ssd's where the windows are pretty
    short we'll end up with small numbers of samples, so 5% of samples will
    come out to 0 if there aren't enough. Make the floor 1 sample to keep
    us from improperly bailing out of scaling down.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • Hitting the case where blk_queue_depth() returned 1 uncovered the fact
    that iolatency doesn't actually handle this case properly, it simply
    doesn't scale down anybody. For this case we should go straight into
    applying the time delay, which we weren't doing. Since we already limit
    the floor at 1 request this if statement is not needed, and this allows
    us to set our depth to 1 which allows us to apply the delay if needed.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • We were using blk_queue_depth() assuming that it would return
    nr_requests, but we hit a case in production on drives that had to have
    NCQ turned off in order for them to not shit the bed which resulted in a
    qd of 1, even though the nr_requests was much larger. iolatency really
    only cares about requests we are allowed to queue up, as any io that
    get's onto the request list is going to be serviced soonish, so we want
    to be throttling before the bio gets onto the request list. To make
    iolatency work as expected, simply use q->nr_requests instead of
    blk_queue_depth() as that is what we actually care about.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     

22 Sep, 2018

5 commits

  • blkg reference counting now uses percpu_ref rather than atomic_t. Let's
    make this consistent with css_tryget. This renames blkg_try_get to
    blkg_tryget and now returns a bool rather than the blkg or NULL.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • bio_issue_init among other things initializes the timestamp for an IO.
    Rather than have this logic handled by policies, this consolidates it to
    be on the init paths (normal, clone, bounce clone).

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Liu Bo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • Previously, blkg's were only assigned as needed by blk-iolatency and
    blk-throttle. bio->css was also always being associated while blkg was
    being looked up and then thrown away in blkcg_bio_issue_check.

    This patch begins the cleanup of bio->css and bio->bi_blkg by always
    associating a blkg in blkcg_bio_issue_check. This tries to create the
    blkg, but if it is not possible, falls back to using the root_blkg of
    the request_queue. Therefore, a bio will always be associated with a
    blkg. The duplicate association logic is removed from blk-throttle and
    blk-iolatency.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • To know when to create a blkg, the general pattern is to do a
    blkg_lookup and if that fails, lock and then do a lookup again and if
    that fails finally create. It doesn't make much sense for everyone who
    wants to do creation to write this themselves.

    This changes blkg_lookup_create to do locking and implement this
    pattern. The old blkg_lookup_create is renamed to __blkg_lookup_create.
    If a call site wants to do its own error handling or already owns the
    queue lock, they can use __blkg_lookup_create. This will be used in
    upcoming patches.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Reviewed-by: Liu Bo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • The accessor function bio_blkcg either returns the blkcg associated with
    the bio or finds one in the current context. This can cause an issue
    when trying to associate a bio with a blkcg. Particularly, it's the
    third case that is problematic:

    return css_to_blkcg(task_css(current, io_cgrp_id));

    As the above may race against task migration and the cgroup exiting, it
    is not always ok to take a reference on the blkcg returned from
    bio_blkcg.

    This patch adds association ahead of calling bio_blkcg rather than
    after. This makes association a required and explicit step along the
    code paths for calling bio_blkcg. blk_get_rl is modified as well to get
    a reference to the blkcg it may use and blk_put_rl will always put the
    reference back. Association is also moved above the bio_blkcg call to
    ensure it will not return NULL in blk-iolatency.

    BFQ and CFQ utilize this flaw, but due to the complexity, I do not want
    to address this in this series. I've created a private version of the
    function with notes not to use it describing the flaw. Hopefully soon,
    that code can be cleaned up.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     

14 Sep, 2018

1 commit

  • Fixes gcc '-Wunused-but-set-variable' warning:

    block/blk-iolatency.c: In function 'scale_change':
    block/blk-iolatency.c:301:7: warning:
    variable 'changed' set but not used [-Wunused-but-set-variable]

    block/blk-iolatency.c: In function 'iolatency_set_limit':
    block/blk-iolatency.c:765:24: warning:
    variable 'blkiolat' set but not used [-Wunused-but-set-variable]

    Signed-off-by: YueHaibing
    Signed-off-by: Jens Axboe

    YueHaibing
     

02 Aug, 2018

1 commit

  • Currently, avg_lat is calculated by accumulating the mean of every
    window in a long running cumulative average. As time goes on, the metric
    becomes less and less useful due to the accumulated history.

    This patch reuses the same calculation done in load averages to make the
    avg_lat metric more lively. Unlike load averages, the avg only advances
    when a window elapses (due to an io). Idle periods extend the most
    recent window. Bucketing is used to limit the history of avg_lat by
    binding it to the window size. So, the window range for 1/exp (decay
    rate) is [1 min, 2.5 min) when windows elapse immediately.

    The current sample window size is exposed in the debug info to enable
    calculation of the window range.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     

01 Aug, 2018

1 commit


17 Jul, 2018

2 commits

  • In our longer tests we noticed that some boxes would degrade to the
    point of uselessness. This is because we truncate the current time when
    saving it in our bio, but I was using the raw current time to subtract
    from. So once the box had been up a certain amount of time it would
    appear as if our IO's were taking several years to complete. Fix this
    by truncating the current time so it matches the issue time. Verified
    this worked by running with this patch for a week on our test tier.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • Early versions of these patches had us waiting for seconds at a time
    during submission, so we had to adjust the timing window we monitored
    for latency. Now we don't do things like that so this is unnecessary
    code.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     

11 Jul, 2018

1 commit

  • max_depth used to be a u64, but I changed it to a unsigned int but
    didn't convert my comparisons over everywhere. Fix by using UINT_MAX
    everywhere instead of (u64)-1.

    Reported-by: Dan Carpenter
    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik