17 Aug, 2022

1 commit

  • [ Upstream commit 14a6e2eb7df5c7897c15b109cba29ab0c4a791b6 ]

    In our test of iocost, we encountered some list add/del corruptions of
    inner_walk list in ioc_timer_fn.

    The reason can be described as follows:

    cpu 0 cpu 1
    ioc_qos_write ioc_qos_write

    ioc = q_to_ioc(queue);
    if (!ioc) {
    ioc = kzalloc();
    ioc = q_to_ioc(queue);
    if (!ioc) {
    ioc = kzalloc();
    ...
    rq_qos_add(q, rqos);
    }
    ...
    rq_qos_add(q, rqos);
    ...
    }

    When the io.cost.qos file is written by two cpus concurrently, rq_qos may
    be added to one disk twice. In that case, there will be two iocs enabled
    and running on one disk. They own different iocgs on their active list. In
    the ioc_timer_fn function, because of the iocgs from two iocs have the
    same root iocg, the root iocg's walk_list may be overwritten by each other
    and this leads to list add/del corruptions in building or destroying the
    inner_walk list.

    And so far, the blk-rq-qos framework works in case that one instance for
    one type rq_qos per queue by default. This patch make this explicit and
    also fix the crash above.

    Signed-off-by: Jinke Han
    Reviewed-by: Muchun Song
    Acked-by: Tejun Heo
    Cc:
    Link: https://lore.kernel.org/r/20220720093616.70584-1-hanjinke.666@bytedance.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jinke Han
     

12 Jul, 2022

1 commit

  • [ Upstream commit aa1b46dcdc7baaf5fec0be25782ef24b26aa209e ]

    a647a524a467 ("block: don't call rq_qos_ops->done_bio if the bio isn't
    tracked") made bio_endio() skip rq_qos_done_bio() if BIO_TRACKED is not set.
    While this fixed a potential oops, it also broke blk-iocost by skipping the
    done_bio callback for merged bios.

    Before, whether a bio goes through rq_qos_throttle() or rq_qos_merge(),
    rq_qos_done_bio() would be called on the bio on completion with BIO_TRACKED
    distinguishing the former from the latter. rq_qos_done_bio() is not called
    for bios which wenth through rq_qos_merge(). This royally confuses
    blk-iocost as the merged bios never finish and are considered perpetually
    in-flight.

    One reliably reproducible failure mode is an intermediate cgroup geting
    stuck active preventing its children from being activated due to the
    leaf-only rule, leading to loss of control. The following is from
    resctl-bench protection scenario which emulates isolating a web server like
    workload from a memory bomb run on an iocost configuration which should
    yield a reasonable level of protection.

    # cat /sys/block/nvme2n1/device/model
    Samsung SSD 970 PRO 512GB
    # cat /sys/fs/cgroup/io.cost.model
    259:0 ctrl=user model=linear rbps=834913556 rseqiops=93622 rrandiops=102913 wbps=618985353 wseqiops=72325 wrandiops=71025
    # cat /sys/fs/cgroup/io.cost.qos
    259:0 enable=1 ctrl=user rpct=95.00 rlat=18776 wpct=95.00 wlat=8897 min=60.00 max=100.00
    # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1
    ...
    Memory Hog Summary
    ==================

    IO Latency: R p50=242u:336u/2.5m p90=794u:1.4m/7.5m p99=2.7m:8.0m/62.5m max=8.0m:36.4m/350m
    W p50=221u:323u/1.5m p90=709u:1.2m/5.5m p99=1.5m:2.5m/9.5m max=6.9m:35.9m/350m

    Isolation and Request Latency Impact Distributions:

    min p01 p05 p10 p25 p50 p75 p90 p95 p99 max mean stdev
    isol% 15.90 15.90 15.90 40.05 57.24 59.07 60.01 74.63 74.63 90.35 90.35 58.12 15.82
    lat-imp% 0 0 0 0 0 4.55 14.68 15.54 233.5 548.1 548.1 53.88 143.6

    Result: isol=58.12:15.82% lat_imp=53.88%:143.6 work_csv=100.0% missing=3.96%

    The isolation result of 58.12% is close to what this device would show
    without any IO control.

    Fix it by introducing a new flag BIO_QOS_MERGED to mark merged bios and
    calling rq_qos_done_bio() on them too. For consistency and clarity, rename
    BIO_TRACKED to BIO_QOS_THROTTLED. The flag checks are moved into
    rq_qos_done_bio() so that it's next to the code paths that set the flags.

    With the patch applied, the above same benchmark shows:

    # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1
    ...
    Memory Hog Summary
    ==================

    IO Latency: R p50=123u:84.4u/985u p90=322u:256u/2.5m p99=1.6m:1.4m/9.5m max=11.1m:36.0m/350m
    W p50=429u:274u/995u p90=1.7m:1.3m/4.5m p99=3.4m:2.7m/11.5m max=7.9m:5.9m/26.5m

    Isolation and Request Latency Impact Distributions:

    min p01 p05 p10 p25 p50 p75 p90 p95 p99 max mean stdev
    isol% 84.91 84.91 89.51 90.73 92.31 94.49 96.36 98.04 98.71 100.0 100.0 94.42 2.81
    lat-imp% 0 0 0 0 0 2.81 5.73 11.11 13.92 17.53 22.61 4.10 4.68

    Result: isol=94.42:2.81% lat_imp=4.10%:4.68 work_csv=58.34% missing=0%

    Signed-off-by: Tejun Heo
    Fixes: a647a524a467 ("block: don't call rq_qos_ops->done_bio if the bio isn't tracked")
    Cc: stable@vger.kernel.org # v5.15+
    Cc: Ming Lei
    Cc: Yu Kuai
    Reviewed-by: Ming Lei
    Link: https://lore.kernel.org/r/Yi7rdrzQEHjJLGKB@slm.duckdns.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Tejun Heo
     

09 Jun, 2022

1 commit

  • commit 8a177a36da6c54c98b8685d4f914cb3637d53c0d upstream.

    iolatency needs to track the number of inflight IOs per cgroup. As this
    tracking can be expensive, it is disabled when no cgroup has iolatency
    configured for the device. To ensure that the inflight counters stay
    balanced, iolatency_set_limit() freezes the request_queue while manipulating
    the enabled counter, which ensures that no IO is in flight and thus all
    counters are zero.

    Unfortunately, iolatency_set_limit() isn't the only place where the enabled
    counter is manipulated. iolatency_pd_offline() can also dec the counter and
    trigger disabling. As this disabling happens without freezing the q, this
    can easily happen while some IOs are in flight and thus leak the counts.

    This can be easily demonstrated by turning on iolatency on an one empty
    cgroup while IOs are in flight in other cgroups and then removing the
    cgroup. Note that iolatency shouldn't have been enabled elsewhere in the
    system to ensure that removing the cgroup disables iolatency for the whole
    device.

    The following keeps flipping on and off iolatency on sda:

    echo +io > /sys/fs/cgroup/cgroup.subtree_control
    while true; do
    mkdir -p /sys/fs/cgroup/test
    echo '8:0 target=100000' > /sys/fs/cgroup/test/io.latency
    sleep 1
    rmdir /sys/fs/cgroup/test
    sleep 1
    done

    and there's concurrent fio generating direct rand reads:

    fio --name test --filename=/dev/sda --direct=1 --rw=randread \
    --runtime=600 --time_based --iodepth=256 --numjobs=4 --bs=4k

    while monitoring with the following drgn script:

    while True:
    for css in css_for_each_descendant_pre(prog['blkcg_root'].css.address_of_()):
    for pos in hlist_for_each(container_of(css, 'struct blkcg', 'css').blkg_list):
    blkg = container_of(pos, 'struct blkcg_gq', 'blkcg_node')
    pd = blkg.pd[prog['blkcg_policy_iolatency'].plid]
    if pd.value_() == 0:
    continue
    iolat = container_of(pd, 'struct iolatency_grp', 'pd')
    inflight = iolat.rq_wait.inflight.counter.value_()
    if inflight:
    print(f'inflight={inflight} {disk_name(blkg.q.disk).decode("utf-8")} '
    f'{cgroup_path(css.cgroup).decode("utf-8")}')
    time.sleep(1)

    The monitoring output looks like the following:

    inflight=1 sda /user.slice
    inflight=1 sda /user.slice
    ...
    inflight=14 sda /user.slice
    inflight=13 sda /user.slice
    inflight=17 sda /user.slice
    inflight=15 sda /user.slice
    inflight=18 sda /user.slice
    inflight=17 sda /user.slice
    inflight=20 sda /user.slice
    inflight=19 sda /user.slice
    Cc: Josef Bacik
    Cc: Liu Bo
    Fixes: 8c772a9bfc7c ("blk-iolatency: fix IO hang due to negative inflight counter")
    Cc: stable@vger.kernel.org # v5.0+
    Link: https://lore.kernel.org/r/Yn9ScX6Nx2qIiQQi@slm.duckdns.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

31 Aug, 2021

1 commit

  • Pull block updates from Jens Axboe:
    "Nothing major in here - lots of good cleanups and tech debt handling,
    which is also evident in the diffstats. In particular:

    - Add disk sequence numbers (Matteo)

    - Discard merge fix (Ming)

    - Relax disk zoned reporting restrictions (Niklas)

    - Bio error handling zoned leak fix (Pavel)

    - Start of proper add_disk() error handling (Luis, Christoph)

    - blk crypto fix (Eric)

    - Non-standard GPT location support (Dmitry)

    - IO priority improvements and cleanups (Damien)o

    - blk-throtl improvements (Chunguang)

    - diskstats_show() stack reduction (Abd-Alrhman)

    - Loop scheduler selection (Bart)

    - Switch block layer to use kmap_local_page() (Christoph)

    - Remove obsolete disk_name helper (Christoph)

    - block_device refcounting improvements (Christoph)

    - Ensure gendisk always has a request queue reference (Christoph)

    - Misc fixes/cleanups (Shaokun, Oliver, Guoqing)"

    * tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits)
    sg: pass the device name to blk_trace_setup
    block, bfq: cleanup the repeated declaration
    blk-crypto: fix check for too-large dun_bytes
    blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN
    blk-zoned: allow zone management send operations without CAP_SYS_ADMIN
    block: mark blkdev_fsync static
    block: refine the disk_live check in del_gendisk
    mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA
    mmc: block: Support alternative_gpt_sector() operation
    partitions/efi: Support non-standard GPT location
    block: Add alternative_gpt_sector() operation
    bio: fix page leak bio_add_hw_page failure
    block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT
    block: remove a pointless call to MINOR() in device_add_disk
    null_blk: add error handling support for add_disk()
    virtio_blk: add error handling support for add_disk()
    block: add error handling for device_add_disk / add_disk
    block: return errors from disk_alloc_events
    block: return errors from blk_integrity_add
    block: call blk_register_queue earlier in device_add_disk
    ...

    Linus Torvalds
     

17 Aug, 2021

1 commit

  • seq_get_buf is a crutch that undoes all the memory safety of the
    seq_file interface. Use the normal seq_printf interfaces instead.

    Signed-off-by: Christoph Hellwig
    Acked-by: Tejun Heo
    Link: https://lore.kernel.org/r/20210810152623.1796144-2-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

06 Aug, 2021

1 commit

  • If queue is dying while iolatency_set_limit() is in progress,
    blk_get_queue() won't increment the refcount of the queue. However,
    blk_put_queue() will still decrement the refcount later, which will
    cause the refcout to be unbalanced.

    Thus error out in such case to fix the problem.

    Fixes: 8c772a9bfc7c ("blk-iolatency: fix IO hang due to negative inflight counter")
    Signed-off-by: Yu Kuai
    Acked-by: Tejun Heo
    Link: https://lore.kernel.org/r/20210805124645.543797-1-yukuai3@huawei.com
    Signed-off-by: Jens Axboe

    Yu Kuai
     

08 Oct, 2020

1 commit


01 Jul, 2020

1 commit

  • ktime_to_ns(ktime_get()), which is expensive, does not need to be called
    if blk_iolatency_enabled() return false in blkcg_iolatency_done_bio().
    Postponing ktime_to_ns(ktime_get()) execution reduces the CPU usage when
    blk_iolatency is disabled.

    Signed-off-by: Hongnan Li
    Signed-off-by: Jens Axboe

    Hongnan Li
     

29 Aug, 2019

2 commits


17 Jul, 2019

1 commit

  • Currently, ->pd_stat() is called only when moduleparam
    blkcg_debug_stats is set which prevents it from printing non-debug
    policy-specific statistics. Let's move debug testing down so that
    ->pd_stat() can print non-debug stat too. This patch doesn't cause
    any visible behavior change.

    Signed-off-by: Tejun Heo
    Cc: Josef Bacik
    Signed-off-by: Jens Axboe

    Tejun Heo
     

06 Jul, 2019

1 commit

  • The iolatency controller is based on rq_qos. It increments on
    rq_qos_throttle() and decrements on either rq_qos_cleanup() or
    rq_qos_done_bio(). a3fb01ba5af0 fixes the double accounting issue where
    blk_mq_make_request() may call both rq_qos_cleanup() and
    rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the
    double decrement.

    The above works upstream as the only way we can get STS_AGAIN is from
    blk_mq_get_request() failing. The STS_AGAIN handling isn't a real
    problem as bio_endio() skipping only happens on reserved tag allocation
    failures which can only be caused by driver bugs and already triggers
    WARN.

    However, the fix creates a not so great dependency on how STS_AGAIN can
    be propagated. Internally, we (Facebook) carry a patch that kills read
    ahead if a cgroup is io congested or a fatal signal is pending. This
    combined with chained bios progagate their bi_status to the parent is
    not already set can can cause the parent bio to not clean up properly
    even though it was successful. This consequently leaks the inflight
    counter and can hang all IOs under that blkg.

    To nip the adverse interaction early, this removes the rq_qos_cleanup()
    callback in iolatency in favor of cleaning up always on the
    rq_qos_done_bio() path.

    Fixes: a3fb01ba5af0 ("blk-iolatency: only account submitted bios")
    Debugged-by: Tejun Heo
    Debugged-by: Josef Bacik
    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

20 Jun, 2019

1 commit

  • As is, iolatency recognizes done_bio and cleanup as ending paths. If a
    request is marked REQ_NOWAIT and fails to get a request, the bio is
    cleaned up via rq_qos_cleanup() and ended in bio_wouldblock_error().
    This results in underflowing the inflight counter. Fix this by only
    accounting bios that were actually submitted.

    Signed-off-by: Dennis Zhou
    Cc: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

16 Jun, 2019

1 commit

  • If use_delay was non-zero when the latency target of a cgroup was set
    to zero, it will stay stuck until io.latency is enabled on the cgroup
    again. This keeps readahead disabled for the cgroup impacting
    performance negatively.

    Signed-off-by: Tejun Heo
    Cc: Josef Bacik
    Fixes: d70675121546 ("block: introduce blk-iolatency io controller")
    Cc: stable@vger.kernel.org # v4.19+
    Signed-off-by: Jens Axboe

    Tejun Heo
     

01 May, 2019

1 commit


21 Mar, 2019

1 commit

  • This patch avoids that the following warning is reported when building
    with W=1:

    block/blk-iolatency.c:734:5: warning: no previous prototype for 'blk_iolatency_init' [-Wmissing-prototypes]

    Cc: Josef Bacik
    Fixes: d70675121546 ("block: introduce blk-iolatency io controller") # v4.19
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

09 Feb, 2019

2 commits

  • This is to catch any unexpected negative value of inflight IO counter.

    Signed-off-by: Liu Bo
    Signed-off-by: Jens Axboe

    Liu Bo
     
  • Our test reported the following stack, and vmcore showed that
    ->inflight counter is -1.

    [ffffc9003fcc38d0] __schedule at ffffffff8173d95d
    [ffffc9003fcc3958] schedule at ffffffff8173de26
    [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6
    [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb
    [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3
    [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a
    [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49
    [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d
    [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4]
    [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4]
    [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae
    [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188
    [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301
    [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4]
    [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b
    [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd
    [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50
    [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04
    [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e

    The ->inflight counter may be negative (-1) if

    1) blk-iolatency was disabled when the IO was issued,

    2) blk-iolatency was enabled before this IO reached its endio,

    3) the ->inflight counter is decreased from 0 to -1 in endio()

    In fact the hang can be easily reproduced by the below script,

    H=/sys/fs/cgroup/unified/
    P=/sys/fs/cgroup/unified/test

    echo "+io" > $H/cgroup.subtree_control
    mkdir -p $P

    echo $$ > $P/cgroup.procs

    xfs_io -f -d -c "pwrite 0 4k" /dev/sdg

    echo "`cat /sys/block/sdg/dev` target=1000000" > $P/io.latency

    xfs_io -f -d -c "pwrite 0 4k" /dev/sdg

    This fixes the problem by freezing the queue so that while
    enabling/disabling iolatency, there is no inflight rq running.

    Note that quiesce_queue is not needed as this only updating iolatency
    configuration about which dispatching request_queue doesn't care.

    Signed-off-by: Liu Bo
    Signed-off-by: Jens Axboe

    Liu Bo
     

18 Dec, 2018

1 commit

  • The blk-iolatency controller measures the time from rq_qos_throttle() to
    rq_qos_done_bio() and attributes this time to the first bio that needs
    to create the request. This means if a bio is plug-mergeable or
    bio-mergeable, it gets to bypass the blk-iolatency controller.

    The recent series [1], to tag all bios w/ blkgs undermined how iolatency
    was determining which bios it was charging and should process in
    rq_qos_done_bio(). Because all bios are being tagged, this caused the
    atomic_t for the struct rq_wait inflight count to underflow and result
    in a stall.

    This patch adds a new flag BIO_TRACKED to let controllers know that a
    bio is going through the rq_qos path. blk-iolatency now checks if this
    flag is set to see if it should process the bio in rq_qos_done_bio().

    Overloading BLK_QUEUE_ENTERED works, but makes the flag rules confusing.
    BIO_THROTTLED was another candidate, but the flag is set for all bios
    that have gone through blk-throttle code. Overloading a flag comes with
    the burden of making sure that when either implementation changes, a
    change in setting rules for one doesn't cause a bug in the other. So
    here, we unfortunately opt for adding a new flag.

    [1] https://lore.kernel.org/lkml/20181205171039.73066-1-dennis@kernel.org/

    Fixes: 5cdf2e3fea5e ("blkcg: associate blkg when associating a device")
    Signed-off-by: Dennis Zhou
    Cc: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

08 Dec, 2018

8 commits

  • Now that we have this common helper, convert io-latency over to use it
    as well.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • blkg reference counting now uses percpu_ref rather than atomic_t. Let's
    make this consistent with css_tryget. This renames blkg_try_get to
    blkg_tryget and now returns a bool rather than the blkg or %NULL.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • bio_issue_init among other things initializes the timestamp for an IO.
    Rather than have this logic handled by policies, this consolidates it to
    be on the init paths (normal, clone, bounce clone).

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Liu Bo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • Previously, blkg association was handled by controller specific code in
    blk-throttle and blk-iolatency. However, because a blkg represents a
    relationship between a blkcg and a request_queue, it makes sense to keep
    the blkg->q and bio->bi_disk->queue consistent.

    This patch moves association into the bio_set_dev macro(). This should
    cover the majority of cases where the device is set/changed keeping the
    two pointers consistent. Fallback code is added to
    blkcg_bio_issue_check() to catch any missing paths.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • There are 3 ways blkg association can happen: association with the
    current css, with the page css (swap), or from the wbc css (writeback).

    This patch handles how association is done for the first case where we
    are associating bsaed on the current css. If there is already a blkg
    associated, the css will be reused and association will be redone as the
    request_queue may have changed.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • There are several scenarios where blkg_lookup_create() can fail such as
    the blkcg dying, request_queue is dying, or simply being OOM. Most
    handle this by simply falling back to the q->root_blkg and calling it a
    day.

    This patch implements the notion of closest blkg. During
    blkg_lookup_create(), if it fails to create, return the closest blkg
    found or the q->root_blkg. blkg_try_get_closest() is introduced and used
    during association so a bio is always attached to a blkg.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • To know when to create a blkg, the general pattern is to do a
    blkg_lookup() and if that fails, lock and do the lookup again, and if
    that fails finally create. It doesn't make much sense for everyone who
    wants to do creation to write this themselves.

    This changes blkg_lookup_create() to do locking and implement this
    pattern. The old blkg_lookup_create() is renamed to
    __blkg_lookup_create(). If a call site wants to do its own error
    handling or already owns the queue lock, they can use
    __blkg_lookup_create(). This will be used in upcoming patches.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Reviewed-by: Liu Bo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • The bio_blkcg() function turns out to be inconsistent and consequently
    dangerous to use. The first part returns a blkcg where a reference is
    owned by the bio meaning it does not need to be rcu protected. However,
    the third case, the last line, is problematic:

    return css_to_blkcg(task_css(current, io_cgrp_id));

    This can race against task migration and the cgroup dying. It is also
    semantically different as it must be called rcu protected and is
    susceptible to failure when trying to get a reference to it.

    This patch adds association ahead of calling bio_blkcg() rather than
    after. This makes association a required and explicit step along the
    code paths for calling bio_blkcg(). In blk-iolatency, association is
    moved above the bio_blkcg() call to ensure it will not return %NULL.

    BFQ uses the old bio_blkcg() function, but I do not want to address it
    in this series due to the complexity. I have created a private version
    documenting the inconsistency and noting not to use it.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

16 Nov, 2018

2 commits


03 Nov, 2018

1 commit

  • Pull block layer fixes from Jens Axboe:
    "The biggest part of this pull request is the revert of the blkcg
    cleanup series. It had one fix earlier for a stacked device issue, but
    another one was reported. Rather than play whack-a-mole with this,
    revert the entire series and try again for the next kernel release.

    Apart from that, only small fixes/changes.

    Summary:

    - Indentation fixup for mtip32xx (Colin Ian King)

    - The blkcg cleanup series revert (Dennis Zhou)

    - Two NVMe fixes. One fixing a regression in the nvme request
    initialization in this merge window, causing nvme-fc to not work.
    The other is a suspend/resume p2p resource issue (James, Keith)

    - Fix sg discard merge, allowing us to merge in cases where we didn't
    before (Jianchao Wang)

    - Call rq_qos_exit() after the queue is frozen, preventing a hang
    (Ming)

    - Fix brd queue setup, fixing an oops if we fail setting up all
    devices (Ming)"

    * tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
    nvme-pci: fix conflicting p2p resource adds
    nvme-fc: fix request private initialization
    blkcg: revert blkcg cleanups series
    block: brd: associate with queue until adding disk
    block: call rq_qos_exit() after queue is frozen
    mtip32xx: clean an indentation issue, remove extraneous tabs
    block: fix the DISCARD request merge

    Linus Torvalds
     

02 Nov, 2018

1 commit

  • This reverts a series committed earlier due to null pointer exception
    bug report in [1]. It seems there are edge case interactions that I did
    not consider and will need some time to understand what causes the
    adverse interactions.

    The original series can be found in [2] with a follow up series in [3].

    [1] https://www.spinics.net/lists/cgroups/msg20719.html
    [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
    [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

    This reverts the following commits:
    d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
    f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
    a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

27 Oct, 2018

1 commit

  • There are several definitions of those functions/macros in places that
    mess with fixed-point load averages. Provide an official version.

    [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
    Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Suren Baghdasaryan
    Tested-by: Daniel Drake
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

29 Sep, 2018

5 commits

  • We apply a smoothing to the scale changes in order to keep sawtoothy
    behavior from occurring. However our window for checking if we've
    missed our target can sometimes be lower than the smoothing interval
    (500ms), especially on faster drives like ssd's. In order to deal with
    this keep track of the running tally of the previous intervals that we
    threw away because we had already done a scale event recently.

    This is needed for the ssd case as these low latency drives will have
    bursts of latency, and if it happens to be ok for the window that
    directly follows the opening of the scale window we could unthrottle
    when previous windows we were missing our target.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • We use an average latency approach for determining if we're missing our
    latency target. This works well for rotational storage where we have
    generally consistent latencies, but for ssd's and other low latency
    devices you have more of a spikey behavior, which means we often won't
    throttle misbehaving groups because a lot of IO completes at drastically
    faster times than our latency target. Instead keep track of how many
    IO's miss our target and how many IO's are done in our time window. If
    the p(90) latency is above our target then we know we need to throttle.
    With this change in place we are seeing the same throttling behavior
    with our testcase on ssd's as we see with rotational drives.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • There is logic to keep cgroups that haven't done a lot of IO in the most
    recent scale window from being punished for over-active higher priority
    groups. However for things like ssd's where the windows are pretty
    short we'll end up with small numbers of samples, so 5% of samples will
    come out to 0 if there aren't enough. Make the floor 1 sample to keep
    us from improperly bailing out of scaling down.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • Hitting the case where blk_queue_depth() returned 1 uncovered the fact
    that iolatency doesn't actually handle this case properly, it simply
    doesn't scale down anybody. For this case we should go straight into
    applying the time delay, which we weren't doing. Since we already limit
    the floor at 1 request this if statement is not needed, and this allows
    us to set our depth to 1 which allows us to apply the delay if needed.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • We were using blk_queue_depth() assuming that it would return
    nr_requests, but we hit a case in production on drives that had to have
    NCQ turned off in order for them to not shit the bed which resulted in a
    qd of 1, even though the nr_requests was much larger. iolatency really
    only cares about requests we are allowed to queue up, as any io that
    get's onto the request list is going to be serviced soonish, so we want
    to be throttling before the bio gets onto the request list. To make
    iolatency work as expected, simply use q->nr_requests instead of
    blk_queue_depth() as that is what we actually care about.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     

22 Sep, 2018

3 commits

  • blkg reference counting now uses percpu_ref rather than atomic_t. Let's
    make this consistent with css_tryget. This renames blkg_try_get to
    blkg_tryget and now returns a bool rather than the blkg or NULL.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • bio_issue_init among other things initializes the timestamp for an IO.
    Rather than have this logic handled by policies, this consolidates it to
    be on the init paths (normal, clone, bounce clone).

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Liu Bo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • Previously, blkg's were only assigned as needed by blk-iolatency and
    blk-throttle. bio->css was also always being associated while blkg was
    being looked up and then thrown away in blkcg_bio_issue_check.

    This patch begins the cleanup of bio->css and bio->bi_blkg by always
    associating a blkg in blkcg_bio_issue_check. This tries to create the
    blkg, but if it is not possible, falls back to using the root_blkg of
    the request_queue. Therefore, a bio will always be associated with a
    blkg. The duplicate association logic is removed from blk-throttle and
    blk-iolatency.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)