15 Nov, 2020

1 commit


26 Oct, 2020

2 commits

  • Similarly to commit 457e490f2b741 ("blkcg: allocate struct blkcg_gq
    outside request queue spinlock"), blkg_create can also trigger
    occasional -ENOMEM failures at the radix insertion because any
    allocation inside blkg_create has to be non-blocking, making it more
    likely to fail. This causes trouble for userspace tools trying to
    configure io weights who need to deal with this condition.

    This patch reduces the occurrence of -ENOMEMs on this path by preloading
    the radix tree element on a GFP_KERNEL context, such that we guarantee
    the later non-blocking insertion won't fail.

    A similar solution exists in blkcg_init_queue for the same situation.

    Acked-by: Tejun Heo
    Signed-off-by: Gabriel Krisman Bertazi
    Signed-off-by: Jens Axboe

    Gabriel Krisman Bertazi
     
  • If new_blkg allocation raced with blk_policy change and
    blkg_lookup_check fails, new_blkg is leaked.

    Acked-by: Tejun Heo
    Signed-off-by: Gabriel Krisman Bertazi
    Signed-off-by: Jens Axboe

    Gabriel Krisman Bertazi
     

10 Sep, 2020

1 commit

  • The test and the explaination of the patch as bellow.

    Before test we added more debug code in blkg_async_bio_workfn():
    int count = 0
    if (bios.head && bios.head->bi_next) {
    need_plug = true;
    blk_start_plug(&plug);
    }
    while ((bio = bio_list_pop(&bios))) {
    /*io_punt is a sysctl user interface to control the print*/
    if(io_punt) {
    printk("[%s:%d] bio start,size:%llu,%d count=%d plug?%d\n",
    current->comm, current->pid, bio->bi_iter.bi_sector,
    (bio->bi_iter.bi_size)>>9, count++, need_plug);
    }
    submit_bio(bio);
    }
    if (need_plug)
    blk_finish_plug(&plug);

    Steps that need to be set to trigger *PUNT* io before testing:
    mount -t btrfs -o compress=lzo /dev/sda6 /btrfs
    mount -t cgroup2 nodev /cgroup2
    mkdir /cgroup2/cg3
    echo "+io" > /cgroup2/cgroup.subtree_control
    echo "8:0 wbps=1048576000" > /cgroup2/cg3/io.max #1000M/s
    echo $$ > /cgroup2/cg3/cgroup.procs

    Then use dd command to test btrfs PUNT io in current shell:
    dd if=/dev/zero of=/btrfs/file bs=64K count=100000

    Test hardware environment as below:
    [root@localhost btrfs]# lscpu
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 32
    On-line CPU(s) list: 0-31
    Thread(s) per core: 2
    Core(s) per socket: 8
    Socket(s): 2
    NUMA node(s): 2
    Vendor ID: GenuineIntel

    With above debug code, test command and test environment, I did the
    tests under 3 different system loads, which are triggered by stress:
    1, Run 64 threads by command "stress -c 64 &"
    [53615.975974] [kworker/u66:18:1490] bio start,size:45583056,8 count=0 plug?1
    [53615.975980] [kworker/u66:18:1490] bio start,size:45583064,8 count=1 plug?1
    [53615.975984] [kworker/u66:18:1490] bio start,size:45583072,8 count=2 plug?1
    [53615.975987] [kworker/u66:18:1490] bio start,size:45583080,8 count=3 plug?1
    [53615.975990] [kworker/u66:18:1490] bio start,size:45583088,8 count=4 plug?1
    [53615.975993] [kworker/u66:18:1490] bio start,size:45583096,8 count=5 plug?1
    ... ...
    [53615.977041] [kworker/u66:18:1490] bio start,size:45585480,8 count=303 plug?1
    [53615.977044] [kworker/u66:18:1490] bio start,size:45585488,8 count=304 plug?1
    [53615.977047] [kworker/u66:18:1490] bio start,size:45585496,8 count=305 plug?1
    [53615.977050] [kworker/u66:18:1490] bio start,size:45585504,8 count=306 plug?1
    [53615.977053] [kworker/u66:18:1490] bio start,size:45585512,8 count=307 plug?1
    [53615.977056] [kworker/u66:18:1490] bio start,size:45585520,8 count=308 plug?1
    [53615.977058] [kworker/u66:18:1490] bio start,size:45585528,8 count=309 plug?1

    2, Run 32 threads by command "stress -c 32 &"
    [50586.290521] [kworker/u66:6:32351] bio start,size:45806496,8 count=0 plug?1
    [50586.290526] [kworker/u66:6:32351] bio start,size:45806504,8 count=1 plug?1
    [50586.290529] [kworker/u66:6:32351] bio start,size:45806512,8 count=2 plug?1
    [50586.290531] [kworker/u66:6:32351] bio start,size:45806520,8 count=3 plug?1
    [50586.290533] [kworker/u66:6:32351] bio start,size:45806528,8 count=4 plug?1
    [50586.290535] [kworker/u66:6:32351] bio start,size:45806536,8 count=5 plug?1
    ... ...
    [50586.299640] [kworker/u66:5:32350] bio start,size:45808576,8 count=252 plug?1
    [50586.299643] [kworker/u66:5:32350] bio start,size:45808584,8 count=253 plug?1
    [50586.299646] [kworker/u66:5:32350] bio start,size:45808592,8 count=254 plug?1
    [50586.299649] [kworker/u66:5:32350] bio start,size:45808600,8 count=255 plug?1
    [50586.299652] [kworker/u66:5:32350] bio start,size:45808608,8 count=256 plug?1
    [50586.299663] [kworker/u66:5:32350] bio start,size:45808616,8 count=257 plug?1
    [50586.299665] [kworker/u66:5:32350] bio start,size:45808624,8 count=258 plug?1
    [50586.299668] [kworker/u66:5:32350] bio start,size:45808632,8 count=259 plug?1

    3, Don't run thread by stress
    [50861.355246] [kworker/u66:19:32376] bio start,size:13544504,8 count=0 plug?0
    [50861.355288] [kworker/u66:19:32376] bio start,size:13544512,8 count=0 plug?0
    [50861.355322] [kworker/u66:19:32376] bio start,size:13544520,8 count=0 plug?0
    [50861.355353] [kworker/u66:19:32376] bio start,size:13544528,8 count=0 plug?0
    [50861.355392] [kworker/u66:19:32376] bio start,size:13544536,8 count=0 plug?0
    [50861.355431] [kworker/u66:19:32376] bio start,size:13544544,8 count=0 plug?0
    [50861.355468] [kworker/u66:19:32376] bio start,size:13544552,8 count=0 plug?0
    [50861.355499] [kworker/u66:19:32376] bio start,size:13544560,8 count=0 plug?0
    [50861.355532] [kworker/u66:19:32376] bio start,size:13544568,8 count=0 plug?0
    [50861.355575] [kworker/u66:19:32376] bio start,size:13544576,8 count=0 plug?0
    [50861.355618] [kworker/u66:19:32376] bio start,size:13544584,8 count=0 plug?0
    [50861.355659] [kworker/u66:19:32376] bio start,size:13544592,8 count=0 plug?0
    [50861.355740] [kworker/u66:0:32346] bio start,size:13544600,8 count=0 plug?1
    [50861.355748] [kworker/u66:0:32346] bio start,size:13544608,8 count=1 plug?1
    [50861.355962] [kworker/u66:2:32347] bio start,size:13544616,8 count=0 plug?0
    [50861.356272] [kworker/u66:7:31962] bio start,size:13544624,8 count=0 plug?0
    [50861.356446] [kworker/u66:7:31962] bio start,size:13544632,8 count=0 plug?0
    [50861.356567] [kworker/u66:7:31962] bio start,size:13544640,8 count=0 plug?0
    [50861.356707] [kworker/u66:19:32376] bio start,size:13544648,8 count=0 plug?0
    [50861.356748] [kworker/u66:15:32355] bio start,size:13544656,8 count=0 plug?0
    [50861.356825] [kworker/u66:17:31970] bio start,size:13544664,8 count=0 plug?0

    Analysis of above 3 test results with different system load:
    >From above test, we can see more and more continuous bios can be plugged
    with system load increasing. When run "stress -c 64 &", 310 continuous
    bios are plugged; When run "stress -c 32 &", 260 continuous bios are
    plugged; When don't run stress, at most only 2 continuous bios are
    plugged, in most cases, bio_list only contains one single bio.

    How to explain above phenomenon:
    We know, in submit_bio(), if the bio is a REQ_CGROUP_PUNT io, it will
    queue a work to workqueue blkcg_punt_bio_wq. But when the workqueue is
    scheduled, it depends on the system load. When system load is low, the
    workqueue will be quickly scheduled, and the bio in bio_list will be
    quickly processed in blkg_async_bio_workfn(), so there is less chance
    that the same io submit thread can add multiple continuous bios to
    bio_list before workqueue is scheduled to run. The analysis aligned with
    above test "3".
    When system load is high, there is some delay before the workqueue can
    be scheduled to run, the higher the system load the greater the delay.
    So there is more chance that the same io submit thread can add multiple
    continuous bios to bio_list. Then when the workqueue is scheduled to run,
    there are more continuous bios in bio_list, which will be processed in
    blkg_async_bio_workfn(). The analysis aligned with above test "1" and "2".

    According to test, we can get io performance improved with the patch,
    especially when system load is higher. Another optimazition is to use
    the plug only when bio_list contains at least 2 bios.

    Signed-off-by: Xianting Tian
    Signed-off-by: Jens Axboe

    Xianting Tian
     

02 Sep, 2020

1 commit

  • Curently, iocost syncs the delay duration to the outstanding debt amount,
    which seemed enough to protect the system from anon memory hogs. However,
    that was mostly because the delay calcuation was using hweight_inuse which
    quickly converges towards zero under debt for delay duration calculation,
    often pusnishing debtors overly harshly for longer than deserved.

    The previous patch fixed the delay calcuation and now the protection against
    anonymous memory hogs isn't enough because the effect of delay is indirect
    and non-linear and a huge amount of future debt can accumulate abruptly
    while unthrottled.

    This patch implements delay hysteresis so that delay is decayed
    exponentially over time instead of getting cleared immediately as debt is
    paid off. While the overall behavior is similar to the blk-cgroup
    implementation used by blk-iolatency, a lot of the details are different and
    due to the empirical nature of the mechanism, it's challenging to adapt the
    mechanism for one controller without negatively impacting the other.

    As the delay is gradually decayed now, there's no point in running it from
    its own hrtimer. Periodic updates are now performed from ioc_timer_fn() and
    the dedicated hrtimer is removed.

    Signed-off-by: Tejun Heo
    Cc: Josef Bacik
    Signed-off-by: Jens Axboe

    Tejun Heo
     

22 Aug, 2020

1 commit

  • Normally, blkcg_iolatency_exit() will free related memory in iolatency
    when cleanup queue. But if blk_throtl_init() return error and queue init
    fail, blkcg_iolatency_exit() will not do that for us. Then it cause
    memory leak.

    Fixes: d70675121546 ("block: introduce blk-iolatency io controller")
    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     

18 Jul, 2020

2 commits

  • In order to improve consistency and usability in cgroup stat accounting,
    we would like to support the root cgroup's io.stat.

    Since the root cgroup has processes doing io even if the system has no
    explicitly created cgroups, we need to be careful to avoid overhead in
    that case. For that reason, the rstat algorithms don't handle the root
    cgroup, so just turning the file on wouldn't give correct statistics.

    To get around this, we simulate flushing the iostat struct by filling it
    out directly from global disk stats. The result is a root cgroup io.stat
    file consistent with both /proc/diskstats and io.stat.

    Note that in order to collect the disk stats, we needed to iterate over
    devices. To facilitate that, we had to change the linkage of a disk_type
    to external so that it can be used from blk-cgroup.c to iterate over
    disks.

    Suggested-by: Tejun Heo
    Signed-off-by: Boris Burkov
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Boris Burkov
     
  • Previously, the code which printed io.stat only needed access to the
    generic rstat flushing code, but since we plan to write some more
    specific code for preparing root cgroup stats, we need to manipulate
    iostat structs directly. Since declaring static functions ahead does not
    seem like common practice in this file, simply move the iostat functions
    up. We only plan to use blkg_iostat_set, but it seems better to keep them
    all together.

    Signed-off-by: Boris Burkov
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Boris Burkov
     

09 Jul, 2020

1 commit


01 Jul, 2020

2 commits


29 Jun, 2020

5 commits


10 May, 2020

2 commits

  • Pull in block-5.7 fixes for 5.8. Mostly to resolve a conflict with
    the blk-iocost changes, but we also need the base of the bdi
    use-after-free as well as we build on top of it.

    * block-5.7:
    nvme: fix possible hang when ns scanning fails during error recovery
    nvme-pci: fix "slimmer CQ head update"
    bdi: add a ->dev_name field to struct backing_dev_info
    bdi: use bdi_dev_name() to get device name
    bdi: move bdi_dev_name out of line
    vboxsf: don't use the source name in the bdi name
    iocost: protect iocg->abs_vdebt with iocg->waitq.lock
    block: remove the bd_openers checks in blk_drop_partitions
    nvme: prevent double free in nvme_alloc_ns() error handling
    null_blk: Cleanup zoned device initialization
    null_blk: Fix zoned command handling
    block: remove unused header
    blk-iocost: Fix error on iocost_ioc_vrate_adj
    bdev: Reduce time holding bd_mutex in sync in blkdev_close()
    buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Use the common interface bdi_dev_name() to get device name.

    Signed-off-by: Yufen Yu
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Jan Kara
    Reviewed-by: Bart Van Assche

    Add missing include BFQ

    Signed-off-by: Jens Axboe

    Yufen Yu
     

01 May, 2020

1 commit

  • The use_delay mechanism was introduced by blk-iolatency to hold memory
    allocators accountable for the reclaim and other shared IOs they cause. The
    duration of the delay is dynamically balanced between iolatency increasing the
    value on each target miss and it auto-decaying as time passes and threads get
    delayed on it.

    While this works well for iolatency, iocost's control model isn't compatible
    with it. There is no repeated "violation" events which can be balanced against
    auto-decaying. iocost instead knows how much a given cgroup is over budget and
    wants to prevent that cgroup from issuing IOs while over budget. Until now,
    iocost has been adding the cost of force-issued IOs. However, this doesn't
    reflect the amount which is already over budget and is simply not enough to
    counter the auto-decaying allowing anon-memory leaking low priority cgroup to
    go over its alloted share of IOs.

    As auto-decaying doesn't make much sense for iocost, this patch introduces a
    different mode of operation for use_delay - when blkcg_set_delay() are used
    insted of blkcg_add/use_delay(), the delay duration is not auto-decayed until it
    is explicitly cleared with blkcg_clear_delay(). iocost is updated to keep the
    delay duration synchronized to the budget overage amount.

    With this change, iocost can effectively police cgroups which generate
    significant amount of force-issued IOs.

    Signed-off-by: Tejun Heo
    Cc: Josef Bacik
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Apr, 2020

2 commits

  • blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs
    don't get offlined while there are active cgwbs on them. However, it
    ends up making offlining unordered sometimes causing parents to be
    offlined before children.

    Let's fix this by making child blkcgs pin the parents' online states.

    Note that pin/unpin names are chosen over get/put intentionally
    because css uses get/put online for something different.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs
    don't get offlined while there are active cgwbs on them. However, it
    ends up making offlining unordered sometimes causing parents to be
    offlined before children.

    To fix it, we want child blkcgs to pin the parents' online states
    turning the refcnt into a more generic online pinning mechanism.

    In prepartion,

    * blkcg->cgwb_refcnt -> blkcg->online_pin
    * blkcg_cgwb_get/put() -> blkcg_pin/unpin_online()
    * Take them out of CONFIG_CGROUP_WRITEBACK

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

28 Mar, 2020

1 commit

  • Current make_request based drivers use either blk_alloc_queue_node or
    blk_alloc_queue to allocate a queue, and then set up the make_request_fn
    function pointer and a few parameters using the blk_queue_make_request
    helper. Simplify this by passing the make_request pointer to
    blk_alloc_queue, and while at it merge the _node variant into the main
    helper by always passing a node_id, and remove the superfluous gfp_mask
    parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

13 Dec, 2019

1 commit


08 Nov, 2019

3 commits

  • blkg_rwstat is now only used by bfq-iosched and blk-throtl when on
    cgroup1. Let's move it into its own files and gate it behind a config
    option.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk-cgroup has been using blkg_rwstat to track basic IO stats.
    Unfortunately, reading recursive stats scales badly as itinvolves
    walking all descendants. On systems with a huge number of cgroups
    (dead or alive), this can lead to substantial CPU cost when reading IO
    stats.

    This patch reimplements basic IO stats using cgroup rstat which uses
    more memory but makes recursive stat reading O(# descendants which
    have been active since last reading) instead of O(# descendants).

    * blk-cgroup core no longer uses sync/async stats. Introduce new stat
    enums - BLKG_IOSTAT_{READ|WRITE|DISCARD}.

    * Add blkg_iostat[_set] which encapsulates byte and io stats, last
    values for propagation delta calculation and u64_stats_sync for
    correctness on 32bit archs.

    * Update the new percpu stat counters directly and implement
    blkcg_rstat_flush() to implement propagation.

    * blkg_print_stat() can now bring the stats up to date by calling
    cgroup_rstat_flush() and print them instead of directly summing up
    all descendants.

    * It now allocates 96 bytes per cpu. It used to be 40 bytes.

    Signed-off-by: Tejun Heo
    Cc: Dan Schatzberg
    Cc: Daniel Xu
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • These don't have users anymore. Remove them.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

07 Nov, 2019

1 commit

  • blkcg_print_stat() iterates blkgs under RCU and doesn't test whether
    the blkg is online. This can call into pd_stat_fn() on a pd which is
    still being initialized leading to an oops.

    The heaviest operation - recursively summing up rwstat counters - is
    already done while holding the queue_lock. Expand queue_lock to cover
    the other operations and skip the blkg if it isn't online yet. The
    online state is protected by both blkcg and queue locks, so this
    guarantees that only online blkgs are processed.

    Signed-off-by: Tejun Heo
    Reported-by: Roman Gushchin
    Cc: Josef Bacik
    Fixes: 903d23f0a354 ("blk-cgroup: allow controllers to output their own stats")
    Cc: stable@vger.kernel.org # v4.19+
    Signed-off-by: Jens Axboe

    Tejun Heo
     

16 Oct, 2019

1 commit

  • blkcg_activate_policy() has the following bugs.

    * cf09a8ee19ad ("blkcg: pass @q and @blkcg into
    blkcg_pol_alloc_pd_fn()") added @blkcg to ->pd_alloc_fn(); however,
    blkcg_activate_policy() ends up using pd's allocated for the root
    blkcg for all preallocations, so ->pd_init_fn() for non-root blkcgs
    can be passed in pd's which are allocated for the root blkcg.

    For blk-iocost, this means that ->pd_init_fn() can write beyond the
    end of the allocated object as it determines the length of the flex
    array at the end based on the blkcg's nesting level.

    * Each pd is initialized as they get allocated. If alloc fails, the
    policy will get freed with pd's initialized on it.

    * After the above partial failure, the partial pds are not freed.

    This patch fixes all the above issues by

    * Restructuring blkcg_activate_policy() so that alloc and init passes
    are separate. Init takes place only after all allocs succeeded and
    on failure all allocated pds are freed.

    * Unifying and fixing the cleanup of the remaining pd_prealloc.

    Signed-off-by: Tejun Heo
    Fixes: cf09a8ee19ad ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()")
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Sep, 2019

1 commit

  • Since commit 795fe54c2a828099e ("bfq: Add per-device weight"), bfq uses
    blkg_conf_prep() and blkg_conf_finish(), which are not exported. So, it
    causes linkage error if bfq compiled as a module.

    Fixes: 795fe54c2a828099e ("bfq: Add per-device weight")
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

29 Aug, 2019

3 commits


17 Jul, 2019

1 commit

  • Currently, ->pd_stat() is called only when moduleparam
    blkcg_debug_stats is set which prevents it from printing non-debug
    policy-specific statistics. Let's move debug testing down so that
    ->pd_stat() can print non-debug stat too. This patch doesn't cause
    any visible behavior change.

    Signed-off-by: Tejun Heo
    Cc: Josef Bacik
    Signed-off-by: Jens Axboe

    Tejun Heo
     

10 Jul, 2019

3 commits

  • When a shared kthread needs to issue a bio for a cgroup, doing so
    synchronously can lead to priority inversions as the kthread can be
    trapped waiting for that cgroup. This patch implements
    REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
    to a dedicated per-blkcg work item to avoid such priority inversions.

    This will be used to fix priority inversions in btrfs compression and
    should be generally useful as we grow filesystem support for
    comprehensive IO control.

    Cc: Chris Mason
    Reviewed-by: Josef Bacik
    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • btrfs is going to use css_put() and wbc helpers to improve cgroup
    writeback support. Add dummy css_get() definition and export wbc
    helpers to prepare for module and !CONFIG_CGROUP builds.

    Reported-by: kbuild test robot
    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With the psi stuff in place we can use the memstall flag to indicate
    pressure that happens from throttling.

    Signed-off-by: Josef Bacik
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     

21 Jun, 2019

4 commits