01 Sep, 2018

1 commit

  • There is a very small change a bio gets caught up in a really
    unfortunate race between a task migration, cgroup exiting, and itself
    trying to associate with a blkg. This is due to css offlining being
    performed after the css->refcnt is killed which triggers removal of
    blkgs that reach their blkg->refcnt of 0.

    To avoid this, association with a blkg should use tryget and fallback to
    using the root_blkg.

    Fixes: 08e18eab0c579 ("block: add bi_blkg to the bio for cgroups")
    Reviewed-by: Josef Bacik
    Signed-off-by: Dennis Zhou
    Cc: Jiufei Xue
    Cc: Joseph Qi
    Cc: Tejun Heo
    Cc: Josef Bacik
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     

10 Aug, 2018

1 commit

  • When an application's iops has exceeded its cgroup's iops limit, surely it
    is throttled and kernel will set a timer for dispatching, thus IO latency
    includes the delay.

    However, the dispatch delay which is calculated by the limit and the
    elapsed jiffies is suboptimal. As the dispatch delay is only calculated
    once the application's iops is (iops limit + 1), it doesn't need to wait
    any longer than the remaining time of the current slice.

    The difference can be proved by the following fio job and cgroup iops
    setting,
    -----
    $ echo 4 > /mnt/config/nullb/disk1/mbps # limit nullb's bandwidth to 4MB/s for testing.
    $ echo "253:1 riops=100 rbps=max" > /sys/fs/cgroup/unified/cg1/io.max
    $ cat r2.job
    [global]
    name=fio-rand-read
    filename=/dev/nullb1
    rw=randread
    bs=4k
    direct=1
    numjobs=1
    time_based=1
    runtime=60
    group_reporting=1

    [file1]
    size=4G
    ioengine=libaio
    iodepth=1
    rate_iops=50000
    norandommap=1
    thinktime=4ms
    -----

    wo patch:
    file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
    fio-3.7-66-gedfc
    Starting 1 process

    read: IOPS=99, BW=400KiB/s (410kB/s)(23.4MiB/60001msec)
    slat (usec): min=10, max=336, avg=27.71, stdev=17.82
    clat (usec): min=2, max=28887, avg=5929.81, stdev=7374.29
    lat (usec): min=24, max=28901, avg=5958.73, stdev=7366.22
    clat percentiles (usec):
    | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4],
    | 30.00th=[ 4], 40.00th=[ 4], 50.00th=[ 6], 60.00th=[11731],
    | 70.00th=[11863], 80.00th=[11994], 90.00th=[12911], 95.00th=[22676],
    | 99.00th=[23725], 99.50th=[23987], 99.90th=[23987], 99.95th=[25035],
    | 99.99th=[28967]

    w/ patch:
    file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
    fio-3.7-66-gedfc
    Starting 1 process

    read: IOPS=100, BW=400KiB/s (410kB/s)(23.4MiB/60005msec)
    slat (usec): min=10, max=155, avg=23.24, stdev=16.79
    clat (usec): min=2, max=12393, avg=5961.58, stdev=5959.25
    lat (usec): min=23, max=12412, avg=5985.91, stdev=5951.92
    clat percentiles (usec):
    | 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 4], 20.00th=[ 4],
    | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 47], 60.00th=[11863],
    | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
    | 99.00th=[11994], 99.50th=[11994], 99.90th=[12125], 99.95th=[12125],
    | 99.99th=[12387]

    Signed-off-by: Liu Bo

    Signed-off-by: Jens Axboe

    Liu Bo
     

09 Jul, 2018

2 commits

  • Currently io.low uses a bi_cg_private to stash its private data for the
    blkg, however other blkcg policies may want to use this as well. Since
    we can get the private data out of the blkg, move this to bi_blkg in the
    bio and make it generic, then we can use bio_associate_blkg() to attach
    the blkg to the bio.

    Theoretically we could simply replace the bi_css with this since we can
    get to all the same information from the blkg, however you have to
    lookup the blkg, so for example wbc_init_bio() would have to lookup and
    possibly allocate the blkg for the css it was trying to attach to the
    bio. This could be problematic and result in us either not attaching
    the css at all to the bio, or falling back to the root blkcg if we are
    unable to allocate the corresponding blkg.

    So for now do this, and in the future if possible we could just replace
    the bi_css with bi_blkg and update the helpers to do the correct
    translation.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • Once one cgroup has io.low configured, @low_valid becomes true and other
    cgroups won't switch it back whatsoever.

    Signed-off-by: Liu Bo
    Signed-off-by: Jens Axboe

    Liu Bo
     

31 May, 2018

2 commits


09 May, 2018

2 commits

  • struct blk_issue_stat squashes three things into one u64:

    - The time the driver started working on a request
    - The original size of the request (for the io.low controller)
    - Flags for writeback throttling

    It turns out that on x86_64, we have a 4 byte hole in struct request
    which we can fill with the non-timestamp fields from blk_issue_stat,
    simplifying things quite a bit.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • struct blk_issue_stat is going away, and bio->bi_issue_stat doesn't even
    use the blk-stats interface, so we can provide a separate implementation
    specific for bios. The helpers work the same way as the blk-stats
    helpers.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

30 Jan, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "This is the main pull request for block IO related changes for the
    4.16 kernel. Nothing major in this pull request, but a good amount of
    improvements and fixes all over the map. This contains:

    - BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
    Paolo.

    - Support for SMR zones for deadline and mq-deadline from Damien and
    Christoph.

    - Set of fixes for bcache by way of Michael Lyle, including fixes
    from himself, Kent, Rui, Tang, and Coly.

    - Series from Matias for lightnvm with fixes from Hans Holmberg,
    Javier, and Matias. Mostly centered around pblk, and the removing
    rrpc 1.2 in preparation for supporting 2.0.

    - A couple of NVMe pull requests from Christoph. Nothing major in
    here, just fixes and cleanups, and support for command tracing from
    Johannes.

    - Support for blk-throttle for tracking reads and writes separately.
    From Joseph Qi. A few cleanups/fixes also for blk-throttle from
    Weiping.

    - Series from Mike Snitzer that enables dm to register its queue more
    logically, something that's alwways been problematic on dm since
    it's a stacked device.

    - Series from Ming cleaning up some of the bio accessor use, in
    preparation for supporting multipage bvecs.

    - Various fixes from Ming closing up holes around queue mapping and
    quiescing.

    - BSD partition fix from Richard Narron, fixing a problem where we
    can't mount newer (10/11) FreeBSD partitions.

    - Series from Tejun reworking blk-mq timeout handling. The previous
    scheme relied on atomic bits, but it had races where we would think
    a request had timed out if it to reused at the wrong time.

    - null_blk now supports faking timeouts, to enable us to better
    exercise and test that functionality separately. From me.

    - Kill the separate atomic poll bit in the request struct. After
    this, we don't use the atomic bits on blk-mq anymore at all. From
    me.

    - sgl_alloc/free helpers from Bart.

    - Heavily contended tag case scalability improvement from me.

    - Various little fixes and cleanups from Arnd, Bart, Corentin,
    Douglas, Eryu, Goldwyn, and myself"

    * 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
    block: remove smart1,2.h
    nvme: add tracepoint for nvme_complete_rq
    nvme: add tracepoint for nvme_setup_cmd
    nvme-pci: introduce RECONNECTING state to mark initializing procedure
    nvme-rdma: remove redundant boolean for inline_data
    nvme: don't free uuid pointer before printing it
    nvme-pci: Suspend queues after deleting them
    bsg: use pr_debug instead of hand crafted macros
    blk-mq-debugfs: don't allow write on attributes with seq_operations set
    nvme-pci: Fix queue double allocations
    block: Set BIO_TRACE_COMPLETION on new bio during split
    blk-throttle: use queue_is_rq_based
    block: Remove kblockd_schedule_delayed_work{,_on}()
    blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
    blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
    lib/scatterlist: Fix chaining support in sgl_alloc_order()
    blk-throttle: track read and write request individually
    block: add bdev_read_only() checks to common helpers
    block: fail op_is_write() requests to read-only partitions
    blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
    ...

    Linus Torvalds
     

20 Jan, 2018

1 commit


19 Jan, 2018

2 commits

  • In mixed read/write workload on SSD, write latency is much lower than
    read. But now we only track and record read latency and then use it as
    threshold base for both read and write io latency accounting. As a
    result, write io latency will always be considered as good and
    bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
    tg to be checked will be treated as idle most of the time and still let
    others dispatch more ios, even it is truly running under low limit and
    wants its low limit to be guaranteed, which is not we expected in fact.
    So track read and write request individually, which can bring more
    precise latency control for low limit idle detection.

    Signed-off-by: Joseph Qi
    Reviewed-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Joseph Qi
     
  • export these two interface for cgroup-v1.

    Acked-by: Tejun Heo
    Signed-off-by: weiping zhang
    Signed-off-by: Jens Axboe

    weiping zhang
     

21 Dec, 2017

1 commit

  • If a bio is throttled and split after throttling, the bio could be
    resubmited and enters the throttling again. This will cause part of the
    bio to be charged multiple times. If the cgroup has an IO limit, the
    double charge will significantly harm the performance. The bio split
    becomes quite common after arbitrary bio size change.

    To fix this, we always set the BIO_THROTTLED flag if a bio is throttled.
    If the bio is cloned/split, we copy the flag to new bio too to avoid a
    double charge. However, cloned bio could be directed to a new disk,
    keeping the flag be a problem. The observation is we always set new disk
    for the bio in this case, so we can clear the flag in bio_set_dev().

    This issue exists for a long time, arbitrary bio size change just makes
    it worse, so this should go into stable at least since v4.2.

    V1-> V2: Not add extra field in bio based on discussion with Tejun

    Cc: Vivek Goyal
    Cc: stable@vger.kernel.org
    Acked-by: Tejun Heo
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

22 Nov, 2017

1 commit

  • This converts all remaining cases of the old setup_timer() API into using
    timer_setup(), where the callback argument is the structure already
    holding the struct timer_list. These should have no behavioral changes,
    since they just change which pointer is passed into the callback with
    the same available pointers after conversion. It handles the following
    examples, in addition to some other variations.

    Casting from unsigned long:

    void my_callback(unsigned long data)
    {
    struct something *ptr = (struct something *)data;
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, ptr);

    and forced object casts:

    void my_callback(struct something *ptr)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);

    become:

    void my_callback(struct timer_list *t)
    {
    struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

    Direct function assignments:

    void my_callback(unsigned long data)
    {
    struct something *ptr = (struct something *)data;
    ...
    }
    ...
    ptr->my_timer.function = my_callback;

    have a temporary cast added, along with converting the args:

    void my_callback(struct timer_list *t)
    {
    struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;

    And finally, callbacks without a data assignment:

    void my_callback(unsigned long data)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, 0);

    have their argument renamed to verify they're unused during conversion:

    void my_callback(struct timer_list *unused)
    {
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

    The conversion is done with the following Coccinelle script:

    spatch --very-quiet --all-includes --include-headers \
    -I ./arch/x86/include -I ./arch/x86/include/generated \
    -I ./include -I ./arch/x86/include/uapi \
    -I ./arch/x86/include/generated/uapi -I ./include/uapi \
    -I ./include/generated/uapi --include ./include/linux/kconfig.h \
    --dir . \
    --cocci-file ~/src/data/timer_setup.cocci

    @fix_address_of@
    expression e;
    @@

    setup_timer(
    -&(e)
    +&e
    , ...)

    // Update any raw setup_timer() usages that have a NULL callback, but
    // would otherwise match change_timer_function_usage, since the latter
    // will update all function assignments done in the face of a NULL
    // function initialization in setup_timer().
    @change_timer_function_usage_NULL@
    expression _E;
    identifier _timer;
    type _cast_data;
    @@

    (
    -setup_timer(&_E->_timer, NULL, _E);
    +timer_setup(&_E->_timer, NULL, 0);
    |
    -setup_timer(&_E->_timer, NULL, (_cast_data)_E);
    +timer_setup(&_E->_timer, NULL, 0);
    |
    -setup_timer(&_E._timer, NULL, &_E);
    +timer_setup(&_E._timer, NULL, 0);
    |
    -setup_timer(&_E._timer, NULL, (_cast_data)&_E);
    +timer_setup(&_E._timer, NULL, 0);
    )

    @change_timer_function_usage@
    expression _E;
    identifier _timer;
    struct timer_list _stl;
    identifier _callback;
    type _cast_func, _cast_data;
    @@

    (
    -setup_timer(&_E->_timer, _callback, _E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, &_callback, _E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, _callback, (_cast_data)_E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, (_cast_func)_callback, _E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E._timer, _callback, (_cast_data)_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, _callback, (_cast_data)&_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, &_callback, (_cast_data)_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    _E->_timer@_stl.function = _callback;
    |
    _E->_timer@_stl.function = &_callback;
    |
    _E->_timer@_stl.function = (_cast_func)_callback;
    |
    _E->_timer@_stl.function = (_cast_func)&_callback;
    |
    _E._timer@_stl.function = _callback;
    |
    _E._timer@_stl.function = &_callback;
    |
    _E._timer@_stl.function = (_cast_func)_callback;
    |
    _E._timer@_stl.function = (_cast_func)&_callback;
    )

    // callback(unsigned long arg)
    @change_callback_handle_cast
    depends on change_timer_function_usage@
    identifier change_timer_function_usage._callback;
    identifier change_timer_function_usage._timer;
    type _origtype;
    identifier _origarg;
    type _handletype;
    identifier _handle;
    @@

    void _callback(
    -_origtype _origarg
    +struct timer_list *t
    )
    {
    (
    ... when != _origarg
    _handletype *_handle =
    -(_handletype *)_origarg;
    +from_timer(_handle, t, _timer);
    ... when != _origarg
    |
    ... when != _origarg
    _handletype *_handle =
    -(void *)_origarg;
    +from_timer(_handle, t, _timer);
    ... when != _origarg
    |
    ... when != _origarg
    _handletype *_handle;
    ... when != _handle
    _handle =
    -(_handletype *)_origarg;
    +from_timer(_handle, t, _timer);
    ... when != _origarg
    |
    ... when != _origarg
    _handletype *_handle;
    ... when != _handle
    _handle =
    -(void *)_origarg;
    +from_timer(_handle, t, _timer);
    ... when != _origarg
    )
    }

    // callback(unsigned long arg) without existing variable
    @change_callback_handle_cast_no_arg
    depends on change_timer_function_usage &&
    !change_callback_handle_cast@
    identifier change_timer_function_usage._callback;
    identifier change_timer_function_usage._timer;
    type _origtype;
    identifier _origarg;
    type _handletype;
    @@

    void _callback(
    -_origtype _origarg
    +struct timer_list *t
    )
    {
    + _handletype *_origarg = from_timer(_origarg, t, _timer);
    +
    ... when != _origarg
    - (_handletype *)_origarg
    + _origarg
    ... when != _origarg
    }

    // Avoid already converted callbacks.
    @match_callback_converted
    depends on change_timer_function_usage &&
    !change_callback_handle_cast &&
    !change_callback_handle_cast_no_arg@
    identifier change_timer_function_usage._callback;
    identifier t;
    @@

    void _callback(struct timer_list *t)
    { ... }

    // callback(struct something *handle)
    @change_callback_handle_arg
    depends on change_timer_function_usage &&
    !match_callback_converted &&
    !change_callback_handle_cast &&
    !change_callback_handle_cast_no_arg@
    identifier change_timer_function_usage._callback;
    identifier change_timer_function_usage._timer;
    type _handletype;
    identifier _handle;
    @@

    void _callback(
    -_handletype *_handle
    +struct timer_list *t
    )
    {
    + _handletype *_handle = from_timer(_handle, t, _timer);
    ...
    }

    // If change_callback_handle_arg ran on an empty function, remove
    // the added handler.
    @unchange_callback_handle_arg
    depends on change_timer_function_usage &&
    change_callback_handle_arg@
    identifier change_timer_function_usage._callback;
    identifier change_timer_function_usage._timer;
    type _handletype;
    identifier _handle;
    identifier t;
    @@

    void _callback(struct timer_list *t)
    {
    - _handletype *_handle = from_timer(_handle, t, _timer);
    }

    // We only want to refactor the setup_timer() data argument if we've found
    // the matching callback. This undoes changes in change_timer_function_usage.
    @unchange_timer_function_usage
    depends on change_timer_function_usage &&
    !change_callback_handle_cast &&
    !change_callback_handle_cast_no_arg &&
    !change_callback_handle_arg@
    expression change_timer_function_usage._E;
    identifier change_timer_function_usage._timer;
    identifier change_timer_function_usage._callback;
    type change_timer_function_usage._cast_data;
    @@

    (
    -timer_setup(&_E->_timer, _callback, 0);
    +setup_timer(&_E->_timer, _callback, (_cast_data)_E);
    |
    -timer_setup(&_E._timer, _callback, 0);
    +setup_timer(&_E._timer, _callback, (_cast_data)&_E);
    )

    // If we fixed a callback from a .function assignment, fix the
    // assignment cast now.
    @change_timer_function_assignment
    depends on change_timer_function_usage &&
    (change_callback_handle_cast ||
    change_callback_handle_cast_no_arg ||
    change_callback_handle_arg)@
    expression change_timer_function_usage._E;
    identifier change_timer_function_usage._timer;
    identifier change_timer_function_usage._callback;
    type _cast_func;
    typedef TIMER_FUNC_TYPE;
    @@

    (
    _E->_timer.function =
    -_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E->_timer.function =
    -&_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E->_timer.function =
    -(_cast_func)_callback;
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E->_timer.function =
    -(_cast_func)&_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E._timer.function =
    -_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E._timer.function =
    -&_callback;
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E._timer.function =
    -(_cast_func)_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E._timer.function =
    -(_cast_func)&_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    )

    // Sometimes timer functions are called directly. Replace matched args.
    @change_timer_function_calls
    depends on change_timer_function_usage &&
    (change_callback_handle_cast ||
    change_callback_handle_cast_no_arg ||
    change_callback_handle_arg)@
    expression _E;
    identifier change_timer_function_usage._timer;
    identifier change_timer_function_usage._callback;
    type _cast_data;
    @@

    _callback(
    (
    -(_cast_data)_E
    +&_E->_timer
    |
    -(_cast_data)&_E
    +&_E._timer
    |
    -_E
    +&_E->_timer
    )
    )

    // If a timer has been configured without a data argument, it can be
    // converted without regard to the callback argument, since it is unused.
    @match_timer_function_unused_data@
    expression _E;
    identifier _timer;
    identifier _callback;
    @@

    (
    -setup_timer(&_E->_timer, _callback, 0);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, _callback, 0L);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, _callback, 0UL);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E._timer, _callback, 0);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, _callback, 0L);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, _callback, 0UL);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_timer, _callback, 0);
    +timer_setup(&_timer, _callback, 0);
    |
    -setup_timer(&_timer, _callback, 0L);
    +timer_setup(&_timer, _callback, 0);
    |
    -setup_timer(&_timer, _callback, 0UL);
    +timer_setup(&_timer, _callback, 0);
    |
    -setup_timer(_timer, _callback, 0);
    +timer_setup(_timer, _callback, 0);
    |
    -setup_timer(_timer, _callback, 0L);
    +timer_setup(_timer, _callback, 0);
    |
    -setup_timer(_timer, _callback, 0UL);
    +timer_setup(_timer, _callback, 0);
    )

    @change_callback_unused_data
    depends on match_timer_function_unused_data@
    identifier match_timer_function_unused_data._callback;
    type _origtype;
    identifier _origarg;
    @@

    void _callback(
    -_origtype _origarg
    +struct timer_list *unused
    )
    {
    ... when != _origarg
    }

    Signed-off-by: Kees Cook

    Kees Cook
     

15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

11 Oct, 2017

1 commit

  • A null pointer dereference can occur when blkcg is removed manually
    with writeback IOs inflight. This is caused by the following case:

    Writeback kworker submit the bio and set bio->bi_cg_private to tg
    in blk_throtl_assoc_bio.
    Then we remove the block cgroup manually, the blkg and tg would be
    freed if there is no request inflight.
    When the submitted bio come back, blk_throtl_bio_endio() fetch the tg
    which was already freed.

    Fix this by increasing the refcount of blkg in funcion
    blk_throtl_assoc_bio() so that the blkg will not be freed until the
    bio_endio called.

    Reviewed-by: Shaohua Li
    Signed-off-by: Jiufei Xue
    Signed-off-by: Jens Axboe

    Jiufei Xue
     

04 Oct, 2017

1 commit

  • There is a case which will lead to io stall. The case is described as
    follows.
    /test1
    |-subtest1
    /test2
    |-subtest2
    And subtest1 and subtest2 each has 32 queued bios already.

    Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
    bios as follows:
    1) tg=subtest1, do nothing;
    2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
    left, no need to schedule next dispatch;
    3) tg=subtest2, do nothing;
    4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
    left, no need to schedule next dispatch;
    5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
    test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2
    to /; note that test1 and test2 each still has 16 queued bios left;
    6) tg=/, try to schedule next dispatch, but since disptime is now
    (update in tg_update_disptime, wait=0), pending timer is not scheduled
    in fact;
    7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
    32 left. test1 and test2 each has 16 queued bios;
    8) throtl_pending_timer_fn sees the left over bios, but could do
    nothing, because throtl_select_dispatch returns 0, and test1/test2 has
    no pending tg.

    The blktrace shows the following:
    8,32 0 0 2.539007641 0 m N throtl upgrade to max
    8,32 0 0 2.539072267 0 m N throtl /test2 dispatch nr_queued=16 read=0 write=16
    8,32 7 0 2.539077142 0 m N throtl /test1 dispatch nr_queued=16 read=0 write=16

    So force schedule dispatch if there are pending children.

    Reviewed-by: Shaohua Li
    Signed-off-by: Joseph Qi
    Signed-off-by: Jens Axboe

    Joseph Qi
     

08 Sep, 2017

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

24 Aug, 2017

1 commit

  • discard request usually is very big and easily use all bandwidth budget
    of a cgroup. discard request size doesn't really mean the size of data
    written, so it doesn't make sense to account it into bandwidth budget.
    Jens pointed out treating the size 0 doesn't make sense too, because
    discard request does have cost. But it's not easy to find the actual
    cost. This patch simply makes the size one sector.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

29 Jul, 2017

2 commits

  • Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
    way. Now we have standard blktrace API for this, so convert them to use
    it.

    Note, this changes the behavior a little bit. cgroup info isn't output
    by default, we only do this with 'blk_cgroup' option enabled. cgroup
    info isn't output as a string by default too, we only do this with
    'blk_cgname' option enabled. Also cgroup info is output in different
    position of the note string. I think these behavior changes aren't a big
    issue (actually we make trace data shorter which is good), since the
    blktrace note is solely for debugging.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • blkcg_bio_issue_check() already gets blkcg for a BIO.
    bio_associate_blkcg() uses a percpu refcounter, so it's a very cheap
    operation. There is no point we don't attach the cgroup info into bio at
    blkcg_bio_issue_check. This also makes blktrace outputs correct cgroup
    info.

    Acked-by: Tejun Heo
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

07 Jun, 2017

2 commits

  • hard disk IO latency varies a lot depending on spindle move. The latency
    range could be from several microseconds to several milliseconds. It's
    pretty hard to get the baseline latency used by io.low.

    We will use a different stragety here. The idea is only using IO with
    spindle move to determine if cgroup IO is in good state. For HD, if io
    latency is small (< 1ms), we ignore the IO. Such IO is likely from
    sequential IO, and is helpless to help determine if a cgroup's IO is
    impacted by other cgroups. With this, we only account IO with big
    latency. Then we can choose a hardcoded baseline latency for HD (4ms,
    which is typical IO latency with seek). With all these settings, the
    io.low latency works for both HD and SSD.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • I have encountered a NULL pointer dereference in
    throtl_schedule_pending_timer:
    [ 413.735396] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
    [ 413.735535] IP: [] throtl_schedule_pending_timer+0x3f/0x210
    [ 413.735643] PGD 22c8cf067 PUD 22cb34067 PMD 0
    [ 413.735713] Oops: 0000 [#1] SMP
    ......

    This is caused by the following case:
    blk_throtl_bio
    throtl_schedule_next_dispatch td->throtl_slice td, which will always
    return a valid td.

    Fixes: 297e3d854784 ("blk-throttle: make throtl_slice tunable")
    Signed-off-by: Joseph Qi
    Reviewed-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Joseph Qi
     

23 May, 2017

4 commits

  • Default value of io.low limit is 0. If user doesn't configure the limit,
    last patch makes cgroup be throttled to very tiny bps/iops, which could
    stall the system. A cgroup with default settings of io.low limit really
    means nothing, so we force user to configure all settings, otherwise
    io.low limit doesn't take effect. With this stragety, default setting of
    latency/idle isn't important, so just set them to very conservative and
    safe value.

    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • If a cgroup with low limit 0 for both bps/iops, the cgroup's low limit
    is ignored and we throttle the cgroup with its max limit. In this way,
    other cgroups with a low limit will not get protected. To fix this, we
    don't do the exception any more. cgroup will be throttled to a limit 0
    if it uese default setting. To avoid completed stall, we give such
    cgroup tiny IO resources.

    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • These info are important to understand what's happening and help debug.

    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • For idle time, children's setting should not be bigger than parent's.
    For latency target, children's setting should not be smaller than
    parent's. The leaf nodes will adjust their settings according to the
    hierarchy and compare their IO with the settings and do
    upgrade/downgrade. parents nodes don't need to track their IO
    latency/idle time.

    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Shaohua Li
     

20 Apr, 2017

1 commit

  • We trigger this warning:

    block/blk-throttle.c: In function ‘blk_throtl_bio’:
    block/blk-throttle.c:2042:6: warning: variable ‘ret’ set but not used [-Wunused-but-set-variable]
    int ret;
    ^~~

    since we only assign 'ret' if BLK_DEV_THROTTLING_LOW is off, we never
    check it.

    Reported-by: Bart Van Assche
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 Mar, 2017

11 commits

  • One hard problem adding .low limit is to detect idle cgroup. If one
    cgroup doesn't dispatch enough IO against its low limit, we must have a
    mechanism to determine if other cgroups dispatch more IO. We added the
    think time detection mechanism before, but it doesn't work for all
    workloads. Here we add a latency based approach.

    We already have mechanism to calculate latency threshold for each IO
    size. For every IO dispatched from a cgorup, we compare its latency
    against its threshold and record the info. If most IO latency is below
    threshold (in the code I use 75%), the cgroup could be treated idle and
    other cgroups can dispatch more IO.

    Currently this latency target check is only for SSD as we can't
    calcualte the latency target for hard disk. And this is only for cgroup
    leaf node so far.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • User configures latency target, but the latency threshold for each
    request size isn't fixed. For a SSD, the IO latency highly depends on
    request size. To calculate latency threshold, we sample some data, eg,
    average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
    threshold of each request size will be the sample latency (I'll call it
    base latency) plus latency target. For example, the base latency for
    request size 4k is 80us and user configures latency target 60us. The 4k
    latency threshold will be 80 + 60 = 140us.

    To sample data, we calculate the order base 2 of rounded up IO sectors.
    If the IO size is bigger than 1M, it will be accounted as 1M. Since the
    calculation does round up, the base latency will be slightly smaller
    than actual value. Also if there isn't any IO dispatched for a specific
    IO size, we will use the base latency of smaller IO size for this IO
    size.

    But we shouldn't sample data at any time. The base latency is supposed
    to be latency where disk isn't congested, because we use latency
    threshold to schedule IOs between cgroups. If disk is congested, the
    latency is higher, using it for scheduling is meaningless. Hence we only
    do the sampling when block throttling is in the LOW limit, with
    assumption disk isn't congested in such state. If the assumption isn't
    true, eg, low limit is too high, calculated latency threshold will be
    higher.

    Hard disk is completely different. Latency depends on spindle seek
    instead of request size. Currently this feature is SSD only, we probably
    can use a fixed threshold like 4ms for hard disk though.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Here we introduce per-cgroup latency target. The target determines how a
    cgroup can afford latency increasement. We will use the target latency
    to calculate a threshold and use it to schedule IO for cgroups. If a
    cgroup's bandwidth is below its low limit but its average latency is
    below the threshold, other cgroups can safely dispatch more IO even
    their bandwidth is higher than their low limits. On the other hand, if
    the first cgroup's latency is higher than the threshold, other cgroups
    are throttled to their low limits. So the target latency determines how
    we efficiently utilize free disk resource without sacifice of worload's
    IO latency.

    For example, assume 4k IO average latency is 50us when disk isn't
    congested. A cgroup sets the target latency to 30us. Then the cgroup can
    accept 50+30=80us IO latency. If the cgroupt's average IO latency is
    90us and its bandwidth is below low limit, other cgroups are throttled
    to their low limit. If the cgroup's average IO latency is 60us, other
    cgroups are allowed to dispatch more IO. When other cgroups dispatch
    more IO, the first cgroup's IO latency will increase. If it increases to
    81us, we then throttle other cgroups.

    User will configure the interface in this way:
    echo "8:16 rbps=2097152 wbps=max latency=100 idle=200" > io.low

    latency is in microsecond unit

    By default, latency target is 0, which means to guarantee IO latency.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Last patch introduces a way to detect idle cgroup. We use it to make
    upgrade/downgrade decision. And the new algorithm can detect completely
    idle cgroup too, so we can delete the corresponding code.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Add interface to configure the threshold. The io.low interface will
    like:
    echo "8:16 rbps=2097152 wbps=max idle=2000" > io.low

    idle is in microsecond unit.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • A cgroup gets assigned a low limit, but the cgroup could never dispatch
    enough IO to cross the low limit. In such case, the queue state machine
    will remain in LIMIT_LOW state and all other cgroups will be throttled
    according to low limit. This is unfair for other cgroups. We should
    treat the cgroup idle and upgrade the state machine to lower state.

    We also have a downgrade logic. If the state machine upgrades because of
    cgroup idle (real idle), the state machine will downgrade soon as the
    cgroup is below its low limit. This isn't what we want. A more
    complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
    when queue gets upgraded to lower state, other cgroups could dispatch
    more IO and this cgroup can't dispatch enough IO, so the cgroup is below
    its low limit and looks like idle (fake idle). In this case, the queue
    should downgrade soon. The key to determine if we should do downgrade is
    to detect if cgroup is truely idle.

    Unfortunately it's very hard to determine if a cgroup is real idle. This
    patch uses the 'think time check' idea from CFQ for the purpose. Please
    note, the idea doesn't work for all workloads. For example, a workload
    with io depth 8 has disk utilization 100%, hence think time is 0, eg,
    not idle. But the workload can run higher bandwidth with io depth 16.
    Compared to io depth 16, the io depth 8 workload is idle. We use the
    idea to roughly determine if a cgroup is idle.

    We treat a cgroup idle if its think time is above a threshold (by
    default 1ms for SSD and 100ms for HD). The idea is think time above the
    threshold will start to harm performance. HD is much slower so a longer
    think time is ok.

    The patch (and the latter patches) uses 'unsigned long' to track time.
    We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
    precision, should not a big deal.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • When cgroups all reach low limit, cgroups can dispatch more IO. This
    could make some cgroups dispatch more IO but others not, and even some
    cgroups could dispatch less IO than their low limit. For example, cg1
    low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
    120M/s for the workload. Their bps could something like this:

    cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80

    At T1, all cgroups reach low limit, so they can dispatch more IO later.
    Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
    T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
    than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to
    LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2
    will have bandwidth below its low limit at most time.

    The big problem here is we don't know the maximum bandwidth of the
    workload, so we can't make smart decision to avoid the situation. This
    patch makes cgroup bandwidth change smooth. After disk upgrades from
    LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
    their max limit immediately. Their bandwidth limit will be increased
    gradually to avoid above situation. So above example will became
    something like:

    cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
    -> 45/75 -> 22/98

    In this way cgroups bandwidth will be above their limit in majority
    time, this still doesn't fully utilize disk bandwidth, but that's
    something we pay for sharing.

    Scale up is linear. The limit scales up 1/2 .low limit every
    throtl_slice after upgrade. The scale up will stop if the adjusted limit
    hits .max limit. Scale down is exponential. We cut the scale value half
    if a cgroup doesn't hit its .low limit. If the scale becomes 0, we then
    fully downgrade the queue to LIMIT_LOW state.

    Note this doesn't completely avoid cgroup running under its low limit.
    The best way to guarantee cgroup doesn't run under its limit is to set
    max limit. For example, if we set cg1 max limit to 40, cg2 will never
    run under its low limit.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
    cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
    we can't move the state machine to higher level and all cgroups will be
    throttled to their lower limit, so we waste bandwidth. Detecting idle
    cgroup is hard. This patch handles a simple case, a cgroup doesn't
    dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
    the bandwidth.

    Please note this will be replaced with a more sophisticated algorithm
    later, but this demonstrates the idea how we handle idle cgroups, so I
    leave it here.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • The throtl_slice is 100ms by default. This is a long time for SSD, a lot
    of IO can run. To make cgroups have smoother throughput, we choose a
    small value (20ms) for SSD.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • throtl_slice is important for blk-throttling. It's called slice
    internally but it really is a time window blk-throttling samples data.
    blk-throttling will make decision based on the samplings. An example is
    bandwidth measurement. A cgroup's bandwidth is measured in the time
    interval of throtl_slice.

    A small throtl_slice meanse cgroups have smoother throughput but burn
    more CPUs. It has 100ms default value, which is not appropriate for all
    disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
    it tunable.

    Since throtl_slice isn't a time slice, the sysfs name
    'throttle_sample_time' reflects its character better.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • cgroup could be throttled to a limit but when all cgroups cross high
    limit, queue enters a higher state and so the group should be throttled
    to a higher limit. It's possible the cgroup is sleeping because of
    throttle and other cgroups don't dispatch IO any more. In this case,
    nobody can trigger current downgrade/upgrade logic. To fix this issue,
    we could either set up a timer to wakeup the cgroup if other cgroups are
    idle or make sure this cgroup doesn't sleep too long. Setting up a timer
    means we must change the timer very frequently. This patch chooses the
    latter. Making cgroup sleep time not too big wouldn't change cgroup
    bps/iops, but could make it wakeup more frequently, which isn't a big
    issue because throtl_slice * 8 is already quite big.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li