12 Feb, 2019

1 commit


30 Dec, 2017

1 commit

  • commit 111be883981748acc9a56e855c8336404a8e787c upstream.

    If a bio is throttled and split after throttling, the bio could be
    resubmited and enters the throttling again. This will cause part of the
    bio to be charged multiple times. If the cgroup has an IO limit, the
    double charge will significantly harm the performance. The bio split
    becomes quite common after arbitrary bio size change.

    To fix this, we always set the BIO_THROTTLED flag if a bio is throttled.
    If the bio is cloned/split, we copy the flag to new bio too to avoid a
    double charge. However, cloned bio could be directed to a new disk,
    keeping the flag be a problem. The observation is we always set new disk
    for the bio in this case, so we can clear the flag in bio_set_dev().

    This issue exists for a long time, arbitrary bio size change just makes
    it worse, so this should go into stable at least since v4.2.

    V1-> V2: Not add extra field in bio based on discussion with Tejun

    Cc: Vivek Goyal
    Acked-by: Tejun Heo
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

04 Oct, 2017

1 commit

  • There is a case which will lead to io stall. The case is described as
    follows.
    /test1
    |-subtest1
    /test2
    |-subtest2
    And subtest1 and subtest2 each has 32 queued bios already.

    Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
    bios as follows:
    1) tg=subtest1, do nothing;
    2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
    left, no need to schedule next dispatch;
    3) tg=subtest2, do nothing;
    4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
    left, no need to schedule next dispatch;
    5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
    test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2
    to /; note that test1 and test2 each still has 16 queued bios left;
    6) tg=/, try to schedule next dispatch, but since disptime is now
    (update in tg_update_disptime, wait=0), pending timer is not scheduled
    in fact;
    7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
    32 left. test1 and test2 each has 16 queued bios;
    8) throtl_pending_timer_fn sees the left over bios, but could do
    nothing, because throtl_select_dispatch returns 0, and test1/test2 has
    no pending tg.

    The blktrace shows the following:
    8,32 0 0 2.539007641 0 m N throtl upgrade to max
    8,32 0 0 2.539072267 0 m N throtl /test2 dispatch nr_queued=16 read=0 write=16
    8,32 7 0 2.539077142 0 m N throtl /test1 dispatch nr_queued=16 read=0 write=16

    So force schedule dispatch if there are pending children.

    Reviewed-by: Shaohua Li
    Signed-off-by: Joseph Qi
    Signed-off-by: Jens Axboe

    Joseph Qi
     

08 Sep, 2017

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

24 Aug, 2017

1 commit

  • discard request usually is very big and easily use all bandwidth budget
    of a cgroup. discard request size doesn't really mean the size of data
    written, so it doesn't make sense to account it into bandwidth budget.
    Jens pointed out treating the size 0 doesn't make sense too, because
    discard request does have cost. But it's not easy to find the actual
    cost. This patch simply makes the size one sector.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

29 Jul, 2017

2 commits

  • Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
    way. Now we have standard blktrace API for this, so convert them to use
    it.

    Note, this changes the behavior a little bit. cgroup info isn't output
    by default, we only do this with 'blk_cgroup' option enabled. cgroup
    info isn't output as a string by default too, we only do this with
    'blk_cgname' option enabled. Also cgroup info is output in different
    position of the note string. I think these behavior changes aren't a big
    issue (actually we make trace data shorter which is good), since the
    blktrace note is solely for debugging.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • blkcg_bio_issue_check() already gets blkcg for a BIO.
    bio_associate_blkcg() uses a percpu refcounter, so it's a very cheap
    operation. There is no point we don't attach the cgroup info into bio at
    blkcg_bio_issue_check. This also makes blktrace outputs correct cgroup
    info.

    Acked-by: Tejun Heo
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

07 Jun, 2017

2 commits

  • hard disk IO latency varies a lot depending on spindle move. The latency
    range could be from several microseconds to several milliseconds. It's
    pretty hard to get the baseline latency used by io.low.

    We will use a different stragety here. The idea is only using IO with
    spindle move to determine if cgroup IO is in good state. For HD, if io
    latency is small (< 1ms), we ignore the IO. Such IO is likely from
    sequential IO, and is helpless to help determine if a cgroup's IO is
    impacted by other cgroups. With this, we only account IO with big
    latency. Then we can choose a hardcoded baseline latency for HD (4ms,
    which is typical IO latency with seek). With all these settings, the
    io.low latency works for both HD and SSD.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • I have encountered a NULL pointer dereference in
    throtl_schedule_pending_timer:
    [ 413.735396] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
    [ 413.735535] IP: [] throtl_schedule_pending_timer+0x3f/0x210
    [ 413.735643] PGD 22c8cf067 PUD 22cb34067 PMD 0
    [ 413.735713] Oops: 0000 [#1] SMP
    ......

    This is caused by the following case:
    blk_throtl_bio
    throtl_schedule_next_dispatch td->throtl_slice td, which will always
    return a valid td.

    Fixes: 297e3d854784 ("blk-throttle: make throtl_slice tunable")
    Signed-off-by: Joseph Qi
    Reviewed-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Joseph Qi
     

23 May, 2017

4 commits

  • Default value of io.low limit is 0. If user doesn't configure the limit,
    last patch makes cgroup be throttled to very tiny bps/iops, which could
    stall the system. A cgroup with default settings of io.low limit really
    means nothing, so we force user to configure all settings, otherwise
    io.low limit doesn't take effect. With this stragety, default setting of
    latency/idle isn't important, so just set them to very conservative and
    safe value.

    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • If a cgroup with low limit 0 for both bps/iops, the cgroup's low limit
    is ignored and we throttle the cgroup with its max limit. In this way,
    other cgroups with a low limit will not get protected. To fix this, we
    don't do the exception any more. cgroup will be throttled to a limit 0
    if it uese default setting. To avoid completed stall, we give such
    cgroup tiny IO resources.

    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • These info are important to understand what's happening and help debug.

    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • For idle time, children's setting should not be bigger than parent's.
    For latency target, children's setting should not be smaller than
    parent's. The leaf nodes will adjust their settings according to the
    hierarchy and compare their IO with the settings and do
    upgrade/downgrade. parents nodes don't need to track their IO
    latency/idle time.

    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Shaohua Li
     

20 Apr, 2017

1 commit

  • We trigger this warning:

    block/blk-throttle.c: In function ‘blk_throtl_bio’:
    block/blk-throttle.c:2042:6: warning: variable ‘ret’ set but not used [-Wunused-but-set-variable]
    int ret;
    ^~~

    since we only assign 'ret' if BLK_DEV_THROTTLING_LOW is off, we never
    check it.

    Reported-by: Bart Van Assche
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 Mar, 2017

17 commits

  • One hard problem adding .low limit is to detect idle cgroup. If one
    cgroup doesn't dispatch enough IO against its low limit, we must have a
    mechanism to determine if other cgroups dispatch more IO. We added the
    think time detection mechanism before, but it doesn't work for all
    workloads. Here we add a latency based approach.

    We already have mechanism to calculate latency threshold for each IO
    size. For every IO dispatched from a cgorup, we compare its latency
    against its threshold and record the info. If most IO latency is below
    threshold (in the code I use 75%), the cgroup could be treated idle and
    other cgroups can dispatch more IO.

    Currently this latency target check is only for SSD as we can't
    calcualte the latency target for hard disk. And this is only for cgroup
    leaf node so far.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • User configures latency target, but the latency threshold for each
    request size isn't fixed. For a SSD, the IO latency highly depends on
    request size. To calculate latency threshold, we sample some data, eg,
    average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
    threshold of each request size will be the sample latency (I'll call it
    base latency) plus latency target. For example, the base latency for
    request size 4k is 80us and user configures latency target 60us. The 4k
    latency threshold will be 80 + 60 = 140us.

    To sample data, we calculate the order base 2 of rounded up IO sectors.
    If the IO size is bigger than 1M, it will be accounted as 1M. Since the
    calculation does round up, the base latency will be slightly smaller
    than actual value. Also if there isn't any IO dispatched for a specific
    IO size, we will use the base latency of smaller IO size for this IO
    size.

    But we shouldn't sample data at any time. The base latency is supposed
    to be latency where disk isn't congested, because we use latency
    threshold to schedule IOs between cgroups. If disk is congested, the
    latency is higher, using it for scheduling is meaningless. Hence we only
    do the sampling when block throttling is in the LOW limit, with
    assumption disk isn't congested in such state. If the assumption isn't
    true, eg, low limit is too high, calculated latency threshold will be
    higher.

    Hard disk is completely different. Latency depends on spindle seek
    instead of request size. Currently this feature is SSD only, we probably
    can use a fixed threshold like 4ms for hard disk though.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Here we introduce per-cgroup latency target. The target determines how a
    cgroup can afford latency increasement. We will use the target latency
    to calculate a threshold and use it to schedule IO for cgroups. If a
    cgroup's bandwidth is below its low limit but its average latency is
    below the threshold, other cgroups can safely dispatch more IO even
    their bandwidth is higher than their low limits. On the other hand, if
    the first cgroup's latency is higher than the threshold, other cgroups
    are throttled to their low limits. So the target latency determines how
    we efficiently utilize free disk resource without sacifice of worload's
    IO latency.

    For example, assume 4k IO average latency is 50us when disk isn't
    congested. A cgroup sets the target latency to 30us. Then the cgroup can
    accept 50+30=80us IO latency. If the cgroupt's average IO latency is
    90us and its bandwidth is below low limit, other cgroups are throttled
    to their low limit. If the cgroup's average IO latency is 60us, other
    cgroups are allowed to dispatch more IO. When other cgroups dispatch
    more IO, the first cgroup's IO latency will increase. If it increases to
    81us, we then throttle other cgroups.

    User will configure the interface in this way:
    echo "8:16 rbps=2097152 wbps=max latency=100 idle=200" > io.low

    latency is in microsecond unit

    By default, latency target is 0, which means to guarantee IO latency.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Last patch introduces a way to detect idle cgroup. We use it to make
    upgrade/downgrade decision. And the new algorithm can detect completely
    idle cgroup too, so we can delete the corresponding code.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Add interface to configure the threshold. The io.low interface will
    like:
    echo "8:16 rbps=2097152 wbps=max idle=2000" > io.low

    idle is in microsecond unit.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • A cgroup gets assigned a low limit, but the cgroup could never dispatch
    enough IO to cross the low limit. In such case, the queue state machine
    will remain in LIMIT_LOW state and all other cgroups will be throttled
    according to low limit. This is unfair for other cgroups. We should
    treat the cgroup idle and upgrade the state machine to lower state.

    We also have a downgrade logic. If the state machine upgrades because of
    cgroup idle (real idle), the state machine will downgrade soon as the
    cgroup is below its low limit. This isn't what we want. A more
    complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
    when queue gets upgraded to lower state, other cgroups could dispatch
    more IO and this cgroup can't dispatch enough IO, so the cgroup is below
    its low limit and looks like idle (fake idle). In this case, the queue
    should downgrade soon. The key to determine if we should do downgrade is
    to detect if cgroup is truely idle.

    Unfortunately it's very hard to determine if a cgroup is real idle. This
    patch uses the 'think time check' idea from CFQ for the purpose. Please
    note, the idea doesn't work for all workloads. For example, a workload
    with io depth 8 has disk utilization 100%, hence think time is 0, eg,
    not idle. But the workload can run higher bandwidth with io depth 16.
    Compared to io depth 16, the io depth 8 workload is idle. We use the
    idea to roughly determine if a cgroup is idle.

    We treat a cgroup idle if its think time is above a threshold (by
    default 1ms for SSD and 100ms for HD). The idea is think time above the
    threshold will start to harm performance. HD is much slower so a longer
    think time is ok.

    The patch (and the latter patches) uses 'unsigned long' to track time.
    We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
    precision, should not a big deal.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • When cgroups all reach low limit, cgroups can dispatch more IO. This
    could make some cgroups dispatch more IO but others not, and even some
    cgroups could dispatch less IO than their low limit. For example, cg1
    low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
    120M/s for the workload. Their bps could something like this:

    cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80

    At T1, all cgroups reach low limit, so they can dispatch more IO later.
    Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
    T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
    than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to
    LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2
    will have bandwidth below its low limit at most time.

    The big problem here is we don't know the maximum bandwidth of the
    workload, so we can't make smart decision to avoid the situation. This
    patch makes cgroup bandwidth change smooth. After disk upgrades from
    LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
    their max limit immediately. Their bandwidth limit will be increased
    gradually to avoid above situation. So above example will became
    something like:

    cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
    -> 45/75 -> 22/98

    In this way cgroups bandwidth will be above their limit in majority
    time, this still doesn't fully utilize disk bandwidth, but that's
    something we pay for sharing.

    Scale up is linear. The limit scales up 1/2 .low limit every
    throtl_slice after upgrade. The scale up will stop if the adjusted limit
    hits .max limit. Scale down is exponential. We cut the scale value half
    if a cgroup doesn't hit its .low limit. If the scale becomes 0, we then
    fully downgrade the queue to LIMIT_LOW state.

    Note this doesn't completely avoid cgroup running under its low limit.
    The best way to guarantee cgroup doesn't run under its limit is to set
    max limit. For example, if we set cg1 max limit to 40, cg2 will never
    run under its low limit.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
    cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
    we can't move the state machine to higher level and all cgroups will be
    throttled to their lower limit, so we waste bandwidth. Detecting idle
    cgroup is hard. This patch handles a simple case, a cgroup doesn't
    dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
    the bandwidth.

    Please note this will be replaced with a more sophisticated algorithm
    later, but this demonstrates the idea how we handle idle cgroups, so I
    leave it here.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • The throtl_slice is 100ms by default. This is a long time for SSD, a lot
    of IO can run. To make cgroups have smoother throughput, we choose a
    small value (20ms) for SSD.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • throtl_slice is important for blk-throttling. It's called slice
    internally but it really is a time window blk-throttling samples data.
    blk-throttling will make decision based on the samplings. An example is
    bandwidth measurement. A cgroup's bandwidth is measured in the time
    interval of throtl_slice.

    A small throtl_slice meanse cgroups have smoother throughput but burn
    more CPUs. It has 100ms default value, which is not appropriate for all
    disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
    it tunable.

    Since throtl_slice isn't a time slice, the sysfs name
    'throttle_sample_time' reflects its character better.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • cgroup could be throttled to a limit but when all cgroups cross high
    limit, queue enters a higher state and so the group should be throttled
    to a higher limit. It's possible the cgroup is sleeping because of
    throttle and other cgroups don't dispatch IO any more. In this case,
    nobody can trigger current downgrade/upgrade logic. To fix this issue,
    we could either set up a timer to wakeup the cgroup if other cgroups are
    idle or make sure this cgroup doesn't sleep too long. Setting up a timer
    means we must change the timer very frequently. This patch chooses the
    latter. Making cgroup sleep time not too big wouldn't change cgroup
    bps/iops, but could make it wakeup more frequently, which isn't a big
    issue because throtl_slice * 8 is already quite big.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • When queue state machine is in LIMIT_MAX state, but a cgroup is below
    its low limit for some time, the queue should be downgraded to lower
    state as one cgroup's low limit isn't met.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • When queue is in LIMIT_LOW state and all cgroups with low limit cross
    the bps/iops limitation, we will upgrade queue's state to
    LIMIT_MAX. To determine if a cgroup exceeds its limitation, we check if
    the cgroup has pending request. Since cgroup is throttled according to
    the limit, pending request means the cgroup reaches the limit.

    If a cgroup has limit set for both read and write, we consider the
    combination of them for upgrade. The reason is read IO and write IO can
    interfere with each other. If we do the upgrade based in one direction
    IO, the other direction IO could be severly harmed.

    For a cgroup hierarchy, there are two cases. Children has lower low
    limit than parent. Parent's low limit is meaningless. If children's
    bps/iops cross low limit, we can upgrade queue state. The other case is
    children has higher low limit than parent. Children's low limit is
    meaningless. As long as parent's bps/iops (which is a sum of childrens
    bps/iops) cross low limit, we can upgrade queue state.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • each queue will have a state machine. Initially queue is in LIMIT_LOW
    state, which means all cgroups will be throttled according to their low
    limit. After all cgroups with low limit cross the limit, the queue state
    gets upgraded to LIMIT_MAX state.
    For max limit, cgroup will use the limit configured by user.
    For low limit, cgroup will use the minimal value between low limit and
    max limit configured by user. If the minimal value is 0, which means the
    cgroup doesn't configure low limit, we will use max limit to throttle
    the cgroup and the cgroup is ready to upgrade to LIMIT_MAX

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Add low limit for cgroup and corresponding cgroup interface. To be
    consistent with memcg, we allow users configure .low limit higher than
    .max limit. But the internal logic always assumes .low limit is lower
    than .max limit. So we add extra bps/iops_conf fields in throtl_grp for
    userspace configuration. Old bps/iops fields in throtl_grp will be the
    actual limit we use for throttling.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • We are going to support low/max limit, each cgroup will have 2 limits
    after that. This patch prepares for the multiple limits change.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • clean up the code to avoid using -1

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

28 Feb, 2017

1 commit


23 Jan, 2017

1 commit


28 Oct, 2016

1 commit


20 Sep, 2016

1 commit

  • Right now, if slice is expired, we start a new slice. If a bio is
    queued, we keep on extending slice by throtle_slice interval (100ms).

    This worked well as long as pending timer function got executed with-in
    few milli seconds of scheduled time. But looks like with recent changes
    in timer subsystem, slack can be much longer depending on the expiry time
    of the scheduled timer.

    commit 500462a9de65 ("timers: Switch to a non-cascading wheel")

    This means, by the time timer function gets executed, it is possible the
    delay from scheduled time is more than 100ms. That means current code
    will conclude that existing slice has expired and a new one needs to
    be started. New slice will be 100ms by default and that will not be
    sufficient to meet rate requirement of group given the bio size and
    bio will not be dispatched and we will start a new timer function to
    wait. And when that timer expires, same process will repeat and we
    will wait again and this can easily be an infinite loop.

    Solve this issue by starting a new slice only if throttle gropup is
    empty. If it is not empty, that means there should be an active slice
    going on. Ideally it should not be expired but given the slack, it is
    possible that it has expired.

    Reported-by: Hou Tao
    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

08 Aug, 2016

1 commit

  • Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
    portion and the op code in the higher portions. This means that
    old code that relies on manually setting bi_rw is most likely
    going to be broken. Instead of letting that brokeness linger,
    rename the member, to force old and out-of-tree code to break
    at compile time instead of at runtime.

    No intended functional changes in this commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Aug, 2016

1 commit


10 May, 2016

1 commit


18 Sep, 2015

1 commit

  • cgroup_on_dfl() tests whether the cgroup's root is the default
    hierarchy; however, an individual controller is only interested in
    whether the controller is attached to the default hierarchy and never
    tests a cgroup which doesn't belong to the hierarchy that the
    controller is attached to.

    This patch replaces cgroup_on_dfl() tests in controllers with faster
    static_key based cgroup_subsys_on_dfl(). This leaves cgroup core as
    the only user of cgroup_on_dfl() and the function is moved from the
    header file to cgroup.c.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo