20 Jan, 2017

1 commit

  • commit ebc4ff661fbe76781c6b16dfb7b754a5d5073f8e upstream.

    cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
    incorrectly hard coding GFP_KERNEL instead of using the mask specified
    through the @gfp parameter. This currently doesn't cause any actual
    issues because all current callers specify GFP_KERNEL. Fix it.

    Signed-off-by: Tejun Heo
    Reported-by: Dan Carpenter
    Fixes: e4a9bde9589f ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

24 Sep, 2016

1 commit

  • While debugging timeouts happening in my application workload (ScyllaDB), I have
    observed calls to open() taking a long time, ranging everywhere from 2 seconds -
    the first ones that are enough to time out my application - to more than 30
    seconds.

    The problem seems to happen because XFS may block on pending metadata updates
    under certain circumnstances, and that's confirmed with the following backtrace
    taken by the offcputime tool (iovisor/bcc):

    ffffffffb90c57b1 finish_task_switch
    ffffffffb97dffb5 schedule
    ffffffffb97e310c schedule_timeout
    ffffffffb97e1f12 __down
    ffffffffb90ea821 down
    ffffffffc046a9dc xfs_buf_lock
    ffffffffc046abfb _xfs_buf_find
    ffffffffc046ae4a xfs_buf_get_map
    ffffffffc046babd xfs_buf_read_map
    ffffffffc0499931 xfs_trans_read_buf_map
    ffffffffc044a561 xfs_da_read_buf
    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
    ffffffffc0452e0f xfs_dir2_leaf_lookup
    ffffffffc044d9d3 xfs_dir_lookup
    ffffffffc047d1d9 xfs_lookup
    ffffffffc0479e53 xfs_vn_lookup
    ffffffffb925347a path_openat
    ffffffffb9254a71 do_filp_open
    ffffffffb9242a94 do_sys_open
    ffffffffb9242b9e sys_open
    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
    00007fb0698162ed [unknown]

    Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
    high "Dispatch wait" times, on the dozens of seconds range and consistent with
    the open() times I have saw in that run.

    Still from the blktrace output, we can after searching a bit, identify the
    request that wasn't dispatched:

    8,0 11 152 81.092472813 804 A WM 141698288 + 8
    8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0]

    As we can see above, in this particular example CFQ took 15 seconds to dispatch
    this request. Going back to the full trace, we can see that the xfsaild queue
    had plenty of opportunity to run, and it was selected as the active queue many
    times. It would just always be preempted by something else (example):

    8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request
    8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr
    8,0 1 0 81.117914044 0 m N cfq1618SN / preempt
    8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1
    8,0 1 0 81.117914755 0 m N cfq767A / resid=40
    8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448
    8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0

    where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
    IO dispatchers.

    The requests preempting the xfsaild queue are synchronous requests. That's a
    characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
    While it can be argued that preempting ASYNC requests in favor of SYNC is part
    of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
    goal.

    Moreover, unless I am misunderstanding something, that breaks the expectation
    set by the "fifo_expire_async" tunable, which in my system is set to the
    default.

    Looking at the code, it seems to me that the issue is that after we make
    an async queue active, there is no guarantee that it will execute any request.

    When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
    requests in flight. An incoming request from another queue can also preempt it
    in such situation before we have the chance to execute anything (as seen in the
    trace above).

    This patch sets the must_dispatch flag if we notice that we have requests
    that are already fifo_expired. This flag is always cleared after
    cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
    the queue for subsequent requests (unless they are themselves expired)

    Care is taken during preempt to still allow rt requests to preempt us
    regardless.

    Testing my workload with this patch applied produces much better results.
    From the application side I see no timeouts, and the open() latency histogram
    generated by systemtap looks much better, with the worst outlier at 131ms:

    Latency histogram of xfs_buf_lock acquisition (microseconds):
    value |-------------------------------------------------- count
    0 | 11
    1 |@@@@ 161
    2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966
    4 |@ 54
    8 | 36
    16 | 7
    32 | 0
    64 | 0
    ~
    1024 | 0
    2048 | 0
    4096 | 1
    8192 | 1
    16384 | 2
    32768 | 0
    65536 | 0
    131072 | 1
    262144 | 0
    524288 | 0

    Signed-off-by: Glauber Costa
    CC: Jens Axboe
    CC: linux-block@vger.kernel.org
    CC: linux-kernel@vger.kernel.org

    Signed-off-by: Glauber Costa
    Signed-off-by: Jens Axboe

    Glauber Costa
     

08 Aug, 2016

1 commit

  • Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
    portion and the op code in the higher portions. This means that
    old code that relies on manually setting bi_rw is most likely
    going to be broken. Instead of letting that brokeness linger,
    rename the member, to force old and out-of-tree code to break
    at compile time instead of at runtime.

    No intended functional changes in this commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Jul, 2016

1 commit

  • Before merging a bio into an existing request, io scheduler is called to
    get its approval first. However, the requests that come from a plug
    flush may get merged by block layer without consulting with io
    scheduler.

    In case of CFQ, this can cause fairness problems. For instance, if a
    request gets merged into a low weight cgroup's request, high weight cgroup
    now will depend on low weight cgroup to get scheduled. If high weigt cgroup
    needs that io request to complete before submitting more requests, then it
    will also lose its timeslice.

    Following script demonstrates the problem. Group g1 has a low weight, g2
    and g3 have equal high weights but g2's requests are adjacent to g1's
    requests so they are subject to merging. Due to these merges, g2 gets
    poor disk time allocation.

    cat > cfq-merge-repro.sh << "EOF"
    #!/bin/bash
    set -e

    IO_ROOT=/mnt-cgroup/io

    mkdir -p $IO_ROOT

    if ! mount | grep -qw $IO_ROOT; then
    mount -t cgroup none -oblkio $IO_ROOT
    fi

    cd $IO_ROOT

    for i in g1 g2 g3; do
    if [ -d $i ]; then
    rmdir $i
    fi
    done

    mkdir g1 && echo 10 > g1/blkio.weight
    mkdir g2 && echo 495 > g2/blkio.weight
    mkdir g3 && echo 495 > g3/blkio.weight

    RUNTIME=10

    (echo $BASHPID > g1/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=0k &> /dev/null)&

    (echo $BASHPID > g2/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=64k &> /dev/null)&

    (echo $BASHPID > g3/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=256k &> /dev/null)&

    sleep $((RUNTIME+1))

    for i in g1 g2 g3; do
    echo ---- $i ----
    cat $i/blkio.time
    done

    EOF
    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 162
    ---- g2 ----
    8:16 165
    ---- g3 ----
    8:16 686

    After applying the patch:

    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 90
    ---- g2 ----
    8:16 445
    ---- g3 ----
    8:16 471

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     

28 Jun, 2016

3 commits

  • Commit 9a7f38c42c2b (cfq-iosched: Convert from jiffies to nanoseconds)
    could result in charging just 1 ns to a cgroup submitting IO instead of 1
    jiffie we always charged before. It is arguable what is the right amount
    to change but for now lets retain the old behavior of always charging at
    least one jiffie.

    Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Commit 9a7f38c42c2 (cfq-iosched: Convert from jiffies to nanoseconds)
    broke the condition for detecting starved sync IO in
    cfq_completed_request() because rq->start_time remained in jiffies but
    we compared it with nanosecond values. This manifested as a regression
    in bonnie++ rewrite performance because we always ended up considering
    sync IO starved and thus never increased async IO queue depth.

    Since rq->start_time is used in a lot of places, converting it to ns
    values would be non-trivial. So just revert the condition in CFQ to use
    comparison with jiffies. This will lead to suboptimal results if
    cfq_fifo_expire[1] will ever come close to 1 jiffie but so far we are
    relatively far from that with the storage used with CFQ (the default
    value is 128 ms).

    Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • slice_resid can be both positive and negative. Commit 9a7f38c42c2b
    (cfq-iosched: Convert from jiffies to nanoseconds) converted it from
    long to u64. Although this did not introduce any functional regression
    (the operations just overflow and the result was fine), it is certainly
    wrong and could cause issues in future. So convert the type to more
    appropriate s64.

    Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

10 Jun, 2016

1 commit

  • If we're queuing REQ_PRIO IO and the task is running at an idle IO
    class, then temporarily boost the priority. This prevents livelocks
    due to priority inversion, when a low priority task is holding file
    system resources while attempting to do IO.

    An example of that is shown below. An ioniced idle task is holding
    the directory mutex, while a normal priority task is trying to do
    a directory lookup.

    [478381.198925] ------------[ cut here ]------------
    [478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
    [478381.201324] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
    [478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [478381.203462] ionice D ffff8803692736a8 0 1168369 1 0x00000080
    [478381.203466] ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698
    [478381.204589] ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002
    [478381.205752] ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000
    [478381.206874] Call Trace:
    [478381.207253] [] ? bit_wait_io_timeout+0x80/0x80
    [478381.208175] [] schedule+0x37/0x90
    [478381.208932] [] schedule_timeout+0x1dc/0x250
    [478381.209805] [] ? __blk_run_queue+0x37/0x50
    [478381.210706] [] ? ktime_get+0x45/0xb0
    [478381.211489] [] io_schedule_timeout+0xa7/0x110
    [478381.212402] [] ? prepare_to_wait+0x5b/0x90
    [478381.213280] [] bit_wait_io+0x36/0x50
    [478381.214063] [] __wait_on_bit+0x65/0x90
    [478381.214961] [] ? bit_wait_io_timeout+0x80/0x80
    [478381.215872] [] out_of_line_wait_on_bit+0x7c/0x90
    [478381.216806] [] ? wake_atomic_t_function+0x40/0x40
    [478381.217773] [] __wait_on_buffer+0x2a/0x30
    [478381.218641] [] ext4_bread+0x57/0x70
    [478381.219425] [] __ext4_read_dirblock+0x3c/0x380
    [478381.220467] [] ext4_dx_find_entry+0x7d/0x170
    [478381.221357] [] ? find_get_entry+0x1e/0xa0
    [478381.222208] [] ext4_find_entry+0x484/0x510
    [478381.223090] [] ext4_lookup+0x52/0x160
    [478381.223882] [] lookup_real+0x1d/0x60
    [478381.224675] [] __lookup_hash+0x38/0x50
    [478381.225697] [] lookup_slow+0x45/0xab
    [478381.226941] [] link_path_walk+0x7ae/0x820
    [478381.227880] [] path_init+0xc2/0x430
    [478381.228677] [] ? security_file_alloc+0x16/0x20
    [478381.229776] [] path_openat+0x77/0x620
    [478381.230767] [] ? page_add_file_rmap+0x2e/0x70
    [478381.232019] [] do_filp_open+0x43/0xa0
    [478381.233016] [] ? creds_are_invalid+0x29/0x70
    [478381.234072] [] do_open_execat+0x70/0x170
    [478381.235039] [] do_execveat_common.isra.36+0x1b8/0x6e0
    [478381.236051] [] do_execve+0x2c/0x30
    [478381.236809] [] ? getname+0x12/0x20
    [478381.237564] [] SyS_execve+0x2e/0x40
    [478381.238338] [] stub_execve+0x6d/0xa0
    [478381.239126] ------------[ cut here ]------------
    [478381.239915] ------------[ cut here ]------------
    [478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds.
    [478381.242673] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
    [478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [478381.244902] python2.7 D ffff88005cf8fb98 0 1168375 1168248 0x00000080
    [478381.244904] ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0
    [478381.246023] ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff
    [478381.247138] ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8
    [478381.248252] Call Trace:
    [478381.248630] [] schedule+0x37/0x90
    [478381.249382] [] schedule_preempt_disabled+0xe/0x10
    [478381.250465] [] __mutex_lock_slowpath+0x92/0x100
    [478381.251409] [] mutex_lock+0x1b/0x2f
    [478381.252199] [] lookup_slow+0x36/0xab
    [478381.253023] [] link_path_walk+0x7ae/0x820
    [478381.253877] [] ? try_charge+0xc1/0x700
    [478381.254690] [] path_init+0xc2/0x430
    [478381.255525] [] ? security_file_alloc+0x16/0x20
    [478381.256450] [] path_openat+0x77/0x620
    [478381.257256] [] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
    [478381.258390] [] ? handle_mm_fault+0x13f3/0x1720
    [478381.259309] [] do_filp_open+0x43/0xa0
    [478381.260139] [] ? __alloc_fd+0x42/0x120
    [478381.260962] [] do_sys_open+0x13c/0x230
    [478381.261779] [] ? syscall_trace_enter_phase1+0x113/0x170
    [478381.262851] [] SyS_open+0x22/0x30
    [478381.263598] [] system_call_fastpath+0x12/0x17
    [478381.264551] ------------[ cut here ]------------
    [478381.265377] ------------[ cut here ]------------

    Signed-off-by: Jens Axboe
    Reviewed-by: Jeff Moyer

    Jens Axboe
     

08 Jun, 2016

6 commits


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

05 Feb, 2016

4 commits

  • Currently we don't allow sync workload of one cgroup to preempt sync
    workload of any other cgroup. This is because we want to achieve service
    separation between cgroups. However in cases where cgroup preempting is
    ancestor of the current cgroup, there is no need of separation and
    idling introduces unnecessary overhead. This hurts for example the case
    when workload is isolated within a cgroup but journalling threads are in
    root cgroup. Simple way to demostrate the issue is using:

    dbench4 -c /usr/share/dbench4/client.txt -t 10 -D /mnt 1

    on ext4 filesystem on plain SATA drive (mounted with barrier=0 to make
    difference more visible). When all processes are in the root cgroup,
    reported throughput is 153.132 MB/sec. When dbench process gets its own
    blkio cgroup, reported throughput drops to 26.1006 MB/sec.

    Fix the problem by making check in cfq_should_preempt() more benevolent
    and allow preemption by ancestor cgroup. This improves the throughput
    reported by dbench4 to 48.9106 MB/sec.

    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • The original idea with preemption of sync noidle queues (introduced in
    commit 718eee0579b8 "cfq-iosched: fairness for sync no-idle queues") was
    that we service all sync noidle queues together, we don't idle on any of
    the queues individually and we idle only if there is no sync noidle
    queue to be served. This intention also matches the original test:

    if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
    && new_cfqq->service_tree == cfqq->service_tree)
    return true;

    However since at that time cfqq->service_tree was not set for idling
    queues, this test was unreliable and was replaced in commit e4a229196a7c
    "cfq-iosched: fix no-idle preemption logic" by:

    if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
    cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
    new_cfqq->service_tree->count == 1)
    return true;

    That was a reliable test but was actually doing something different -
    now we preempt sync noidle queue only if the new queue is the only one
    busy in the service tree.

    These days cfq queue is kept in service tree even if it is idling and
    thus the original check would be safe again. But since we actually check
    that cfq queues are in the same cgroup, of the same priority class and
    workload type (sync noidle), we know that new_cfqq is fine to preempt
    cfqq. So just remove the service tree check.

    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Move check for preemption by rt class up. There is no functional change
    but it makes arguing about conditions simpler since we can be sure both
    cfq queues are from the same ioprio class.

    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • There is no point in idling on a cfq group if the only cfq queue that is
    there has too big thinktime.

    Signed-off-by: Jan Kara
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Jan Kara
     

18 Sep, 2015

1 commit

  • cgroup_on_dfl() tests whether the cgroup's root is the default
    hierarchy; however, an individual controller is only interested in
    whether the controller is attached to the default hierarchy and never
    tests a cgroup which doesn't belong to the hierarchy that the
    controller is attached to.

    This patch replaces cgroup_on_dfl() tests in controllers with faster
    static_key based cgroup_subsys_on_dfl(). This leaves cgroup core as
    the only user of cgroup_on_dfl() and the function is moved from the
    header file to cgroup.c.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     

19 Aug, 2015

20 commits

  • cgroup is trying to make interface consistent across different
    controllers. For weight based resource control, the knob should have
    the range [1, 10000] and default to 100. This patch updates
    cfq-iosched so that the weight range conforms. The internal
    calculations have enough range and the widening of the weight range
    shouldn't cause any problem.

    * blkcg_policy->cpd_bind_fn() is added. If present, this is invoked
    when blkcg is attached to a hierarchy.

    * cfq_cpd_init() is updated to use the new default value on the
    unified hierarchy.

    * cfq_cpd_bind() callback is implemented to clear per-blkg configs and
    apply the default config matching the hierarchy type.

    * cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
    is moved into !CONFIG_CFQ_GROUP_IOSCHED block. cfq_cpd_bind() is
    now responsible for initializing the initial weights when blkcg is
    enabled.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg is gonna switch to cgroup common weight range as defined by
    CGROUP_WEIGHT_* on the unified hierarchy. In preparation, rename
    CFQ_WEIGHT_* constants to CFQ_WEIGHT_LEGACY_*.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg interface grew to be the biggest of all controllers and
    unfortunately most inconsistent too. The interface files are
    inconsistent with a number of cloes duplicates. Some files have
    recursive variants while others don't. There's distinction between
    normal and leaf weights which isn't intuitive and there are a lot of
    stat knobs which don't make much sense outside of debugging and expose
    too much implementation details to userland.

    In the unified hierarchy, everything is always hierarchical and
    internal nodes can't have tasks rendering the two structural issues
    twisting the current interface. The interface has to be updated in a
    significant anyway and this is a good chance to revamp it as a whole.
    This patch implements blkcg interface for the unified hierarchy.

    * (from a previous patch) blkcg is identified by "io" instead of
    "blkio" on the unified hierarchy. Given that the whole interface is
    updated anyway, the rename shouldn't carry noticeable conversion
    overhead.

    * The original interface consisted of 27 files is replaced with the
    following three files.

    blkio.stat : per-blkcg stats
    blkio.weight : per-cgroup and per-cgroup-queue weight settings
    blkio.max : per-cgroup-queue bps and iops max limits

    Documentation/cgroups/unified-hierarchy.txt updated accordingly.

    v2: blkcg_policy->dfl_cftypes wasn't removed on
    blkcg_policy_unregister() corrupting the cftypes list. Fixed.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • * Export blkg_dev_name()

    * Drop unnecessary @cft from __cfq_set_weight().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blkg_conf_prep() expects input to be of the following form

    MAJ:MIN NUM

    and reads the NUM part into blkg_conf_ctx->v. This is quite
    restrictive and gets in the way in implementing blkcg interface for
    the unified hierarchy. This patch updates blkg_conf_prep() so that it
    expects

    MAJ:MIN BODY_STR

    where BODY_STR is an arbitrary string. blkg_conf_ctx->v is replaced
    with ->body which is a char pointer pointing to the start of BODY_STR.
    Parsing of the body is moved to blkg_conf_prep()'s callers.

    To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
    non-const pointer and to accommodate that const is dropped from @input
    too.

    This doesn't cause any behavior changes.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg is about to grow interface for the unified hierarchy. Add
    legacy to existing cftypes.

    * blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
    * blk-cgroup.c:blkcg_files -> blkcg_legacy_files
    * cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
    * blk-throttle.c:throtl_files -> throtl_legacy_files

    Pure renames. No functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg currently returns -EINVAL for most errors which can be pretty
    confusing given that the failure modes are quite varied. Update the
    error returns so that

    * -EINVAL only for syntactic errors.
    * -ERANGE if the value is out of range.
    * -ENODEV if the target device can't be found.
    * -EOPNOTSUPP if the policy is not enabled on the target device.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkg_to_cfqg() and blkcg_to_cfqgd() on a valid blkg with the policy
    enabled are guaranteed to return non-NULL and the counterpart in
    blk-throttle doesn't have these checks either. Remove the spurious
    NULL checks.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cfq_stats->sectors is a blkg_stat which keeps track of the total
    number of sectors serviced; however, this can be trivially calculated
    from blkcg_gq->stat_bytes. The only thing necessary is adding up
    READs and WRITEs and then dividing by sector size.

    Remove cfqg_stats->sectors and make cfq print "sectors" and
    "sectors_recursive" from stat_bytes.

    While this is a bit more code, it removes duplicate stat allocations
    and updates and ensures that the reported stats stay in tune with each
    other.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, both cfq-iosched and blk-throttle keep track of
    io_service_bytes and io_serviced stats. While keeping track of them
    separately may be useful during development, it doesn't make much
    sense otherwise. Also, blk-throttle was counting bio's as IOs while
    cfq-iosched request's, which is more confusing than informative.

    This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq),
    removes the counterparts from cfq-iosched and blk-throttle and let
    them print from the common blkg counters. The common counters are
    incremented during bio issue in blkcg_bio_issue_check().

    The outputs are still filtered by whether the policy has
    blkg_policy_data on a given blkg, so cfq's output won't show up if it
    has never been used for a given blkg. The only times when the outputs
    would differ significantly are when policies are attached on the fly
    or elevators are switched back and forth. Those are quite exceptional
    operations and I don't think they warrant keeping separate counters.

    v3: Update blkio-controller.txt accordingly.

    v2: Account IOs during bio issues instead of request completions so
    that bio-based drivers can be handled the same way.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blkg_[rw]stat_recursive_sum() assume that the target
    counter is located in pd (blkg_policy_data); however, some counters
    are planned to be moved to blkg (blkcg_gq).

    This patch updates blkg_[rw]stat_recursive_sum() to take blkg and
    blkg_policy pointers instead of pd. If policy is NULL, it indexes
    into blkg. If non-NULL, into the blkg's pd of the policy.

    The existing usages are updated to maintain the current behaviors.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg_[rw]stat are used as stat counters for blkcg policies. It isn't
    per-cpu by itself and blk-throttle makes it per-cpu by wrapping around
    it. This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc
    per-cpu wrapping in blk-throttle.

    * blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct
    percpu_counter. This makes syncp unnecessary as remote accesses are
    handled by percpu_counter itself.

    * blkg_[rw]stat_init() can now fail due to percpu allocation failure
    and thus are updated to return int.

    * percpu_counters need explicit freeing. blkg_[rw]stat_exit() added.

    * As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading
    and summing results are stored in ->aux_cnt[] instead.

    * Custom per-cpu stat implementation in blk-throttle is removed.

    This makes all blkcg stat counters per-cpu without complicating policy
    implmentations.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cgroup stats are local to each cgroup and doesn't propagate to
    ancestors by default. When recursive stats are necessary, the sum is
    calculated over all the descendants. This initially was for backward
    compatibility to support both group-local and recursive stats but this
    mode of operation makes general sense as stat update is much hotter
    thafn reporting those stats.

    This however ends up losing recursive stats when a child is removed.
    To work around this, cfq-iosched adds its stats to its parent
    cfq_group->dead_stats which is summed up together when calculating
    recursive stats.

    It's planned that the core stats will be moved to blkcg_gq, so we want
    to move the mechanism for keeping track of the stats of dead children
    from cfq to blkcg core. This patch adds blkg_[rw]stat->aux_cnt which
    are atomic64_t's keeping track of auxiliary counts which are excluded
    when reading local counts but included for recursive.

    blkg_[rw]stat_merge() which were used by cfq to implement dead_stats
    are replaced by blkg_[rw]stat_add_aux(), and cfq now forwards stats of
    a dead cgroup to the aux counts of parent->stats instead of separate
    ->dead_stats.

    This will also help making blkg_[rw]stats per-cpu.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkg (blkcg_gq) currently is created by blkcg policies invoking
    blkg_lookup_create() which ends up repeating about the same code in
    different policies. Theoretically, this can avoid the overhead of
    looking and/or creating blkg's if blkcg is enabled but no policy is in
    use; however, the cost of blkg lookup / creation is very low
    especially if only the root blkcg is in use which is highly likely if
    no blkcg policy is in active use - it boils down to a single very
    predictable conditional and surrounding RCU protection.

    This patch consolidates blkg creation to a new function
    blkcg_bio_issue_check() which is called during bio issue from
    generic_make_request_checks(). blkcg_bio_issue_check() is now the
    only function which tries to create missing blkg's. The subsequent
    policy and request_list operations just perform blkg_lookup() and if
    missing falls back to the root.

    * blk_get_rl() no longer tries to create blkg. It uses blkg_lookup()
    instead of blkg_lookup_create().

    * blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
    read locked and blkg already looked up. Both throtl_lookup_tg() and
    throtl_lookup_create_tg() are dropped.

    * cfq is similarly updated. cfq_lookup_create_cfqg() is replaced with
    cfq_lookup_cfqg()which uses blkg_lookup().

    This consolidates blkg handling and avoids unnecessary blkg creation
    retries under memory pressure. In addition, this provides a common
    bio entry point into blkcg where things like common accounting can be
    performed.

    v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
    !CONFIG_BLK_DEV_THROTTLING.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Each active policy has a cpd (blkcg_policy_data) on each blkcg. The
    cpd's were allocated by blkcg core and each policy could request to
    allocate extra space at the end by setting blkcg_policy->cpd_size
    larger than the size of cpd.

    This is a bit unusual but blkg (blkcg_gq) policy data used to be
    handled this way too so it made sense to be consistent; however, blkg
    policy data switched to alloc/free callbacks.

    This patch makes similar changes to cpd handling.
    blkcg_policy->cpd_alloc/free_fn() are added to replace ->cpd_size. As
    cpd allocation is now done from policy side, it can simply allocate a
    larger area which embeds cpd at the beginning.

    As ->cpd_alloc_fn() may be able to perform all necessary
    initializations, this patch makes ->cpd_init_fn() optional.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • * Rename blkcg->pd[] to blkcg->cpd[] so that cpd is consistently used
    for blkcg_policy_data.

    * Make blkcg_policy->cpd_init_fn() take blkcg_policy_data instead of
    blkcg. This makes it consistent with blkg_policy_data methods and
    to-be-added cpd alloc/free methods.

    * blkcg_policy_data->blkcg and cpd_to_blkcg() added so that
    cpd_init_fn() can determine the associated blkcg from
    blkcg_policy_data.

    v2: blkcg_policy_data->blkcg initializations were missing. Added.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • The newly added ->pd_alloc_fn() and ->pd_free_fn() deal with pd
    (blkg_policy_data) while the older ones use blkg (blkcg_gq). As using
    blkg doesn't make sense for ->pd_alloc_fn() and after allocation pd
    can always be mapped to blkg and given that these are policy-specific
    methods, it makes sense to converge on pd.

    This patch makes all methods deal with pd instead of blkg. Most
    conversions are trivial. In blk-cgroup.c, a couple method invocation
    sites now test whether pd exists instead of policy state for
    consistency. This shouldn't cause any behavioral differences.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With the recent addition of alloc and free methods, things became
    messier. This patch reorganizes them according to the followings.

    * ->pd_alloc_fn()

    Responsible for allocation and static initializations - the ones
    which can be done independent of where the pd might be attached.

    * ->pd_init_fn()

    Initializations which require the knowledge of where the pd is
    attached.

    * ->pd_free_fn()

    The counter part of pd_alloc_fn(). Static de-init and freeing.

    This leaves ->pd_exit_fn() without any users. Removed.

    While at it, collapse an one liner function throtl_pd_exit(), which
    has only one user, into its user.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • A blkg (blkcg_gq) represents the relationship between a cgroup and
    request_queue. Each active policy has a pd (blkg_policy_data) on each
    blkg. The pd's were allocated by blkcg core and each policy could
    request to allocate extra space at the end by setting
    blkcg_policy->pd_size larger than the size of pd.

    This is a bit unusual but was done this way mostly to simplify error
    handling and all the existing use cases could be handled this way;
    however, this is becoming too restrictive now that percpu memory can
    be allocated without blocking.

    This introduces two new mandatory blkcg_policy methods - pd_alloc_fn()
    and pd_free_fn() - which are used to allocate and release pd for a
    given policy. As pd allocation is now done from policy side, it can
    simply allocate a larger area which embeds pd at the beginning. This
    change makes ->pd_size pointless. Removed.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Up until now, all async IOs were queued to async queues which are
    shared across the whole request_queue, which means that blkcg resource
    control is completely void on async IOs including all writeback IOs.
    It was done this way because writeback didn't support writeback and
    there was no way of telling which writeback IO belonged to which
    cgroup; however, writeback recently became cgroup aware and writeback
    bio's are sent down properly tagged with the blkcg's to charge them
    against.

    This patch makes async cfq_queues per-cfq_cgroup instead of
    per-cfq_data so that each async IO is charged to the blkcg that it was
    tagged for instead of unconditionally attributing it to root.

    * cfq_data->async_cfqq and ->async_idle_cfqq are moved to cfq_group
    and alloc / destroy paths are updated accordingly.

    * cfq_link_cfqq_cfqg() no longer overrides @cfqg to root for async
    queues.

    * check_blkcg_changed() now also invalidates async queues as they no
    longer stay the same across cgroups.

    After this patch, cfq's proportional IO control through blkio.weight
    works correctly when cgroup writeback is in use.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jeff Moyer
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo