Commit 72f924f62a6eb375c7c237ecc911f95be0531d1a

Authored by Vivek Goyal
Committed by Jens Axboe
1 parent c04645e592

blkio: Documentation

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

Showing 1 changed file with 135 additions and 0 deletions Side-by-side Diff

Documentation/cgroups/blkio-controller.txt
  1 + Block IO Controller
  2 + ===================
  3 +Overview
  4 +========
  5 +cgroup subsys "blkio" implements the block io controller. There seems to be
  6 +a need of various kinds of IO control policies (like proportional BW, max BW)
  7 +both at leaf nodes as well as at intermediate nodes in a storage hierarchy.
  8 +Plan is to use the same cgroup based management interface for blkio controller
  9 +and based on user options switch IO policies in the background.
  10 +
  11 +In the first phase, this patchset implements proportional weight time based
  12 +division of disk policy. It is implemented in CFQ. Hence this policy takes
  13 +effect only on leaf nodes when CFQ is being used.
  14 +
  15 +HOWTO
  16 +=====
  17 +You can do a very simple testing of running two dd threads in two different
  18 +cgroups. Here is what you can do.
  19 +
  20 +- Enable group scheduling in CFQ
  21 + CONFIG_CFQ_GROUP_IOSCHED=y
  22 +
  23 +- Compile and boot into kernel and mount IO controller (blkio).
  24 +
  25 + mount -t cgroup -o blkio none /cgroup
  26 +
  27 +- Create two cgroups
  28 + mkdir -p /cgroup/test1/ /cgroup/test2
  29 +
  30 +- Set weights of group test1 and test2
  31 + echo 1000 > /cgroup/test1/blkio.weight
  32 + echo 500 > /cgroup/test2/blkio.weight
  33 +
  34 +- Create two same size files (say 512MB each) on same disk (file1, file2) and
  35 + launch two dd threads in different cgroup to read those files.
  36 +
  37 + sync
  38 + echo 3 > /proc/sys/vm/drop_caches
  39 +
  40 + dd if=/mnt/sdb/zerofile1 of=/dev/null &
  41 + echo $! > /cgroup/test1/tasks
  42 + cat /cgroup/test1/tasks
  43 +
  44 + dd if=/mnt/sdb/zerofile2 of=/dev/null &
  45 + echo $! > /cgroup/test2/tasks
  46 + cat /cgroup/test2/tasks
  47 +
  48 +- At macro level, first dd should finish first. To get more precise data, keep
  49 + on looking at (with the help of script), at blkio.disk_time and
  50 + blkio.disk_sectors files of both test1 and test2 groups. This will tell how
  51 + much disk time (in milli seconds), each group got and how many secotors each
  52 + group dispatched to the disk. We provide fairness in terms of disk time, so
  53 + ideally io.disk_time of cgroups should be in proportion to the weight.
  54 +
  55 +Various user visible config options
  56 +===================================
  57 +CONFIG_CFQ_GROUP_IOSCHED
  58 + - Enables group scheduling in CFQ. Currently only 1 level of group
  59 + creation is allowed.
  60 +
  61 +CONFIG_DEBUG_CFQ_IOSCHED
  62 + - Enables some debugging messages in blktrace. Also creates extra
  63 + cgroup file blkio.dequeue.
  64 +
  65 +Config options selected automatically
  66 +=====================================
  67 +These config options are not user visible and are selected/deselected
  68 +automatically based on IO scheduler configuration.
  69 +
  70 +CONFIG_BLK_CGROUP
  71 + - Block IO controller. Selected by CONFIG_CFQ_GROUP_IOSCHED.
  72 +
  73 +CONFIG_DEBUG_BLK_CGROUP
  74 + - Debug help. Selected by CONFIG_DEBUG_CFQ_IOSCHED.
  75 +
  76 +Details of cgroup files
  77 +=======================
  78 +- blkio.weight
  79 + - Specifies per cgroup weight.
  80 +
  81 + Currently allowed range of weights is from 100 to 1000.
  82 +
  83 +- blkio.time
  84 + - disk time allocated to cgroup per device in milliseconds. First
  85 + two fields specify the major and minor number of the device and
  86 + third field specifies the disk time allocated to group in
  87 + milliseconds.
  88 +
  89 +- blkio.sectors
  90 + - number of sectors transferred to/from disk by the group. First
  91 + two fields specify the major and minor number of the device and
  92 + third field specifies the number of sectors transferred by the
  93 + group to/from the device.
  94 +
  95 +- blkio.dequeue
  96 + - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
  97 + gives the statistics about how many a times a group was dequeued
  98 + from service tree of the device. First two fields specify the major
  99 + and minor number of the device and third field specifies the number
  100 + of times a group was dequeued from a particular device.
  101 +
  102 +CFQ sysfs tunable
  103 +=================
  104 +/sys/block/<disk>/queue/iosched/group_isolation
  105 +
  106 +If group_isolation=1, it provides stronger isolation between groups at the
  107 +expense of throughput. By default group_isolation is 0. In general that
  108 +means that if group_isolation=0, expect fairness for sequential workload
  109 +only. Set group_isolation=1 to see fairness for random IO workload also.
  110 +
  111 +Generally CFQ will put random seeky workload in sync-noidle category. CFQ
  112 +will disable idling on these queues and it does a collective idling on group
  113 +of such queues. Generally these are slow moving queues and if there is a
  114 +sync-noidle service tree in each group, that group gets exclusive access to
  115 +disk for certain period. That means it will bring the throughput down if
  116 +group does not have enough IO to drive deeper queue depths and utilize disk
  117 +capacity to the fullest in the slice allocated to it. But the flip side is
  118 +that even a random reader should get better latencies and overall throughput
  119 +if there are lots of sequential readers/sync-idle workload running in the
  120 +system.
  121 +
  122 +If group_isolation=0, then CFQ automatically moves all the random seeky queues
  123 +in the root group. That means there will be no service differentiation for
  124 +that kind of workload. This leads to better throughput as we do collective
  125 +idling on root sync-noidle tree.
  126 +
  127 +By default one should run with group_isolation=0. If that is not sufficient
  128 +and one wants stronger isolation between groups, then set group_isolation=1
  129 +but this will come at cost of reduced throughput.
  130 +
  131 +What works
  132 +==========
  133 +- Currently only sync IO queues are support. All the buffered writes are
  134 + still system wide and not per group. Hence we will not see service
  135 + differentiation between buffered writes between groups.