Doug / smarc-fsl-linux-kernel | Embedian Git Server

Commit 72f924f62a6eb375c7c237ecc911f95be0531d1a

Authored by Vivek Goyal 2009-12-04 01:59:57 +0800

Committed by Jens Axboe 2009-12-04 02:28:53 +0800

Exists in master and in 7 other branches

blkio: Documentation

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

Showing 1 changed file with 135 additions and 0 deletions Side-by-side Diff

Documentation/cgroups/blkio-controller.txt

Documentation/cgroups/blkio-controller.txt

Diff comments View file @ 72f924f

	1	+ Block IO Controller
	2	+ ===================
	3	+Overview
	4	+========
	5	+cgroup subsys "blkio" implements the block io controller. There seems to be
	6	+a need of various kinds of IO control policies (like proportional BW, max BW)
	7	+both at leaf nodes as well as at intermediate nodes in a storage hierarchy.
	8	+Plan is to use the same cgroup based management interface for blkio controller
	9	+and based on user options switch IO policies in the background.
	10	+
	11	+In the first phase, this patchset implements proportional weight time based
	12	+division of disk policy. It is implemented in CFQ. Hence this policy takes
	13	+effect only on leaf nodes when CFQ is being used.
	14	+
	15	+HOWTO
	16	+=====
	17	+You can do a very simple testing of running two dd threads in two different
	18	+cgroups. Here is what you can do.
	19	+
	20	+- Enable group scheduling in CFQ
	21	+ CONFIG_CFQ_GROUP_IOSCHED=y
	22	+
	23	+- Compile and boot into kernel and mount IO controller (blkio).
	24	+
	25	+ mount -t cgroup -o blkio none /cgroup
	26	+
	27	+- Create two cgroups
	28	+ mkdir -p /cgroup/test1/ /cgroup/test2
	29	+
	30	+- Set weights of group test1 and test2
	31	+ echo 1000 > /cgroup/test1/blkio.weight
	32	+ echo 500 > /cgroup/test2/blkio.weight
	33	+
	34	+- Create two same size files (say 512MB each) on same disk (file1, file2) and
	35	+ launch two dd threads in different cgroup to read those files.
	36	+
	37	+ sync
	38	+ echo 3 > /proc/sys/vm/drop_caches
	39	+
	40	+ dd if=/mnt/sdb/zerofile1 of=/dev/null &
	41	+ echo $! > /cgroup/test1/tasks
	42	+ cat /cgroup/test1/tasks
	43	+
	44	+ dd if=/mnt/sdb/zerofile2 of=/dev/null &
	45	+ echo $! > /cgroup/test2/tasks
	46	+ cat /cgroup/test2/tasks
	47	+
	48	+- At macro level, first dd should finish first. To get more precise data, keep
	49	+ on looking at (with the help of script), at blkio.disk_time and
	50	+ blkio.disk_sectors files of both test1 and test2 groups. This will tell how
	51	+ much disk time (in milli seconds), each group got and how many secotors each
	52	+ group dispatched to the disk. We provide fairness in terms of disk time, so
	53	+ ideally io.disk_time of cgroups should be in proportion to the weight.
	54	+
	55	+Various user visible config options
	56	+===================================
	57	+CONFIG_CFQ_GROUP_IOSCHED
	58	+ - Enables group scheduling in CFQ. Currently only 1 level of group
	59	+ creation is allowed.
	60	+
	61	+CONFIG_DEBUG_CFQ_IOSCHED
	62	+ - Enables some debugging messages in blktrace. Also creates extra
	63	+ cgroup file blkio.dequeue.
	64	+
	65	+Config options selected automatically
	66	+=====================================
	67	+These config options are not user visible and are selected/deselected
	68	+automatically based on IO scheduler configuration.
	69	+
	70	+CONFIG_BLK_CGROUP
	71	+ - Block IO controller. Selected by CONFIG_CFQ_GROUP_IOSCHED.
	72	+
	73	+CONFIG_DEBUG_BLK_CGROUP
	74	+ - Debug help. Selected by CONFIG_DEBUG_CFQ_IOSCHED.
	75	+
	76	+Details of cgroup files
	77	+=======================
	78	+- blkio.weight
	79	+ - Specifies per cgroup weight.
	80	+
	81	+ Currently allowed range of weights is from 100 to 1000.
	82	+
	83	+- blkio.time
	84	+ - disk time allocated to cgroup per device in milliseconds. First
	85	+ two fields specify the major and minor number of the device and
	86	+ third field specifies the disk time allocated to group in
	87	+ milliseconds.
	88	+
	89	+- blkio.sectors
	90	+ - number of sectors transferred to/from disk by the group. First
	91	+ two fields specify the major and minor number of the device and
	92	+ third field specifies the number of sectors transferred by the
	93	+ group to/from the device.
	94	+
	95	+- blkio.dequeue
	96	+ - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
	97	+ gives the statistics about how many a times a group was dequeued
	98	+ from service tree of the device. First two fields specify the major
	99	+ and minor number of the device and third field specifies the number
	100	+ of times a group was dequeued from a particular device.
	101	+
	102	+CFQ sysfs tunable
	103	+=================
	104	+/sys/block/<disk>/queue/iosched/group_isolation
	105	+
	106	+If group_isolation=1, it provides stronger isolation between groups at the
	107	+expense of throughput. By default group_isolation is 0. In general that
	108	+means that if group_isolation=0, expect fairness for sequential workload
	109	+only. Set group_isolation=1 to see fairness for random IO workload also.
	110	+
	111	+Generally CFQ will put random seeky workload in sync-noidle category. CFQ
	112	+will disable idling on these queues and it does a collective idling on group
	113	+of such queues. Generally these are slow moving queues and if there is a
	114	+sync-noidle service tree in each group, that group gets exclusive access to
	115	+disk for certain period. That means it will bring the throughput down if
	116	+group does not have enough IO to drive deeper queue depths and utilize disk
	117	+capacity to the fullest in the slice allocated to it. But the flip side is
	118	+that even a random reader should get better latencies and overall throughput
	119	+if there are lots of sequential readers/sync-idle workload running in the
	120	+system.
	121	+
	122	+If group_isolation=0, then CFQ automatically moves all the random seeky queues
	123	+in the root group. That means there will be no service differentiation for
	124	+that kind of workload. This leads to better throughput as we do collective
	125	+idling on root sync-noidle tree.
	126	+
	127	+By default one should run with group_isolation=0. If that is not sufficient
	128	+and one wants stronger isolation between groups, then set group_isolation=1
	129	+but this will come at cost of reduced throughput.
	130	+
	131	+What works
	132	+==========
	133	+- Currently only sync IO queues are support. All the buffered writes are
	134	+ still system wide and not per group. Hence we will not see service
	135	+ differentiation between buffered writes between groups.