Eric Lee / smarc-fsl-linux-kernel

05 Apr, 2017

1 commit

142bbdfcc cfq: Disable writeback throttling by default ... Browse Code »

Writeback throttling does not play well with CFQ since that also tries
to throttle async writes. As a result async writeback can get starved in
presence of readers. As an example take a benchmark simulating
postgreSQL database running over a standard rotating SATA drive. There
are 16 processes doing random reads from a huge file (2*machine memory),
1 process doing random writes to the huge file and calling fsync once
per 50000 writes and 1 process doing sequential 8k writes to a
relatively small file wrapping around at the end of the file and calling
fsync every 5 writes. Under this load read latency easily exceeds the
target latency of 75 ms (just because there are so many reads happening
against a relatively slow disk) and thus writeback is throttled to a
point where only 1 write request is allowed at a time. Blktrace data
then looks like:

8,0 1 0
8,0 1
8,0 1
8,0 1
8,0 0
8,0 1
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
...
8,0 0
8,0 0
...
8,0 1
8,0 1
8,0 1
8,0 0
8,0 0
8,0 0
8,0 0
...
8,0 0
8,0 0
...
8,0 1
8,0 1
8,0 1
Signed-off-by: Jens Axboe 8.347751764 0 m N cfq workload slice:40000000 0 8.347755256 0 m N cfq293A / set_active wl_class: 0 wl_type:0 0 8.347784100 0 m N cfq293A / Not idling. st->count:1 3814 8.347763916 5839 UT N [kworker/u9:2] 1 0 8.347777605 0 m N cfq293A / Not idling. st->count:1 0 8.347784100 0 m N cfq293A / Not idling. st->count:1 1596 8.354364057 0 C R 156109528 + 8 (6906954) [0] 0 8.354383193 0 m N cfq6196SN / complete rqnoidle 0 0 8.354386476 0 m N cfq schedule dispatch 0 8.354399397 0 m N cfq293A / Not idling. st->count:1 0 8.354404705 0 m N cfq293A / dispatch_insert 0 8.354409454 0 m N cfq293A / dispatched a request 0 8.354412527 0 m N cfq293A / activate rq, drv=1 1597 8.354414692 0 D W 145961400 + 24 (6718452) [swapper/0] 0 8.354484184 0 m N cfq293A / Not idling. st->count:1 0 8.354487536 0 m N cfq293A / slice expired t=0 0 8.354498013 0 m N / served: vt=5888102466265088 min_vt=5888074869387264 0 8.354502692 0 m N cfq293A / sl_used=6737519 disp=1 charge=6737519 iops=0 sect=24 0 8.354505695 0 m N cfq293A / del_from_rr 1810 8.354728768 0 C W 145961400 + 24 (314076) [0] 0 8.354746927 0 m N cfq293A / complete rqnoidle 0 3829 8.389886102 5839 G W 145962968 + 24 [kworker/u9:2] 3830 8.389888127 5839 P N [kworker/u9:2] 3831 8.389908102 5839 A W 145978336 + 24 count:1 0 9.455395969 0 m N cfq293A / slice expired t=0 0 9.455404210 0 m N / served: vt=5888958194597888 min_vt=5888941810597888 0 9.455410077 0 m N cfq293A / sl_used=4000000 disp=1 charge=4000000 iops=0 sect=24 0 9.455416851 0 m N cfq293A / del_from_rr 2045 9.455648515 0 C W 145962968 + 24 (332305) [0] 0 9.455668350 0 m N cfq293A / complete rqnoidle 0 4371 9.455710115 5839 G W 145978336 + 24 [kworker/u9:2] 4372 9.455712350 5839 P N [kworker/u9:2] 4373 9.455730159 5839 A W 145986616 + 24

Jan Kara
2017-04-05 22:15:08 +0800

02 Mar, 2017

1 commit

e60175710 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h> ... Browse Code »

We are going to split out of , which
will have to be picked up from other headers and .c files.

Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-03-02 15:42:27 +0800

22 Feb, 2017

1 commit

772c8f6f3 Merge tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer updates from Jens Axboe:

- blk-mq scheduling framework from me and Omar, with a port of the
deadline scheduler for this framework. A port of BFQ from Paolo is in
the works, and should be ready for 4.12.

- Various fixups and improvements to the above scheduling framework
from Omar, Paolo, Bart, me, others.

- Cleanup of the exported sysfs blk-mq data into debugfs, from Omar.
This allows us to export more information that helps debug hangs or
performance issues, without cluttering or abusing the sysfs API.

- Fixes for the sbitmap code, the scalable bitmap code that was
migrated from blk-mq, from Omar.

- Removal of the BLOCK_PC support in struct request, and refactoring of
carrying SCSI payloads in the block layer. This cleans up the code
nicely, and enables us to kill the SCSI specific parts of struct
request, shrinking it down nicely. From Christoph mainly, with help
from Hannes.

- Support for ranged discard requests and discard merging, also from
Christoph.

- Support for OPAL in the block layer, and for NVMe as well. Mainly
from Scott Bauer, with fixes/updates from various others folks.

- Error code fixup for gdrom from Christophe.

- cciss pci irq allocation cleanup from Christoph.

- Making the cdrom device operations read only, from Kees Cook.

- Fixes for duplicate bdi registrations and bdi/queue life time
problems from Jan and Dan.

- Set of fixes and updates for lightnvm, from Matias and Javier.

- A few fixes for nbd from Josef, using idr to name devices and a
workqueue deadlock fix on receive. Also marks Josef as the current
maintainer of nbd.

- Fix from Josef, overwriting queue settings when the number of
hardware queues is updated for a blk-mq device.

- NVMe fix from Keith, ensuring that we don't repeatedly mark and IO
aborted, if we didn't end up aborting it.

- SG gap merging fix from Ming Lei for block.

- Loop fix also from Ming, fixing a race and crash between setting loop
status and IO.

- Two block race fixes from Tahsin, fixing request list iteration and
fixing a race between device registration and udev device add
notifiations.

- Double free fix from cgroup writeback, from Tejun.

- Another double free fix in blkcg, from Hou Tao.

- Partition overflow fix for EFI from Alden Tondettar.

* tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits)
nvme: Check for Security send/recv support before issuing commands.
block/sed-opal: allocate struct opal_dev dynamically
block/sed-opal: tone down not supported warnings
block: don't defer flushes on blk-mq + scheduling
blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers
blk-mq: don't special case flush inserts for blk-mq-sched
blk-mq-sched: don't add flushes to the head of requeue queue
blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not
block: do not allow updates through sysfs until registration completes
lightnvm: set default lun range when no luns are specified
lightnvm: fix off-by-one error on target initialization
Maintainers: Modify SED list from nvme to block
Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
uapi: sed-opal fix IOW for activate lsp to use correct struct
cdrom: Make device operations read-only
elevator: fix loading wrong elevator type for blk-mq devices
cciss: switch to pci_irq_alloc_vectors
block/loop: fix race between I/O and set_status
blk-mq-sched: don't hold queue_lock when calling exit_icq
block: set make_request_fn manually in blk_mq_update_nr_hw_queues
...

Linus Torvalds
2017-02-22 02:57:33 +0800

16 Feb, 2017

1 commit

5d7f5ce15 cfq-iosched: don't call wbt_disable_default() with IRQs disabled ... Browse Code »

wbt_disable_default() calls del_timer_sync() to wait for the wbt
timer to finish before disabling throttling. We can't do this with
IRQs disable. This fixes a lockdep splat on boot, if non-root
cgroups are used.

Reported-by: Gabriel C
Fixes: 87760e5eef35 ("block: hook up writeback throttling")
Signed-off-by: Jens Axboe

Jens Axboe
2017-02-16 23:02:06 +0800

09 Feb, 2017

1 commit

34fe7c054 block: enumify ELEVATOR_*_MERGE ... Browse Code »

Switch these constants to an enum, and make let the compiler ensure that
all callers of blk_try_merge and elv_merge handle all potential values.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-02-09 04:43:06 +0800

23 Jan, 2017

2 commits

1cf417530 cfq-iosched: Adjust one function call together with a variable assignment ... Browse Code »

The script "checkpatch.pl" pointed information out like the following.

ERROR: do not use assignment in if condition

Thus fix the affected source code place.

Signed-off-by: Markus Elfring
Signed-off-by: Jens Axboe

Markus Elfring
2017-01-23 23:32:18 +0800
4d608baac block: Initialize cfqq->ioprio_class in cfq_get_queue() ... Browse Code »

KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of
uninitialized memory in cfq_init_cfqq():

==================================================================
BUG: KMSAN: use of unitialized memory
...
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[] dump_stack+0x157/0x1d0 lib/dump_stack.c:51
[] kmsan_report+0x205/0x360 ??:?
[] __msan_warning+0x5b/0xb0 ??:?
[< inline >] cfq_init_cfqq block/cfq-iosched.c:3754
[] cfq_get_queue+0xc80/0x14d0 block/cfq-iosched.c:3857
...
origin:
[] save_stack_trace+0x27/0x50 arch/x86/kernel/stacktrace.c:67
[] kmsan_internal_poison_shadow+0xab/0x150 ??:?
[] kmsan_poison_slab+0xbb/0x120 ??:?
[< inline >] allocate_slab mm/slub.c:1627
[] new_slab+0x3af/0x4b0 mm/slub.c:1641
[< inline >] new_slab_objects mm/slub.c:2407
[] ___slab_alloc+0x323/0x4a0 mm/slub.c:2564
[< inline >] __slab_alloc mm/slub.c:2606
[< inline >] slab_alloc_node mm/slub.c:2669
[] kmem_cache_alloc_node+0x1d2/0x1f0 mm/slub.c:2746
[] cfq_get_queue+0x47d/0x14d0 block/cfq-iosched.c:3850
...
==================================================================
(the line numbers are relative to 4.8-rc6, but the bug persists
upstream)

The uninitialized struct cfq_queue is created by kmem_cache_alloc_node()
and then passed to cfq_init_cfqq(), which accesses cfqq->ioprio_class
before it's initialized.

Signed-off-by: Alexander Potapenko
Signed-off-by: Jens Axboe

Alexander Potapenko
2017-01-23 23:30:28 +0800

18 Jan, 2017

1 commit

c51ca6cf5 block: move existing elevator ops to union ... Browse Code »

Prep patch for adding MQ ops as well, since doing anon unions with
named initializers doesn't work on older compilers.

Signed-off-by: Jens Axboe
Reviewed-by: Johannes Thumshirn
Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval

Jens Axboe
2017-01-18 01:03:33 +0800

29 Nov, 2016

1 commit

fa224eed2 blk-wbt: cleanup disable-by-default for CFQ ... Browse Code »

Make it clear that we are disabling wbt for the specified queued,
if it was enabled by default. This is in preparation for allowing
users to re-enable wbt, and not have it disabled automatically
again.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-29 01:27:03 +0800

22 Nov, 2016

1 commit

e00f4f4d0 block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg ... Browse Code »

blkcg allocates some per-cgroup data structures with GFP_NOWAIT and
when that fails falls back to operations which aren't specific to the
cgroup. Occassional failures are expected under pressure and falling
back to non-cgroup operation is the right thing to do.

Unfortunately, I forgot to add __GFP_NOWARN to these allocations and
these expected failures end up creating a lot of noise. Add
__GFP_NOWARN.

Signed-off-by: Tejun Heo
Reported-by: Marc MERLIN
Reported-by: Vlastimil Babka
Signed-off-by: Jens Axboe

Tejun Heo
2016-11-22 23:59:49 +0800

11 Nov, 2016

2 commits

87760e5ee block: hook up writeback throttling ... Browse Code »

Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.

Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.

The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.

We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-11 04:53:40 +0800
ebc4ff661 block: cfq_cpd_alloc() should use @gfp ... Browse Code »

cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
incorrectly hard coding GFP_KERNEL instead of using the mask specified
through the @gfp parameter. This currently doesn't cause any actual
issues because all current callers specify GFP_KERNEL. Fix it.

Signed-off-by: Tejun Heo
Reported-by: Dan Carpenter
Fixes: e4a9bde9589f ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
Signed-off-by: Jens Axboe

Tejun Heo
2016-11-11 01:10:04 +0800

01 Nov, 2016

2 commits

a2b809672 block: replace REQ_NOIDLE with REQ_IDLE ... Browse Code »

Noidle should be the default for writes as seen by all the compounds
definitions in fs.h using it. In fact only direct I/O really should
be using NODILE, so turn the whole flag around to get the defaults
right, which will make our life much easier especially onces the
WRITE_* defines go away.

This assumes all the existing "raw" users of REQ_SYNC for writes
want noidle behavior, which seems to be spot on from a quick audit.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-11-01 23:43:26 +0800
aa39ebd40 cfq-iosched: use op_is_sync instead of opencoding it ... Browse Code »

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-11-01 23:43:26 +0800

28 Oct, 2016

1 commit

ef295ecf0 block: better op and flags encoding ... Browse Code »

Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields. This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits. Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-10-28 22:48:16 +0800

24 Sep, 2016

1 commit

3932a86b4 cfq: fix starvation of asynchronous writes ... Browse Code »

While debugging timeouts happening in my application workload (ScyllaDB), I have
observed calls to open() taking a long time, ranging everywhere from 2 seconds -
the first ones that are enough to time out my application - to more than 30
seconds.

The problem seems to happen because XFS may block on pending metadata updates
under certain circumnstances, and that's confirmed with the following backtrace
taken by the offcputime tool (iovisor/bcc):

ffffffffb90c57b1 finish_task_switch
ffffffffb97dffb5 schedule
ffffffffb97e310c schedule_timeout
ffffffffb97e1f12 __down
ffffffffb90ea821 down
ffffffffc046a9dc xfs_buf_lock
ffffffffc046abfb _xfs_buf_find
ffffffffc046ae4a xfs_buf_get_map
ffffffffc046babd xfs_buf_read_map
ffffffffc0499931 xfs_trans_read_buf_map
ffffffffc044a561 xfs_da_read_buf
ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
ffffffffc0452b90 xfs_dir2_leaf_lookup_int
ffffffffc0452e0f xfs_dir2_leaf_lookup
ffffffffc044d9d3 xfs_dir_lookup
ffffffffc047d1d9 xfs_lookup
ffffffffc0479e53 xfs_vn_lookup
ffffffffb925347a path_openat
ffffffffb9254a71 do_filp_open
ffffffffb9242a94 do_sys_open
ffffffffb9242b9e sys_open
ffffffffb97e42b2 entry_SYSCALL_64_fastpath
00007fb0698162ed [unknown]

Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
high "Dispatch wait" times, on the dozens of seconds range and consistent with
the open() times I have saw in that run.

Still from the blktrace output, we can after searching a bit, identify the
request that wasn't dispatched:

8,0 11 152 81.092472813 804 A WM 141698288 + 8
8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0]

As we can see above, in this particular example CFQ took 15 seconds to dispatch
this request. Going back to the full trace, we can see that the xfsaild queue
had plenty of opportunity to run, and it was selected as the active queue many
times. It would just always be preempted by something else (example):

8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request
8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr
8,0 1 0 81.117914044 0 m N cfq1618SN / preempt
8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1
8,0 1 0 81.117914755 0 m N cfq767A / resid=40
8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448
8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0

where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
IO dispatchers.

The requests preempting the xfsaild queue are synchronous requests. That's a
characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
While it can be argued that preempting ASYNC requests in favor of SYNC is part
of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
goal.

Moreover, unless I am misunderstanding something, that breaks the expectation
set by the "fifo_expire_async" tunable, which in my system is set to the
default.

Looking at the code, it seems to me that the issue is that after we make
an async queue active, there is no guarantee that it will execute any request.

When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
requests in flight. An incoming request from another queue can also preempt it
in such situation before we have the chance to execute anything (as seen in the
trace above).

This patch sets the must_dispatch flag if we notice that we have requests
that are already fifo_expired. This flag is always cleared after
cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
the queue for subsequent requests (unless they are themselves expired)

Care is taken during preempt to still allow rt requests to preempt us
regardless.

Testing my workload with this patch applied produces much better results.
From the application side I see no timeouts, and the open() latency histogram
generated by systemtap looks much better, with the worst outlier at 131ms:

Latency histogram of xfs_buf_lock acquisition (microseconds):
value |-------------------------------------------------- count
0 | 11
1 |@@@@ 161
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966
4 |@ 54
8 | 36
16 | 7
32 | 0
64 | 0
~
1024 | 0
2048 | 0
4096 | 1
8192 | 1
16384 | 2
32768 | 0
65536 | 0
131072 | 1
262144 | 0
524288 | 0

Signed-off-by: Glauber Costa
CC: Jens Axboe
CC: linux-block@vger.kernel.org
CC: linux-kernel@vger.kernel.org

Signed-off-by: Glauber Costa
Signed-off-by: Jens Axboe

Glauber Costa
2016-09-24 00:01:24 +0800

08 Aug, 2016

1 commit

1eff9d322 block: rename bio bi_rw to bi_opf ... Browse Code »

Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe

Jens Axboe
2016-08-08 04:41:02 +0800

21 Jul, 2016

1 commit

72ef799b3 block: do not merge requests without consulting with io scheduler ... Browse Code »

Before merging a bio into an existing request, io scheduler is called to
get its approval first. However, the requests that come from a plug
flush may get merged by block layer without consulting with io
scheduler.

In case of CFQ, this can cause fairness problems. For instance, if a
request gets merged into a low weight cgroup's request, high weight cgroup
now will depend on low weight cgroup to get scheduled. If high weigt cgroup
needs that io request to complete before submitting more requests, then it
will also lose its timeslice.

Following script demonstrates the problem. Group g1 has a low weight, g2
and g3 have equal high weights but g2's requests are adjacent to g1's
requests so they are subject to merging. Due to these merges, g2 gets
poor disk time allocation.

cat > cfq-merge-repro.sh << "EOF"
#!/bin/bash
set -e

IO_ROOT=/mnt-cgroup/io

mkdir -p $IO_ROOT

if ! mount | grep -qw $IO_ROOT; then
mount -t cgroup none -oblkio $IO_ROOT
fi

cd $IO_ROOT

for i in g1 g2 g3; do
if [ -d $i ]; then
rmdir $i
fi
done

mkdir g1 && echo 10 > g1/blkio.weight
mkdir g2 && echo 495 > g2/blkio.weight
mkdir g3 && echo 495 > g3/blkio.weight

RUNTIME=10

(echo $BASHPID > g1/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=0k &> /dev/null)&

(echo $BASHPID > g2/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=64k &> /dev/null)&

(echo $BASHPID > g3/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=256k &> /dev/null)&

sleep $((RUNTIME+1))

for i in g1 g2 g3; do
echo ---- $i ----
cat $i/blkio.time
done

EOF
# ./cfq-merge-repro.sh
---- g1 ----
8:16 162
---- g2 ----
8:16 165
---- g3 ----
8:16 686

After applying the patch:

# ./cfq-merge-repro.sh
---- g1 ----
8:16 90
---- g2 ----
8:16 445
---- g3 ----
8:16 471

Signed-off-by: Tahsin Erdogan
Signed-off-by: Jens Axboe

Tahsin Erdogan
2016-07-21 11:35:12 +0800

28 Jun, 2016

3 commits

0b31c10c6 cfq-iosched: Charge at least 1 jiffie instead of 1 ns ... Browse Code »

Commit 9a7f38c42c2b (cfq-iosched: Convert from jiffies to nanoseconds)
could result in charging just 1 ns to a cgroup submitting IO instead of 1
jiffie we always charged before. It is arguable what is the right amount
to change but for now lets retain the old behavior of always charging at
least one jiffie.

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-06-28 22:21:50 +0800
149321a61 cfq-iosched: Fix regression in bonnie++ rewrite performance ... Browse Code »

Commit 9a7f38c42c2 (cfq-iosched: Convert from jiffies to nanoseconds)
broke the condition for detecting starved sync IO in
cfq_completed_request() because rq->start_time remained in jiffies but
we compared it with nanosecond values. This manifested as a regression
in bonnie++ rewrite performance because we always ended up considering
sync IO starved and thus never increased async IO queue depth.

Since rq->start_time is used in a lot of places, converting it to ns
values would be non-trivial. So just revert the condition in CFQ to use
comparison with jiffies. This will lead to suboptimal results if
cfq_fifo_expire[1] will ever come close to 1 jiffie but so far we are
relatively far from that with the storage used with CFQ (the default
value is 128 ms).

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-06-28 22:21:48 +0800
93fdf1478 cfq-iosched: Convert slice_resid from u64 to s64 ... Browse Code »

slice_resid can be both positive and negative. Commit 9a7f38c42c2b
(cfq-iosched: Convert from jiffies to nanoseconds) converted it from
long to u64. Although this did not introduce any functional regression
(the operations just overflow and the result was fine), it is certainly
wrong and could cause issues in future. So convert the type to more
appropriate s64.

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-06-28 22:21:46 +0800

10 Jun, 2016

1 commit

b8269db45 cfq-iosched: temporarily boost queue priority for idle classes ... Browse Code »

If we're queuing REQ_PRIO IO and the task is running at an idle IO
class, then temporarily boost the priority. This prevents livelocks
due to priority inversion, when a low priority task is holding file
system resources while attempting to do IO.

An example of that is shown below. An ioniced idle task is holding
the directory mutex, while a normal priority task is trying to do
a directory lookup.

[478381.198925] ------------[ cut here ]------------
[478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
[478381.201324] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
[478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[478381.203462] ionice D ffff8803692736a8 0 1168369 1 0x00000080
[478381.203466] ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698
[478381.204589] ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002
[478381.205752] ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000
[478381.206874] Call Trace:
[478381.207253] [] ? bit_wait_io_timeout+0x80/0x80
[478381.208175] [] schedule+0x37/0x90
[478381.208932] [] schedule_timeout+0x1dc/0x250
[478381.209805] [] ? __blk_run_queue+0x37/0x50
[478381.210706] [] ? ktime_get+0x45/0xb0
[478381.211489] [] io_schedule_timeout+0xa7/0x110
[478381.212402] [] ? prepare_to_wait+0x5b/0x90
[478381.213280] [] bit_wait_io+0x36/0x50
[478381.214063] [] __wait_on_bit+0x65/0x90
[478381.214961] [] ? bit_wait_io_timeout+0x80/0x80
[478381.215872] [] out_of_line_wait_on_bit+0x7c/0x90
[478381.216806] [] ? wake_atomic_t_function+0x40/0x40
[478381.217773] [] __wait_on_buffer+0x2a/0x30
[478381.218641] [] ext4_bread+0x57/0x70
[478381.219425] [] __ext4_read_dirblock+0x3c/0x380
[478381.220467] [] ext4_dx_find_entry+0x7d/0x170
[478381.221357] [] ? find_get_entry+0x1e/0xa0
[478381.222208] [] ext4_find_entry+0x484/0x510
[478381.223090] [] ext4_lookup+0x52/0x160
[478381.223882] [] lookup_real+0x1d/0x60
[478381.224675] [] __lookup_hash+0x38/0x50
[478381.225697] [] lookup_slow+0x45/0xab
[478381.226941] [] link_path_walk+0x7ae/0x820
[478381.227880] [] path_init+0xc2/0x430
[478381.228677] [] ? security_file_alloc+0x16/0x20
[478381.229776] [] path_openat+0x77/0x620
[478381.230767] [] ? page_add_file_rmap+0x2e/0x70
[478381.232019] [] do_filp_open+0x43/0xa0
[478381.233016] [] ? creds_are_invalid+0x29/0x70
[478381.234072] [] do_open_execat+0x70/0x170
[478381.235039] [] do_execveat_common.isra.36+0x1b8/0x6e0
[478381.236051] [] do_execve+0x2c/0x30
[478381.236809] [] ? getname+0x12/0x20
[478381.237564] [] SyS_execve+0x2e/0x40
[478381.238338] [] stub_execve+0x6d/0xa0
[478381.239126] ------------[ cut here ]------------
[478381.239915] ------------[ cut here ]------------
[478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds.
[478381.242673] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
[478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[478381.244902] python2.7 D ffff88005cf8fb98 0 1168375 1168248 0x00000080
[478381.244904] ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0
[478381.246023] ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff
[478381.247138] ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8
[478381.248252] Call Trace:
[478381.248630] [] schedule+0x37/0x90
[478381.249382] [] schedule_preempt_disabled+0xe/0x10
[478381.250465] [] __mutex_lock_slowpath+0x92/0x100
[478381.251409] [] mutex_lock+0x1b/0x2f
[478381.252199] [] lookup_slow+0x36/0xab
[478381.253023] [] link_path_walk+0x7ae/0x820
[478381.253877] [] ? try_charge+0xc1/0x700
[478381.254690] [] path_init+0xc2/0x430
[478381.255525] [] ? security_file_alloc+0x16/0x20
[478381.256450] [] path_openat+0x77/0x620
[478381.257256] [] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
[478381.258390] [] ? handle_mm_fault+0x13f3/0x1720
[478381.259309] [] do_filp_open+0x43/0xa0
[478381.260139] [] ? __alloc_fd+0x42/0x120
[478381.260962] [] do_sys_open+0x13c/0x230
[478381.261779] [] ? syscall_trace_enter_phase1+0x113/0x170
[478381.262851] [] SyS_open+0x22/0x30
[478381.263598] [] system_call_fastpath+0x12/0x17
[478381.264551] ------------[ cut here ]------------
[478381.265377] ------------[ cut here ]------------

Signed-off-by: Jens Axboe
Reviewed-by: Jeff Moyer

Jens Axboe
2016-06-10 06:15:01 +0800

08 Jun, 2016

6 commits

911483258 cfq-iosched: Convert to use highres timers ... Browse Code »

Reviewed-by: Jeff Moyer
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-06-08 22:56:06 +0800
d2d481d04 cfq-iosched: Expose microsecond interfaces ... Browse Code »

Expose interfaces to tune time slices of CFQ IO scheduler in
microseconds.

Signed-off-by: Jeff Moyer
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jeff Moyer
2016-06-08 22:56:03 +0800
9a7f38c42 cfq-iosched: Convert from jiffies to nanoseconds ... Browse Code »

Convert all time-keeping in CFQ IO scheduler from jiffies to nanoseconds
so that we can later make the intervals more fine-grained than jiffies.
One jiffie is several miliseconds and even for today's rotating disks
that is a noticeable amount of time and thus we leave disk unnecessarily
idle.

Signed-off-by: Jeff Moyer
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jeff Moyer
2016-06-08 22:55:34 +0800
d9d8c5c48 block: convert is_sync helpers to use REQ_OPs. ... Browse Code »

This patch converts the is_sync helpers to use separate variables
for the operation and flags.

Signed-off-by: Mike Christie
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Mike Christie
2016-06-08 03:41:38 +0800
63a4cc248 blkg_rwstat: separate op from flags ... Browse Code »

The bio and request operation and flags are going to be separate
definitions, so we cannot pass them in as a bitmap. This patch
converts the blkg_rwstat code and its caller, cfq, to pass in the
values separately.

Signed-off-by: Mike Christie
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Mike Christie
2016-06-08 03:41:38 +0800
ba568ea0a block: prepare elevator to use REQ_OPs. ... Browse Code »

This patch converts the elevator code to use separate variables
for the operation and flags, and to check req_op for the REQ_OP.

Signed-off-by: Mike Christie
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Mike Christie
2016-06-08 03:41:38 +0800

05 Apr, 2016

1 commit

09cbfeaf1 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros ... Browse Code »

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-04-05 01:41:08 +0800

05 Feb, 2016

4 commits

3984aa552 cfq-iosched: Allow parent cgroup to preempt its child ... Browse Code »

Currently we don't allow sync workload of one cgroup to preempt sync
workload of any other cgroup. This is because we want to achieve service
separation between cgroups. However in cases where cgroup preempting is
ancestor of the current cgroup, there is no need of separation and
idling introduces unnecessary overhead. This hurts for example the case
when workload is isolated within a cgroup but journalling threads are in
root cgroup. Simple way to demostrate the issue is using:

dbench4 -c /usr/share/dbench4/client.txt -t 10 -D /mnt 1

on ext4 filesystem on plain SATA drive (mounted with barrier=0 to make
difference more visible). When all processes are in the root cgroup,
reported throughput is 153.132 MB/sec. When dbench process gets its own
blkio cgroup, reported throughput drops to 26.1006 MB/sec.

Fix the problem by making check in cfq_should_preempt() more benevolent
and allow preemption by ancestor cgroup. This improves the throughput
reported by dbench4 to 48.9106 MB/sec.

Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-02-05 00:50:43 +0800
a257ae3e4 cfq-iosched: Allow sync noidle workloads to preempt each other ... Browse Code »

The original idea with preemption of sync noidle queues (introduced in
commit 718eee0579b8 "cfq-iosched: fairness for sync no-idle queues") was
that we service all sync noidle queues together, we don't idle on any of
the queues individually and we idle only if there is no sync noidle
queue to be served. This intention also matches the original test:

if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
&& new_cfqq->service_tree == cfqq->service_tree)
return true;

However since at that time cfqq->service_tree was not set for idling
queues, this test was unreliable and was replaced in commit e4a229196a7c
"cfq-iosched: fix no-idle preemption logic" by:

if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
new_cfqq->service_tree->count == 1)
return true;

That was a reliable test but was actually doing something different -
now we preempt sync noidle queue only if the new queue is the only one
busy in the service tree.

These days cfq queue is kept in service tree even if it is idling and
thus the original check would be safe again. But since we actually check
that cfq queues are in the same cgroup, of the same priority class and
workload type (sync noidle), we know that new_cfqq is fine to preempt
cfqq. So just remove the service tree check.

Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-02-05 00:50:14 +0800
6c80731c7 cfq-iosched: Reorder checks in cfq_should_preempt() ... Browse Code »

Move check for preemption by rt class up. There is no functional change
but it makes arguing about conditions simpler since we can be sure both
cfq queues are from the same ioprio class.

Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-02-05 00:50:12 +0800
e795421e4 cfq-iosched: Don't group_idle if cfqq has big thinktime ... Browse Code »

There is no point in idling on a cfq group if the only cfq queue that is
there has too big thinktime.

Signed-off-by: Jan Kara
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Jan Kara
2016-02-05 00:50:11 +0800

18 Sep, 2015

1 commit

9e10a130d cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl() ... Browse Code »

cgroup_on_dfl() tests whether the cgroup's root is the default
hierarchy; however, an individual controller is only interested in
whether the controller is attached to the default hierarchy and never
tests a cgroup which doesn't belong to the hierarchy that the
controller is attached to.

This patch replaces cgroup_on_dfl() tests in controllers with faster
static_key based cgroup_subsys_on_dfl(). This leaves cgroup core as
the only user of cgroup_on_dfl() and the function is moved from the
header file to cgroup.c.

Signed-off-by: Tejun Heo
Acked-by: Zefan Li
Cc: Vivek Goyal
Cc: Jens Axboe
Cc: Johannes Weiner
Cc: Michal Hocko

Tejun Heo
2015-09-18 23:56:28 +0800

19 Aug, 2015

6 commits

69d7fde59 blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy ... Browse Code »

cgroup is trying to make interface consistent across different
controllers. For weight based resource control, the knob should have
the range [1, 10000] and default to 100. This patch updates
cfq-iosched so that the weight range conforms. The internal
calculations have enough range and the widening of the weight range
shouldn't cause any problem.

* blkcg_policy->cpd_bind_fn() is added. If present, this is invoked
when blkcg is attached to a hierarchy.

* cfq_cpd_init() is updated to use the new default value on the
unified hierarchy.

* cfq_cpd_bind() callback is implemented to clear per-blkg configs and
apply the default config matching the hierarchy type.

* cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
is moved into !CONFIG_CFQ_GROUP_IOSCHED block. cfq_cpd_bind() is
now responsible for initializing the initial weights when blkcg is
enabled.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Cc: Arianna Avanzini
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:36 +0800
3ecca6293 blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/ ... Browse Code »

blkcg is gonna switch to cgroup common weight range as defined by
CGROUP_WEIGHT_* on the unified hierarchy. In preparation, rename
CFQ_WEIGHT_* constants to CFQ_WEIGHT_LEGACY_*.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Cc: Arianna Avanzini
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:36 +0800
2ee867dcf blkcg: implement interface for the unified hierarchy ... Browse Code »

blkcg interface grew to be the biggest of all controllers and
unfortunately most inconsistent too. The interface files are
inconsistent with a number of cloes duplicates. Some files have
recursive variants while others don't. There's distinction between
normal and leaf weights which isn't intuitive and there are a lot of
stat knobs which don't make much sense outside of debugging and expose
too much implementation details to userland.

In the unified hierarchy, everything is always hierarchical and
internal nodes can't have tasks rendering the two structural issues
twisting the current interface. The interface has to be updated in a
significant anyway and this is a good chance to revamp it as a whole.
This patch implements blkcg interface for the unified hierarchy.

* (from a previous patch) blkcg is identified by "io" instead of
"blkio" on the unified hierarchy. Given that the whole interface is
updated anyway, the rename shouldn't carry noticeable conversion
overhead.

* The original interface consisted of 27 files is replaced with the
following three files.

blkio.stat : per-blkcg stats
blkio.weight : per-cgroup and per-cgroup-queue weight settings
blkio.max : per-cgroup-queue bps and iops max limits

Documentation/cgroups/unified-hierarchy.txt updated accordingly.

v2: blkcg_policy->dfl_cftypes wasn't removed on
blkcg_policy_unregister() corrupting the cftypes list. Fixed.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:35 +0800
dd165eb3b blkcg: misc preparations for unified hierarchy interface ... Browse Code »

* Export blkg_dev_name()

* Drop unnecessary @cft from __cfq_set_weight().

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:18 +0800
36aa9e5f5 blkcg: move body parsing from blkg_conf_prep() to its callers ... Browse Code »

Currently, blkg_conf_prep() expects input to be of the following form

MAJ:MIN NUM

and reads the NUM part into blkg_conf_ctx->v. This is quite
restrictive and gets in the way in implementing blkcg interface for
the unified hierarchy. This patch updates blkg_conf_prep() so that it
expects

MAJ:MIN BODY_STR

where BODY_STR is an arbitrary string. blkg_conf_ctx->v is replaced
with ->body which is a char pointer pointing to the start of BODY_STR.
Parsing of the body is moved to blkg_conf_prep()'s callers.

To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
non-const pointer and to accommodate that const is dropped from @input
too.

This doesn't cause any behavior changes.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:18 +0800
880f50e22 blkcg: mark existing cftypes as legacy ... Browse Code »

blkcg is about to grow interface for the unified hierarchy. Add
legacy to existing cftypes.

* blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
* blk-cgroup.c:blkcg_files -> blkcg_legacy_files
* cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
* blk-throttle.c:throtl_files -> throtl_legacy_files

Pure renames. No functional change.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:18 +0800