Eric Lee / smarc-fsl-linux-kernel

08 Aug, 2018

2 commits

f7ecb1b10 cfq: Suppress compiler warnings about comparisons ... Browse Code »

This patch does not change any functionality but avoids that gcc
reports the following warnings when building with W=1:

block/cfq-iosched.c: In function ?cfq_back_seek_max_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_slice_idle_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_group_idle_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_low_latency_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?:
block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX);
^~~~~~~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_group_idle_us_store?:
block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX);
^~~~~~~~~~~~~~~~~~~

Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2018-08-08 07:57:13 +0800
9b4f43460 cfq: Annotate fall-through in a switch statement ... Browse Code »

This patch avoids that gcc complains about fall-through when building
with W=1.

Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2018-08-08 07:57:11 +0800

25 May, 2018

1 commit

5657a819a block drivers/block: Use octal not symbolic permissions ... Browse Code »

Convert the S_ symbolic permissions to their octal equivalents as
using octal and not symbolic permissions is preferred by many as more
readable.

see: https://lkml.org/lkml/2016/8/2/1945

Done with automated conversion via:
$ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace

Miscellanea:

o Wrapped modified multi-line calls to a single line where appropriate
o Realign modified multi-line calls to open parenthesis

Signed-off-by: Joe Perches
Signed-off-by: Jens Axboe

Joe Perches
2018-05-25 03:38:59 +0800

09 May, 2018

2 commits

522a77756 block: consolidate struct request timestamp fields ... Browse Code »

Currently, struct request has four timestamp fields:

- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq

These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2018-05-09 22:33:09 +0800
84c7afceb block: use ktime_get_ns() instead of sched_clock() for cfq and bfq ... Browse Code »

cfq and bfq have some internal fields that use sched_clock() which can
trivially use ktime_get_ns() instead. Their timestamp fields in struct
request can also use ktime_get_ns(), which resolves the 8 year old
comment added by commit 28f4197e5d47 ("block: disable preemption before
using sched_clock()").

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2018-05-09 22:33:06 +0800

09 Sep, 2017

2 commits

f0f1a45f9 block/cfq: cache rightmost rb_node ... Browse Code »

For the same reasons we already cache the leftmost pointer, apply the same
optimization for rb_last() calls. Users must explicitly do this as
rb_root_cached only deals with the smallest node.

[dave@stgolabs.net: brain fart #1]
Link: http://lkml.kernel.org/r/20170731155955.GD21328@linux-80c1.suse
Link: http://lkml.kernel.org/r/20170719014603.19029-18-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2017-09-09 09:26:49 +0800
09663c86e block/cfq: replace cfq_rb_root leftmost caching ... Browse Code »

... with the generic rbtree flavor instead. No changes
in semantics whatsoever.

Link: http://lkml.kernel.org/r/20170719014603.19029-11-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Reviewed-by: Jan Kara
Acked-by: Peter Zijlstra (Intel)
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2017-09-09 09:26:49 +0800

29 Aug, 2017

1 commit

235f8da11 block, scheduler: convert xxx_var_store to void ... Browse Code »

The last parameter "count" never be used in xxx_var_store,
convert these functions to void.

Signed-off-by: weiping zhang
Signed-off-by: Jens Axboe

weiping zhang
2017-08-29 00:01:08 +0800

11 Aug, 2017

1 commit

b3193bc0d cfq: Give a chance for arming slice idle timer in case of group_idle ... Browse Code »

In below scenario blkio cgroup does not work as per their assigned
weights :-
1. When the underlying device is nonrotational with a single HW queue
with depth of >= CFQ_HW_QUEUE_MIN
2. When the use case is forming two blkio cgroups cg1(weight 1000) &
cg2(wight 100) and two processes(file1 and file2) doing sync IO in
their respective blkio cgroups.

For above usecase result of fio (without this patch):-
file1: (groupid=0, jobs=1): err= 0: pid=685: Thu Jan 1 19:41:49 1970
write: IOPS=1315, BW=41.1MiB/s (43.1MB/s)(1024MiB/24906msec)

file2: (groupid=0, jobs=1): err= 0: pid=686: Thu Jan 1 19:41:49 1970
write: IOPS=1295, BW=40.5MiB/s (42.5MB/s)(1024MiB/25293msec)

// both the process BW is equal even though they belong to diff.
cgroups with weight of 1000(cg1) and 100(cg2)

In above case (for non rotational NCQ devices),
as soon as the request from cg1 is completed and even
though it is provided with higher set_slice=10, because of CFQ
algorithm when the driver tries to fetch the request, CFQ expires
this group without providing any idle time nor weight priority
and schedules another cfq group (in this case cg2).
And thus both cfq groups(cg1 & cg2) keep alternating to get the
disk time and hence loses the cgroup weight based scheduling.

Below patch gives a chance to cfq algorithm (cfq_arm_slice_timer)
to arm the slice timer in case group_idle is enabled.
In case if group_idle is also not required (including for nonrotational
NCQ drives), we need to explicitly set group_idle = 0 from sysfs for
such cases.

With this patch result of fio(for above usecase) :-
file1: (groupid=0, jobs=1): err= 0: pid=690: Thu Jan 1 00:06:08 1970
write: IOPS=1706, BW=53.3MiB/s (55.9MB/s)(1024MiB/19197msec)

file2: (groupid=0, jobs=1): err= 0: pid=691: Thu Jan 1 00:06:08 1970
write: IOPS=1043, BW=32.6MiB/s (34.2MB/s)(1024MiB/31401msec)

// In this processes BW is as per their respective cgroups weight.

Signed-off-by: Ritesh Harjani
Signed-off-by: Jens Axboe

Ritesh Harjani
2017-08-11 23:00:26 +0800

29 Jul, 2017

1 commit

35fe6d763 block: use standard blktrace API to output cgroup info for debug notes ... Browse Code »

Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
way. Now we have standard blktrace API for this, so convert them to use
it.

Note, this changes the behavior a little bit. cgroup info isn't output
by default, we only do this with 'blk_cgroup' option enabled. cgroup
info isn't output as a string by default too, we only do this with
'blk_cgname' option enabled. Also cgroup info is output in different
position of the note string. I think these behavior changes aren't a big
issue (actually we make trace data shorter which is good), since the
blktrace note is solely for debugging.

Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2017-07-29 23:00:03 +0800

12 Jun, 2017

1 commit

8f66439ee Merge tag 'v4.12-rc5' into for-4.13/block ... Browse Code »

We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.

Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.

Signed-off-by: Jens Axboe

Jens Axboe
2017-06-12 22:30:13 +0800

31 May, 2017

1 commit

5be6b7561 cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode ... Browse Code »

When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
as the delay of cfq_group's vdisktime if there have been other cfq_groups
already.

When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert
from jiffies to nanoseconds") could result in a large iops delay and
lead to an abnormal io schedule delay for the added cfq_group. To fix
it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
when iops mode is enabled.

Despite having the same value, the delay of a cfq_queue in idle class
and the delay of cfq_group are different things, so I define two new
macros for the delay of a cfq_group under time-slice mode and iops mode.

Fixes: 9a7f38c42c2b ("cfq-iosched: Convert from jiffies to nanoseconds")
Cc: # 4.8+
Signed-off-by: Hou Tao
Acked-by: Jan Kara
Signed-off-by: Jens Axboe

Hou Tao
2017-05-31 23:25:21 +0800

30 May, 2017

1 commit

03ea8ad78 cfq-iosched: Delete unused function min_vdisktime() ... Browse Code »

This fixes the following warning when building with clang:

block/cfq-iosched.c:970:19: error: unused function 'min_vdisktime'
[-Werror,-Wunused-function]

Signed-off-by: Matthias Kaehlcke
Signed-off-by: Jens Axboe

Matthias Kaehlcke
2017-05-30 23:49:44 +0800

05 Apr, 2017

1 commit

142bbdfcc cfq: Disable writeback throttling by default ... Browse Code »

Writeback throttling does not play well with CFQ since that also tries
to throttle async writes. As a result async writeback can get starved in
presence of readers. As an example take a benchmark simulating
postgreSQL database running over a standard rotating SATA drive. There
are 16 processes doing random reads from a huge file (2*machine memory),
1 process doing random writes to the huge file and calling fsync once
per 50000 writes and 1 process doing sequential 8k writes to a
relatively small file wrapping around at the end of the file and calling
fsync every 5 writes. Under this load read latency easily exceeds the
target latency of 75 ms (just because there are so many reads happening
against a relatively slow disk) and thus writeback is throttled to a
point where only 1 write request is allowed at a time. Blktrace data
then looks like:

8,0 1 0
8,0 1
8,0 1
8,0 1
8,0 0
8,0 1
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
8,0 3
...
8,0 0
8,0 0
...
8,0 1
8,0 1
8,0 1
8,0 0
8,0 0
8,0 0
8,0 0
...
8,0 0
8,0 0
...
8,0 1
8,0 1
8,0 1
Signed-off-by: Jens Axboe 8.347751764 0 m N cfq workload slice:40000000 0 8.347755256 0 m N cfq293A / set_active wl_class: 0 wl_type:0 0 8.347784100 0 m N cfq293A / Not idling. st->count:1 3814 8.347763916 5839 UT N [kworker/u9:2] 1 0 8.347777605 0 m N cfq293A / Not idling. st->count:1 0 8.347784100 0 m N cfq293A / Not idling. st->count:1 1596 8.354364057 0 C R 156109528 + 8 (6906954) [0] 0 8.354383193 0 m N cfq6196SN / complete rqnoidle 0 0 8.354386476 0 m N cfq schedule dispatch 0 8.354399397 0 m N cfq293A / Not idling. st->count:1 0 8.354404705 0 m N cfq293A / dispatch_insert 0 8.354409454 0 m N cfq293A / dispatched a request 0 8.354412527 0 m N cfq293A / activate rq, drv=1 1597 8.354414692 0 D W 145961400 + 24 (6718452) [swapper/0] 0 8.354484184 0 m N cfq293A / Not idling. st->count:1 0 8.354487536 0 m N cfq293A / slice expired t=0 0 8.354498013 0 m N / served: vt=5888102466265088 min_vt=5888074869387264 0 8.354502692 0 m N cfq293A / sl_used=6737519 disp=1 charge=6737519 iops=0 sect=24 0 8.354505695 0 m N cfq293A / del_from_rr 1810 8.354728768 0 C W 145961400 + 24 (314076) [0] 0 8.354746927 0 m N cfq293A / complete rqnoidle 0 3829 8.389886102 5839 G W 145962968 + 24 [kworker/u9:2] 3830 8.389888127 5839 P N [kworker/u9:2] 3831 8.389908102 5839 A W 145978336 + 24 count:1 0 9.455395969 0 m N cfq293A / slice expired t=0 0 9.455404210 0 m N / served: vt=5888958194597888 min_vt=5888941810597888 0 9.455410077 0 m N cfq293A / sl_used=4000000 disp=1 charge=4000000 iops=0 sect=24 0 9.455416851 0 m N cfq293A / del_from_rr 2045 9.455648515 0 C W 145962968 + 24 (332305) [0] 0 9.455668350 0 m N cfq293A / complete rqnoidle 0 4371 9.455710115 5839 G W 145978336 + 24 [kworker/u9:2] 4372 9.455712350 5839 P N [kworker/u9:2] 4373 9.455730159 5839 A W 145986616 + 24

Jan Kara
2017-04-05 22:15:08 +0800

02 Mar, 2017

1 commit

e60175710 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h> ... Browse Code »

We are going to split out of , which
will have to be picked up from other headers and .c files.

Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-03-02 15:42:27 +0800

22 Feb, 2017

1 commit

772c8f6f3 Merge tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer updates from Jens Axboe:

- blk-mq scheduling framework from me and Omar, with a port of the
deadline scheduler for this framework. A port of BFQ from Paolo is in
the works, and should be ready for 4.12.

- Various fixups and improvements to the above scheduling framework
from Omar, Paolo, Bart, me, others.

- Cleanup of the exported sysfs blk-mq data into debugfs, from Omar.
This allows us to export more information that helps debug hangs or
performance issues, without cluttering or abusing the sysfs API.

- Fixes for the sbitmap code, the scalable bitmap code that was
migrated from blk-mq, from Omar.

- Removal of the BLOCK_PC support in struct request, and refactoring of
carrying SCSI payloads in the block layer. This cleans up the code
nicely, and enables us to kill the SCSI specific parts of struct
request, shrinking it down nicely. From Christoph mainly, with help
from Hannes.

- Support for ranged discard requests and discard merging, also from
Christoph.

- Support for OPAL in the block layer, and for NVMe as well. Mainly
from Scott Bauer, with fixes/updates from various others folks.

- Error code fixup for gdrom from Christophe.

- cciss pci irq allocation cleanup from Christoph.

- Making the cdrom device operations read only, from Kees Cook.

- Fixes for duplicate bdi registrations and bdi/queue life time
problems from Jan and Dan.

- Set of fixes and updates for lightnvm, from Matias and Javier.

- A few fixes for nbd from Josef, using idr to name devices and a
workqueue deadlock fix on receive. Also marks Josef as the current
maintainer of nbd.

- Fix from Josef, overwriting queue settings when the number of
hardware queues is updated for a blk-mq device.

- NVMe fix from Keith, ensuring that we don't repeatedly mark and IO
aborted, if we didn't end up aborting it.

- SG gap merging fix from Ming Lei for block.

- Loop fix also from Ming, fixing a race and crash between setting loop
status and IO.

- Two block race fixes from Tahsin, fixing request list iteration and
fixing a race between device registration and udev device add
notifiations.

- Double free fix from cgroup writeback, from Tejun.

- Another double free fix in blkcg, from Hou Tao.

- Partition overflow fix for EFI from Alden Tondettar.

* tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits)
nvme: Check for Security send/recv support before issuing commands.
block/sed-opal: allocate struct opal_dev dynamically
block/sed-opal: tone down not supported warnings
block: don't defer flushes on blk-mq + scheduling
blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers
blk-mq: don't special case flush inserts for blk-mq-sched
blk-mq-sched: don't add flushes to the head of requeue queue
blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not
block: do not allow updates through sysfs until registration completes
lightnvm: set default lun range when no luns are specified
lightnvm: fix off-by-one error on target initialization
Maintainers: Modify SED list from nvme to block
Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
uapi: sed-opal fix IOW for activate lsp to use correct struct
cdrom: Make device operations read-only
elevator: fix loading wrong elevator type for blk-mq devices
cciss: switch to pci_irq_alloc_vectors
block/loop: fix race between I/O and set_status
blk-mq-sched: don't hold queue_lock when calling exit_icq
block: set make_request_fn manually in blk_mq_update_nr_hw_queues
...

Linus Torvalds
2017-02-22 02:57:33 +0800

16 Feb, 2017

1 commit

5d7f5ce15 cfq-iosched: don't call wbt_disable_default() with IRQs disabled ... Browse Code »

wbt_disable_default() calls del_timer_sync() to wait for the wbt
timer to finish before disabling throttling. We can't do this with
IRQs disable. This fixes a lockdep splat on boot, if non-root
cgroups are used.

Reported-by: Gabriel C
Fixes: 87760e5eef35 ("block: hook up writeback throttling")
Signed-off-by: Jens Axboe

Jens Axboe
2017-02-16 23:02:06 +0800

09 Feb, 2017

1 commit

34fe7c054 block: enumify ELEVATOR_*_MERGE ... Browse Code »

Switch these constants to an enum, and make let the compiler ensure that
all callers of blk_try_merge and elv_merge handle all potential values.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-02-09 04:43:06 +0800

23 Jan, 2017

2 commits

1cf417530 cfq-iosched: Adjust one function call together with a variable assignment ... Browse Code »

The script "checkpatch.pl" pointed information out like the following.

ERROR: do not use assignment in if condition

Thus fix the affected source code place.

Signed-off-by: Markus Elfring
Signed-off-by: Jens Axboe

Markus Elfring
2017-01-23 23:32:18 +0800
4d608baac block: Initialize cfqq->ioprio_class in cfq_get_queue() ... Browse Code »

KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of
uninitialized memory in cfq_init_cfqq():

==================================================================
BUG: KMSAN: use of unitialized memory
...
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[] dump_stack+0x157/0x1d0 lib/dump_stack.c:51
[] kmsan_report+0x205/0x360 ??:?
[] __msan_warning+0x5b/0xb0 ??:?
[< inline >] cfq_init_cfqq block/cfq-iosched.c:3754
[] cfq_get_queue+0xc80/0x14d0 block/cfq-iosched.c:3857
...
origin:
[] save_stack_trace+0x27/0x50 arch/x86/kernel/stacktrace.c:67
[] kmsan_internal_poison_shadow+0xab/0x150 ??:?
[] kmsan_poison_slab+0xbb/0x120 ??:?
[< inline >] allocate_slab mm/slub.c:1627
[] new_slab+0x3af/0x4b0 mm/slub.c:1641
[< inline >] new_slab_objects mm/slub.c:2407
[] ___slab_alloc+0x323/0x4a0 mm/slub.c:2564
[< inline >] __slab_alloc mm/slub.c:2606
[< inline >] slab_alloc_node mm/slub.c:2669
[] kmem_cache_alloc_node+0x1d2/0x1f0 mm/slub.c:2746
[] cfq_get_queue+0x47d/0x14d0 block/cfq-iosched.c:3850
...
==================================================================
(the line numbers are relative to 4.8-rc6, but the bug persists
upstream)

The uninitialized struct cfq_queue is created by kmem_cache_alloc_node()
and then passed to cfq_init_cfqq(), which accesses cfqq->ioprio_class
before it's initialized.

Signed-off-by: Alexander Potapenko
Signed-off-by: Jens Axboe

Alexander Potapenko
2017-01-23 23:30:28 +0800

18 Jan, 2017

1 commit

c51ca6cf5 block: move existing elevator ops to union ... Browse Code »

Prep patch for adding MQ ops as well, since doing anon unions with
named initializers doesn't work on older compilers.

Signed-off-by: Jens Axboe
Reviewed-by: Johannes Thumshirn
Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval

Jens Axboe
2017-01-18 01:03:33 +0800

29 Nov, 2016

1 commit

fa224eed2 blk-wbt: cleanup disable-by-default for CFQ ... Browse Code »

Make it clear that we are disabling wbt for the specified queued,
if it was enabled by default. This is in preparation for allowing
users to re-enable wbt, and not have it disabled automatically
again.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-29 01:27:03 +0800

22 Nov, 2016

1 commit

e00f4f4d0 block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg ... Browse Code »

blkcg allocates some per-cgroup data structures with GFP_NOWAIT and
when that fails falls back to operations which aren't specific to the
cgroup. Occassional failures are expected under pressure and falling
back to non-cgroup operation is the right thing to do.

Unfortunately, I forgot to add __GFP_NOWARN to these allocations and
these expected failures end up creating a lot of noise. Add
__GFP_NOWARN.

Signed-off-by: Tejun Heo
Reported-by: Marc MERLIN
Reported-by: Vlastimil Babka
Signed-off-by: Jens Axboe

Tejun Heo
2016-11-22 23:59:49 +0800

11 Nov, 2016

2 commits

87760e5ee block: hook up writeback throttling ... Browse Code »

Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.

Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.

The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.

We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-11 04:53:40 +0800
ebc4ff661 block: cfq_cpd_alloc() should use @gfp ... Browse Code »

cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
incorrectly hard coding GFP_KERNEL instead of using the mask specified
through the @gfp parameter. This currently doesn't cause any actual
issues because all current callers specify GFP_KERNEL. Fix it.

Signed-off-by: Tejun Heo
Reported-by: Dan Carpenter
Fixes: e4a9bde9589f ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
Signed-off-by: Jens Axboe

Tejun Heo
2016-11-11 01:10:04 +0800

01 Nov, 2016

2 commits

a2b809672 block: replace REQ_NOIDLE with REQ_IDLE ... Browse Code »

Noidle should be the default for writes as seen by all the compounds
definitions in fs.h using it. In fact only direct I/O really should
be using NODILE, so turn the whole flag around to get the defaults
right, which will make our life much easier especially onces the
WRITE_* defines go away.

This assumes all the existing "raw" users of REQ_SYNC for writes
want noidle behavior, which seems to be spot on from a quick audit.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-11-01 23:43:26 +0800
aa39ebd40 cfq-iosched: use op_is_sync instead of opencoding it ... Browse Code »

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-11-01 23:43:26 +0800

28 Oct, 2016

1 commit

ef295ecf0 block: better op and flags encoding ... Browse Code »

Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields. This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits. Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-10-28 22:48:16 +0800

24 Sep, 2016

1 commit

3932a86b4 cfq: fix starvation of asynchronous writes ... Browse Code »

While debugging timeouts happening in my application workload (ScyllaDB), I have
observed calls to open() taking a long time, ranging everywhere from 2 seconds -
the first ones that are enough to time out my application - to more than 30
seconds.

The problem seems to happen because XFS may block on pending metadata updates
under certain circumnstances, and that's confirmed with the following backtrace
taken by the offcputime tool (iovisor/bcc):

ffffffffb90c57b1 finish_task_switch
ffffffffb97dffb5 schedule
ffffffffb97e310c schedule_timeout
ffffffffb97e1f12 __down
ffffffffb90ea821 down
ffffffffc046a9dc xfs_buf_lock
ffffffffc046abfb _xfs_buf_find
ffffffffc046ae4a xfs_buf_get_map
ffffffffc046babd xfs_buf_read_map
ffffffffc0499931 xfs_trans_read_buf_map
ffffffffc044a561 xfs_da_read_buf
ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
ffffffffc0452b90 xfs_dir2_leaf_lookup_int
ffffffffc0452e0f xfs_dir2_leaf_lookup
ffffffffc044d9d3 xfs_dir_lookup
ffffffffc047d1d9 xfs_lookup
ffffffffc0479e53 xfs_vn_lookup
ffffffffb925347a path_openat
ffffffffb9254a71 do_filp_open
ffffffffb9242a94 do_sys_open
ffffffffb9242b9e sys_open
ffffffffb97e42b2 entry_SYSCALL_64_fastpath
00007fb0698162ed [unknown]

Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
high "Dispatch wait" times, on the dozens of seconds range and consistent with
the open() times I have saw in that run.

Still from the blktrace output, we can after searching a bit, identify the
request that wasn't dispatched:

8,0 11 152 81.092472813 804 A WM 141698288 + 8
8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0]

As we can see above, in this particular example CFQ took 15 seconds to dispatch
this request. Going back to the full trace, we can see that the xfsaild queue
had plenty of opportunity to run, and it was selected as the active queue many
times. It would just always be preempted by something else (example):

8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request
8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr
8,0 1 0 81.117914044 0 m N cfq1618SN / preempt
8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1
8,0 1 0 81.117914755 0 m N cfq767A / resid=40
8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448
8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0

where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
IO dispatchers.

The requests preempting the xfsaild queue are synchronous requests. That's a
characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
While it can be argued that preempting ASYNC requests in favor of SYNC is part
of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
goal.

Moreover, unless I am misunderstanding something, that breaks the expectation
set by the "fifo_expire_async" tunable, which in my system is set to the
default.

Looking at the code, it seems to me that the issue is that after we make
an async queue active, there is no guarantee that it will execute any request.

When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
requests in flight. An incoming request from another queue can also preempt it
in such situation before we have the chance to execute anything (as seen in the
trace above).

This patch sets the must_dispatch flag if we notice that we have requests
that are already fifo_expired. This flag is always cleared after
cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
the queue for subsequent requests (unless they are themselves expired)

Care is taken during preempt to still allow rt requests to preempt us
regardless.

Testing my workload with this patch applied produces much better results.
From the application side I see no timeouts, and the open() latency histogram
generated by systemtap looks much better, with the worst outlier at 131ms:

Latency histogram of xfs_buf_lock acquisition (microseconds):
value |-------------------------------------------------- count
0 | 11
1 |@@@@ 161
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966
4 |@ 54
8 | 36
16 | 7
32 | 0
64 | 0
~
1024 | 0
2048 | 0
4096 | 1
8192 | 1
16384 | 2
32768 | 0
65536 | 0
131072 | 1
262144 | 0
524288 | 0

Signed-off-by: Glauber Costa
CC: Jens Axboe
CC: linux-block@vger.kernel.org
CC: linux-kernel@vger.kernel.org

Signed-off-by: Glauber Costa
Signed-off-by: Jens Axboe

Glauber Costa
2016-09-24 00:01:24 +0800

08 Aug, 2016

1 commit

1eff9d322 block: rename bio bi_rw to bi_opf ... Browse Code »

Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe

Jens Axboe
2016-08-08 04:41:02 +0800

21 Jul, 2016

1 commit

72ef799b3 block: do not merge requests without consulting with io scheduler ... Browse Code »

Before merging a bio into an existing request, io scheduler is called to
get its approval first. However, the requests that come from a plug
flush may get merged by block layer without consulting with io
scheduler.

In case of CFQ, this can cause fairness problems. For instance, if a
request gets merged into a low weight cgroup's request, high weight cgroup
now will depend on low weight cgroup to get scheduled. If high weigt cgroup
needs that io request to complete before submitting more requests, then it
will also lose its timeslice.

Following script demonstrates the problem. Group g1 has a low weight, g2
and g3 have equal high weights but g2's requests are adjacent to g1's
requests so they are subject to merging. Due to these merges, g2 gets
poor disk time allocation.

cat > cfq-merge-repro.sh << "EOF"
#!/bin/bash
set -e

IO_ROOT=/mnt-cgroup/io

mkdir -p $IO_ROOT

if ! mount | grep -qw $IO_ROOT; then
mount -t cgroup none -oblkio $IO_ROOT
fi

cd $IO_ROOT

for i in g1 g2 g3; do
if [ -d $i ]; then
rmdir $i
fi
done

mkdir g1 && echo 10 > g1/blkio.weight
mkdir g2 && echo 495 > g2/blkio.weight
mkdir g3 && echo 495 > g3/blkio.weight

RUNTIME=10

(echo $BASHPID > g1/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=0k &> /dev/null)&

(echo $BASHPID > g2/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=64k &> /dev/null)&

(echo $BASHPID > g3/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=256k &> /dev/null)&

sleep $((RUNTIME+1))

for i in g1 g2 g3; do
echo ---- $i ----
cat $i/blkio.time
done

EOF
# ./cfq-merge-repro.sh
---- g1 ----
8:16 162
---- g2 ----
8:16 165
---- g3 ----
8:16 686

After applying the patch:

# ./cfq-merge-repro.sh
---- g1 ----
8:16 90
---- g2 ----
8:16 445
---- g3 ----
8:16 471

Signed-off-by: Tahsin Erdogan
Signed-off-by: Jens Axboe

Tahsin Erdogan
2016-07-21 11:35:12 +0800

28 Jun, 2016

3 commits

0b31c10c6 cfq-iosched: Charge at least 1 jiffie instead of 1 ns ... Browse Code »

Commit 9a7f38c42c2b (cfq-iosched: Convert from jiffies to nanoseconds)
could result in charging just 1 ns to a cgroup submitting IO instead of 1
jiffie we always charged before. It is arguable what is the right amount
to change but for now lets retain the old behavior of always charging at
least one jiffie.

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-06-28 22:21:50 +0800
149321a61 cfq-iosched: Fix regression in bonnie++ rewrite performance ... Browse Code »

Commit 9a7f38c42c2 (cfq-iosched: Convert from jiffies to nanoseconds)
broke the condition for detecting starved sync IO in
cfq_completed_request() because rq->start_time remained in jiffies but
we compared it with nanosecond values. This manifested as a regression
in bonnie++ rewrite performance because we always ended up considering
sync IO starved and thus never increased async IO queue depth.

Since rq->start_time is used in a lot of places, converting it to ns
values would be non-trivial. So just revert the condition in CFQ to use
comparison with jiffies. This will lead to suboptimal results if
cfq_fifo_expire[1] will ever come close to 1 jiffie but so far we are
relatively far from that with the storage used with CFQ (the default
value is 128 ms).

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-06-28 22:21:48 +0800
93fdf1478 cfq-iosched: Convert slice_resid from u64 to s64 ... Browse Code »

slice_resid can be both positive and negative. Commit 9a7f38c42c2b
(cfq-iosched: Convert from jiffies to nanoseconds) converted it from
long to u64. Although this did not introduce any functional regression
(the operations just overflow and the result was fine), it is certainly
wrong and could cause issues in future. So convert the type to more
appropriate s64.

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-06-28 22:21:46 +0800

10 Jun, 2016

1 commit

b8269db45 cfq-iosched: temporarily boost queue priority for idle classes ... Browse Code »

If we're queuing REQ_PRIO IO and the task is running at an idle IO
class, then temporarily boost the priority. This prevents livelocks
due to priority inversion, when a low priority task is holding file
system resources while attempting to do IO.

An example of that is shown below. An ioniced idle task is holding
the directory mutex, while a normal priority task is trying to do
a directory lookup.

[478381.198925] ------------[ cut here ]------------
[478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
[478381.201324] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
[478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[478381.203462] ionice D ffff8803692736a8 0 1168369 1 0x00000080
[478381.203466] ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698
[478381.204589] ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002
[478381.205752] ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000
[478381.206874] Call Trace:
[478381.207253] [] ? bit_wait_io_timeout+0x80/0x80
[478381.208175] [] schedule+0x37/0x90
[478381.208932] [] schedule_timeout+0x1dc/0x250
[478381.209805] [] ? __blk_run_queue+0x37/0x50
[478381.210706] [] ? ktime_get+0x45/0xb0
[478381.211489] [] io_schedule_timeout+0xa7/0x110
[478381.212402] [] ? prepare_to_wait+0x5b/0x90
[478381.213280] [] bit_wait_io+0x36/0x50
[478381.214063] [] __wait_on_bit+0x65/0x90
[478381.214961] [] ? bit_wait_io_timeout+0x80/0x80
[478381.215872] [] out_of_line_wait_on_bit+0x7c/0x90
[478381.216806] [] ? wake_atomic_t_function+0x40/0x40
[478381.217773] [] __wait_on_buffer+0x2a/0x30
[478381.218641] [] ext4_bread+0x57/0x70
[478381.219425] [] __ext4_read_dirblock+0x3c/0x380
[478381.220467] [] ext4_dx_find_entry+0x7d/0x170
[478381.221357] [] ? find_get_entry+0x1e/0xa0
[478381.222208] [] ext4_find_entry+0x484/0x510
[478381.223090] [] ext4_lookup+0x52/0x160
[478381.223882] [] lookup_real+0x1d/0x60
[478381.224675] [] __lookup_hash+0x38/0x50
[478381.225697] [] lookup_slow+0x45/0xab
[478381.226941] [] link_path_walk+0x7ae/0x820
[478381.227880] [] path_init+0xc2/0x430
[478381.228677] [] ? security_file_alloc+0x16/0x20
[478381.229776] [] path_openat+0x77/0x620
[478381.230767] [] ? page_add_file_rmap+0x2e/0x70
[478381.232019] [] do_filp_open+0x43/0xa0
[478381.233016] [] ? creds_are_invalid+0x29/0x70
[478381.234072] [] do_open_execat+0x70/0x170
[478381.235039] [] do_execveat_common.isra.36+0x1b8/0x6e0
[478381.236051] [] do_execve+0x2c/0x30
[478381.236809] [] ? getname+0x12/0x20
[478381.237564] [] SyS_execve+0x2e/0x40
[478381.238338] [] stub_execve+0x6d/0xa0
[478381.239126] ------------[ cut here ]------------
[478381.239915] ------------[ cut here ]------------
[478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds.
[478381.242673] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
[478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[478381.244902] python2.7 D ffff88005cf8fb98 0 1168375 1168248 0x00000080
[478381.244904] ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0
[478381.246023] ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff
[478381.247138] ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8
[478381.248252] Call Trace:
[478381.248630] [] schedule+0x37/0x90
[478381.249382] [] schedule_preempt_disabled+0xe/0x10
[478381.250465] [] __mutex_lock_slowpath+0x92/0x100
[478381.251409] [] mutex_lock+0x1b/0x2f
[478381.252199] [] lookup_slow+0x36/0xab
[478381.253023] [] link_path_walk+0x7ae/0x820
[478381.253877] [] ? try_charge+0xc1/0x700
[478381.254690] [] path_init+0xc2/0x430
[478381.255525] [] ? security_file_alloc+0x16/0x20
[478381.256450] [] path_openat+0x77/0x620
[478381.257256] [] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
[478381.258390] [] ? handle_mm_fault+0x13f3/0x1720
[478381.259309] [] do_filp_open+0x43/0xa0
[478381.260139] [] ? __alloc_fd+0x42/0x120
[478381.260962] [] do_sys_open+0x13c/0x230
[478381.261779] [] ? syscall_trace_enter_phase1+0x113/0x170
[478381.262851] [] SyS_open+0x22/0x30
[478381.263598] [] system_call_fastpath+0x12/0x17
[478381.264551] ------------[ cut here ]------------
[478381.265377] ------------[ cut here ]------------

Signed-off-by: Jens Axboe
Reviewed-by: Jeff Moyer

Jens Axboe
2016-06-10 06:15:01 +0800

08 Jun, 2016

5 commits

911483258 cfq-iosched: Convert to use highres timers ... Browse Code »

Reviewed-by: Jeff Moyer
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-06-08 22:56:06 +0800
d2d481d04 cfq-iosched: Expose microsecond interfaces ... Browse Code »

Expose interfaces to tune time slices of CFQ IO scheduler in
microseconds.

Signed-off-by: Jeff Moyer
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jeff Moyer
2016-06-08 22:56:03 +0800
9a7f38c42 cfq-iosched: Convert from jiffies to nanoseconds ... Browse Code »

Convert all time-keeping in CFQ IO scheduler from jiffies to nanoseconds
so that we can later make the intervals more fine-grained than jiffies.
One jiffie is several miliseconds and even for today's rotating disks
that is a noticeable amount of time and thus we leave disk unnecessarily
idle.

Signed-off-by: Jeff Moyer
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jeff Moyer
2016-06-08 22:55:34 +0800
d9d8c5c48 block: convert is_sync helpers to use REQ_OPs. ... Browse Code »

This patch converts the is_sync helpers to use separate variables
for the operation and flags.

Signed-off-by: Mike Christie
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Mike Christie
2016-06-08 03:41:38 +0800
63a4cc248 blkg_rwstat: separate op from flags ... Browse Code »

The bio and request operation and flags are going to be separate
definitions, so we cannot pass them in as a bitmap. This patch
converts the blkg_rwstat code and its caller, cfq, to pass in the
values separately.

Signed-off-by: Mike Christie
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Mike Christie
2016-06-08 03:41:38 +0800