Eric Lee / smarc-fsl-linux-kernel

27 Jul, 2016

2 commits

3fc9d6909 Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block driver updates from Jens Axboe:
"This branch also contains core changes. I've come to the conclusion
that from 4.9 and forward, I'll be doing just a single branch. We
often have dependencies between core and drivers, and it's hard to
always split them up appropriately without pulling core into drivers
when that happens.

That said, this contains:

- separate secure erase type for the core block layer, from
Christoph.

- set of discard fixes, from Christoph.

- bio shrinking fixes from Christoph, as a followup up to the
op/flags change in the core branch.

- map and append request fixes from Christoph.

- NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
exciting!

- nvme-loop fixes from Arnd.

- removal of ->driverfs_dev from Dan, after providing a
device_add_disk() helper.

- bcache fixes from Bhaktipriya and Yijing.

- cdrom subchannel read fix from Vchannaiah.

- set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

- set of drbd updates and fixes from Fabian, Lars, and Philipp.

- mg_disk error path fix from Bart.

- user notification for failed device add for loop, from Minfei.

- NVMe in general:
+ NVMe delay quirk from Guilherme.
+ SR-IOV support and command retry limits from Keith.
+ fix for memory-less NUMA node from Masayoshi.
+ use UINT_MAX for discard sectors, from Minfei.
+ cancel IO fixes from Ming.
+ don't allocate unused major, from Neil.
+ error code fixup from Dan.
+ use constants for PSDT/FUSE from James.
+ variable init fix from Jay.
+ fabrics fixes from Ming, Sagi, and Wei.
+ various fixes"

* 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
nvme/pci: Provide SR-IOV support
nvme: initialize variable before logical OR'ing it
block: unexport various bio mapping helpers
scsi/osd: open code blk_make_request
target: stop using blk_make_request
block: simplify and export blk_rq_append_bio
block: ensure bios return from blk_get_request are properly initialized
virtio_blk: use blk_rq_map_kern
memstick: don't allow REQ_TYPE_BLOCK_PC requests
block: shrink bio size again
block: simplify and cleanup bvec pool handling
block: get rid of bio_rw and READA
block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
NVMe: don't allocate unused nvme_major
nvme: avoid crashes when node 0 is memoryless node.
nvme: Limit command retries
loop: Make user notify for adding loop device failed
nvme-loop: fix nvme-loop Kconfig dependencies
nvmet: fix return value check in nvmet_subsys_alloc()
...

Linus Torvalds
2016-07-27 06:37:51 +0800
d05d7f407 Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block updates from Jens Axboe:

- the big change is the cleanup from Mike Christie, cleaning up our
uses of command types and modified flags. This is what will throw
some merge conflicts

- regression fix for the above for btrfs, from Vincent

- following up to the above, better packing of struct request from
Christoph

- a 2038 fix for blktrace from Arnd

- a few trivial/spelling fixes from Bart Van Assche

- a front merge check fix from Damien, which could cause issues on
SMR drives

- Atari partition fix from Gabriel

- convert cfq to highres timers, since jiffies isn't granular enough
for some devices these days. From Jan and Jeff

- CFQ priority boost fix idle classes, from me

- cleanup series from Ming, improving our bio/bvec iteration

- a direct issue fix for blk-mq from Omar

- fix for plug merging not involving the IO scheduler, like we do for
other types of merges. From Tahsin

- expose DAX type internally and through sysfs. From Toshi and Yigal

* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
block: Fix front merge check
block: do not merge requests without consulting with io scheduler
block: Fix spelling in a source code comment
block: expose QUEUE_FLAG_DAX in sysfs
block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
Btrfs: fix comparison in __btrfs_map_block()
block: atari: Return early for unsupported sector size
Doc: block: Fix a typo in queue-sysfs.txt
cfq-iosched: Charge at least 1 jiffie instead of 1 ns
cfq-iosched: Fix regression in bonnie++ rewrite performance
cfq-iosched: Convert slice_resid from u64 to s64
block: Convert fifo_time from ulong to u64
blktrace: avoid using timespec
block/blk-cgroup.c: Declare local symbols static
block/bio-integrity.c: Add #include "blk.h"
block/partition-generic.c: Remove a set-but-not-used variable
block: bio: kill BIO_MAX_SIZE
cfq-iosched: temporarily boost queue priority for idle classes
block: drbd: avoid to use BIO_MAX_SIZE
block: bio: remove BIO_MAX_SECTORS
...

Linus Torvalds
2016-07-27 06:03:07 +0800

26 Jul, 2016

1 commit

55392c4c0 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer updates from Thomas Gleixner:
"This update provides the following changes:

- The rework of the timer wheel which addresses the shortcomings of
the current wheel (cascading, slow search for next expiring timer,
etc). That's the first major change of the wheel in almost 20
years since Finn implemted it.

- A large overhaul of the clocksource drivers init functions to
consolidate the Device Tree initialization

- Some more Y2038 updates

- A capability fix for timerfd

- Yet another clock chip driver

- The usual pile of updates, comment improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
tick/nohz: Optimize nohz idle enter
clockevents: Make clockevents_subsys static
clocksource/drivers/time-armada-370-xp: Fix return value check
timers: Implement optimization for same expiry time in mod_timer()
timers: Split out index calculation
timers: Only wake softirq if necessary
timers: Forward the wheel clock whenever possible
timers/nohz: Remove pointless tick_nohz_kick_tick() function
timers: Optimize collect_expired_timers() for NOHZ
timers: Move __run_timers() function
timers: Remove set_timer_slack() leftovers
timers: Switch to a non-cascading wheel
timers: Reduce the CPU index space to 256k
timers: Give a few structs and members proper names
hlist: Add hlist_is_singular_node() helper
signals: Use hrtimer for sigtimedwait()
timers: Remove the deprecated mod_timer_pinned() API
timers, net/ipv4/inet: Initialize connection request timers as pinned
timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned
timers, drivers/tty/metag_da: Initialize the poll timer as pinned
...

Linus Torvalds
2016-07-26 11:43:12 +0800

21 Jul, 2016

11 commits

17007f399 block: Fix front merge check ... Browse Code »

For a front merge, the maximum number of sectors of the
request must be checked against the front merge BIO sector,
not the current sector of the request.

Signed-off-by: Damien Le Moal
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Damien Le Moal
2016-07-21 11:40:47 +0800
72ef799b3 block: do not merge requests without consulting with io scheduler ... Browse Code »

Before merging a bio into an existing request, io scheduler is called to
get its approval first. However, the requests that come from a plug
flush may get merged by block layer without consulting with io
scheduler.

In case of CFQ, this can cause fairness problems. For instance, if a
request gets merged into a low weight cgroup's request, high weight cgroup
now will depend on low weight cgroup to get scheduled. If high weigt cgroup
needs that io request to complete before submitting more requests, then it
will also lose its timeslice.

Following script demonstrates the problem. Group g1 has a low weight, g2
and g3 have equal high weights but g2's requests are adjacent to g1's
requests so they are subject to merging. Due to these merges, g2 gets
poor disk time allocation.

cat > cfq-merge-repro.sh << "EOF"
#!/bin/bash
set -e

IO_ROOT=/mnt-cgroup/io

mkdir -p $IO_ROOT

if ! mount | grep -qw $IO_ROOT; then
mount -t cgroup none -oblkio $IO_ROOT
fi

cd $IO_ROOT

for i in g1 g2 g3; do
if [ -d $i ]; then
rmdir $i
fi
done

mkdir g1 && echo 10 > g1/blkio.weight
mkdir g2 && echo 495 > g2/blkio.weight
mkdir g3 && echo 495 > g3/blkio.weight

RUNTIME=10

(echo $BASHPID > g1/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=0k &> /dev/null)&

(echo $BASHPID > g2/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=64k &> /dev/null)&

(echo $BASHPID > g3/cgroup.procs &&
fio --readonly --name name1 --filename /dev/sdb \
--rw read --size 64k --bs 64k --time_based \
--runtime=$RUNTIME --offset=256k &> /dev/null)&

sleep $((RUNTIME+1))

for i in g1 g2 g3; do
echo ---- $i ----
cat $i/blkio.time
done

EOF
# ./cfq-merge-repro.sh
---- g1 ----
8:16 162
---- g2 ----
8:16 165
---- g3 ----
8:16 686

After applying the patch:

# ./cfq-merge-repro.sh
---- g1 ----
8:16 90
---- g2 ----
8:16 445
---- g3 ----
8:16 471

Signed-off-by: Tahsin Erdogan
Signed-off-by: Jens Axboe

Tahsin Erdogan
2016-07-21 11:35:12 +0800
68bdf1ac2 block: Fix spelling in a source code comment ... Browse Code »

Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Signed-off-by: Jens Axboe

Bart Van Assche
2016-07-21 11:28:22 +0800
ea6ca600e block: expose QUEUE_FLAG_DAX in sysfs ... Browse Code »

Provides the ability to identify DAX enabled devices in userspace.

Signed-off-by: Yigal Korman
Signed-off-by: Toshi Kani
Acked-by: Dan Williams
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Yigal Korman
2016-07-21 11:01:08 +0800
f9cc4472c block: unexport various bio mapping helpers ... Browse Code »

They are unused and potential new users really should use the
blk_rq_map* versions.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-07-21 07:38:37 +0800
4613c5f1d scsi/osd: open code blk_make_request ... Browse Code »

I wish the OSD code could simply use blk_rq_map_* helpers like
everyone else, but the complex nature of deciding if we have
DATA IN and/or DATA OUT buffers might make this impossible
(at least for a mere human like me).

But using blk_rq_append_bio at least allows sharing the setup code
between request with or without dat a buffers, and given that this
is the last user of blk_make_request it allows getting rid of that
somewhat awkward interface.

Signed-off-by: Christoph Hellwig
Acked-by: Boaz Harrosh
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-07-21 07:38:35 +0800
98d61d5b1 block: simplify and export blk_rq_append_bio ... Browse Code »

The target SCSI passthrough backend is much better served with the low-level
blk_rq_append_bio construct then the helpers built on top of it, so export it.

Also use the opportunity to remove the pointless request_queue argument and
make the code flow a little more readable.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-07-21 07:38:32 +0800
0c4de0f33 block: ensure bios return from blk_get_request are properly initialized ... Browse Code »

blk_get_request is used for BLOCK_PC and similar passthrough requests.
Currently we always need to call blk_rq_set_block_pc or an open coded
version of it to allow appending bios using the request mapping helpers
later on, which is a somewhat awkward API. Instead move the
initialization part of blk_rq_set_block_pc into blk_get_request, so that
we always have a safe to use request.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-07-21 07:38:30 +0800
ed996a52c block: simplify and cleanup bvec pool handling ... Browse Code »

Instead of a flag and an index just make sure an index of 0 means
no need to free the bvec array. Also move the constants related
to the bvec pools together and use a consistent naming scheme for
them.

Signed-off-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Reviewed-by: Mike Christie
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-07-21 07:37:02 +0800
3f40bf2c8 block: don't ignore -EOPNOTSUPP blkdev_issue_write_same ... Browse Code »

WRITE SAME is a data integrity operation and we can't simply ignore
errors.

Signed-off-by: Christoph Hellwig
Reviewed-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-07-21 07:35:22 +0800
e950fdf71 block: introduce BLKDEV_DISCARD_ZERO to fix zeroout ... Browse Code »

Currently blkdev_issue_zeroout cascades down from discards (if the driver
guarantees that discards zero data), to WRITE SAME and then to a loop
writing zeroes. Unfortunately we ignore run-time EOPNOTSUPP errors in the
block layer blkdev_issue_discard helper to work around DM volumes that
may have mixed discard support underneath.

This patch intoroduces a new BLKDEV_DISCARD_ZERO flag to
blkdev_issue_discard that indicates we are called for zeroing operation.
This allows both to ignore the EOPNOTSUPP hack and actually consolidating
the discard_zeroes_data check into the function.

Signed-off-by: Christoph Hellwig
Reviewed-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-07-21 07:35:20 +0800

14 Jul, 2016

1 commit

9c8747167 block: atari: Return early for unsupported sector size ... Browse Code »

For 4K LBA or very large disks, atari_partition can easily get tricked
into thinking it has found an Atari partition table. Depending on the
data in the disk, it ends up creating partitions with awkward lengths.

We saw logs like this while playing with fio.

[5.625867] nvme2n1: AHDI p2
[5.625872] nvme2n1: p2 size 2910030523 extends beyond EOD, truncated

People has had issues with misinterpreted AHDI partition tables for a long
time, see this BSD thread from 1995, for example.

https://mail-index.netbsd.org/port-atari/1995/11/19/0001.html

Since the atari partition, according to the spec, doesn't even support
sector sizes with more than 512, a quick sanity check is reasonable to
just bail out early, before even attempting to read sector 0.

Signed-off-by: Gabriel Krisman Bertazi
Signed-off-by: Jens Axboe

Gabriel Krisman Bertazi
2016-07-14 00:31:44 +0800

09 Jul, 2016

1 commit

41d512e51 Merge branch 'for-4.8/block' of git://git.kernel.org/pub/scm/linux/kernel/git/nv… ... Browse Code »

…dimm/nvdimm into for-4.8/drivers

Dan writes:

"The removal of ->driverfs_dev in favor of just passing the parent
device in as a parameter to add_disk(). See below, it has received a
"Reviewed-by" from Christoph, Bart, and Johannes.

It is also a pre-requisite for Fam Zheng's work to cleanup gendisk
uevents vs attribute visibility [1]. We would extend device_add_disk()
to take an attribute_group list.

This is based off a branch of block.git/for-4.8/drivers and has
received a positive build success notification from the kbuild robot
across several configs.

[1]: "gendisk: Generate uevent after attribute available"
http://marc.info/?l=linux-virtualization&m=146725201522201&w=2"

Jens Axboe
2016-07-09 06:04:11 +0800

08 Jul, 2016

1 commit

486cf9899 blk-mq: Introduce blk_mq_reinit_tagset ... Browse Code »

The new nvme-rdma driver will need to reinitialize all the tags as part of
the error recovery procedure (realloc the tag memory region). Add a helper
in blk-mq for it that can iterate over all requests in a tagset to make
this easier.

Signed-off-by: Sagi Grimberg
Tested-by: Ming Lin
Reviewed-by: Stephen Bates
Signed-off-by: Christoph Hellwig
Reviewed-by: Steve Wise
Tested-by: Steve Wise
Signed-off-by: Jens Axboe

Sagi Grimberg
2016-07-08 22:38:49 +0800

07 Jul, 2016

1 commit

53bf837b7 timers: Remove set_timer_slack() leftovers ... Browse Code »

We now have implicit batching in the timer wheel. The slack API is no longer
used, so remove it.

Signed-off-by: Thomas Gleixner
Cc: Alan Stern
Cc: Andrew F. Davis
Cc: Arjan van de Ven
Cc: Chris Mason
Cc: David S. Miller
Cc: David Woodhouse
Cc: Dmitry Eremin-Solenikov
Cc: Eric Dumazet
Cc: Frederic Weisbecker
Cc: George Spelvin
Cc: Greg Kroah-Hartman
Cc: Jaehoon Chung
Cc: Jens Axboe
Cc: John Stultz
Cc: Josh Triplett
Cc: Len Brown
Cc: Linus Torvalds
Cc: Mathias Nyman
Cc: Pali Rohár
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Sebastian Reichel
Cc: Ulf Hansson
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.189813118@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2016-07-07 16:35:09 +0800

06 Jul, 2016

2 commits

9645c1a23 block: Export blk_poll ... Browse Code »

The new NVMe over fabrics target will make use of this outside from a
module.

Signed-off-by: Sagi Grimberg
Signed-off-by: Christoph Hellwig
Reviewed-by: Steve Wise
Signed-off-by: Jens Axboe

Sagi Grimberg
2016-07-06 01:30:31 +0800
1f5bd336b blk-mq: add blk_mq_alloc_request_hctx ... Browse Code »

For some protocols like NVMe over Fabrics we need to be able to send
initialization commands to a specific queue.

Based on an earlier patch from Christoph Hellwig .

Signed-off-by: Ming Lin
[hch: disallow sleeping allocation, req_op fixes]
Signed-off-by: Christoph Hellwig
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe

Ming Lin
2016-07-06 01:28:07 +0800

01 Jul, 2016

1 commit

8ba868210 block: fix use-after-free in sys_ioprio_get() ... Browse Code »

get_task_ioprio() accesses the task->io_context without holding the task
lock and thus can race with exit_io_context(), leading to a
use-after-free. The reproducer below hits this within a few seconds on
my 4-core QEMU VM:

#define _GNU_SOURCE
#include
#include
#include
#include

int main(int argc, char **argv)
{
pid_t pid, child;
long nproc, i;

/* ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); */
syscall(SYS_ioprio_set, 1, 0, 0x6000);

nproc = sysconf(_SC_NPROCESSORS_ONLN);

for (i = 0; i < nproc; i++) {
pid = fork();
assert(pid != -1);
if (pid == 0) {
for (;;) {
pid = fork();
assert(pid != -1);
if (pid == 0) {
_exit(0);
} else {
child = wait(NULL);
assert(child == pid);
}
}
}

pid = fork();
assert(pid != -1);
if (pid == 0) {
for (;;) {
/* ioprio_get(IOPRIO_WHO_PGRP, 0); */
syscall(SYS_ioprio_get, 2, 0);
}
}
}

for (;;) {
/* ioprio_get(IOPRIO_WHO_PGRP, 0); */
syscall(SYS_ioprio_get, 2, 0);
}

return 0;
}

This gets us KASAN dumps like this:

[ 35.526914] ==================================================================
[ 35.530009] BUG: KASAN: out-of-bounds in get_task_ioprio+0x7b/0x90 at addr ffff880066f34e6c
[ 35.530009] Read of size 2 by task ioprio-gpf/363
[ 35.530009] =============================================================================
[ 35.530009] BUG blkdev_ioc (Not tainted): kasan: bad access detected
[ 35.530009] -----------------------------------------------------------------------------

[ 35.530009] Disabling lock debugging due to kernel taint
[ 35.530009] INFO: Allocated in create_task_io_context+0x2b/0x370 age=0 cpu=0 pid=360
[ 35.530009] ___slab_alloc+0x55d/0x5a0
[ 35.530009] __slab_alloc.isra.20+0x2b/0x40
[ 35.530009] kmem_cache_alloc_node+0x84/0x200
[ 35.530009] create_task_io_context+0x2b/0x370
[ 35.530009] get_task_io_context+0x92/0xb0
[ 35.530009] copy_process.part.8+0x5029/0x5660
[ 35.530009] _do_fork+0x155/0x7e0
[ 35.530009] SyS_clone+0x19/0x20
[ 35.530009] do_syscall_64+0x195/0x3a0
[ 35.530009] return_from_SYSCALL_64+0x0/0x6a
[ 35.530009] INFO: Freed in put_io_context+0xe7/0x120 age=0 cpu=0 pid=1060
[ 35.530009] __slab_free+0x27b/0x3d0
[ 35.530009] kmem_cache_free+0x1fb/0x220
[ 35.530009] put_io_context+0xe7/0x120
[ 35.530009] put_io_context_active+0x238/0x380
[ 35.530009] exit_io_context+0x66/0x80
[ 35.530009] do_exit+0x158e/0x2b90
[ 35.530009] do_group_exit+0xe5/0x2b0
[ 35.530009] SyS_exit_group+0x1d/0x20
[ 35.530009] entry_SYSCALL_64_fastpath+0x1a/0xa4
[ 35.530009] INFO: Slab 0xffffea00019bcd00 objects=20 used=4 fp=0xffff880066f34ff0 flags=0x1fffe0000004080
[ 35.530009] INFO: Object 0xffff880066f34e58 @offset=3672 fp=0x0000000000000001
[ 35.530009] ==================================================================

Fix it by grabbing the task lock while we poke at the io_context.

Cc: stable@vger.kernel.org
Reported-by: Dmitry Vyukov
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2016-07-01 22:39:24 +0800

28 Jun, 2016

5 commits

0b31c10c6 cfq-iosched: Charge at least 1 jiffie instead of 1 ns ... Browse Code »

Commit 9a7f38c42c2b (cfq-iosched: Convert from jiffies to nanoseconds)
could result in charging just 1 ns to a cgroup submitting IO instead of 1
jiffie we always charged before. It is arguable what is the right amount
to change but for now lets retain the old behavior of always charging at
least one jiffie.

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-06-28 22:21:50 +0800
149321a61 cfq-iosched: Fix regression in bonnie++ rewrite performance ... Browse Code »

Commit 9a7f38c42c2 (cfq-iosched: Convert from jiffies to nanoseconds)
broke the condition for detecting starved sync IO in
cfq_completed_request() because rq->start_time remained in jiffies but
we compared it with nanosecond values. This manifested as a regression
in bonnie++ rewrite performance because we always ended up considering
sync IO starved and thus never increased async IO queue depth.

Since rq->start_time is used in a lot of places, converting it to ns
values would be non-trivial. So just revert the condition in CFQ to use
comparison with jiffies. This will lead to suboptimal results if
cfq_fifo_expire[1] will ever come close to 1 jiffie but so far we are
relatively far from that with the storage used with CFQ (the default
value is 128 ms).

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-06-28 22:21:48 +0800
93fdf1478 cfq-iosched: Convert slice_resid from u64 to s64 ... Browse Code »

slice_resid can be both positive and negative. Commit 9a7f38c42c2b
(cfq-iosched: Convert from jiffies to nanoseconds) converted it from
long to u64. Although this did not introduce any functional regression
(the operations just overflow and the result was fine), it is certainly
wrong and could cause issues in future. So convert the type to more
appropriate s64.

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-06-28 22:21:46 +0800
9828c2c6c block: Convert fifo_time from ulong to u64 ... Browse Code »

Currently rq->fifo_time is unsigned long but CFQ stores nanosecond
timestamp in it which would overflow on 32-bit archs. Convert it to u64
to avoid the overflow. Since the rq->fifo_time is unioned with struct
call_single_data(), this does not change the size of struct request in
any way.

We have to slightly fixup block/deadline-iosched.c so that comparison
happens in the right types.

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-06-28 22:21:44 +0800
52c44d93c block: remove ->driverfs_dev ... Browse Code »

Now that all drivers that specify a ->driverfs_dev have been converted
to device_add_disk(), the pointer can be removed from struct gendisk.

Cc: Jens Axboe
Cc: Ross Zwisler
Reviewed-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Signed-off-by: Dan Williams

Dan Williams
2016-06-28 03:26:08 +0800

16 Jun, 2016

1 commit

e63a46bef block: introduce device_add_disk() ... Browse Code »

In preparation for removing the ->driverfs_dev member of a gendisk, add
an api that takes the parent device as a parameter to add_disk(). For
now this maintains the status quo of WARN()ing on failure, but not
return a error code.

Reviewed-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Reviewed-by: Bart Van Assche
Signed-off-by: Dan Williams

Dan Williams
2016-06-16 10:53:06 +0800

14 Jun, 2016

3 commits

e1f3b9412 block/blk-cgroup.c: Declare local symbols static ... Browse Code »

Detected by sparse.

Signed-off-by: Bart Van Assche
Cc: Tejun Heo
Signed-off-by: Jens Axboe

Bart Van Assche
2016-06-14 23:09:33 +0800
1179a5a08 block/bio-integrity.c: Add #include "blk.h" ... Browse Code »

This patch avoids that building with W=1 C=2 triggers the following
warning:

block/bio-integrity.c:35:6: warning: symbol 'blk_flush_integrity' was not declared. Should it be static?

Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2016-06-14 23:09:17 +0800
aa8d15bfe block/partition-generic.c: Remove a set-but-not-used variable ... Browse Code »

A value is assigned to the variable 'info' but that value is never
used. Hence remove the variable 'info'.

Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2016-06-14 23:09:15 +0800

10 Jun, 2016

1 commit

b8269db45 cfq-iosched: temporarily boost queue priority for idle classes ... Browse Code »

If we're queuing REQ_PRIO IO and the task is running at an idle IO
class, then temporarily boost the priority. This prevents livelocks
due to priority inversion, when a low priority task is holding file
system resources while attempting to do IO.

An example of that is shown below. An ioniced idle task is holding
the directory mutex, while a normal priority task is trying to do
a directory lookup.

[478381.198925] ------------[ cut here ]------------
[478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
[478381.201324] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
[478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[478381.203462] ionice D ffff8803692736a8 0 1168369 1 0x00000080
[478381.203466] ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698
[478381.204589] ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002
[478381.205752] ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000
[478381.206874] Call Trace:
[478381.207253] [] ? bit_wait_io_timeout+0x80/0x80
[478381.208175] [] schedule+0x37/0x90
[478381.208932] [] schedule_timeout+0x1dc/0x250
[478381.209805] [] ? __blk_run_queue+0x37/0x50
[478381.210706] [] ? ktime_get+0x45/0xb0
[478381.211489] [] io_schedule_timeout+0xa7/0x110
[478381.212402] [] ? prepare_to_wait+0x5b/0x90
[478381.213280] [] bit_wait_io+0x36/0x50
[478381.214063] [] __wait_on_bit+0x65/0x90
[478381.214961] [] ? bit_wait_io_timeout+0x80/0x80
[478381.215872] [] out_of_line_wait_on_bit+0x7c/0x90
[478381.216806] [] ? wake_atomic_t_function+0x40/0x40
[478381.217773] [] __wait_on_buffer+0x2a/0x30
[478381.218641] [] ext4_bread+0x57/0x70
[478381.219425] [] __ext4_read_dirblock+0x3c/0x380
[478381.220467] [] ext4_dx_find_entry+0x7d/0x170
[478381.221357] [] ? find_get_entry+0x1e/0xa0
[478381.222208] [] ext4_find_entry+0x484/0x510
[478381.223090] [] ext4_lookup+0x52/0x160
[478381.223882] [] lookup_real+0x1d/0x60
[478381.224675] [] __lookup_hash+0x38/0x50
[478381.225697] [] lookup_slow+0x45/0xab
[478381.226941] [] link_path_walk+0x7ae/0x820
[478381.227880] [] path_init+0xc2/0x430
[478381.228677] [] ? security_file_alloc+0x16/0x20
[478381.229776] [] path_openat+0x77/0x620
[478381.230767] [] ? page_add_file_rmap+0x2e/0x70
[478381.232019] [] do_filp_open+0x43/0xa0
[478381.233016] [] ? creds_are_invalid+0x29/0x70
[478381.234072] [] do_open_execat+0x70/0x170
[478381.235039] [] do_execveat_common.isra.36+0x1b8/0x6e0
[478381.236051] [] do_execve+0x2c/0x30
[478381.236809] [] ? getname+0x12/0x20
[478381.237564] [] SyS_execve+0x2e/0x40
[478381.238338] [] stub_execve+0x6d/0xa0
[478381.239126] ------------[ cut here ]------------
[478381.239915] ------------[ cut here ]------------
[478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds.
[478381.242673] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
[478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[478381.244902] python2.7 D ffff88005cf8fb98 0 1168375 1168248 0x00000080
[478381.244904] ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0
[478381.246023] ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff
[478381.247138] ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8
[478381.248252] Call Trace:
[478381.248630] [] schedule+0x37/0x90
[478381.249382] [] schedule_preempt_disabled+0xe/0x10
[478381.250465] [] __mutex_lock_slowpath+0x92/0x100
[478381.251409] [] mutex_lock+0x1b/0x2f
[478381.252199] [] lookup_slow+0x36/0xab
[478381.253023] [] link_path_walk+0x7ae/0x820
[478381.253877] [] ? try_charge+0xc1/0x700
[478381.254690] [] path_init+0xc2/0x430
[478381.255525] [] ? security_file_alloc+0x16/0x20
[478381.256450] [] path_openat+0x77/0x620
[478381.257256] [] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
[478381.258390] [] ? handle_mm_fault+0x13f3/0x1720
[478381.259309] [] do_filp_open+0x43/0xa0
[478381.260139] [] ? __alloc_fd+0x42/0x120
[478381.260962] [] do_sys_open+0x13c/0x230
[478381.261779] [] ? syscall_trace_enter_phase1+0x113/0x170
[478381.262851] [] SyS_open+0x22/0x30
[478381.263598] [] system_call_fastpath+0x12/0x17
[478381.264551] ------------[ cut here ]------------
[478381.265377] ------------[ cut here ]------------

Signed-off-by: Jens Axboe
Reviewed-by: Jeff Moyer

Jens Axboe
2016-06-10 06:15:01 +0800

09 Jun, 2016

2 commits

52b9c330c blk-mq: actually hook up defer list when running requests ... Browse Code »

If ->queue_rq() returns BLK_MQ_RQ_QUEUE_OK, we use continue and skip
over the rest of the loop body. However, dptr is assigned later in the
loop body, and the BLK_MQ_RQ_QUEUE_OK case is exactly the case that we'd
want it for.

NVMe isn't actually using BLK_MQ_F_DEFER_ISSUE yet, nor is any other
in-tree driver, but if the code's going to be there, it might as well
work.

Fixes: 74c450521dd8 ("blk-mq: add a 'list' parameter to ->queue_rq()")
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2016-06-09 23:55:15 +0800
288dab8a3 block: add a separate operation type for secure erase ... Browse Code »

Instead of overloading the discard support with the REQ_SECURE flag.
Use the opportunity to rename the queue flag as well, and remove the
dead checks for this flag in the RAID 1 and RAID 10 drivers that don't
claim support for secure erase.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-06-09 23:52:25 +0800

08 Jun, 2016

7 commits

911483258 cfq-iosched: Convert to use highres timers ... Browse Code »

Reviewed-by: Jeff Moyer
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-06-08 22:56:06 +0800
d2d481d04 cfq-iosched: Expose microsecond interfaces ... Browse Code »

Expose interfaces to tune time slices of CFQ IO scheduler in
microseconds.

Signed-off-by: Jeff Moyer
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jeff Moyer
2016-06-08 22:56:03 +0800
9a7f38c42 cfq-iosched: Convert from jiffies to nanoseconds ... Browse Code »

Convert all time-keeping in CFQ IO scheduler from jiffies to nanoseconds
so that we can later make the intervals more fine-grained than jiffies.
One jiffie is several miliseconds and even for today's rotating disks
that is a noticeable amount of time and thus we leave disk unnecessarily
idle.

Signed-off-by: Jeff Moyer
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jeff Moyer
2016-06-08 22:55:34 +0800
28a8f0d31 block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH ... Browse Code »

To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Mike Christie
2016-06-08 03:41:38 +0800
3a5e02ced block, drivers: add REQ_OP_FLUSH operation ... Browse Code »

This adds a REQ_OP_FLUSH operation that is sent to request_fn
based drivers by the block layer's flush code, instead of
sending requests with the request->cmd_flags REQ_FLUSH bit set.

Signed-off-by: Mike Christie
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Mike Christie
2016-06-08 03:41:38 +0800
6296b9604 block, drivers, fs: shrink bi_rw from long to int ... Browse Code »

We don't need bi_rw to be so large on 64 bit archs, so
reduce it to unsigned int.

Signed-off-by: Mike Christie
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Mike Christie
2016-06-08 03:41:38 +0800
d9d8c5c48 block: convert is_sync helpers to use REQ_OPs. ... Browse Code »

This patch converts the is_sync helpers to use separate variables
for the operation and flags.

Signed-off-by: Mike Christie
Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Mike Christie
2016-06-08 03:41:38 +0800