Eric Lee / smarc-fsl-linux-kernel

22 Nov, 2016

3 commits

778889d84 block: apply blk_partition_remap to REQ_OP_ZONE_RESET ... Browse Code »

If a ZBC device is partitioned and operations are performed on the partition
the zone information is rebased to the partition, however the zone reset
is not mapped from the partition to device as are other operations.

This causes the API (report zones / reset zone) to be unbalanced in this
regard. Checking for the zone reset op code explicitly will balance the
API.

Signed-off-by: Shaun Tancheff
Signed-off-by: Jens Axboe

Shaun Tancheff
2016-11-22 06:08:24 +0800
93c5bdf7a block: clear all of bi_opf in bio_set_op_attrs ... Browse Code »

Since commit 87374179 ("block: add a proper block layer data direction
encoding") we only or the new op and flags into bi_opf in bio_set_op_attrs
instead of clearing the old value. I've not seen any breakage with the
new behavior, but it seems dangerous.

Also convert it to an inline function to make the argument passing
safer.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-11-22 00:35:05 +0800
5a8b187c6 pktcdvd: mark as unmaintained and deprecated ... Browse Code »

This driver is both orphaned, and not really useful anymore. Mark
it as such, and remove it in a future kernel after a release or
two.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-22 00:33:17 +0800

18 Nov, 2016

10 commits

9a05e7541 block: Change extern inline to static inline ... Browse Code »

With compilers which follow the C99 standard (like modern versions of
gcc and clang), "extern inline" does the opposite thing from older
versions of gcc (emits code for an externally linkable version of the
inline function).

"static inline" does the intended behavior in all cases instead.

Description taken from commit 6d91857d4826 ("staging, rtl8192e,
LLVMLinux: Change extern inline to static inline").

This also fixes the following GCC warning when building with CONFIG_PM
disabled:

./include/linux/blkdev.h:1143:20: warning: no previous prototype for 'blk_set_runtime_active' [-Wmissing-prototypes]

Fixes: d07ab6d11477 ("block: Add blk_set_runtime_active()")
Reviewed-by: Mika Westerberg
Signed-off-by: Tobias Klauser
Signed-off-by: Jens Axboe

Tobias Klauser
2016-11-18 22:44:23 +0800
55f958cc6 skd_main: drop duplicate header scatterlist.h ... Browse Code »

Drop duplicate header scatterlist.h from skd_main.c.

Signed-off-by: Geliang Tang
Signed-off-by: Jens Axboe

Geliang Tang
2016-11-18 22:44:21 +0800
10e6246e2 block: document the 'io_poll_delay' queue sysfs file ... Browse Code »

This was documented in the original commit, 64f1c21e86f7, but it
never made it into the proper location for queue sysfs files.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-18 13:23:02 +0800
542ff7bf1 block: new direct I/O implementation ... Browse Code »

Similar to the simple fast path, but we now need a dio structure to
track multiple-bio completions. It's basically a cut-down version
of the new iomap-based direct I/O code for filesystems, but without
all the logic to call into the filesystem for extent lookup or
allocation, and without the complex I/O completion workqueue handler
for AIO - instead we just use the FUA bit on the bios to ensure
data is flushed to stable storage.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-11-18 04:35:11 +0800
78250c02d block: make __blkdev_direct_IO_sync() support O_SYNC/DSYNC ... Browse Code »

Split the op setting code into a helper, use it in both places.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-18 04:35:05 +0800
72ecad22d block: support a full bio worth of IO for simplified bdev direct-io ... Browse Code »

Just alloc the bio_vec array if we exceed the inline limit.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-18 04:35:02 +0800
64f1c21e8 blk-mq: make the polling code adaptive ... Browse Code »

The previous commit introduced the hybrid sleep/poll mode. Take
that one step further, and use the completion latencies to
automatically sleep for half the mean completion time. This is
a good approximation.

This changes the 'io_poll_delay' sysfs file a bit to expose the
various options. Depending on the value, the polling code will
behave differently:

-1 Never enter hybrid sleep mode
0 Use half of the completion mean for the sleep delay
>0 Use this specific value as the sleep delay

Signed-off-by: Jens Axboe
Tested-By: Stephen Bates
Reviewed-By: Stephen Bates

Jens Axboe
2016-11-18 04:34:57 +0800
06426adf0 blk-mq: implement hybrid poll mode for sync O_DIRECT ... Browse Code »

This patch enables a hybrid polling mode. Instead of polling after IO
submission, we can induce an artificial delay, and then poll after that.
For example, if the IO is presumed to complete in 8 usecs from now, we
can sleep for 4 usecs, wake up, and then do our polling. This still puts
a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
after the IO has completed, it'll happen before. With this hybrid
scheme, we can achieve big latency reductions while still using the same
(or less) amount of CPU.

Signed-off-by: Jens Axboe
Tested-By: Stephen Bates
Reviewed-By: Stephen Bates

Jens Axboe
2016-11-18 04:34:51 +0800
189ce2b9d block: fast-path for small and simple direct I/O requests ... Browse Code »

This patch adds a small and simple fast patch for small direct I/O
requests on block devices that don't use AIO. Between the neat
bio_iov_iter_get_pages helper that avoids allocating a page array
for get_user_pages and the on-stack bio and biovec this avoid memory
allocations and atomic operations entirely in the direct I/O code
(lower levels might still do memory allocations and will usually
have at least some atomic operations, though).

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
Tested-By: Stephen Bates
Reviewed-By: Stephen Bates

Christoph Hellwig
2016-11-18 04:34:45 +0800
429a787be nbd: fix use-after-free of rq/bio in the xmit path ... Browse Code »

For writes, we can get a completion in while we're still iterating
the request and bio chain. If that happens, we're reading freed
memory and we can crash.

Break out after the last segment and avoid having the iterator
read freed memory.

Reviewed-by: Josef Bacik
Signed-off-by: Jens Axboe

Jens Axboe
2016-11-18 03:30:37 +0800

16 Nov, 2016

5 commits

4121d385f blk-wbt: fix old-style function declaration ... Browse Code »

The newly added driver causes a harmless warning in some configurations:

block/blk-wbt.c:250:1: error: ‘inline’ is not at beginning of declaration [-Werror=old-style-declaration]
static bool inline stat_sample_valid(struct blk_rq_stat *stat)

This makes it use the expected format for the declaration.

Signed-off-by: Arnd Bergmann
Signed-off-by: Jens Axboe

Arnd Bergmann
2016-11-16 23:32:40 +0800
92153d30c null_blk: add usage hints for NVM ... Browse Code »

If CONFIG_NVM is disabled, loading null_block module with use_lightnvm=1
fails. But there are no messages and documents related to the failure.

Add the appropriate error message.

Signed-off-by: Yasuaki Ishimatsu

Massaged the text a bit.

Signed-off-by: Jens Axboe

Yasuaki Ishimatsu
2016-11-16 23:26:11 +0800
0a6219a95 block: deal with stale req count of plug list ... Browse Code »

In both legacy and mq path, req count of plug list is computed
before allocating request, so the number can be stale when falling
back to slept allocation, also the new introduced wbt can sleep
too.

This patch deals with the case by checking if plug list becomes
empty, and fixes the KASAN report of 'BUG: KASAN: stack-out-of-bounds'
which is introduced by Shaohua's patches of dispatching big request.

Fixes: 600271d900002(blk-mq: immediately dispatch big size request)
Fixes: 50d24c34403c6(block: immediately dispatch big size request)
Cc: Shaohua Li
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2016-11-16 23:09:51 +0800
2868f13c3 scsi_lib: untangle 0 and BLK_MQ_RQ_QUEUE_OK ... Browse Code »

Let's not depend on any of the BLK_MQ_RQ_QUEUE_* constants having
specific values. No functional change.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2016-11-16 03:50:31 +0800
bac0000af nvme: untangle 0 and BLK_MQ_RQ_QUEUE_OK ... Browse Code »

Let's not depend on any of the BLK_MQ_RQ_QUEUE_* constants having
specific values. No functional change.

Signed-off-by: Omar Sandoval
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe

Omar Sandoval
2016-11-16 03:50:11 +0800

15 Nov, 2016

3 commits

b4a567e81 loop: return proper error from loop_queue_rq() ... Browse Code »

->queue_rq() should return one of the BLK_MQ_RQ_QUEUE_* constants, not
an errno.

f4aa4c7bbac6 ("block: loop: convert to per-device workqueue")
Signed-off-by: Omar Sandoval
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe

Omar Sandoval
2016-11-15 06:58:44 +0800
c6463c651 sd_zbc: Force use of READ16/WRITE16 ... Browse Code »

Normally, sd_read_capacity sets sdp->use_16_for_rw to 1 based on the
disk capacity so that READ16/WRITE16 are used for large drives.
However, for a zoned disk with RC_BASIS set to 0, the capacity reported
through READ_CAPACITY may be very small, leading to use_16_for_rw not being
set and READ10/WRITE10 commands being used, even after the actual zoned disk
capacity is corrected in sd_zbc_read_zones. This causes LBA offset overflow for
accesses beyond 2TB.

As the ZBC standard makes it mandatory for ZBC drives to support
the READ16/WRITE16 commands anyway, make sure that use_16_for_rw is set.

Signed-off-by: Damien Le Moal
eviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Damien Le Moal
2016-11-15 04:16:42 +0800
dbb3ab035 bsg: Add sparse annotations to bsg_request_fn() ... Browse Code »

Avoid that sparse complains about unbalanced lock actions.

Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2016-11-15 00:57:03 +0800

12 Nov, 2016

6 commits

382cf633e blk-wbt: use BLK_STAT_{READ,WRITE} instead of 0/1 ... Browse Code »

Since we have proper enums for the stats directions, use them.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-12 07:18:29 +0800
8054b89f8 blk-wbt: remove stat ops ... Browse Code »

Again a leftover from when the throttling code was generic. Now that we
just have the block user, get rid of the stat ops and indirections.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-12 07:18:24 +0800
d8a0cbfd7 blk-wbt: store queue instead of bdi ... Browse Code »

The bdi was a leftover from when the code was block layer agnostic.
Now that we just support a block layer user, store the queue directly.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-12 07:18:18 +0800
bbd7bb701 block: move poll code to blk-mq ... Browse Code »

The poll code is blk-mq specific, let's move it to blk-mq.c. This
is a prep patch for improving the polling code.

Signed-off-by: Jens Axboe
Reviewed-by: Christoph Hellwig

Jens Axboe
2016-11-12 04:40:25 +0800
b425b0201 Block: mtip32xx: Improvement in code readability when memdup_user() fails. ... Browse Code »

There is no need to call kfree() if memdup_user() fails, as no memory
was allocated and the error in the error-valued pointer should be returned.

Signed-off-by: Sachin Shukla
Signed-off-by: Jens Axboe

Sachin Shukla
2016-11-12 04:37:04 +0800
066a4a73c blk-mq: blk_mq_try_issue_directly() should lookup hardware queue ... Browse Code »

A previous commit changed this to pass in the hardware queue, but
it was using the wrong hardware queue. Hence a request that was
allocated on one hardware queue ended up being issued on another
one, and that caused IO timeouts and oopses on some drivers. Since
the request holds hardware queue private resources, like a tag,
we can't just issue it on a different hardware queue.

Fixes: 2253efc850c4 ("blk-mq: Move more code into blk_mq_direct_issue_request()")
Signed-off-by: Jens Axboe

Jens Axboe
2016-11-12 03:24:46 +0800

11 Nov, 2016

6 commits

87760e5ee block: hook up writeback throttling ... Browse Code »

Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.

Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.

The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.

We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-11 04:53:40 +0800
e34cbd307 blk-wbt: add general throttling mechanism ... Browse Code »

We can hook this up to the block layer, to help throttle buffered
writes.

wbt registers a few trace points that can be used to track what is
happening in the system:

wbt_lat: 259:0: latency 2446318
wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
wmean=518866, wmin=15522, wmax=5330353, wsamples=57
wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32

This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
dumps the current read/write stats for that window, and wbt_step shows a
step down event where we now scale back writes. Each trace includes the
device, 259:0 in this case.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-11 04:53:32 +0800
cf43e6be8 block: add scalable completion tracking of requests ... Browse Code »

For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

The feature is off by default, to avoid any extra overhead. In-kernel
users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
flags. We currently don't turn it on if someone just reads any of
the stats files, that is something we could add as well.

Signed-off-by: Jens Axboe

Jens Axboe
2016-11-11 04:53:26 +0800
ebc4ff661 block: cfq_cpd_alloc() should use @gfp ... Browse Code »

cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
incorrectly hard coding GFP_KERNEL instead of using the mask specified
through the @gfp parameter. This currently doesn't cause any actual
issues because all current callers specify GFP_KERNEL. Fix it.

Signed-off-by: Tejun Heo
Reported-by: Dan Carpenter
Fixes: e4a9bde9589f ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
Signed-off-by: Jens Axboe

Tejun Heo
2016-11-11 01:10:04 +0800
7bf58533a nvme: don't pass the full CQE to nvme_complete_async_event ... Browse Code »

We only need the status and result fields, and passing them explicitly
makes life a lot easier for the Fibre Channel transport which doesn't
have a full CQE for the fast path case.

Signed-off-by: Christoph Hellwig
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-11-11 01:06:26 +0800
d49187e97 nvme: introduce struct nvme_request ... Browse Code »

This adds a shared per-request structure for all NVMe I/O. This structure
is embedded as the first member in all NVMe transport drivers request
private data and allows to implement common functionality between the
drivers.

The first use is to replace the current abuse of the SCSI command
passthrough fields in struct request for the NVMe command passthrough,
but it will grow a field more fields to allow implementing things
like common abort handlers in the future.

The passthrough commands are handled by having a pointer to the SQE
(struct nvme_command) in struct nvme_request, and the union of the
possible result fields, which had to be turned from an anonymous
into a named union for that purpose. This avoids having to pass
a reference to a full CQE around and thus makes checking the result
a lot more lightweight.

Signed-off-by: Christoph Hellwig
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-11-11 01:06:24 +0800

10 Nov, 2016

2 commits

41c9499b2 skd: fix function prototype ... Browse Code »

Building with W=1 shows a harmless warning for the skd driver:

drivers/block/skd_main.c:2959:1: error: ‘static’ is not at beginning of declaration [-Werror=old-style-declaration]

This changes the prototype to the expected formatting.

Signed-off-by: Arnd Bergmann
Signed-off-by: Jens Axboe

Arnd Bergmann
2016-11-10 13:53:47 +0800
3bc8492f0 skd: fix msix error handling ... Browse Code »

As reported by gcc -Wmaybe-uninitialized, the cleanup path for
skd_acquire_msix tries to free the already allocated msi-x vectors
in reverse order, but the index variable may not have been
used yet:

drivers/block/skd_main.c: In function ‘skd_acquire_irq’:
drivers/block/skd_main.c:3890:8: error: ‘i’ may be used uninitialized in this function [-Werror=maybe-uninitialized]

This changes the failure path to skip releasing the interrupts
if we have not started requesting them yet.

Fixes: 180b0ae77d49 ("skd: use pci_alloc_irq_vectors")
Signed-off-by: Arnd Bergmann
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Arnd Bergmann
2016-11-10 13:53:45 +0800

09 Nov, 2016

1 commit

ae5b2ec8a block: set REQ_SYNC if we clear REQ_FUA|REQ_PREFLUSH ... Browse Code »

If we insert a flush request, we clear REQ_PREFLUSH and/or REQ_FUA,
depending on flush settings. Since op_is_sync() factors those flags
in for deciding whether this request is sync or not, we should
set REQ_SYNC to avoid screwing up this accounting.

This should be less fragile.

Reported-by: Logan Gunthorpe
Fixes: b685d3d65ac ("block: treat REQ_FUA and REQ_PREFLUSH as synchronous")
Signed-off-by: Jens Axboe

Jens Axboe
2016-11-09 10:39:28 +0800

08 Nov, 2016

3 commits

b57d74aff writeback: track if we're sleeping on progress in balance_dirty_pages() ... Browse Code »

Note in the bdi_writeback structure whenever a task ends up sleeping
waiting for progress. We can use that information in the lower layers
to increase the priority of writes.

Signed-off-by: Jens Axboe
Reviewed-by: Jan Kara

Jens Axboe
2016-11-08 23:28:55 +0800
8e1de26cd skd_main: use %*ph to dump small buffers ... Browse Code »

Replace custom approach by %*ph specifier to dump small buffers in hex format.

Unfortunately we can't use print_hex_dump_bytes() here since tha gap is
present, though one familiar with the code may change this.

Signed-off-by: Andy Shevchenko
Signed-off-by: Jens Axboe

Andy Shevchenko
2016-11-08 12:40:42 +0800
180b0ae77 skd: use pci_alloc_irq_vectors ... Browse Code »

Switch the skd driver to use pci_alloc_irq_vectors. We need to two calls to
pci_alloc_irq_vectors as skd only supports multiple MSI-X vectors, but not
multiple MSI vectors.

Otherwise this cleans up a lot of cruft and allows to a lot more common code.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-11-08 12:37:20 +0800

07 Nov, 2016

1 commit

feebd5687 pktcdvd: don't scribble over the bvec array ... Browse Code »

Hi Peter, hi Jens,

I've been looking over the multi page bio vec work again recently, and
one of the stumbling blocks is raw biovec access in the pktcdvd.

The first issue is that it directly sets up the page and offset pointers
in the biovec just before calling bio_add_page. As bio_add_page already
does the setup it's trivial to just switch it to stack variables for the
arguments.

The second issue is the copy code in pkt_make_local_copy, which
effectively is an opencoded version of bio_copy_data except that it
skips pages that already are the same in the ѕource and destination.
But we look at the only calleer we just set up the bio using bio_add_page
to point exactly to the page array that pkt_make_local_copy compares,
so the pages will always be the same and we can just remove this function.

Note that all this is done based on code inspection, I don't have any
packet writing hardware myself.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-11-07 23:50:54 +0800