Eric Lee / smarc-fsl-linux-kernel

29 Oct, 2018

16 commits

c3a8759ec block: bio_check_eod() needs to consider partitions ... Browse Code »

bio_check_eod() should check partition size not the whole disk if
bio->bi_partno is non-zero. Do this by moving the call
to bio_check_eod() into blk_partition_remap().

Based on an earlier patch from Jiufei Xue.

Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Reported-by: Jiufei Xue
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
(cherry picked from commit 52c5e62d4c4beecddc6e1b8045ce1d695fca1ba7)

Christoph Hellwig
2018-10-29 11:10:38 +0800
6e1383dc7 block: add bdev_read_only() checks to common helpers ... Browse Code »

Similar to blkdev_write_iter(), return -EPERM if the partition is
read-only. This covers ioctl(), fallocate() and most in-kernel users
but isn't meant to be exhaustive -- everything else will be caught in
generic_make_request_checks(), fail with -EIO and can be fixed later.

Reviewed-by: Sagi Grimberg
Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe
(cherry picked from commit a13553c777375009584741e7d9982e775c4b0744)

Ilya Dryomov
2018-10-29 11:10:38 +0800
b4fb3baf3 block: fail op_is_write() requests to read-only partitions ... Browse Code »

Regular block device writes go through blkdev_write_iter(), which does
bdev_read_only(), while zeroout/discard/etc requests are never checked,
both userspace- and kernel-triggered. Add a generic catch-all check to
generic_make_request_checks() to actually enforce ioctl(BLKROSET) and
set_disk_ro(), which is used by quite a few drivers for things like
snapshots, read-only backing files/images, etc.

Reviewed-by: Sagi Grimberg
Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe
(cherry picked from commit 721c7fc701c71f693307d274d2b346a1ecd4a534)

Ilya Dryomov
2018-10-29 11:10:38 +0800
3059bf852 blk-throttle: export io_serviced_recursive, io_service_bytes_recursive ... Browse Code »

export these two interface for cgroup-v1.

Acked-by: Tejun Heo
Signed-off-by: weiping zhang
Signed-off-by: Jens Axboe
(cherry picked from commit 17534c6f2c065ad8e34ff6f013e5afaa90428512)

weiping zhang
2018-10-29 11:10:38 +0800
81bb3160a block: Protect less code with sysfs_lock in blk_{un,}register_queue() ... Browse Code »

The __blk_mq_register_dev(), blk_mq_unregister_dev(),
elv_register_queue() and elv_unregister_queue() calls need to be
protected with sysfs_lock but other code in these functions not.
Hence protect only this code with sysfs_lock. This patch fixes a
locking inversion issue in blk_unregister_queue() and also in an
error path of blk_register_queue(): it is not allowed to hold
sysfs_lock around the kobject_del(&q->kobj) call.

Reviewed-by: Christoph Hellwig
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
(cherry picked from commit 2c2086afc2b8b974fac32cb028e73dc27bfae442)

Bart Van Assche
2018-10-29 11:10:38 +0800
a3edeedd6 block: allow gendisk's request_queue registration to be deferred ... Browse Code »

Since I can remember DM has forced the block layer to allow the
allocation and initialization of the request_queue to be distinct
operations. Reason for this is block/genhd.c:add_disk() has requires
that the request_queue (and associated bdi) be tied to the gendisk
before add_disk() is called -- because add_disk() also deals with
exposing the request_queue via blk_register_queue().

DM's dynamic creation of arbitrary device types (and associated
request_queue types) requires the DM device's gendisk be available so
that DM table loads can establish a master/slave relationship with
subordinate devices that are referenced by loaded DM tables -- using
bd_link_disk_holder(). But until these DM tables, and their associated
subordinate devices, are known DM cannot know what type of request_queue
it needs -- nor what its queue_limits should be.

This chicken and egg scenario has created all manner of problems for DM
and, at times, the block layer.

Summary of changes:

- Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
that drivers may use to add a disk without also calling
blk_register_queue(). Driver must call blk_register_queue() once its
request_queue is fully initialized.

- Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
is not set. It won't be set if driver used add_disk_no_queue_reg()
but driver encounters an error and must del_gendisk() before calling
blk_register_queue().

- Export blk_register_queue().

These changes allow DM to use add_disk_no_queue_reg() to anchor its
gendisk as the "master" for master/slave relationships DM must establish
with subordinate devices referenced in DM tables that get loaded. Once
all "slave" devices for a DM device are known its request_queue can be
properly initialized and then advertised via sysfs -- important
improvement being that no request_queue resource initialization
performed by blk_register_queue() is missed for DM devices anymore.

Signed-off-by: Mike Snitzer
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe
(cherry picked from commit fa70d2e2c4a0a54ced98260c6a176cc94c876d27)

Mike Snitzer
2018-10-29 11:10:38 +0800
f00fb650d block: properly protect the 'queue' kobj in blk_unregister_queue ... Browse Code »

The original commit e9a823fb34a8b (block: fix warning when I/O elevator
is changed as request_queue is being removed) is pretty conflated.
"conflated" because the resource being protected by q->sysfs_lock isn't
the queue_flags (it is the 'queue' kobj).

q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
from racing with blk_unregister_queue():
1) By holding q->sysfs_lock first, __elevator_change() can complete
before a racing blk_unregister_queue().
2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
in case elv_iosched_store() loses the race with blk_unregister_queue(),
it needs a way to know the 'queue' kobj isn't there.

Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
held until after the 'queue' kobj is removed.

To do so blk_mq_unregister_dev() must not also take q->sysfs_lock. So
rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().

Also, blk_unregister_queue() should use q->queue_lock to protect against
any concurrent writes to q->queue_flags -- even though chances are the
queue is being cleaned up so no concurrent writes are likely.

Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as request_queue is being removed")
Signed-off-by: Mike Snitzer
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe
(cherry picked from commit 667257e8b2988c0183ba23e2bcd6900e87961606)

Mike Snitzer
2018-10-29 11:10:38 +0800
bd280751e block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN ... Browse Code »

device_add_disk() will only call bdi_register_owner() if
!GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call
bdi_unregister() if !GENHD_FL_HIDDEN.

Found with code inspection. bdi_unregister() won't do any harm if
bdi_register_owner() wasn't used but best to avoid the unnecessary
call to bdi_unregister().

Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN")
Signed-off-by: Mike Snitzer
Reviewed-by: Ming Lei
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
(cherry picked from commit bc8d062c36e3525e81ea8237ff0ab3264c2317b6)

Mike Snitzer
2018-10-29 11:10:38 +0800
45cde2f81 block: add a poll_fn callback to struct request_queue ... Browse Code »

That we we can also poll non blk-mq queues. Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.

Signed-off-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
(cherry picked from commit ea435e1b9392a33deceaea2a16ebaa3397bead93)

Christoph Hellwig
2018-10-29 11:10:38 +0800
77e24f9e1 block: introduce GENHD_FL_HIDDEN ... Browse Code »

With this flag a driver can create a gendisk that can be used for I/O
submission inside the kernel, but which is not registered as user
facing block device. This will be useful for the NVMe multipath
implementation.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
(cherry picked from commit 8ddcd653257c18a669fcb75ee42c37054908e0d6)

Christoph Hellwig
2018-10-29 11:10:38 +0800
7103d1879 block: don't look at the struct device dev_t in disk_devt ... Browse Code »

The hidden gendisks introduced in the next patch need to keep the dev
field in their struct device empty so that udev won't try to create
block device nodes for them. To support that rewrite disk_devt to
look at the major and first_minor fields in the gendisk itself instead
of looking into the struct device.

Signed-off-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
(cherry picked from commit 517bf3c306bad4fe0da631f90ae2ec40924dee2b)

Christoph Hellwig
2018-10-29 11:10:38 +0800
091b43061 block: add a blk_steal_bios helper ... Browse Code »

This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.

Signed-off-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
(cherry picked from commit ef71de8b15d891b27b8c983a9a8972b11cb4576a)

Christoph Hellwig
2018-10-29 11:10:38 +0800
56e5b6b0b block: provide a direct_make_request helper ... Browse Code »

This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions.

Signed-off-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Reviewed-by: Javier González
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
(cherry picked from commit f421e1d9ade4e1b88183e54425cf50e390d16a7f)

Christoph Hellwig
2018-10-29 11:10:38 +0800
2f8cda28a block: Document scheduler modification locking requirements ... Browse Code »

This patch does not change any functionality.

Reviewed-by: Christoph Hellwig
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
(cherry picked from commit 14a23498ba97683c6790b1bcd8b2cdfe9ad99797)

Bart Van Assche
2018-10-29 11:10:38 +0800
eac4ee142 block: Unexport elv_register_queue() and elv_unregister_queue() ... Browse Code »

These two functions are only called from inside the block layer so
unexport them.

Reviewed-by: Christoph Hellwig
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
(cherry picked from commit 83d016ac86428dbca8a62d3e4fdc29e3ea39e535)

Bart Van Assche
2018-10-29 11:10:38 +0800
044355868 MLK-11444 ata: imx: cmd buf corruption errata bug fix ... Browse Code »

errata:
When a read command returns less data than specified in the PRDs (for
example, there are two PRDs for this command, but the device returns a
number of bytes which is less than in the first PRD), the second PRD of
this command is not read out of the PRD FIFO, causing the next command
to use this PRD erroneously.

workaround
- forces sg_tablesize = 1
- modified the sg_io function in block/scsi_ioctl.c to use a 64k buffer
allocated with dma_alloc_coherent during the probe in ahci_imx
- In order to fix the scsi/sata hang, when CD_ROM and HDD are
accessed simultaneously after the workaround is applied.
Do not go to sleep in scsi_eh_handler, when there is host failed.

Signed-off-by: Richard Zhu

Richard Zhu
2018-10-29 11:10:38 +0800

13 Oct, 2018

1 commit

8e2e2192e blk-mq: I/O and timer unplugs are inverted in blktrace ... Browse Code »

commit 587562d0c7cd6861f4f90a2eb811cccb1a376f5f upstream.

trace_block_unplug() takes true for explicit unplugs and false for
implicit unplugs. schedule() unplugs are implicit and should be
reported as timer unplugs. While correct in the legacy code, this has
been inverted in blk-mq since 4.11.

Cc: stable@vger.kernel.org
Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reviewed-by: Omar Sandoval
Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Ilya Dryomov
2018-10-13 15:27:22 +0800

26 Sep, 2018

3 commits

b520f00da blk-mq: avoid to synchronize rcu inside blk_cleanup_queue() ... Browse Code »

[ Upstream commit 1311326cf4755c7ffefd20f576144ecf46d9906b ]

SCSI probing may synchronously create and destroy a lot of request_queues
for non-existent devices. Any synchronize_rcu() in queue creation or
destroy path may introduce long latency during booting, see detailed
description in comment of blk_register_queue().

This patch removes one synchronize_rcu() inside blk_cleanup_queue()
for this case, commit c2856ae2f315d75(blk-mq: quiesce queue before freeing queue)
needs synchronize_rcu() for implementing blk_mq_quiesce_queue(), but
when queue isn't initialized, it isn't necessary to do that since
only pass-through requests are involved, no original issue in
scsi_execute() at all.

Without this patch and previous one, it may take more 20+ seconds for
virtio-scsi to complete disk probe. With the two patches, the time becomes
less than 100ms.

Fixes: c2856ae2f315d75 ("blk-mq: quiesce queue before freeing queue")
Reported-by: Andrew Jones
Cc: Omar Sandoval
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen"
Cc: Christoph Hellwig
Tested-by: Andrew Jones
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-09-26 14:38:13 +0800
07a252b47 blk-mq: only attempt to merge bio if there is rq in sw queue ... Browse Code »

[ Upstream commit b04f50ab8a74129b3041a2836c33c916be3c6667 ]

Only attempt to merge bio iff the ctx->rq_list isn't empty, because:

1) for high-performance SSD, most of times dispatch may succeed, then
there may be nothing left in ctx->rq_list, so don't try to merge over
sw queue if it is empty, then we can save one acquiring of ctx->lock

2) we can't expect good merge performance on per-cpu sw queue, and missing
one merge on sw queue won't be a big deal since tasks can be scheduled from
one CPU to another.

Cc: Laurence Oberman
Cc: Omar Sandoval
Cc: Bart Van Assche
Tested-by: Kashyap Desai
Reported-by: Kashyap Desai
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-09-26 14:38:13 +0800
030f2ad6c block: allow max_discard_segments to be stacked ... Browse Code »

[ Upstream commit 42c9cdfe1e11e083dceb0f0c4977b758cf7403b9 ]

Set max_discard_segments to USHRT_MAX in blk_set_stacking_limits() so
that blk_stack_limits() can stack up this limit for stacked devices.

before:

$ cat /sys/block/nvme0n1/queue/max_discard_segments
256
$ cat /sys/block/dm-0/queue/max_discard_segments
1

after:

$ cat /sys/block/nvme0n1/queue/max_discard_segments
256
$ cat /sys/block/dm-0/queue/max_discard_segments
256

Fixes: 1e739730c5b9e ("block: optionally merge discontiguous discard bios into a single request")
Reviewed-by: Christoph Hellwig
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Mike Snitzer
2018-09-26 14:38:00 +0800

20 Sep, 2018

4 commits

5deea7d63 partitions/aix: fix usage of uninitialized lv_info and lvname structures ... Browse Code »

[ Upstream commit 14cb2c8a6c5dae57ee3e2da10fa3db2b9087e39e ]

The if-block that sets a successful return value in aix_partition()
uses 'lvip[].pps_per_lv' and 'n[].name' potentially uninitialized.

For example, if 'numlvs' is zero or alloc_lvn() fails, neither is
initialized, but are used anyway if alloc_pvd() succeeds after it.

So, make the alloc_pvd() call conditional on their initialization.

This has been hit when attaching an apparently corrupted/stressed
AIX LUN, misleading the kernel to pr_warn() invalid data and hang.

[...] partition (null) (11 pp's found) is not contiguous
[...] partition (null) (2 pp's found) is not contiguous
[...] partition (null) (3 pp's found) is not contiguous
[...] partition (null) (64 pp's found) is not contiguous

Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files")
Signed-off-by: Mauricio Faria de Oliveira
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Mauricio Faria de Oliveira
2018-09-20 04:43:44 +0800
f3677a5c7 partitions/aix: append null character to print data from disk ... Browse Code »

[ Upstream commit d43fdae7bac2def8c4314b5a49822cb7f08a45f1 ]

Even if properly initialized, the lvname array (i.e., strings)
is read from disk, and might contain corrupt data (e.g., lack
the null terminating character for strings).

So, make sure the partition name string used in pr_warn() has
the null terminating character.

Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files")
Suggested-by: Daniel J. Axtens
Signed-off-by: Mauricio Faria de Oliveira
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Mauricio Faria de Oliveira
2018-09-20 04:43:44 +0800
3ddbcd49b blk-mq: fix updating tags depth ... Browse Code »

[ Upstream commit 75d6e175fc511e95ae3eb8f708680133bc211ed3 ]

The passed 'nr' from userspace represents the total depth, meantime
inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth,
and 'nr_reserved_tags' stores the reserved part.

There are two issues in blk_mq_tag_update_depth() now:

1) for growing tags, we should have used the passed 'nr', and keep the
number of reserved tags not changed.

2) the passed 'nr' should have been used for checking against
'tags->nr_tags', instead of number of the normal part.

This patch fixes the above two cases, and avoids kernel crash caused
by wrong resizing sbitmap queue.

Cc: "Ewan D. Milne"
Cc: Christoph Hellwig
Cc: Bart Van Assche
Cc: Omar Sandoval
Tested by: Marco Patalano
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-09-20 04:43:39 +0800
381992bcc block: bfq: swap puts in bfqg_and_blkg_put ... Browse Code »

commit d5274b3cd6a814ccb2f56d81ee87cbbf51bd4cf7 upstream.

Fix trivial use-after-free. This could be last reference to bfqg.

Fixes: 8f9bebc33dd7 ("block, bfq: access and cache blkg data only when safe")
Acked-by: Paolo Valente
Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Konstantin Khlebnikov
2018-09-20 04:43:35 +0800

15 Sep, 2018

2 commits

9dd38052a cfq: Suppress compiler warnings about comparisons ... Browse Code »

[ Upstream commit f7ecb1b109da1006a08d5675debe60990e824432 ]

This patch does not change any functionality but avoids that gcc
reports the following warnings when building with W=1:

block/cfq-iosched.c: In function ?cfq_back_seek_max_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_slice_idle_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_group_idle_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_low_latency_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?:
block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX);
^~~~~~~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_group_idle_us_store?:
block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX);
^~~~~~~~~~~~~~~~~~~

Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Bart Van Assche
2018-09-15 15:45:31 +0800
d67c7c9dd block: bvec_nr_vecs() returns value for wrong slab ... Browse Code »

[ Upstream commit d6c02a9beb67f13d5f14f23e72fa9981e8b84477 ]

In commit ed996a52c868 ("block: simplify and cleanup bvec pool
handling"), the value of the slab index is incremented by one in
bvec_alloc() after the allocation is done to indicate an index value of
0 does not need to be later freed.

bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
value. Decrement idx before performing the lookup.

Fixes: ed996a52c868 ("block: simplify and cleanup bvec pool handling")
Signed-off-by: Greg Edwards
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Greg Edwards
2018-09-15 15:45:30 +0800

10 Sep, 2018

3 commits

ed480f2b9 block, bfq: return nbytes and not zero from struct cftype .write() method ... Browse Code »

commit fc8ebd01deeb12728c83381f6ec923e4a192ffd3 upstream.

The value that struct cftype .write() method returns is then directly
returned to userspace as the value returned by write() syscall, so it
should be the number of bytes actually written (or consumed) and not zero.

Returning zero from write() syscall makes programs like /bin/echo or bash
spin.

Signed-off-by: Maciej S. Szmigiero
Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Maciej S. Szmigiero
2018-09-10 01:55:59 +0800
1e2698976 block: really disable runtime-pm for blk-mq ... Browse Code »

commit b233f127042dba991229e3882c6217c80492f6ef upstream.

Runtime PM isn't ready for blk-mq yet, and commit 765e40b675a9 ("block:
disable runtime-pm for blk-mq") tried to disable it. Unfortunately,
it can't take effect in that way since user space still can switch
it on via 'echo auto > /sys/block/sdN/device/power/control'.

This patch disables runtime-pm for blk-mq really by pm_runtime_disable()
and fixes all kinds of PM related kernel crash.

Cc: Tomas Janousek
Cc: Przemek Socha
Cc: Alan Stern
Cc:
Reviewed-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Tested-by: Patrick Steinhardt
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-09-10 01:55:53 +0800
0affbaece block: blk_init_allocated_queue() set q->fq as NULL in the fail case ... Browse Code »

commit 54648cf1ec2d7f4b6a71767799c45676a138ca24 upstream.

We find the memory use-after-free issue in __blk_drain_queue()
on the kernel 4.14. After read the latest kernel 4.18-rc6 we
think it has the same problem.

Memory is allocated for q->fq in the blk_init_allocated_queue().
If the elevator init function called with error return, it will
run into the fail case to free the q->fq.

Then the __blk_drain_queue() uses the same memory after the free
of the q->fq, it will lead to the unpredictable event.

The patch is to set q->fq as NULL in the fail case of
blk_init_allocated_queue().

Fixes: commit 7c94e1c157a2 ("block: introduce blk_flush_queue to drive flush machinery")
Cc:
Reviewed-by: Ming Lei
Reviewed-by: Bart Van Assche
Signed-off-by: xiao jin
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

xiao jin
2018-09-10 01:55:53 +0800

24 Aug, 2018

1 commit

6d2b87505 block: sed-opal: Fix a couple off by one bugs ... Browse Code »

[ Upstream commit ce042c183bcb94eb2919e8036473a1fc203420f9 ]

resp->num is the number of tokens in resp->tok[]. It gets set in
response_parse(). So if n == resp->num then we're reading beyond the
end of the data.

Fixes: 455a7b238cd6 ("block: Add Sed-opal library")
Reviewed-by: Scott Bauer
Tested-by: Scott Bauer
Signed-off-by: Dan Carpenter
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Dan Carpenter
2018-08-24 19:09:03 +0800

18 Aug, 2018

1 commit

1a2d99218 block, bfq: fix wrong init of saved start time for weight raising ... Browse Code »

commit 4baa8bb13f41307f3eb62fe91f93a1a798ebef53 upstream.

This commit fixes a bug that causes bfq to fail to guarantee a high
responsiveness on some drives, if there is heavy random read+write I/O
in the background. More precisely, such a failure allowed this bug to
be found [1], but the bug may well cause other yet unreported
anomalies.

BFQ raises the weight of the bfq_queues associated with soft real-time
applications, to privilege the I/O, and thus reduce latency, for these
applications. This mechanism is named soft-real-time weight raising in
BFQ. A soft real-time period may happen to be nested into an
interactive weight raising period, i.e., it may happen that, when a
bfq_queue switches to a soft real-time weight-raised state, the
bfq_queue is already being weight-raised because deemed interactive
too. In this case, BFQ saves in a special variable
wr_start_at_switch_to_srt, the time instant when the interactive
weight-raising period started for the bfq_queue, i.e., the time
instant when BFQ started to deem the bfq_queue interactive. This value
is then used to check whether the interactive weight-raising period
would still be in progress when the soft real-time weight-raising
period ends. If so, interactive weight raising is restored for the
bfq_queue. This restore is useful, in particular, because it prevents
bfq_queues from losing their interactive weight raising prematurely,
as a consequence of spurious, short-lived soft real-time
weight-raising periods caused by wrong detections as soft real-time.

If, instead, a bfq_queue switches to soft-real-time weight raising
while it *is not* already in an interactive weight-raising period,
then the variable wr_start_at_switch_to_srt has no meaning during the
following soft real-time weight-raising period. Unfortunately the
handling of this case is wrong in BFQ: not only the variable is not
flagged somehow as meaningless, but it is also set to the time when
the switch to soft real-time weight-raising occurs. This may cause an
interactive weight-raising period to be considered mistakenly as still
in progress, and thus a spurious interactive weight-raising period to
start for the bfq_queue, at the end of the soft-real-time
weight-raising period. In particular the spurious interactive
weight-raising period will be considered as still in progress, if the
soft-real-time weight-raising period does not last very long. The
bfq_queue will then be wrongly privileged and, if I/O bound, will
unjustly steal bandwidth to truly interactive or soft real-time
bfq_queues, harming responsiveness and low latency.

This commit fixes this issue by just setting wr_start_at_switch_to_srt
to minus infinity (farthest past time instant according to jiffies
macros): when the soft-real-time weight-raising period ends, certainly
no interactive weight-raising period will be considered as still in
progress.

[1] Background I/O Type: Random - Background I/O mix: Reads and writes
- Application to start: LibreOffice Writer in
http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-Laptop

Signed-off-by: Paolo Valente
Signed-off-by: Angelo Ruocco
Tested-by: Oleksandr Natalenko
Tested-by: Lee Tibbert
Tested-by: Mirko Montanari
Signed-off-by: Jens Axboe
Signed-off-by: Sudip Mukherjee
Signed-off-by: Greg Kroah-Hartman

Paolo Valente
2018-08-18 03:01:11 +0800

03 Aug, 2018

3 commits

b8088c524 block: reset bi_iter.bi_done after splitting bio ... Browse Code »

commit 5151842b9d8732d4cbfa8400b40bff894f501b2f upstream.

After the bio has been updated to represent the remaining sectors, reset
bi_done so bio_rewind_iter() does not rewind further than it should.

This resolves a bio_integrity_process() failure on reads where the
original request was split.

Fixes: 63573e359d05 ("bio-integrity: Restore original iterator on verify stage")
Signed-off-by: Greg Edwards
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Greg Edwards
2018-08-03 13:50:42 +0800
2258351cf block: bio_iov_iter_get_pages: fix size of last iovec ... Browse Code »

commit b403ea2404889e1227812fa9657667a1deb9c694 upstream.

If the last page of the bio is not "full", the length of the last
vector slot needs to be corrected. This slot has the index
(bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
array, which is shifted by the value of bio->bi_vcnt at function
invocation, the correct index is (nr_pages - 1).

v2: improved readability following suggestions from Ming Lei.
v3: followed a formatting suggestion from Christoph Hellwig.

Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()")
Reviewed-by: Hannes Reinecke
Reviewed-by: Ming Lei
Reviewed-by: Jan Kara
Reviewed-by: Christoph Hellwig
Signed-off-by: Martin Wilck
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Martin Wilck
2018-08-03 13:50:42 +0800
63019044f block, bfq: remove wrong lock in bfq_requests_merged ... Browse Code »

[ Upstream commit a12bffebc0c9d6a5851f062aaea3aa7c4adc6042 ]

In bfq_requests_merged(), there is a deadlock because the lock on
bfqq->bfqd->lock is held by the calling function, but the code of
this function tries to grab the lock again.

This deadlock is currently hidden by another bug (fixed by next commit
for this source file), which causes the body of bfq_requests_merged()
to be never executed.

This commit removes the deadlock by removing the lock/unlock pair.

Signed-off-by: Filippo Muzzini
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Filippo Muzzini
2018-08-03 13:50:26 +0800

22 Jul, 2018

1 commit

aa6be3967 block: do not use interruptible wait anywhere ... Browse Code »

commit 1dc3039bc87ae7d19a990c3ee71cfd8a9068f428 upstream.

When blk_queue_enter() waits for a queue to unfreeze, or unset the
PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.

The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
("block, scsi: Make SCSI quiesce and resume work reliably"). Note the SCSI
device is resumed asynchronously, i.e. after un-freezing userspace tasks.

So that commit exposed the bug as a regression in v4.15. A mysterious
SIGBUS (or -EIO) sometimes happened during the time the device was being
resumed. Most frequently, there was no kernel log message, and we saw Xorg
or Xwayland killed by SIGBUS.[1]

[1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979

Without this fix, I get an IO error in this test:

# dd if=/dev/sda of=/dev/null iflag=direct & \
while killall -SIGUSR1 dd; do sleep 0.1; done & \
echo mem > /sys/power/state ; \
sleep 5; killall dd # stop after 5 seconds

The interruptible wait was added to blk_queue_enter in
commit 3ef28e83ab15 ("block: generic request_queue reference counting").
Before then, the interruptible wait was only in blk-mq, but I don't think
it could ever have been correct.

Reviewed-by: Bart Van Assche
Cc: stable@vger.kernel.org
Signed-off-by: Alan Jenkins
Signed-off-by: Jens Axboe
Signed-off-by: Sudip Mukherjee
Signed-off-by: Greg Kroah-Hartman

Alan Jenkins
2018-07-22 20:28:48 +0800

11 Jul, 2018

2 commits

c894755d1 block: cope with WRITE ZEROES failing in blkdev_issue_zeroout() ... Browse Code »

commit d5ce4c31d6df518dd8f63bbae20d7423c5018a6c upstream.

sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to
permit trying WRITE SAME on older SCSI devices, unless ->no_write_same
is set. Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE
SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO:

$ fallocate -zn -l 1k /dev/sdg
fallocate: fallocate failed: Remote I/O error
$ fallocate -zn -l 1k /dev/sdg # OK
$ fallocate -zn -l 1k /dev/sdg # OK

The following calls succeed because sd_done() sets ->no_write_same in
response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing
__blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios.

This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing
and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is
specified. For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if
sd_done() has just set ->no_write_same thus indicating lack of offload
support.

Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing")
Cc: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Reviewed-by: Martin K. Petersen
Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe
Cc: Janne Huttunen
Signed-off-by: Greg Kroah-Hartman

Ilya Dryomov
2018-07-11 22:29:19 +0800
3e3f1310c block: factor out __blkdev_issue_zero_pages() ... Browse Code »

commit 425a4dba7953e35ffd096771973add6d2f40d2ed upstream.

blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case.

Reviewed-by: Christoph Hellwig
Reviewed-by: Martin K. Petersen
Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe
Cc: Janne Huttunen
Signed-off-by: Greg Kroah-Hartman

Ilya Dryomov
2018-07-11 22:29:19 +0800

03 Jul, 2018

1 commit

251141340 block: Fix cloning of requests with a special payload ... Browse Code »

commit 297ba57dcdec7ea37e702bcf1a577ac32a034e21 upstream.

This patch avoids that removing a path controlled by the dm-mpath driver
while mkfs is running triggers the following kernel bug:

kernel BUG at block/blk-core.c:3347!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 20 PID: 24369 Comm: mkfs.ext4 Not tainted 4.18.0-rc1-dbg+ #2
RIP: 0010:blk_end_request_all+0x68/0x70
Call Trace:

dm_softirq_done+0x326/0x3d0 [dm_mod]
blk_done_softirq+0x19b/0x1e0
__do_softirq+0x128/0x60d
irq_exit+0x100/0x110
smp_call_function_single_interrupt+0x90/0x330
call_function_single_interrupt+0xf/0x20

Fixes: f9d03f96b988 ("block: improve handling of the magic discard payload")
Reviewed-by: Ming Lei
Reviewed-by: Christoph Hellwig
Acked-by: Mike Snitzer
Signed-off-by: Bart Van Assche
Cc: Hannes Reinecke
Cc: Johannes Thumshirn
Cc:
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Bart Van Assche
2018-07-03 17:25:05 +0800

26 Jun, 2018

1 commit

ba502bf2b blk-mq: reinit q->tag_set_list entry only after grace period ... Browse Code »

commit a347c7ad8edf4c5685154f3fdc3c12fc1db800ba upstream.

It is not allowed to reinit q->tag_set_list list entry while RCU grace
period has not completed yet, otherwise the following soft lockup in
blk_mq_sched_restart() happens:

[ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
[ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
[ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
[ 1064.256510] Call Trace:
[ 1064.256664]
[ 1064.256824] blk_mq_free_request+0xea/0x100
[ 1064.256987] msg_io_conf+0x59/0xd0 [ibnbd_client]
[ 1064.257175] complete_rdma_req+0xf2/0x230 [ibtrs_client]
[ 1064.257340] ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
[ 1064.257502] ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
[ 1064.257669] ib_create_qp+0x321/0x380 [ib_core]
[ 1064.257841] ib_process_cq_direct+0xbd/0x120 [ib_core]
[ 1064.258007] irq_poll_softirq+0xb7/0xe0
[ 1064.258165] __do_softirq+0x106/0x2a2
[ 1064.258328] irq_exit+0x92/0xa0
[ 1064.258509] do_IRQ+0x4a/0xd0
[ 1064.258660] common_interrupt+0x7a/0x7a
[ 1064.258818]

Meanwhile another context frees other queue but with the same set of
shared tags:

[ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
[ 1288.201833] bash D 0 5910 5820 0x00000000
[ 1288.202016] Call Trace:
[ 1288.202315] schedule+0x32/0x80
[ 1288.202462] schedule_timeout+0x1e5/0x380
[ 1288.203838] wait_for_completion+0xb0/0x120
[ 1288.204137] __wait_rcu_gp+0x125/0x160
[ 1288.204287] synchronize_sched+0x6e/0x80
[ 1288.204770] blk_mq_free_queue+0x74/0xe0
[ 1288.204922] blk_cleanup_queue+0xc7/0x110
[ 1288.205073] ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
[ 1288.205389] ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
[ 1288.205548] kernfs_fop_write+0x109/0x180
[ 1288.206328] vfs_write+0xb3/0x1a0
[ 1288.206476] SyS_write+0x52/0xc0
[ 1288.206624] do_syscall_64+0x68/0x1d0
[ 1288.206774] entry_SYSCALL_64_after_hwframe+0x3d/0xa2

What happened is the following:

1. There are several MQ queues with shared tags.
2. One queue is about to be freed and now task is in
blk_mq_del_queue_tag_set().
3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
tag list in order to find hctx to restart.

Because linked list entry was modified in blk_mq_del_queue_tag_set()
without proper waiting for a grace period, blk_mq_sched_restart()
never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.

Fix is simple: reinit list entry after an RCU grace period elapsed.

Fixes: Fixes: 705cda97ee3a ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Cc: stable@vger.kernel.org
Cc: Sagi Grimberg
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Reviewed-by: Bart Van Assche
Signed-off-by: Roman Pen
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Roman Pen
2018-06-26 08:06:32 +0800

21 Jun, 2018

1 commit

06beec871 blk-mq: fix sysfs inflight counter ... Browse Code »

[ Upstream commit bf0ddaba65ddbb2715af97041da8e7a45b2d8628 ]

When the blk-mq inflight implementation was added, /proc/diskstats was
converted to use it, but /sys/block/$dev/inflight was not. Fix it by
adding another helper to count in-flight requests by data direction.

Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Omar Sandoval
2018-06-21 03:02:49 +0800