Eric Lee / smarc-fsl-linux-kernel

12 Feb, 2019

16 commits

a4f73a021 block: bio_check_eod() needs to consider partitions ... Browse Code »

bio_check_eod() should check partition size not the whole disk if
bio->bi_partno is non-zero. Do this by moving the call
to bio_check_eod() into blk_partition_remap().

Based on an earlier patch from Jiufei Xue.

Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Reported-by: Jiufei Xue
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
(cherry picked from commit 52c5e62d4c4beecddc6e1b8045ce1d695fca1ba7)

Christoph Hellwig
2019-02-12 10:33:27 +0800
94f852f18 block: add bdev_read_only() checks to common helpers ... Browse Code »

Similar to blkdev_write_iter(), return -EPERM if the partition is
read-only. This covers ioctl(), fallocate() and most in-kernel users
but isn't meant to be exhaustive -- everything else will be caught in
generic_make_request_checks(), fail with -EIO and can be fixed later.

Reviewed-by: Sagi Grimberg
Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe
(cherry picked from commit a13553c777375009584741e7d9982e775c4b0744)

Ilya Dryomov
2019-02-12 10:33:27 +0800
e0d75ce59 block: fail op_is_write() requests to read-only partitions ... Browse Code »

Regular block device writes go through blkdev_write_iter(), which does
bdev_read_only(), while zeroout/discard/etc requests are never checked,
both userspace- and kernel-triggered. Add a generic catch-all check to
generic_make_request_checks() to actually enforce ioctl(BLKROSET) and
set_disk_ro(), which is used by quite a few drivers for things like
snapshots, read-only backing files/images, etc.

Reviewed-by: Sagi Grimberg
Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe
(cherry picked from commit 721c7fc701c71f693307d274d2b346a1ecd4a534)

Ilya Dryomov
2019-02-12 10:33:27 +0800
1b9137ac5 blk-throttle: export io_serviced_recursive, io_service_bytes_recursive ... Browse Code »

export these two interface for cgroup-v1.

Acked-by: Tejun Heo
Signed-off-by: weiping zhang
Signed-off-by: Jens Axboe
(cherry picked from commit 17534c6f2c065ad8e34ff6f013e5afaa90428512)

weiping zhang
2019-02-12 10:33:26 +0800
d5f07ad97 block: Protect less code with sysfs_lock in blk_{un,}register_queue() ... Browse Code »

The __blk_mq_register_dev(), blk_mq_unregister_dev(),
elv_register_queue() and elv_unregister_queue() calls need to be
protected with sysfs_lock but other code in these functions not.
Hence protect only this code with sysfs_lock. This patch fixes a
locking inversion issue in blk_unregister_queue() and also in an
error path of blk_register_queue(): it is not allowed to hold
sysfs_lock around the kobject_del(&q->kobj) call.

Reviewed-by: Christoph Hellwig
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
(cherry picked from commit 2c2086afc2b8b974fac32cb028e73dc27bfae442)

Bart Van Assche
2019-02-12 10:33:26 +0800
f5ff07043 block: allow gendisk's request_queue registration to be deferred ... Browse Code »

Since I can remember DM has forced the block layer to allow the
allocation and initialization of the request_queue to be distinct
operations. Reason for this is block/genhd.c:add_disk() has requires
that the request_queue (and associated bdi) be tied to the gendisk
before add_disk() is called -- because add_disk() also deals with
exposing the request_queue via blk_register_queue().

DM's dynamic creation of arbitrary device types (and associated
request_queue types) requires the DM device's gendisk be available so
that DM table loads can establish a master/slave relationship with
subordinate devices that are referenced by loaded DM tables -- using
bd_link_disk_holder(). But until these DM tables, and their associated
subordinate devices, are known DM cannot know what type of request_queue
it needs -- nor what its queue_limits should be.

This chicken and egg scenario has created all manner of problems for DM
and, at times, the block layer.

Summary of changes:

- Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
that drivers may use to add a disk without also calling
blk_register_queue(). Driver must call blk_register_queue() once its
request_queue is fully initialized.

- Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
is not set. It won't be set if driver used add_disk_no_queue_reg()
but driver encounters an error and must del_gendisk() before calling
blk_register_queue().

- Export blk_register_queue().

These changes allow DM to use add_disk_no_queue_reg() to anchor its
gendisk as the "master" for master/slave relationships DM must establish
with subordinate devices referenced in DM tables that get loaded. Once
all "slave" devices for a DM device are known its request_queue can be
properly initialized and then advertised via sysfs -- important
improvement being that no request_queue resource initialization
performed by blk_register_queue() is missed for DM devices anymore.

Signed-off-by: Mike Snitzer
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe
(cherry picked from commit fa70d2e2c4a0a54ced98260c6a176cc94c876d27)

Mike Snitzer
2019-02-12 10:33:25 +0800
9ab3834a3 block: properly protect the 'queue' kobj in blk_unregister_queue ... Browse Code »

The original commit e9a823fb34a8b (block: fix warning when I/O elevator
is changed as request_queue is being removed) is pretty conflated.
"conflated" because the resource being protected by q->sysfs_lock isn't
the queue_flags (it is the 'queue' kobj).

q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
from racing with blk_unregister_queue():
1) By holding q->sysfs_lock first, __elevator_change() can complete
before a racing blk_unregister_queue().
2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
in case elv_iosched_store() loses the race with blk_unregister_queue(),
it needs a way to know the 'queue' kobj isn't there.

Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
held until after the 'queue' kobj is removed.

To do so blk_mq_unregister_dev() must not also take q->sysfs_lock. So
rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().

Also, blk_unregister_queue() should use q->queue_lock to protect against
any concurrent writes to q->queue_flags -- even though chances are the
queue is being cleaned up so no concurrent writes are likely.

Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as request_queue is being removed")
Signed-off-by: Mike Snitzer
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe
(cherry picked from commit 667257e8b2988c0183ba23e2bcd6900e87961606)

Mike Snitzer
2019-02-12 10:33:25 +0800
d16900b12 block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN ... Browse Code »

device_add_disk() will only call bdi_register_owner() if
!GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call
bdi_unregister() if !GENHD_FL_HIDDEN.

Found with code inspection. bdi_unregister() won't do any harm if
bdi_register_owner() wasn't used but best to avoid the unnecessary
call to bdi_unregister().

Fixes: 8ddcd65325 ("block: introduce GENHD_FL_HIDDEN")
Signed-off-by: Mike Snitzer
Reviewed-by: Ming Lei
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
(cherry picked from commit bc8d062c36e3525e81ea8237ff0ab3264c2317b6)

Mike Snitzer
2019-02-12 10:33:25 +0800
ee4e916b2 block: add a poll_fn callback to struct request_queue ... Browse Code »

That we we can also poll non blk-mq queues. Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.

Signed-off-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
(cherry picked from commit ea435e1b9392a33deceaea2a16ebaa3397bead93)

Christoph Hellwig
2019-02-12 10:33:25 +0800
7a56fe8e2 block: introduce GENHD_FL_HIDDEN ... Browse Code »

With this flag a driver can create a gendisk that can be used for I/O
submission inside the kernel, but which is not registered as user
facing block device. This will be useful for the NVMe multipath
implementation.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
(cherry picked from commit 8ddcd653257c18a669fcb75ee42c37054908e0d6)

Christoph Hellwig
2019-02-12 10:33:25 +0800
bece506fe block: don't look at the struct device dev_t in disk_devt ... Browse Code »

The hidden gendisks introduced in the next patch need to keep the dev
field in their struct device empty so that udev won't try to create
block device nodes for them. To support that rewrite disk_devt to
look at the major and first_minor fields in the gendisk itself instead
of looking into the struct device.

Signed-off-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
(cherry picked from commit 517bf3c306bad4fe0da631f90ae2ec40924dee2b)

Christoph Hellwig
2019-02-12 10:33:25 +0800
c21c8b24e block: add a blk_steal_bios helper ... Browse Code »

This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.

Signed-off-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
(cherry picked from commit ef71de8b15d891b27b8c983a9a8972b11cb4576a)

Christoph Hellwig
2019-02-12 10:33:25 +0800
7055bd523 block: provide a direct_make_request helper ... Browse Code »

This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions.

Signed-off-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Reviewed-by: Javier González
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
(cherry picked from commit f421e1d9ade4e1b88183e54425cf50e390d16a7f)

Christoph Hellwig
2019-02-12 10:33:25 +0800
ad64bdbb0 block: Document scheduler modification locking requirements ... Browse Code »

This patch does not change any functionality.

Reviewed-by: Christoph Hellwig
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
(cherry picked from commit 14a23498ba97683c6790b1bcd8b2cdfe9ad99797)

Bart Van Assche
2019-02-12 10:33:24 +0800
b6d3fae20 block: Unexport elv_register_queue() and elv_unregister_queue() ... Browse Code »

These two functions are only called from inside the block layer so
unexport them.

Reviewed-by: Christoph Hellwig
Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
(cherry picked from commit 83d016ac86428dbca8a62d3e4fdc29e3ea39e535)

Bart Van Assche
2019-02-12 10:33:24 +0800
634a28baf MLK-11444 ata: imx: cmd buf corruption errata bug fix ... Browse Code »

errata:
When a read command returns less data than specified in the PRDs (for
example, there are two PRDs for this command, but the device returns a
number of bytes which is less than in the first PRD), the second PRD of
this command is not read out of the PRD FIFO, causing the next command
to use this PRD erroneously.

workaround
- forces sg_tablesize = 1
- modified the sg_io function in block/scsi_ioctl.c to use a 64k buffer
allocated with dma_alloc_coherent during the probe in ahci_imx
- In order to fix the scsi/sata hang, when CD_ROM and HDD are
accessed simultaneously after the workaround is applied.
Do not go to sleep in scsi_eh_handler, when there is host failed.

Signed-off-by: Richard Zhu

Richard Zhu
2019-02-12 10:23:04 +0800

29 Dec, 2018

2 commits

07eae146f block: fix infinite loop if the device loses discard capability ... Browse Code »

[ Upstream commit b88aef36b87c9787a4db724923ec4f57dfd513f3 ]

If __blkdev_issue_discard is in progress and a device mapper device is
reloaded with a table that doesn't support discard,
q->limits.max_discard_sectors is set to zero. This results in infinite
loop in __blkdev_issue_discard.

This patch checks if max_discard_sectors is zero and aborts with
-EOPNOTSUPP.

Signed-off-by: Mikulas Patocka
Tested-by: Zdenek Kabelac
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Mikulas Patocka
2018-12-29 20:39:07 +0800
1e9290ca9 block: break discard submissions into the user defined size ... Browse Code »

[ Upstream commit af097f5d199e2aa3ab3ef777f0716e487b8f7b08 ]

Don't build discards bigger than what the user asked for, if the
user decided to limit the size by writing to 'discard_max_bytes'.

Reviewed-by: Darrick J. Wong
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Jens Axboe
2018-12-29 20:39:07 +0800

21 Dec, 2018

1 commit

2a35d21a4 elevator: lookup mq vs non-mq elevators ... Browse Code »

[ Upstream commit 2527d99789e248576ac8081530cd4fd88730f8c7 ]

If an IO scheduler is selected via elevator= and it doesn't match
the driver in question wrt blk-mq support, then we fail to boot.

The elevator= parameter is deprecated and only supported for
non-mq devices. Augment the elevator lookup API so that we
pass in if we're looking for an mq capable scheduler or not,
so that we only ever return a valid type for the queue in
question.

Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=196695
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Jens Axboe
2018-12-21 21:13:10 +0800

21 Nov, 2018

1 commit

a06139578 SCSI: fix queue cleanup race before queue initialization is done ... Browse Code »

commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.

c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has
already fixed this race, however the implied synchronize_rcu()
in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
performance regression.

Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
tried to quiesce queue for avoiding unnecessary synchronize_rcu()
only when queue initialization is done, because it is usual to see
lots of inexistent LUNs which need to be probed.

However, turns out it isn't safe to quiesce queue only when queue
initialization is done. Because when one SCSI command is completed,
the user of sending command can be waken up immediately, then the
scsi device may be removed, meantime the run queue in scsi_end_request()
is still in-progress, so kernel panic can be caused.

In Red Hat QE lab, there are several reports about this kind of kernel
panic triggered during kernel booting.

This patch tries to address the issue by grabing one queue usage
counter during freeing one request and the following run queue.

Fixes: 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
Cc: Andrew Jones
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org
Cc: Martin K. Petersen
Cc: Christoph Hellwig
Cc: James E.J. Bottomley
Cc: stable
Cc: jianchao.wang
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-11-21 16:24:09 +0800

14 Nov, 2018

1 commit

d4d73f1fd block, bfq: correctly charge and reset entity service in all cases ... Browse Code »

[ Upstream commit cbeb869a3d1110450186b738199963c5e68c2a71 ]

BFQ schedules entities (which represent either per-process queues or
groups of queues) as a function of their timestamps. In particular, as
a function of their (virtual) finish times. The finish time of an
entity is computed as a function of the budget assigned to the entity,
assuming, tentatively, that the entity, once in service, will receive
an amount of service equal to its budget. Then, when the entity is
expired because it finishes to be served, this finish time is updated
as a function of the actual service received by the entity. This
allows the entity to be correctly charged with only the service
received, and then to be correctly re-scheduled.

Yet an entity may receive service also while not being the entity in
service (in the scheduling environment of its parent entity), for
several reasons. If the entity remains with no backlog while receiving
this 'unofficial' service, then it is expired. Also on such an
expiration, the finish time of the entity should be updated to account
for only the service actually received by the entity. Unfortunately,
such an update is not performed for an entity expiring without being
the entity in service.

In a similar vein, the service counter of the entity in service is
reset when the entity is expired, to be ready to be used for next
service cycle. This reset too should be performed also in case an
entity is expired because it remains empty after receiving service
while not being the entity in service. But in this case the reset is
not performed.

This commit performs the above update of the finish time and reset of
the service received, also for an entity expiring while not being the
entity in service.

Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Paolo Valente
2018-11-14 03:14:55 +0800

13 Oct, 2018

1 commit

8e2e2192e blk-mq: I/O and timer unplugs are inverted in blktrace ... Browse Code »

commit 587562d0c7cd6861f4f90a2eb811cccb1a376f5f upstream.

trace_block_unplug() takes true for explicit unplugs and false for
implicit unplugs. schedule() unplugs are implicit and should be
reported as timer unplugs. While correct in the legacy code, this has
been inverted in blk-mq since 4.11.

Cc: stable@vger.kernel.org
Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reviewed-by: Omar Sandoval
Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Ilya Dryomov
2018-10-13 15:27:22 +0800

26 Sep, 2018

3 commits

b520f00da blk-mq: avoid to synchronize rcu inside blk_cleanup_queue() ... Browse Code »

[ Upstream commit 1311326cf4755c7ffefd20f576144ecf46d9906b ]

SCSI probing may synchronously create and destroy a lot of request_queues
for non-existent devices. Any synchronize_rcu() in queue creation or
destroy path may introduce long latency during booting, see detailed
description in comment of blk_register_queue().

This patch removes one synchronize_rcu() inside blk_cleanup_queue()
for this case, commit c2856ae2f315d75(blk-mq: quiesce queue before freeing queue)
needs synchronize_rcu() for implementing blk_mq_quiesce_queue(), but
when queue isn't initialized, it isn't necessary to do that since
only pass-through requests are involved, no original issue in
scsi_execute() at all.

Without this patch and previous one, it may take more 20+ seconds for
virtio-scsi to complete disk probe. With the two patches, the time becomes
less than 100ms.

Fixes: c2856ae2f315d75 ("blk-mq: quiesce queue before freeing queue")
Reported-by: Andrew Jones
Cc: Omar Sandoval
Cc: Bart Van Assche
Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen"
Cc: Christoph Hellwig
Tested-by: Andrew Jones
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-09-26 14:38:13 +0800
07a252b47 blk-mq: only attempt to merge bio if there is rq in sw queue ... Browse Code »

[ Upstream commit b04f50ab8a74129b3041a2836c33c916be3c6667 ]

Only attempt to merge bio iff the ctx->rq_list isn't empty, because:

1) for high-performance SSD, most of times dispatch may succeed, then
there may be nothing left in ctx->rq_list, so don't try to merge over
sw queue if it is empty, then we can save one acquiring of ctx->lock

2) we can't expect good merge performance on per-cpu sw queue, and missing
one merge on sw queue won't be a big deal since tasks can be scheduled from
one CPU to another.

Cc: Laurence Oberman
Cc: Omar Sandoval
Cc: Bart Van Assche
Tested-by: Kashyap Desai
Reported-by: Kashyap Desai
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-09-26 14:38:13 +0800
030f2ad6c block: allow max_discard_segments to be stacked ... Browse Code »

[ Upstream commit 42c9cdfe1e11e083dceb0f0c4977b758cf7403b9 ]

Set max_discard_segments to USHRT_MAX in blk_set_stacking_limits() so
that blk_stack_limits() can stack up this limit for stacked devices.

before:

$ cat /sys/block/nvme0n1/queue/max_discard_segments
256
$ cat /sys/block/dm-0/queue/max_discard_segments
1

after:

$ cat /sys/block/nvme0n1/queue/max_discard_segments
256
$ cat /sys/block/dm-0/queue/max_discard_segments
256

Fixes: 1e739730c5b9e ("block: optionally merge discontiguous discard bios into a single request")
Reviewed-by: Christoph Hellwig
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Mike Snitzer
2018-09-26 14:38:00 +0800

20 Sep, 2018

4 commits

5deea7d63 partitions/aix: fix usage of uninitialized lv_info and lvname structures ... Browse Code »

[ Upstream commit 14cb2c8a6c5dae57ee3e2da10fa3db2b9087e39e ]

The if-block that sets a successful return value in aix_partition()
uses 'lvip[].pps_per_lv' and 'n[].name' potentially uninitialized.

For example, if 'numlvs' is zero or alloc_lvn() fails, neither is
initialized, but are used anyway if alloc_pvd() succeeds after it.

So, make the alloc_pvd() call conditional on their initialization.

This has been hit when attaching an apparently corrupted/stressed
AIX LUN, misleading the kernel to pr_warn() invalid data and hang.

[...] partition (null) (11 pp's found) is not contiguous
[...] partition (null) (2 pp's found) is not contiguous
[...] partition (null) (3 pp's found) is not contiguous
[...] partition (null) (64 pp's found) is not contiguous

Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files")
Signed-off-by: Mauricio Faria de Oliveira
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Mauricio Faria de Oliveira
2018-09-20 04:43:44 +0800
f3677a5c7 partitions/aix: append null character to print data from disk ... Browse Code »

[ Upstream commit d43fdae7bac2def8c4314b5a49822cb7f08a45f1 ]

Even if properly initialized, the lvname array (i.e., strings)
is read from disk, and might contain corrupt data (e.g., lack
the null terminating character for strings).

So, make sure the partition name string used in pr_warn() has
the null terminating character.

Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files")
Suggested-by: Daniel J. Axtens
Signed-off-by: Mauricio Faria de Oliveira
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Mauricio Faria de Oliveira
2018-09-20 04:43:44 +0800
3ddbcd49b blk-mq: fix updating tags depth ... Browse Code »

[ Upstream commit 75d6e175fc511e95ae3eb8f708680133bc211ed3 ]

The passed 'nr' from userspace represents the total depth, meantime
inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth,
and 'nr_reserved_tags' stores the reserved part.

There are two issues in blk_mq_tag_update_depth() now:

1) for growing tags, we should have used the passed 'nr', and keep the
number of reserved tags not changed.

2) the passed 'nr' should have been used for checking against
'tags->nr_tags', instead of number of the normal part.

This patch fixes the above two cases, and avoids kernel crash caused
by wrong resizing sbitmap queue.

Cc: "Ewan D. Milne"
Cc: Christoph Hellwig
Cc: Bart Van Assche
Cc: Omar Sandoval
Tested by: Marco Patalano
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-09-20 04:43:39 +0800
381992bcc block: bfq: swap puts in bfqg_and_blkg_put ... Browse Code »

commit d5274b3cd6a814ccb2f56d81ee87cbbf51bd4cf7 upstream.

Fix trivial use-after-free. This could be last reference to bfqg.

Fixes: 8f9bebc33dd7 ("block, bfq: access and cache blkg data only when safe")
Acked-by: Paolo Valente
Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Konstantin Khlebnikov
2018-09-20 04:43:35 +0800

15 Sep, 2018

2 commits

9dd38052a cfq: Suppress compiler warnings about comparisons ... Browse Code »

[ Upstream commit f7ecb1b109da1006a08d5675debe60990e824432 ]

This patch does not change any functionality but avoids that gcc
reports the following warnings when building with W=1:

block/cfq-iosched.c: In function ?cfq_back_seek_max_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_slice_idle_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_group_idle_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_low_latency_store?:
block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION?
STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
^~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?:
block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX);
^~~~~~~~~~~~~~~~~~~
block/cfq-iosched.c: In function ?cfq_group_idle_us_store?:
block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
if (__data < (MIN)) \
^
block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX);
^~~~~~~~~~~~~~~~~~~

Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Bart Van Assche
2018-09-15 15:45:31 +0800
d67c7c9dd block: bvec_nr_vecs() returns value for wrong slab ... Browse Code »

[ Upstream commit d6c02a9beb67f13d5f14f23e72fa9981e8b84477 ]

In commit ed996a52c868 ("block: simplify and cleanup bvec pool
handling"), the value of the slab index is incremented by one in
bvec_alloc() after the allocation is done to indicate an index value of
0 does not need to be later freed.

bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
value. Decrement idx before performing the lookup.

Fixes: ed996a52c868 ("block: simplify and cleanup bvec pool handling")
Signed-off-by: Greg Edwards
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Greg Edwards
2018-09-15 15:45:30 +0800

10 Sep, 2018

3 commits

ed480f2b9 block, bfq: return nbytes and not zero from struct cftype .write() method ... Browse Code »

commit fc8ebd01deeb12728c83381f6ec923e4a192ffd3 upstream.

The value that struct cftype .write() method returns is then directly
returned to userspace as the value returned by write() syscall, so it
should be the number of bytes actually written (or consumed) and not zero.

Returning zero from write() syscall makes programs like /bin/echo or bash
spin.

Signed-off-by: Maciej S. Szmigiero
Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Maciej S. Szmigiero
2018-09-10 01:55:59 +0800
1e2698976 block: really disable runtime-pm for blk-mq ... Browse Code »

commit b233f127042dba991229e3882c6217c80492f6ef upstream.

Runtime PM isn't ready for blk-mq yet, and commit 765e40b675a9 ("block:
disable runtime-pm for blk-mq") tried to disable it. Unfortunately,
it can't take effect in that way since user space still can switch
it on via 'echo auto > /sys/block/sdN/device/power/control'.

This patch disables runtime-pm for blk-mq really by pm_runtime_disable()
and fixes all kinds of PM related kernel crash.

Cc: Tomas Janousek
Cc: Przemek Socha
Cc: Alan Stern
Cc:
Reviewed-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Tested-by: Patrick Steinhardt
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2018-09-10 01:55:53 +0800
0affbaece block: blk_init_allocated_queue() set q->fq as NULL in the fail case ... Browse Code »

commit 54648cf1ec2d7f4b6a71767799c45676a138ca24 upstream.

We find the memory use-after-free issue in __blk_drain_queue()
on the kernel 4.14. After read the latest kernel 4.18-rc6 we
think it has the same problem.

Memory is allocated for q->fq in the blk_init_allocated_queue().
If the elevator init function called with error return, it will
run into the fail case to free the q->fq.

Then the __blk_drain_queue() uses the same memory after the free
of the q->fq, it will lead to the unpredictable event.

The patch is to set q->fq as NULL in the fail case of
blk_init_allocated_queue().

Fixes: commit 7c94e1c157a2 ("block: introduce blk_flush_queue to drive flush machinery")
Cc:
Reviewed-by: Ming Lei
Reviewed-by: Bart Van Assche
Signed-off-by: xiao jin
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

xiao jin
2018-09-10 01:55:53 +0800

24 Aug, 2018

1 commit

6d2b87505 block: sed-opal: Fix a couple off by one bugs ... Browse Code »

[ Upstream commit ce042c183bcb94eb2919e8036473a1fc203420f9 ]

resp->num is the number of tokens in resp->tok[]. It gets set in
response_parse(). So if n == resp->num then we're reading beyond the
end of the data.

Fixes: 455a7b238cd6 ("block: Add Sed-opal library")
Reviewed-by: Scott Bauer
Tested-by: Scott Bauer
Signed-off-by: Dan Carpenter
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Dan Carpenter
2018-08-24 19:09:03 +0800

18 Aug, 2018

1 commit

1a2d99218 block, bfq: fix wrong init of saved start time for weight raising ... Browse Code »

commit 4baa8bb13f41307f3eb62fe91f93a1a798ebef53 upstream.

This commit fixes a bug that causes bfq to fail to guarantee a high
responsiveness on some drives, if there is heavy random read+write I/O
in the background. More precisely, such a failure allowed this bug to
be found [1], but the bug may well cause other yet unreported
anomalies.

BFQ raises the weight of the bfq_queues associated with soft real-time
applications, to privilege the I/O, and thus reduce latency, for these
applications. This mechanism is named soft-real-time weight raising in
BFQ. A soft real-time period may happen to be nested into an
interactive weight raising period, i.e., it may happen that, when a
bfq_queue switches to a soft real-time weight-raised state, the
bfq_queue is already being weight-raised because deemed interactive
too. In this case, BFQ saves in a special variable
wr_start_at_switch_to_srt, the time instant when the interactive
weight-raising period started for the bfq_queue, i.e., the time
instant when BFQ started to deem the bfq_queue interactive. This value
is then used to check whether the interactive weight-raising period
would still be in progress when the soft real-time weight-raising
period ends. If so, interactive weight raising is restored for the
bfq_queue. This restore is useful, in particular, because it prevents
bfq_queues from losing their interactive weight raising prematurely,
as a consequence of spurious, short-lived soft real-time
weight-raising periods caused by wrong detections as soft real-time.

If, instead, a bfq_queue switches to soft-real-time weight raising
while it *is not* already in an interactive weight-raising period,
then the variable wr_start_at_switch_to_srt has no meaning during the
following soft real-time weight-raising period. Unfortunately the
handling of this case is wrong in BFQ: not only the variable is not
flagged somehow as meaningless, but it is also set to the time when
the switch to soft real-time weight-raising occurs. This may cause an
interactive weight-raising period to be considered mistakenly as still
in progress, and thus a spurious interactive weight-raising period to
start for the bfq_queue, at the end of the soft-real-time
weight-raising period. In particular the spurious interactive
weight-raising period will be considered as still in progress, if the
soft-real-time weight-raising period does not last very long. The
bfq_queue will then be wrongly privileged and, if I/O bound, will
unjustly steal bandwidth to truly interactive or soft real-time
bfq_queues, harming responsiveness and low latency.

This commit fixes this issue by just setting wr_start_at_switch_to_srt
to minus infinity (farthest past time instant according to jiffies
macros): when the soft-real-time weight-raising period ends, certainly
no interactive weight-raising period will be considered as still in
progress.

[1] Background I/O Type: Random - Background I/O mix: Reads and writes
- Application to start: LibreOffice Writer in
http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-Laptop

Signed-off-by: Paolo Valente
Signed-off-by: Angelo Ruocco
Tested-by: Oleksandr Natalenko
Tested-by: Lee Tibbert
Tested-by: Mirko Montanari
Signed-off-by: Jens Axboe
Signed-off-by: Sudip Mukherjee
Signed-off-by: Greg Kroah-Hartman

Paolo Valente
2018-08-18 03:01:11 +0800

03 Aug, 2018

3 commits

b8088c524 block: reset bi_iter.bi_done after splitting bio ... Browse Code »

commit 5151842b9d8732d4cbfa8400b40bff894f501b2f upstream.

After the bio has been updated to represent the remaining sectors, reset
bi_done so bio_rewind_iter() does not rewind further than it should.

This resolves a bio_integrity_process() failure on reads where the
original request was split.

Fixes: 63573e359d05 ("bio-integrity: Restore original iterator on verify stage")
Signed-off-by: Greg Edwards
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Greg Edwards
2018-08-03 13:50:42 +0800
2258351cf block: bio_iov_iter_get_pages: fix size of last iovec ... Browse Code »

commit b403ea2404889e1227812fa9657667a1deb9c694 upstream.

If the last page of the bio is not "full", the length of the last
vector slot needs to be corrected. This slot has the index
(bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
array, which is shifted by the value of bio->bi_vcnt at function
invocation, the correct index is (nr_pages - 1).

v2: improved readability following suggestions from Ming Lei.
v3: followed a formatting suggestion from Christoph Hellwig.

Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()")
Reviewed-by: Hannes Reinecke
Reviewed-by: Ming Lei
Reviewed-by: Jan Kara
Reviewed-by: Christoph Hellwig
Signed-off-by: Martin Wilck
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Martin Wilck
2018-08-03 13:50:42 +0800
63019044f block, bfq: remove wrong lock in bfq_requests_merged ... Browse Code »

[ Upstream commit a12bffebc0c9d6a5851f062aaea3aa7c4adc6042 ]

In bfq_requests_merged(), there is a deadlock because the lock on
bfqq->bfqd->lock is held by the calling function, but the code of
this function tries to grab the lock again.

This deadlock is currently hidden by another bug (fixed by next commit
for this source file), which causes the body of bfq_requests_merged()
to be never executed.

This commit removes the deadlock by removing the lock/unlock pair.

Signed-off-by: Filippo Muzzini
Signed-off-by: Paolo Valente
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Filippo Muzzini
2018-08-03 13:50:26 +0800

22 Jul, 2018

1 commit

aa6be3967 block: do not use interruptible wait anywhere ... Browse Code »

commit 1dc3039bc87ae7d19a990c3ee71cfd8a9068f428 upstream.

When blk_queue_enter() waits for a queue to unfreeze, or unset the
PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.

The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
("block, scsi: Make SCSI quiesce and resume work reliably"). Note the SCSI
device is resumed asynchronously, i.e. after un-freezing userspace tasks.

So that commit exposed the bug as a regression in v4.15. A mysterious
SIGBUS (or -EIO) sometimes happened during the time the device was being
resumed. Most frequently, there was no kernel log message, and we saw Xorg
or Xwayland killed by SIGBUS.[1]

[1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979

Without this fix, I get an IO error in this test:

# dd if=/dev/sda of=/dev/null iflag=direct & \
while killall -SIGUSR1 dd; do sleep 0.1; done & \
echo mem > /sys/power/state ; \
sleep 5; killall dd # stop after 5 seconds

The interruptible wait was added to blk_queue_enter in
commit 3ef28e83ab15 ("block: generic request_queue reference counting").
Before then, the interruptible wait was only in blk-mq, but I don't think
it could ever have been correct.

Reviewed-by: Bart Van Assche
Cc: stable@vger.kernel.org
Signed-off-by: Alan Jenkins
Signed-off-by: Jens Axboe
Signed-off-by: Sudip Mukherjee
Signed-off-by: Greg Kroah-Hartman

Alan Jenkins
2018-07-22 20:28:48 +0800