Eric Lee / smarc-fsl-linux-kernel

22 Apr, 2019

1 commit

6fcc44d1d block: fix use-after-free on gendisk ... Browse Code »

commit 2da78092dda "block: Fix dev_t minor allocation lifetime"
specifically moved blk_free_devt(dev->devt) call to part_release()
to avoid reallocating device number before the device is fully
shutdown.

However, it can cause use-after-free on gendisk in get_gendisk().
We use md device as example to show the race scenes:

Process1 Worker Process2
md_free
blkdev_open
del_gendisk
add delete_partition_work_fn() to wq
__blkdev_get
get_gendisk
put_disk
disk_release
kfree(disk)
find part from ext_devt_idr
get_disk_and_module(disk)
cause use after free

delete_partition_work_fn
put_device(part)
part_release
remove part from ext_devt_idr

Before is removed from ext_devt_idr by
delete_partition_work_fn(), we can find the devt and then access
gendisk by hd_struct pointer. But, if we access the gendisk after
it have been freed, it can cause in use-after-freeon gendisk in
get_gendisk().

We fix this by adding a new helper blk_invalidate_devt() in
delete_partition() and del_gendisk(). It replaces hd_struct
pointer in idr with value 'NULL', and deletes the entry from
idr in part_release() as we do now.

Thanks to Jan Kara for providing the solution and more clear comments
for the code.

Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime")
Cc: Al Viro
Reviewed-by: Bart Van Assche
Reviewed-by: Keith Busch
Reviewed-by: Jan Kara
Suggested-by: Jan Kara
Signed-off-by: Yufen Yu
Signed-off-by: Jens Axboe

Yufen Yu
2019-04-22 23:48:12 +0800

13 Apr, 2019

2 commits

c92e2f04b block: disk_events: introduce event flags ... Browse Code »

Currently, an empty disk->events field tells the block layer not to
forward media change events to user space. This was done in commit
7c88a168da80 ("block: don't propagate unlisted DISK_EVENTs to userland")
in order to avoid events from "fringe" drivers to be forwarded to user
space. By doing so, the block layer lost the information which events
were supported by a particular block device, and most importantly,
whether or not a given device supports media change events at all.

Prepare for not interpreting the "events" field this way in the future
any more. This is done by adding an additional field "event_flags" to
struct gendisk, and two flag bits that can be set to have the device
treated like one that had the "events" field set to a non-zero value
before. This applies only to the sd and sr drivers, which are changed to
set the new flags.

The new flags are DISK_EVENT_FLAG_POLL to enforce polling of the device
for synchronous events, and DISK_EVENT_FLAG_UEVENT to tell the
blocklayer to generate udev events from kernel events.

In order to add the event_flags field to struct gendisk, the events
field is converted to an "unsigned short"; it doesn't need to hold
values bigger than 2 anyway.

This patch doesn't change behavior.

Reviewed-by: Christoph Hellwig
Signed-off-by: Martin Wilck
Signed-off-by: Jens Axboe

Martin Wilck
2019-04-13 03:35:24 +0800
673387a93 block: genhd: remove async_events field ... Browse Code »

The async_events field, intended to be used for drivers that support
asynchronous notifications about disk events (aka media change events),
isn't currently used by any driver, and apparently that has been that
way for a long time (if not forever). Remove it.

Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Signed-off-by: Martin Wilck
Signed-off-by: Jens Axboe

Martin Wilck
2019-04-13 03:35:22 +0800

07 Apr, 2019

1 commit

72deb455b block: remove CONFIG_LBDAF ... Browse Code »

Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
architectures. These types are required to support block device and/or
file sizes larger than 2 TiB, and have generally defaulted to on for
a long time. Enabling the option only increases the i386 tinyconfig
size by 145 bytes, and many data structures already always use
64-bit values for their in-core and on-disk data structures anyway,
so there should not be a large change in dynamic memory usage either.

Dropping this option removes a somewhat weird non-default config that
has cause various bugs or compiler warnings when actually used.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2019-04-07 00:48:35 +0800

10 Dec, 2018

4 commits

e016b7820 block: return just one value from part_in_flight ... Browse Code »

The previous patches deleted all the code that needed the second value
returned from part_in_flight - now the kernel only uses the first value.

Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
it only returns one value.

This patch just refactors the code, there's no functional change.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Mikulas Patocka
2018-12-10 23:30:38 +0800
1226b8dd0 block: switch to per-cpu in-flight counters ... Browse Code »

Now when part_round_stats is gone, we can switch to per-cpu in-flight
counters.

We use the local-atomic type local_t, so that if part_inc_in_flight or
part_dec_in_flight is reentrantly called from an interrupt, the value will
be correct.

The other counters could be corrupted due to reentrant interrupt, but the
corruption only results in slight counter skew - the in_flight counter
must be exact, so it needs local_t.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Mikulas Patocka
2018-12-10 23:30:37 +0800
5b18b5a73 block: delete part_round_stats and switch to less precise counting ... Browse Code »

We want to convert to per-cpu in_flight counters.

The function part_round_stats needs the in_flight counter every jiffy, it
would be too costly to sum all the percpu variables every jiffy, so it
must be deleted. part_round_stats is used to calculate two counters -
time_in_queue and io_ticks.

time_in_queue can be calculated without part_round_stats, by adding the
duration of the I/O when the I/O ends (the value is almost as exact as the
previously calculated value, except that time for in-progress I/Os is not
counted).

io_ticks can be approximated by increasing the value when I/O is started
or ended and the jiffies value has changed. If the I/Os take less than a
jiffy, the value is as exact as the previously calculated value. If the
I/Os take more than a jiffy, io_ticks can drift behind the previously
calculated value.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Mikulas Patocka
2018-12-10 23:30:37 +0800
112f158f6 block: stop passing 'cpu' to all percpu stats methods ... Browse Code »

All of part_stat_* and related methods are used with preempt disabled,
so there is no need to pass cpu around to allow of them. Just call
smp_processor_id() as needed.

Suggested-by: Jens Axboe
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Mike Snitzer
2018-12-10 23:30:37 +0800

29 Nov, 2018

1 commit

94a2c3a32 block: use rcu_work instead of call_rcu to avoid sleep in softirq ... Browse Code »

We recently got a stack by syzkaller like this:

BUG: sleeping function called from invalid context at mm/slab.h:361
in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid
INFO: lockdep is turned off.
CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194
0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00
0000000000000000 0000000000000001 0000000000000004 0000000000000000
Call Trace:
[] __dump_stack lib/dump_stack.c:15 [inline]
[] dump_stack+0x114/0x1a0 lib/dump_stack.c:51
[] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675
[] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637
[] slab_pre_alloc_hook mm/slab.h:361 [inline]
[] slab_alloc_node mm/slub.c:2610 [inline]
[] slab_alloc mm/slub.c:2692 [inline]
[] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709
[] kmalloc include/linux/slab.h:479 [inline]
[] kzalloc include/linux/slab.h:623 [inline]
[] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227
[] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374
[] kobject_cleanup lib/kobject.c:633 [inline]
[] kobject_release+0x229/0x440 lib/kobject.c:675
[] kref_sub include/linux/kref.h:73 [inline]
[] kref_put include/linux/kref.h:98 [inline]
[] kobject_put+0x72/0xd0 lib/kobject.c:692
[] put_device+0x25/0x30 drivers/base/core.c:1237
[] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232
[] __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
[] rcu_do_batch kernel/rcu/tree.c:2705 [inline]
[] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline]
[] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline]
[] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957
[] __do_softirq+0x299/0xe20 kernel/softirq.c:273
[] invoke_softirq kernel/softirq.c:350 [inline]
[] irq_exit+0x216/0x2c0 kernel/softirq.c:391
[] exiting_irq arch/x86/include/asm/apic.h:652 [inline]
[] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926
[] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746
[] ? audit_kill_trees+0x180/0x180
[] fd_install+0x57/0x80 fs/file.c:626
[] do_sys_open+0x45e/0x550 fs/open.c:1043
[] SYSC_open fs/open.c:1055 [inline]
[] SyS_open+0x32/0x40 fs/open.c:1050
[] entry_SYSCALL_64_fastpath+0x1e/0x9a

In softirq context, we call rcu callback function delete_partition_rcu_cb(),
which may allocate memory by kzalloc with GFP_KERNEL flag. If the
allocation cannot be satisfied, it may sleep. However, That is not allowed
in softirq contex.

Although we found this problem on linux 4.4, the latest kernel version
seems to have this problem as well. And it is very similar to the
previous one:
https://lkml.org/lkml/2018/7/9/391

Fix it by using RCU workqueue, which allows sleep.

Reviewed-by: Paul E. McKenney
Signed-off-by: Yufen Yu
Signed-off-by: Jens Axboe

Yufen Yu
2018-11-29 00:08:27 +0800

01 Oct, 2018

1 commit

c0aac682f Merge tag 'v4.19-rc6' into for-4.20/block ... Browse Code »

Merge -rc6 in, for two reasons:

1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
they aren't in the 4.20 branch.

Signed-off-by: Jens Axboe

* tag 'v4.19-rc6': (780 commits)
Linux 4.19-rc6
MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
cpufreq: qcom-kryo: Fix section annotations
perf/core: Add sanity check to deal with pinned event failure
xen/blkfront: correct purging of persistent grants
Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
selftests/powerpc: Fix Makefiles for headers_install change
blk-mq: I/O and timer unplugs are inverted in blktrace
dax: Fix deadlock in dax_lock_mapping_entry()
x86/boot: Fix kexec booting failure in the SEV bit detection code
bcache: add separate workqueue for journal_write to avoid deadlock
drm/amd/display: Fix Edid emulation for linux
drm/amd/display: Fix Vega10 lightup on S3 resume
drm/amdgpu: Fix vce work queue was not cancelled when suspend
Revert "drm/panel: Add device_link from panel device to DRM device"
xen/blkfront: When purging persistent grants, keep them in the buffer
clocksource/drivers/timer-atmel-pit: Properly handle error cases
block: fix deadline elevator drain for zoned block devices
ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
...

Signed-off-by: Jens Axboe

Jens Axboe
2018-10-01 22:58:57 +0800

28 Sep, 2018

1 commit

fef912bf8 block: genhd: add 'groups' argument to device_add_disk ... Browse Code »

Update device_add_disk() to take an 'groups' argument so that
individual drivers can register a device with additional sysfs
attributes.
This avoids race condition the driver would otherwise have if these
groups were to be created with sysfs_add_groups().

Signed-off-by: Martin Wilck
Signed-off-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe

Hannes Reinecke
2018-09-28 22:30:28 +0800

22 Sep, 2018

1 commit

b57e99b4b block: use nanosecond resolution for iostat ... Browse Code »

Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
updating properly on 4.18. This is because we started using ktime to
track elapsed time, and we convert nanoseconds to jiffies when we update
the partition counter. However, this gets rounded down, so any I/Os that
take less than a jiffy are not accounted for. Previously in this case,
the value of jiffies would sometimes increment while we were doing I/O,
so at least some I/Os were accounted for.

Let's convert the stats to use nanoseconds internally. We still report
milliseconds as before, now more accurately than ever. The value is
still truncated to 32 bits for backwards compatibility.

Fixes: 522a777566f5 ("block: consolidate struct request timestamp fields")
Cc: stable@vger.kernel.org
Reported-by: Klaus Kusche
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2018-09-22 10:26:59 +0800

18 Jul, 2018

3 commits

bdca3c87f block: Track DISCARD statistics and output them in stat and diskstat ... Browse Code »

Add tracking of REQ_OP_DISCARD ios to the partition statistics and
append them to the various stat files in /sys as well as
/proc/diskstats. These are tracked with the same four stats as reads
and writes:

Number of discard ios completed.
Number of discard ios merged
Number of discard sectors completed
Milliseconds spent on discard requests

This is done via adding a new STAT_DISCARD define to genhd.h and then
using it to index that stat field for discard requests.

tj: Refreshed on top of v4.17 and other previous updates.

Signed-off-by: Michael Callahan
Signed-off-by: Tejun Heo
Cc: Andy Newell
Signed-off-by: Jens Axboe

Michael Callahan
2018-07-18 22:44:22 +0800
dbae2c551 block: Define and use STAT_READ and STAT_WRITE ... Browse Code »

Add defines for STAT_READ and STAT_WRITE for indexing the partition
stat entries. This clarifies some fs/ code which has hardcoded 1 for
STAT_WRITE and will make it easier to extend the stats with additional
fields.

tj: Refreshed on top of v4.17.

Signed-off-by: Michael Callahan
Signed-off-by: Tejun Heo
Cc: "Theodore Ts'o"
Cc: Jaegeuk Kim
Signed-off-by: Jens Axboe

Michael Callahan
2018-07-18 22:44:18 +0800
59767fbd4 block: Add part_stat_read_accum to read across field entries. ... Browse Code »

Add a part_stat_read_accum macro to genhd.h to read and sum across
field entries. For example to sum up the number read and write
sectors completed. In addition to being ar reasonable cleanup by
itself this will make it easier to add new stat fields in the future.

tj: Refreshed on top of v4.17.

Signed-off-by: Michael Callahan
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Michael Callahan
2018-07-18 22:44:16 +0800

26 Apr, 2018

1 commit

bf0ddaba6 blk-mq: fix sysfs inflight counter ... Browse Code »

When the blk-mq inflight implementation was added, /proc/diskstats was
converted to use it, but /sys/block/$dev/inflight was not. Fix it by
adding another helper to count in-flight requests by data direction.

Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2018-04-26 23:02:01 +0800

27 Feb, 2018

3 commits

56c0908c8 genhd: Fix BUG in blkdev_open() ... Browse Code »

When two blkdev_open() calls for a partition race with device removal
and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
blkdev_open(). The race can happen as follows:

CPU0 CPU1 CPU2
del_gendisk()
bdev_unhash_inode(part1);

blkdev_open(part1, O_EXCL) blkdev_open(part1, O_EXCL)
bdev = bd_acquire() bdev = bd_acquire()
blkdev_get(bdev)
bd_start_claiming(bdev)
- finds old inode 'whole'
bd_prepare_to_claim() -> 0
bdev_unhash_inode(whole);

blkdev_get(bdev);
bd_start_claiming(bdev)
- finds new inode 'whole'
bd_prepare_to_claim()
- this also succeeds as we have
different 'whole' here...
- bad things happen now as we
have two exclusive openers of
the same bdev

The problem here is that block device opens can see various intermediate
states while gendisk is shutting down and then being recreated.

We fix the problem by introducing new lookup_sem in gendisk that
synchronizes gendisk deletion with get_gendisk() and furthermore by
making sure that get_gendisk() does not return gendisk that is being (or
has been) deleted. This makes sure that once we ever manage to look up
newly created bdev inode, we are also guaranteed that following
get_gendisk() will either return failure (and we fail open) or it
returns gendisk for the new device and following bdget_disk() will
return new bdev inode (i.e., blkdev_open() follows the path as if it is
completely run after new device is created).

Reported-and-analyzed-by: Hou Tao
Tested-by: Hou Tao
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2018-02-27 00:48:42 +0800
9df6c2991 genhd: Add helper put_disk_and_module() ... Browse Code »

Add a proper counterpart to get_disk_and_module() -
put_disk_and_module(). Currently it is opencoded in several places.

Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2018-02-27 00:48:42 +0800
3079c22ea genhd: Rename get_disk() to get_disk_and_module() ... Browse Code »

Rename get_disk() to get_disk_and_module() to make sure what the
function does. It's not a great name but at least it is now clear that
put_disk() is not it's counterpart.

Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2018-02-27 00:48:42 +0800

15 Jan, 2018

1 commit

fa70d2e2c block: allow gendisk's request_queue registration to be deferred ... Browse Code »

Since I can remember DM has forced the block layer to allow the
allocation and initialization of the request_queue to be distinct
operations. Reason for this is block/genhd.c:add_disk() has requires
that the request_queue (and associated bdi) be tied to the gendisk
before add_disk() is called -- because add_disk() also deals with
exposing the request_queue via blk_register_queue().

DM's dynamic creation of arbitrary device types (and associated
request_queue types) requires the DM device's gendisk be available so
that DM table loads can establish a master/slave relationship with
subordinate devices that are referenced by loaded DM tables -- using
bd_link_disk_holder(). But until these DM tables, and their associated
subordinate devices, are known DM cannot know what type of request_queue
it needs -- nor what its queue_limits should be.

This chicken and egg scenario has created all manner of problems for DM
and, at times, the block layer.

Summary of changes:

- Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
that drivers may use to add a disk without also calling
blk_register_queue(). Driver must call blk_register_queue() once its
request_queue is fully initialized.

- Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
is not set. It won't be set if driver used add_disk_no_queue_reg()
but driver encounters an error and must del_gendisk() before calling
blk_register_queue().

- Export blk_register_queue().

These changes allow DM to use add_disk_no_queue_reg() to anchor its
gendisk as the "master" for master/slave relationships DM must establish
with subordinate devices referenced in DM tables that get loaded. Once
all "slave" devices for a DM device are known its request_queue can be
properly initialized and then advertised via sysfs -- important
improvement being that no request_queue resource initialization
performed by blk_register_queue() is missed for DM devices anymore.

Signed-off-by: Mike Snitzer
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe

Mike Snitzer
2018-01-15 23:41:38 +0800

15 Nov, 2017

1 commit

e2c5923c3 Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block layer updates from Jens Axboe:
"This is the main pull request for block storage for 4.15-rc1.

Nothing out of the ordinary in here, and no API changes or anything
like that. Just various new features for drivers, core changes, etc.
In particular, this pull request contains:

- A patch series from Bart, closing the whole on blk/scsi-mq queue
quescing.

- A series from Christoph, building towards hidden gendisks (for
multipath) and ability to move bio chains around.

- NVMe
- Support for native multipath for NVMe (Christoph).
- Userspace notifications for AENs (Keith).
- Command side-effects support (Keith).
- SGL support (Chaitanya Kulkarni)
- FC fixes and improvements (James Smart)
- Lots of fixes and tweaks (Various)

- bcache
- New maintainer (Michael Lyle)
- Writeback control improvements (Michael)
- Various fixes (Coly, Elena, Eric, Liang, et al)

- lightnvm updates, mostly centered around the pblk interface
(Javier, Hans, and Rakesh).

- Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

- Writeback series that fix the much discussed hundreds of millions
of sync-all units. This goes all the way, as discussed previously
(me).

- Fix for missing wakeup on writeback timer adjustments (Yafang
Shao).

- Fix laptop mode on blk-mq (me).

- {mq,name} tupple lookup for IO schedulers, allowing us to have
alias names. This means you can use 'deadline' on both !mq and on
mq (where it's called mq-deadline). (me).

- blktrace race fix, oopsing on sg load (me).

- blk-mq optimizations (me).

- Obscure waitqueue race fix for kyber (Omar).

- NBD fixes (Josef).

- Disable writeback throttling by default on bfq, like we do on cfq
(Luca Miccio).

- Series from Ming that enable us to treat flush requests on blk-mq
like any other request. This is a really nice cleanup.

- Series from Ming that improves merging on blk-mq with schedulers,
getting us closer to flipping the switch on scsi-mq again.

- BFQ updates (Paolo).

- blk-mq atomic flags memory ordering fixes (Peter Z).

- Loop cgroup support (Shaohua).

- Lots of minor fixes from lots of different folks, both for core and
driver code"

* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
nvme: fix visibility of "uuid" ns attribute
blk-mq: fixup some comment typos and lengths
ide: ide-atapi: fix compile error with defining macro DEBUG
blk-mq: improve tag waiting setup for non-shared tags
brd: remove unused brd_mutex
blk-mq: only run the hardware queue if IO is pending
block: avoid null pointer dereference on null disk
fs: guard_bio_eod() needs to consider partitions
xtensa/simdisk: fix compile error
nvme: expose subsys attribute to sysfs
nvme: create 'slaves' and 'holders' entries for hidden controllers
block: create 'slaves' and 'holders' entries for hidden gendisks
nvme: also expose the namespace identification sysfs files for mpath nodes
nvme: implement multipath access to nvme subsystems
nvme: track shared namespaces
nvme: introduce a nvme_ns_ids structure
nvme: track subsystems
block, nvme: Introduce blk_mq_req_flags_t
block, scsi: Make SCSI quiesce and resume work reliably
block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
...

Linus Torvalds
2017-11-15 07:32:19 +0800

11 Nov, 2017

1 commit

67f2519fe fs: guard_bio_eod() needs to consider partitions ... Browse Code »

guard_bio_eod() needs to look at the partition capacity, not just the
capacity of the whole device, when determining if truncation is
necessary.

[ 60.268688] attempt to access beyond end of device
[ 60.268690] unknown-block(9,1): rw=0, want=67103509, limit=67103506
[ 60.268693] buffer_io_error: 2 callbacks suppressed
[ 60.268696] Buffer I/O error on dev md1p7, logical block 4524305, async page read

Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Cc: stable@vger.kernel.org # v4.13
Reviewed-by: Christoph Hellwig
Signed-off-by: Greg Edwards
Signed-off-by: Jens Axboe

Greg Edwards
2017-11-11 10:55:57 +0800

07 Nov, 2017

1 commit

8c5db92a7 Merge branch 'linus' into locking/core, to resolve conflicts ... Browse Code »

Conflicts:
include/linux/compiler-clang.h
include/linux/compiler-gcc.h
include/linux/compiler-intel.h
include/uapi/linux/stddef.h

Signed-off-by: Ingo Molnar

Ingo Molnar
2017-11-07 17:32:44 +0800

04 Nov, 2017

2 commits

8ddcd6532 block: introduce GENHD_FL_HIDDEN ... Browse Code »

With this flag a driver can create a gendisk that can be used for I/O
submission inside the kernel, but which is not registered as user
facing block device. This will be useful for the NVMe multipath
implementation.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-11-04 00:31:48 +0800
517bf3c30 block: don't look at the struct device dev_t in disk_devt ... Browse Code »

The hidden gendisks introduced in the next patch need to keep the dev
field in their struct device empty so that udev won't try to create
block device nodes for them. To support that rewrite disk_devt to
look at the major and first_minor fields in the gendisk itself instead
of looking into the struct device.

Signed-off-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-11-04 00:31:48 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

26 Oct, 2017

1 commit

e319e1fbd block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion() ... Browse Code »

Darrick posted the following warning and Dave Chinner analyzed it:

> ======================================================
> WARNING: possible circular locking dependency detected
> 4.14.0-rc1-fixes #1 Tainted: G W
> ------------------------------------------------------
> loop0/31693 is trying to acquire lock:
> (&(&ip->i_mmaplock)->mr_lock){++++}, at: [] xfs_ilock+0x23c/0x330 [xfs]
>
> but now in release context of a crosslock acquired at the following:
> ((complete)&ret.event){+.+.}, at: [] submit_bio_wait+0x7f/0xb0
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 ((complete)&ret.event){+.+.}:
> lock_acquire+0xab/0x200
> wait_for_completion_io+0x4e/0x1a0
> submit_bio_wait+0x7f/0xb0
> blkdev_issue_zeroout+0x71/0xa0
> xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
> xfs_bmapi_write+0x374/0x11f0 [xfs]
> xfs_iomap_write_direct+0x2ac/0x430 [xfs]
> xfs_file_iomap_begin+0x20d/0xd50 [xfs]
> iomap_apply+0x43/0xe0
> dax_iomap_rw+0x89/0xf0
> xfs_file_dax_write+0xcc/0x220 [xfs]
> xfs_file_write_iter+0xf0/0x130 [xfs]
> __vfs_write+0xd9/0x150
> vfs_write+0xc8/0x1c0
> SyS_write+0x45/0xa0
> entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> -> #1 (&xfs_nondir_ilock_class){++++}:
> lock_acquire+0xab/0x200
> down_write_nested+0x4a/0xb0
> xfs_ilock+0x263/0x330 [xfs]
> xfs_setattr_size+0x152/0x370 [xfs]
> xfs_vn_setattr+0x6b/0x90 [xfs]
> notify_change+0x27d/0x3f0
> do_truncate+0x5b/0x90
> path_openat+0x237/0xa90
> do_filp_open+0x8a/0xf0
> do_sys_open+0x11c/0x1f0
> entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
> up_write+0x1c/0x40
> xfs_iunlock+0x1d0/0x310 [xfs]
> xfs_file_fallocate+0x8a/0x310 [xfs]
> loop_queue_work+0xb7/0x8d0
> kthread_worker_fn+0xb9/0x1f0
>
> Chain exists of:
> &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
>
> Possible unsafe locking scenario by crosslock:
>
> CPU0 CPU1
> ---- ----
> lock(&xfs_nondir_ilock_class);
> lock((complete)&ret.event);
> lock(&(&ip->i_mmaplock)->mr_lock);
> unlock((complete)&ret.event);
>
> *** DEADLOCK ***

The warning is a false positive, caused by the fact that all
wait_for_completion()s in submit_bio_wait() are waiting with the same
lock class.

However, some bios have nothing to do with others, for example in the case
of loop devices, there's no direct connection between the bios of an upper
device and the bios of a lower device(=loop device).

The safest way to assign different lock classes to different devices is
to do it for each gendisk. In other words, this patch assigns a
lockdep_map per gendisk and uses it when initializing completion in
submit_bio_wait().

Analyzed-by: Dave Chinner
Reported-by: Darrick J. Wong
Signed-off-by: Byungchul Park
Reviewed-by: Jens Axboe
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: amir73il@gmail.com
Cc: axboe@kernel.dk
Cc: david@fromorbit.com
Cc: hch@infradead.org
Cc: idryomov@gmail.com
Cc: johan@kernel.org
Cc: johannes.berg@intel.com
Cc: kernel-team@lge.com
Cc: linux-block@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Cc: oleg@redhat.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/1508921765-15396-10-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar

Byungchul Park
2017-10-26 13:54:17 +0800

10 Aug, 2017

3 commits

f299b7c7a blk-mq: provide internal in-flight variant ... Browse Code »

We don't have to inc/dec some counter, since we can just
iterate the tags. That makes inc/dec a noop, but means we
have to iterate busy tags to get an in-flight count.

Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2017-08-10 03:09:28 +0800
0609e0efc block: make part_in_flight() take an array of two ints ... Browse Code »

Instead of returning the count that matches the partition, pass
in an array of two ints. Index 0 will be filled with the inflight
count for the partition in question, and index 1 will filled
with the root inflight count, if the partition passed in is not the
root.

This is in preparation for being able to calculate both in one
go.

Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2017-08-10 03:09:20 +0800
d62e26b3f block: pass in queue to inflight accounting ... Browse Code »

No functional change in this patch, just in preparation for
basing the inflight mechanism on the queue in question.

Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2017-08-10 03:09:16 +0800

05 Jun, 2017

1 commit

1dd771eb0 block: remove blk_part_pack_uuid ... Browse Code »

This helper was only used by IMA of all things, which would get spurious
errors if CONFIG_BLOCK is disabled. Just opencode the call there.

Signed-off-by: Christoph Hellwig
Reviewed-by: Amir Goldstein
Acked-by: Mimi Zohar
Reviewed-by: Andy Shevchenko

Christoph Hellwig
2017-06-05 22:59:10 +0800

22 Apr, 2017

1 commit

19b7ccf86 block: get rid of blk_integrity_revalidate() ... Browse Code »

Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
introduced blk_integrity_revalidate(), which seems to assume ownership
of the stable pages flag and unilaterally clears it if no blk_integrity
profile is registered:

if (bi->profile)
disk->queue->backing_dev_info->capabilities |=
BDI_CAP_STABLE_WRITES;
else
disk->queue->backing_dev_info->capabilities &=
~BDI_CAP_STABLE_WRITES;

It's called from revalidate_disk() and rescan_partitions(), making it
impossible to enable stable pages for drivers that support partitions
and don't use blk_integrity: while the call in revalidate_disk() can be
trivially worked around (see zram, which doesn't support partitions and
hence gets away with zram_revalidate_disk()), rescan_partitions() can
be triggered from userspace at any time. This breaks rbd, where the
ceph messenger is responsible for generating/verifying CRCs.

Since blk_integrity_{un,}register() "must" be used for (un)registering
the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
setting there. This way drivers that call blk_integrity_register() and
use integrity infrastructure won't interfere with drivers that don't
but still want stable pages.

Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
Cc: "Martin K. Petersen"
Cc: Christoph Hellwig
Cc: Mike Snitzer
Cc: stable@vger.kernel.org # 4.4+, needs backporting
Tested-by: Dan Williams
Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe

Ilya Dryomov
2017-04-22 04:17:27 +0800

25 Mar, 2017

1 commit

869ab90f0 block: constify struct blk_integrity_profile ... Browse Code »

blk_integrity_profile's are never modified, so mark them 'const' so that
they are placed in .rodata and benefit from memory protection.

Signed-off-by: Eric Biggers
Signed-off-by: Jens Axboe

Eric Biggers
2017-03-25 10:34:39 +0800

09 Mar, 2017

1 commit

c01228db4 Revert "scsi, block: fix duplicate bdi name registration crashes" ... Browse Code »

This reverts commit 0dba1314d4f81115dce711292ec7981d17231064. It causes
leaking of device numbers for SCSI when SCSI registers multiple gendisks
for one request_queue in succession. It can be easily reproduced using
Omar's script [1] on kernel with CONFIG_DEBUG_TEST_DRIVER_REMOVE.
Furthermore the protection provided by this commit is not needed anymore
as the problem it was fixing got also fixed by commit 165a5e22fafb
"block: Move bdi_unregister() to del_gendisk()".

[1]: http://marc.info/?l=linux-block&m=148554717109098&w=2

Signed-off-by: Jan Kara
Acked-by: Dan Williams
Tested-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jan Kara
2017-03-09 01:55:17 +0800

02 Feb, 2017

1 commit

0dba1314d scsi, block: fix duplicate bdi name registration crashes ... Browse Code »

Warnings of the following form occur because scsi reuses a devt number
while the block layer still has it referenced as the name of the bdi
[1]:

WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192'
[..]
Call Trace:
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_fmt+0x5f/0x80
? kernfs_path_from_node+0x4f/0x60
sysfs_warn_dup+0x62/0x80
sysfs_create_dir_ns+0x77/0x90
kobject_add_internal+0xb2/0x350
kobject_add+0x75/0xd0
device_add+0x15a/0x650
device_create_groups_vargs+0xe0/0xf0
device_create_vargs+0x1c/0x20
bdi_register+0x90/0x240
? lockdep_init_map+0x57/0x200
bdi_register_owner+0x36/0x60
device_add_disk+0x1bb/0x4e0
? __pm_runtime_use_autosuspend+0x5c/0x70
sd_probe_async+0x10d/0x1c0
async_run_entry_fn+0x39/0x170

This is a brute-force fix to pass the devt release information from
sd_probe() to the locations where we register the bdi,
device_add_disk(), and unregister the bdi, blk_cleanup_queue().

Thanks to Omar for the quick reproducer script [2]. This patch survives
where an unmodified kernel fails in a few seconds.

[1]: https://marc.info/?l=linux-scsi&m=147116857810716&w=4
[2]: http://marc.info/?l=linux-block&m=148554717109098&w=2

Cc: James Bottomley
Cc: Bart Van Assche
Cc: "Martin K. Petersen"
Cc: Jan Kara
Reported-by: Omar Sandoval
Tested-by: Omar Sandoval
Signed-off-by: Dan Williams
Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe

Dan Williams
2017-02-02 23:23:19 +0800

23 Dec, 2016

1 commit

72c5296f9 genhd: remove dead and duplicated scsi code ... Browse Code »

blk_scsi_cmd_filter use was deprecated by 4beab5c6 and the SCSI macros
are duplicated in blkdev.h, both likely reintroduced by a bad merge from
540eed56.

Signed-off-by: Jon Derrick
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jon Derrick
2016-12-23 02:48:20 +0800

11 Oct, 2016

1 commit

0766f788e latent_entropy: Mark functions with __latent_entropy ... Browse Code »

The __latent_entropy gcc attribute can be used only on functions and
variables. If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents. The variable must
be an integer, an integer array type or a structure with integer fields.

These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.

Signed-off-by: Emese Revfy
[kees: expanded commit message]
Signed-off-by: Kees Cook

Emese Revfy
2016-10-11 05:51:45 +0800

28 Jun, 2016

1 commit

52c44d93c block: remove ->driverfs_dev ... Browse Code »

Now that all drivers that specify a ->driverfs_dev have been converted
to device_add_disk(), the pointer can be removed from struct gendisk.

Cc: Jens Axboe
Cc: Ross Zwisler
Reviewed-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Signed-off-by: Dan Williams

Dan Williams
2016-06-28 03:26:08 +0800

16 Jun, 2016

1 commit

e63a46bef block: introduce device_add_disk() ... Browse Code »

In preparation for removing the ->driverfs_dev member of a gendisk, add
an api that takes the parent device as a parameter to add_disk(). For
now this maintains the status quo of WARN()ing on failure, but not
return a error code.

Reviewed-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Reviewed-by: Bart Van Assche
Signed-off-by: Dan Williams

Dan Williams
2016-06-16 10:53:06 +0800

21 May, 2016

1 commit

635797857 include/linux/genhd.h: move to use generic UUID library ... Browse Code »

UUID library provides uuid_be type and uuid_be_to_bin() function. This
substitutes open coded variant by generic library calls.

Signed-off-by: Andy Shevchenko
Reviewed-by: Matt Fleming
Cc: Dmitry Kasatkin
Cc: Mimi Zohar
Cc: Rasmus Villemoes
Cc: Arnd Bergmann
Cc: "Theodore Ts'o"
Cc: Al Viro
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2016-05-21 08:58:30 +0800