22 Apr, 2019
1 commit
-
commit 2da78092dda "block: Fix dev_t minor allocation lifetime"
specifically moved blk_free_devt(dev->devt) call to part_release()
to avoid reallocating device number before the device is fully
shutdown.However, it can cause use-after-free on gendisk in get_gendisk().
We use md device as example to show the race scenes:Process1 Worker Process2
md_free
blkdev_open
del_gendisk
add delete_partition_work_fn() to wq
__blkdev_get
get_gendisk
put_disk
disk_release
kfree(disk)
find part from ext_devt_idr
get_disk_and_module(disk)
cause use after freedelete_partition_work_fn
put_device(part)
part_release
remove part from ext_devt_idrBefore is removed from ext_devt_idr by
delete_partition_work_fn(), we can find the devt and then access
gendisk by hd_struct pointer. But, if we access the gendisk after
it have been freed, it can cause in use-after-freeon gendisk in
get_gendisk().We fix this by adding a new helper blk_invalidate_devt() in
delete_partition() and del_gendisk(). It replaces hd_struct
pointer in idr with value 'NULL', and deletes the entry from
idr in part_release() as we do now.Thanks to Jan Kara for providing the solution and more clear comments
for the code.Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime")
Cc: Al Viro
Reviewed-by: Bart Van Assche
Reviewed-by: Keith Busch
Reviewed-by: Jan Kara
Suggested-by: Jan Kara
Signed-off-by: Yufen Yu
Signed-off-by: Jens Axboe
13 Apr, 2019
2 commits
-
Currently, an empty disk->events field tells the block layer not to
forward media change events to user space. This was done in commit
7c88a168da80 ("block: don't propagate unlisted DISK_EVENTs to userland")
in order to avoid events from "fringe" drivers to be forwarded to user
space. By doing so, the block layer lost the information which events
were supported by a particular block device, and most importantly,
whether or not a given device supports media change events at all.Prepare for not interpreting the "events" field this way in the future
any more. This is done by adding an additional field "event_flags" to
struct gendisk, and two flag bits that can be set to have the device
treated like one that had the "events" field set to a non-zero value
before. This applies only to the sd and sr drivers, which are changed to
set the new flags.The new flags are DISK_EVENT_FLAG_POLL to enforce polling of the device
for synchronous events, and DISK_EVENT_FLAG_UEVENT to tell the
blocklayer to generate udev events from kernel events.In order to add the event_flags field to struct gendisk, the events
field is converted to an "unsigned short"; it doesn't need to hold
values bigger than 2 anyway.This patch doesn't change behavior.
Reviewed-by: Christoph Hellwig
Signed-off-by: Martin Wilck
Signed-off-by: Jens Axboe -
The async_events field, intended to be used for drivers that support
asynchronous notifications about disk events (aka media change events),
isn't currently used by any driver, and apparently that has been that
way for a long time (if not forever). Remove it.Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Signed-off-by: Martin Wilck
Signed-off-by: Jens Axboe
07 Apr, 2019
1 commit
-
Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
architectures. These types are required to support block device and/or
file sizes larger than 2 TiB, and have generally defaulted to on for
a long time. Enabling the option only increases the i386 tinyconfig
size by 145 bytes, and many data structures already always use
64-bit values for their in-core and on-disk data structures anyway,
so there should not be a large change in dynamic memory usage either.Dropping this option removes a somewhat weird non-default config that
has cause various bugs or compiler warnings when actually used.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
10 Dec, 2018
4 commits
-
The previous patches deleted all the code that needed the second value
returned from part_in_flight - now the kernel only uses the first value.Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
it only returns one value.This patch just refactors the code, there's no functional change.
Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe -
Now when part_round_stats is gone, we can switch to per-cpu in-flight
counters.We use the local-atomic type local_t, so that if part_inc_in_flight or
part_dec_in_flight is reentrantly called from an interrupt, the value will
be correct.The other counters could be corrupted due to reentrant interrupt, but the
corruption only results in slight counter skew - the in_flight counter
must be exact, so it needs local_t.Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe -
We want to convert to per-cpu in_flight counters.
The function part_round_stats needs the in_flight counter every jiffy, it
would be too costly to sum all the percpu variables every jiffy, so it
must be deleted. part_round_stats is used to calculate two counters -
time_in_queue and io_ticks.time_in_queue can be calculated without part_round_stats, by adding the
duration of the I/O when the I/O ends (the value is almost as exact as the
previously calculated value, except that time for in-progress I/Os is not
counted).io_ticks can be approximated by increasing the value when I/O is started
or ended and the jiffies value has changed. If the I/Os take less than a
jiffy, the value is as exact as the previously calculated value. If the
I/Os take more than a jiffy, io_ticks can drift behind the previously
calculated value.Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe -
All of part_stat_* and related methods are used with preempt disabled,
so there is no need to pass cpu around to allow of them. Just call
smp_processor_id() as needed.Suggested-by: Jens Axboe
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe
29 Nov, 2018
1 commit
-
We recently got a stack by syzkaller like this:
BUG: sleeping function called from invalid context at mm/slab.h:361
in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid
INFO: lockdep is turned off.
CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194
0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00
0000000000000000 0000000000000001 0000000000000004 0000000000000000
Call Trace:
[] __dump_stack lib/dump_stack.c:15 [inline]
[] dump_stack+0x114/0x1a0 lib/dump_stack.c:51
[] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675
[] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637
[] slab_pre_alloc_hook mm/slab.h:361 [inline]
[] slab_alloc_node mm/slub.c:2610 [inline]
[] slab_alloc mm/slub.c:2692 [inline]
[] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709
[] kmalloc include/linux/slab.h:479 [inline]
[] kzalloc include/linux/slab.h:623 [inline]
[] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227
[] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374
[] kobject_cleanup lib/kobject.c:633 [inline]
[] kobject_release+0x229/0x440 lib/kobject.c:675
[] kref_sub include/linux/kref.h:73 [inline]
[] kref_put include/linux/kref.h:98 [inline]
[] kobject_put+0x72/0xd0 lib/kobject.c:692
[] put_device+0x25/0x30 drivers/base/core.c:1237
[] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232
[] __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
[] rcu_do_batch kernel/rcu/tree.c:2705 [inline]
[] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline]
[] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline]
[] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957
[] __do_softirq+0x299/0xe20 kernel/softirq.c:273
[] invoke_softirq kernel/softirq.c:350 [inline]
[] irq_exit+0x216/0x2c0 kernel/softirq.c:391
[] exiting_irq arch/x86/include/asm/apic.h:652 [inline]
[] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926
[] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746
[] ? audit_kill_trees+0x180/0x180
[] fd_install+0x57/0x80 fs/file.c:626
[] do_sys_open+0x45e/0x550 fs/open.c:1043
[] SYSC_open fs/open.c:1055 [inline]
[] SyS_open+0x32/0x40 fs/open.c:1050
[] entry_SYSCALL_64_fastpath+0x1e/0x9aIn softirq context, we call rcu callback function delete_partition_rcu_cb(),
which may allocate memory by kzalloc with GFP_KERNEL flag. If the
allocation cannot be satisfied, it may sleep. However, That is not allowed
in softirq contex.Although we found this problem on linux 4.4, the latest kernel version
seems to have this problem as well. And it is very similar to the
previous one:
https://lkml.org/lkml/2018/7/9/391Fix it by using RCU workqueue, which allows sleep.
Reviewed-by: Paul E. McKenney
Signed-off-by: Yufen Yu
Signed-off-by: Jens Axboe
01 Oct, 2018
1 commit
-
Merge -rc6 in, for two reasons:
1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
they aren't in the 4.20 branch.Signed-off-by: Jens Axboe
* tag 'v4.19-rc6': (780 commits)
Linux 4.19-rc6
MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
cpufreq: qcom-kryo: Fix section annotations
perf/core: Add sanity check to deal with pinned event failure
xen/blkfront: correct purging of persistent grants
Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
selftests/powerpc: Fix Makefiles for headers_install change
blk-mq: I/O and timer unplugs are inverted in blktrace
dax: Fix deadlock in dax_lock_mapping_entry()
x86/boot: Fix kexec booting failure in the SEV bit detection code
bcache: add separate workqueue for journal_write to avoid deadlock
drm/amd/display: Fix Edid emulation for linux
drm/amd/display: Fix Vega10 lightup on S3 resume
drm/amdgpu: Fix vce work queue was not cancelled when suspend
Revert "drm/panel: Add device_link from panel device to DRM device"
xen/blkfront: When purging persistent grants, keep them in the buffer
clocksource/drivers/timer-atmel-pit: Properly handle error cases
block: fix deadline elevator drain for zoned block devices
ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
...Signed-off-by: Jens Axboe
28 Sep, 2018
1 commit
-
Update device_add_disk() to take an 'groups' argument so that
individual drivers can register a device with additional sysfs
attributes.
This avoids race condition the driver would otherwise have if these
groups were to be created with sysfs_add_groups().Signed-off-by: Martin Wilck
Signed-off-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe
22 Sep, 2018
1 commit
-
Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
updating properly on 4.18. This is because we started using ktime to
track elapsed time, and we convert nanoseconds to jiffies when we update
the partition counter. However, this gets rounded down, so any I/Os that
take less than a jiffy are not accounted for. Previously in this case,
the value of jiffies would sometimes increment while we were doing I/O,
so at least some I/Os were accounted for.Let's convert the stats to use nanoseconds internally. We still report
milliseconds as before, now more accurately than ever. The value is
still truncated to 32 bits for backwards compatibility.Fixes: 522a777566f5 ("block: consolidate struct request timestamp fields")
Cc: stable@vger.kernel.org
Reported-by: Klaus Kusche
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
18 Jul, 2018
3 commits
-
Add tracking of REQ_OP_DISCARD ios to the partition statistics and
append them to the various stat files in /sys as well as
/proc/diskstats. These are tracked with the same four stats as reads
and writes:Number of discard ios completed.
Number of discard ios merged
Number of discard sectors completed
Milliseconds spent on discard requestsThis is done via adding a new STAT_DISCARD define to genhd.h and then
using it to index that stat field for discard requests.tj: Refreshed on top of v4.17 and other previous updates.
Signed-off-by: Michael Callahan
Signed-off-by: Tejun Heo
Cc: Andy Newell
Signed-off-by: Jens Axboe -
Add defines for STAT_READ and STAT_WRITE for indexing the partition
stat entries. This clarifies some fs/ code which has hardcoded 1 for
STAT_WRITE and will make it easier to extend the stats with additional
fields.tj: Refreshed on top of v4.17.
Signed-off-by: Michael Callahan
Signed-off-by: Tejun Heo
Cc: "Theodore Ts'o"
Cc: Jaegeuk Kim
Signed-off-by: Jens Axboe -
Add a part_stat_read_accum macro to genhd.h to read and sum across
field entries. For example to sum up the number read and write
sectors completed. In addition to being ar reasonable cleanup by
itself this will make it easier to add new stat fields in the future.tj: Refreshed on top of v4.17.
Signed-off-by: Michael Callahan
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe
26 Apr, 2018
1 commit
-
When the blk-mq inflight implementation was added, /proc/diskstats was
converted to use it, but /sys/block/$dev/inflight was not. Fix it by
adding another helper to count in-flight requests by data direction.Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe
27 Feb, 2018
3 commits
-
When two blkdev_open() calls for a partition race with device removal
and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
blkdev_open(). The race can happen as follows:CPU0 CPU1 CPU2
del_gendisk()
bdev_unhash_inode(part1);blkdev_open(part1, O_EXCL) blkdev_open(part1, O_EXCL)
bdev = bd_acquire() bdev = bd_acquire()
blkdev_get(bdev)
bd_start_claiming(bdev)
- finds old inode 'whole'
bd_prepare_to_claim() -> 0
bdev_unhash_inode(whole);
blkdev_get(bdev);
bd_start_claiming(bdev)
- finds new inode 'whole'
bd_prepare_to_claim()
- this also succeeds as we have
different 'whole' here...
- bad things happen now as we
have two exclusive openers of
the same bdevThe problem here is that block device opens can see various intermediate
states while gendisk is shutting down and then being recreated.We fix the problem by introducing new lookup_sem in gendisk that
synchronizes gendisk deletion with get_gendisk() and furthermore by
making sure that get_gendisk() does not return gendisk that is being (or
has been) deleted. This makes sure that once we ever manage to look up
newly created bdev inode, we are also guaranteed that following
get_gendisk() will either return failure (and we fail open) or it
returns gendisk for the new device and following bdget_disk() will
return new bdev inode (i.e., blkdev_open() follows the path as if it is
completely run after new device is created).Reported-and-analyzed-by: Hou Tao
Tested-by: Hou Tao
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe -
Add a proper counterpart to get_disk_and_module() -
put_disk_and_module(). Currently it is opencoded in several places.Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe -
Rename get_disk() to get_disk_and_module() to make sure what the
function does. It's not a great name but at least it is now clear that
put_disk() is not it's counterpart.Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe
15 Jan, 2018
1 commit
-
Since I can remember DM has forced the block layer to allow the
allocation and initialization of the request_queue to be distinct
operations. Reason for this is block/genhd.c:add_disk() has requires
that the request_queue (and associated bdi) be tied to the gendisk
before add_disk() is called -- because add_disk() also deals with
exposing the request_queue via blk_register_queue().DM's dynamic creation of arbitrary device types (and associated
request_queue types) requires the DM device's gendisk be available so
that DM table loads can establish a master/slave relationship with
subordinate devices that are referenced by loaded DM tables -- using
bd_link_disk_holder(). But until these DM tables, and their associated
subordinate devices, are known DM cannot know what type of request_queue
it needs -- nor what its queue_limits should be.This chicken and egg scenario has created all manner of problems for DM
and, at times, the block layer.Summary of changes:
- Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
that drivers may use to add a disk without also calling
blk_register_queue(). Driver must call blk_register_queue() once its
request_queue is fully initialized.- Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
is not set. It won't be set if driver used add_disk_no_queue_reg()
but driver encounters an error and must del_gendisk() before calling
blk_register_queue().- Export blk_register_queue().
These changes allow DM to use add_disk_no_queue_reg() to anchor its
gendisk as the "master" for master/slave relationships DM must establish
with subordinate devices referenced in DM tables that get loaded. Once
all "slave" devices for a DM device are known its request_queue can be
properly initialized and then advertised via sysfs -- important
improvement being that no request_queue resource initialization
performed by blk_register_queue() is missed for DM devices anymore.Signed-off-by: Mike Snitzer
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe
15 Nov, 2017
1 commit
-
Pull core block layer updates from Jens Axboe:
"This is the main pull request for block storage for 4.15-rc1.Nothing out of the ordinary in here, and no API changes or anything
like that. Just various new features for drivers, core changes, etc.
In particular, this pull request contains:- A patch series from Bart, closing the whole on blk/scsi-mq queue
quescing.- A series from Christoph, building towards hidden gendisks (for
multipath) and ability to move bio chains around.- NVMe
- Support for native multipath for NVMe (Christoph).
- Userspace notifications for AENs (Keith).
- Command side-effects support (Keith).
- SGL support (Chaitanya Kulkarni)
- FC fixes and improvements (James Smart)
- Lots of fixes and tweaks (Various)- bcache
- New maintainer (Michael Lyle)
- Writeback control improvements (Michael)
- Various fixes (Coly, Elena, Eric, Liang, et al)- lightnvm updates, mostly centered around the pblk interface
(Javier, Hans, and Rakesh).- Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)
- Writeback series that fix the much discussed hundreds of millions
of sync-all units. This goes all the way, as discussed previously
(me).- Fix for missing wakeup on writeback timer adjustments (Yafang
Shao).- Fix laptop mode on blk-mq (me).
- {mq,name} tupple lookup for IO schedulers, allowing us to have
alias names. This means you can use 'deadline' on both !mq and on
mq (where it's called mq-deadline). (me).- blktrace race fix, oopsing on sg load (me).
- blk-mq optimizations (me).
- Obscure waitqueue race fix for kyber (Omar).
- NBD fixes (Josef).
- Disable writeback throttling by default on bfq, like we do on cfq
(Luca Miccio).- Series from Ming that enable us to treat flush requests on blk-mq
like any other request. This is a really nice cleanup.- Series from Ming that improves merging on blk-mq with schedulers,
getting us closer to flipping the switch on scsi-mq again.- BFQ updates (Paolo).
- blk-mq atomic flags memory ordering fixes (Peter Z).
- Loop cgroup support (Shaohua).
- Lots of minor fixes from lots of different folks, both for core and
driver code"* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
nvme: fix visibility of "uuid" ns attribute
blk-mq: fixup some comment typos and lengths
ide: ide-atapi: fix compile error with defining macro DEBUG
blk-mq: improve tag waiting setup for non-shared tags
brd: remove unused brd_mutex
blk-mq: only run the hardware queue if IO is pending
block: avoid null pointer dereference on null disk
fs: guard_bio_eod() needs to consider partitions
xtensa/simdisk: fix compile error
nvme: expose subsys attribute to sysfs
nvme: create 'slaves' and 'holders' entries for hidden controllers
block: create 'slaves' and 'holders' entries for hidden gendisks
nvme: also expose the namespace identification sysfs files for mpath nodes
nvme: implement multipath access to nvme subsystems
nvme: track shared namespaces
nvme: introduce a nvme_ns_ids structure
nvme: track subsystems
block, nvme: Introduce blk_mq_req_flags_t
block, scsi: Make SCSI quiesce and resume work reliably
block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
...
11 Nov, 2017
1 commit
-
guard_bio_eod() needs to look at the partition capacity, not just the
capacity of the whole device, when determining if truncation is
necessary.[ 60.268688] attempt to access beyond end of device
[ 60.268690] unknown-block(9,1): rw=0, want=67103509, limit=67103506
[ 60.268693] buffer_io_error: 2 callbacks suppressed
[ 60.268696] Buffer I/O error on dev md1p7, logical block 4524305, async page readFixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Cc: stable@vger.kernel.org # v4.13
Reviewed-by: Christoph Hellwig
Signed-off-by: Greg Edwards
Signed-off-by: Jens Axboe
07 Nov, 2017
1 commit
-
Conflicts:
include/linux/compiler-clang.h
include/linux/compiler-gcc.h
include/linux/compiler-intel.h
include/uapi/linux/stddef.hSigned-off-by: Ingo Molnar
04 Nov, 2017
2 commits
-
With this flag a driver can create a gendisk that can be used for I/O
submission inside the kernel, but which is not registered as user
facing block device. This will be useful for the NVMe multipath
implementation.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
The hidden gendisks introduced in the next patch need to keep the dev
field in their struct device empty so that udev won't try to create
block device nodes for them. To support that rewrite disk_devt to
look at the major and first_minor fields in the gendisk itself instead
of looking into the struct device.Signed-off-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Reviewed-by: Hannes Reinecke
Signed-off-by: Jens Axboe
02 Nov, 2017
1 commit
-
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.By default all files without license information are under the default
license of the kernel, which is GPL version 2.Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
26 Oct, 2017
1 commit
-
Darrick posted the following warning and Dave Chinner analyzed it:
> ======================================================
> WARNING: possible circular locking dependency detected
> 4.14.0-rc1-fixes #1 Tainted: G W
> ------------------------------------------------------
> loop0/31693 is trying to acquire lock:
> (&(&ip->i_mmaplock)->mr_lock){++++}, at: [] xfs_ilock+0x23c/0x330 [xfs]
>
> but now in release context of a crosslock acquired at the following:
> ((complete)&ret.event){+.+.}, at: [] submit_bio_wait+0x7f/0xb0
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 ((complete)&ret.event){+.+.}:
> lock_acquire+0xab/0x200
> wait_for_completion_io+0x4e/0x1a0
> submit_bio_wait+0x7f/0xb0
> blkdev_issue_zeroout+0x71/0xa0
> xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
> xfs_bmapi_write+0x374/0x11f0 [xfs]
> xfs_iomap_write_direct+0x2ac/0x430 [xfs]
> xfs_file_iomap_begin+0x20d/0xd50 [xfs]
> iomap_apply+0x43/0xe0
> dax_iomap_rw+0x89/0xf0
> xfs_file_dax_write+0xcc/0x220 [xfs]
> xfs_file_write_iter+0xf0/0x130 [xfs]
> __vfs_write+0xd9/0x150
> vfs_write+0xc8/0x1c0
> SyS_write+0x45/0xa0
> entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> -> #1 (&xfs_nondir_ilock_class){++++}:
> lock_acquire+0xab/0x200
> down_write_nested+0x4a/0xb0
> xfs_ilock+0x263/0x330 [xfs]
> xfs_setattr_size+0x152/0x370 [xfs]
> xfs_vn_setattr+0x6b/0x90 [xfs]
> notify_change+0x27d/0x3f0
> do_truncate+0x5b/0x90
> path_openat+0x237/0xa90
> do_filp_open+0x8a/0xf0
> do_sys_open+0x11c/0x1f0
> entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> -> #0 (&(&ip->i_mmaplock)->mr_lock){++++}:
> up_write+0x1c/0x40
> xfs_iunlock+0x1d0/0x310 [xfs]
> xfs_file_fallocate+0x8a/0x310 [xfs]
> loop_queue_work+0xb7/0x8d0
> kthread_worker_fn+0xb9/0x1f0
>
> Chain exists of:
> &(&ip->i_mmaplock)->mr_lock --> &xfs_nondir_ilock_class --> (complete)&ret.event
>
> Possible unsafe locking scenario by crosslock:
>
> CPU0 CPU1
> ---- ----
> lock(&xfs_nondir_ilock_class);
> lock((complete)&ret.event);
> lock(&(&ip->i_mmaplock)->mr_lock);
> unlock((complete)&ret.event);
>
> *** DEADLOCK ***The warning is a false positive, caused by the fact that all
wait_for_completion()s in submit_bio_wait() are waiting with the same
lock class.However, some bios have nothing to do with others, for example in the case
of loop devices, there's no direct connection between the bios of an upper
device and the bios of a lower device(=loop device).The safest way to assign different lock classes to different devices is
to do it for each gendisk. In other words, this patch assigns a
lockdep_map per gendisk and uses it when initializing completion in
submit_bio_wait().Analyzed-by: Dave Chinner
Reported-by: Darrick J. Wong
Signed-off-by: Byungchul Park
Reviewed-by: Jens Axboe
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: amir73il@gmail.com
Cc: axboe@kernel.dk
Cc: david@fromorbit.com
Cc: hch@infradead.org
Cc: idryomov@gmail.com
Cc: johan@kernel.org
Cc: johannes.berg@intel.com
Cc: kernel-team@lge.com
Cc: linux-block@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Cc: oleg@redhat.com
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/1508921765-15396-10-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar
10 Aug, 2017
3 commits
-
We don't have to inc/dec some counter, since we can just
iterate the tags. That makes inc/dec a noop, but means we
have to iterate busy tags to get an in-flight count.Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe -
Instead of returning the count that matches the partition, pass
in an array of two ints. Index 0 will be filled with the inflight
count for the partition in question, and index 1 will filled
with the root inflight count, if the partition passed in is not the
root.This is in preparation for being able to calculate both in one
go.Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe -
No functional change in this patch, just in preparation for
basing the inflight mechanism on the queue in question.Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe
05 Jun, 2017
1 commit
-
This helper was only used by IMA of all things, which would get spurious
errors if CONFIG_BLOCK is disabled. Just opencode the call there.Signed-off-by: Christoph Hellwig
Reviewed-by: Amir Goldstein
Acked-by: Mimi Zohar
Reviewed-by: Andy Shevchenko
22 Apr, 2017
1 commit
-
Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
introduced blk_integrity_revalidate(), which seems to assume ownership
of the stable pages flag and unilaterally clears it if no blk_integrity
profile is registered:if (bi->profile)
disk->queue->backing_dev_info->capabilities |=
BDI_CAP_STABLE_WRITES;
else
disk->queue->backing_dev_info->capabilities &=
~BDI_CAP_STABLE_WRITES;It's called from revalidate_disk() and rescan_partitions(), making it
impossible to enable stable pages for drivers that support partitions
and don't use blk_integrity: while the call in revalidate_disk() can be
trivially worked around (see zram, which doesn't support partitions and
hence gets away with zram_revalidate_disk()), rescan_partitions() can
be triggered from userspace at any time. This breaks rbd, where the
ceph messenger is responsible for generating/verifying CRCs.Since blk_integrity_{un,}register() "must" be used for (un)registering
the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
setting there. This way drivers that call blk_integrity_register() and
use integrity infrastructure won't interfere with drivers that don't
but still want stable pages.Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
Cc: "Martin K. Petersen"
Cc: Christoph Hellwig
Cc: Mike Snitzer
Cc: stable@vger.kernel.org # 4.4+, needs backporting
Tested-by: Dan Williams
Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe
25 Mar, 2017
1 commit
-
blk_integrity_profile's are never modified, so mark them 'const' so that
they are placed in .rodata and benefit from memory protection.Signed-off-by: Eric Biggers
Signed-off-by: Jens Axboe
09 Mar, 2017
1 commit
-
This reverts commit 0dba1314d4f81115dce711292ec7981d17231064. It causes
leaking of device numbers for SCSI when SCSI registers multiple gendisks
for one request_queue in succession. It can be easily reproduced using
Omar's script [1] on kernel with CONFIG_DEBUG_TEST_DRIVER_REMOVE.
Furthermore the protection provided by this commit is not needed anymore
as the problem it was fixing got also fixed by commit 165a5e22fafb
"block: Move bdi_unregister() to del_gendisk()".[1]: http://marc.info/?l=linux-block&m=148554717109098&w=2
Signed-off-by: Jan Kara
Acked-by: Dan Williams
Tested-by: Omar Sandoval
Signed-off-by: Jens Axboe
02 Feb, 2017
1 commit
-
Warnings of the following form occur because scsi reuses a devt number
while the block layer still has it referenced as the name of the bdi
[1]:WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192'
[..]
Call Trace:
dump_stack+0x86/0xc3
__warn+0xcb/0xf0
warn_slowpath_fmt+0x5f/0x80
? kernfs_path_from_node+0x4f/0x60
sysfs_warn_dup+0x62/0x80
sysfs_create_dir_ns+0x77/0x90
kobject_add_internal+0xb2/0x350
kobject_add+0x75/0xd0
device_add+0x15a/0x650
device_create_groups_vargs+0xe0/0xf0
device_create_vargs+0x1c/0x20
bdi_register+0x90/0x240
? lockdep_init_map+0x57/0x200
bdi_register_owner+0x36/0x60
device_add_disk+0x1bb/0x4e0
? __pm_runtime_use_autosuspend+0x5c/0x70
sd_probe_async+0x10d/0x1c0
async_run_entry_fn+0x39/0x170This is a brute-force fix to pass the devt release information from
sd_probe() to the locations where we register the bdi,
device_add_disk(), and unregister the bdi, blk_cleanup_queue().Thanks to Omar for the quick reproducer script [2]. This patch survives
where an unmodified kernel fails in a few seconds.[1]: https://marc.info/?l=linux-scsi&m=147116857810716&w=4
[2]: http://marc.info/?l=linux-block&m=148554717109098&w=2Cc: James Bottomley
Cc: Bart Van Assche
Cc: "Martin K. Petersen"
Cc: Jan Kara
Reported-by: Omar Sandoval
Tested-by: Omar Sandoval
Signed-off-by: Dan Williams
Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe
23 Dec, 2016
1 commit
-
blk_scsi_cmd_filter use was deprecated by 4beab5c6 and the SCSI macros
are duplicated in blkdev.h, both likely reintroduced by a bad merge from
540eed56.Signed-off-by: Jon Derrick
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe
11 Oct, 2016
1 commit
-
The __latent_entropy gcc attribute can be used only on functions and
variables. If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents. The variable must
be an integer, an integer array type or a structure with integer fields.These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.Signed-off-by: Emese Revfy
[kees: expanded commit message]
Signed-off-by: Kees Cook
28 Jun, 2016
1 commit
-
Now that all drivers that specify a ->driverfs_dev have been converted
to device_add_disk(), the pointer can be removed from struct gendisk.Cc: Jens Axboe
Cc: Ross Zwisler
Reviewed-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Signed-off-by: Dan Williams
16 Jun, 2016
1 commit
-
In preparation for removing the ->driverfs_dev member of a gendisk, add
an api that takes the parent device as a parameter to add_disk(). For
now this maintains the status quo of WARN()ing on failure, but not
return a error code.Reviewed-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Reviewed-by: Bart Van Assche
Signed-off-by: Dan Williams
21 May, 2016
1 commit
-
UUID library provides uuid_be type and uuid_be_to_bin() function. This
substitutes open coded variant by generic library calls.Signed-off-by: Andy Shevchenko
Reviewed-by: Matt Fleming
Cc: Dmitry Kasatkin
Cc: Mimi Zohar
Cc: Rasmus Villemoes
Cc: Arnd Bergmann
Cc: "Theodore Ts'o"
Cc: Al Viro
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds