Eric Lee / smarc-fsl-linux-kernel

04 Jul, 2019

4 commits

867363429 btrfs: migrate the delalloc space stuff to it's own home ... Browse Code »

We have code for data and metadata reservations for delalloc. There's
quite a bit of code here, and it's used in a lot of places so I've
separated it out to it's own file. inode.c and file.c are already
pretty large, and this code is complicated enough to live in its own
space.

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-04 23:26:17 +0800
fb6dea266 btrfs: migrate btrfs_trans_release_chunk_metadata ... Browse Code »

Move this into transaction.c with the rest of the transaction related
code.

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-04 23:26:17 +0800
6ef03debd btrfs: migrate the delayed refs rsv code ... Browse Code »

These belong with the delayed refs related code, not in extent-tree.c.

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-04 23:26:17 +0800
9978059be btrfs: Evaluate io_tree in find_lock_delalloc_range() ... Browse Code »

Simplification. No point passing the tree variable when it can be
evaluated from inode. The tests now use the io_tree from btrfs_inode as
opposed to creating one.

Signed-off-by: Goldwyn Rodrigues
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Goldwyn Rodrigues
2019-07-04 23:26:17 +0800

02 Jul, 2019

36 commits

67f9c2209 btrfs: migrate the global_block_rsv helpers to block-rsv.c ... Browse Code »

These helpers belong in block-rsv.c

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:55 +0800
550fa228e btrfs: migrate the block-rsv code to block-rsv.c ... Browse Code »

This moves everything out of extent-tree.c to block-rsv.c.

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:54 +0800
424a47805 btrfs: stop using block_rsv_release_bytes everywhere ... Browse Code »

block_rsv_release_bytes() is the internal to the block_rsv code, and
shouldn't be called directly by anything else. Switch all users to the
exported helpers.

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:54 +0800
fcec36224 btrfs: cleanup the target logic in __btrfs_block_rsv_release ... Browse Code »

This works for all callers already, but if we wanted to use the helper
for the global_block_rsv it would end up trying to refill itself. Fix
the logic to be able to be used no matter which block rsv is passed in
to this helper.

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:54 +0800
fed14b323 btrfs: export __btrfs_block_rsv_release ... Browse Code »

The delalloc reserve stuff calls this directly because it cares about
the qgroup accounting stuff, so export it so we can move it around. Fix
btrfs_block_rsv_release() to just be a static inline since it just calls
__btrfs_block_rsv_release() with NULL for the qgroup stuff.

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:54 +0800
0b50174ad btrfs: export btrfs_block_rsv_add_bytes ... Browse Code »

This is used in a few places, we need to make sure it's exported so we
can move it around.

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:54 +0800
d12ffdd1a btrfs: move btrfs_block_rsv definitions into it's own header ... Browse Code »

Prep work for separating out all of the block_rsv related code into its
own file.

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:53 +0800
9b4851bc4 btrfs: Simplify update of space_info in __reserve_metadata_bytes() ... Browse Code »

We don't need an if-else-if chain where we can use a simple OR since
both conditions are performing the same action. The short-circuit for OR
will ensure that if the first condition is true, can_overcommit() is not
called.

Reviewed-by: Nikolay Borisov
Signed-off-by: Goldwyn Rodrigues
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Goldwyn Rodrigues
2019-07-02 18:30:53 +0800
83d731a5b btrfs: unexport can_overcommit ... Browse Code »

Now that we've moved all of the users to space-info.c, unexport it and
name it back to can_overcommit.

Reviewed-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:53 +0800
0d9764f6d btrfs: move reserve_metadata_bytes and supporting code to space-info.c ... Browse Code »

This moves all of the metadata reservation code into space-info.c.

Reviewed-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:53 +0800
5da6afeb3 btrfs: move dump_space_info to space-info.c ... Browse Code »

We'll need this exported so we can use it in all the various was we need
to use it. This is prep work to move reserve_metadata_bytes.

Reviewed-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:53 +0800
c2a67a76e btrfs: export block_rsv_use_bytes ... Browse Code »

We are going to need this to move the metadata reservation stuff to
space_info.c.

Reviewed-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:53 +0800
b338b013e btrfs: move btrfs_space_info_add_*_bytes to space-info.c ... Browse Code »

Now that we've moved all the pre-requisite stuff, move these two
functions.

Reviewed-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:52 +0800
bb96c4e57 btrfs: move the space info update macro to space-info.h ... Browse Code »

Also rename it to btrfs_space_info_update_* so it's clear what we're
updating.

Reviewed-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:52 +0800
41783ef24 btrfs: move and export can_overcommit ... Browse Code »

This is the first piece of moving the space reservation code to
space-info.c

Reviewed-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:52 +0800
280c29088 btrfs: move the space_info handling code to space-info.c ... Browse Code »

These are the basic init and lookup functions and some helper functions,
fairly straightforward before the bad stuff starts.

Reviewed-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:52 +0800
d44b72aa1 btrfs: export space_info_add_*_bytes ... Browse Code »

Prep work for consolidating all of the space_info code into one file.
We need to export these so multiple files can use them.

Reviewed-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:52 +0800
fc471cb0c btrfs: rename do_chunk_alloc to btrfs_chunk_alloc ... Browse Code »

Really we just need the enum, but as we break more things up it'll help
to have this external to extent-tree.c.

Reviewed-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:51 +0800
8719aaae8 btrfs: move space_info to space-info.h ... Browse Code »

Migrate the struct definition and the one helper that's in ctree.h into
space-info.h

Reviewed-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:51 +0800
e749af443 btrfs: lift bio_set_dev from bio allocation helpers ... Browse Code »

The block device is passed around for the only purpose to set it in new
bios. Move the assignment one level up. This is a preparatory patch for
further bdev cleanups.

Signed-off-by: David Sterba

David Sterba
2019-07-02 18:30:51 +0800
e1ea2beee btrfs: use raid_attr for minimum stripe count in btrfs_calc_avail_data_space ... Browse Code »

Minimum stripe count matches the minimum devices required for a given
profile. The open coded assignments match the raid_attr table.

What's changed here is the meaning for RAID5/6. Previously their
min_stripes would be 1, while newly it's devs_min. This however shold be
the same as before because it's not possible to create filesystem on
fewer devices than the raid_attr table allows.

There's no adjustment regarding the parity stripes (like
calc_data_stripes does), because we're interested in overall space that
would fit on the devices.

Missing devices make no difference for the whole calculation, we have
the size stored in the structures.

Signed-off-by: David Sterba

David Sterba
2019-07-02 18:30:51 +0800
4f080f571 btrfs: use raid_attr to adjust minimal stripe size in btrfs_calc_avail_data_space ... Browse Code »

Special case for DUP can be replaced by lookup to the attribute table,
where the dev_stripes is the right coefficient.

Signed-off-by: David Sterba

David Sterba
2019-07-02 18:30:51 +0800
f262fa8de btrfs: drop default value assignments in enums ... Browse Code »

A few more instances whre we don't need to specify the values as long as
they are the same that enum assigns automatically. All of the enums are
in-memory only and nothing relies on the exact values.

Signed-off-by: David Sterba

David Sterba
2019-07-02 18:30:50 +0800
2792237d0 btrfs: use common helpers for extent IO state insertion messages ... Browse Code »

Print the error messages using the helpers that also print the
filesystem identification.

Signed-off-by: David Sterba

David Sterba
2019-07-02 18:30:50 +0800
63611e738 btrfs: run delayed iput at unlink time ... Browse Code »

We have been seeing issues in production where a cleaner script will end
up unlinking a bunch of files that have pending iputs. This means they
will get their final iput's run at btrfs-cleaner time and thus are not
throttled, which impacts the workload.

Since we are unlinking these files we can just drop the delayed iput at
unlink time. We are already holding a reference to the inode so this
will not be the final iput and thus is completely safe to do at this
point. Doing this means we are more likely to be doing the final iput
at unlink time, and thus will get the IO charged to the caller and get
throttled appropriately without affecting the main workload.

Reviewed-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-02 18:30:50 +0800
179006688 Btrfs: add missing inode version, ctime and mtime updates when punching hole ... Browse Code »

If the range for which we are punching a hole covers only part of a page,
we end up updating the inode item but we skip the update of the inode's
iversion, mtime and ctime. Fix that by ensuring we update those properties
of the inode.

A patch for fstests test case generic/059 that tests this as been sent
along with this fix.

Fixes: 2aaa66558172b0 ("Btrfs: add hole punching")
Fixes: e8c1c76e804b18 ("Btrfs: add missing inode update when punching hole")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba

Filipe Manana
2019-07-02 18:30:50 +0800
803f0f64d Btrfs: fix fsync not persisting dentry deletions due to inode evictions ... Browse Code »

In order to avoid searches on a log tree when unlinking an inode, we check
if the inode being unlinked was logged in the current transaction, as well
as the inode of its parent directory. When any of the inodes are logged,
we proceed to delete directory items and inode reference items from the
log, to ensure that if a subsequent fsync of only the inode being unlinked
or only of the parent directory when the other is not fsync'ed as well,
does not result in the entry still existing after a power failure.

That check however is not reliable when one of the inodes involved (the
one being unlinked or its parent directory's inode) is evicted, since the
logged_trans field is transient, that is, it is not stored on disk, so it
is lost when the inode is evicted and loaded into memory again (which is
set to zero on load). As a consequence the checks currently being done by
btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log() always
return true if the inode was evicted before, regardless of the inode
having been logged or not before (and in the current transaction), this
results in the dentry being unlinked still existing after a log replay
if after the unlink operation only one of the inodes involved is fsync'ed.

Example:

$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt

$ mkdir /mnt/dir
$ touch /mnt/dir/foo
$ xfs_io -c fsync /mnt/dir/foo

# Keep an open file descriptor on our directory while we evict inodes.
# We just want to evict the file's inode, the directory's inode must not
# be evicted.
$ ( cd /mnt/dir; while true; do :; done ) &
$ pid=$!

# Wait a bit to give time to background process to chdir to our test
# directory.
$ sleep 0.5

# Trigger eviction of the file's inode.
$ echo 2 > /proc/sys/vm/drop_caches

# Unlink our file and fsync the parent directory. After a power failure
# we don't expect to see the file anymore, since we fsync'ed the parent
# directory.
$ rm -f $SCRATCH_MNT/dir/foo
$ xfs_io -c fsync /mnt/dir

$ mount /dev/sdb /mnt
$ ls /mnt/dir
foo
$
--> file still there, unlink not persisted despite explicit fsync on dir

Fix this by checking if the inode has the full_sync bit set in its runtime
flags as well, since that bit is set everytime an inode is loaded from
disk, or for other less common cases such as after a shrinking truncate
or failure to allocate extent maps for holes, and gets cleared after the
first fsync. Also consider the inode as possibly logged only if it was
last modified in the current transaction (besides having the full_fsync
flag set).

Fixes: 3a5f1d458ad161 ("Btrfs: Optimize btree walking while logging inodes")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba

Filipe Manana
2019-07-02 18:30:50 +0800
89b798ad1 btrfs: Use btrfs_get_io_geometry appropriately ... Browse Code »

Presently btrfs_map_block is used not only to do everything necessary to
map a bio to the underlying allocation profile but it's also used to
identify how much data could be written based on btrfs' stripe logic
without actually submitting anything. This is achieved by passing NULL
for 'bbio_ret' parameter.

This patch refactors all callers that require just the mapping length
by switching them to using btrfs_io_geometry instead of calling
btrfs_map_block with a special NULL value for 'bbio_ret'. No functional
change.

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2019-07-02 18:30:50 +0800
5f1411265 btrfs: Introduce btrfs_io_geometry infrastructure ... Browse Code »

Add a structure that holds various parameters for IO calculations and a
helper that fills the values. This will help further refactoring and
reduction of functions that in some way open-coded the calculations.

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2019-07-02 18:30:49 +0800
c9d713d5b btrfs: improve messages when updating feature flags ... Browse Code »

Currently the messages printed after setting an incompat feature are
cryptis, we can easily make it better as the textual description is
passed to the helpers. Old:

setting 128 feature flag

updated:

setting incompat feature flag for RAID56 (0x80)

Signed-off-by: David Sterba

David Sterba
2019-07-02 18:30:49 +0800
6c64460cd btrfs: shut up bogus -Wmaybe-uninitialized warning ... Browse Code »

gcc sometimes can't determine whether a variable has been initialized
when both the initialization and the use are conditional:

fs/btrfs/props.c: In function 'inherit_props':
fs/btrfs/props.c:389:4: error: 'num_bytes' may be used uninitialized in this function [-Werror=maybe-uninitialized]
btrfs_block_rsv_release(fs_info, trans->block_rsv,

This code is fine. Unfortunately, I cannot think of a good way to
rephrase it in a way that makes gcc understand this, so I add a bogus
initialization the way one should not.

Signed-off-by: Arnd Bergmann
Reviewed-by: David Sterba
[ gcc 8 and 9 don't emit the warning ]
Signed-off-by: David Sterba

Arnd Bergmann
2019-07-02 18:30:49 +0800
9e967495e Btrfs: prevent send failures and crashes due to concurrent relocation ... Browse Code »

Send always operates on read-only trees and always expected that while it
is in progress, nothing changes in those trees. Due to that expectation
and the fact that send is a read-only operation, it operates on commit
roots and does not hold transaction handles. However relocation can COW
nodes and leafs from read-only trees, which can cause unexpected failures
and crashes (hitting BUG_ONs). while send using a node/leaf, it gets
COWed, the transaction used to COW it is committed, a new transaction
starts, the extent previously used for that node/leaf gets allocated,
possibly for another tree, and the respective extent buffer' content
changes while send is still using it. When this happens send normally
fails with EIO being returned to user space and messages like the
following are found in dmesg/syslog:

[ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253
[ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984

Other times, less often, we hit a BUG_ON() because an extent buffer that
send is using used to be a node, and while send is still using it, it
got COWed and got reused as a leaf while send is still using, producing
the following trace:

[ 3478.466280] ------------[ cut here ]------------
[ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806!
[ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1
[ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246
[ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002
[ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000
[ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e
[ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708
[ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000
[ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0
[ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3478.479003] Call Trace:
[ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0
[ 3478.480202] tree_advance+0x173/0x1d0 [btrfs]
[ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs]
[ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs]
[ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[ 3478.483581] ? rq_clock_task+0x2e/0x60
[ 3478.484086] ? wake_up_new_task+0x1f3/0x370
[ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0
[ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 3478.485552] do_vfs_ioctl+0xa2/0x6f0
[ 3478.486016] ? __fget+0x113/0x200
[ 3478.486467] ksys_ioctl+0x70/0x80
[ 3478.486911] __x64_sys_ioctl+0x16/0x20
[ 3478.487337] do_syscall_64+0x60/0x1b0
[ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7
(...)
[ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7
[ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005
[ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700
[ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005
[ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001
(...)
[ 3478.493352] ---[ end trace d5f537302be4f8c8 ]---

Another possibility, much less likely to happen, is that send will not
fail but the contents of the stream it produces may not be correct.

To avoid this, do not allow send and relocation (balance) to run in
parallel. In the long term the goal is to allow for both to be able to
run concurrently without any problems, but that will take a significant
effort in development and testing.

Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Filipe Manana
2019-07-02 18:30:49 +0800
71a9c4885 btrfs: document BTRFS_MAX_MIRRORS ... Browse Code »

The real meaning of that constant is not clear from the context due to
the target device inclusion.

Signed-off-by: David Sterba

David Sterba
2019-07-02 18:30:49 +0800
a07e8a468 btrfs: use mask for RAID56 profiles ... Browse Code »

We don't need to enumerate the profiles, use the mask for consistency.

Signed-off-by: David Sterba

David Sterba
2019-07-02 18:30:48 +0800
c7369b3fa btrfs: add mask for all RAID1 types ... Browse Code »

Preparatory patch for additional RAID1 profiles with more copies. The
mask will contain 3-copy and 4-copy, most of the checks for plain RAID1
work the same for the other profiles.

Signed-off-by: David Sterba

David Sterba
2019-07-02 18:30:48 +0800
e88439deb btrfs: qgroup: Don't hold qgroup_ioctl_lock in btrfs_qgroup_inherit() ... Browse Code »

[BUG]
Lockdep will report the following circular locking dependency:

WARNING: possible circular locking dependency detected
5.2.0-rc2-custom #24 Tainted: G O
------------------------------------------------------
btrfs/8631 is trying to acquire lock:
000000002536438c (&fs_info->qgroup_ioctl_lock#2){+.+.}, at: btrfs_qgroup_inherit+0x40/0x620 [btrfs]

but task is already holding lock:
000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&fs_info->tree_log_mutex){+.+.}:
__mutex_lock+0x76/0x940
mutex_lock_nested+0x1b/0x20
btrfs_commit_transaction+0x475/0xa00 [btrfs]
btrfs_commit_super+0x71/0x80 [btrfs]
close_ctree+0x2bd/0x320 [btrfs]
btrfs_put_super+0x15/0x20 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x16/0xa0 [btrfs]
deactivate_locked_super+0x3a/0x80
deactivate_super+0x51/0x60
cleanup_mnt+0x3f/0x80
__cleanup_mnt+0x12/0x20
task_work_run+0x94/0xb0
exit_to_usermode_loop+0xd8/0xe0
do_syscall_64+0x210/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #1 (&fs_info->reloc_mutex){+.+.}:
__mutex_lock+0x76/0x940
mutex_lock_nested+0x1b/0x20
btrfs_commit_transaction+0x40d/0xa00 [btrfs]
btrfs_quota_enable+0x2da/0x730 [btrfs]
btrfs_ioctl+0x2691/0x2b40 [btrfs]
do_vfs_ioctl+0xa9/0x6d0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x65/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 (&fs_info->qgroup_ioctl_lock#2){+.+.}:
lock_acquire+0xa7/0x190
__mutex_lock+0x76/0x940
mutex_lock_nested+0x1b/0x20
btrfs_qgroup_inherit+0x40/0x620 [btrfs]
create_pending_snapshot+0x9d7/0xe60 [btrfs]
create_pending_snapshots+0x94/0xb0 [btrfs]
btrfs_commit_transaction+0x415/0xa00 [btrfs]
btrfs_mksubvol+0x496/0x4e0 [btrfs]
btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
btrfs_ioctl+0xa90/0x2b40 [btrfs]
do_vfs_ioctl+0xa9/0x6d0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x65/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Chain exists of:
&fs_info->qgroup_ioctl_lock#2 --> &fs_info->reloc_mutex --> &fs_info->tree_log_mutex

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&fs_info->tree_log_mutex);
lock(&fs_info->reloc_mutex);
lock(&fs_info->tree_log_mutex);
lock(&fs_info->qgroup_ioctl_lock#2);

*** DEADLOCK ***

6 locks held by btrfs/8631:
#0: 00000000ed8f23f6 (sb_writers#12){.+.+}, at: mnt_want_write_file+0x28/0x60
#1: 000000009fb1597a (&type->i_mutex_dir_key#10/1){+.+.}, at: btrfs_mksubvol+0x70/0x4e0 [btrfs]
#2: 0000000088c5ad88 (&fs_info->subvol_sem){++++}, at: btrfs_mksubvol+0x128/0x4e0 [btrfs]
#3: 000000009606fc3e (sb_internal#2){.+.+}, at: start_transaction+0x37a/0x520 [btrfs]
#4: 00000000f82bbdf5 (&fs_info->reloc_mutex){+.+.}, at: btrfs_commit_transaction+0x40d/0xa00 [btrfs]
#5: 000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]

[CAUSE]
Due to the delayed subvolume creation, we need to call
btrfs_qgroup_inherit() inside commit transaction code, with a lot of
other mutex hold.
This hell of lock chain can lead to above problem.

[FIX]
On the other hand, we don't really need to hold qgroup_ioctl_lock if
we're in the context of create_pending_snapshot().
As in that context, we're the only one being able to modify qgroup.

All other qgroup functions which needs qgroup_ioctl_lock are either
holding a transaction handle, or will start a new transaction:
Functions will start a new transaction():
* btrfs_quota_enable()
* btrfs_quota_disable()
Functions hold a transaction handler:
* btrfs_add_qgroup_relation()
* btrfs_del_qgroup_relation()
* btrfs_create_qgroup()
* btrfs_remove_qgroup()
* btrfs_limit_qgroup()
* btrfs_qgroup_inherit() call inside create_subvol()

So we have a higher level protection provided by transaction, thus we
don't need to always hold qgroup_ioctl_lock in btrfs_qgroup_inherit().

Only the btrfs_qgroup_inherit() call in create_subvol() needs to hold
qgroup_ioctl_lock, while the btrfs_qgroup_inherit() call in
create_pending_snapshot() is already protected by transaction.

So the fix is to detect the context by checking
trans->transaction->state.
If we're at TRANS_STATE_COMMIT_DOING, then we're in commit transaction
context and no need to get the mutex.

Reported-by: Nikolay Borisov
Signed-off-by: Qu Wenruo
Signed-off-by: David Sterba

Qu Wenruo
2019-07-02 18:30:48 +0800