26 Oct, 2020
1 commit
-
Dave reported a problem with my rwsem conversion patch where we got the
following lockdep splat:======================================================
WARNING: possible circular locking dependency detected
5.9.0-default+ #1297 Not tainted
------------------------------------------------------
kswapd0/76 is trying to acquire lock:
ffff9d5d25df2530 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]but task is already holding lock:
ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (fs_reclaim){+.+.}-{0:0}:
__lock_acquire+0x582/0xac0
lock_acquire+0xca/0x430
fs_reclaim_acquire.part.0+0x25/0x30
kmem_cache_alloc+0x30/0x9c0
alloc_inode+0x81/0x90
iget_locked+0xcd/0x1a0
kernfs_get_inode+0x1b/0x130
kernfs_get_tree+0x136/0x210
sysfs_get_tree+0x1a/0x50
vfs_get_tree+0x1d/0xb0
path_mount+0x70f/0xa80
do_mount+0x75/0x90
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x2d/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xa9-> #3 (kernfs_mutex){+.+.}-{3:3}:
__lock_acquire+0x582/0xac0
lock_acquire+0xca/0x430
__mutex_lock+0xa0/0xaf0
kernfs_add_one+0x23/0x150
kernfs_create_dir_ns+0x58/0x80
sysfs_create_dir_ns+0x70/0xd0
kobject_add_internal+0xbb/0x2d0
kobject_add+0x7a/0xd0
btrfs_sysfs_add_block_group_type+0x141/0x1d0 [btrfs]
btrfs_read_block_groups+0x1f1/0x8c0 [btrfs]
open_ctree+0x981/0x1108 [btrfs]
btrfs_mount_root.cold+0xe/0xb0 [btrfs]
legacy_get_tree+0x2d/0x60
vfs_get_tree+0x1d/0xb0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
legacy_get_tree+0x2d/0x60
vfs_get_tree+0x1d/0xb0
path_mount+0x70f/0xa80
do_mount+0x75/0x90
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x2d/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xa9-> #2 (btrfs-extent-00){++++}-{3:3}:
__lock_acquire+0x582/0xac0
lock_acquire+0xca/0x430
down_read_nested+0x45/0x220
__btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
__btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
btrfs_search_slot+0x6d4/0xfd0 [btrfs]
check_committed_ref+0x69/0x200 [btrfs]
btrfs_cross_ref_exist+0x65/0xb0 [btrfs]
run_delalloc_nocow+0x446/0x9b0 [btrfs]
btrfs_run_delalloc_range+0x61/0x6a0 [btrfs]
writepage_delalloc+0xae/0x160 [btrfs]
__extent_writepage+0x262/0x420 [btrfs]
extent_write_cache_pages+0x2b6/0x510 [btrfs]
extent_writepages+0x43/0x90 [btrfs]
do_writepages+0x40/0xe0
__writeback_single_inode+0x62/0x610
writeback_sb_inodes+0x20f/0x500
wb_writeback+0xef/0x4a0
wb_do_writeback+0x49/0x2e0
wb_workfn+0x81/0x340
process_one_work+0x233/0x5d0
worker_thread+0x50/0x3b0
kthread+0x137/0x150
ret_from_fork+0x1f/0x30-> #1 (btrfs-fs-00){++++}-{3:3}:
__lock_acquire+0x582/0xac0
lock_acquire+0xca/0x430
down_read_nested+0x45/0x220
__btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
__btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
btrfs_search_slot+0x6d4/0xfd0 [btrfs]
btrfs_lookup_inode+0x3a/0xc0 [btrfs]
__btrfs_update_delayed_inode+0x93/0x2c0 [btrfs]
__btrfs_commit_inode_delayed_items+0x7de/0x850 [btrfs]
__btrfs_run_delayed_items+0x8e/0x140 [btrfs]
btrfs_commit_transaction+0x367/0xbc0 [btrfs]
btrfs_mksubvol+0x2db/0x470 [btrfs]
btrfs_mksnapshot+0x7b/0xb0 [btrfs]
__btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
btrfs_ioctl+0xd0b/0x2690 [btrfs]
__x64_sys_ioctl+0x6f/0xa0
do_syscall_64+0x2d/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xa9-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
check_prev_add+0x91/0xc60
validate_chain+0xa6e/0x2a20
__lock_acquire+0x582/0xac0
lock_acquire+0xca/0x430
__mutex_lock+0xa0/0xaf0
__btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
btrfs_evict_inode+0x3cc/0x560 [btrfs]
evict+0xd6/0x1c0
dispose_list+0x48/0x70
prune_icache_sb+0x54/0x80
super_cache_scan+0x121/0x1a0
do_shrink_slab+0x16d/0x3b0
shrink_slab+0xb1/0x2e0
shrink_node+0x230/0x6a0
balance_pgdat+0x325/0x750
kswapd+0x206/0x4d0
kthread+0x137/0x150
ret_from_fork+0x1f/0x30other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> kernfs_mutex --> fs_reclaimPossible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(kernfs_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);*** DEADLOCK ***
3 locks held by kswapd0/76:
#0: ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffffa40b8b58 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
#2: ffff9d5d322390e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50stack backtrace:
CPU: 2 PID: 76 Comm: kswapd0 Not tainted 5.9.0-default+ #1297
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Call Trace:
dump_stack+0x77/0x97
check_noncircular+0xff/0x110
? save_trace+0x50/0x470
check_prev_add+0x91/0xc60
validate_chain+0xa6e/0x2a20
? save_trace+0x50/0x470
__lock_acquire+0x582/0xac0
lock_acquire+0xca/0x430
? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
__mutex_lock+0xa0/0xaf0
? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
? __lock_acquire+0x582/0xac0
? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
? btrfs_evict_inode+0x30b/0x560 [btrfs]
? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
__btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
btrfs_evict_inode+0x3cc/0x560 [btrfs]
evict+0xd6/0x1c0
dispose_list+0x48/0x70
prune_icache_sb+0x54/0x80
super_cache_scan+0x121/0x1a0
do_shrink_slab+0x16d/0x3b0
shrink_slab+0xb1/0x2e0
shrink_node+0x230/0x6a0
balance_pgdat+0x325/0x750
kswapd+0x206/0x4d0
? finish_wait+0x90/0x90
? balance_pgdat+0x750/0x750
kthread+0x137/0x150
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x1f/0x30This happens because we are still holding the path open when we start
adding the sysfs files for the block groups, which creates a dependency
on fs_reclaim via the tree lock. Fix this by dropping the path before
we start doing anything with sysfs.Reported-by: David Sterba
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Anand Jain
Reviewed-by: Filipe Manana
Signed-off-by: Josef Bacik
Reviewed-by: David Sterba
Signed-off-by: David Sterba
07 Oct, 2020
6 commits
-
While running xfstests btrfs/177 I got the following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc3+ #5 Not tainted
------------------------------------------------------
kswapd0/100 is trying to acquire lock:
ffff97066aa56760 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330but task is already holding lock:
ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x65/0x80
slab_pre_alloc_hook.constprop.0+0x20/0x200
kmem_cache_alloc+0x37/0x270
alloc_inode+0x82/0xb0
iget_locked+0x10d/0x2c0
kernfs_get_inode+0x1b/0x130
kernfs_get_tree+0x136/0x240
sysfs_get_tree+0x16/0x40
vfs_get_tree+0x28/0xc0
path_mount+0x434/0xc00
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9-> #2 (kernfs_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
kernfs_add_one+0x23/0x150
kernfs_create_dir_ns+0x7a/0xb0
sysfs_create_dir_ns+0x60/0xb0
kobject_add_internal+0xc0/0x2c0
kobject_add+0x6e/0x90
btrfs_sysfs_add_block_group_type+0x102/0x160
btrfs_make_block_group+0x167/0x230
btrfs_alloc_chunk+0x54f/0xb80
btrfs_chunk_alloc+0x18e/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_insert_empty_items+0x64/0xb0
btrfs_new_inode+0x225/0x730
btrfs_create+0xab/0x1f0
lookup_open.isra.0+0x52d/0x690
path_openat+0x2a7/0x9e0
do_filp_open+0x75/0x100
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7e0
btrfs_chunk_alloc+0x125/0x3a0
find_free_extent+0xdf6/0x1210
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb0/0x310
alloc_tree_block_no_bg_flush+0x4a/0x60
__btrfs_cow_block+0x11a/0x530
btrfs_cow_block+0x104/0x220
btrfs_search_slot+0x52e/0x9d0
btrfs_lookup_inode+0x2a/0x8f
__btrfs_update_delayed_inode+0x80/0x240
btrfs_commit_inode_delayed_inode+0x119/0x120
btrfs_evict_inode+0x357/0x500
evict+0xcf/0x1f0
do_unlinkat+0x1a9/0x2b0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
__mutex_lock+0x7e/0x7e0
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
kthread+0x138/0x160
ret_from_fork+0x1f/0x30other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> kernfs_mutex --> fs_reclaimPossible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(kernfs_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);*** DEADLOCK ***
3 locks held by kswapd0/100:
#0: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffff9fd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
#2: ffff9706629780e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0stack backtrace:
CPU: 1 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #5
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x8b/0xb8
check_noncircular+0x12d/0x150
__lock_acquire+0x119c/0x1fc0
lock_acquire+0xa7/0x3d0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
__mutex_lock+0x7e/0x7e0
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? __btrfs_release_delayed_node.part.0+0x3f/0x330
? lock_acquire+0xa7/0x3d0
? find_held_lock+0x2b/0x80
__btrfs_release_delayed_node.part.0+0x3f/0x330
btrfs_evict_inode+0x24c/0x500
evict+0xcf/0x1f0
dispose_list+0x48/0x70
prune_icache_sb+0x44/0x50
super_cache_scan+0x161/0x1e0
do_shrink_slab+0x178/0x3c0
shrink_slab+0x17c/0x290
shrink_node+0x2b2/0x6d0
balance_pgdat+0x30a/0x670
kswapd+0x213/0x4c0
? _raw_spin_unlock_irqrestore+0x41/0x50
? add_wait_queue_exclusive+0x70/0x70
? balance_pgdat+0x670/0x670
kthread+0x138/0x160
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30This happens because when we link in a block group with a new raid index
type we'll create the corresponding sysfs entries for it. This is
problematic because while restriping we're holding the chunk_mutex, and
while mounting we're holding the tree locks.Fixing this isn't pretty, we move the call to the sysfs stuff into the
btrfs_create_pending_block_groups() work, where we're not holding any
locks. This creates a slight race where other threads could see that
there's no sysfs kobj for that raid type, and race to create the
sysfs dir. Fix this by wrapping the creation in space_info->lock, so we
only get one thread calling kobject_add() for the new directory. We
don't worry about the lock on cleanup as it only gets deleted on
unmount.On mount it's more straightforward, we loop through the space_infos
already, just check every raid index in each space_info and added the
sysfs entries for the corresponding block groups.Signed-off-by: Josef Bacik
Signed-off-by: David Sterba -
We have this thing wrapped in an RCU lock, but it's really not needed.
We create all the space_info's on mount, and we destroy them on unmount.
The list never changes and we're protected from messing with it by the
normal mount/umount path, so kill the RCU stuff around it.Reviewed-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Since it's inclusion on 9afc66498a0b ("btrfs: block-group: refactor how
we read one block group item") this function always returned 0, so there
is no need to check for the returned value.Reviewed-by: Nikolay Borisov
Signed-off-by: Marcos Paulo de Souza
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
If we have compression on we could free up more space than we reserved,
and thus be able to make a space reservation. Add the call for this
scenario.Reviewed-by: Nikolay Borisov
Tested-by: Nikolay Borisov
Reviewed-by: Johannes Thumshirn
Signed-off-by: Josef Bacik
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
We were missing a call to btrfs_try_granting_tickets in
btrfs_free_reserved_bytes, so add it to handle the case where we're able
to satisfy an allocation because we've freed a pending reservation.Reviewed-by: Nikolay Borisov
Tested-by: Nikolay Borisov
Reviewed-by: Johannes Thumshirn
Signed-off-by: Josef Bacik
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Delete repeated words in fs/btrfs/.
{to, the, a, and old}
and change "into 2 part" to "into 2 parts".Reviewed-by: Nikolay Borisov
Signed-off-by: Randy Dunlap
Reviewed-by: David Sterba
Signed-off-by: David Sterba
27 Aug, 2020
1 commit
-
[BUG]
After commit 9afc66498a0b ("btrfs: block-group: refactor how we read one
block group item"), cache->length is being assigned after calling
btrfs_create_block_group_cache. This causes a problem since
set_free_space_tree_thresholds calculates the free-space threshold to
decide if the free-space tree should convert from extents to bitmaps.The current code calls set_free_space_tree_thresholds with cache->length
being 0, which then makes cache->bitmap_high_thresh zero. This implies
the system will always use bitmap instead of extents, which is not
desired if the block group is not fragmented.This behavior can be seen by a test that expects to repair systems
with FREE_SPACE_EXTENT and FREE_SPACE_BITMAP, but the current code only
created FREE_SPACE_BITMAP.[FIX]
Call set_free_space_tree_thresholds after setting cache->length. There
is now a WARN_ON in set_free_space_tree_thresholds to help preventing
the same mistake to happen again in the future.Link: https://github.com/kdave/btrfs-progs/issues/251
Fixes: 9afc66498a0b ("btrfs: block-group: refactor how we read one block group item")
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Qu Wenruo
Reviewed-by: Filipe Manana
Signed-off-by: Marcos Paulo de Souza
Signed-off-by: David Sterba
27 Jul, 2020
9 commits
-
Previously we depended on some weird behavior in our chunk allocator to
force the allocation of new stripes, so by the time we got to doing the
reduce we would usually already have a chunk with the proper target.However that behavior causes other problems and needs to be removed.
First however we need to remove this check to only restripe if we
already have those available profiles, because if we're allocating our
first chunk it obviously will not be available. Simply use the target
as specified, and if that fails it'll be because we're out of space.Tested-by: Holger Hoffstätte
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba -
btrfs/061 has been failing consistently for me recently with a
transaction abort. We run out of space in the system chunk array, which
means we've allocated way too many system chunks than we need.Chris added this a long time ago for balance as a poor mans restriping.
If you had a single disk and then added another disk and then did a
balance, update_block_group_flags would then figure out which RAID level
you needed.Fast forward to today and we have restriping behavior, so we can
explicitly tell the fs that we're trying to change the raid level. This
is accomplished through the normal get_alloc_profile path.Furthermore this code actually causes btrfs/061 to fail, because we do
things like mkfs -m dup -d single with multiple devices. This trips
this checkalloc_flags = update_block_group_flags(fs_info, cache->flags);
if (alloc_flags != cache->flags) {
ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);in btrfs_inc_block_group_ro. Because we're balancing and scrubbing, but
not actually restriping, we keep forcing chunk allocation of RAID1
chunks. This eventually causes us to run out of system space and the
file system aborts and flips read only.We don't need this poor mans restriping any more, simply use the normal
get_alloc_profile helper, which will get the correct alloc_flags and
thus make the right decision for chunk allocation. This keeps us from
allocating a billion system chunks and falling over.Tested-by: Holger Hoffstätte
Reviewed-by: Qu Wenruo
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba -
We have refcount_t now with the associated library to handle refcounts,
which gives us extra debugging around reference count mistakes that may
be made. For example it'll warn on any transition from 0->1 or 0->-1,
which is handy for noticing cases where we've messed up reference
counting. Convert the block group ref counting from an atomic_t to
refcount_t and use the appropriate helpers.Reviewed-by: Filipe Manana
Signed-off-by: Josef Bacik
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Instead of calling BTRFS_I on the passed vfs_inode take btrfs_inode
directly.Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Initially when the 'removed' flag was added to a block group to avoid
races between block group removal and fitrim, by commit 04216820fe83d5
("Btrfs: fix race between fs trimming and block group remove/allocation"),
we had to lock the chunks mutex because we could be moving the block
group from its current list, the pending chunks list, into the pinned
chunks list, or we could just be adding it to the pinned chunks if it was
not in the pending chunks list. Both lists were protected by the chunk
mutex.However we no longer have those lists since commit 1c11b63eff2a67
("btrfs: replace pending/pinned chunks lists with io tree"), and locking
the chunk mutex is no longer necessary because of that. The same happens
at btrfs_unfreeze_block_group(), we lock the chunk mutex because the block
group's extent map could be part of the pinned chunks list and the call
to remove_extent_mapping() could be deleting it from that list, which
used to be protected by that mutex.So just remove those lock and unlock calls as they are not needed anymore.
Reviewed-by: Nikolay Borisov
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba -
When find_first_block_group() finds a block group item in the extent-tree,
it does a lookup of the object in the extent mapping tree and does further
checks on the item.Factor out this step from find_first_block_group() so we can further
simplify the code.While we're at it, we can also just return early in
find_first_block_group(), if the tree slot isn't found.Signed-off-by: Johannes Thumshirn
Signed-off-by: David Sterba -
We already have an fs_info in our function parameters, there's no need
to do the maths again and get fs_info from the extent_root just to get
the mapping_tree.Instead directly grab the mapping_tree from fs_info.
Reviewed-by: Nikolay Borisov
Signed-off-by: Johannes Thumshirn
Signed-off-by: David Sterba -
Adresses held in 'logical' array are always guaranteed to fall within
the boundaries of the block group. That is, 'start' can never be
smaller than cache->start. This invariant follows from the way the
address are calculated in btrfs_rmap_block:stripe_nr = physical - map->stripes[i].physical;
stripe_nr = div64_u64(stripe_nr, map->stripe_len);
bytenr = chunk_start + stripe_nr * io_stripe_size;I.e it's always some IO stripe within the given chunk.
Exploit this invariant to simplify the body of the loop by removing the
unnecessary 'if' since its 'else' part is the one always executed.Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba -
extent_map::orig_block_len contains the size of a physical stripe when
it's used to describe block groups (calculated in read_one_chunk via
calc_stripe_length or calculated in decide_stripe_size and then assigned
to extent_map::orig_block_len in create_chunk). Exploit this fact to get
the size directly rather than opencoding the calculations. No functional
changes.Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba
17 Jun, 2020
2 commits
-
There is a race between block group removal and block group creation
when the removal is completed by a task running fitrim or scrub. When
this happens we end up failing the block group creation with an error
-EEXIST since we attempt to insert a duplicate block group item key
in the extent tree. That results in a transaction abort.The race happens like this:
1) Task A is doing a fitrim, and at btrfs_trim_block_group() it freezes
block group X with btrfs_freeze_block_group() (until very recently
that was named btrfs_get_block_group_trimming());2) Task B starts removing block group X, either because it's now unused
or due to relocation for example. So at btrfs_remove_block_group(),
while holding the chunk mutex and the block group's lock, it sets
the 'removed' flag of the block group and it sets the local variable
'remove_em' to false, because the block group is currently frozen
(its 'frozen' counter is > 0, until very recently this counter was
named 'trimming');3) Task B unlocks the block group and the chunk mutex;
4) Task A is done trimming the block group and unfreezes the block group
by calling btrfs_unfreeze_block_group() (until very recently this was
named btrfs_put_block_group_trimming()). In this function we lock the
block group and set the local variable 'cleanup' to true because we
were able to decrement the block group's 'frozen' counter down to 0 and
the flag 'removed' is set in the block group.Since 'cleanup' is set to true, it locks the chunk mutex and removes
the extent mapping representing the block group from the mapping tree;5) Task C allocates a new block group Y and it picks up the logical address
that block group X had as the logical address for Y, because X was the
block group with the highest logical address and now the second block
group with the highest logical address, the last in the fs mapping tree,
ends at an offset corresponding to block group X's logical address (this
logical address selection is done at volumes.c:find_next_chunk()).At this point the new block group Y does not have yet its item added
to the extent tree (nor the corresponding device extent items and
chunk item in the device and chunk trees). The new group Y is added to
the list of pending block groups in the transaction handle;6) Before task B proceeds to removing the block group item for block
group X from the extent tree, which has a key matching:(X logical offset, BTRFS_BLOCK_GROUP_ITEM_KEY, length)
task C while ending its transaction handle calls
btrfs_create_pending_block_groups(), which finds block group Y and
tries to insert the block group item for Y into the exten tree, which
fails with -EEXIST since logical offset is the same that X had and
task B hasn't yet deleted the key from the extent tree.
This failure results in a transaction abort, producing a stack like
the following:------------[ cut here ]------------
BTRFS: Transaction aborted (error -17)
WARNING: CPU: 2 PID: 19736 at fs/btrfs/block-group.c:2074 btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
Modules linked in: btrfs blake2b_generic xor raid6_pq (...)
CPU: 2 PID: 19736 Comm: fsstress Tainted: G W 5.6.0-rc7-btrfs-next-58 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
Code: ff ff ff 48 8b 55 50 f0 48 (...)
RSP: 0018:ffffa4160a1c7d58 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff961581909d98 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffb3d63990 RDI: 0000000000000001
RBP: ffff9614f3356a58 R08: 0000000000000000 R09: 0000000000000001
R10: ffff9615b65b0040 R11: 0000000000000000 R12: ffff961581909c10
R13: ffff9615b0c32000 R14: ffff9614f3356ab0 R15: ffff9614be779000
FS: 00007f2ce2841e80(0000) GS:ffff9615bae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000555f18780000 CR3: 0000000131d34005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_start_dirty_block_groups+0x398/0x4e0 [btrfs]
btrfs_commit_transaction+0xd0/0xc50 [btrfs]
? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
? __ia32_sys_fdatasync+0x20/0x20
iterate_supers+0xdb/0x180
ksys_sync+0x60/0xb0
__ia32_sys_sync+0xa/0x10
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f2ce1d4d5b7
Code: 83 c4 08 48 3d 01 (...)
RSP: 002b:00007ffd8b558c58 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
RAX: ffffffffffffffda RBX: 000000000000002c RCX: 00007f2ce1d4d5b7
RDX: 00000000ffffffff RSI: 00000000186ba07b RDI: 000000000000002c
RBP: 0000555f17b9e520 R08: 0000000000000012 R09: 000000000000ce00
R10: 0000000000000078 R11: 0000000000000202 R12: 0000000000000032
R13: 0000000051eb851f R14: 00007ffd8b558cd0 R15: 0000555f1798ec20
irq event stamp: 0
hardirqs last enabled at (0): [] 0x0
hardirqs last disabled at (0): [] copy_process+0x74f/0x2020
softirqs last enabled at (0): [] copy_process+0x74f/0x2020
softirqs last disabled at (0): [] 0x0
---[ end trace bd7c03622e0b0a9c ]---Fix this simply by making btrfs_remove_block_group() remove the block
group's item from the extent tree before it flags the block group as
removed. Also make the free space deletion from the free space tree
before flagging the block group as removed, to avoid a similar race
with adding and removing free space entries for the free space tree.Fixes: 04216820fe83d5 ("Btrfs: fix race between fs trimming and block group remove/allocation")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba -
When removing a block group, if we fail to delete the block group's item
from the extent tree, we jump to the 'out' label and end up decrementing
the block group's reference count once only (by 1), resulting in a counter
leak because the block group at that point was already removed from the
block group cache rbtree - so we have to decrement the reference count
twice, once for the rbtree and once for our lookup at the start of the
function.There is a second bug where if removing the free space tree entries (the
call to remove_block_group_free_space()) fails we end up jumping to the
'out_put_group' label but end up decrementing the reference count only
once, when we should have done it twice, since we have already removed
the block group from the block group cache rbtree. This happens because
the reference count decrement for the rbtree reference happens after
attempting to remove the free space tree entries, which is far away from
the place where we remove the block group from the rbtree.To make things less error prone, decrement the reference count for the
rbtree immediately after removing the block group from it. This also
eleminates the need for two different exit labels on error, renaming
'out_put_label' to just 'out' and removing the old 'out'.Fixes: f6033c5e333238 ("btrfs: fix block group leak when removing fails")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov
Reviewed-by: Anand Jain
Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba
25 May, 2020
10 commits
-
disk-io.h is included more than once in block-group.c, remove it.
Reviewed-by: Johannes Thumshirn
Signed-off-by: Tiezhu Yang
Signed-off-by: David Sterba -
The name of this function contains the word "cache", which is left from
the times where btrfs_block_group was called btrfs_block_group_cache.Now this "cache" doesn't match anything, and we have better namings for
functions like read/insert/remove_block_group_item().Rename it to update_block_group_item().
Reviewed-by: Johannes Thumshirn
Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Currently the block group item insert is pretty straight forward, fill
the block group item structure and insert it into extent tree.However the incoming skinny block group feature is going to change this,
so this patch will refactor insertion into a new function,
insert_block_group_item(), to make the incoming feature easier to add.Reviewed-by: Johannes Thumshirn
Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
When deleting a block group item, it's pretty straight forward, just
delete the item pointed by the key. However it will not be that
straight-forward for incoming skinny block group item.So refactor the block group item deletion into a new function,
remove_block_group_item(), also to make the already lengthy
btrfs_remove_block_group() a little shorter.Reviewed-by: Johannes Thumshirn
Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Structure btrfs_block_group has the following members which are
currently read from on-disk block group item and key:- length - from item key
- used
- flags - from block group itemHowever for incoming skinny block group tree, we are going to read those
members from different sources.This patch will refactor such read by:
- Don't initialize btrfs_block_group::length at allocation
Caller should initialize them manually.
Also to avoid possible (well, only two callers) missing
initialization, add extra ASSERT() in btrfs_add_block_group_cache().- Refactor length/used/flags initialization into one function
The new function, fill_one_block_group() will handle the
initialization of such members.- Use btrfs_block_group::length to replace key::offset
Since skinny block group item would have a different meaning for its
key offset.Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Regular block group items in extent tree are scattered inside the huge
tree, thus forward readahead makes no sense.Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
The helpers btrfs_freeze_block_group() and btrfs_unfreeze_block_group()
used to be named btrfs_get_block_group_trimming() and
btrfs_put_block_group_trimming() respectively.At the time they were added to free-space-cache.c, by commit e33e17ee1098
("btrfs: add missing discards when unpinning extents with -o discard")
because all the trimming related functions were in free-space-cache.c.Now that the helpers were renamed and are used in scrub context as well,
move them to block-group.c, a much more logical location for them.Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Back in 2014, commit 04216820fe83d5 ("Btrfs: fix race between fs trimming
and block group remove/allocation"), I added the 'trimming' member to the
block group structure. Its purpose was to prevent races between trimming
and block group deletion/allocation by pinning the block group in a way
that prevents its logical address and device extents from being reused
while trimming is in progress for a block group, so that if another task
deletes the block group and then another task allocates a new block group
that gets the same logical address and device extents while the trimming
task is still in progress.After the previous fix for scrub (patch "btrfs: fix a race between scrub
and block group removal/allocation"), scrub now also has the same needs that
trimming has, so the member name 'trimming' no longer makes sense.
Since there is already a 'pinned' member in the block group that refers
to space reservations (pinned bytes), rename the member to 'frozen',
add a comment on top of it to describe its general purpose and rename
the helpers to increment and decrement the counter as well, to match
the new member name.The next patch in the series will move the helpers into a more suitable
file (from free-space-cache.c to block-group.c).Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
At clean_pinned_extents(), whether we end up returning success or failure,
we pretty much have to do the same things:1) unlock unused_bg_unpin_mutex
2) decrement reference count on the previous transactionWe also call btrfs_dec_block_group_ro() in case of failure, but that is
better done in its caller, btrfs_delete_unused_bgs(), since its the
caller that calls inc_block_group_ro(), so it should be responsible for
the decrement operation, as it is in case any of the other functions it
calls fail.So move the call to btrfs_dec_block_group_ro() from clean_pinned_extents()
into btrfs_delete_unused_bgs() and unify the error and success return
paths for clean_pinned_extents(), reducing duplicated code and making it
simpler.Signed-off-by: Filipe Manana
Signed-off-by: David Sterba -
For unlink transactions and block group removal
btrfs_start_transaction_fallback_global_rsv will first try to start an
ordinary transaction and if it fails it will fall back to reserving the
required amount by stealing from the global reserve. This is problematic
because of all the same reasons we had with previous iterations of the
ENOSPC handling, thundering herd. We get a bunch of failures all at
once, everybody tries to allocate from the global reserve, some win and
some lose, we get an ENSOPC.Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
used to mark unlink reservation. To fix this we need to integrate this
logic into the normal ENOSPC infrastructure. We still go through all of
the normal flushing work, and at the moment we begin to fail all the
tickets we try to satisfy any tickets that are allowed to steal by
stealing from the global reserve. If this works we start the flushing
system over again just like we would with a normal ticket satisfaction.
This serializes our global reserve stealing, so we don't have the
thundering herd problem.Reviewed-by: Nikolay Borisov
Tested-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba
23 Apr, 2020
2 commits
-
btrfs_remove_block_group() invokes btrfs_lookup_block_group(), which
returns a local reference of the block group that contains the given
bytenr to "block_group" with increased refcount.When btrfs_remove_block_group() returns, "block_group" becomes invalid,
so the refcount should be decreased to keep refcount balanced.The reference counting issue happens in several exception handling paths
of btrfs_remove_block_group(). When those error scenarios occur such as
btrfs_alloc_path() returns NULL, the function forgets to decrease its
refcnt increased by btrfs_lookup_block_group() and will cause a refcnt
leak.Fix this issue by jumping to "out_put_group" label and calling
btrfs_put_block_group() when those error scenarios occur.CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Xiyu Yang
Signed-off-by: Xin Tan
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
When cleaning pinned extents right before deleting an unused block group,
we check if there's still a previous transaction running and if so we
increment its reference count before using it for cleaning pinned ranges
in its pinned extents iotree. However we ended up never decrementing the
reference count after using the transaction, resulting in a memory leak.Fix it by decrementing the reference count.
Fixes: fe119a6eeb6705 ("btrfs: switch to per-transaction pinned extents")
Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba
09 Apr, 2020
1 commit
-
Whenever we add a ticket to a space_info object we increment the object's
reclaim_size counter witht the ticket's bytes, and we decrement it with
the corresponding amount only when we are able to grant the requested
space to the ticket. When we are not able to grant the space to a ticket,
or when the ticket is removed due to a signal (e.g. an application has
received sigterm from the terminal) we never decrement the counter with
the corresponding bytes from the ticket. This leak can result in the
space reclaim code to later do much more work than necessary. So fix it
by decrementing the counter when those two cases happen as well.Fixes: db161806dc5615 ("btrfs: account ticket size at add/delete time")
Reviewed-by: Nikolay Borisov
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba
24 Mar, 2020
4 commits
-
The space_info list is normally RCU protected and should be traversed
with rcu_read_lock held. There's a warning[29.104756] WARNING: suspicious RCU usage
[29.105046] 5.6.0-rc4-next-20200305 #1 Not tainted
[29.105231] -----------------------------
[29.105401] fs/btrfs/block-group.c:2011 RCU-list traversed in non-reader section!!pointing out that the locking is missing in btrfs_read_block_groups.
However this is not necessary as the list traversal happens at mount
time when there's no other thread potentially accessing the list.To fix the warning and for consistency let's add the RCU lock/unlock,
the code won't be affected much as it's doing some lightweight
operations.Reported-by: Guenter Roeck
Signed-off-by: Madhuparna Bhowmik
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
This commit flips the switch to start tracking/processing pinned extents
on a per-transaction basis. It mostly replaces all references from
btrfs_fs_info::(pinned_extents|freed_extents[]) to
btrfs_transaction::pinned_extents.Two notable modifications that warrant explicit mention are changing
clean_pinned_extents to get a reference to the previously running
transaction. The other one is removal of call to
btrfs_destroy_pinned_extent since transactions are going to be cleaned
in btrfs_cleanup_one_transaction.Reviewed-by: Josef Bacik
Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Next patch is going to refactor how pinned extents are tracked which
will necessitate changing this code. To ease that work and contain the
changes factor the code now in preparation, this will also help review.Reviewed-by: Josef Bacik
Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
The status of aborted transaction can change between calls and it needs
to be accessed by READ_ONCE. Add a helper that also wraps the unlikely
hint.Reviewed-by: Josef Bacik
Signed-off-by: David Sterba
21 Mar, 2020
1 commit
-
We are incorrectly dropping the raid56 and raid1c34 incompat flags when
there are still raid56 and raid1c34 block groups, not when we do not any
of those anymore. The logic just got unintentionally broken after adding
the support for the raid1c34 modes.Fix this by clear the flags only if we do not have block groups with the
respective profiles.Fixes: 9c907446dce3 ("btrfs: drop incompat bit for raid1c34 after last block group is gone")
Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba
31 Jan, 2020
2 commits
-
inc_block_group_ro does a calculation to see if we have enough room left
over if we mark this block group as read only in order to see if it's ok
to mark the block group as read only.The problem is this calculation _only_ works for data, where our used is
always less than our total. For metadata we will overcommit, so this
will almost always fail for metadata.Fix this by exporting btrfs_can_overcommit, and then see if we have
enough space to remove the remaining free space in the block group we
are trying to mark read only. If we do then we can mark this block
group as read only.Reviewed-by: Qu Wenruo
Signed-off-by: Josef Bacik
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
For some reason we've translated the do_chunk_alloc that goes into
btrfs_inc_block_group_ro to force in inc_block_group_ro, but these are
two different things.force for inc_block_group_ro is used when we are forcing the block group
read only no matter what, for example when the underlying chunk is
marked read only. We need to not do the space check here as this block
group needs to be read only.btrfs_inc_block_group_ro() has a do_chunk_alloc flag that indicates that
we need to pre-allocate a chunk before marking the block group read
only. This has nothing to do with forcing, and in fact we _always_ want
to do the space check in this case, so unconditionally pass false for
force in this case.Then fixup inc_block_group_ro to honor force as it's expected and
documented to do.Reviewed-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Reviewed-by: David Sterba
Signed-off-by: David Sterba
24 Jan, 2020
1 commit
-
Move variables to appropriate scope. Remove last BUG_ON in the function
and rework error handling accordingly. Make the duplicate detection code
more straightforward. Use in_range macro. And give variables more
descriptive name by explicitly distinguishing between IO stripe size
(size recorded in the chunk item) and data stripe size (the size of
an actual stripe, constituting a logical chunk/block group).Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba