26 Oct, 2020

1 commit

  • Dave reported a problem with my rwsem conversion patch where we got the
    following lockdep splat:

    ======================================================
    WARNING: possible circular locking dependency detected
    5.9.0-default+ #1297 Not tainted
    ------------------------------------------------------
    kswapd0/76 is trying to acquire lock:
    ffff9d5d25df2530 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]

    but task is already holding lock:
    ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #4 (fs_reclaim){+.+.}-{0:0}:
    __lock_acquire+0x582/0xac0
    lock_acquire+0xca/0x430
    fs_reclaim_acquire.part.0+0x25/0x30
    kmem_cache_alloc+0x30/0x9c0
    alloc_inode+0x81/0x90
    iget_locked+0xcd/0x1a0
    kernfs_get_inode+0x1b/0x130
    kernfs_get_tree+0x136/0x210
    sysfs_get_tree+0x1a/0x50
    vfs_get_tree+0x1d/0xb0
    path_mount+0x70f/0xa80
    do_mount+0x75/0x90
    __x64_sys_mount+0x8e/0xd0
    do_syscall_64+0x2d/0x70
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #3 (kernfs_mutex){+.+.}-{3:3}:
    __lock_acquire+0x582/0xac0
    lock_acquire+0xca/0x430
    __mutex_lock+0xa0/0xaf0
    kernfs_add_one+0x23/0x150
    kernfs_create_dir_ns+0x58/0x80
    sysfs_create_dir_ns+0x70/0xd0
    kobject_add_internal+0xbb/0x2d0
    kobject_add+0x7a/0xd0
    btrfs_sysfs_add_block_group_type+0x141/0x1d0 [btrfs]
    btrfs_read_block_groups+0x1f1/0x8c0 [btrfs]
    open_ctree+0x981/0x1108 [btrfs]
    btrfs_mount_root.cold+0xe/0xb0 [btrfs]
    legacy_get_tree+0x2d/0x60
    vfs_get_tree+0x1d/0xb0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    btrfs_mount+0x13b/0x3e0 [btrfs]
    legacy_get_tree+0x2d/0x60
    vfs_get_tree+0x1d/0xb0
    path_mount+0x70f/0xa80
    do_mount+0x75/0x90
    __x64_sys_mount+0x8e/0xd0
    do_syscall_64+0x2d/0x70
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #2 (btrfs-extent-00){++++}-{3:3}:
    __lock_acquire+0x582/0xac0
    lock_acquire+0xca/0x430
    down_read_nested+0x45/0x220
    __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
    btrfs_search_slot+0x6d4/0xfd0 [btrfs]
    check_committed_ref+0x69/0x200 [btrfs]
    btrfs_cross_ref_exist+0x65/0xb0 [btrfs]
    run_delalloc_nocow+0x446/0x9b0 [btrfs]
    btrfs_run_delalloc_range+0x61/0x6a0 [btrfs]
    writepage_delalloc+0xae/0x160 [btrfs]
    __extent_writepage+0x262/0x420 [btrfs]
    extent_write_cache_pages+0x2b6/0x510 [btrfs]
    extent_writepages+0x43/0x90 [btrfs]
    do_writepages+0x40/0xe0
    __writeback_single_inode+0x62/0x610
    writeback_sb_inodes+0x20f/0x500
    wb_writeback+0xef/0x4a0
    wb_do_writeback+0x49/0x2e0
    wb_workfn+0x81/0x340
    process_one_work+0x233/0x5d0
    worker_thread+0x50/0x3b0
    kthread+0x137/0x150
    ret_from_fork+0x1f/0x30

    -> #1 (btrfs-fs-00){++++}-{3:3}:
    __lock_acquire+0x582/0xac0
    lock_acquire+0xca/0x430
    down_read_nested+0x45/0x220
    __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
    btrfs_search_slot+0x6d4/0xfd0 [btrfs]
    btrfs_lookup_inode+0x3a/0xc0 [btrfs]
    __btrfs_update_delayed_inode+0x93/0x2c0 [btrfs]
    __btrfs_commit_inode_delayed_items+0x7de/0x850 [btrfs]
    __btrfs_run_delayed_items+0x8e/0x140 [btrfs]
    btrfs_commit_transaction+0x367/0xbc0 [btrfs]
    btrfs_mksubvol+0x2db/0x470 [btrfs]
    btrfs_mksnapshot+0x7b/0xb0 [btrfs]
    __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
    btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
    btrfs_ioctl+0xd0b/0x2690 [btrfs]
    __x64_sys_ioctl+0x6f/0xa0
    do_syscall_64+0x2d/0x70
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
    check_prev_add+0x91/0xc60
    validate_chain+0xa6e/0x2a20
    __lock_acquire+0x582/0xac0
    lock_acquire+0xca/0x430
    __mutex_lock+0xa0/0xaf0
    __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
    btrfs_evict_inode+0x3cc/0x560 [btrfs]
    evict+0xd6/0x1c0
    dispose_list+0x48/0x70
    prune_icache_sb+0x54/0x80
    super_cache_scan+0x121/0x1a0
    do_shrink_slab+0x16d/0x3b0
    shrink_slab+0xb1/0x2e0
    shrink_node+0x230/0x6a0
    balance_pgdat+0x325/0x750
    kswapd+0x206/0x4d0
    kthread+0x137/0x150
    ret_from_fork+0x1f/0x30

    other info that might help us debug this:

    Chain exists of:
    &delayed_node->mutex --> kernfs_mutex --> fs_reclaim

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(fs_reclaim);
    lock(kernfs_mutex);
    lock(fs_reclaim);
    lock(&delayed_node->mutex);

    *** DEADLOCK ***

    3 locks held by kswapd0/76:
    #0: ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
    #1: ffffffffa40b8b58 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
    #2: ffff9d5d322390e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50

    stack backtrace:
    CPU: 2 PID: 76 Comm: kswapd0 Not tainted 5.9.0-default+ #1297
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
    Call Trace:
    dump_stack+0x77/0x97
    check_noncircular+0xff/0x110
    ? save_trace+0x50/0x470
    check_prev_add+0x91/0xc60
    validate_chain+0xa6e/0x2a20
    ? save_trace+0x50/0x470
    __lock_acquire+0x582/0xac0
    lock_acquire+0xca/0x430
    ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
    __mutex_lock+0xa0/0xaf0
    ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
    ? __lock_acquire+0x582/0xac0
    ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
    ? btrfs_evict_inode+0x30b/0x560 [btrfs]
    ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
    __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
    btrfs_evict_inode+0x3cc/0x560 [btrfs]
    evict+0xd6/0x1c0
    dispose_list+0x48/0x70
    prune_icache_sb+0x54/0x80
    super_cache_scan+0x121/0x1a0
    do_shrink_slab+0x16d/0x3b0
    shrink_slab+0xb1/0x2e0
    shrink_node+0x230/0x6a0
    balance_pgdat+0x325/0x750
    kswapd+0x206/0x4d0
    ? finish_wait+0x90/0x90
    ? balance_pgdat+0x750/0x750
    kthread+0x137/0x150
    ? kthread_mod_delayed_work+0xc0/0xc0
    ret_from_fork+0x1f/0x30

    This happens because we are still holding the path open when we start
    adding the sysfs files for the block groups, which creates a dependency
    on fs_reclaim via the tree lock. Fix this by dropping the path before
    we start doing anything with sysfs.

    Reported-by: David Sterba
    CC: stable@vger.kernel.org # 5.8+
    Reviewed-by: Anand Jain
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     

07 Oct, 2020

6 commits

  • While running xfstests btrfs/177 I got the following lockdep splat

    ======================================================
    WARNING: possible circular locking dependency detected
    5.9.0-rc3+ #5 Not tainted
    ------------------------------------------------------
    kswapd0/100 is trying to acquire lock:
    ffff97066aa56760 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330

    but task is already holding lock:
    ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (fs_reclaim){+.+.}-{0:0}:
    fs_reclaim_acquire+0x65/0x80
    slab_pre_alloc_hook.constprop.0+0x20/0x200
    kmem_cache_alloc+0x37/0x270
    alloc_inode+0x82/0xb0
    iget_locked+0x10d/0x2c0
    kernfs_get_inode+0x1b/0x130
    kernfs_get_tree+0x136/0x240
    sysfs_get_tree+0x16/0x40
    vfs_get_tree+0x28/0xc0
    path_mount+0x434/0xc00
    __x64_sys_mount+0xe3/0x120
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #2 (kernfs_mutex){+.+.}-{3:3}:
    __mutex_lock+0x7e/0x7e0
    kernfs_add_one+0x23/0x150
    kernfs_create_dir_ns+0x7a/0xb0
    sysfs_create_dir_ns+0x60/0xb0
    kobject_add_internal+0xc0/0x2c0
    kobject_add+0x6e/0x90
    btrfs_sysfs_add_block_group_type+0x102/0x160
    btrfs_make_block_group+0x167/0x230
    btrfs_alloc_chunk+0x54f/0xb80
    btrfs_chunk_alloc+0x18e/0x3a0
    find_free_extent+0xdf6/0x1210
    btrfs_reserve_extent+0xb3/0x1b0
    btrfs_alloc_tree_block+0xb0/0x310
    alloc_tree_block_no_bg_flush+0x4a/0x60
    __btrfs_cow_block+0x11a/0x530
    btrfs_cow_block+0x104/0x220
    btrfs_search_slot+0x52e/0x9d0
    btrfs_insert_empty_items+0x64/0xb0
    btrfs_new_inode+0x225/0x730
    btrfs_create+0xab/0x1f0
    lookup_open.isra.0+0x52d/0x690
    path_openat+0x2a7/0x9e0
    do_filp_open+0x75/0x100
    do_sys_openat2+0x7b/0x130
    __x64_sys_openat+0x46/0x70
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
    __mutex_lock+0x7e/0x7e0
    btrfs_chunk_alloc+0x125/0x3a0
    find_free_extent+0xdf6/0x1210
    btrfs_reserve_extent+0xb3/0x1b0
    btrfs_alloc_tree_block+0xb0/0x310
    alloc_tree_block_no_bg_flush+0x4a/0x60
    __btrfs_cow_block+0x11a/0x530
    btrfs_cow_block+0x104/0x220
    btrfs_search_slot+0x52e/0x9d0
    btrfs_lookup_inode+0x2a/0x8f
    __btrfs_update_delayed_inode+0x80/0x240
    btrfs_commit_inode_delayed_inode+0x119/0x120
    btrfs_evict_inode+0x357/0x500
    evict+0xcf/0x1f0
    do_unlinkat+0x1a9/0x2b0
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
    __lock_acquire+0x119c/0x1fc0
    lock_acquire+0xa7/0x3d0
    __mutex_lock+0x7e/0x7e0
    __btrfs_release_delayed_node.part.0+0x3f/0x330
    btrfs_evict_inode+0x24c/0x500
    evict+0xcf/0x1f0
    dispose_list+0x48/0x70
    prune_icache_sb+0x44/0x50
    super_cache_scan+0x161/0x1e0
    do_shrink_slab+0x178/0x3c0
    shrink_slab+0x17c/0x290
    shrink_node+0x2b2/0x6d0
    balance_pgdat+0x30a/0x670
    kswapd+0x213/0x4c0
    kthread+0x138/0x160
    ret_from_fork+0x1f/0x30

    other info that might help us debug this:

    Chain exists of:
    &delayed_node->mutex --> kernfs_mutex --> fs_reclaim

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(fs_reclaim);
    lock(kernfs_mutex);
    lock(fs_reclaim);
    lock(&delayed_node->mutex);

    *** DEADLOCK ***

    3 locks held by kswapd0/100:
    #0: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
    #1: ffffffff9fd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
    #2: ffff9706629780e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0

    stack backtrace:
    CPU: 1 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #5
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
    Call Trace:
    dump_stack+0x8b/0xb8
    check_noncircular+0x12d/0x150
    __lock_acquire+0x119c/0x1fc0
    lock_acquire+0xa7/0x3d0
    ? __btrfs_release_delayed_node.part.0+0x3f/0x330
    __mutex_lock+0x7e/0x7e0
    ? __btrfs_release_delayed_node.part.0+0x3f/0x330
    ? __btrfs_release_delayed_node.part.0+0x3f/0x330
    ? lock_acquire+0xa7/0x3d0
    ? find_held_lock+0x2b/0x80
    __btrfs_release_delayed_node.part.0+0x3f/0x330
    btrfs_evict_inode+0x24c/0x500
    evict+0xcf/0x1f0
    dispose_list+0x48/0x70
    prune_icache_sb+0x44/0x50
    super_cache_scan+0x161/0x1e0
    do_shrink_slab+0x178/0x3c0
    shrink_slab+0x17c/0x290
    shrink_node+0x2b2/0x6d0
    balance_pgdat+0x30a/0x670
    kswapd+0x213/0x4c0
    ? _raw_spin_unlock_irqrestore+0x41/0x50
    ? add_wait_queue_exclusive+0x70/0x70
    ? balance_pgdat+0x670/0x670
    kthread+0x138/0x160
    ? kthread_create_worker_on_cpu+0x40/0x40
    ret_from_fork+0x1f/0x30

    This happens because when we link in a block group with a new raid index
    type we'll create the corresponding sysfs entries for it. This is
    problematic because while restriping we're holding the chunk_mutex, and
    while mounting we're holding the tree locks.

    Fixing this isn't pretty, we move the call to the sysfs stuff into the
    btrfs_create_pending_block_groups() work, where we're not holding any
    locks. This creates a slight race where other threads could see that
    there's no sysfs kobj for that raid type, and race to create the
    sysfs dir. Fix this by wrapping the creation in space_info->lock, so we
    only get one thread calling kobject_add() for the new directory. We
    don't worry about the lock on cleanup as it only gets deleted on
    unmount.

    On mount it's more straightforward, we loop through the space_infos
    already, just check every raid index in each space_info and added the
    sysfs entries for the corresponding block groups.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We have this thing wrapped in an RCU lock, but it's really not needed.
    We create all the space_info's on mount, and we destroy them on unmount.
    The list never changes and we're protected from messing with it by the
    normal mount/umount path, so kill the RCU stuff around it.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Since it's inclusion on 9afc66498a0b ("btrfs: block-group: refactor how
    we read one block group item") this function always returned 0, so there
    is no need to check for the returned value.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Marcos Paulo de Souza
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Marcos Paulo de Souza
     
  • If we have compression on we could free up more space than we reserved,
    and thus be able to make a space reservation. Add the call for this
    scenario.

    Reviewed-by: Nikolay Borisov
    Tested-by: Nikolay Borisov
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We were missing a call to btrfs_try_granting_tickets in
    btrfs_free_reserved_bytes, so add it to handle the case where we're able
    to satisfy an allocation because we've freed a pending reservation.

    Reviewed-by: Nikolay Borisov
    Tested-by: Nikolay Borisov
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Delete repeated words in fs/btrfs/.
    {to, the, a, and old}
    and change "into 2 part" to "into 2 parts".

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Randy Dunlap
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Randy Dunlap
     

27 Aug, 2020

1 commit

  • [BUG]
    After commit 9afc66498a0b ("btrfs: block-group: refactor how we read one
    block group item"), cache->length is being assigned after calling
    btrfs_create_block_group_cache. This causes a problem since
    set_free_space_tree_thresholds calculates the free-space threshold to
    decide if the free-space tree should convert from extents to bitmaps.

    The current code calls set_free_space_tree_thresholds with cache->length
    being 0, which then makes cache->bitmap_high_thresh zero. This implies
    the system will always use bitmap instead of extents, which is not
    desired if the block group is not fragmented.

    This behavior can be seen by a test that expects to repair systems
    with FREE_SPACE_EXTENT and FREE_SPACE_BITMAP, but the current code only
    created FREE_SPACE_BITMAP.

    [FIX]
    Call set_free_space_tree_thresholds after setting cache->length. There
    is now a WARN_ON in set_free_space_tree_thresholds to help preventing
    the same mistake to happen again in the future.

    Link: https://github.com/kdave/btrfs-progs/issues/251
    Fixes: 9afc66498a0b ("btrfs: block-group: refactor how we read one block group item")
    CC: stable@vger.kernel.org # 5.8+
    Reviewed-by: Qu Wenruo
    Reviewed-by: Filipe Manana
    Signed-off-by: Marcos Paulo de Souza
    Signed-off-by: David Sterba

    Marcos Paulo de Souza
     

27 Jul, 2020

9 commits

  • Previously we depended on some weird behavior in our chunk allocator to
    force the allocation of new stripes, so by the time we got to doing the
    reduce we would usually already have a chunk with the proper target.

    However that behavior causes other problems and needs to be removed.
    First however we need to remove this check to only restripe if we
    already have those available profiles, because if we're allocating our
    first chunk it obviously will not be available. Simply use the target
    as specified, and if that fails it'll be because we're out of space.

    Tested-by: Holger Hoffstätte
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • btrfs/061 has been failing consistently for me recently with a
    transaction abort. We run out of space in the system chunk array, which
    means we've allocated way too many system chunks than we need.

    Chris added this a long time ago for balance as a poor mans restriping.
    If you had a single disk and then added another disk and then did a
    balance, update_block_group_flags would then figure out which RAID level
    you needed.

    Fast forward to today and we have restriping behavior, so we can
    explicitly tell the fs that we're trying to change the raid level. This
    is accomplished through the normal get_alloc_profile path.

    Furthermore this code actually causes btrfs/061 to fail, because we do
    things like mkfs -m dup -d single with multiple devices. This trips
    this check

    alloc_flags = update_block_group_flags(fs_info, cache->flags);
    if (alloc_flags != cache->flags) {
    ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);

    in btrfs_inc_block_group_ro. Because we're balancing and scrubbing, but
    not actually restriping, we keep forcing chunk allocation of RAID1
    chunks. This eventually causes us to run out of system space and the
    file system aborts and flips read only.

    We don't need this poor mans restriping any more, simply use the normal
    get_alloc_profile helper, which will get the correct alloc_flags and
    thus make the right decision for chunk allocation. This keeps us from
    allocating a billion system chunks and falling over.

    Tested-by: Holger Hoffstätte
    Reviewed-by: Qu Wenruo
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We have refcount_t now with the associated library to handle refcounts,
    which gives us extra debugging around reference count mistakes that may
    be made. For example it'll warn on any transition from 0->1 or 0->-1,
    which is handy for noticing cases where we've messed up reference
    counting. Convert the block group ref counting from an atomic_t to
    refcount_t and use the appropriate helpers.

    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Instead of calling BTRFS_I on the passed vfs_inode take btrfs_inode
    directly.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Initially when the 'removed' flag was added to a block group to avoid
    races between block group removal and fitrim, by commit 04216820fe83d5
    ("Btrfs: fix race between fs trimming and block group remove/allocation"),
    we had to lock the chunks mutex because we could be moving the block
    group from its current list, the pending chunks list, into the pinned
    chunks list, or we could just be adding it to the pinned chunks if it was
    not in the pending chunks list. Both lists were protected by the chunk
    mutex.

    However we no longer have those lists since commit 1c11b63eff2a67
    ("btrfs: replace pending/pinned chunks lists with io tree"), and locking
    the chunk mutex is no longer necessary because of that. The same happens
    at btrfs_unfreeze_block_group(), we lock the chunk mutex because the block
    group's extent map could be part of the pinned chunks list and the call
    to remove_extent_mapping() could be deleting it from that list, which
    used to be protected by that mutex.

    So just remove those lock and unlock calls as they are not needed anymore.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     
  • When find_first_block_group() finds a block group item in the extent-tree,
    it does a lookup of the object in the extent mapping tree and does further
    checks on the item.

    Factor out this step from find_first_block_group() so we can further
    simplify the code.

    While we're at it, we can also just return early in
    find_first_block_group(), if the tree slot isn't found.

    Signed-off-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • We already have an fs_info in our function parameters, there's no need
    to do the maths again and get fs_info from the extent_root just to get
    the mapping_tree.

    Instead directly grab the mapping_tree from fs_info.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • Adresses held in 'logical' array are always guaranteed to fall within
    the boundaries of the block group. That is, 'start' can never be
    smaller than cache->start. This invariant follows from the way the
    address are calculated in btrfs_rmap_block:

    stripe_nr = physical - map->stripes[i].physical;
    stripe_nr = div64_u64(stripe_nr, map->stripe_len);
    bytenr = chunk_start + stripe_nr * io_stripe_size;

    I.e it's always some IO stripe within the given chunk.

    Exploit this invariant to simplify the body of the loop by removing the
    unnecessary 'if' since its 'else' part is the one always executed.

    Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • extent_map::orig_block_len contains the size of a physical stripe when
    it's used to describe block groups (calculated in read_one_chunk via
    calc_stripe_length or calculated in decide_stripe_size and then assigned
    to extent_map::orig_block_len in create_chunk). Exploit this fact to get
    the size directly rather than opencoding the calculations. No functional
    changes.

    Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba

    Nikolay Borisov
     

17 Jun, 2020

2 commits

  • There is a race between block group removal and block group creation
    when the removal is completed by a task running fitrim or scrub. When
    this happens we end up failing the block group creation with an error
    -EEXIST since we attempt to insert a duplicate block group item key
    in the extent tree. That results in a transaction abort.

    The race happens like this:

    1) Task A is doing a fitrim, and at btrfs_trim_block_group() it freezes
    block group X with btrfs_freeze_block_group() (until very recently
    that was named btrfs_get_block_group_trimming());

    2) Task B starts removing block group X, either because it's now unused
    or due to relocation for example. So at btrfs_remove_block_group(),
    while holding the chunk mutex and the block group's lock, it sets
    the 'removed' flag of the block group and it sets the local variable
    'remove_em' to false, because the block group is currently frozen
    (its 'frozen' counter is > 0, until very recently this counter was
    named 'trimming');

    3) Task B unlocks the block group and the chunk mutex;

    4) Task A is done trimming the block group and unfreezes the block group
    by calling btrfs_unfreeze_block_group() (until very recently this was
    named btrfs_put_block_group_trimming()). In this function we lock the
    block group and set the local variable 'cleanup' to true because we
    were able to decrement the block group's 'frozen' counter down to 0 and
    the flag 'removed' is set in the block group.

    Since 'cleanup' is set to true, it locks the chunk mutex and removes
    the extent mapping representing the block group from the mapping tree;

    5) Task C allocates a new block group Y and it picks up the logical address
    that block group X had as the logical address for Y, because X was the
    block group with the highest logical address and now the second block
    group with the highest logical address, the last in the fs mapping tree,
    ends at an offset corresponding to block group X's logical address (this
    logical address selection is done at volumes.c:find_next_chunk()).

    At this point the new block group Y does not have yet its item added
    to the extent tree (nor the corresponding device extent items and
    chunk item in the device and chunk trees). The new group Y is added to
    the list of pending block groups in the transaction handle;

    6) Before task B proceeds to removing the block group item for block
    group X from the extent tree, which has a key matching:

    (X logical offset, BTRFS_BLOCK_GROUP_ITEM_KEY, length)

    task C while ending its transaction handle calls
    btrfs_create_pending_block_groups(), which finds block group Y and
    tries to insert the block group item for Y into the exten tree, which
    fails with -EEXIST since logical offset is the same that X had and
    task B hasn't yet deleted the key from the extent tree.
    This failure results in a transaction abort, producing a stack like
    the following:

    ------------[ cut here ]------------
    BTRFS: Transaction aborted (error -17)
    WARNING: CPU: 2 PID: 19736 at fs/btrfs/block-group.c:2074 btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
    Modules linked in: btrfs blake2b_generic xor raid6_pq (...)
    CPU: 2 PID: 19736 Comm: fsstress Tainted: G W 5.6.0-rc7-btrfs-next-58 #5
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
    Code: ff ff ff 48 8b 55 50 f0 48 (...)
    RSP: 0018:ffffa4160a1c7d58 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: ffff961581909d98 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: ffffffffb3d63990 RDI: 0000000000000001
    RBP: ffff9614f3356a58 R08: 0000000000000000 R09: 0000000000000001
    R10: ffff9615b65b0040 R11: 0000000000000000 R12: ffff961581909c10
    R13: ffff9615b0c32000 R14: ffff9614f3356ab0 R15: ffff9614be779000
    FS: 00007f2ce2841e80(0000) GS:ffff9615bae00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000555f18780000 CR3: 0000000131d34005 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    btrfs_start_dirty_block_groups+0x398/0x4e0 [btrfs]
    btrfs_commit_transaction+0xd0/0xc50 [btrfs]
    ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
    ? __ia32_sys_fdatasync+0x20/0x20
    iterate_supers+0xdb/0x180
    ksys_sync+0x60/0xb0
    __ia32_sys_sync+0xa/0x10
    do_syscall_64+0x5c/0x280
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7f2ce1d4d5b7
    Code: 83 c4 08 48 3d 01 (...)
    RSP: 002b:00007ffd8b558c58 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
    RAX: ffffffffffffffda RBX: 000000000000002c RCX: 00007f2ce1d4d5b7
    RDX: 00000000ffffffff RSI: 00000000186ba07b RDI: 000000000000002c
    RBP: 0000555f17b9e520 R08: 0000000000000012 R09: 000000000000ce00
    R10: 0000000000000078 R11: 0000000000000202 R12: 0000000000000032
    R13: 0000000051eb851f R14: 00007ffd8b558cd0 R15: 0000555f1798ec20
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x74f/0x2020
    softirqs last enabled at (0): [] copy_process+0x74f/0x2020
    softirqs last disabled at (0): [] 0x0
    ---[ end trace bd7c03622e0b0a9c ]---

    Fix this simply by making btrfs_remove_block_group() remove the block
    group's item from the extent tree before it flags the block group as
    removed. Also make the free space deletion from the free space tree
    before flagging the block group as removed, to avoid a similar race
    with adding and removing free space entries for the free space tree.

    Fixes: 04216820fe83d5 ("Btrfs: fix race between fs trimming and block group remove/allocation")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     
  • When removing a block group, if we fail to delete the block group's item
    from the extent tree, we jump to the 'out' label and end up decrementing
    the block group's reference count once only (by 1), resulting in a counter
    leak because the block group at that point was already removed from the
    block group cache rbtree - so we have to decrement the reference count
    twice, once for the rbtree and once for our lookup at the start of the
    function.

    There is a second bug where if removing the free space tree entries (the
    call to remove_block_group_free_space()) fails we end up jumping to the
    'out_put_group' label but end up decrementing the reference count only
    once, when we should have done it twice, since we have already removed
    the block group from the block group cache rbtree. This happens because
    the reference count decrement for the rbtree reference happens after
    attempting to remove the free space tree entries, which is far away from
    the place where we remove the block group from the rbtree.

    To make things less error prone, decrement the reference count for the
    rbtree immediately after removing the block group from it. This also
    eleminates the need for two different exit labels on error, renaming
    'out_put_label' to just 'out' and removing the old 'out'.

    Fixes: f6033c5e333238 ("btrfs: fix block group leak when removing fails")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Anand Jain
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     

25 May, 2020

10 commits

  • disk-io.h is included more than once in block-group.c, remove it.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Tiezhu Yang
    Signed-off-by: David Sterba

    Tiezhu Yang
     
  • The name of this function contains the word "cache", which is left from
    the times where btrfs_block_group was called btrfs_block_group_cache.

    Now this "cache" doesn't match anything, and we have better namings for
    functions like read/insert/remove_block_group_item().

    Rename it to update_block_group_item().

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Currently the block group item insert is pretty straight forward, fill
    the block group item structure and insert it into extent tree.

    However the incoming skinny block group feature is going to change this,
    so this patch will refactor insertion into a new function,
    insert_block_group_item(), to make the incoming feature easier to add.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • When deleting a block group item, it's pretty straight forward, just
    delete the item pointed by the key. However it will not be that
    straight-forward for incoming skinny block group item.

    So refactor the block group item deletion into a new function,
    remove_block_group_item(), also to make the already lengthy
    btrfs_remove_block_group() a little shorter.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Structure btrfs_block_group has the following members which are
    currently read from on-disk block group item and key:

    - length - from item key
    - used
    - flags - from block group item

    However for incoming skinny block group tree, we are going to read those
    members from different sources.

    This patch will refactor such read by:

    - Don't initialize btrfs_block_group::length at allocation
    Caller should initialize them manually.
    Also to avoid possible (well, only two callers) missing
    initialization, add extra ASSERT() in btrfs_add_block_group_cache().

    - Refactor length/used/flags initialization into one function
    The new function, fill_one_block_group() will handle the
    initialization of such members.

    - Use btrfs_block_group::length to replace key::offset
    Since skinny block group item would have a different meaning for its
    key offset.

    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Regular block group items in extent tree are scattered inside the huge
    tree, thus forward readahead makes no sense.

    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • The helpers btrfs_freeze_block_group() and btrfs_unfreeze_block_group()
    used to be named btrfs_get_block_group_trimming() and
    btrfs_put_block_group_trimming() respectively.

    At the time they were added to free-space-cache.c, by commit e33e17ee1098
    ("btrfs: add missing discards when unpinning extents with -o discard")
    because all the trimming related functions were in free-space-cache.c.

    Now that the helpers were renamed and are used in scrub context as well,
    move them to block-group.c, a much more logical location for them.

    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • Back in 2014, commit 04216820fe83d5 ("Btrfs: fix race between fs trimming
    and block group remove/allocation"), I added the 'trimming' member to the
    block group structure. Its purpose was to prevent races between trimming
    and block group deletion/allocation by pinning the block group in a way
    that prevents its logical address and device extents from being reused
    while trimming is in progress for a block group, so that if another task
    deletes the block group and then another task allocates a new block group
    that gets the same logical address and device extents while the trimming
    task is still in progress.

    After the previous fix for scrub (patch "btrfs: fix a race between scrub
    and block group removal/allocation"), scrub now also has the same needs that
    trimming has, so the member name 'trimming' no longer makes sense.
    Since there is already a 'pinned' member in the block group that refers
    to space reservations (pinned bytes), rename the member to 'frozen',
    add a comment on top of it to describe its general purpose and rename
    the helpers to increment and decrement the counter as well, to match
    the new member name.

    The next patch in the series will move the helpers into a more suitable
    file (from free-space-cache.c to block-group.c).

    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • At clean_pinned_extents(), whether we end up returning success or failure,
    we pretty much have to do the same things:

    1) unlock unused_bg_unpin_mutex
    2) decrement reference count on the previous transaction

    We also call btrfs_dec_block_group_ro() in case of failure, but that is
    better done in its caller, btrfs_delete_unused_bgs(), since its the
    caller that calls inc_block_group_ro(), so it should be responsible for
    the decrement operation, as it is in case any of the other functions it
    calls fail.

    So move the call to btrfs_dec_block_group_ro() from clean_pinned_extents()
    into btrfs_delete_unused_bgs() and unify the error and success return
    paths for clean_pinned_extents(), reducing duplicated code and making it
    simpler.

    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     
  • For unlink transactions and block group removal
    btrfs_start_transaction_fallback_global_rsv will first try to start an
    ordinary transaction and if it fails it will fall back to reserving the
    required amount by stealing from the global reserve. This is problematic
    because of all the same reasons we had with previous iterations of the
    ENOSPC handling, thundering herd. We get a bunch of failures all at
    once, everybody tries to allocate from the global reserve, some win and
    some lose, we get an ENSOPC.

    Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
    used to mark unlink reservation. To fix this we need to integrate this
    logic into the normal ENOSPC infrastructure. We still go through all of
    the normal flushing work, and at the moment we begin to fail all the
    tickets we try to satisfy any tickets that are allowed to steal by
    stealing from the global reserve. If this works we start the flushing
    system over again just like we would with a normal ticket satisfaction.
    This serializes our global reserve stealing, so we don't have the
    thundering herd problem.

    Reviewed-by: Nikolay Borisov
    Tested-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     

23 Apr, 2020

2 commits

  • btrfs_remove_block_group() invokes btrfs_lookup_block_group(), which
    returns a local reference of the block group that contains the given
    bytenr to "block_group" with increased refcount.

    When btrfs_remove_block_group() returns, "block_group" becomes invalid,
    so the refcount should be decreased to keep refcount balanced.

    The reference counting issue happens in several exception handling paths
    of btrfs_remove_block_group(). When those error scenarios occur such as
    btrfs_alloc_path() returns NULL, the function forgets to decrease its
    refcnt increased by btrfs_lookup_block_group() and will cause a refcnt
    leak.

    Fix this issue by jumping to "out_put_group" label and calling
    btrfs_put_block_group() when those error scenarios occur.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Xiyu Yang
    Signed-off-by: Xin Tan
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Xiyu Yang
     
  • When cleaning pinned extents right before deleting an unused block group,
    we check if there's still a previous transaction running and if so we
    increment its reference count before using it for cleaning pinned ranges
    in its pinned extents iotree. However we ended up never decrementing the
    reference count after using the transaction, resulting in a memory leak.

    Fix it by decrementing the reference count.

    Fixes: fe119a6eeb6705 ("btrfs: switch to per-transaction pinned extents")
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     

09 Apr, 2020

1 commit

  • Whenever we add a ticket to a space_info object we increment the object's
    reclaim_size counter witht the ticket's bytes, and we decrement it with
    the corresponding amount only when we are able to grant the requested
    space to the ticket. When we are not able to grant the space to a ticket,
    or when the ticket is removed due to a signal (e.g. an application has
    received sigterm from the terminal) we never decrement the counter with
    the corresponding bytes from the ticket. This leak can result in the
    space reclaim code to later do much more work than necessary. So fix it
    by decrementing the counter when those two cases happen as well.

    Fixes: db161806dc5615 ("btrfs: account ticket size at add/delete time")
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     

24 Mar, 2020

4 commits

  • The space_info list is normally RCU protected and should be traversed
    with rcu_read_lock held. There's a warning

    [29.104756] WARNING: suspicious RCU usage
    [29.105046] 5.6.0-rc4-next-20200305 #1 Not tainted
    [29.105231] -----------------------------
    [29.105401] fs/btrfs/block-group.c:2011 RCU-list traversed in non-reader section!!

    pointing out that the locking is missing in btrfs_read_block_groups.
    However this is not necessary as the list traversal happens at mount
    time when there's no other thread potentially accessing the list.

    To fix the warning and for consistency let's add the RCU lock/unlock,
    the code won't be affected much as it's doing some lightweight
    operations.

    Reported-by: Guenter Roeck
    Signed-off-by: Madhuparna Bhowmik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Madhuparna Bhowmik
     
  • This commit flips the switch to start tracking/processing pinned extents
    on a per-transaction basis. It mostly replaces all references from
    btrfs_fs_info::(pinned_extents|freed_extents[]) to
    btrfs_transaction::pinned_extents.

    Two notable modifications that warrant explicit mention are changing
    clean_pinned_extents to get a reference to the previously running
    transaction. The other one is removal of call to
    btrfs_destroy_pinned_extent since transactions are going to be cleaned
    in btrfs_cleanup_one_transaction.

    Reviewed-by: Josef Bacik
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Next patch is going to refactor how pinned extents are tracked which
    will necessitate changing this code. To ease that work and contain the
    changes factor the code now in preparation, this will also help review.

    Reviewed-by: Josef Bacik
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • The status of aborted transaction can change between calls and it needs
    to be accessed by READ_ONCE. Add a helper that also wraps the unlikely
    hint.

    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    David Sterba
     

21 Mar, 2020

1 commit

  • We are incorrectly dropping the raid56 and raid1c34 incompat flags when
    there are still raid56 and raid1c34 block groups, not when we do not any
    of those anymore. The logic just got unintentionally broken after adding
    the support for the raid1c34 modes.

    Fix this by clear the flags only if we do not have block groups with the
    respective profiles.

    Fixes: 9c907446dce3 ("btrfs: drop incompat bit for raid1c34 after last block group is gone")
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     

31 Jan, 2020

2 commits

  • inc_block_group_ro does a calculation to see if we have enough room left
    over if we mark this block group as read only in order to see if it's ok
    to mark the block group as read only.

    The problem is this calculation _only_ works for data, where our used is
    always less than our total. For metadata we will overcommit, so this
    will almost always fail for metadata.

    Fix this by exporting btrfs_can_overcommit, and then see if we have
    enough space to remove the remaining free space in the block group we
    are trying to mark read only. If we do then we can mark this block
    group as read only.

    Reviewed-by: Qu Wenruo
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • For some reason we've translated the do_chunk_alloc that goes into
    btrfs_inc_block_group_ro to force in inc_block_group_ro, but these are
    two different things.

    force for inc_block_group_ro is used when we are forcing the block group
    read only no matter what, for example when the underlying chunk is
    marked read only. We need to not do the space check here as this block
    group needs to be read only.

    btrfs_inc_block_group_ro() has a do_chunk_alloc flag that indicates that
    we need to pre-allocate a chunk before marking the block group read
    only. This has nothing to do with forcing, and in fact we _always_ want
    to do the space check in this case, so unconditionally pass false for
    force in this case.

    Then fixup inc_block_group_ro to honor force as it's expected and
    documented to do.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     

24 Jan, 2020

1 commit

  • Move variables to appropriate scope. Remove last BUG_ON in the function
    and rework error handling accordingly. Make the duplicate detection code
    more straightforward. Use in_range macro. And give variables more
    descriptive name by explicitly distinguishing between IO stripe size
    (size recorded in the chunk item) and data stripe size (the size of
    an actual stripe, constituting a logical chunk/block group).

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov