29 Feb, 2020

9 commits

  • commit a5ae50dea9111db63d30d700766dd5509602f7ad upstream.

    While logging the prealloc extents of an inode during a fast fsync we call
    btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
    holding a read lock on a leaf of the inode's root (not the log root, the
    fs/subvol root), and then that function locks the file range in the inode's
    iotree. This can lead to a deadlock when:

    * the fsync is ranged

    * the file has prealloc extents beyond eof

    * writeback for a range different from the fsync range starts
    during the fsync

    * the size of the file is not sector size aligned

    Because when finishing an ordered extent we lock first a file range and
    then try to COW the fs/subvol tree to insert an extent item.

    The following diagram shows how the deadlock can happen.

    CPU 1 CPU 2

    btrfs_sync_file()
    --> for range [0, 1MiB)

    --> inode has a size of
    1MiB and has 1 prealloc
    extent beyond the
    i_size, starting at offset
    4MiB

    flushes all delalloc for the
    range [0MiB, 1MiB) and waits
    for the respective ordered
    extents to complete

    --> before task at CPU 1 locks the
    inode, a write into file range
    [1MiB, 2MiB + 1KiB) is made

    --> i_size is updated to 2MiB + 1KiB

    --> writeback is started for that
    range, [1MiB, 2MiB + 4KiB)
    --> end offset rounded up to
    be sector size aligned

    btrfs_log_dentry_safe()
    btrfs_log_inode_parent()
    btrfs_log_inode()

    btrfs_log_changed_extents()
    btrfs_log_prealloc_extents()
    --> does a search on the
    inode's root
    --> holds a read lock on
    leaf X

    btrfs_finish_ordered_io()
    --> locks range [1MiB, 2MiB + 4KiB)
    --> end offset rounded up
    to be sector size aligned

    --> tries to cow leaf X, through
    insert_reserved_file_extent()
    --> already locked by the
    task at CPU 1

    btrfs_truncate_inode_items()

    --> gets an i_size of
    2MiB + 1KiB, which is
    not sector size
    aligned

    --> tries to lock file
    range [2MiB, (u64)-1)
    --> the start range
    is rounded down
    from 2MiB + 1K
    to 2MiB to be sector
    size aligned

    --> but the subrange
    [2MiB, 2MiB + 4KiB) is
    already locked by
    task at CPU 2 which
    is waiting to get a
    write lock on leaf X
    for which we are
    holding a read lock

    *** deadlock ***

    This results in a stack trace like the following, triggered by test case
    generic/561 from fstests:

    [ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
    [ 2779.979536] Not tainted 5.6.0-rc2-btrfs-next-53 #1
    [ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 2779.990136] kworker/u8:6 D 0 247 2 0x80004000
    [ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
    [ 2779.990466] Call Trace:
    [ 2779.990491] ? __schedule+0x384/0xa30
    [ 2779.990521] schedule+0x33/0xe0
    [ 2779.990616] btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
    [ 2779.990632] ? remove_wait_queue+0x60/0x60
    [ 2779.990730] btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
    [ 2779.990782] btrfs_search_slot+0x510/0x1000 [btrfs]
    [ 2779.990869] btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
    [ 2779.990944] __btrfs_drop_extents+0x161/0x1060 [btrfs]
    [ 2779.990987] ? mark_held_locks+0x6d/0xc0
    [ 2779.990994] ? __slab_alloc.isra.49+0x99/0x100
    [ 2779.991060] ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
    [ 2779.991145] insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
    [ 2779.991222] ? start_transaction+0xdd/0x5c0 [btrfs]
    [ 2779.991291] btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
    [ 2779.991405] btrfs_work_helper+0xaa/0x720 [btrfs]
    [ 2779.991432] process_one_work+0x26d/0x6a0
    [ 2779.991460] worker_thread+0x4f/0x3e0
    [ 2779.991481] ? process_one_work+0x6a0/0x6a0
    [ 2779.991489] kthread+0x103/0x140
    [ 2779.991499] ? kthread_create_worker_on_cpu+0x70/0x70
    [ 2779.991515] ret_from_fork+0x3a/0x50
    (...)
    [ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
    [ 2780.027480] Not tainted 5.6.0-rc2-btrfs-next-53 #1
    [ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 2780.030035] fsstress D 0 17375 17373 0x00004000
    [ 2780.030038] Call Trace:
    [ 2780.030044] ? __schedule+0x384/0xa30
    [ 2780.030052] schedule+0x33/0xe0
    [ 2780.030075] lock_extent_bits+0x20c/0x320 [btrfs]
    [ 2780.030094] ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
    [ 2780.030098] ? rcu_read_lock_sched_held+0x59/0xa0
    [ 2780.030102] ? remove_wait_queue+0x60/0x60
    [ 2780.030122] btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
    [ 2780.030151] ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
    [ 2780.030165] ? btrfs_search_slot+0x379/0x1000 [btrfs]
    [ 2780.030195] btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
    [ 2780.030202] ? do_raw_spin_unlock+0x49/0xc0
    [ 2780.030215] ? btrfs_get_num_csums+0x10/0x10 [btrfs]
    [ 2780.030239] btrfs_log_inode+0xf83/0x1124 [btrfs]
    [ 2780.030251] ? __mutex_unlock_slowpath+0x45/0x2a0
    [ 2780.030275] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
    [ 2780.030282] ? dget_parent+0xa1/0x370
    [ 2780.030309] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
    [ 2780.030329] btrfs_sync_file+0x3f3/0x490 [btrfs]
    [ 2780.030339] do_fsync+0x38/0x60
    [ 2780.030343] __x64_sys_fdatasync+0x13/0x20
    [ 2780.030345] do_syscall_64+0x5c/0x280
    [ 2780.030348] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
    [ 2780.030361] Code: Bad RIP value.
    [ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
    [ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
    [ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
    [ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
    [ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
    [ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90

    So fix this by making btrfs_truncate_inode_items() not lock the range in
    the inode's iotree when the target root is a log root, since it's not
    needed to lock the range for log roots as the protection from the inode's
    lock and log_mutex are all that's needed.

    Fixes: 28553fa992cb28 ("Btrfs: fix race between shrinking truncate and fiemap")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 52e29e331070cd7d52a64cbf1b0958212a340e28 upstream.

    The only time we actually leave the path spinning is if we're truncating
    a small amount and don't actually free an extent, which is not a common
    occurrence. We have to set the path blocking in order to add the
    delayed ref anyway, so the first extent we find we set the path to
    blocking and stay blocking for the duration of the operation. With the
    upcoming file extent map stuff there will be another case that we have
    to have the path blocking, so just swap to blocking always.

    Note: this patch also fixes a warning after 28553fa992cb ("Btrfs: fix
    race between shrinking truncate and fiemap") got merged that inserts
    extent locks around truncation so the path must not leave spinning locks
    after btrfs_search_slot.

    [70.794783] BUG: sleeping function called from invalid context at mm/slab.h:565
    [70.794834] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1141, name: rsync
    [70.794863] 5 locks held by rsync/1141:
    [70.794876] #0: ffff888417b9c408 (sb_writers#17){.+.+}, at: mnt_want_write+0x20/0x50
    [70.795030] #1: ffff888428de28e8 (&type->i_mutex_dir_key#13/1){+.+.}, at: lock_rename+0xf1/0x100
    [70.795051] #2: ffff888417b9c608 (sb_internal#2){.+.+}, at: start_transaction+0x394/0x560
    [70.795124] #3: ffff888403081768 (btrfs-fs-01){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
    [70.795203] #4: ffff888403086568 (btrfs-fs-00){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
    [70.795222] CPU: 5 PID: 1141 Comm: rsync Not tainted 5.6.0-rc2-backup+ #2
    [70.795362] Call Trace:
    [70.795374] dump_stack+0x71/0xa0
    [70.795445] ___might_sleep.part.96.cold.106+0xa6/0xb6
    [70.795459] kmem_cache_alloc+0x1d3/0x290
    [70.795471] alloc_extent_state+0x22/0x1c0
    [70.795544] __clear_extent_bit+0x3ba/0x580
    [70.795557] ? _raw_spin_unlock_irq+0x24/0x30
    [70.795569] btrfs_truncate_inode_items+0x339/0xe50
    [70.795647] btrfs_evict_inode+0x269/0x540
    [70.795659] ? dput.part.38+0x29/0x460
    [70.795671] evict+0xcd/0x190
    [70.795682] __dentry_kill+0xd6/0x180
    [70.795754] dput.part.38+0x2ad/0x460
    [70.795765] do_renameat2+0x3cb/0x540
    [70.795777] __x64_sys_rename+0x1c/0x20

    Reported-by: Dave Jones
    Fixes: 28553fa992cb ("Btrfs: fix race between shrinking truncate and fiemap")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    [ add note ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 28553fa992cb28be6a65566681aac6cafabb4f2d upstream.

    When there is a fiemap executing in parallel with a shrinking truncate
    we can end up in a situation where we have extent maps for which we no
    longer have corresponding file extent items. This is generally harmless
    and at the moment the only consequences are missing file extent items
    representing holes after we expand the file size again after the
    truncate operation removed the prealloc extent items, and stale
    information for future fiemap calls (reporting extents that no longer
    exist or may have been reallocated to other files for example).

    Consider the following example:

    1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
    and a 1MiB prealloc extent at file offset 128KiB;

    2) Task A starts doing a shrinking truncate of our inode to reduce it to
    a size of 64KiB. Before it searches the subvolume tree for file
    extent items to delete, it drops all the extent maps in the range
    from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();

    3) Task B starts doing a fiemap against our inode. When looking up for
    the inode's extent maps in the range from 128KiB to (u64)-1, it
    doesn't find any in the inode's extent map tree, since they were
    removed by task A. Because it didn't find any in the extent map
    tree, it scans the inode's subvolume tree for file extent items, and
    it finds the 1MiB prealloc extent at file offset 128KiB, then it
    creates an extent map based on that file extent item and adds it to
    inode's extent map tree (this ends up being done by
    btrfs_get_extent()
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit e75fd33b3f744f644061a4f9662bd63f5434f806 upstream.

    In btrfs_wait_ordered_range() once we find an ordered extent that has
    finished with an error we exit the loop and don't wait for any other
    ordered extents that might be still in progress.

    All the users of btrfs_wait_ordered_range() expect that there are no more
    ordered extents in progress after that function returns. So past fixes
    such like the ones from the two following commits:

    ff612ba7849964 ("btrfs: fix panic during relocation after ENOSPC before
    writeback happens")

    28aeeac1dd3080 ("Btrfs: fix panic when starting bg cache writeout after
    IO error")

    don't work when there are multiple ordered extents in the range.

    Fix that by making btrfs_wait_ordered_range() wait for all ordered extents
    even after it finds one that had an error.

    Link: https://github.com/kdave/btrfs-progs/issues/228#issuecomment-569777554
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Qu Wenruo
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 1e90315149f3fe148e114a5de86f0196d1c21fa5 upstream.

    btrfs_assert_delayed_root_empty() will check if the delayed root is
    completely empty, but this is a filesystem-wide check. On cleanup we
    may have allowed other transactions to begin, for whatever reason, and
    thus the delayed root is not empty.

    So remove this check from cleanup_one_transation(). This however can
    stay in btrfs_cleanup_transaction(), because it checks only after all of
    the transactions have been properly cleaned up, and thus is valid.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Qu Wenruo
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 315bf8ef914f31d51d084af950703aa1e09a728c upstream.

    While running my error injection script I hit a panic when we tried to
    clean up the fs_root when freeing the fs_root. This is because
    fs_info->fs_root == PTR_ERR(-EIO), which isn't great. Fix this by
    setting fs_info->fs_root = NULL; if we fail to read the root.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Qu Wenruo
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit b778cf962d71a0e737923d55d0432f3bd287258e upstream.

    I hit the following warning while running my error injection stress
    testing:

    WARNING: CPU: 3 PID: 1453 at fs/btrfs/space-info.h:108 btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
    RIP: 0010:btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
    Call Trace:
    btrfs_free_reserved_data_space+0x4f/0x70 [btrfs]
    __btrfs_prealloc_file_range+0x378/0x470 [btrfs]
    elfcorehdr_read+0x40/0x40
    ? elfcorehdr_read+0x40/0x40
    ? btrfs_commit_transaction+0xca/0xa50 [btrfs]
    ? dput+0xb4/0x2a0
    ? btrfs_log_dentry_safe+0x55/0x70 [btrfs]
    ? btrfs_sync_file+0x30e/0x420 [btrfs]
    ? do_fsync+0x38/0x70
    ? __x64_sys_fdatasync+0x13/0x20
    ? do_syscall_64+0x5b/0x1b0
    ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

    This happens if we fail to insert our reserved file extent. At this
    point we've already converted our reservation from ->bytes_may_use to
    ->bytes_reserved. However once we break we will attempt to free
    everything from [cur_offset, end] from ->bytes_may_use, but our extent
    reservation will overlap part of this.

    Fix this problem by adding ins.offset (our extent allocation size) to
    cur_offset so we remove the actual remaining part from ->bytes_may_use.

    I validated this fix using my inject-error.py script

    python inject-error.py -o should_fail_bio -t cache_save_setup -t \
    __btrfs_prealloc_file_range \
    -t insert_reserved_file_extent.constprop.0 \
    -r "-5" ./run-fsstress.sh

    where run-fsstress.sh simply mounts and runs fsstress on a disk.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Qu Wenruo
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 81f7eb00ff5bb8326e82503a32809421d14abb8a upstream.

    We clean up the delayed references when we abort a transaction but we
    leave the pending qgroup extent records behind, leaking memory.

    This patch destroys the extent records when we destroy the delayed refs
    and makes sure ensure they're gone before releasing the transaction.

    Fixes: 3368d001ba5d ("btrfs: qgroup: Record possible quota-related extent for qgroup.")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Jeff Mahoney
    [ Rebased to latest upstream, remove to_qgroup() helper, use
    rbtree_postorder_for_each_entry_safe() wrapper ]
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Jeff Mahoney
     
  • commit bd727173e4432fe6cb70ba108dc1f3602c5409d7 upstream.

    If we're allocating a logged extent we attempt to insert an extent
    record for the file extent directly. We increase
    space_info->bytes_reserved, because the extent entry addition will call
    btrfs_update_block_group(), which will convert the ->bytes_reserved to
    ->bytes_used. However if we fail at any point while inserting the
    extent entry we will bail and leave space on ->bytes_reserved, which
    will trigger a WARN_ON() on umount. Fix this by pinning the space if we
    fail to insert, which is what happens in every other failure case that
    involves adding the extent entry.

    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Qu Wenruo
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     

24 Feb, 2020

7 commits

  • [ Upstream commit f4b1363cae43fef7c86c993b7ca7fe7d546b3c68 ]

    We ran into a deadlock in production with the fixup worker. The stack
    traces were as follows:

    Thread responsible for the writeout, waiting on the page lock

    [] io_schedule+0x12/0x40
    [] __lock_page+0x109/0x1e0
    [] extent_write_cache_pages+0x206/0x360
    [] extent_writepages+0x40/0x60
    [] do_writepages+0x31/0xb0
    [] __writeback_single_inode+0x3d/0x350
    [] writeback_sb_inodes+0x19d/0x3c0
    [] __writeback_inodes_wb+0x5d/0xb0
    [] wb_writeback+0x231/0x2c0
    [] wb_workfn+0x308/0x3c0
    [] process_one_work+0x1e0/0x390
    [] worker_thread+0x2b/0x3c0
    [] kthread+0x113/0x130
    [] ret_from_fork+0x35/0x40
    [] 0xffffffffffffffff

    Thread of the fixup worker who is holding the page lock

    [] start_delalloc_inodes+0x241/0x2d0
    [] btrfs_start_delalloc_roots+0x179/0x230
    [] btrfs_alloc_data_chunk_ondemand+0x11b/0x2e0
    [] btrfs_check_data_free_space+0x53/0xa0
    [] btrfs_delalloc_reserve_space+0x20/0x70
    [] btrfs_writepage_fixup_worker+0x1fc/0x2a0
    [] normal_work_helper+0x11c/0x360
    [] process_one_work+0x1e0/0x390
    [] worker_thread+0x2b/0x3c0
    [] kthread+0x113/0x130
    [] ret_from_fork+0x35/0x40
    [] 0xffffffffffffffff

    Thankfully the stars have to align just right to hit this. First you
    have to end up in the fixup worker, which is tricky by itself (my
    reproducer does DIO reads into a MMAP'ed region, so not a common
    operation). Then you have to have less than a page size of free data
    space and 0 unallocated space so you go down the "commit the transaction
    to free up pinned space" path. This was accomplished by a random
    balance that was running on the host. Then you get this deadlock.

    I'm still in the process of trying to force the deadlock to happen on
    demand, but I've hit other issues. I can still trigger the fixup worker
    path itself so this patch has been tested in that regard, so the normal
    case is fine.

    Fixes: 87826df0ec36 ("btrfs: delalloc for page dirtied out-of-band in fixup worker")
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik
     
  • [ Upstream commit 1362089d2ad7e20d16371b39d3c11990d4ec23e4 ]

    Current code doesn't correctly handle the situation which arises when
    a file system that has METADATA_UUID_INCOMPAT flag set and has its FSID
    changed to the one in metadata uuid. This causes the incompat flag to
    disappear.

    In case of a power failure we could end up in a situation where part of
    the disks in a multi-disk filesystem are correctly reverted to
    METADATA_UUID_INCOMPAT flag unset state, while others have
    METADATA_UUID_INCOMPAT set and CHANGING_FSID_V2_IN_PROGRESS.

    This patch corrects the behavior required to handle the case where a
    disk of the second type is scanned first, creating the necessary
    btrfs_fs_devices. Subsequently, when a disk which has already completed
    the transition is scanned it should overwrite the data in
    btrfs_fs_devices.

    Reported-by: Su Yue
    Reviewed-by: Josef Bacik
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Nikolay Borisov
     
  • [ Upstream commit 68c467cbb2f389b6c933e235bce0d1756fc8cc34 ]

    There's a report where objtool detects unreachable instructions, eg.:

    fs/btrfs/ctree.o: warning: objtool: btrfs_search_slot()+0x2d4: unreachable instruction

    This seems to be a false positive due to compiler version. The cause is
    in the ASSERT macro implementation that does the conditional check as
    IS_DEFINED(CONFIG_BTRFS_ASSERT) and not an #ifdef.

    To avoid that, use the ifdefs directly.

    There are still 2 reports that aren't fixed:

    fs/btrfs/extent_io.o: warning: objtool: __set_extent_bit()+0x71f: unreachable instruction
    fs/btrfs/relocation.o: warning: objtool: find_data_references()+0x4e0: unreachable instruction

    Co-developed-by: Josh Poimboeuf
    Signed-off-by: Josh Poimboeuf
    Reported-by: Randy Dunlap
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    David Sterba
     
  • [ Upstream commit a69976bc69308aa475d0ba3b8b3efd1d013c0460 ]

    We had a report indicating that some read errors aren't reported by the
    device stats in the userland. It is important to have the errors
    reported in the device stat as user land scripts might depend on it to
    take the reasonable corrective actions. But to debug these issue we need
    to be really sure that request to reset the device stat did not come
    from the userland itself. So log an info message when device error reset
    happens.

    For example:
    BTRFS info (device sdc): device stats zeroed by btrfs(9223)

    Reported-by: philip@philip-seeger.de
    Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html
    Reviewed-by: Josef Bacik
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Anand Jain
     
  • [ Upstream commit 4babad10198fa73fe73239d02c2e99e3333f5f5c ]

    Dan's smatch tool reports

    fs/btrfs/file-item.c:295 btrfs_lookup_bio_sums()
    warn: should this be 'count == -1'

    which points to the while (count--) loop. With count == 0 the check
    itself could decrement it to -1. There's a WARN_ON a few lines below
    that has never been seen in practice though.

    It turns out that the value of page_bytes_left matches the count (by
    sectorsize multiples). The loop never reaches the state where count
    would go to -1, because page_bytes_left == 0 is found first and this
    breaks out.

    For clarity, use only plain check on count (and only for positive
    value), decrement safely inside the loop. Any other discrepancy after
    the whole bio list processing should be reported by the exising
    WARN_ON_ONCE as well.

    Reported-by: Dan Carpenter
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    David Sterba
     
  • [ Upstream commit 3dbd351df42109902fbcebf27104149226a4fcd9 ]

    A user reports a possible NULL-pointer dereference in
    btrfsic_process_superblock(). We are assigning state->fs_info to a local
    fs_info variable and afterwards checking for the presence of state.

    While we would BUG_ON() a NULL state anyways, we can also just remove
    the local fs_info copy, as fs_info is only used once as the first
    argument for btrfs_num_copies(). There we can just pass in
    state->fs_info as well.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205003
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Johannes Thumshirn
     
  • [ Upstream commit 25f3c5021985e885292980d04a1423fd83c967bb ]

    For COW, btrfs expects pages dirty pages to have been through a few setup
    steps. This includes reserving space for the new block allocations and marking
    the range in the state tree for delayed allocation.

    A few places outside btrfs will dirty pages directly, especially when unmapping
    mmap'd pages. In order for these to properly go through COW, we run them
    through a fixup worker to wait for stable pages, and do the delalloc prep.

    87826df0ec36 added a window where the dirty pages were cleaned, but pending
    more action from the fixup worker. We clear_page_dirty_for_io() before
    we call into writepage, so the page is no longer dirty. The commit
    changed it so now we leave the page clean between unlocking it here and
    the fixup worker starting at some point in the future.

    During this window, page migration can jump in and relocate the page. Once our
    fixup work actually starts, it finds page->mapping is NULL and we end up
    freeing the page without ever writing it.

    This leads to crc errors and other exciting problems, since it screws up the
    whole statemachine for waiting for ordered extents. The fix here is to keep
    the page dirty while we're waiting for the fixup worker to get to work.
    This is accomplished by returning -EAGAIN from btrfs_writepage_cow_fixup
    if we queued the page up for fixup, which will cause the writepage
    function to redirty the page.

    Because we now expect the page to be dirty once it gets to the fixup
    worker we must adjust the error cases to call clear_page_dirty_for_io()
    on the page. That is the bulk of the patch, but it is not the fix, the
    fix is the -EAGAIN from btrfs_writepage_cow_fixup. We cannot separate
    these two changes out because the error conditions change with the new
    expectations.

    Signed-off-by: Chris Mason
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Chris Mason
     

20 Feb, 2020

4 commits

  • commit 10a3a3edc5b89a8cd095bc63495fb1e0f42047d9 upstream.

    A remount to a read-write filesystem is not safe when there's tree-log
    to be replayed. Files that could be opened until now might be affected
    by the changes in the tree-log.

    A regular mount is needed to replay the log so the filesystem presents
    the consistent view with the pending changes included.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Anand Jain
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    David Sterba
     
  • commit e8294f2f6aa6208ed0923aa6d70cea3be178309a upstream.

    There's no logged information about tree-log replay although this is
    something that points to previous unclean unmount. Other filesystems
    report that as well.

    Suggested-by: Chris Murphy
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Anand Jain
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    David Sterba
     
  • commit f311ade3a7adf31658ed882aaab9f9879fdccef7 upstream.

    In btrfs_ref_tree_mod(), 'ref' and 'ra' are allocated through kzalloc() and
    kmalloc(), respectively. In the following code, if an error occurs, the
    execution will be redirected to 'out' or 'out_unlock' and the function will
    be exited. However, on some of the paths, 'ref' and 'ra' are not
    deallocated, leading to memory leaks. For example, if 'action' is
    BTRFS_ADD_DELAYED_EXTENT, add_block_entry() will be invoked. If the return
    value indicates an error, the execution will be redirected to 'out'. But,
    'ref' is not deallocated on this path, causing a memory leak.

    To fix the above issues, deallocate both 'ref' and 'ra' before exiting from
    the function when an error is encountered.

    CC: stable@vger.kernel.org # 4.15+
    Signed-off-by: Wenwen Wang
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Wenwen Wang
     
  • commit ac05ca913e9f3871126d61da275bfe8516ff01ca upstream.

    We have a few cases where we allow an extent map that is in an extent map
    tree to be merged with other extents in the tree. Such cases include the
    unpinning of an extent after the respective ordered extent completed or
    after logging an extent during a fast fsync. This can lead to subtle and
    dangerous problems because when doing the merge some other task might be
    using the same extent map and as consequence see an inconsistent state of
    the extent map - for example sees the new length but has seen the old start
    offset.

    With luck this triggers a BUG_ON(), and not some silent bug, such as the
    following one in __do_readpage():

    $ cat -n fs/btrfs/extent_io.c
    3061 static int __do_readpage(struct extent_io_tree *tree,
    3062 struct page *page,
    (...)
    3127 em = __get_extent_map(inode, page, pg_offset, cur,
    3128 end - cur + 1, get_extent, em_cached);
    3129 if (IS_ERR_OR_NULL(em)) {
    3130 SetPageError(page);
    3131 unlock_extent(tree, cur, end);
    3132 break;
    3133 }
    3134 extent_offset = cur - em->start;
    3135 BUG_ON(extent_map_end(em)
    Reported-by: Koki Mitani
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     

11 Feb, 2020

12 commits

  • [ Upstream commit 4e19443da1941050b346f8fc4c368aa68413bc88 ]

    Sometimes when running generic/475 we would trip the
    WARN_ON(cache->reserved) check when free'ing the block groups on umount.
    This is because sometimes we don't commit the transaction because of IO
    errors and thus do not cleanup the tree logs until at umount time.

    These blocks are still reserved until they are cleaned up, but they
    aren't cleaned up until _after_ we do the free block groups work. Fix
    this by moving the free after free'ing the fs roots, that way all of the
    tree logs are cleaned up and we have a properly cleaned fs. A bunch of
    loops of generic/475 confirmed this fixes the problem.

    CC: stable@vger.kernel.org # 4.9+
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik
     
  • [ Upstream commit 4273eaff9b8d5e141113a5bdf9628c02acf3afe5 ]

    We don't need int argument bool shall do in free_root_pointers(). And
    rename the argument as it confused two people.

    Reviewed-by: Qu Wenruo
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Anand Jain
     
  • commit 5750c37523a2c8cbb450b9ef31e21c2ba876b05e upstream.

    Raviu reported that running his regular fs_trim segfaulted with the
    following backtrace:

    [ 237.525947] assertion failed: prev, in ../fs/btrfs/extent_io.c:1595
    [ 237.525984] ------------[ cut here ]------------
    [ 237.525985] kernel BUG at ../fs/btrfs/ctree.h:3117!
    [ 237.525992] invalid opcode: 0000 [#1] SMP PTI
    [ 237.525998] CPU: 4 PID: 4423 Comm: fstrim Tainted: G U OE 5.4.14-8-vanilla #1
    [ 237.526001] Hardware name: ASUSTeK COMPUTER INC.
    [ 237.526044] RIP: 0010:assfail.constprop.58+0x18/0x1a [btrfs]
    [ 237.526079] Call Trace:
    [ 237.526120] find_first_clear_extent_bit+0x13d/0x150 [btrfs]
    [ 237.526148] btrfs_trim_fs+0x211/0x3f0 [btrfs]
    [ 237.526184] btrfs_ioctl_fitrim+0x103/0x170 [btrfs]
    [ 237.526219] btrfs_ioctl+0x129a/0x2ed0 [btrfs]
    [ 237.526227] ? filemap_map_pages+0x190/0x3d0
    [ 237.526232] ? do_filp_open+0xaf/0x110
    [ 237.526238] ? _copy_to_user+0x22/0x30
    [ 237.526242] ? cp_new_stat+0x150/0x180
    [ 237.526247] ? do_vfs_ioctl+0xa4/0x640
    [ 237.526278] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
    [ 237.526283] do_vfs_ioctl+0xa4/0x640
    [ 237.526288] ? __do_sys_newfstat+0x3c/0x60
    [ 237.526292] ksys_ioctl+0x70/0x80
    [ 237.526297] __x64_sys_ioctl+0x16/0x20
    [ 237.526303] do_syscall_64+0x5a/0x1c0
    [ 237.526310] entry_SYSCALL_64_after_hwframe+0x49/0xbe

    That was due to btrfs_fs_device::aloc_tree being empty. Initially I
    thought this wasn't possible and as a percaution have put the assert in
    find_first_clear_extent_bit. Turns out this is indeed possible and could
    happen when a file system with SINGLE data/metadata profile has a 2nd
    device added. Until balance is run or a new chunk is allocated on this
    device it will be completely empty.

    In this case find_first_clear_extent_bit should return the full range
    [0, -1ULL] and let the caller handle this i.e for trim the end will be
    capped at the size of actual device.

    Link: https://lore.kernel.org/linux-btrfs/izW2WNyvy1dEDweBICizKnd2KDwDiDyY2EYQr4YCwk7pkuIpthx-JRn65MPBde00ND6V0_Lh8mW0kZwzDiLDv25pUYWxkskWNJnVP0kgdMA=@protonmail.com/
    Fixes: 45bfcfc168f8 ("btrfs: Implement find_first_clear_extent_bit")
    CC: stable@vger.kernel.org # 5.2+
    Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     
  • commit 42ffb0bf584ae5b6b38f72259af1e0ee417ac77f upstream.

    There exists a deadlock with range_cyclic that has existed forever. If
    we loop around with a bio already built we could deadlock with a writer
    who has the page locked that we're attempting to write but is waiting on
    a page in our bio to be written out. The task traces are as follows

    PID: 1329874 TASK: ffff889ebcdf3800 CPU: 33 COMMAND: "kworker/u113:5"
    #0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
    #1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
    #2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
    #3 [ffffc900297bb708] __lock_page at ffffffff811f145b
    #4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
    #5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
    #6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
    #7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
    #8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
    #9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd

    PID: 2167901 TASK: ffff889dc6a59c00 CPU: 14 COMMAND:
    "aio-dio-invalid"
    #0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
    #1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
    #2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
    #3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
    #4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
    #5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
    #6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
    #7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
    #8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
    #9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032

    I used drgn to find the respective pages we were stuck on

    page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
    page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874

    As you can see the kworker is waiting for bit 0 (PG_locked) on index
    7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
    8148. aio-dio-invalid has 7680, and the kworker epd looks like the
    following

    crash> struct extent_page_data ffffc900297bbbb0
    struct extent_page_data {
    bio = 0xffff889f747ed830,
    tree = 0xffff889eed6ba448,
    extent_locked = 0,
    sync_io = 0
    }

    Probably worth mentioning as well that it waits for writeback of the
    page to complete while holding a lock on it (at prepare_pages()).

    Using drgn I walked the bio pages looking for page
    0xffffea00fbfc7500 which is the one we're waiting for writeback on

    bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
    for i in range(0, bio.bi_vcnt.value_()):
    bv = bio.bi_io_vec[i]
    if bv.bv_page.value_() == 0xffffea00fbfc7500:
    print("FOUND IT")

    which validated what I suspected.

    The fix for this is simple, flush the epd before we loop back around to
    the beginning of the file during writeout.

    Fixes: b293f02e1423 ("Btrfs: Add writepages support")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 7227ff4de55d931bbdc156c8ef0ce4f100c78a5b upstream.

    There is a race between adding and removing elements to the tree mod log
    list and rbtree that can lead to use-after-free problems.

    Consider the following example that explains how/why the problems happens:

    1) Task A has mod log element with sequence number 200. It currently is
    the only element in the mod log list;

    2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
    access the tree mod log. When it enters the function, it initializes
    'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
    before checking if there are other elements in the mod seq list.
    Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
    unlocks the lock 'tree_mod_seq_lock';

    3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
    itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
    sequence number of 201;

    4) Some other task, name it task C, modifies a btree and because there
    elements in the mod seq list, it adds a tree mod elem to the tree
    mod log rbtree. That node added to the mod log rbtree is assigned
    a sequence number of 202;

    5) Task B, which is doing fiemap and resolving indirect back references,
    calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
    calls tree_mod_log_search() - the search returns the mod log node
    from the rbtree with sequence number 202, created by task C;

    6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
    the mod log rbtree and finds the node with sequence number 202. Since
    202 is less than the previously computed 'min_seq', (u64)-1, it
    removes the node and frees it;

    7) Task B still has a pointer to the node with sequence number 202, and
    it dereferences the pointer itself and through the call to
    __tree_mod_log_rewind(), resulting in a use-after-free problem.

    This issue can be triggered sporadically with the test case generic/561
    from fstests, and it happens more frequently with a higher number of
    duperemove processes. When it happens to me, it either freezes the VM or
    it produces a trace like the following before crashing:

    [ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
    [ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
    [ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
    [ 1245.321287] RIP: 0010:rb_next+0x16/0x50
    [ 1245.321307] Code: ....
    [ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
    [ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
    [ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
    [ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
    [ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
    [ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
    [ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
    [ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
    [ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 1245.321706] Call Trace:
    [ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
    [ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs]
    [ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs]
    [ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs]
    [ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs]
    [ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs]
    [ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs]
    [ 1245.322066] do_vfs_ioctl+0x45a/0x750
    [ 1245.322081] ksys_ioctl+0x70/0x80
    [ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c
    [ 1245.322113] __x64_sys_ioctl+0x16/0x20
    [ 1245.322126] do_syscall_64+0x5c/0x280
    [ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 1245.322155] RIP: 0033:0x7fdee3942dd7
    [ 1245.322177] Code: ....
    [ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    [ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
    [ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
    [ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
    [ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
    [ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
    [ 1245.322423] Modules linked in: ....
    [ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---

    Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
    sequence number and iterates the rbtree while holding the lock
    'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
    lock, since it is now redundant.

    Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions")
    Fixes: 097b8a7c9e48e2 ("Btrfs: join tree mod log code with the code holding back delayed refs")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 889bfa39086e86b52fcfaa04d72c95eaeb12f9a5 upstream.

    If we fsync on a subvolume and create a log root for that volume, and
    then later delete that subvolume we'll never clean up its log root. Fix
    this by making switch_commit_roots free the log for any dropped roots we
    encounter. The extra churn is because we need a btrfs_trans_handle, not
    the btrfs_transaction.

    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit d62b23c94952e78211a383b7d90ef0afbd9a3717 upstream.

    If we abort a transaction we have the following sequence

    if (!trans->dirty && list_empty(&trans->new_bgs))
    return;
    WRITE_ONCE(trans->transaction->aborted, err);

    The idea being if we didn't modify anything with our trans handle then
    we don't really need to abort the whole transaction, maybe the other
    trans handles are fine and we can carry on.

    However in the case of create_snapshot we add a pending_snapshot object
    to our transaction and then commit the transaction. We don't actually
    modify anything. sync() behaves the same way, attach to an existing
    transaction and commit it. This means that if we have an IO error in
    the right places we could abort the committing transaction with our
    trans->dirty being not set and thus not set transaction->aborted.

    This is a problem because in the create_snapshot() case we depend on
    pending->error being set to something, or btrfs_commit_transaction
    returning an error.

    If we are not the trans handle that gets to commit the transaction, and
    we're waiting on the commit to happen we get our return value from
    cur_trans->aborted. If this was not set to anything because sync() hit
    an error in the transaction commit before it could modify anything then
    cur_trans->aborted would be 0. Thus we'd return 0 from
    btrfs_commit_transaction() in create_snapshot.

    This is a problem because we then try to do things with
    pending_snapshot->snap, which will be NULL because we didn't create the
    snapshot, and then we'll get a NULL pointer dereference like the
    following

    "BUG: kernel NULL pointer dereference, address: 00000000000001f0"
    RIP: 0010:btrfs_orphan_cleanup+0x2d/0x330
    Call Trace:
    ? btrfs_mksubvol.isra.31+0x3f2/0x510
    btrfs_mksubvol.isra.31+0x4bc/0x510
    ? __sb_start_write+0xfa/0x200
    ? mnt_want_write_file+0x24/0x50
    btrfs_ioctl_snap_create_transid+0x16c/0x1a0
    btrfs_ioctl_snap_create_v2+0x11e/0x1a0
    btrfs_ioctl+0x1534/0x2c10
    ? free_debug_processing+0x262/0x2a3
    do_vfs_ioctl+0xa6/0x6b0
    ? do_sys_open+0x188/0x220
    ? syscall_trace_enter+0x1f8/0x330
    ksys_ioctl+0x60/0x90
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x4a/0x1b0

    In order to fix this we need to make sure anybody who calls
    commit_transaction has trans->dirty set so that they properly set the
    trans->transaction->aborted value properly so any waiters know bad
    things happened.

    This was found while I was running generic/475 with my modified
    fsstress, it reproduced within a few runs. I ran with this patch all
    night and didn't see the problem again.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit b5e4ff9d465da1233a2d9a47ebce487c70d8f4ab upstream.

    Recently fsstress (from fstests) sporadically started to trigger an
    infinite loop during fsync operations. This turned out to be because
    support for the rename exchange and whiteout operations was added to
    fsstress in fstests. These operations, unlike any others in fsstress,
    cause file names to be reused, whence triggering this issue. However
    it's not necessary to use rename exchange and rename whiteout operations
    trigger this issue, simple rename operations and file creations are
    enough to trigger the issue.

    The issue boils down to when we are logging inodes that conflict (that
    had the name of any inode we need to log during the fsync operation), we
    keep logging them even if they were already logged before, and after
    that we check if there's any other inode that conflicts with them and
    then add it again to the list of inodes to log. Skipping already logged
    inodes fixes the issue.

    Consider the following example:

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt

    $ mkdir /mnt/testdir # inode 257

    $ touch /mnt/testdir/zz # inode 258
    $ ln /mnt/testdir/zz /mnt/testdir/zz_link

    $ touch /mnt/testdir/a # inode 259

    $ sync

    # The following 3 renames achieve the same result as a rename exchange
    # operation ( /mnt/testdir/zz_link to /mnt/testdir/a).

    $ mv /mnt/testdir/a /mnt/testdir/a/tmp
    $ mv /mnt/testdir/zz_link /mnt/testdir/a
    $ mv /mnt/testdir/a/tmp /mnt/testdir/zz_link

    # The following rename and file creation give the same result as a
    # rename whiteout operation ( zz to a2).

    $ mv /mnt/testdir/zz /mnt/testdir/a2
    $ touch /mnt/testdir/zz # inode 260

    $ xfs_io -c fsync /mnt/testdir/zz
    --> results in the infinite loop

    The following steps happen:

    1) When logging inode 260, we find that its reference named "zz" was
    used by inode 258 in the previous transaction (through the commit
    root), so inode 258 is added to the list of conflicting indoes that
    need to be logged;

    2) After logging inode 258, we find that its reference named "a" was
    used by inode 259 in the previous transaction, and therefore we add
    inode 259 to the list of conflicting inodes to be logged;

    3) After logging inode 259, we find that its reference named "zz_link"
    was used by inode 258 in the previous transaction - we add inode 258
    to the list of conflicting inodes to log, again - we had already
    logged it before at step 3. After logging it again, we find again
    that inode 259 conflicts with him, and we add again 259 to the list,
    etc - we end up repeating all the previous steps.

    So fix this by skipping logging of conflicting inodes that were already
    logged.

    Fixes: 6b5fc433a7ad67 ("Btrfs: fix fsync after succession of renames of different files")
    CC: stable@vger.kernel.org # 5.1+
    Signed-off-by: Filipe Manana
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 831d2fa25ab8e27592b1b0268dae6f2dfaf7cc43 upstream.

    Since btrfs was migrated to use the generic VFS helpers for clone and
    deduplication, it stopped allowing for the last block of a file to be
    deduplicated when the source file size is not sector size aligned (when
    eof is somewhere in the middle of the last block). There are two reasons
    for that:

    1) The generic code always rounds down, to a multiple of the block size,
    the range's length for deduplications. This means we end up never
    deduplicating the last block when the eof is not block size aligned,
    even for the safe case where the destination range's end offset matches
    the destination file's size. That rounding down operation is done at
    generic_remap_check_len();

    2) Because of that, the btrfs specific code does not expect anymore any
    non-aligned range length's for deduplication and therefore does not
    work if such nona-aligned length is given.

    This patch addresses that second part, and it depends on a patch that
    fixes generic_remap_check_len(), in the VFS, which was submitted ealier
    and has the following subject:

    "fs: allow deduplication of eof block into the end of the destination file"

    These two patches address reports from users that started seeing lower
    deduplication rates due to the last block never being deduplicated when
    the file size is not aligned to the filesystem's block size.

    Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
    CC: stable@vger.kernel.org # 5.1+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 0e56315ca147b3e60c7bf240233a301d3c7fb508 upstream.

    When using the NO_HOLES feature, if we punch a hole into a file and then
    fsync it, there are cases where a subsequent fsync will miss the fact that
    a hole was punched, resulting in the holes not existing after replaying
    the log tree.

    Essentially these cases all imply that, tree-log.c:copy_items(), is not
    invoked for the leafs that delimit holes, because nothing changed those
    leafs in the current transaction. And it's precisely copy_items() where
    we currenly detect and log holes, which works as long as the holes are
    between file extent items in the input leaf or between the beginning of
    input leaf and the previous leaf or between the last item in the leaf
    and the next leaf.

    First example where we miss a hole:

    *) The extent items of the inode span multiple leafs;

    *) The punched hole covers a range that affects only the extent items of
    the first leaf;

    *) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC
    is set in the inode's runtime flags).

    That results in the hole not existing after replaying the log tree.

    For example, if the fs/subvolume tree has the following layout for a
    particular inode:

    Leaf N, generation 10:

    [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ]

    Leaf N + 1, generation 10:

    [ EXTENT_ITEM (128K 64K) ... ]

    If at transaction 11 we punch a hole coverting the range [0, 128K[, we end
    up dropping the two extent items from leaf N, but we don't touch the other
    leaf, so we end up in the following state:

    Leaf N, generation 11:

    [ ... INODE_ITEM INODE_REF ]

    Leaf N + 1, generation 10:

    [ EXTENT_ITEM (128K 64K) ... ]

    A full fsync after punching the hole will only process leaf N because it
    was modified in the current transaction, but not leaf N + 1, since it
    was not modified in the current transaction (generation 10 and not 11).
    As a result the fsync will not log any holes, because it didn't process
    any leaf with extent items.

    Second example where we will miss a hole:

    *) An inode as its items spanning 5 (or more) leafs;

    *) A hole is punched and it covers only the extents items of the 3rd
    leaf. This resulsts in deleting the entire leaf and not touching any
    of the other leafs.

    So the only leaf that is modified in the current transaction, when
    punching the hole, is the first leaf, which contains the inode item.
    During the full fsync, the only leaf that is passed to copy_items()
    is that first leaf, and that's not enough for the hole detection
    code in copy_items() to determine there's a hole between the last
    file extent item in the 2nd leaf and the first file extent item in
    the 3rd leaf (which was the 4th leaf before punching the hole).

    Fix this by scanning all leafs and punch holes as necessary when doing a
    full fsync (less common than a non-full fsync) when the NO_HOLES feature
    is enabled. The lack of explicit file extent items to mark holes makes it
    necessary to scan existing extents to determine if holes exist.

    A test case for fstests follows soon.

    Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 05840710149c7d1a78ea85a2db5723f706e97d8f upstream.

    There is one more cases which isn't handled by the original metadata
    uuid work. Namely, when a filesystem has METADATA_UUID incompat bit and
    the user decides to change the FSID to the original one e.g. have
    metadata_uuid and fsid match. In case of power failure while this
    operation is in progress we could end up in a situation where some of
    the disks have the incompat bit removed and the other half have both
    METADATA_UUID_INCOMPAT and FSID_CHANGING_IN_PROGRESS flags.

    This patch handles the case where a disk that has successfully changed
    its FSID such that it equals METADATA_UUID is scanned first.
    Subsequently when a disk with both
    METADATA_UUID_INCOMPAT/FSID_CHANGING_IN_PROGRESS flags is scanned
    find_fsid_changed won't be able to find an appropriate btrfs_fs_devices.
    This is done by extending find_fsid_changed to correctly find
    btrfs_fs_devices whose metadata_uuid/fsid are the same and they match
    the metadata_uuid of the currently scanned device.

    Fixes: cc5de4e70256 ("btrfs: Handle final split-brain possibility during fsid change")
    Reviewed-by: Josef Bacik
    Reported-by: Su Yue
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     
  • commit 556755a8a99be8ca3cd9fbe36aaf9b3b0339a00d upstream.

    We noticed that we were having regular CG OOM kills in cases where there
    was still enough dirty pages to avoid OOM'ing. It turned out there's
    this corner case in btrfs's handling of range_cyclic where files that
    were being redirtied were not getting fully written out because of how
    we do range_cyclic writeback.

    We unconditionally were setting scanned = 1; the first time we found any
    pages in the inode. This isn't actually what we want, we want it to be
    set if we've scanned the entire file. For range_cyclic we could be
    starting in the middle or towards the end of the file, so we could write
    one page and then not write any of the other dirty pages in the file
    because we set scanned = 1.

    Fix this by not setting scanned = 1 if we find pages. The rules for
    setting scanned should be

    1) !range_cyclic. In this case we have a specified range to write out.
    2) range_cyclic && index == 0. In this case we've started at the
    beginning and there is no need to loop around a second time.
    3) range_cyclic && we started at index > 0 and we've reached the end of
    the file without satisfying our nr_to_write.

    This patch fixes both of our writepages implementations to make sure
    these rules hold true. This fixed our over zealous CG OOMs in
    production.

    Fixes: d1310b2e0cd9 ("Btrfs: Split the extent_map code into two parts")
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    [ add comment ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     

06 Feb, 2020

1 commit

  • commit d55966c4279bfc6a0cf0b32bf13f5df228a1eeb6 upstream.

    There was some logic added a while ago to clear out f_bavail in statfs()
    if we did not have enough free metadata space to satisfy our global
    reserve. This was incorrect at the time, however didn't really pose a
    problem for normal file systems because we would often allocate chunks
    if we got this low on free metadata space, and thus wouldn't really hit
    this case unless we were actually full.

    Fast forward to today and now we are much better about not allocating
    metadata chunks all of the time. Couple this with d792b0f19711 ("btrfs:
    always reserve our entire size for the global reserve") which now means
    we'll easily have a larger global reserve than our free space, we are
    now more likely to trip over this while still having plenty of space.

    Fix this by skipping this logic if the global rsv's space_info is not
    full. space_info->full is 0 unless we've attempted to allocate a chunk
    for that space_info and that has failed. If this happens then the space
    for the global reserve is definitely sacred and we need to report
    b_avail == 0, but before then we can just use our calculated b_avail.

    Reported-by: Martin Steigerwald
    Fixes: ca8a51b3a979 ("btrfs: statfs: report zero available if metadata are exhausted")
    CC: stable@vger.kernel.org # 4.5+
    Reviewed-by: Qu Wenruo
    Tested-By: Martin Steigerwald
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     

23 Jan, 2020

7 commits

  • commit 5afe6ce748c1ea99e0d648153c05075e1ab93afb upstream.

    If scrub returns an error we are not copying back the scrub arguments
    structure to user space. This prevents user space to know how much
    progress scrub has done if an error happened - this includes -ECANCELED
    which is returned when users ask for scrub to stop. A particular use
    case, which is used in btrfs-progs, is to resume scrub after it is
    canceled, in that case it relies on checking the progress from the scrub
    arguments structure and then use that progress in a call to resume
    scrub.

    So fix this by always copying the scrub arguments structure to user
    space, overwriting the value returned to user space with -EFAULT only if
    copying the structure failed to let user space know that either that
    copying did not happen, and therefore the structure is stale, or it
    happened partially and the structure is probably not valid and corrupt
    due to the partial copy.

    Reported-by: Graham Cobb
    Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/
    Fixes: 06fe39ab15a6a4 ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl")
    CC: stable@vger.kernel.org # 5.1+
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Qu Wenruo
    Tested-by: Graham Cobb
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit b35cf1f0bf1f2b0b193093338414b9bd63b29015 upstream.

    The fstest btrfs/154 reports

    [ 8675.381709] BTRFS: Transaction aborted (error -28)
    [ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
    [ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935
    [ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
    [ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
    [ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286
    [ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001
    [ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971
    [ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000
    [ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4
    [ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000
    [ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000
    [ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0
    [ 8675.419801] Call Trace:
    [ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs]
    [ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs]
    [ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs]
    [ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs]
    [ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0
    [ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs]
    [ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs]
    [ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0
    [ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400
    [ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0
    [ 8675.436618] ? _raw_spin_unlock+0x1f/0x30
    [ 8675.438093] ? __handle_mm_fault+0x499/0x740
    [ 8675.439619] ? do_vfs_ioctl+0x56e/0x770
    [ 8675.441034] do_vfs_ioctl+0x56e/0x770
    [ 8675.442411] ksys_ioctl+0x3a/0x70
    [ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c
    [ 8675.445333] __x64_sys_ioctl+0x16/0x20
    [ 8675.446705] do_syscall_64+0x50/0x210
    [ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left

    We now use btrfs_can_overcommit() to see if we can flip a block group
    read only. Before this would fail because we weren't taking into
    account the usable un-allocated space for allocating chunks. With my
    patches we were allowed to do the balance, which is technically correct.

    The test is trying to start balance on degraded mount. So now we're
    trying to allocate a chunk and cannot because we want to allocate a
    RAID1 chunk, but there's only 1 device that's available for usage. This
    results in an ENOSPC.

    But we shouldn't even be making it this far, we don't have enough
    devices to restripe. The problem is we're using btrfs_num_devices(),
    that also includes missing devices. That's not actually what we want, we
    need to use rw_devices.

    The chunk_mutex is not needed here, rw_devices changes only in device
    add, remove or replace, all are excluded by EXCL_OP mechanism.

    Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    [ add stacktrace, update changelog, drop chunk_mutex ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 26ef8493e1ab771cb01d27defca2fa1315dc3980 upstream.

    When running xfstests on the current btrfs I get the following splat from
    kmemleak:

    unreferenced object 0xffff88821b2404e0 (size 32):
    comm "kworker/u4:7", pid 26663, jiffies 4295283698 (age 8.776s)
    hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 10 ff fd 26 82 88 ff ff ...........&....
    10 ff fd 26 82 88 ff ff 20 ff fd 26 82 88 ff ff ...&.... ..&....
    backtrace:
    [] ulist_alloc+0x25/0x60 [btrfs]
    [] btrfs_find_all_roots_safe+0x41/0x100 [btrfs]
    [] btrfs_find_all_roots+0x52/0x70 [btrfs]
    [] btrfs_qgroup_rescan_worker+0x343/0x680 [btrfs]
    [] btrfs_work_helper+0xac/0x1e0 [btrfs]
    [] process_one_work+0x1cf/0x350
    [] worker_thread+0x28/0x3c0
    [] kthread+0x109/0x120
    [] ret_from_fork+0x35/0x40

    This corresponds to:

    (gdb) l *(btrfs_find_all_roots_safe+0x41)
    0x8d7e1 is in btrfs_find_all_roots_safe (fs/btrfs/backref.c:1413).
    1408
    1409 tmp = ulist_alloc(GFP_NOFS);
    1410 if (!tmp)
    1411 return -ENOMEM;
    1412 *roots = ulist_alloc(GFP_NOFS);
    1413 if (!*roots) {
    1414 ulist_free(tmp);
    1415 return -ENOMEM;
    1416 }
    1417

    Following the lifetime of the allocated 'roots' ulist, it gets freed
    again in btrfs_qgroup_account_extent().

    But this does not happen if the function is called with the
    'BTRFS_FS_QUOTA_ENABLED' flag cleared, then btrfs_qgroup_account_extent()
    does a short leave and directly returns.

    Instead of directly returning we should jump to the 'out_free' in order to
    free all resources as expected.

    CC: stable@vger.kernel.org # 4.14+
    Reviewed-by: Qu Wenruo
    Signed-off-by: Johannes Thumshirn
    [ add comment ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Johannes Thumshirn
     
  • commit 6282675e6708ec78518cc0e9ad1f1f73d7c5c53d upstream.

    [BUG]
    There are several different KASAN reports for balance + snapshot
    workloads. Involved call paths include:

    should_ignore_root+0x54/0xb0 [btrfs]
    build_backref_tree+0x11af/0x2280 [btrfs]
    relocate_tree_blocks+0x391/0xb80 [btrfs]
    relocate_block_group+0x3e5/0xa00 [btrfs]
    btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
    btrfs_relocate_chunk+0x53/0xf0 [btrfs]
    btrfs_balance+0xc91/0x1840 [btrfs]
    btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
    btrfs_ioctl+0x8af/0x3e60 [btrfs]
    do_vfs_ioctl+0x831/0xb10

    create_reloc_root+0x9f/0x460 [btrfs]
    btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
    create_pending_snapshot+0xa9b/0x15f0 [btrfs]
    create_pending_snapshots+0x111/0x140 [btrfs]
    btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
    btrfs_mksubvol+0x915/0x960 [btrfs]
    btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
    btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
    btrfs_ioctl+0x241b/0x3e60 [btrfs]
    do_vfs_ioctl+0x831/0xb10

    btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
    create_pending_snapshot+0x209/0x15f0 [btrfs]
    create_pending_snapshots+0x111/0x140 [btrfs]
    btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
    btrfs_mksubvol+0x915/0x960 [btrfs]
    btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
    btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
    btrfs_ioctl+0x241b/0x3e60 [btrfs]
    do_vfs_ioctl+0x831/0xb10

    [CAUSE]
    All these call sites are only relying on root->reloc_root, which can
    undergo btrfs_drop_snapshot(), and since we don't have real refcount
    based protection to reloc roots, we can reach already dropped reloc
    root, triggering KASAN.

    [FIX]
    To avoid such access to unstable root->reloc_root, we should check
    BTRFS_ROOT_DEAD_RELOC_TREE bit first.

    This patch introduces wrappers that provide the correct way to check the
    bit with memory barriers protection.

    Most callers don't distinguish merged reloc tree and no reloc tree. The
    only exception is should_ignore_root(), as merged reloc tree can be
    ignored, while no reloc tree shouldn't.

    [CRITICAL SECTION ANALYSIS]
    Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the
    DEAD_RELOC_TREE bit has extra help from transaction as a higher level
    barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are:

    NULL: reloc_root is NULL PTR: reloc_root is not NULL
    0: DEAD_RELOC_ROOT bit not set DEAD: DEAD_RELOC_ROOT bit set

    (NULL, 0) Initial state __
    | /\ Section A
    btrfs_init_reloc_root() \/
    | __
    (PTR, 0) reloc_root initialized /\
    | |
    btrfs_update_reloc_root() | Section B
    | |
    (PTR, DEAD) reloc_root has been merged \/
    | __
    === btrfs_commit_transaction() ====================
    | /\
    clean_dirty_subvols() |
    | | Section C
    (NULL, DEAD) reloc_root cleanup starts \/
    | __
    btrfs_drop_snapshot() /\
    | | Section D
    (NULL, 0) Back to initial state \/

    Every have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller holds
    transaction handle, so none of such caller can cross transaction boundary.

    In Section A, every caller just found no DEAD bit, and grab reloc_root.

    In the cross section A-B, caller may get no DEAD bit, but since reloc_root
    is still completely valid thus accessing reloc_root is completely safe.

    No test_bit() caller can cross the boundary of Section B and Section C.

    In Section C, every caller found the DEAD bit, so no one will access
    reloc_root.

    In the cross section C-D, either caller gets the DEAD bit set, avoiding
    access reloc_root no matter if it's safe or not. Or caller get the DEAD
    bit cleared, then access reloc_root, which is already NULL, nothing will
    be wrong.

    The memory write barriers are between the reloc_root updates and bit
    set/clear, the pairing read side is before test_bit.

    Reported-by: Zygo Blaxell
    Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    [ barriers ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     
  • commit 423a716cd7be16fb08690760691befe3be97d3fc upstream.

    btrfs_del_root_ref() will simply WARN_ON() if the ref doesn't match in
    any way, and then continue to delete the reference. This shouldn't
    happen, we have these values because there's more to the reference than
    the original root and the sub root. If any of these checks fail, return
    -ENOENT.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit d49d3287e74ffe55ae7430d1e795e5f9bf7359ea upstream.

    If we have the following sequence of events

    btrfs sub create A
    btrfs sub create A/B
    btrfs sub snap A C
    mkdir C/foo
    mv A/B C/foo
    rm -rf *

    We will end up with a transaction abort.

    The reason for this is because we create a root ref for B pointing to A.
    When we create a snapshot of C we still have B in our tree, but because
    the root ref points to A and not C we will make it appear to be empty.

    The problem happens when we move B into C. This removes the root ref
    for B pointing to A and adds a ref of B pointing to C. When we rmdir C
    we'll see that we have a ref to our root and remove the root ref,
    despite not actually matching our reference name.

    Now btrfs_del_root_ref() allowing this to work is a bug as well, however
    we know that this inode does not actually point to a root ref in the
    first place, so we shouldn't be calling btrfs_del_root_ref() in the
    first place and instead simply look up our dir index for this item and
    do the rest of the removal.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • [ Upstream commit 045d3967b6920b663fc010ad414ade1b24143bd1 ]

    btrfs_unlink_subvol takes the name of the dentry and the root objectid
    based on what kind of inode this is, either a real subvolume link or a
    empty one that we inherited as a snapshot. We need to fix how we unlink
    in the case for BTRFS_EMPTY_SUBVOL_DIR_OBJECTID in the future, so rework
    btrfs_unlink_subvol to just take the dentry and handle getting the right
    objectid given the type of inode this is. There is no functional change
    here, simply pushing the work into btrfs_unlink_subvol() proper.

    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik