14 Nov, 2020

2 commits

  • [BUG]
    When running the following script, btrfs will trigger an ASSERT():

    #/bin/bash
    mkfs.btrfs -f $dev
    mount $dev $mnt
    xfs_io -f -c "pwrite 0 1G" $mnt/file
    sync
    btrfs quota enable $mnt
    btrfs quota rescan -w $mnt

    # Manually set the limit below current usage
    btrfs qgroup limit 512M $mnt $mnt

    # Crash happens
    touch $mnt/file

    The dmesg looks like this:

    assertion failed: refcount_read(&trans->use_count) == 1, in fs/btrfs/transaction.c:2022
    ------------[ cut here ]------------
    kernel BUG at fs/btrfs/ctree.h:3230!
    invalid opcode: 0000 [#1] SMP PTI
    RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
    btrfs_commit_transaction.cold+0x11/0x5d [btrfs]
    try_flush_qgroup+0x67/0x100 [btrfs]
    __btrfs_qgroup_reserve_meta+0x3a/0x60 [btrfs]
    btrfs_delayed_update_inode+0xaa/0x350 [btrfs]
    btrfs_update_inode+0x9d/0x110 [btrfs]
    btrfs_dirty_inode+0x5d/0xd0 [btrfs]
    touch_atime+0xb5/0x100
    iterate_dir+0xf1/0x1b0
    __x64_sys_getdents64+0x78/0x110
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7fb5afe588db

    [CAUSE]
    In try_flush_qgroup(), we assume we don't hold a transaction handle at
    all. This is true for data reservation and mostly true for metadata.
    Since data space reservation always happens before we start a
    transaction, and for most metadata operation we reserve space in
    start_transaction().

    But there is an exception, btrfs_delayed_inode_reserve_metadata().
    It holds a transaction handle, while still trying to reserve extra
    metadata space.

    When we hit EDQUOT inside btrfs_delayed_inode_reserve_metadata(), we
    will join current transaction and commit, while we still have
    transaction handle from qgroup code.

    [FIX]
    Let's check current->journal before we join the transaction.

    If current->journal is unset or BTRFS_SEND_TRANS_STUB, it means
    we are not holding a transaction, thus are able to join and then commit
    transaction.

    If current->journal is a valid transaction handle, we avoid committing
    transaction and just end it

    This is less effective than committing current transaction, as it won't
    free metadata reserved space, but we may still free some data space
    before new data writes.

    Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1178634
    Fixes: c53e9653605d ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
    Reviewed-by: Filipe Manana
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • When doing a buffered write, through one of the write family syscalls, we
    look for ranges which currently don't have allocated extents and set the
    'delalloc new' bit on them, so that we can report a correct number of used
    blocks to the stat(2) syscall until delalloc is flushed and ordered extents
    complete.

    However there are a few other places where we can do a buffered write
    against a range that is mapped to a hole (no extent allocated) and where
    we do not set the 'new delalloc' bit. Those places are:

    - Doing a memory mapped write against a hole;

    - Cloning an inline extent into a hole starting at file offset 0;

    - Calling btrfs_cont_expand() when the i_size of the file is not aligned
    to the sector size and is located in a hole. For example when cloning
    to a destination offset beyond EOF.

    So after such cases, until the corresponding delalloc range is flushed and
    the respective ordered extents complete, we can report an incorrect number
    of blocks used through the stat(2) syscall.

    In some cases we can end up reporting 0 used blocks to stat(2), which is a
    particular bad value to report as it may mislead tools to think a file is
    completely sparse when its i_size is not zero, making them skip reading
    any data, an undesired consequence for tools such as archivers and other
    backup tools, as reported a long time ago in the following thread (and
    other past threads):

    https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html

    Example reproducer:

    $ cat reproducer.sh
    #!/bin/bash

    MNT=/mnt/sdi
    DEV=/dev/sdi

    mkfs.btrfs -f $DEV > /dev/null
    # mkfs.xfs -f $DEV > /dev/null
    # mkfs.ext4 -F $DEV > /dev/null
    # mkfs.f2fs -f $DEV > /dev/null
    mount $DEV $MNT

    xfs_io -f -c "truncate 64K" \
    -c "mmap -w 0 64K" \
    -c "mwrite -S 0xab 0 64K" \
    -c "munmap" \
    $MNT/foo

    blocks_used=$(stat -c %b $MNT/foo)
    echo "blocks used: $blocks_used"

    if [ $blocks_used -eq 0 ]; then
    echo "ERROR: blocks used is 0"
    fi

    umount $DEV

    $ ./reproducer.sh
    blocks used: 0
    ERROR: blocks used is 0

    So move the logic that decides to set the 'delalloc bit' bit into the
    function btrfs_set_extent_delalloc(), since that is what we use for all
    those missing cases as well as for the cases that currently work well.

    This change is also preparatory work for an upcoming patch that fixes
    other problems related to tracking and reporting the number of bytes used
    by an inode.

    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     

05 Nov, 2020

7 commits

  • There is one error handling path that does not free ref, which may cause
    a minor memory leak.

    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Josef Bacik
    Signed-off-by: Dinghao Liu
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Dinghao Liu
     
  • If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace
    item, then it means the filesystem is inconsistent state. This is either
    corruption or a crafted image. Fail the mount as this needs a closer
    look what is actually wrong.

    As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace
    item, in __btrfs_free_extra_devids() we determine that there is an
    extra device, and free those extra devices but continue to mount the
    device.
    However, we were wrong in keeping tack of the rw_devices so the syzbot
    testcase failed:

    WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
    Kernel panic - not syncing: panic_on_warn set ...
    CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x198/0x1fd lib/dump_stack.c:118
    panic+0x347/0x7c0 kernel/panic.c:231
    __warn.cold+0x20/0x46 kernel/panic.c:600
    report_bug+0x1bd/0x210 lib/bug.c:198
    handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
    exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
    asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
    RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
    RSP: 0018:ffffc900091777e0 EFLAGS: 00010246
    RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000
    RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007
    RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130
    R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050
    close_fs_devices fs/btrfs/volumes.c:1193 [inline]
    btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
    open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434
    btrfs_fill_super fs/btrfs/super.c:1316 [inline]
    btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672

    The fix here is, when we determine that there isn't a replace item
    then fail the mount if there is a replace target device (devid 0).

    CC: stable@vger.kernel.org # 4.19+
    Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Anand Jain
     
  • Based on user feedback update the message printed when scrub fails to
    start due to write requirements. To make a distinction add a device id
    to the messages.

    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    David Sterba
     
  • Smatch complains that this code dereferences "entry" before checking
    whether it's NULL on the next line. Fortunately, rb_entry() will never
    return NULL so it doesn't cause a problem. We can clean up the NULL
    checking a bit to silence the warning and make the code more clear.

    Reviewed-by: Qu Wenruo
    Signed-off-by: Dan Carpenter
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Dan Carpenter
     
  • The minimum reserve size was adjusted to take into account the height of
    the tree we are merging, however we can have a root with a level == 0.
    What we want is root_level + 1 to get the number of nodes we may have to
    cow. This fixes the enospc_debug warning pops with btrfs/101.

    Nikolay: this fixes failures on btrfs/060 btrfs/062 btrfs/063 and
    btrfs/195 That I was seeing, the call trace was:

    [ 3680.515564] ------------[ cut here ]------------
    [ 3680.515566] BTRFS: block rsv returned -28
    [ 3680.515585] WARNING: CPU: 2 PID: 8339 at fs/btrfs/block-rsv.c:521 btrfs_use_block_rsv+0x162/0x180
    [ 3680.515587] Modules linked in:
    [ 3680.515591] CPU: 2 PID: 8339 Comm: btrfs Tainted: G W 5.9.0-rc8-default #95
    [ 3680.515593] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
    [ 3680.515595] RIP: 0010:btrfs_use_block_rsv+0x162/0x180
    [ 3680.515600] RSP: 0018:ffffa01ac9753910 EFLAGS: 00010282
    [ 3680.515602] RAX: 0000000000000000 RBX: ffff984b34200000 RCX: 0000000000000027
    [ 3680.515604] RDX: 0000000000000027 RSI: 0000000000000000 RDI: ffff984b3bd19e28
    [ 3680.515606] RBP: 0000000000004000 R08: ffff984b3bd19e20 R09: 0000000000000001
    [ 3680.515608] R10: 0000000000000004 R11: 0000000000000046 R12: ffff984b264fdc00
    [ 3680.515609] R13: ffff984b13149000 R14: 00000000ffffffe4 R15: ffff984b34200000
    [ 3680.515613] FS: 00007f4e2912b8c0(0000) GS:ffff984b3bd00000(0000) knlGS:0000000000000000
    [ 3680.515615] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 3680.515617] CR2: 00007fab87122150 CR3: 0000000118e42000 CR4: 00000000000006e0
    [ 3680.515620] Call Trace:
    [ 3680.515627] btrfs_alloc_tree_block+0x8b/0x340
    [ 3680.515633] ? __lock_acquire+0x51a/0xac0
    [ 3680.515646] alloc_tree_block_no_bg_flush+0x4f/0x60
    [ 3680.515651] __btrfs_cow_block+0x14e/0x7e0
    [ 3680.515662] btrfs_cow_block+0x144/0x2c0
    [ 3680.515670] merge_reloc_root+0x4d4/0x610
    [ 3680.515675] ? btrfs_lookup_fs_root+0x78/0x90
    [ 3680.515686] merge_reloc_roots+0xee/0x280
    [ 3680.515695] relocate_block_group+0x2ce/0x5e0
    [ 3680.515704] btrfs_relocate_block_group+0x16e/0x310
    [ 3680.515711] btrfs_relocate_chunk+0x38/0xf0
    [ 3680.515716] btrfs_shrink_device+0x200/0x560
    [ 3680.515728] btrfs_rm_device+0x1ae/0x6a6
    [ 3680.515744] ? _copy_from_user+0x6e/0xb0
    [ 3680.515750] btrfs_ioctl+0x1afe/0x28c0
    [ 3680.515755] ? find_held_lock+0x2b/0x80
    [ 3680.515760] ? do_user_addr_fault+0x1f8/0x418
    [ 3680.515773] ? __x64_sys_ioctl+0x77/0xb0
    [ 3680.515775] __x64_sys_ioctl+0x77/0xb0
    [ 3680.515781] do_syscall_64+0x31/0x70
    [ 3680.515785] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reported-by: Nikolay Borisov
    Fixes: 44d354abf33e ("btrfs: relocation: review the call sites which can be interrupted by signal")
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Nikolay Borisov
    Tested-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • To help with debugging, print the type of the block rsv when we fail to
    use our target block rsv in btrfs_use_block_rsv.

    This now produces:

    [ 544.672035] BTRFS: block rsv 1 returned -28

    which is still cryptic without consulting the enum in block-rsv.h but I
    guess it's better than nothing.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    [ add note from Nikolay ]
    Signed-off-by: David Sterba

    Josef Bacik
     
  • On 32-bit systems, this shift will overflow for files larger than 4GB as
    start_index is unsigned long while the calls to btrfs_delalloc_*_space
    expect u64.

    CC: stable@vger.kernel.org # 4.4+
    Fixes: df480633b891 ("btrfs: extent-tree: Switch to new delalloc space reserve and release")
    Reviewed-by: Josef Bacik
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: David Sterba
    [ define the variable instead of repeating the shift ]
    Signed-off-by: David Sterba

    Matthew Wilcox (Oracle)
     

27 Oct, 2020

2 commits

  • By doing so we can associate the sequence counter to the chunk_mutex
    for lockdep purposes (compiled-out otherwise), the mutex is otherwise
    used on the write side.
    Also avoid explicitly disabling preemption around the write region as it
    will now be done automatically by the seqcount machinery based on the
    lock type.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Davidlohr Bueso
     
  • Since we switched to the iomap infrastructure in b5ff9f1a96e8f ("btrfs:
    switch to iomap for direct IO") we're calling generic_file_buffered_read()
    directly and not via generic_file_read_iter() anymore.

    If the read could read everything there is no need to bother calling
    generic_file_buffered_read(), like it is handled in
    generic_file_read_iter().

    If we call generic_file_buffered_read() in this case we can hit a
    situation where we do an invalid readahead and cause this UBSAN splat
    in fstest generic/091:

    run fstests generic/091 at 2020-10-21 10:52:32
    ================================================================================
    UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
    shift exponent 64 is too large for 64-bit type 'long unsigned int'
    CPU: 0 PID: 656 Comm: fsx Not tainted 5.9.0-rc7+ #821
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
    Call Trace:
    __dump_stack lib/dump_stack.c:77
    dump_stack+0x57/0x70 lib/dump_stack.c:118
    ubsan_epilogue+0x5/0x40 lib/ubsan.c:148
    __ubsan_handle_shift_out_of_bounds.cold+0x61/0xe9 lib/ubsan.c:395
    __roundup_pow_of_two ./include/linux/log2.h:57
    get_init_ra_size mm/readahead.c:318
    ondemand_readahead.cold+0x16/0x2c mm/readahead.c:530
    generic_file_buffered_read+0x3ac/0x840 mm/filemap.c:2199
    call_read_iter ./include/linux/fs.h:1876
    new_sync_read+0x102/0x180 fs/read_write.c:415
    vfs_read+0x11c/0x1a0 fs/read_write.c:481
    ksys_read+0x4f/0xc0 fs/read_write.c:615
    do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9 arch/x86/entry/entry_64.S:118
    RIP: 0033:0x7fe87fee992e
    RSP: 002b:00007ffe01605278 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
    RAX: ffffffffffffffda RBX: 000000000004f000 RCX: 00007fe87fee992e
    RDX: 0000000000004000 RSI: 0000000001677000 RDI: 0000000000000003
    RBP: 000000000004f000 R08: 0000000000004000 R09: 000000000004f000
    R10: 0000000000053000 R11: 0000000000000246 R12: 0000000000004000
    R13: 0000000000000000 R14: 000000000007a120 R15: 0000000000000000
    ================================================================================
    BTRFS info (device nullb0): has skinny extents
    BTRFS info (device nullb0): ZONED mode enabled, zone size 268435456 B
    BTRFS info (device nullb0): enabling ssd optimizations

    Fixes: f85781fb505e ("btrfs: switch to iomap for direct IO")
    Reviewed-by: Goldwyn Rodrigues
    Signed-off-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    Johannes Thumshirn
     

26 Oct, 2020

7 commits

  • I got the following lockdep splat with tree locks converted to rwsem
    patches on btrfs/104:

    ======================================================
    WARNING: possible circular locking dependency detected
    5.9.0+ #102 Not tainted
    ------------------------------------------------------
    btrfs-cleaner/903 is trying to acquire lock:
    ffff8e7fab6ffe30 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170

    but task is already holding lock:
    ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (&fs_info->commit_root_sem){++++}-{3:3}:
    down_read+0x40/0x130
    caching_thread+0x53/0x5a0
    btrfs_work_helper+0xfa/0x520
    process_one_work+0x238/0x540
    worker_thread+0x55/0x3c0
    kthread+0x13a/0x150
    ret_from_fork+0x1f/0x30

    -> #2 (&caching_ctl->mutex){+.+.}-{3:3}:
    __mutex_lock+0x7e/0x7b0
    btrfs_cache_block_group+0x1e0/0x510
    find_free_extent+0xb6e/0x12f0
    btrfs_reserve_extent+0xb3/0x1b0
    btrfs_alloc_tree_block+0xb1/0x330
    alloc_tree_block_no_bg_flush+0x4f/0x60
    __btrfs_cow_block+0x11d/0x580
    btrfs_cow_block+0x10c/0x220
    commit_cowonly_roots+0x47/0x2e0
    btrfs_commit_transaction+0x595/0xbd0
    sync_filesystem+0x74/0x90
    generic_shutdown_super+0x22/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20
    deactivate_locked_super+0x36/0xa0
    cleanup_mnt+0x12d/0x190
    task_work_run+0x5c/0xa0
    exit_to_user_mode_prepare+0x1df/0x200
    syscall_exit_to_user_mode+0x54/0x280
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #1 (&space_info->groups_sem){++++}-{3:3}:
    down_read+0x40/0x130
    find_free_extent+0x2ed/0x12f0
    btrfs_reserve_extent+0xb3/0x1b0
    btrfs_alloc_tree_block+0xb1/0x330
    alloc_tree_block_no_bg_flush+0x4f/0x60
    __btrfs_cow_block+0x11d/0x580
    btrfs_cow_block+0x10c/0x220
    commit_cowonly_roots+0x47/0x2e0
    btrfs_commit_transaction+0x595/0xbd0
    sync_filesystem+0x74/0x90
    generic_shutdown_super+0x22/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20
    deactivate_locked_super+0x36/0xa0
    cleanup_mnt+0x12d/0x190
    task_work_run+0x5c/0xa0
    exit_to_user_mode_prepare+0x1df/0x200
    syscall_exit_to_user_mode+0x54/0x280
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (btrfs-root-00){++++}-{3:3}:
    __lock_acquire+0x1167/0x2150
    lock_acquire+0xb9/0x3d0
    down_read_nested+0x43/0x130
    __btrfs_tree_read_lock+0x32/0x170
    __btrfs_read_lock_root_node+0x3a/0x50
    btrfs_search_slot+0x614/0x9d0
    btrfs_find_root+0x35/0x1b0
    btrfs_read_tree_root+0x61/0x120
    btrfs_get_root_ref+0x14b/0x600
    find_parent_nodes+0x3e6/0x1b30
    btrfs_find_all_roots_safe+0xb4/0x130
    btrfs_find_all_roots+0x60/0x80
    btrfs_qgroup_trace_extent_post+0x27/0x40
    btrfs_add_delayed_data_ref+0x3fd/0x460
    btrfs_free_extent+0x42/0x100
    __btrfs_mod_ref+0x1d7/0x2f0
    walk_up_proc+0x11c/0x400
    walk_up_tree+0xf0/0x180
    btrfs_drop_snapshot+0x1c7/0x780
    btrfs_clean_one_deleted_snapshot+0xfb/0x110
    cleaner_kthread+0xd4/0x140
    kthread+0x13a/0x150
    ret_from_fork+0x1f/0x30

    other info that might help us debug this:

    Chain exists of:
    btrfs-root-00 --> &caching_ctl->mutex --> &fs_info->commit_root_sem

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&fs_info->commit_root_sem);
    lock(&caching_ctl->mutex);
    lock(&fs_info->commit_root_sem);
    lock(btrfs-root-00);

    *** DEADLOCK ***

    3 locks held by btrfs-cleaner/903:
    #0: ffff8e7fab628838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
    #1: ffff8e7faadac640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
    #2: ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

    stack backtrace:
    CPU: 0 PID: 903 Comm: btrfs-cleaner Not tainted 5.9.0+ #102
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
    Call Trace:
    dump_stack+0x8b/0xb0
    check_noncircular+0xcf/0xf0
    __lock_acquire+0x1167/0x2150
    ? __bfs+0x42/0x210
    lock_acquire+0xb9/0x3d0
    ? __btrfs_tree_read_lock+0x32/0x170
    down_read_nested+0x43/0x130
    ? __btrfs_tree_read_lock+0x32/0x170
    __btrfs_tree_read_lock+0x32/0x170
    __btrfs_read_lock_root_node+0x3a/0x50
    btrfs_search_slot+0x614/0x9d0
    ? find_held_lock+0x2b/0x80
    btrfs_find_root+0x35/0x1b0
    ? do_raw_spin_unlock+0x4b/0xa0
    btrfs_read_tree_root+0x61/0x120
    btrfs_get_root_ref+0x14b/0x600
    find_parent_nodes+0x3e6/0x1b30
    btrfs_find_all_roots_safe+0xb4/0x130
    btrfs_find_all_roots+0x60/0x80
    btrfs_qgroup_trace_extent_post+0x27/0x40
    btrfs_add_delayed_data_ref+0x3fd/0x460
    btrfs_free_extent+0x42/0x100
    __btrfs_mod_ref+0x1d7/0x2f0
    walk_up_proc+0x11c/0x400
    walk_up_tree+0xf0/0x180
    btrfs_drop_snapshot+0x1c7/0x780
    ? btrfs_clean_one_deleted_snapshot+0x73/0x110
    btrfs_clean_one_deleted_snapshot+0xfb/0x110
    cleaner_kthread+0xd4/0x140
    ? btrfs_alloc_root+0x50/0x50
    kthread+0x13a/0x150
    ? kthread_create_worker_on_cpu+0x40/0x40
    ret_from_fork+0x1f/0x30
    BTRFS info (device sdb): disk space caching is enabled
    BTRFS info (device sdb): has skinny extents

    This happens because qgroups does a backref lookup when we create a
    delayed ref. From here it may have to look up a root from an indirect
    ref, which does a normal lookup on the tree_root, which takes the read
    lock on the tree_root nodes.

    To fix this we need to add a variant for looking up roots that searches
    the commit root of the tree_root. Then when we do the backref search
    using the commit root we are sure to not take any locks on the tree_root
    nodes. This gets rid of the lockdep splat when running btrfs/104.

    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • When enabling qgroups we walk the tree_root and then add a qgroup item
    for every root that we have. This creates a lock dependency on the
    tree_root and qgroup_root, which results in the following lockdep splat
    (with tree locks using rwsem), eg. in tests btrfs/017 or btrfs/022:

    ======================================================
    WARNING: possible circular locking dependency detected
    5.9.0-default+ #1299 Not tainted
    ------------------------------------------------------
    btrfs/24552 is trying to acquire lock:
    ffff9142dfc5f630 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]

    but task is already holding lock:
    ffff9142dfc5d0b0 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (btrfs-root-00){++++}-{3:3}:
    __lock_acquire+0x3fb/0x730
    lock_acquire.part.0+0x6a/0x130
    down_read_nested+0x46/0x130
    __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
    btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
    btrfs_search_slot+0xc3/0x9f0 [btrfs]
    btrfs_insert_item+0x6e/0x140 [btrfs]
    btrfs_create_tree+0x1cb/0x240 [btrfs]
    btrfs_quota_enable+0xcd/0x790 [btrfs]
    btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
    __x64_sys_ioctl+0x83/0xa0
    do_syscall_64+0x2d/0x70
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (btrfs-quota-00){++++}-{3:3}:
    check_prev_add+0x91/0xc30
    validate_chain+0x491/0x750
    __lock_acquire+0x3fb/0x730
    lock_acquire.part.0+0x6a/0x130
    down_read_nested+0x46/0x130
    __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
    btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
    btrfs_search_slot+0xc3/0x9f0 [btrfs]
    btrfs_insert_empty_items+0x58/0xa0 [btrfs]
    add_qgroup_item.part.0+0x72/0x210 [btrfs]
    btrfs_quota_enable+0x3bb/0x790 [btrfs]
    btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
    __x64_sys_ioctl+0x83/0xa0
    do_syscall_64+0x2d/0x70
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(btrfs-root-00);
    lock(btrfs-quota-00);
    lock(btrfs-root-00);
    lock(btrfs-quota-00);

    *** DEADLOCK ***

    5 locks held by btrfs/24552:
    #0: ffff9142df431478 (sb_writers#10){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0xa0
    #1: ffff9142f9b10cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl_quota_ctl+0x7b/0xe0 [btrfs]
    #2: ffff9142f9b11a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0x790 [btrfs]
    #3: ffff9142df431698 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x406/0x510 [btrfs]
    #4: ffff9142dfc5d0b0 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]

    stack backtrace:
    CPU: 1 PID: 24552 Comm: btrfs Not tainted 5.9.0-default+ #1299
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
    Call Trace:
    dump_stack+0x77/0x97
    check_noncircular+0xf3/0x110
    check_prev_add+0x91/0xc30
    validate_chain+0x491/0x750
    __lock_acquire+0x3fb/0x730
    lock_acquire.part.0+0x6a/0x130
    ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    ? lock_acquire+0xc4/0x140
    ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    down_read_nested+0x46/0x130
    ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    ? btrfs_root_node+0xd9/0x200 [btrfs]
    __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
    btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
    btrfs_search_slot+0xc3/0x9f0 [btrfs]
    btrfs_insert_empty_items+0x58/0xa0 [btrfs]
    add_qgroup_item.part.0+0x72/0x210 [btrfs]
    btrfs_quota_enable+0x3bb/0x790 [btrfs]
    btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
    __x64_sys_ioctl+0x83/0xa0
    do_syscall_64+0x2d/0x70
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fix this by dropping the path whenever we find a root item, add the
    qgroup item, and then re-lookup the root item we found and continue
    processing roots.

    Reported-by: David Sterba
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Very sporadically I had test case btrfs/069 from fstests hanging (for
    years, it is not a recent regression), with the following traces in
    dmesg/syslog:

    [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
    [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
    [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
    [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
    [162513.514318] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.514747] task:btrfs-transacti state:D stack: 0 pid:1356167 ppid: 2 flags:0x00004000
    [162513.514751] Call Trace:
    [162513.514761] __schedule+0x5ce/0xd00
    [162513.514765] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.514771] schedule+0x46/0xf0
    [162513.514844] wait_current_trans+0xde/0x140 [btrfs]
    [162513.514850] ? finish_wait+0x90/0x90
    [162513.514864] start_transaction+0x37c/0x5f0 [btrfs]
    [162513.514879] transaction_kthread+0xa4/0x170 [btrfs]
    [162513.514891] ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
    [162513.514894] kthread+0x153/0x170
    [162513.514897] ? kthread_stop+0x2c0/0x2c0
    [162513.514902] ret_from_fork+0x22/0x30
    [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
    [162513.515192] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.515680] task:fsstress state:D stack: 0 pid:1356184 ppid:1356177 flags:0x00004000
    [162513.515682] Call Trace:
    [162513.515688] __schedule+0x5ce/0xd00
    [162513.515691] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.515697] schedule+0x46/0xf0
    [162513.515712] wait_current_trans+0xde/0x140 [btrfs]
    [162513.515716] ? finish_wait+0x90/0x90
    [162513.515729] start_transaction+0x37c/0x5f0 [btrfs]
    [162513.515743] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
    [162513.515753] btrfs_sync_fs+0x61/0x1c0 [btrfs]
    [162513.515758] ? __ia32_sys_fdatasync+0x20/0x20
    [162513.515761] iterate_supers+0x87/0xf0
    [162513.515765] ksys_sync+0x60/0xb0
    [162513.515768] __do_sys_sync+0xa/0x10
    [162513.515771] do_syscall_64+0x33/0x80
    [162513.515774] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [162513.515781] RIP: 0033:0x7f5238f50bd7
    [162513.515782] Code: Bad RIP value.
    [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
    [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
    [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
    [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
    [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
    [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
    [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
    [162513.516064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.516617] task:fsstress state:D stack: 0 pid:1356185 ppid:1356177 flags:0x00000000
    [162513.516620] Call Trace:
    [162513.516625] __schedule+0x5ce/0xd00
    [162513.516628] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.516634] schedule+0x46/0xf0
    [162513.516647] wait_current_trans+0xde/0x140 [btrfs]
    [162513.516650] ? finish_wait+0x90/0x90
    [162513.516662] start_transaction+0x4d7/0x5f0 [btrfs]
    [162513.516679] btrfs_setxattr_trans+0x3c/0x100 [btrfs]
    [162513.516686] __vfs_setxattr+0x66/0x80
    [162513.516691] __vfs_setxattr_noperm+0x70/0x200
    [162513.516697] vfs_setxattr+0x6b/0x120
    [162513.516703] setxattr+0x125/0x240
    [162513.516709] ? lock_acquire+0xb1/0x480
    [162513.516712] ? mnt_want_write+0x20/0x50
    [162513.516721] ? rcu_read_lock_any_held+0x8e/0xb0
    [162513.516723] ? preempt_count_add+0x49/0xa0
    [162513.516725] ? __sb_start_write+0x19b/0x290
    [162513.516727] ? preempt_count_add+0x49/0xa0
    [162513.516732] path_setxattr+0xba/0xd0
    [162513.516739] __x64_sys_setxattr+0x27/0x30
    [162513.516741] do_syscall_64+0x33/0x80
    [162513.516743] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [162513.516745] RIP: 0033:0x7f5238f56d5a
    [162513.516746] Code: Bad RIP value.
    [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
    [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
    [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
    [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
    [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
    [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
    [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
    [162513.517064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.517763] task:fsstress state:D stack: 0 pid:1356196 ppid:1356177 flags:0x00004000
    [162513.517780] Call Trace:
    [162513.517786] __schedule+0x5ce/0xd00
    [162513.517789] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.517796] schedule+0x46/0xf0
    [162513.517810] wait_current_trans+0xde/0x140 [btrfs]
    [162513.517814] ? finish_wait+0x90/0x90
    [162513.517829] start_transaction+0x37c/0x5f0 [btrfs]
    [162513.517845] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
    [162513.517857] btrfs_sync_fs+0x61/0x1c0 [btrfs]
    [162513.517862] ? __ia32_sys_fdatasync+0x20/0x20
    [162513.517865] iterate_supers+0x87/0xf0
    [162513.517869] ksys_sync+0x60/0xb0
    [162513.517872] __do_sys_sync+0xa/0x10
    [162513.517875] do_syscall_64+0x33/0x80
    [162513.517878] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [162513.517881] RIP: 0033:0x7f5238f50bd7
    [162513.517883] Code: Bad RIP value.
    [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
    [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
    [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
    [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
    [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
    [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
    [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
    [162513.518298] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.519157] task:fsstress state:D stack: 0 pid:1356197 ppid:1356177 flags:0x00000000
    [162513.519160] Call Trace:
    [162513.519165] __schedule+0x5ce/0xd00
    [162513.519168] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.519174] schedule+0x46/0xf0
    [162513.519190] wait_current_trans+0xde/0x140 [btrfs]
    [162513.519193] ? finish_wait+0x90/0x90
    [162513.519206] start_transaction+0x4d7/0x5f0 [btrfs]
    [162513.519222] btrfs_create+0x57/0x200 [btrfs]
    [162513.519230] lookup_open+0x522/0x650
    [162513.519246] path_openat+0x2b8/0xa50
    [162513.519270] do_filp_open+0x91/0x100
    [162513.519275] ? find_held_lock+0x32/0x90
    [162513.519280] ? lock_acquired+0x33b/0x470
    [162513.519285] ? do_raw_spin_unlock+0x4b/0xc0
    [162513.519287] ? _raw_spin_unlock+0x29/0x40
    [162513.519295] do_sys_openat2+0x20d/0x2d0
    [162513.519300] do_sys_open+0x44/0x80
    [162513.519304] do_syscall_64+0x33/0x80
    [162513.519307] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [162513.519309] RIP: 0033:0x7f5238f4a903
    [162513.519310] Code: Bad RIP value.
    [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
    [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
    [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
    [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
    [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
    [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
    [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
    [162513.519727] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.520508] task:btrfs state:D stack: 0 pid:1356211 ppid:1356178 flags:0x00004002
    [162513.520511] Call Trace:
    [162513.520516] __schedule+0x5ce/0xd00
    [162513.520519] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.520525] schedule+0x46/0xf0
    [162513.520544] btrfs_scrub_pause+0x11f/0x180 [btrfs]
    [162513.520548] ? finish_wait+0x90/0x90
    [162513.520562] btrfs_commit_transaction+0x45a/0xc30 [btrfs]
    [162513.520574] ? start_transaction+0xe0/0x5f0 [btrfs]
    [162513.520596] btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
    [162513.520619] btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
    [162513.520639] btrfs_ioctl+0x2a25/0x36f0 [btrfs]
    [162513.520643] ? do_sigaction+0xf3/0x240
    [162513.520645] ? find_held_lock+0x32/0x90
    [162513.520648] ? do_sigaction+0xf3/0x240
    [162513.520651] ? lock_acquired+0x33b/0x470
    [162513.520655] ? _raw_spin_unlock_irq+0x24/0x50
    [162513.520657] ? lockdep_hardirqs_on+0x7d/0x100
    [162513.520660] ? _raw_spin_unlock_irq+0x35/0x50
    [162513.520662] ? do_sigaction+0xf3/0x240
    [162513.520671] ? __x64_sys_ioctl+0x83/0xb0
    [162513.520672] __x64_sys_ioctl+0x83/0xb0
    [162513.520677] do_syscall_64+0x33/0x80
    [162513.520679] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [162513.520681] RIP: 0033:0x7fc3cd307d87
    [162513.520682] Code: Bad RIP value.
    [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
    [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
    [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
    [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
    [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
    [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
    [162513.520703]
    Showing all locks held in the system:
    [162513.520712] 1 lock held by khungtaskd/54:
    [162513.520713] #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
    [162513.520728] 1 lock held by in:imklog/596:
    [162513.520729] #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
    [162513.520782] 1 lock held by btrfs-transacti/1356167:
    [162513.520784] #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
    [162513.520798] 1 lock held by btrfs/1356190:
    [162513.520800] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
    [162513.520805] 1 lock held by fsstress/1356184:
    [162513.520806] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
    [162513.520811] 3 locks held by fsstress/1356185:
    [162513.520812] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
    [162513.520815] #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
    [162513.520820] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
    [162513.520833] 1 lock held by fsstress/1356196:
    [162513.520834] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
    [162513.520838] 3 locks held by fsstress/1356197:
    [162513.520839] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
    [162513.520843] #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
    [162513.520846] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
    [162513.520858] 2 locks held by btrfs/1356211:
    [162513.520859] #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
    [162513.520877] #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]

    This was weird because the stack traces show that a transaction commit,
    triggered by a device replace operation, is blocking trying to pause any
    running scrubs but there are no stack traces of blocked tasks doing a
    scrub.

    After poking around with drgn, I noticed there was a scrub task that was
    constantly running and blocking for shorts periods of time:

    >>> t = find_task(prog, 1356190)
    >>> prog.stack_trace(t)
    #0 __schedule+0x5ce/0xcfc
    #1 schedule+0x46/0xe4
    #2 schedule_timeout+0x1df/0x475
    #3 btrfs_reada_wait+0xda/0x132
    #4 scrub_stripe+0x2a8/0x112f
    #5 scrub_chunk+0xcd/0x134
    #6 scrub_enumerate_chunks+0x29e/0x5ee
    #7 btrfs_scrub_dev+0x2d5/0x91b
    #8 btrfs_ioctl+0x7f5/0x36e7
    #9 __x64_sys_ioctl+0x83/0xb0
    #10 do_syscall_64+0x33/0x77
    #11 entry_SYSCALL_64+0x7c/0x156

    Which corresponds to:

    int btrfs_reada_wait(void *handle)
    {
    struct reada_control *rc = handle;
    struct btrfs_fs_info *fs_info = rc->fs_info;

    while (atomic_read(&rc->elems)) {
    if (!atomic_read(&fs_info->reada_works_cnt))
    reada_start_machine(fs_info);
    wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
    (HZ + 9) / 10);
    }
    (...)

    So the counter "rc->elems" was set to 1 and never decreased to 0, causing
    the scrub task to loop forever in that function. Then I used the following
    script for drgn to check the readahead requests:

    $ cat dump_reada.py
    import sys
    import drgn
    from drgn import NULL, Object, cast, container_of, execscript, \
    reinterpret, sizeof
    from drgn.helpers.linux import *

    mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

    mnt = None
    for mnt in for_each_mount(prog, dst = mnt_path):
    pass

    if mnt is None:
    sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
    sys.exit(1)

    fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

    def dump_re(re):
    nzones = re.nzones.value_()
    print(f're at {hex(re.value_())}')
    print(f'\t logical {re.logical.value_()}')
    print(f'\t refcnt {re.refcnt.value_()}')
    print(f'\t nzones {nzones}')
    for i in range(nzones):
    dev = re.zones[i].device
    name = dev.name.str.string_()
    print(f'\t\t dev id {dev.devid.value_()} name {name}')
    print()

    for _, e in radix_tree_for_each(fs_info.reada_tree):
    re = cast('struct reada_extent *', e)
    dump_re(re)

    $ drgn dump_reada.py
    re at 0xffff8f3da9d25ad8
    logical 38928384
    refcnt 1
    nzones 1
    dev id 0 name b'/dev/sdd'
    $

    So there was one readahead extent with a single zone corresponding to the
    source device of that last device replace operation logged in dmesg/syslog.
    Also the ID of that zone's device was 0 which is a special value set in
    the source device of a device replace operation when the operation finishes
    (constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
    confirming again that device /dev/sdd was the source of a device replace
    operation.

    Normally there should be as many zones in the readahead extent as there are
    devices, and I wasn't expecting the extent to be in a block group with a
    'single' profile, so I went and confirmed with the following drgn script
    that there weren't any single profile block groups:

    $ cat dump_block_groups.py
    import sys
    import drgn
    from drgn import NULL, Object, cast, container_of, execscript, \
    reinterpret, sizeof
    from drgn.helpers.linux import *

    mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

    mnt = None
    for mnt in for_each_mount(prog, dst = mnt_path):
    pass

    if mnt is None:
    sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
    sys.exit(1)

    fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

    BTRFS_BLOCK_GROUP_DATA = (1 << 0)
    BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
    BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
    BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
    BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
    BTRFS_BLOCK_GROUP_DUP = (1 << 5)
    BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
    BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
    BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
    BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
    BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)

    def bg_flags_string(bg):
    flags = bg.flags.value_()
    ret = ''
    if flags & BTRFS_BLOCK_GROUP_DATA:
    ret = 'data'
    if flags & BTRFS_BLOCK_GROUP_METADATA:
    if len(ret) > 0:
    ret += '|'
    ret += 'meta'
    if flags & BTRFS_BLOCK_GROUP_SYSTEM:
    if len(ret) > 0:
    ret += '|'
    ret += 'system'
    if flags & BTRFS_BLOCK_GROUP_RAID0:
    ret += ' raid0'
    elif flags & BTRFS_BLOCK_GROUP_RAID1:
    ret += ' raid1'
    elif flags & BTRFS_BLOCK_GROUP_DUP:
    ret += ' dup'
    elif flags & BTRFS_BLOCK_GROUP_RAID10:
    ret += ' raid10'
    elif flags & BTRFS_BLOCK_GROUP_RAID5:
    ret += ' raid5'
    elif flags & BTRFS_BLOCK_GROUP_RAID6:
    ret += ' raid6'
    elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
    ret += ' raid1c3'
    elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
    ret += ' raid1c4'
    else:
    ret += ' single'

    return ret

    def dump_bg(bg):
    print()
    print(f'block group at {hex(bg.value_())}')
    print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
    print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')

    bg_root = fs_info.block_group_cache_tree.address_of_()
    for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
    dump_bg(bg)

    $ drgn dump_block_groups.py

    block group at 0xffff8f3d673b0400
    start 22020096 length 16777216
    flags 258 - system raid6

    block group at 0xffff8f3d53ddb400
    start 38797312 length 536870912
    flags 260 - meta raid6

    block group at 0xffff8f3d5f4d9c00
    start 575668224 length 2147483648
    flags 257 - data raid6

    block group at 0xffff8f3d08189000
    start 2723151872 length 67108864
    flags 258 - system raid6

    block group at 0xffff8f3db70ff000
    start 2790260736 length 1073741824
    flags 260 - meta raid6

    block group at 0xffff8f3d5f4dd800
    start 3864002560 length 67108864
    flags 258 - system raid6

    block group at 0xffff8f3d67037000
    start 3931111424 length 2147483648
    flags 257 - data raid6
    $

    So there were only 2 reasons left for having a readahead extent with a
    single zone: reada_find_zone(), called when creating a readahead extent,
    returned NULL either because we failed to find the corresponding block
    group or because a memory allocation failed. With some additional and
    custom tracing I figured out that on every further ocurrence of the
    problem the block group had just been deleted when we were looping to
    create the zones for the readahead extent (at reada_find_extent()), so we
    ended up with only one zone in the readahead extent, corresponding to a
    device that ends up getting replaced.

    So after figuring that out it became obvious why the hang happens:

    1) Task A starts a scrub on any device of the filesystem, except for
    device /dev/sdd;

    2) Task B starts a device replace with /dev/sdd as the source device;

    3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
    starting to scrub a stripe from block group X. This call to
    btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
    calls reada_add_block(), it passes the logical address of the extent
    tree's root node as its 'logical' argument - a value of 38928384;

    4) Task A then enters reada_find_extent(), called from reada_add_block().
    It finds there isn't any existing readahead extent for the logical
    address 38928384, so it proceeds to the path of creating a new one.

    It calls btrfs_map_block() to find out which stripes exist for the block
    group X. On the first iteration of the for loop that iterates over the
    stripes, it finds the stripe for device /dev/sdd, so it creates one
    zone for that device and adds it to the readahead extent. Before getting
    into the second iteration of the loop, the cleanup kthread deletes block
    group X because it was empty. So in the iterations for the remaining
    stripes it does not add more zones to the readahead extent, because the
    calls to reada_find_zone() returned NULL because they couldn't find
    block group X anymore.

    As a result the new readahead extent has a single zone, corresponding to
    the device /dev/sdd;

    4) Before task A returns to btrfs_reada_add() and queues the readahead job
    for the readahead work queue, task B finishes the device replace and at
    btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
    device /dev/sdg;

    5) Task A returns to reada_add_block(), which increments the counter
    "->elems" of the reada_control structure allocated at btrfs_reada_add().

    Then it returns back to btrfs_reada_add() and calls
    reada_start_machine(). This queues a job in the readahead work queue to
    run the function reada_start_machine_worker(), which calls
    __reada_start_machine().

    At __reada_start_machine() we take the device list mutex and for each
    device found in the current device list, we call
    reada_start_machine_dev() to start the readahead work. However at this
    point the device /dev/sdd was already freed and is not in the device
    list anymore.

    This means the corresponding readahead for the extent at 38928384 is
    never started, and therefore the "->elems" counter of the reada_control
    structure allocated at btrfs_reada_add() never goes down to 0, causing
    the call to btrfs_reada_wait(), done by the scrub task, to wait forever.

    Note that the readahead request can be made either after the device replace
    started or before it started, however in pratice it is very unlikely that a
    device replace is able to start after a readahead request is made and is
    able to complete before the readahead request completes - maybe only on a
    very small and nearly empty filesystem.

    This hang however is not the only problem we can have with readahead and
    device removals. When the readahead extent has other zones other than the
    one corresponding to the device that is being removed (either by a device
    replace or a device remove operation), we risk having a use-after-free on
    the device when dropping the last reference of the readahead extent.

    For example if we create a readahead extent with two zones, one for the
    device /dev/sdd and one for the device /dev/sde:

    1) Before the readahead worker starts, the device /dev/sdd is removed,
    and the corresponding btrfs_device structure is freed. However the
    readahead extent still has the zone pointing to the device structure;

    2) When the readahead worker starts, it only finds device /dev/sde in the
    current device list of the filesystem;

    3) It starts the readahead work, at reada_start_machine_dev(), using the
    device /dev/sde;

    4) Then when it finishes reading the extent from device /dev/sde, it calls
    __readahead_hook() which ends up dropping the last reference on the
    readahead extent through the last call to reada_extent_put();

    5) At reada_extent_put() it iterates over each zone of the readahead extent
    and attempts to delete an element from the device's 'reada_extents'
    radix tree, resulting in a use-after-free, as the device pointer of the
    zone for /dev/sdd is now stale. We can also access the device after
    dropping the last reference of a zone, through reada_zone_release(),
    also called by reada_extent_put().

    And a device remove suffers the same problem, however since it shrinks the
    device size down to zero before removing the device, it is very unlikely to
    still have readahead requests not completed by the time we free the device,
    the only possibility is if the device has a very little space allocated.

    While the hang problem is exclusive to scrub, since it is currently the
    only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
    problem affects any path that triggers readhead, which includes
    btree_readahead_hook() and __readahead_hook() (a readahead worker can
    trigger readahed for the children of a node) for example - any path that
    ends up calling reada_add_block() can trigger the use-after-free after a
    device is removed.

    So fix this by waiting for any readahead requests for a device to complete
    before removing a device, ensuring that while waiting for existing ones no
    new ones can be made.

    This problem has been around for a very long time - the readahead code was
    added in 2011, device remove exists since 2008 and device replace was
    introduced in 2013, hard to pick a specific commit for a git Fixes tag.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • If we fail to find suitable zones for a new readahead extent, we end up
    leaving a stale pointer in the global readahead extents radix tree
    (fs_info->reada_tree), which can trigger the following trace later on:

    [13367.696354] BUG: kernel NULL pointer dereference, address: 00000000000000b0
    [13367.696802] #PF: supervisor read access in kernel mode
    [13367.697249] #PF: error_code(0x0000) - not-present page
    [13367.697721] PGD 0 P4D 0
    [13367.698171] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
    [13367.698632] CPU: 6 PID: 851214 Comm: btrfs Tainted: G W 5.9.0-rc6-btrfs-next-69 #1
    [13367.699100] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    [13367.700069] RIP: 0010:__lock_acquire+0x20a/0x3970
    [13367.700562] Code: ff 1f 0f b7 c0 48 0f (...)
    [13367.701609] RSP: 0018:ffffb14448f57790 EFLAGS: 00010046
    [13367.702140] RAX: 0000000000000000 RBX: 29b935140c15e8cf RCX: 0000000000000000
    [13367.702698] RDX: 0000000000000002 RSI: ffffffffb3d66bd0 RDI: 0000000000000046
    [13367.703240] RBP: ffff8a52ba8ac040 R08: 00000c2866ad9288 R09: 0000000000000001
    [13367.703783] R10: 0000000000000001 R11: 00000000b66d9b53 R12: ffff8a52ba8ac9b0
    [13367.704330] R13: 0000000000000000 R14: ffff8a532b6333e8 R15: 0000000000000000
    [13367.704880] FS: 00007fe1df6b5700(0000) GS:ffff8a5376600000(0000) knlGS:0000000000000000
    [13367.705438] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [13367.705995] CR2: 00000000000000b0 CR3: 000000022cca8004 CR4: 00000000003706e0
    [13367.706565] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [13367.707127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [13367.707686] Call Trace:
    [13367.708246] ? ___slab_alloc+0x395/0x740
    [13367.708820] ? reada_add_block+0xae/0xee0 [btrfs]
    [13367.709383] lock_acquire+0xb1/0x480
    [13367.709955] ? reada_add_block+0xe0/0xee0 [btrfs]
    [13367.710537] ? reada_add_block+0xae/0xee0 [btrfs]
    [13367.711097] ? rcu_read_lock_sched_held+0x5d/0x90
    [13367.711659] ? kmem_cache_alloc_trace+0x8d2/0x990
    [13367.712221] ? lock_acquired+0x33b/0x470
    [13367.712784] _raw_spin_lock+0x34/0x80
    [13367.713356] ? reada_add_block+0xe0/0xee0 [btrfs]
    [13367.713966] reada_add_block+0xe0/0xee0 [btrfs]
    [13367.714529] ? btrfs_root_node+0x15/0x1f0 [btrfs]
    [13367.715077] btrfs_reada_add+0x117/0x170 [btrfs]
    [13367.715620] scrub_stripe+0x21e/0x10d0 [btrfs]
    [13367.716141] ? kvm_sched_clock_read+0x5/0x10
    [13367.716657] ? __lock_acquire+0x41e/0x3970
    [13367.717184] ? scrub_chunk+0x60/0x140 [btrfs]
    [13367.717697] ? find_held_lock+0x32/0x90
    [13367.718254] ? scrub_chunk+0x60/0x140 [btrfs]
    [13367.718773] ? lock_acquired+0x33b/0x470
    [13367.719278] ? scrub_chunk+0xcd/0x140 [btrfs]
    [13367.719786] scrub_chunk+0xcd/0x140 [btrfs]
    [13367.720291] scrub_enumerate_chunks+0x270/0x5c0 [btrfs]
    [13367.720787] ? finish_wait+0x90/0x90
    [13367.721281] btrfs_scrub_dev+0x1ee/0x620 [btrfs]
    [13367.721762] ? rcu_read_lock_any_held+0x8e/0xb0
    [13367.722235] ? preempt_count_add+0x49/0xa0
    [13367.722710] ? __sb_start_write+0x19b/0x290
    [13367.723192] btrfs_ioctl+0x7f5/0x36f0 [btrfs]
    [13367.723660] ? __fget_files+0x101/0x1d0
    [13367.724118] ? find_held_lock+0x32/0x90
    [13367.724559] ? __fget_files+0x101/0x1d0
    [13367.724982] ? __x64_sys_ioctl+0x83/0xb0
    [13367.725399] __x64_sys_ioctl+0x83/0xb0
    [13367.725802] do_syscall_64+0x33/0x80
    [13367.726188] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [13367.726574] RIP: 0033:0x7fe1df7add87
    [13367.726948] Code: 00 00 00 48 8b 05 09 91 (...)
    [13367.727763] RSP: 002b:00007fe1df6b4d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    [13367.728179] RAX: ffffffffffffffda RBX: 000055ce1fb596a0 RCX: 00007fe1df7add87
    [13367.728604] RDX: 000055ce1fb596a0 RSI: 00000000c400941b RDI: 0000000000000003
    [13367.729021] RBP: 0000000000000000 R08: 00007fe1df6b5700 R09: 0000000000000000
    [13367.729431] R10: 00007fe1df6b5700 R11: 0000000000000246 R12: 00007ffd922b07de
    [13367.729842] R13: 00007ffd922b07df R14: 00007fe1df6b4e40 R15: 0000000000802000
    [13367.730275] Modules linked in: btrfs blake2b_generic xor (...)
    [13367.732638] CR2: 00000000000000b0
    [13367.733166] ---[ end trace d298b6805556acd9 ]---

    What happens is the following:

    1) At reada_find_extent() we don't find any existing readahead extent for
    the metadata extent starting at logical address X;

    2) So we proceed to create a new one. We then call btrfs_map_block() to get
    information about which stripes contain extent X;

    3) After that we iterate over the stripes and create only one zone for the
    readahead extent - only one because reada_find_zone() returned NULL for
    all iterations except for one, either because a memory allocation failed
    or it couldn't find the block group of the extent (it may have just been
    deleted);

    4) We then add the new readahead extent to the readahead extents radix
    tree at fs_info->reada_tree;

    5) Then we iterate over each zone of the new readahead extent, and find
    that the device used for that zone no longer exists, because it was
    removed or it was the source device of a device replace operation.
    Since this left 'have_zone' set to 0, after finishing the loop we jump
    to the 'error' label, call kfree() on the new readahead extent and
    return without removing it from the radix tree at fs_info->reada_tree;

    6) Any future call to reada_find_extent() for the logical address X will
    find the stale pointer in the readahead extents radix tree, increment
    its reference counter, which can trigger the use-after-free right
    away or return it to the caller reada_add_block() that results in the
    use-after-free of the example trace above.

    So fix this by making sure we delete the readahead extent from the radix
    tree if we fail to setup zones for it (when 'have_zone = 0').

    Fixes: 319450211842ba ("btrfs: reada: bypass adding extent when all zone failed")
    CC: stable@vger.kernel.org # 4.9+
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • If there's no parity and num_stripes < ncopies, a crafted image can
    trigger a division by zero in calc_stripe_length().

    The image was generated through fuzzing.

    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Qu Wenruo
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209587
    Signed-off-by: Daniel Xu
    Signed-off-by: David Sterba

    Daniel Xu
     
  • This patch addresses a compile warning:

    fs/btrfs/extent-tree.c: In function '__btrfs_free_extent':
    fs/btrfs/extent-tree.c:3187:4: warning: format '%lu' expects argument of type 'long unsigned int', but argument 8 has type 'unsigned int' [-Wformat=]

    Fixes: 1c2a07f598d5 ("btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()")
    Reviewed-by: Filipe Manana
    Signed-off-by: Pujin Shi
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Pujin Shi
     
  • Dave reported a problem with my rwsem conversion patch where we got the
    following lockdep splat:

    ======================================================
    WARNING: possible circular locking dependency detected
    5.9.0-default+ #1297 Not tainted
    ------------------------------------------------------
    kswapd0/76 is trying to acquire lock:
    ffff9d5d25df2530 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]

    but task is already holding lock:
    ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #4 (fs_reclaim){+.+.}-{0:0}:
    __lock_acquire+0x582/0xac0
    lock_acquire+0xca/0x430
    fs_reclaim_acquire.part.0+0x25/0x30
    kmem_cache_alloc+0x30/0x9c0
    alloc_inode+0x81/0x90
    iget_locked+0xcd/0x1a0
    kernfs_get_inode+0x1b/0x130
    kernfs_get_tree+0x136/0x210
    sysfs_get_tree+0x1a/0x50
    vfs_get_tree+0x1d/0xb0
    path_mount+0x70f/0xa80
    do_mount+0x75/0x90
    __x64_sys_mount+0x8e/0xd0
    do_syscall_64+0x2d/0x70
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #3 (kernfs_mutex){+.+.}-{3:3}:
    __lock_acquire+0x582/0xac0
    lock_acquire+0xca/0x430
    __mutex_lock+0xa0/0xaf0
    kernfs_add_one+0x23/0x150
    kernfs_create_dir_ns+0x58/0x80
    sysfs_create_dir_ns+0x70/0xd0
    kobject_add_internal+0xbb/0x2d0
    kobject_add+0x7a/0xd0
    btrfs_sysfs_add_block_group_type+0x141/0x1d0 [btrfs]
    btrfs_read_block_groups+0x1f1/0x8c0 [btrfs]
    open_ctree+0x981/0x1108 [btrfs]
    btrfs_mount_root.cold+0xe/0xb0 [btrfs]
    legacy_get_tree+0x2d/0x60
    vfs_get_tree+0x1d/0xb0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    btrfs_mount+0x13b/0x3e0 [btrfs]
    legacy_get_tree+0x2d/0x60
    vfs_get_tree+0x1d/0xb0
    path_mount+0x70f/0xa80
    do_mount+0x75/0x90
    __x64_sys_mount+0x8e/0xd0
    do_syscall_64+0x2d/0x70
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #2 (btrfs-extent-00){++++}-{3:3}:
    __lock_acquire+0x582/0xac0
    lock_acquire+0xca/0x430
    down_read_nested+0x45/0x220
    __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
    btrfs_search_slot+0x6d4/0xfd0 [btrfs]
    check_committed_ref+0x69/0x200 [btrfs]
    btrfs_cross_ref_exist+0x65/0xb0 [btrfs]
    run_delalloc_nocow+0x446/0x9b0 [btrfs]
    btrfs_run_delalloc_range+0x61/0x6a0 [btrfs]
    writepage_delalloc+0xae/0x160 [btrfs]
    __extent_writepage+0x262/0x420 [btrfs]
    extent_write_cache_pages+0x2b6/0x510 [btrfs]
    extent_writepages+0x43/0x90 [btrfs]
    do_writepages+0x40/0xe0
    __writeback_single_inode+0x62/0x610
    writeback_sb_inodes+0x20f/0x500
    wb_writeback+0xef/0x4a0
    wb_do_writeback+0x49/0x2e0
    wb_workfn+0x81/0x340
    process_one_work+0x233/0x5d0
    worker_thread+0x50/0x3b0
    kthread+0x137/0x150
    ret_from_fork+0x1f/0x30

    -> #1 (btrfs-fs-00){++++}-{3:3}:
    __lock_acquire+0x582/0xac0
    lock_acquire+0xca/0x430
    down_read_nested+0x45/0x220
    __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
    btrfs_search_slot+0x6d4/0xfd0 [btrfs]
    btrfs_lookup_inode+0x3a/0xc0 [btrfs]
    __btrfs_update_delayed_inode+0x93/0x2c0 [btrfs]
    __btrfs_commit_inode_delayed_items+0x7de/0x850 [btrfs]
    __btrfs_run_delayed_items+0x8e/0x140 [btrfs]
    btrfs_commit_transaction+0x367/0xbc0 [btrfs]
    btrfs_mksubvol+0x2db/0x470 [btrfs]
    btrfs_mksnapshot+0x7b/0xb0 [btrfs]
    __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
    btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
    btrfs_ioctl+0xd0b/0x2690 [btrfs]
    __x64_sys_ioctl+0x6f/0xa0
    do_syscall_64+0x2d/0x70
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
    check_prev_add+0x91/0xc60
    validate_chain+0xa6e/0x2a20
    __lock_acquire+0x582/0xac0
    lock_acquire+0xca/0x430
    __mutex_lock+0xa0/0xaf0
    __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
    btrfs_evict_inode+0x3cc/0x560 [btrfs]
    evict+0xd6/0x1c0
    dispose_list+0x48/0x70
    prune_icache_sb+0x54/0x80
    super_cache_scan+0x121/0x1a0
    do_shrink_slab+0x16d/0x3b0
    shrink_slab+0xb1/0x2e0
    shrink_node+0x230/0x6a0
    balance_pgdat+0x325/0x750
    kswapd+0x206/0x4d0
    kthread+0x137/0x150
    ret_from_fork+0x1f/0x30

    other info that might help us debug this:

    Chain exists of:
    &delayed_node->mutex --> kernfs_mutex --> fs_reclaim

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(fs_reclaim);
    lock(kernfs_mutex);
    lock(fs_reclaim);
    lock(&delayed_node->mutex);

    *** DEADLOCK ***

    3 locks held by kswapd0/76:
    #0: ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
    #1: ffffffffa40b8b58 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
    #2: ffff9d5d322390e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50

    stack backtrace:
    CPU: 2 PID: 76 Comm: kswapd0 Not tainted 5.9.0-default+ #1297
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
    Call Trace:
    dump_stack+0x77/0x97
    check_noncircular+0xff/0x110
    ? save_trace+0x50/0x470
    check_prev_add+0x91/0xc60
    validate_chain+0xa6e/0x2a20
    ? save_trace+0x50/0x470
    __lock_acquire+0x582/0xac0
    lock_acquire+0xca/0x430
    ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
    __mutex_lock+0xa0/0xaf0
    ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
    ? __lock_acquire+0x582/0xac0
    ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
    ? btrfs_evict_inode+0x30b/0x560 [btrfs]
    ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
    __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
    btrfs_evict_inode+0x3cc/0x560 [btrfs]
    evict+0xd6/0x1c0
    dispose_list+0x48/0x70
    prune_icache_sb+0x54/0x80
    super_cache_scan+0x121/0x1a0
    do_shrink_slab+0x16d/0x3b0
    shrink_slab+0xb1/0x2e0
    shrink_node+0x230/0x6a0
    balance_pgdat+0x325/0x750
    kswapd+0x206/0x4d0
    ? finish_wait+0x90/0x90
    ? balance_pgdat+0x750/0x750
    kthread+0x137/0x150
    ? kthread_mod_delayed_work+0xc0/0xc0
    ret_from_fork+0x1f/0x30

    This happens because we are still holding the path open when we start
    adding the sysfs files for the block groups, which creates a dependency
    on fs_reclaim via the tree lock. Fix this by dropping the path before
    we start doing anything with sysfs.

    Reported-by: David Sterba
    CC: stable@vger.kernel.org # 5.8+
    Reviewed-by: Anand Jain
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     

16 Oct, 2020

1 commit

  • When doing a fallocate() we have a short time window, after reserving an
    extent and before starting a transaction, where if relocation for the block
    group containing the reserved extent happens, we can end up missing the
    extent in the data relocation inode causing relocation to fail later.

    This only happens when we don't pass a transaction to the internal
    fallocate function __btrfs_prealloc_file_range(), which is for all the
    cases where fallocate() is called from user space (the internal use cases
    include space cache extent allocation and relocation).

    When the race triggers the relocation failure, it produces a trace like
    the following:

    [200611.995995] ------------[ cut here ]------------
    [200611.997084] BTRFS: Transaction aborted (error -2)
    [200611.998208] WARNING: CPU: 3 PID: 235845 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x3a0/0x5b0 [btrfs]
    [200611.999042] Modules linked in: dm_thin_pool dm_persistent_data (...)
    [200612.003287] CPU: 3 PID: 235845 Comm: btrfs Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [200612.004442] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    [200612.006186] RIP: 0010:__btrfs_cow_block+0x3a0/0x5b0 [btrfs]
    [200612.007110] Code: 1b 00 00 02 72 2a 83 f8 fb 0f 84 b8 01 (...)
    [200612.007341] BTRFS warning (device sdb): Skipping commit of aborted transaction.
    [200612.008959] RSP: 0018:ffffaee38550f918 EFLAGS: 00010286
    [200612.009672] BTRFS: error (device sdb) in cleanup_transaction:1901: errno=-30 Readonly filesystem
    [200612.010428] RAX: 0000000000000000 RBX: ffff9174d96f4000 RCX: 0000000000000000
    [200612.011078] BTRFS info (device sdb): forced readonly
    [200612.011862] RDX: 0000000000000001 RSI: ffffffffa8161978 RDI: 00000000ffffffff
    [200612.013215] RBP: ffff9172569a0f80 R08: 0000000000000000 R09: 0000000000000000
    [200612.014263] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9174b8403b88
    [200612.015203] R13: ffff9174b8400a88 R14: ffff9174c90f1000 R15: ffff9174a5a60e08
    [200612.016182] FS: 00007fa55cf878c0(0000) GS:ffff9174ece00000(0000) knlGS:0000000000000000
    [200612.017174] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [200612.018418] CR2: 00007f8fb8048148 CR3: 0000000428a46003 CR4: 00000000003706e0
    [200612.019510] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [200612.020648] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [200612.021520] Call Trace:
    [200612.022434] btrfs_cow_block+0x10b/0x250 [btrfs]
    [200612.023407] do_relocation+0x54e/0x7b0 [btrfs]
    [200612.024343] ? do_raw_spin_unlock+0x4b/0xc0
    [200612.025280] ? _raw_spin_unlock+0x29/0x40
    [200612.026200] relocate_tree_blocks+0x3bc/0x6d0 [btrfs]
    [200612.027088] relocate_block_group+0x2f3/0x600 [btrfs]
    [200612.027961] btrfs_relocate_block_group+0x15e/0x340 [btrfs]
    [200612.028896] btrfs_relocate_chunk+0x38/0x110 [btrfs]
    [200612.029772] btrfs_balance+0xb22/0x1790 [btrfs]
    [200612.030601] ? btrfs_ioctl_balance+0x253/0x380 [btrfs]
    [200612.031414] btrfs_ioctl_balance+0x2cf/0x380 [btrfs]
    [200612.032279] btrfs_ioctl+0x620/0x36f0 [btrfs]
    [200612.033077] ? _raw_spin_unlock+0x29/0x40
    [200612.033948] ? handle_mm_fault+0x116d/0x1ca0
    [200612.034749] ? up_read+0x18/0x240
    [200612.035542] ? __x64_sys_ioctl+0x83/0xb0
    [200612.036244] __x64_sys_ioctl+0x83/0xb0
    [200612.037269] do_syscall_64+0x33/0x80
    [200612.038190] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [200612.038976] RIP: 0033:0x7fa55d07ed87
    [200612.040127] Code: 00 00 00 48 8b 05 09 91 0c 00 64 c7 00 26 (...)
    [200612.041669] RSP: 002b:00007ffd5ebf03e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
    [200612.042437] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fa55d07ed87
    [200612.043511] RDX: 00007ffd5ebf0470 RSI: 00000000c4009420 RDI: 0000000000000003
    [200612.044250] RBP: 0000000000000003 R08: 000055d8362642a0 R09: 00007fa55d148be0
    [200612.044963] R10: fffffffffffff52e R11: 0000000000000206 R12: 00007ffd5ebf1614
    [200612.045683] R13: 00007ffd5ebf0470 R14: 0000000000000002 R15: 00007ffd5ebf0470
    [200612.046361] irq event stamp: 0
    [200612.047040] hardirqs last enabled at (0): [] 0x0
    [200612.047725] hardirqs last disabled at (0): [] copy_process+0x823/0x1bc0
    [200612.048387] softirqs last enabled at (0): [] copy_process+0x823/0x1bc0
    [200612.049024] softirqs last disabled at (0): [] 0x0
    [200612.049722] ---[ end trace 49006c6876e65227 ]---

    The race happens like this:

    1) Task A starts an fallocate() (plain or zero range) and it calls
    __btrfs_prealloc_file_range() with the 'trans' parameter set to NULL;

    2) Task A calls btrfs_reserve_extent() and gets an extent that belongs to
    block group X;

    3) Before task A gets into btrfs_replace_file_extents(), through the call
    to insert_prealloc_file_extent(), task B starts relocation of block
    group X;

    4) Task B enters btrfs_relocate_block_group() and it sets block group X to
    RO mode;

    5) Task B enters relocate_block_group(), it calls prepare_to_relocate()
    whichs joins/starts a transaction and then commits the transaction;

    6) Task B then starts scanning the extent tree looking for extents that
    belong to block group X - it does not find yet the extent reserved by
    task A, since that extent was not yet added to the extent tree, as its
    delayed reference was not even yet created at this point;

    7) The data relocation inode ends up not having the extent reserved by
    task A associated to it;

    8) Task A then starts a transaction through btrfs_replace_file_extents(),
    inserts a file extent item in the subvolume tree pointing to the
    reserved extent and creates a delayed reference for it;

    9) Task A finishes and returns success to user space;

    10) Later on, while relocation is still in progress, the leaf where task A
    inserted the new file extent item is COWed, so we end up at
    __btrfs_cow_block(), which calls btrfs_reloc_cow_block(), and that in
    turn calls relocation.c:replace_file_extents();

    11) At relocation.c:replace_file_extents() we iterate over all the items in
    the leaf and find the file extent item pointing to the extent that was
    allocated by task A, and then call relocation.c:get_new_location(), to
    find the new location for the extent;

    12) However relocation.c:get_new_location() fails, returning -ENOENT,
    because it couldn't find a corresponding file extent item associated
    with the data relocation inode. This is because the extent was not seen
    in the extent tree at step 6). The -ENOENT error is propagated to
    __btrfs_cow_block(), which aborts the transaction.

    So fix this simply by decrementing the block group's number of reservations
    after calling insert_prealloc_file_extent(), as relocation waits for that
    counter to go down to zero before calling prepare_to_relocate() and start
    looking for extents in the extent tree.

    This issue only started to happen recently as of commit 8fccebfa534c79
    ("btrfs: fix metadata reservation for fallocate that leads to transaction
    aborts"), because now we can reserve an extent before starting/joining a
    transaction, and previously we always did it after that, so relocation
    ended up waiting for a concurrent fallocate() to finish because before
    searching for the extents of the block group, it starts/joins a transaction
    and then commits it (at prepare_to_relocate()), which made it wait for the
    fallocate task to complete first.

    Fixes: 8fccebfa534c79 ("btrfs: fix metadata reservation for fallocate that leads to transaction aborts")
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     

07 Oct, 2020

21 commits

  • Commit 8d875f95da43 ("btrfs: disable strict file flushes for
    renames and truncates") eliminated the notion of ordered operations and
    instead BTRFS_INODE_ORDERED_DATA_CLOSE only remained as a flag
    indicating that a file's content should be synced to disk in case a
    file is truncated and any writes happen to it concurrently. In fact
    this intendend behavior was broken until it was fixed in
    f6dc45c7a93a ("Btrfs: fix filemap_flush call in btrfs_file_release").

    All things considered let's give the flag a more descriptive name. Also
    slightly reword comments.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • This patch fixes the following sparse errors in
    fs/btrfs/super.c in function btrfs_show_devname()

    fs/btrfs/super.c: error: incompatible types in comparison expression (different address spaces):
    fs/btrfs/super.c: struct rcu_string [noderef] *
    fs/btrfs/super.c: struct rcu_string *

    The error was because of the following line in function btrfs_show_devname():

    if (first_dev)
    seq_escape(m, rcu_str_deref(first_dev->name), " \t\n\\");

    Annotating the btrfs_device::name member with __rcu fixes the sparse
    error.

    Acked-by: Joel Fernandes (Google)
    Signed-off-by: Madhuparna Bhowmik
    Signed-off-by: David Sterba

    Madhuparna Bhowmik
     
  • Many things can happen after the device is scanned and before the device
    is mounted. One such thing is losing the BTRFS_MAGIC on the device.
    If it happens we still won't free that device from the memory and cause
    the userland confusion.

    For example: As the BTRFS_IOC_DEV_INFO still carries the device path
    which does not have the BTRFS_MAGIC, 'btrfs fi show' still lists
    device which does not belong to the filesystem anymore:

    $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sda /dev/sdb
    $ wipefs -a /dev/sdb
    # /dev/sdb does not contain magic signature
    $ mount -o degraded /dev/sda /btrfs
    $ btrfs fi show -m
    Label: none uuid: 470ec6fb-646b-4464-b3cb-df1b26c527bd
    Total devices 2 FS bytes used 128.00KiB
    devid 1 size 3.00GiB used 571.19MiB path /dev/sda
    devid 2 size 3.00GiB used 571.19MiB path /dev/sdb

    We need to distinguish the missing signature and invalid superblock, so
    add a specific error code ENODATA for that. This also fixes failure of
    fstest btrfs/198.

    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Josef Bacik
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Anand Jain
     
  • In fstest btrfs/064 a transaction abort in __btrfs_cow_block could lead
    to a system lockup. It gets stuck trying to write back inodes, and the
    write back thread was trying to lock an extent buffer:

    $ cat /proc/2143497/stack
    [] __btrfs_tree_lock+0x108/0x250
    [] lock_extent_buffer_for_io+0x35e/0x3a0
    [] btree_write_cache_pages+0x15a/0x3b0
    [] do_writepages+0x28/0xb0
    [] __writeback_single_inode+0x54/0x5c0
    [] writeback_sb_inodes+0x1e8/0x510
    [] wb_writeback+0xcc/0x440
    [] wb_workfn+0xd7/0x650
    [] process_one_work+0x236/0x560
    [] worker_thread+0x55/0x3c0
    [] kthread+0x13a/0x150
    [] ret_from_fork+0x1f/0x30

    This is because we got an error while COWing a block, specifically here

    if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) {
    ret = btrfs_reloc_cow_block(trans, root, buf, cow);
    if (ret) {
    btrfs_abort_transaction(trans, ret);
    return ret;
    }
    }

    [16402.241552] BTRFS: Transaction aborted (error -2)
    [16402.242362] WARNING: CPU: 1 PID: 2563188 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x376/0x540
    [16402.249469] CPU: 1 PID: 2563188 Comm: fsstress Not tainted 5.9.0-rc6+ #8
    [16402.249936] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
    [16402.250525] RIP: 0010:__btrfs_cow_block+0x376/0x540
    [16402.252417] RSP: 0018:ffff9cca40e578b0 EFLAGS: 00010282
    [16402.252787] RAX: 0000000000000025 RBX: 0000000000000002 RCX: ffff9132bbd19388
    [16402.253278] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9132bbd19380
    [16402.254063] RBP: ffff9132b41a49c0 R08: 0000000000000000 R09: 0000000000000000
    [16402.254887] R10: 0000000000000000 R11: ffff91324758b080 R12: ffff91326ef17ce0
    [16402.255694] R13: ffff91325fc0f000 R14: ffff91326ef176b0 R15: ffff9132815e2000
    [16402.256321] FS: 00007f542c6d7b80(0000) GS:ffff9132bbd00000(0000) knlGS:0000000000000000
    [16402.256973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [16402.257374] CR2: 00007f127b83f250 CR3: 0000000133480002 CR4: 0000000000370ee0
    [16402.257867] Call Trace:
    [16402.258072] btrfs_cow_block+0x109/0x230
    [16402.258356] btrfs_search_slot+0x530/0x9d0
    [16402.258655] btrfs_lookup_file_extent+0x37/0x40
    [16402.259155] __btrfs_drop_extents+0x13c/0xd60
    [16402.259628] ? btrfs_block_rsv_migrate+0x4f/0xb0
    [16402.259949] btrfs_replace_file_extents+0x190/0x820
    [16402.260873] btrfs_clone+0x9ae/0xc00
    [16402.261139] btrfs_extent_same_range+0x66/0x90
    [16402.261771] btrfs_remap_file_range+0x353/0x3b1
    [16402.262333] vfs_dedupe_file_range_one.part.0+0xd5/0x140
    [16402.262821] vfs_dedupe_file_range+0x189/0x220
    [16402.263150] do_vfs_ioctl+0x552/0x700
    [16402.263662] __x64_sys_ioctl+0x62/0xb0
    [16402.264023] do_syscall_64+0x33/0x40
    [16402.264364] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [16402.264862] RIP: 0033:0x7f542c7d15cb
    [16402.266901] RSP: 002b:00007ffd35944ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    [16402.267627] RAX: ffffffffffffffda RBX: 00000000009d1968 RCX: 00007f542c7d15cb
    [16402.268298] RDX: 00000000009d2490 RSI: 00000000c0189436 RDI: 0000000000000003
    [16402.268958] RBP: 00000000009d2520 R08: 0000000000000036 R09: 00000000009d2e64
    [16402.269726] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
    [16402.270659] R13: 000000000001f000 R14: 00000000009d1970 R15: 00000000009d2e80
    [16402.271498] irq event stamp: 0
    [16402.271846] hardirqs last enabled at (0): [] 0x0
    [16402.272497] hardirqs last disabled at (0): [] copy_process+0x6b9/0x1ba0
    [16402.273343] softirqs last enabled at (0): [] copy_process+0x6b9/0x1ba0
    [16402.273905] softirqs last disabled at (0): [] 0x0
    [16402.274338] ---[ end trace 737874a5a41a8236 ]---
    [16402.274669] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
    [16402.276179] BTRFS info (device dm-9): forced readonly
    [16402.277046] BTRFS: error (device dm-9) in btrfs_replace_file_extents:2723: errno=-2 No such entry
    [16402.278744] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
    [16402.279968] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
    [16402.280582] BTRFS info (device dm-9): balance: ended with status: -30

    The problem here is that as soon as we allocate the new block it is
    locked and marked dirty in the btree inode. This means that we could
    attempt to writeback this block and need to lock the extent buffer.
    However we're not unlocking it here and thus we deadlock.

    Fix this by unlocking the cow block if we have any errors inside of
    __btrfs_cow_block, and also free it so we do not leak it.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Since we now perform direct reads using i_rwsem, we can remove this
    inode flag used to co-ordinate unlocked reads.

    The truncate call takes i_rwsem. This means it is correctly synchronized
    with concurrent direct reads.

    Reviewed-by: Nikolay Borisov
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Josef Bacik
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Goldwyn Rodrigues
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Goldwyn Rodrigues
     
  • Since we removed the last user of dio_end_io() when btrfs got converted
    to iomap infrastructure ("btrfs: switch to iomap for direct IO"), remove
    the helper function dio_end_io().

    Reviewed-by: Nikolay Borisov
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Josef Bacik
    Signed-off-by: Goldwyn Rodrigues
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Goldwyn Rodrigues
     
  • I noticed when fixing device stats for seed devices that we simply threw
    away the return value from btrfs_search_slot(). This is because we may
    not have stat items, but we could very well get an error, and thus miss
    reporting the error up the chain.

    Fix this by returning ret if it's an actual error, and then stop trying
    to init the rest of the devices stats and return the error up the chain.

    Reviewed-by: Anand Jain
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We recently started recording device stats across the fleet, and noticed
    a large increase in messages such as this

    BTRFS warning (device dm-0): get dev_stats failed, not yet valid

    on our tiers that use seed devices for their root devices. This is
    because we do not initialize the device stats for any seed devices if we
    have a sprout device and mount using that sprout device. The basic
    steps for reproducing are:

    $ mkfs seed device
    $ mount seed device
    # fill seed device
    $ umount seed device
    $ btrfstune -S 1 seed device
    $ mount seed device
    $ btrfs device add -f sprout device /mnt/wherever
    $ umount /mnt/wherever
    $ mount sprout device /mnt/wherever
    $ btrfs device stats /mnt/wherever

    This will fail with the above message in dmesg.

    Fix this by iterating over the fs_devices->seed if they exist in
    btrfs_init_dev_stats. This fixed the problem and properly reports the
    stats for both devices.

    Reviewed-by: Anand Jain
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    [ rename to btrfs_device_init_dev_stats ]
    Signed-off-by: David Sterba

    Josef Bacik
     
  • It's no longer used just remove the function and any related code which
    was initialising it for inodes. No functional changes.

    Removing 8 bytes from extent_io_tree in turn reduces size of other
    structures where it is embedded, notably btrfs_inode where it reduces
    size by 24 bytes.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • No need to go through a function pointer indirection simply call
    submit_bio_hook directly by exporting and renaming the helper to
    btrfs_submit_metadata_bio. This makes the code more readable and should
    result in somewhat faster code due to no longer paying the price for
    specualtive attack mitigations that come with indirect function calls.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Instead export and rename the function to btrfs_submit_data_bio and
    call it directly in submit_one_bio. This avoids paying the cost for
    speculative attacks mitigations and improves code readability.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Use the is_data_inode helper.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • BTRFS has 2 inode types (for the purposes of the code in submit_one_bio)
    - ordinary data inodes (including the freespace inode) and the btree
    inode. Both of these implement submit_bio_hook so btrfsic_submit_bio can
    never be called from submit_one_bio so just remove it.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • It's no longer used so let's remove it.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Don't call readpage_end_io_hook for the btree inode. Instead of relying
    on indirect calls to implement metadata buffer validation simply check
    if the inode whose page we are processing equals the btree inode. If it
    does call the necessary function.

    This is an improvement in 2 directions:

    1. We aren't paying the penalty of indirect calls in a post-speculation
    attacks world.

    2. The function is now named more explicitly so it's obvious what's
    going on

    This is in preparation to removing struct extent_io_ops altogether.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • During an incremental send, when an inode has multiple new references we
    might end up emitting rename operations for orphanizations that have a
    source path that is no longer valid due to a previous orphanization of
    some directory inode. This causes the receiver to fail since it tries
    to rename a path that does not exists.

    Example reproducer:

    $ cat reproducer.sh
    #!/bin/bash

    mkfs.btrfs -f /dev/sdi >/dev/null
    mount /dev/sdi /mnt/sdi

    touch /mnt/sdi/f1
    touch /mnt/sdi/f2
    mkdir /mnt/sdi/d1
    mkdir /mnt/sdi/d1/d2

    # Filesystem looks like:
    #
    # . (ino 256)
    # |----- f1 (ino 257)
    # |----- f2 (ino 258)
    # |----- d1/ (ino 259)
    # |----- d2/ (ino 260)

    btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
    btrfs send -f /tmp/snap1.send /mnt/sdi/snap1

    # Now do a series of changes such that:
    #
    # *) inode 258 has one new hardlink and the previous name changed
    #
    # *) both names conflict with the old names of two other inodes:
    #
    # 1) the new name "d1" conflicts with the old name of inode 259,
    # under directory inode 256 (root)
    #
    # 2) the new name "d2" conflicts with the old name of inode 260
    # under directory inode 259
    #
    # *) inodes 259 and 260 now have the old names of inode 258
    #
    # *) inode 257 is now located under inode 260 - an inode with a number
    # smaller than the inode (258) for which we created a second hard
    # link and swapped its names with inodes 259 and 260
    #
    ln /mnt/sdi/f2 /mnt/sdi/d1/f2_link
    mv /mnt/sdi/f1 /mnt/sdi/d1/d2/f1

    # Swap d1 and f2.
    mv /mnt/sdi/d1 /mnt/sdi/tmp
    mv /mnt/sdi/f2 /mnt/sdi/d1
    mv /mnt/sdi/tmp /mnt/sdi/f2

    # Swap d2 and f2_link
    mv /mnt/sdi/f2/d2 /mnt/sdi/tmp
    mv /mnt/sdi/f2/f2_link /mnt/sdi/f2/d2
    mv /mnt/sdi/tmp /mnt/sdi/f2/f2_link

    # Filesystem now looks like:
    #
    # . (ino 256)
    # |----- d1 (ino 258)
    # |----- f2/ (ino 259)
    # |----- f2_link/ (ino 260)
    # | |----- f1 (ino 257)
    # |
    # |----- d2 (ino 258)

    btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
    btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2

    mkfs.btrfs -f /dev/sdj >/dev/null
    mount /dev/sdj /mnt/sdj

    btrfs receive -f /tmp/snap1.send /mnt/sdj
    btrfs receive -f /tmp/snap2.send /mnt/sdj

    umount /mnt/sdi
    umount /mnt/sdj

    When executed the receive of the incremental stream fails:

    $ ./reproducer.sh
    Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
    At subvol /mnt/sdi/snap1
    Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
    At subvol /mnt/sdi/snap2
    At subvol snap1
    At snapshot snap2
    ERROR: rename d1/d2 -> o260-6-0 failed: No such file or directory

    This happens because:

    1) When processing inode 257 we end up computing the name for inode 259
    because it is an ancestor in the send snapshot, and at that point it
    still has its old name, "d1", from the parent snapshot because inode
    259 was not yet processed. We then cache that name, which is valid
    until we start processing inode 259 (or set the progress to 260 after
    processing its references);

    2) Later we start processing inode 258 and collecting all its new
    references into the list sctx->new_refs. The first reference in the
    list happens to be the reference for name "d1" while the reference for
    name "d2" is next (the last element of the list).
    We compute the full path "d1/d2" for this second reference and store
    it in the reference (its ->full_path member). The path used for the
    new parent directory was "d1" and not "f2" because inode 259, the
    new parent, was not yet processed;

    3) When we start processing the new references at process_recorded_refs()
    we start with the first reference in the list, for the new name "d1".
    Because there is a conflicting inode that was not yet processed, which
    is directory inode 259, we orphanize it, renaming it from "d1" to
    "o259-6-0";

    4) Then we start processing the new reference for name "d2", and we
    realize it conflicts with the reference of inode 260 in the parent
    snapshot. So we issue an orphanization operation for inode 260 by
    emitting a rename operation with a destination path of "o260-6-0"
    and a source path of "d1/d2" - this source path is the value we
    stored in the reference earlier at step 2), corresponding to the
    ->full_path member of the reference, however that path is no longer
    valid due to the orphanization of the directory inode 259 in step 3).
    This makes the receiver fail since the path does not exists, it should
    have been "o259-6-0/d2".

    Fix this by recomputing the full path of a reference before emitting an
    orphanization if we previously orphanized any directory, since that
    directory could be a parent in the new path. This is a rare scenario so
    keeping it simple and not checking if that previously orphanized directory
    is in fact an ancestor of the inode we are trying to orphanize.

    A test case for fstests follows soon.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     
  • When doing an incremental send it is possible that when processing the new
    references for an inode we end up issuing rename or link operations that
    have an invalid path, which contains the orphanized name of a directory
    before we actually orphanized it, causing the receiver to fail.

    The following reproducer triggers such scenario:

    $ cat reproducer.sh
    #!/bin/bash

    mkfs.btrfs -f /dev/sdi >/dev/null
    mount /dev/sdi /mnt/sdi

    touch /mnt/sdi/a
    touch /mnt/sdi/b
    mkdir /mnt/sdi/testdir
    # We want "a" to have a lower inode number then "testdir" (257 vs 259).
    mv /mnt/sdi/a /mnt/sdi/testdir/a

    # Filesystem looks like:
    #
    # . (ino 256)
    # |----- testdir/ (ino 259)
    # | |----- a (ino 257)
    # |
    # |----- b (ino 258)

    btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
    btrfs send -f /tmp/snap1.send /mnt/sdi/snap1

    # Now rename 259 to "testdir_2", then change the name of 257 to
    # "testdir" and make it a direct descendant of the root inode (256).
    # Also create a new link for inode 257 with the old name of inode 258.
    # By swapping the names and location of several inodes and create a
    # nasty dependency chain of rename and link operations.
    mv /mnt/sdi/testdir/a /mnt/sdi/a2
    touch /mnt/sdi/testdir/a
    mv /mnt/sdi/b /mnt/sdi/b2
    ln /mnt/sdi/a2 /mnt/sdi/b
    mv /mnt/sdi/testdir /mnt/sdi/testdir_2
    mv /mnt/sdi/a2 /mnt/sdi/testdir

    # Filesystem now looks like:
    #
    # . (ino 256)
    # |----- testdir_2/ (ino 259)
    # | |----- a (ino 260)
    # |
    # |----- testdir (ino 257)
    # |----- b (ino 257)
    # |----- b2 (ino 258)

    btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
    btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2

    mkfs.btrfs -f /dev/sdj >/dev/null
    mount /dev/sdj /mnt/sdj

    btrfs receive -f /tmp/snap1.send /mnt/sdj
    btrfs receive -f /tmp/snap2.send /mnt/sdj

    umount /mnt/sdi
    umount /mnt/sdj

    When running the reproducer, the receive of the incremental send stream
    fails:

    $ ./reproducer.sh
    Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
    At subvol /mnt/sdi/snap1
    Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
    At subvol /mnt/sdi/snap2
    At subvol snap1
    At snapshot snap2
    ERROR: link b -> o259-6-0/a failed: No such file or directory

    The problem happens because of the following:

    1) Before we start iterating the list of new references for inode 257,
    we generate its current path and store it at @valid_path, done at
    the very beginning of process_recorded_refs(). The generated path
    is "o259-6-0/a", containing the orphanized name for inode 259;

    2) Then we iterate over the list of new references, which has the
    references "b" and "testdir" in that specific order;

    3) We process reference "b" first, because it is in the list before
    reference "testdir". We then issue a link operation to create
    the new reference "b" using a target path corresponding to the
    content at @valid_path, which corresponds to "o259-6-0/a".
    However we haven't yet orphanized inode 259, its name is still
    "testdir", and not "o259-6-0". The orphanization of 259 did not
    happen yet because we will process the reference named "testdir"
    for inode 257 only in the next iteration of the loop that goes
    over the list of new references.

    Fix the issue by having a preliminar iteration over all the new references
    at process_recorded_refs(). This iteration is responsible only for doing
    the orphanization of other inodes that have and old reference that
    conflicts with one of the new references of the inode we are currently
    processing. The emission of rename and link operations happen now in the
    next iteration of the new references.

    A test case for fstests will follow soon.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     
  • Commit 259ee7754b67 ("btrfs: tree-checker: Add ROOT_ITEM check")
    introduced btrfs root item size check, however btrfs root item has two
    versions, the legacy one which just ends before generation_v2 member, is
    smaller than current btrfs root item size.

    This caused btrfs kernel to reject valid but old tree root leaves.

    Fix this problem by also allowing legacy root item, since kernel can
    already handle them pretty well and upgrade to newer root item format
    when needed.

    Reported-by: Martin Steigerwald
    Fixes: 259ee7754b67 ("btrfs: tree-checker: Add ROOT_ITEM check")
    CC: stable@vger.kernel.org # 5.4+
    Tested-By: Martin Steigerwald
    Reviewed-by: Josef Bacik
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • In the definitions generated by BTRFS_SETGET_HEADER_FUNCS there's direct
    pointer assignment but we should use the helpers for unaligned access
    for clarity. It hasn't been a problem so far because of the natural
    alignment.

    Similarly for BTRFS_SETGET_STACK_FUNCS, that usually get a structure
    from stack that has an aligned start but some members may not be aligned
    due to packing. This as well hasn't caused problems so far.

    Move the put/get_unaligned_le8 stubs to ctree.h so we can use them.

    Signed-off-by: David Sterba

    David Sterba
     
  • The free space inode stores the tracking data, checksums etc, using the
    io_ctl structure and moving the pointers. The data are generally aligned
    to at least 4 bytes (u32 for CRC) so it's not completely unaligned but
    for clarity we should use the proper helpers whenever a struct is
    initialized from io_ctl->cur pointer.

    Signed-off-by: David Sterba

    David Sterba
     
  • The header is mapped onto the send buffer and thus its members may be
    potentially unaligned so use the helpers instead of directly assigning
    the pointers. This has worked so far but let's use the helpers to make
    that clear.

    Signed-off-by: David Sterba

    David Sterba