29 Jul, 2020

5 commits

  • commit 5909ca110b29aa16b23b52b8de8d3bb1035fd738 upstream.

    When locking pages for delalloc, we check if it's dirty and mapping still
    matches. If it does not match, we need to return -EAGAIN and release all
    pages. Only the current page was put though, iterate over all the
    remaining pages too.

    CC: stable@vger.kernel.org # 4.14+
    Reviewed-by: Filipe Manana
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Robbie Ko
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Robbie Ko
     
  • commit 48cfa61b58a1fee0bc49eef04f8ccf31493b7cdd upstream.

    It is possible to cause a btrfs mount to fail by racing it with a slow
    umount. The crux of the sequence is generic_shutdown_super not yet
    calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
    If that occurs, btrfs_open_devices will decide the opened counter is
    non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
    0. From here, mount will call sget which will result in grab_super
    trying to take the super block umount semaphore. That semaphore will be
    held by the slow umount, so mount will block. Before up-ing the
    semaphore, umount will delete the super block, resulting in mount's sget
    reliably allocating a new one, which causes the mount path to dutifully
    fill it out, and increment total_rw_bytes a second time, which causes
    the mount to fail, as we see double the expected bytes.

    Here is the sequence laid out in greater detail:

    CPU0 CPU1
    down_write sb->s_umount
    btrfs_kill_super
    kill_anon_super(sb)
    generic_shutdown_super(sb);
    shrink_dcache_for_umount(sb);
    sync_filesystem(sb);
    evict_inodes(sb); // SLOW

    btrfs_mount_root
    btrfs_scan_one_device
    fs_devices = device->fs_devices
    fs_info->fs_devices = fs_devices
    // fs_devices-opened makes this a no-op
    btrfs_open_devices(fs_devices, mode, fs_type)
    s = sget(fs_type, test, set, flags, fs_info);
    find sb in s_instances
    grab_super(sb);
    down_write(&s->s_umount); // blocks

    sop->put_super(sb)
    // sb->fs_devices->opened == 2; no-op
    spin_lock(&sb_lock);
    hlist_del_init(&sb->s_instances);
    spin_unlock(&sb_lock);
    up_write(&sb->s_umount);
    return 0;
    retry lookup
    don't find sb in s_instances (deleted by CPU0)
    s = alloc_super
    return s;
    btrfs_fill_super(s, fs_devices, data)
    open_ctree // fs_devices total_rw_bytes improperly set!
    btrfs_read_chunk_tree
    read_one_dev // increment total_rw_bytes again!!
    super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!

    To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
    before the calls to read_one_dev, while holding the sb umount semaphore
    and the uuid mutex.

    To reproduce, it is sufficient to dirty a decent number of inodes, then
    quickly umount and mount.

    for i in $(seq 0 500)
    do
    dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
    done
    umount /mnt/foo&
    mount /mnt/foo

    does the trick for me.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Boris Burkov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Boris Burkov
     
  • commit 580c079b5766ac706f56eec5c79aee4bf929fef6 upstream.

    At btrfs_find_all_roots_safe() we allocate a ulist and set the **roots
    argument to point to it. However if later we fail due to an error returned
    by find_parent_nodes(), we free that ulist but leave a dangling pointer in
    the **roots argument. Upon receiving the error, a caller of this function
    can attempt to free the same ulist again, resulting in an invalid memory
    access.

    One such scenario is during qgroup accounting:

    btrfs_qgroup_account_extents()

    --> calls btrfs_find_all_roots() passes &new_roots (a stack allocated
    pointer) to btrfs_find_all_roots()

    --> btrfs_find_all_roots() just calls btrfs_find_all_roots_safe()
    passing &new_roots to it

    --> allocates ulist and assigns its address to **roots (which
    points to new_roots from btrfs_qgroup_account_extents())

    --> find_parent_nodes() returns an error, so we free the ulist
    and leave **roots pointing to it after returning

    --> btrfs_qgroup_account_extents() sees btrfs_find_all_roots() returned
    an error and jumps to the label 'cleanup', which just tries to
    free again the same ulist

    Stack trace example:

    ------------[ cut here ]------------
    BTRFS: tree first key check failed
    WARNING: CPU: 1 PID: 1763215 at fs/btrfs/disk-io.c:422 btrfs_verify_level_key+0xe0/0x180 [btrfs]
    Modules linked in: dm_snapshot dm_thin_pool (...)
    CPU: 1 PID: 1763215 Comm: fsstress Tainted: G W 5.8.0-rc3-btrfs-next-64 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_verify_level_key+0xe0/0x180 [btrfs]
    Code: 28 5b 5d (...)
    RSP: 0018:ffffb89b473779a0 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: ffff90397759bf08 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
    RBP: ffff9039a419c000 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: ffffb89b43301000 R12: 000000000000005e
    R13: ffffb89b47377a2e R14: ffffb89b473779af R15: 0000000000000000
    FS: 00007fc47e1e1000(0000) GS:ffff9039ac200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc47e1df000 CR3: 00000003d9e4e001 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    read_block_for_search+0xf6/0x350 [btrfs]
    btrfs_next_old_leaf+0x242/0x650 [btrfs]
    resolve_indirect_refs+0x7cf/0x9e0 [btrfs]
    find_parent_nodes+0x4ea/0x12c0 [btrfs]
    btrfs_find_all_roots_safe+0xbf/0x130 [btrfs]
    btrfs_qgroup_account_extents+0x9d/0x390 [btrfs]
    btrfs_commit_transaction+0x4f7/0xb20 [btrfs]
    btrfs_sync_file+0x3d4/0x4d0 [btrfs]
    do_fsync+0x38/0x70
    __x64_sys_fdatasync+0x13/0x20
    do_syscall_64+0x5c/0xe0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7fc47e2d72e3
    Code: Bad RIP value.
    RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
    RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3
    RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003
    RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003
    R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8
    R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x755/0x1eb0
    softirqs last enabled at (0): [] copy_process+0x755/0x1eb0
    softirqs last disabled at (0): [] 0x0
    ---[ end trace 8639237550317b48 ]---
    BTRFS error (device sdc): tree first key mismatch detected, bytenr=62324736 parent_transid=94 key expected=(262,108,1351680) has=(259,108,1921024)
    general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
    CPU: 2 PID: 1763215 Comm: fsstress Tainted: G W 5.8.0-rc3-btrfs-next-64 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:ulist_release+0x14/0x60 [btrfs]
    Code: c7 07 00 (...)
    RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282
    RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840
    RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840
    R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840
    FS: 00007fc47e1e1000(0000) GS:ffff9039ac600000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f8c1c0a51c8 CR3: 00000003d9e4e004 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    ulist_free+0x13/0x20 [btrfs]
    btrfs_qgroup_account_extents+0xf3/0x390 [btrfs]
    btrfs_commit_transaction+0x4f7/0xb20 [btrfs]
    btrfs_sync_file+0x3d4/0x4d0 [btrfs]
    do_fsync+0x38/0x70
    __x64_sys_fdatasync+0x13/0x20
    do_syscall_64+0x5c/0xe0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7fc47e2d72e3
    Code: Bad RIP value.
    RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
    RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3
    RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003
    RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003
    R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8
    R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50
    Modules linked in: dm_snapshot dm_thin_pool (...)
    ---[ end trace 8639237550317b49 ]---
    RIP: 0010:ulist_release+0x14/0x60 [btrfs]
    Code: c7 07 00 (...)
    RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282
    RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840
    RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840
    R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840
    FS: 00007fc47e1e1000(0000) GS:ffff9039ad200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f6a776f7d40 CR3: 00000003d9e4e002 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    Fix this by making btrfs_find_all_roots_safe() set *roots to NULL after
    it frees the ulist.

    Fixes: 8da6d5815c592b ("Btrfs: added btrfs_find_all_roots()")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 1dae7e0e58b484eaa43d530f211098fdeeb0f404 upstream.

    [BUG]
    There are several reported runaway balance, that balance is flooding the
    log with "found X extents" where the X never changes.

    [CAUSE]
    Commit d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after
    merge_reloc_roots") introduced BTRFS_ROOT_DEAD_RELOC_TREE bit to
    indicate that one subvolume has finished its tree blocks swap with its
    reloc tree.

    However if balance is canceled or hits ENOSPC halfway, we didn't clear
    the BTRFS_ROOT_DEAD_RELOC_TREE bit, leaving that bit hanging forever
    until unmount.

    Any subvolume root with that bit, would cause backref cache to skip this
    tree block, as it has finished its tree block swap. This would cause
    all tree blocks of that root be ignored by balance, leading to runaway
    balance.

    [FIX]
    Fix the problem by also clearing the BTRFS_ROOT_DEAD_RELOC_TREE bit for
    the original subvolume of orphan reloc root.

    Add an umount check for the stale bit still set.

    Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba
    [Manually solve the conflicts due to no btrfs root refs rework]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     
  • commit 51415b6c1b117e223bc083e30af675cb5c5498f3 upstream.

    [BUG]
    When balance is canceled, there is a pretty high chance that unmounting
    the fs can lead to lead the NULL pointer dereference:

    BTRFS warning (device dm-3): page private not zero on page 223158272
    ...
    BTRFS warning (device dm-3): page private not zero on page 223162368
    BTRFS error (device dm-3): leaked root 18446744073709551608-304 refcount 1
    BUG: kernel NULL pointer dereference, address: 0000000000000168
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: 0000 [#1] PREEMPT SMP NOPTI
    CPU: 2 PID: 5793 Comm: umount Tainted: G O 5.7.0-rc5-custom+ #53
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
    RIP: 0010:__lock_acquire+0x5dc/0x24c0
    Call Trace:
    lock_acquire+0xab/0x390
    _raw_spin_lock+0x39/0x80
    btrfs_release_extent_buffer_pages+0xd7/0x200 [btrfs]
    release_extent_buffer+0xb2/0x170 [btrfs]
    free_extent_buffer+0x66/0xb0 [btrfs]
    btrfs_put_root+0x8e/0x130 [btrfs]
    btrfs_check_leaked_roots.cold+0x5/0x5d [btrfs]
    btrfs_free_fs_info+0xe5/0x120 [btrfs]
    btrfs_kill_super+0x1f/0x30 [btrfs]
    deactivate_locked_super+0x3b/0x80
    deactivate_super+0x3e/0x50
    cleanup_mnt+0x109/0x160
    __cleanup_mnt+0x12/0x20
    task_work_run+0x67/0xa0
    exit_to_usermode_loop+0xc5/0xd0
    syscall_return_slowpath+0x205/0x360
    do_syscall_64+0x6e/0xb0
    entry_SYSCALL_64_after_hwframe+0x49/0xb3
    RIP: 0033:0x7fd028ef740b

    [CAUSE]
    When balance is canceled, all reloc roots are marked as orphan, and
    orphan reloc roots are going to be cleaned up.

    However for orphan reloc roots and merged reloc roots, their lifespan
    are quite different:

    Merged reloc roots | Orphan reloc roots by cancel
    --------------------------------------------------------------------
    create_reloc_root() | create_reloc_root()
    |- refs == 1 | |- refs == 1
    |
    btrfs_grab_root(reloc_root); | btrfs_grab_root(reloc_root);
    |- refs == 2 | |- refs == 2
    |
    root->reloc_root = reloc_root; | root->reloc_root = reloc_root;
    >>> No difference so far <<<
    |
    prepare_to_merge() | prepare_to_merge()
    |- btrfs_set_root_refs(item, 1);| |- if (!err) (err == -EINTR)
    |
    merge_reloc_roots() | merge_reloc_roots()
    |- merge_reloc_root() | |- Doing nothing to put reloc root
    |- insert_dirty_subvol() | |- refs == 2
    |- __del_reloc_root() |
    |- btrfs_put_root() |
    |- refs == 1 |
    >>> Now orphan reloc roots still have refs 2 <<<
    |
    clean_dirty_subvols() | clean_dirty_subvols()
    |- btrfs_drop_snapshot() | |- btrfS_drop_snapshot()
    |- reloc_root get freed | |- reloc_root still has refs 2
    | related ebs get freed, but
    | reloc_root still recorded in
    | allocated_roots
    btrfs_check_leaked_roots() | btrfs_check_leaked_roots()
    |- No leaked roots | |- Leaked reloc_roots detected
    | |- btrfs_put_root()
    | |- free_extent_buffer(root->node);
    | |- eb already freed, caused NULL
    | pointer dereference

    [FIX]
    The fix is to clear fs_root->reloc_root and put it at
    merge_reloc_roots() time, so that we won't leak reloc roots.

    Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
    CC: stable@vger.kernel.org # 5.1+
    Tested-by: Johannes Thumshirn
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba
    [Manually solve the conflicts due to no btrfs root refs rework]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     

16 Jul, 2020

2 commits

  • commit 230ed397435e85b54f055c524fcb267ae2ce3bc4 upstream.

    While debugging a patch that I wrote I was hitting use-after-free panics
    when accessing block groups on unmount. This turned out to be because
    in the nocow case if we bail out of doing the nocow for whatever reason
    we need to call btrfs_dec_nocow_writers() if we called the inc. This
    puts our block group, but a few error cases does

    if (nocow) {
    btrfs_dec_nocow_writers();
    goto error;
    }

    unfortunately, error is

    error:
    if (nocow)
    btrfs_dec_nocow_writers();

    so we get a double put on our block group. Fix this by dropping the
    error cases calling of btrfs_dec_nocow_writers(), as it's handled at the
    error label now.

    Fixes: 762bf09893b4 ("btrfs: improve error handling in run_delalloc_nocow")
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 6bf9cd2eed9aee6d742bb9296c994a91f5316949 upstream.

    Under somewhat convoluted conditions, it is possible to attempt to
    release an extent_buffer that is under io, which triggers a BUG_ON in
    btrfs_release_extent_buffer_pages.

    This relies on a few different factors. First, extent_buffer reads done
    as readahead for searching use WAIT_NONE, so they free the local extent
    buffer reference while the io is outstanding. However, they should still
    be protected by TREE_REF. However, if the system is doing signficant
    reclaim, and simultaneously heavily accessing the extent_buffers, it is
    possible for releasepage to race with two concurrent readahead attempts
    in a way that leaves TREE_REF unset when the readahead extent buffer is
    released.

    Essentially, if two tasks race to allocate a new extent_buffer, but the
    winner who attempts the first io is rebuffed by a page being locked
    (likely by the reclaim itself) then the loser will still go ahead with
    issuing the readahead. The loser's call to find_extent_buffer must also
    race with the reclaim task reading the extent_buffer's refcount as 1 in
    a way that allows the reclaim to re-clear the TREE_REF checked by
    find_extent_buffer.

    The following represents an example execution demonstrating the race:

    CPU0 CPU1 CPU2
    reada_for_search reada_for_search
    readahead_tree_block readahead_tree_block
    find_create_tree_block find_create_tree_block
    alloc_extent_buffer alloc_extent_buffer
    find_extent_buffer // not found
    allocates eb
    lock pages
    associate pages to eb
    insert eb into radix tree
    set TREE_REF, refs == 2
    unlock pages
    read_extent_buffer_pages // WAIT_NONE
    not uptodate (brand new eb)
    lock_page
    if !trylock_page
    goto unlock_exit // not an error
    free_extent_buffer
    release_extent_buffer
    atomic_dec_and_test refs to 1
    find_extent_buffer // found
    try_release_extent_buffer
    take refs_lock
    reads refs == 1; no io
    atomic_inc_not_zero refs to 2
    mark_buffer_accessed
    check_buffer_tree_ref
    // not STALE, won't take refs_lock
    refs == 2; TREE_REF set // no action
    read_extent_buffer_pages // WAIT_NONE
    clear TREE_REF
    release_extent_buffer
    atomic_dec_and_test refs to 1
    unlock_page
    still not uptodate (CPU1 read failed on trylock_page)
    locks pages
    set io_pages > 0
    submit io
    return
    free_extent_buffer
    release_extent_buffer
    dec refs to 0
    delete from radix tree
    btrfs_release_extent_buffer_pages
    BUG_ON(io_pages > 0)!!!

    We observe this at a very low rate in production and were also able to
    reproduce it in a test environment by introducing some spurious delays
    and by introducing probabilistic trylock_page failures.

    To fix it, we apply check_tree_ref at a point where it could not
    possibly be unset by a competing task: after io_pages has been
    incremented. All the codepaths that clear TREE_REF check for io, so they
    would not be able to clear it after this point until the io is done.

    Stack trace, for reference:
    [1417839.424739] ------------[ cut here ]------------
    [1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
    [1417839.447024] invalid opcode: 0000 [#1] SMP
    [1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
    [1417839.517008] Code: ed e9 ...
    [1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
    [1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
    [1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
    [1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
    [1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
    [1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
    [1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
    [1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
    [1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [1417839.731320] Call Trace:
    [1417839.737103] release_extent_buffer+0x39/0x90
    [1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
    [1417839.758645] btrfs_search_slot+0x260/0x9b0
    [1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
    [1417839.778427] btrfs_get_extent+0x15f/0x830
    [1417839.787665] ? submit_extent_page+0xc4/0x1c0
    [1417839.797474] ? __do_readpage+0x299/0x7a0
    [1417839.806515] __do_readpage+0x33b/0x7a0
    [1417839.815171] ? btrfs_releasepage+0x70/0x70
    [1417839.824597] extent_readpages+0x28f/0x400
    [1417839.833836] read_pages+0x6a/0x1c0
    [1417839.841729] ? startup_64+0x2/0x30
    [1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
    [1417839.860590] filemap_fault+0x6c7/0x990
    [1417839.869252] ? xas_load+0x8/0x80
    [1417839.876756] ? xas_find+0x150/0x190
    [1417839.884839] ? filemap_map_pages+0x295/0x3b0
    [1417839.894652] __do_fault+0x32/0x110
    [1417839.902540] __handle_mm_fault+0xacd/0x1000
    [1417839.912156] handle_mm_fault+0xaa/0x1c0
    [1417839.921004] __do_page_fault+0x242/0x4b0
    [1417839.930044] ? page_fault+0x8/0x30
    [1417839.937933] page_fault+0x1e/0x30
    [1417839.945631] RIP: 0033:0x33c4bae
    [1417839.952927] Code: Bad RIP value.
    [1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
    [1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
    [1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
    [1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
    [1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
    [1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Boris Burkov
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Boris Burkov
     

01 Jul, 2020

5 commits

  • commit 4b1946284dd6641afdb9457101056d9e6ee6204c upstream.

    If we attempt to write to prealloc extent located after eof using a
    RWF_NOWAIT write, we always fail with -EAGAIN.

    We do actually check if we have an allocated extent for the write at
    the start of btrfs_file_write_iter() through a call to check_can_nocow(),
    but later when we go into the actual direct IO write path we simply
    return -EAGAIN if the write starts at or beyond EOF.

    Trivial to reproduce:

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt

    $ touch /mnt/foo
    $ chattr +C /mnt/foo

    $ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foo
    wrote 65536/65536 bytes at offset 0
    64 KiB, 16 ops; 0.0004 sec (135.575 MiB/sec and 34707.1584 ops/sec)

    $ xfs_io -c "falloc -k 64K 1M" /mnt/foo

    $ xfs_io -d -c "pwrite -N -V 1 -S 0xfe -b 64K 64K 64K" /mnt/foo
    pwrite: Resource temporarily unavailable

    On xfs and ext4 the write succeeds, as expected.

    Fix this by removing the wrong check at btrfs_direct_IO().

    Fixes: edf064e7c6fec3 ("btrfs: nowait aio support")
    CC: stable@vger.kernel.org # 4.14+
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit e7a79811d0db136dc2d336b56d54cf1b774ce972 upstream.

    This brings back an optimization that commit e678934cbe5f02 ("btrfs:
    Remove unnecessary check from join_running_log_trans") removed, but in
    a different form. So it's almost equivalent to a revert.

    That commit removed an optimization where we avoid locking a root's
    log_mutex when there is no log tree created in the current transaction.
    The affected code path is triggered through unlink operations.

    That commit was based on the assumption that the optimization was not
    necessary because we used to have the following checks when the patch
    was authored:

    int btrfs_del_dir_entries_in_log(...)
    {
    (...)
    if (dir->logged_trans < trans->transid)
    return 0;

    ret = join_running_log_trans(root);
    (...)
    }

    int btrfs_del_inode_ref_in_log(...)
    {
    (...)
    if (inode->logged_trans < trans->transid)
    return 0;

    ret = join_running_log_trans(root);
    (...)
    }

    However before that patch was merged, another patch was merged first which
    replaced those checks because they were buggy.

    That other patch corresponds to commit 803f0f64d17769 ("Btrfs: fix fsync
    not persisting dentry deletions due to inode evictions"). The assumption
    that if the logged_trans field of an inode had a smaller value then the
    current transaction's generation (transid) meant that the inode was not
    logged in the current transaction was only correct if the inode was not
    evicted and reloaded in the current transaction. So the corresponding bug
    fix changed those checks and replaced them with the following helper
    function:

    static bool inode_logged(struct btrfs_trans_handle *trans,
    struct btrfs_inode *inode)
    {
    if (inode->logged_trans == trans->transid)
    return true;

    if (inode->last_trans == trans->transid &&
    test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) &&
    !test_bit(BTRFS_FS_LOG_RECOVERING, &trans->fs_info->flags))
    return true;

    return false;
    }

    So if we have a subvolume without a log tree in the current transaction
    (because we had no fsyncs), every time we unlink an inode we can end up
    trying to lock the log_mutex of the root through join_running_log_trans()
    twice, once for the inode being unlinked (by btrfs_del_inode_ref_in_log())
    and once for the parent directory (with btrfs_del_dir_entries_in_log()).

    This means if we have several unlink operations happening in parallel for
    inodes in the same subvolume, and the those inodes and/or their parent
    inode were changed in the current transaction, we end up having a lot of
    contention on the log_mutex.

    The test robots from intel reported a -30.7% performance regression for
    a REAIM test after commit e678934cbe5f02 ("btrfs: Remove unnecessary check
    from join_running_log_trans").

    So just bring back the optimization to join_running_log_trans() where we
    check first if a log root exists before trying to lock the log_mutex. This
    is done by checking for a bit that is set on the root when a log tree is
    created and removed when a log tree is freed (at transaction commit time).

    Commit e678934cbe5f02 ("btrfs: Remove unnecessary check from
    join_running_log_trans") was merged in the 5.4 merge window while commit
    803f0f64d17769 ("Btrfs: fix fsync not persisting dentry deletions due to
    inode evictions") was merged in the 5.3 merge window. But the first
    commit was actually authored before the second commit (May 23 2019 vs
    June 19 2019).

    Reported-by: kernel test robot
    Link: https://lore.kernel.org/lkml/20200611090233.GL12456@shao2-debian/
    Fixes: e678934cbe5f02 ("btrfs: Remove unnecessary check from join_running_log_trans")
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 432cd2a10f1c10cead91fe706ff5dc52f06d642a upstream.

    When running relocation of a data block group while scrub is running in
    parallel, it is possible that the relocation will fail and abort the
    current transaction with an -EINVAL error:

    [134243.988595] BTRFS info (device sdc): found 14 extents, stage: move data extents
    [134243.999871] ------------[ cut here ]------------
    [134244.000741] BTRFS: Transaction aborted (error -22)
    [134244.001692] WARNING: CPU: 0 PID: 26954 at fs/btrfs/ctree.c:1071 __btrfs_cow_block+0x6a7/0x790 [btrfs]
    [134244.003380] Modules linked in: btrfs blake2b_generic xor raid6_pq (...)
    [134244.012577] CPU: 0 PID: 26954 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #5
    [134244.014162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
    [134244.016184] RIP: 0010:__btrfs_cow_block+0x6a7/0x790 [btrfs]
    [134244.017151] Code: 48 c7 c7 (...)
    [134244.020549] RSP: 0018:ffffa41607863888 EFLAGS: 00010286
    [134244.021515] RAX: 0000000000000000 RBX: ffff9614bdfe09c8 RCX: 0000000000000000
    [134244.022822] RDX: 0000000000000001 RSI: ffffffffb3d63980 RDI: 0000000000000001
    [134244.024124] RBP: ffff961589e8c000 R08: 0000000000000000 R09: 0000000000000001
    [134244.025424] R10: ffffffffc0ae5955 R11: 0000000000000000 R12: ffff9614bd530d08
    [134244.026725] R13: ffff9614ced41b88 R14: ffff9614bdfe2a48 R15: 0000000000000000
    [134244.028024] FS: 00007f29b63c08c0(0000) GS:ffff9615ba600000(0000) knlGS:0000000000000000
    [134244.029491] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [134244.030560] CR2: 00007f4eb339b000 CR3: 0000000130d6e006 CR4: 00000000003606f0
    [134244.031997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [134244.033153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [134244.034484] Call Trace:
    [134244.034984] btrfs_cow_block+0x12b/0x2b0 [btrfs]
    [134244.035859] do_relocation+0x30b/0x790 [btrfs]
    [134244.036681] ? do_raw_spin_unlock+0x49/0xc0
    [134244.037460] ? _raw_spin_unlock+0x29/0x40
    [134244.038235] relocate_tree_blocks+0x37b/0x730 [btrfs]
    [134244.039245] relocate_block_group+0x388/0x770 [btrfs]
    [134244.040228] btrfs_relocate_block_group+0x161/0x2e0 [btrfs]
    [134244.041323] btrfs_relocate_chunk+0x36/0x110 [btrfs]
    [134244.041345] btrfs_balance+0xc06/0x1860 [btrfs]
    [134244.043382] ? btrfs_ioctl_balance+0x27c/0x310 [btrfs]
    [134244.045586] btrfs_ioctl_balance+0x1ed/0x310 [btrfs]
    [134244.045611] btrfs_ioctl+0x1880/0x3760 [btrfs]
    [134244.049043] ? do_raw_spin_unlock+0x49/0xc0
    [134244.049838] ? _raw_spin_unlock+0x29/0x40
    [134244.050587] ? __handle_mm_fault+0x11b3/0x14b0
    [134244.051417] ? ksys_ioctl+0x92/0xb0
    [134244.052070] ksys_ioctl+0x92/0xb0
    [134244.052701] ? trace_hardirqs_off_thunk+0x1a/0x1c
    [134244.053511] __x64_sys_ioctl+0x16/0x20
    [134244.054206] do_syscall_64+0x5c/0x280
    [134244.054891] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [134244.055819] RIP: 0033:0x7f29b51c9dd7
    [134244.056491] Code: 00 00 00 (...)
    [134244.059767] RSP: 002b:00007ffcccc1dd08 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
    [134244.061168] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f29b51c9dd7
    [134244.062474] RDX: 00007ffcccc1dda0 RSI: 00000000c4009420 RDI: 0000000000000003
    [134244.063771] RBP: 0000000000000003 R08: 00005565cea4b000 R09: 0000000000000000
    [134244.065032] R10: 0000000000000541 R11: 0000000000000202 R12: 00007ffcccc2060a
    [134244.066327] R13: 00007ffcccc1dda0 R14: 0000000000000002 R15: 00007ffcccc1dec0
    [134244.067626] irq event stamp: 0
    [134244.068202] hardirqs last enabled at (0): [] 0x0
    [134244.069351] hardirqs last disabled at (0): [] copy_process+0x74f/0x2020
    [134244.070909] softirqs last enabled at (0): [] copy_process+0x74f/0x2020
    [134244.072392] softirqs last disabled at (0): [] 0x0
    [134244.073432] ---[ end trace bd7c03622e0b0a99 ]---

    The -EINVAL error comes from the following chain of function calls:

    __btrfs_cow_block() sectorsize). Due to free space
    fragmentation, btrfs_reserve_extent() ends up allocating two extents
    of 32KiB each, each one on a different iteration of that while loop;

    6) Writeback of the data relocation inode completes;

    7) Relocation proceeds and ends up at relocation.c:replace_file_extents(),
    with a leaf which has a file extent item that points to the data extent
    from block group X, that has a logical address (bytenr) of X + 128KiB
    and a size of 64KiB. Then it calls get_new_location(), which does a
    lookup in the data relocation tree for a file extent item starting at
    offset 128KiB (X + 128KiB - X) and belonging to the data relocation
    inode. It finds a corresponding file extent item, however that item
    points to an extent that has a size of 32KiB, which doesn't match the
    expected size of 64KiB, resuling in -EINVAL being returned from this
    function and propagated up to __btrfs_cow_block(), which aborts the
    current transaction.

    To fix this make sure that at cow_file_range() when we call the allocator
    we pass it a minimum allocation size corresponding the desired extent size
    if the inode belongs to the data relocation tree, otherwise pass it the
    filesystem's sector size as the minimum allocation size.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 6bd335b469f945f75474c11e3f577f85409f39c3 upstream.

    When balance and scrub are running in parallel it is possible to end up
    with an underflow of the bytes_may_use counter of the data space_info
    object, which triggers a warning like the following:

    [134243.793196] BTRFS info (device sdc): relocating block group 1104150528 flags data
    [134243.806891] ------------[ cut here ]------------
    [134243.807561] WARNING: CPU: 1 PID: 26884 at fs/btrfs/space-info.h:125 btrfs_add_reserved_bytes+0x1da/0x280 [btrfs]
    [134243.808819] Modules linked in: btrfs blake2b_generic xor (...)
    [134243.815779] CPU: 1 PID: 26884 Comm: kworker/u8:8 Tainted: G W 5.6.0-rc7-btrfs-next-58 #5
    [134243.816944] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
    [134243.818389] Workqueue: writeback wb_workfn (flush-btrfs-108483)
    [134243.819186] RIP: 0010:btrfs_add_reserved_bytes+0x1da/0x280 [btrfs]
    [134243.819963] Code: 0b f2 85 (...)
    [134243.822271] RSP: 0018:ffffa4160aae7510 EFLAGS: 00010287
    [134243.822929] RAX: 000000000000c000 RBX: ffff96159a8c1000 RCX: 0000000000000000
    [134243.823816] RDX: 0000000000008000 RSI: 0000000000000000 RDI: ffff96158067a810
    [134243.824742] RBP: ffff96158067a800 R08: 0000000000000001 R09: 0000000000000000
    [134243.825636] R10: ffff961501432a40 R11: 0000000000000000 R12: 000000000000c000
    [134243.826532] R13: 0000000000000001 R14: ffffffffffff4000 R15: ffff96158067a810
    [134243.827432] FS: 0000000000000000(0000) GS:ffff9615baa00000(0000) knlGS:0000000000000000
    [134243.828451] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [134243.829184] CR2: 000055bd7e414000 CR3: 00000001077be004 CR4: 00000000003606e0
    [134243.830083] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [134243.830975] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [134243.831867] Call Trace:
    [134243.832211] find_free_extent+0x4a0/0x16c0 [btrfs]
    [134243.832846] btrfs_reserve_extent+0x91/0x180 [btrfs]
    [134243.833487] cow_file_range+0x12d/0x490 [btrfs]
    [134243.834080] fallback_to_cow+0x82/0x1b0 [btrfs]
    [134243.834689] ? release_extent_buffer+0x121/0x170 [btrfs]
    [134243.835370] run_delalloc_nocow+0x33f/0xa30 [btrfs]
    [134243.836032] btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
    [134243.836725] ? find_lock_delalloc_range+0x221/0x250 [btrfs]
    [134243.837450] writepage_delalloc+0xe8/0x150 [btrfs]
    [134243.838059] __extent_writepage+0xe8/0x4c0 [btrfs]
    [134243.838674] extent_write_cache_pages+0x237/0x530 [btrfs]
    [134243.839364] extent_writepages+0x44/0xa0 [btrfs]
    [134243.839946] do_writepages+0x23/0x80
    [134243.840401] __writeback_single_inode+0x59/0x700
    [134243.841006] writeback_sb_inodes+0x267/0x5f0
    [134243.841548] __writeback_inodes_wb+0x87/0xe0
    [134243.842091] wb_writeback+0x382/0x590
    [134243.842574] ? wb_workfn+0x4a2/0x6c0
    [134243.843030] wb_workfn+0x4a2/0x6c0
    [134243.843468] process_one_work+0x26d/0x6a0
    [134243.843978] worker_thread+0x4f/0x3e0
    [134243.844452] ? process_one_work+0x6a0/0x6a0
    [134243.844981] kthread+0x103/0x140
    [134243.845400] ? kthread_create_worker_on_cpu+0x70/0x70
    [134243.846030] ret_from_fork+0x3a/0x50
    [134243.846494] irq event stamp: 0
    [134243.846892] hardirqs last enabled at (0): [] 0x0
    [134243.847682] hardirqs last disabled at (0): [] copy_process+0x74f/0x2020
    [134243.848687] softirqs last enabled at (0): [] copy_process+0x74f/0x2020
    [134243.849913] softirqs last disabled at (0): [] 0x0
    [134243.850698] ---[ end trace bd7c03622e0b0a96 ]---
    [134243.851335] ------------[ cut here ]------------

    When relocating a data block group, for each extent allocated in the
    block group we preallocate another extent with the same size for the
    data relocation inode (we do it at prealloc_file_extent_cluster()).
    We reserve space by calling btrfs_check_data_free_space(), which ends
    up incrementing the data space_info's bytes_may_use counter, and
    then call btrfs_prealloc_file_range() to allocate the extent, which
    always decrements the bytes_may_use counter by the same amount.

    The expectation is that writeback of the data relocation inode always
    follows a NOCOW path, by writing into the preallocated extents. However,
    when starting writeback we might end up falling back into the COW path,
    because the block group that contains the preallocated extent was turned
    into RO mode by a scrub running in parallel. The COW path then calls the
    extent allocator which ends up calling btrfs_add_reserved_bytes(), and
    this function decrements the bytes_may_use counter of the data space_info
    object by an amount corresponding to the size of the allocated extent,
    despite we haven't previously incremented it. When the counter currently
    has a value smaller then the allocated extent we reset the counter to 0
    and emit a warning, otherwise we just decrement it and slowly mess up
    with this counter which is crucial for space reservation, the end result
    can be granting reserved space to tasks when there isn't really enough
    free space, and having the tasks fail later in critical places where
    error handling consists of a transaction abort or hitting a BUG_ON().

    Fix this by making sure that if we fallback to the COW path for a data
    relocation inode, we increment the bytes_may_use counter of the data
    space_info object. The COW path will then decrement it at
    btrfs_add_reserved_bytes() on success or through its error handling part
    by a call to extent_clear_unlock_delalloc() (which ends up calling
    btrfs_clear_delalloc_extent() that does the decrement operation) in case
    of an error.

    Test case btrfs/061 from fstests could sporadically trigger this.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • [ Upstream commit 9fecd13202f520f3f25d5b1c313adb740fe19773 ]

    When removing a block group, if we fail to delete the block group's item
    from the extent tree, we jump to the 'out' label and end up decrementing
    the block group's reference count once only (by 1), resulting in a counter
    leak because the block group at that point was already removed from the
    block group cache rbtree - so we have to decrement the reference count
    twice, once for the rbtree and once for our lookup at the start of the
    function.

    There is a second bug where if removing the free space tree entries (the
    call to remove_block_group_free_space()) fails we end up jumping to the
    'out_put_group' label but end up decrementing the reference count only
    once, when we should have done it twice, since we have already removed
    the block group from the block group cache rbtree. This happens because
    the reference count decrement for the rbtree reference happens after
    attempting to remove the free space tree entries, which is far away from
    the place where we remove the block group from the rbtree.

    To make things less error prone, decrement the reference count for the
    rbtree immediately after removing the block group from it. This also
    eleminates the need for two different exit labels on error, renaming
    'out_put_label' to just 'out' and removing the old 'out'.

    Fixes: f6033c5e333238 ("btrfs: fix block group leak when removing fails")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Anand Jain
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Sasha Levin
     

22 Jun, 2020

12 commits

  • commit 2166e5edce9ac1edf3b113d6091ef72fcac2d6c4 upstream.

    We always preallocate a data extent for writing a free space cache, which
    causes writeback to always try the nocow path first, since the free space
    inode has the prealloc bit set in its flags.

    However if the block group that contains the data extent for the space
    cache has been turned to RO mode due to a running scrub or balance for
    example, we have to fallback to the cow path. In that case once a new data
    extent is allocated we end up calling btrfs_add_reserved_bytes(), which
    decrements the counter named bytes_may_use from the data space_info object
    with the expection that this counter was previously incremented with the
    same amount (the size of the data extent).

    However when we started writeout of the space cache at cache_save_setup(),
    we incremented the value of the bytes_may_use counter through a call to
    btrfs_check_data_free_space() and then decremented it through a call to
    btrfs_prealloc_file_range_trans() immediately after. So when starting the
    writeback if we fallback to cow mode we have to increment the counter
    bytes_may_use of the data space_info again to compensate for the extent
    allocation done by the cow path.

    When this issue happens we are incorrectly decrementing the bytes_may_use
    counter and when its current value is smaller then the amount we try to
    subtract we end up with the following warning:

    ------------[ cut here ]------------
    WARNING: CPU: 3 PID: 657 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
    Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
    CPU: 3 PID: 657 Comm: kworker/u8:7 Tainted: G W 5.6.0-rc7-btrfs-next-58 #5
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
    Workqueue: writeback wb_workfn (flush-btrfs-1591)
    RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
    Code: ff ff 48 (...)
    RSP: 0000:ffffa41608f13660 EFLAGS: 00010287
    RAX: 0000000000001000 RBX: ffff9615b93ae400 RCX: 0000000000000000
    RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9615b96ab410
    RBP: fffffffffffee000 R08: 0000000000000001 R09: 0000000000000000
    R10: ffff961585e62a40 R11: 0000000000000000 R12: ffff9615b96ab400
    R13: ffff9615a1a2a000 R14: 0000000000012000 R15: ffff9615b93ae400
    FS: 0000000000000000(0000) GS:ffff9615bb200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055cbbc2ae178 CR3: 0000000115794006 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    find_free_extent+0x4a0/0x16c0 [btrfs]
    btrfs_reserve_extent+0x91/0x180 [btrfs]
    cow_file_range+0x12d/0x490 [btrfs]
    btrfs_run_delalloc_range+0x9f/0x6d0 [btrfs]
    ? find_lock_delalloc_range+0x221/0x250 [btrfs]
    writepage_delalloc+0xe8/0x150 [btrfs]
    __extent_writepage+0xe8/0x4c0 [btrfs]
    extent_write_cache_pages+0x237/0x530 [btrfs]
    extent_writepages+0x44/0xa0 [btrfs]
    do_writepages+0x23/0x80
    __writeback_single_inode+0x59/0x700
    writeback_sb_inodes+0x267/0x5f0
    __writeback_inodes_wb+0x87/0xe0
    wb_writeback+0x382/0x590
    ? wb_workfn+0x4a2/0x6c0
    wb_workfn+0x4a2/0x6c0
    process_one_work+0x26d/0x6a0
    worker_thread+0x4f/0x3e0
    ? process_one_work+0x6a0/0x6a0
    kthread+0x103/0x140
    ? kthread_create_worker_on_cpu+0x70/0x70
    ret_from_fork+0x3a/0x50
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x74f/0x2020
    softirqs last enabled at (0): [] copy_process+0x74f/0x2020
    softirqs last disabled at (0): [] 0x0
    ---[ end trace bd7c03622e0b0a52 ]---
    ------------[ cut here ]------------

    So fix this by incrementing the bytes_may_use counter of the data
    space_info when we fallback to the cow path. If the cow path is successful
    the counter is decremented after extent allocation (by
    btrfs_add_reserved_bytes()), if it fails it ends up being decremented as
    well when clearing the delalloc range (extent_clear_unlock_delalloc()).

    This could be triggered sporadically by the test case btrfs/061 from
    fstests.

    Fixes: 82d5902d9c681b ("Btrfs: Support reading/writing on disk free ino cache")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 467dc47ea99c56e966e99d09dae54869850abeeb upstream.

    When doing a buffered write we always try to reserve data space for it,
    even when the file has the NOCOW bit set or the write falls into a file
    range covered by a prealloc extent. This is done both because it is
    expensive to check if we can do a nocow write (checking if an extent is
    shared through reflinks or if there's a hole in the range for example),
    and because when writeback starts we might actually need to fallback to
    COW mode (for example the block group containing the target extents was
    turned into RO mode due to a scrub or balance).

    When we are unable to reserve data space we check if we can do a nocow
    write, and if we can, we proceed with dirtying the pages and setting up
    the range for delalloc. In this case the bytes_may_use counter of the
    data space_info object is not incremented, unlike in the case where we
    are able to reserve data space (done through btrfs_check_data_free_space()
    which calls btrfs_alloc_data_chunk_ondemand()).

    Later when running delalloc we attempt to start writeback in nocow mode
    but we might revert back to cow mode, for example because in the meanwhile
    a block group was turned into RO mode by a scrub or relocation. The cow
    path after successfully allocating an extent ends up calling
    btrfs_add_reserved_bytes(), which expects the bytes_may_use counter of
    the data space_info object to have been incremented before - but we did
    not do it when the buffered write started, since there was not enough
    available data space. So btrfs_add_reserved_bytes() ends up decrementing
    the bytes_may_use counter anyway, and when the counter's current value
    is smaller then the size of the allocated extent we get a stack trace
    like the following:

    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 20138 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
    Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
    CPU: 0 PID: 20138 Comm: kworker/u8:15 Not tainted 5.6.0-rc7-btrfs-next-58 #5
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
    Workqueue: writeback wb_workfn (flush-btrfs-1754)
    RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
    Code: ff ff 48 (...)
    RSP: 0018:ffffbda18a4b3568 EFLAGS: 00010287
    RAX: 0000000000000000 RBX: ffff9ca076f5d800 RCX: 0000000000000000
    RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9ca068470410
    RBP: fffffffffffff000 R08: 0000000000000001 R09: 0000000000000000
    R10: ffff9ca079d58040 R11: 0000000000000000 R12: ffff9ca068470400
    R13: ffff9ca0408b2000 R14: 0000000000001000 R15: ffff9ca076f5d800
    FS: 0000000000000000(0000) GS:ffff9ca07a600000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00005605dbfe7048 CR3: 0000000138570006 CR4: 00000000003606f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    find_free_extent+0x4a0/0x16c0 [btrfs]
    btrfs_reserve_extent+0x91/0x180 [btrfs]
    cow_file_range+0x12d/0x490 [btrfs]
    run_delalloc_nocow+0x341/0xa40 [btrfs]
    btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
    ? find_lock_delalloc_range+0x221/0x250 [btrfs]
    writepage_delalloc+0xe8/0x150 [btrfs]
    __extent_writepage+0xe8/0x4c0 [btrfs]
    extent_write_cache_pages+0x237/0x530 [btrfs]
    ? btrfs_wq_submit_bio+0x9f/0xc0 [btrfs]
    extent_writepages+0x44/0xa0 [btrfs]
    do_writepages+0x23/0x80
    __writeback_single_inode+0x59/0x700
    writeback_sb_inodes+0x267/0x5f0
    __writeback_inodes_wb+0x87/0xe0
    wb_writeback+0x382/0x590
    ? wb_workfn+0x4a2/0x6c0
    wb_workfn+0x4a2/0x6c0
    process_one_work+0x26d/0x6a0
    worker_thread+0x4f/0x3e0
    ? process_one_work+0x6a0/0x6a0
    kthread+0x103/0x140
    ? kthread_create_worker_on_cpu+0x70/0x70
    ret_from_fork+0x3a/0x50
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x74f/0x2020
    softirqs last enabled at (0): [] copy_process+0x74f/0x2020
    softirqs last disabled at (0): [] 0x0
    ---[ end trace f9f6ef8ec4cd8ec9 ]---

    So to fix this, when falling back into cow mode check if space was not
    reserved, by testing for the bit EXTENT_NORESERVE in the respective file
    range, and if not, increment the bytes_may_use counter for the data
    space_info object. Also clear the EXTENT_NORESERVE bit from the range, so
    that if the cow path fails it decrements the bytes_may_use counter when
    clearing the delalloc range (through the btrfs_clear_delalloc_extent()
    callback).

    Fixes: 7ee9e4405f264e ("Btrfs: check if we can nocow if we don't have data space")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit e2c8e92d1140754073ad3799eb6620c76bab2078 upstream.

    If an error happens while running dellaloc in COW mode for a range, we can
    end up calling extent_clear_unlock_delalloc() for a range that goes beyond
    our range's end offset by 1 byte, which affects 1 extra page. This results
    in clearing bits and doing page operations (such as a page unlock) outside
    our target range.

    Fix that by calling extent_clear_unlock_delalloc() with an inclusive end
    offset, instead of an exclusive end offset, at cow_file_range().

    Fixes: a315e68f6e8b30 ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
    CC: stable@vger.kernel.org # 4.14+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 6d3113a193e3385c72240096fe397618ecab6e43 upstream.

    In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID
    stripe or chunk, we submit orig_bio without cloning it. In this case, we
    don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we
    decrement pending_bios to -1, and we never complete orig_bio. Fix it by
    initializing pending_bios to 1 instead of incrementing later.

    Fixing this exposes another bug: we put orig_bio prematurely and then
    put it again from end_io. Fix it by not putting orig_bio.

    After this change, pending_bios is really more of a reference count, but
    I'll leave that cleanup separate to keep the fix small.

    Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Josef Bacik
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Omar Sandoval
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     
  • commit 9c343784c4328781129bcf9e671645f69fe4b38a upstream.

    Nikolay noticed a bunch of test failures with my global rsv steal
    patches. At first he thought they were introduced by them, but they've
    been failing for a while with 64k nodes.

    The problem is with 64k nodes we have a global reserve that calculates
    out to 13MiB on a freshly made file system, which only has 8MiB of
    metadata space. Because of changes I previously made we no longer
    account for the global reserve in the overcommit logic, which means we
    correctly allow overcommit to happen even though we are already
    overcommitted.

    However in some corner cases, for example btrfs/170, we will allocate
    the entire file system up with data chunks before we have enough space
    pressure to allocate a metadata chunk. Then once the fs is full we
    ENOSPC out because we cannot overcommit and the global reserve is taking
    up all of the available space.

    The most ideal way to deal with this is to change our space reservation
    stuff to take into account the height of the tree's that we're
    modifying, so that our global reserve calculation does not end up so
    obscenely large.

    However that is a huge undertaking. Instead fix this by forcing a chunk
    allocation if the global reserve is larger than the total metadata
    space. This gives us essentially the same behavior that happened
    before, we get a chunk allocated and these tests can pass.

    This is meant to be a stop-gap measure until we can tackle the "tree
    height only" project.

    Fixes: 0096420adb03 ("btrfs: do not account global reserve in can_overcommit")
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Nikolay Borisov
    Tested-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 89efda52e6b6930f80f5adda9c3c9edfb1397191 upstream.

    Whenever a chown is executed, all capabilities of the file being touched
    are lost. When doing incremental send with a file with capabilities,
    there is a situation where the capability can be lost on the receiving
    side. The sequence of actions bellow shows the problem:

    $ mount /dev/sda fs1
    $ mount /dev/sdb fs2

    $ touch fs1/foo.bar
    $ setcap cap_sys_nice+ep fs1/foo.bar
    $ btrfs subvolume snapshot -r fs1 fs1/snap_init
    $ btrfs send fs1/snap_init | btrfs receive fs2

    $ chgrp adm fs1/foo.bar
    $ setcap cap_sys_nice+ep fs1/foo.bar

    $ btrfs subvolume snapshot -r fs1 fs1/snap_complete
    $ btrfs subvolume snapshot -r fs1 fs1/snap_incremental

    $ btrfs send fs1/snap_complete | btrfs receive fs2
    $ btrfs send -p fs1/snap_init fs1/snap_incremental | btrfs receive fs2

    At this point, only a chown was emitted by "btrfs send" since only the
    group was changed. This makes the cap_sys_nice capability to be dropped
    from fs2/snap_incremental/foo.bar

    To fix that, only emit capabilities after chown is emitted. The current
    code first checks for xattrs that are new/changed, emits them, and later
    emit the chown. Now, __process_new_xattr skips capabilities, letting
    only finish_inode_if_needed to emit them, if they exist, for the inode
    being processed.

    This behavior was being worked around in "btrfs receive" side by caching
    the capability and only applying it after chown. Now, xattrs are only
    emmited _after_ chown, making that workaround not needed anymore.

    Link: https://github.com/kdave/btrfs-progs/issues/202
    CC: stable@vger.kernel.org # 4.4+
    Suggested-by: Filipe Manana
    Reviewed-by: Filipe Manana
    Signed-off-by: Marcos Paulo de Souza
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Marcos Paulo de Souza
     
  • commit 998a0671961f66e9fad4990ed75f80ba3088c2f1 upstream.

    btrfs_free_extra_devids() updates fs_devices::latest_bdev to point to
    the bdev with greatest device::generation number. For a typical-missing
    device the generation number is zero so fs_devices::latest_bdev will
    never point to it.

    But if the missing device is due to alienation [1], then
    device::generation is not zero and if it is greater or equal to the rest
    of device generations in the list, then fs_devices::latest_bdev ends up
    pointing to the missing device and reports the error like [2].

    [1] We maintain devices of a fsid (as in fs_device::fsid) in the
    fs_devices::devices list, a device is considered as an alien device
    if its fsid does not match with the fs_device::fsid

    Consider a working filesystem with raid1:

    $ mkfs.btrfs -f -d raid1 -m raid1 /dev/sda /dev/sdb
    $ mount /dev/sda /mnt-raid1
    $ umount /mnt-raid1

    While mnt-raid1 was unmounted the user force-adds one of its devices to
    another btrfs filesystem:

    $ mkfs.btrfs -f /dev/sdc
    $ mount /dev/sdc /mnt-single
    $ btrfs dev add -f /dev/sda /mnt-single

    Now the original mnt-raid1 fails to mount in degraded mode, because
    fs_devices::latest_bdev is pointing to the alien device.

    $ mount -o degraded /dev/sdb /mnt-raid1

    [2]
    mount: wrong fs type, bad option, bad superblock on /dev/sdb,
    missing codepage or helper program, or other error

    In some cases useful info is found in syslog - try
    dmesg | tail or so.

    kernel: BTRFS warning (device sdb): devid 1 uuid 072a0192-675b-4d5a-8640-a5cf2b2c704d is missing
    kernel: BTRFS error (device sdb): failed to read devices
    kernel: BTRFS error (device sdb): open_ctree failed

    Fix the root cause by checking if the device is not missing before it
    can be considered for the fs_devices::latest_bdev.

    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Josef Bacik
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Anand Jain
     
  • commit 7f551d969037cc128eca60688d9c5a300d84e665 upstream.

    When an old device has new fsid through 'btrfs device add -f ' our
    fs_devices list has an alien device in one of the fs_devices lists.

    By having an alien device in fs_devices, we have two issues so far

    1. missing device does not not show as missing in the userland

    2. degraded mount will fail

    Both issues are caused by the fact that there's an alien device in the
    fs_devices list. (Alien means that it does not belong to the filesystem,
    identified by fsid, or does not contain btrfs filesystem at all, eg. due
    to overwrite).

    A device can be scanned/added through the control device ioctls
    SCAN_DEV, DEVICES_READY or by ADD_DEV.

    And device coming through the control device is checked against the all
    other devices in the lists, but this was not the case for ADD_DEV.

    This patch fixes both issues above by removing the alien device.

    CC: stable@vger.kernel.org # 5.4+
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Anand Jain
     
  • [ Upstream commit cbab8ade585a18c4334b085564d9d046e01a3f70 ]

    [BUG]
    For the following operation, qgroup is guaranteed to be screwed up due
    to snapshot adding to a new qgroup:

    # mkfs.btrfs -f $dev
    # mount $dev $mnt
    # btrfs qgroup en $mnt
    # btrfs subv create $mnt/src
    # xfs_io -f -c "pwrite 0 1m" $mnt/src/file
    # sync
    # btrfs qgroup create 1/0 $mnt/src
    # btrfs subv snapshot -i 1/0 $mnt/src $mnt/snapshot
    # btrfs qgroup show -prce $mnt/src
    qgroupid rfer excl max_rfer max_excl parent child
    -------- ---- ---- -------- -------- ------ -----
    0/5 16.00KiB 16.00KiB none none --- ---
    0/257 1.02MiB 16.00KiB none none --- ---
    0/258 1.02MiB 16.00KiB none none 1/0 ---
    1/0 0.00B 0.00B none none --- 0/258
    ^^^^^^^^^^^^^^^^^^^^

    [CAUSE]
    The problem is in btrfs_qgroup_inherit(), we don't have good enough
    check to determine if the new relation would break the existing
    accounting.

    Unlike btrfs_add_qgroup_relation(), which has proper check to determine
    if we can do quick update without a rescan, in btrfs_qgroup_inherit() we
    can even assign a snapshot to multiple qgroups.

    [FIX]
    Fix it by manually marking qgroup inconsistent for snapshot inheritance.

    For subvolume creation, since all its extents are exclusively owned, we
    don't need to rescan.

    In theory, we should call relation check like quick_update_accounting()
    when doing qgroup inheritance and inform user about qgroup accounting
    inconsistency.

    But we don't have good mechanism to relay that back to the user in the
    snapshot creation context, thus we can only silently mark the qgroup
    inconsistent.

    Anyway, user shouldn't use qgroup inheritance during snapshot creation,
    and should add qgroup relationship after snapshot creation by 'btrfs
    qgroup assign', which has a much better UI to inform user about qgroup
    inconsistent and kick in rescan automatically.

    Reviewed-by: Josef Bacik
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Qu Wenruo
     
  • [ Upstream commit 7f9fe614407692f670601a634621138233ac00d7 ]

    For unlink transactions and block group removal
    btrfs_start_transaction_fallback_global_rsv will first try to start an
    ordinary transaction and if it fails it will fall back to reserving the
    required amount by stealing from the global reserve. This is problematic
    because of all the same reasons we had with previous iterations of the
    ENOSPC handling, thundering herd. We get a bunch of failures all at
    once, everybody tries to allocate from the global reserve, some win and
    some lose, we get an ENSOPC.

    Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
    used to mark unlink reservation. To fix this we need to integrate this
    logic into the normal ENOSPC infrastructure. We still go through all of
    the normal flushing work, and at the moment we begin to fail all the
    tickets we try to satisfy any tickets that are allowed to steal by
    stealing from the global reserve. If this works we start the flushing
    system over again just like we would with a normal ticket satisfaction.
    This serializes our global reserve stealing, so we don't have the
    thundering herd problem.

    Reviewed-by: Nikolay Borisov
    Tested-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik
     
  • [ Upstream commit 7e4a3f7ed5d54926ec671bbb13e171cfe179cc50 ]

    We are currently treating any non-zero return value from btrfs_next_leaf()
    the same way, by going to the code that inserts a new checksum item in the
    tree. However if btrfs_next_leaf() returns an error (a value < 0), we
    should just stop and return the error, and not behave as if nothing has
    happened, since in that case we do not have a way to know if there is a
    next leaf or we are currently at the last leaf already.

    So fix that by returning the error from btrfs_next_leaf().

    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     
  • [ Upstream commit bb4f58a747f0421b10645fbf75a6acc88da0de50 ]

    On ppc64le with 64k page size (respectively 64k block size) generic/320
    was failing and debug output showed we were getting a premature ENOSPC
    with a bunch of space in btrfs_fs_info::trans_block_rsv.

    This meant there were still open transaction handles holding space, yet
    the flusher didn't commit the transaction because it deemed the freed
    space won't be enough to satisfy the current reserve ticket. Fix this
    by accounting for space in trans_block_rsv when deciding whether the
    current transaction should be committed or not.

    Reviewed-by: Nikolay Borisov
    Tested-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik
     

06 May, 2020

4 commits

  • commit fcc99734d1d4ced30167eb02e17f656735cb9928 upstream.

    [BUG]
    One run of btrfs/063 triggered the following lockdep warning:
    ============================================
    WARNING: possible recursive locking detected
    5.6.0-rc7-custom+ #48 Not tainted
    --------------------------------------------
    kworker/u24:0/7 is trying to acquire lock:
    ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]

    but task is already holding lock:
    ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(sb_internal#2);
    lock(sb_internal#2);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    4 locks held by kworker/u24:0/7:
    #0: ffff88817b495948 ((wq_completion)btrfs-endio-write){+.+.}, at: process_one_work+0x557/0xb80
    #1: ffff888189ea7db8 ((work_completion)(&work->normal_work)){+.+.}, at: process_one_work+0x557/0xb80
    #2: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
    #3: ffff888174ca4da8 (&fs_info->reloc_mutex){+.+.}, at: btrfs_record_root_in_trans+0x83/0xd0 [btrfs]

    stack backtrace:
    CPU: 0 PID: 7 Comm: kworker/u24:0 Not tainted 5.6.0-rc7-custom+ #48
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
    Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
    Call Trace:
    dump_stack+0xc2/0x11a
    __lock_acquire.cold+0xce/0x214
    lock_acquire+0xe6/0x210
    __sb_start_write+0x14e/0x290
    start_transaction+0x66c/0x890 [btrfs]
    btrfs_join_transaction+0x1d/0x20 [btrfs]
    find_free_extent+0x1504/0x1a50 [btrfs]
    btrfs_reserve_extent+0xd5/0x1f0 [btrfs]
    btrfs_alloc_tree_block+0x1ac/0x570 [btrfs]
    btrfs_copy_root+0x213/0x580 [btrfs]
    create_reloc_root+0x3bd/0x470 [btrfs]
    btrfs_init_reloc_root+0x2d2/0x310 [btrfs]
    record_root_in_trans+0x191/0x1d0 [btrfs]
    btrfs_record_root_in_trans+0x90/0xd0 [btrfs]
    start_transaction+0x16e/0x890 [btrfs]
    btrfs_join_transaction+0x1d/0x20 [btrfs]
    btrfs_finish_ordered_io+0x55d/0xcd0 [btrfs]
    finish_ordered_fn+0x15/0x20 [btrfs]
    btrfs_work_helper+0x116/0x9a0 [btrfs]
    process_one_work+0x632/0xb80
    worker_thread+0x80/0x690
    kthread+0x1a3/0x1f0
    ret_from_fork+0x27/0x50

    It's pretty hard to reproduce, only one hit so far.

    [CAUSE]
    This is because we're calling btrfs_join_transaction() without re-using
    the current running one:

    btrfs_finish_ordered_io()
    |- btrfs_join_transaction() <<< Call #1
    |- btrfs_record_root_in_trans()
    |- btrfs_reserve_extent()
    |- btrfs_join_transaction() <<< Call #2

    Normally such btrfs_join_transaction() call should re-use the existing
    one, without trying to re-start a transaction.

    But the problem is, in btrfs_join_transaction() call #1, we call
    btrfs_record_root_in_trans() before initializing current::journal_info.

    And in btrfs_join_transaction() call #2, we're relying on
    current::journal_info to avoid such deadlock.

    [FIX]
    Call btrfs_record_root_in_trans() after we have initialized
    current::journal_info.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     
  • commit f135cea30de5f74d5bfb5116682073841fb4af8f upstream.

    When we have an inode with a prealloc extent that starts at an offset
    lower than the i_size and there is another prealloc extent that starts at
    an offset beyond i_size, we can end up losing part of the first prealloc
    extent (the part that starts at i_size) and have an implicit hole if we
    fsync the file and then have a power failure.

    Consider the following example with comments explaining how and why it
    happens.

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt

    # Create our test file with 2 consecutive prealloc extents, each with a
    # size of 128Kb, and covering the range from 0 to 256Kb, with a file
    # size of 0.
    $ xfs_io -f -c "falloc -k 0 128K" /mnt/foo
    $ xfs_io -c "falloc -k 128K 128K" /mnt/foo

    # Fsync the file to record both extents in the log tree.
    $ xfs_io -c "fsync" /mnt/foo

    # Now do a redudant extent allocation for the range from 0 to 64Kb.
    # This will merely increase the file size from 0 to 64Kb. Instead we
    # could also do a truncate to set the file size to 64Kb.
    $ xfs_io -c "falloc 0 64K" /mnt/foo

    # Fsync the file, so we update the inode item in the log tree with the
    # new file size (64Kb). This also ends up setting the number of bytes
    # for the first prealloc extent to 64Kb. This is done by the truncation
    # at btrfs_log_prealloc_extents().
    # This means that if a power failure happens after this, a write into
    # the file range 64Kb to 128Kb will not use the prealloc extent and
    # will result in allocation of a new extent.
    $ xfs_io -c "fsync" /mnt/foo

    # Now set the file size to 256K with a truncate and then fsync the file.
    # Since no changes happened to the extents, the fsync only updates the
    # i_size in the inode item at the log tree. This results in an implicit
    # hole for the file range from 64Kb to 128Kb, something which fsck will
    # complain when not using the NO_HOLES feature if we replay the log
    # after a power failure.
    $ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo

    So instead of always truncating the log to the inode's current i_size at
    btrfs_log_prealloc_extents(), check first if there's a prealloc extent
    that starts at an offset lower than the i_size and with a length that
    crosses the i_size - if there is one, just make sure we truncate to a
    size that corresponds to the end offset of that prealloc extent, so
    that we don't lose the part of that extent that starts at i_size if a
    power failure happens.

    A test case for fstests follows soon.

    Fixes: 31d11b83b96f ("Btrfs: fix duplicate extents after fsync of file with prealloc extents")
    CC: stable@vger.kernel.org # 4.14+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit f6033c5e333238f299c3ae03fac8cc1365b23b77 upstream.

    btrfs_remove_block_group() invokes btrfs_lookup_block_group(), which
    returns a local reference of the block group that contains the given
    bytenr to "block_group" with increased refcount.

    When btrfs_remove_block_group() returns, "block_group" becomes invalid,
    so the refcount should be decreased to keep refcount balanced.

    The reference counting issue happens in several exception handling paths
    of btrfs_remove_block_group(). When those error scenarios occur such as
    btrfs_alloc_path() returns NULL, the function forgets to decrease its
    refcnt increased by btrfs_lookup_block_group() and will cause a refcnt
    leak.

    Fix this issue by jumping to "out_put_group" label and calling
    btrfs_put_block_group() when those error scenarios occur.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Xiyu Yang
    Signed-off-by: Xin Tan
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Xiyu Yang
     
  • commit 1402d17dfd9657be0da8458b2079d03c2d61c86a upstream.

    btrfs_recover_relocation() invokes btrfs_join_transaction(), which joins
    a btrfs_trans_handle object into transactions and returns a reference of
    it with increased refcount to "trans".

    When btrfs_recover_relocation() returns, "trans" becomes invalid, so the
    refcount should be decreased to keep refcount balanced.

    The reference counting issue happens in one exception handling path of
    btrfs_recover_relocation(). When read_fs_root() failed, the refcnt
    increased by btrfs_join_transaction() is not decreased, causing a refcnt
    leak.

    Fix this issue by calling btrfs_end_transaction() on this error path
    when read_fs_root() failed.

    Fixes: 79787eaab461 ("btrfs: replace many BUG_ONs with proper error handling")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Xiyu Yang
    Signed-off-by: Xin Tan
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Xiyu Yang
     

23 Apr, 2020

1 commit

  • [ Upstream commit 29566c9c773456467933ee22bbca1c2b72a3506c ]

    The space_info list is normally RCU protected and should be traversed
    with rcu_read_lock held. There's a warning

    [29.104756] WARNING: suspicious RCU usage
    [29.105046] 5.6.0-rc4-next-20200305 #1 Not tainted
    [29.105231] -----------------------------
    [29.105401] fs/btrfs/block-group.c:2011 RCU-list traversed in non-reader section!!

    pointing out that the locking is missing in btrfs_read_block_groups.
    However this is not necessary as the list traversal happens at mount
    time when there's no other thread potentially accessing the list.

    To fix the warning and for consistency let's add the RCU lock/unlock,
    the code won't be affected much as it's doing some lightweight
    operations.

    Reported-by: Guenter Roeck
    Signed-off-by: Madhuparna Bhowmik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Madhuparna Bhowmik
     

21 Apr, 2020

1 commit

  • commit 4d4225fc228e46948486d8b8207955f0c031b92e upstream.

    Previously we would set the reloc root's last snapshot to transid - 1.
    However there was a problem with doing this, and we changed it to
    setting the last snapshot to the generation of the commit node of the fs
    root.

    This however broke should_ignore_root(). The assumption is that if we
    are in a generation newer than when the reloc root was created, then we
    would find the reloc root through normal backref lookups, and thus can
    ignore any fs roots we find with an old enough reloc root.

    Now that the last snapshot could be considerably further in the past
    than before, we'd end up incorrectly ignoring an fs root. Thus we'd
    find no nodes for the bytenr we were searching for, and we'd fail to
    relocate anything. We'd loop through the relocate code again and see
    that there were still used space in that block group, attempt to
    relocate those bytenr's again, fail in the same way, and just loop like
    this forever. This is tricky in that we have to not modify the fs root
    at all during this time, so we need to have a block group that has data
    in this fs root that is not shared by any other root, which is why this
    has been difficult to reproduce.

    Fixes: 054570a1dc94 ("Btrfs: fix relocation incorrectly dropping data references")
    CC: stable@vger.kernel.org # 4.9+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     

17 Apr, 2020

10 commits

  • commit 351cbf6e4410e7ece05e35d0a07320538f2418b4 upstream.

    Zygo reported the following lockdep splat while testing the balance
    patches

    ======================================================
    WARNING: possible circular locking dependency detected
    5.6.0-c6f0579d496a+ #53 Not tainted
    ------------------------------------------------------
    kswapd0/1133 is trying to acquire lock:
    ffff888092f622c0 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node+0x7c/0x5b0

    but task is already holding lock:
    ffffffff8fc5f860 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (fs_reclaim){+.+.}:
    fs_reclaim_acquire.part.91+0x29/0x30
    fs_reclaim_acquire+0x19/0x20
    kmem_cache_alloc_trace+0x32/0x740
    add_block_entry+0x45/0x260
    btrfs_ref_tree_mod+0x6e2/0x8b0
    btrfs_alloc_tree_block+0x789/0x880
    alloc_tree_block_no_bg_flush+0xc6/0xf0
    __btrfs_cow_block+0x270/0x940
    btrfs_cow_block+0x1ba/0x3a0
    btrfs_search_slot+0x999/0x1030
    btrfs_insert_empty_items+0x81/0xe0
    btrfs_insert_delayed_items+0x128/0x7d0
    __btrfs_run_delayed_items+0xf4/0x2a0
    btrfs_run_delayed_items+0x13/0x20
    btrfs_commit_transaction+0x5cc/0x1390
    insert_balance_item.isra.39+0x6b2/0x6e0
    btrfs_balance+0x72d/0x18d0
    btrfs_ioctl_balance+0x3de/0x4c0
    btrfs_ioctl+0x30ab/0x44a0
    ksys_ioctl+0xa1/0xe0
    __x64_sys_ioctl+0x43/0x50
    do_syscall_64+0x77/0x2c0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (&delayed_node->mutex){+.+.}:
    __lock_acquire+0x197e/0x2550
    lock_acquire+0x103/0x220
    __mutex_lock+0x13d/0xce0
    mutex_lock_nested+0x1b/0x20
    __btrfs_release_delayed_node+0x7c/0x5b0
    btrfs_remove_delayed_node+0x49/0x50
    btrfs_evict_inode+0x6fc/0x900
    evict+0x19a/0x2c0
    dispose_list+0xa0/0xe0
    prune_icache_sb+0xbd/0xf0
    super_cache_scan+0x1b5/0x250
    do_shrink_slab+0x1f6/0x530
    shrink_slab+0x32e/0x410
    shrink_node+0x2a5/0xba0
    balance_pgdat+0x4bd/0x8a0
    kswapd+0x35a/0x800
    kthread+0x1e9/0x210
    ret_from_fork+0x3a/0x50

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(fs_reclaim);
    lock(&delayed_node->mutex);
    lock(fs_reclaim);
    lock(&delayed_node->mutex);

    *** DEADLOCK ***

    3 locks held by kswapd0/1133:
    #0: ffffffff8fc5f860 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
    #1: ffffffff8fc380d8 (shrinker_rwsem){++++}, at: shrink_slab+0x1e8/0x410
    #2: ffff8881e0e6c0e8 (&type->s_umount_key#42){++++}, at: trylock_super+0x1b/0x70

    stack backtrace:
    CPU: 2 PID: 1133 Comm: kswapd0 Not tainted 5.6.0-c6f0579d496a+ #53
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
    Call Trace:
    dump_stack+0xc1/0x11a
    print_circular_bug.isra.38.cold.57+0x145/0x14a
    check_noncircular+0x2a9/0x2f0
    ? print_circular_bug.isra.38+0x130/0x130
    ? stack_trace_consume_entry+0x90/0x90
    ? save_trace+0x3cc/0x420
    __lock_acquire+0x197e/0x2550
    ? btrfs_inode_clear_file_extent_range+0x9b/0xb0
    ? register_lock_class+0x960/0x960
    lock_acquire+0x103/0x220
    ? __btrfs_release_delayed_node+0x7c/0x5b0
    __mutex_lock+0x13d/0xce0
    ? __btrfs_release_delayed_node+0x7c/0x5b0
    ? __asan_loadN+0xf/0x20
    ? pvclock_clocksource_read+0xeb/0x190
    ? __btrfs_release_delayed_node+0x7c/0x5b0
    ? mutex_lock_io_nested+0xc20/0xc20
    ? __kasan_check_read+0x11/0x20
    ? check_chain_key+0x1e6/0x2e0
    mutex_lock_nested+0x1b/0x20
    ? mutex_lock_nested+0x1b/0x20
    __btrfs_release_delayed_node+0x7c/0x5b0
    btrfs_remove_delayed_node+0x49/0x50
    btrfs_evict_inode+0x6fc/0x900
    ? btrfs_setattr+0x840/0x840
    ? do_raw_spin_unlock+0xa8/0x140
    evict+0x19a/0x2c0
    dispose_list+0xa0/0xe0
    prune_icache_sb+0xbd/0xf0
    ? invalidate_inodes+0x310/0x310
    super_cache_scan+0x1b5/0x250
    do_shrink_slab+0x1f6/0x530
    shrink_slab+0x32e/0x410
    ? do_shrink_slab+0x530/0x530
    ? do_shrink_slab+0x530/0x530
    ? __kasan_check_read+0x11/0x20
    ? mem_cgroup_protected+0x13d/0x260
    shrink_node+0x2a5/0xba0
    balance_pgdat+0x4bd/0x8a0
    ? mem_cgroup_shrink_node+0x490/0x490
    ? _raw_spin_unlock_irq+0x27/0x40
    ? finish_task_switch+0xce/0x390
    ? rcu_read_lock_bh_held+0xb0/0xb0
    kswapd+0x35a/0x800
    ? _raw_spin_unlock_irqrestore+0x4c/0x60
    ? balance_pgdat+0x8a0/0x8a0
    ? finish_wait+0x110/0x110
    ? __kasan_check_read+0x11/0x20
    ? __kthread_parkme+0xc6/0xe0
    ? balance_pgdat+0x8a0/0x8a0
    kthread+0x1e9/0x210
    ? kthread_create_worker_on_cpu+0xc0/0xc0
    ret_from_fork+0x3a/0x50

    This is because we hold that delayed node's mutex while doing tree
    operations. Fix this by just wrapping the searches in nofs.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 6ff06729c22ec0b7498d900d79cc88cfb8aceaeb upstream.

    Ordered ops are started twice in sync file, once outside of inode mutex
    and once inside, taking the dio semaphore. There was one error path
    missing the semaphore unlock.

    Fixes: aab15e8ec2576 ("Btrfs: fix rare chances for data loss when doing a fast fsync")
    CC: stable@vger.kernel.org # 4.19+
    Signed-off-by: Robbie Ko
    Reviewed-by: Filipe Manana
    [ add changelog ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Robbie Ko
     
  • commit fb2d83eefef4e1c717205bac71cb1941edf8ae11 upstream.

    If we fail to load an fs root, or fail to start a transaction we can
    bail without unsetting the reloc control, which leads to problems later
    when we free the reloc control but still have it attached to the file
    system.

    In the normal path we'll end up calling unset_reloc_control() twice, but
    all it does is set fs_info->reloc_control = NULL, and we can only have
    one balance at a time so it's not racey.

    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Qu Wenruo
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 95418ed1d10774cd9a49af6f39e216c1256f1eeb upstream.

    When doing a fast fsync for a range that starts at an offset greater than
    zero, we can end up with a log that when replayed causes the respective
    inode miss a file extent item representing a hole if we are not using the
    NO_HOLES feature. This is because for fast fsyncs we don't log any extents
    that cover a range different from the one requested in the fsync.

    Example scenario to trigger it:

    $ mkfs.btrfs -O ^no-holes -f /dev/sdd
    $ mount /dev/sdd /mnt

    # Create a file with a single 256K and fsync it to clear to full sync
    # bit in the inode - we want the msync below to trigger a fast fsync.
    $ xfs_io -f -c "pwrite -S 0xab 0 256K" -c "fsync" /mnt/foo

    # Force a transaction commit and wipe out the log tree.
    $ sync

    # Dirty 768K of data, increasing the file size to 1Mb, and flush only
    # the range from 256K to 512K without updating the log tree
    # (sync_file_range() does not trigger fsync, it only starts writeback
    # and waits for it to finish).

    $ xfs_io -c "pwrite -S 0xcd 256K 768K" /mnt/foo
    $ xfs_io -c "sync_range -abw 256K 256K" /mnt/foo

    # Now dirty the range from 768K to 1M again and sync that range.
    $ xfs_io -c "mmap -w 768K 256K" \
    -c "mwrite -S 0xef 768K 256K" \
    -c "msync -s 768K 256K" \
    -c "munmap" \
    /mnt/foo

    # Mount to replay the log.
    $ mount /dev/sdd /mnt
    $ umount /mnt

    $ btrfs check /dev/sdd
    Opening filesystem to check...
    Checking filesystem on /dev/sdd
    UUID: 482fb574-b288-478e-a190-a9c44a78fca6
    [1/7] checking root items
    [2/7] checking extents
    [3/7] checking free space cache
    [4/7] checking fs roots
    root 5 inode 257 errors 100, file extent discount
    Found file extent holes:
    start: 262144, len: 524288
    ERROR: errors found in fs roots
    found 720896 bytes used, error(s) found
    total csum bytes: 512
    total tree bytes: 131072
    total fs tree bytes: 32768
    total extent tree bytes: 16384
    btree space waste bytes: 123514
    file data blocks allocated: 589824
    referenced 589824

    Fix this issue by setting the range to full (0 to LLONG_MAX) when the
    NO_HOLES feature is not enabled. This results in extra work being done
    but it gives the guarantee we don't end up with missing holes after
    replaying the log.

    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 8e19c9732ad1d127b5575a10f4fbcacf740500ff upstream.

    If we have an error while building the backref tree in relocation we'll
    process all the pending edges and then free the node. However if we
    integrated some edges into the cache we'll lose our link to those edges
    by simply freeing this node, which means we'll leak memory and
    references to any roots that we've found.

    Instead we need to use remove_backref_node(), which walks through all of
    the edges that are still linked to this node and free's them up and
    drops any root references we may be holding.

    CC: stable@vger.kernel.org # 4.9+
    Reviewed-by: Qu Wenruo
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 75ec1db8717a8f0a9d9c8d033e542fdaa7b73898 upstream.

    In my EIO stress testing I noticed I was getting forced to rescan the
    uuid tree pretty often, which was weird. This is because my error
    injection stuff would sometimes inject an error after log replay but
    before we loaded the UUID tree. If log replay committed the transaction
    it wouldn't have updated the uuid tree generation, but the tree was
    valid and didn't change, so there's no reason to not update the
    generation here.

    Fix this by setting the BTRFS_FS_UPDATE_UUID_TREE_GEN bit immediately
    after reading all the fs roots if the uuid tree generation matches the
    fs generation. Then any transaction commits that happen during mount
    won't screw up our uuid tree state, forcing us to do needless uuid
    rescans.

    Fixes: 70f801754728 ("Btrfs: check UUID tree during mount if required")
    CC: stable@vger.kernel.org # 4.19+
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 6217b0fadd4473a16fabc6aecd7527a9f71af534 upstream.

    If we do merge_reloc_roots() we could insert a few roots onto the dirty
    subvol roots list, where we hold a ref on them. If we fail to start the
    transaction we need to run clean_dirty_subvols() in order to cleanup the
    refs.

    CC: stable@vger.kernel.org # 5.4+
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit f0cc2cd70164efe8f75c5d99560f0f69969c72e4 upstream.

    During unmount we can have a job from the delayed inode items work queue
    still running, that can lead to at least two bad things:

    1) A crash, because the worker can try to create a transaction just
    after the fs roots were freed;

    2) A transaction leak, because the worker can create a transaction
    before the fs roots are freed and just after we committed the last
    transaction and after we stopped the transaction kthread.

    A stack trace example of the crash:

    [79011.691214] kernel BUG at lib/radix-tree.c:982!
    [79011.692056] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
    [79011.693180] CPU: 3 PID: 1394 Comm: kworker/u8:2 Tainted: G W 5.6.0-rc2-btrfs-next-54 #2
    (...)
    [79011.696789] Workqueue: btrfs-delayed-meta btrfs_work_helper [btrfs]
    [79011.697904] RIP: 0010:radix_tree_tag_set+0xe7/0x170
    (...)
    [79011.702014] RSP: 0018:ffffb3c84a317ca0 EFLAGS: 00010293
    [79011.702949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
    [79011.704202] RDX: ffffb3c84a317cb0 RSI: ffffb3c84a317ca8 RDI: ffff8db3931340a0
    [79011.705463] RBP: 0000000000000005 R08: 0000000000000005 R09: ffffffff974629d0
    [79011.706756] R10: ffffb3c84a317bc0 R11: 0000000000000001 R12: ffff8db393134000
    [79011.708010] R13: ffff8db3931340a0 R14: ffff8db393134068 R15: 0000000000000001
    [79011.709270] FS: 0000000000000000(0000) GS:ffff8db3b6a00000(0000) knlGS:0000000000000000
    [79011.710699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [79011.711710] CR2: 00007f22c2a0a000 CR3: 0000000232ad4005 CR4: 00000000003606e0
    [79011.712958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [79011.714205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [79011.715448] Call Trace:
    [79011.715925] record_root_in_trans+0x72/0xf0 [btrfs]
    [79011.716819] btrfs_record_root_in_trans+0x4b/0x70 [btrfs]
    [79011.717925] start_transaction+0xdd/0x5c0 [btrfs]
    [79011.718829] btrfs_async_run_delayed_root+0x17e/0x2b0 [btrfs]
    [79011.719915] btrfs_work_helper+0xaa/0x720 [btrfs]
    [79011.720773] process_one_work+0x26d/0x6a0
    [79011.721497] worker_thread+0x4f/0x3e0
    [79011.722153] ? process_one_work+0x6a0/0x6a0
    [79011.722901] kthread+0x103/0x140
    [79011.723481] ? kthread_create_worker_on_cpu+0x70/0x70
    [79011.724379] ret_from_fork+0x3a/0x50
    (...)

    The following diagram shows a sequence of steps that lead to the crash
    during ummount of the filesystem:

    CPU 1 CPU 2 CPU 3

    btrfs_punch_hole()
    btrfs_btree_balance_dirty()
    btrfs_balance_delayed_items()
    --> sees
    fs_info->delayed_root->items
    with value 200, which is greater
    than
    BTRFS_DELAYED_BACKGROUND (128)
    and smaller than
    BTRFS_DELAYED_WRITEBACK (512)
    btrfs_wq_run_delayed_node()
    --> queues a job for
    fs_info->delayed_workers to run
    btrfs_async_run_delayed_root()

    btrfs_async_run_delayed_root()
    --> job queued by CPU 1

    --> starts picking and running
    delayed nodes from the
    prepare_list list

    close_ctree()

    btrfs_delete_unused_bgs()

    btrfs_commit_super()

    btrfs_join_transaction()
    --> gets transaction N

    btrfs_commit_transaction(N)
    --> set transaction state
    to TRANTS_STATE_COMMIT_START

    btrfs_first_prepared_delayed_node()
    --> picks delayed node X through
    the prepared_list list

    btrfs_run_delayed_items()

    btrfs_first_delayed_node()
    --> also picks delayed node X
    but through the node_list
    list

    __btrfs_commit_inode_delayed_items()
    --> runs all delayed items from
    this node and drops the
    node's item count to 0
    through call to
    btrfs_release_delayed_inode()

    --> finishes running any remaining
    delayed nodes

    --> finishes transaction commit

    --> stops cleaner and transaction threads

    btrfs_free_fs_roots()
    --> frees all roots and removes them
    from the radix tree
    fs_info->fs_roots_radix

    btrfs_join_transaction()
    start_transaction()
    btrfs_record_root_in_trans()
    record_root_in_trans()
    radix_tree_tag_set()
    --> crashes because
    the root is not in
    the radix tree
    anymore

    If the worker is able to call btrfs_join_transaction() before the unmount
    task frees the fs roots, we end up leaking a transaction and all its
    resources, since after the call to btrfs_commit_super() and stopping the
    transaction kthread, we don't expect to have any transaction open anymore.

    When this situation happens the worker has a delayed node that has no
    more items to run, since the task calling btrfs_run_delayed_items(),
    which is doing a transaction commit, picks the same node and runs all
    its items first.

    We can not wait for the worker to complete when running delayed items
    through btrfs_run_delayed_items(), because we call that function in
    several phases of a transaction commit, and that could cause a deadlock
    because the worker calls btrfs_join_transaction() and the task doing the
    transaction commit may have already set the transaction state to
    TRANS_STATE_COMMIT_DOING.

    Also it's not possible to get into a situation where only some of the
    items of a delayed node are added to the fs/subvolume tree in the current
    transaction and the remaining ones in the next transaction, because when
    running the items of a delayed inode we lock its mutex, effectively
    waiting for the worker if the worker is running the items of the delayed
    node already.

    Since this can only cause issues when unmounting a filesystem, fix it in
    a simple way by waiting for any jobs on the delayed workers queue before
    calling btrfs_commit_supper() at close_ctree(). This works because at this
    point no one can call btrfs_btree_balance_dirty() or
    btrfs_balance_delayed_items(), and if we end up waiting for any worker to
    complete, btrfs_commit_super() will commit the transaction created by the
    worker.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit b3ff8f1d380e65dddd772542aa9bff6c86bf715a upstream.

    [BUG]
    There is a fuzzed image which could cause KASAN report at unmount time.

    BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
    Read of size 8 at addr ffff888067cf6848 by task umount/1922

    CPU: 0 PID: 1922 Comm: umount Tainted: G W 5.0.21 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
    Call Trace:
    dump_stack+0x5b/0x8b
    print_address_description+0x70/0x280
    kasan_report+0x13a/0x19b
    btrfs_queue_work+0x2c1/0x390
    btrfs_wq_submit_bio+0x1cd/0x240
    btree_submit_bio_hook+0x18c/0x2a0
    submit_one_bio+0x1be/0x320
    flush_write_bio.isra.41+0x2c/0x70
    btree_write_cache_pages+0x3bb/0x7f0
    do_writepages+0x5c/0x130
    __writeback_single_inode+0xa3/0x9a0
    writeback_single_inode+0x23d/0x390
    write_inode_now+0x1b5/0x280
    iput+0x2ef/0x600
    close_ctree+0x341/0x750
    generic_shutdown_super+0x126/0x370
    kill_anon_super+0x31/0x50
    btrfs_kill_super+0x36/0x2b0
    deactivate_locked_super+0x80/0xc0
    deactivate_super+0x13c/0x150
    cleanup_mnt+0x9a/0x130
    task_work_run+0x11a/0x1b0
    exit_to_usermode_loop+0x107/0x130
    do_syscall_64+0x1e5/0x280
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    [CAUSE]
    The fuzzed image has a completely screwd up extent tree:

    leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
    refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
    item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
    extent refs 1 gen 9 flags 1
    ref#0: extent data backref root 5 objectid 259 offset 0 count 1
    item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
    extent refs 1 gen 9 flags 1
    ref#0: extent data backref root 5 objectid 271 offset 0 count 1
    item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
    extent refs 1 gen 9 flags 1
    ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
    item 3 key (29360128 169 0) itemoff 3803 itemsize 33
    extent refs 1 gen 9 flags 2
    ref#0: tree block backref root 5
    item 4 key (29368320 169 1) itemoff 3770 itemsize 33
    extent refs 1 gen 9 flags 2
    ref#0: tree block backref root 5
    item 5 key (29372416 169 0) itemoff 3737 itemsize 33
    extent refs 1 gen 9 flags 2
    ref#0: tree block backref root 5

    Note that leaf 29421568 doesn't have its backref in the extent tree.
    Thus extent allocator can re-allocate leaf 29421568 for other trees.

    In short, the bug is caused by:

    - Existing tree block gets allocated to log tree
    This got its generation bumped.

    - Log tree balance cleaned dirty bit of offending tree block
    It will not be written back to disk, thus no WRITTEN flag.

    - Original owner of the tree block gets COWed
    Since the tree block has higher transid, no WRITTEN flag, it's reused,
    and not traced by transaction::dirty_pages.

    - Transaction aborted
    Tree blocks get cleaned according to transaction::dirty_pages. But the
    offending tree block is not recorded at all.

    - Filesystem unmount
    All pages are assumed to be are clean, destroying all workqueue, then
    call iput(btree_inode).
    But offending tree block is still dirty, which triggers writeback, and
    causes use-after-free bug.

    The detailed sequence looks like this:

    - Initial status
    eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8,
    not traced by any dirty extent_iot_tree.

    - New tree block is allocated
    Since there is no backref for 29421568, it's re-allocated as new tree
    block.
    Keep in mind that tree block 29421568 is still referred by extent
    tree.

    - Tree block 29421568 is filled for log tree
    eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped)
    traced by btrfs_root::dirty_log_pages

    - Some log tree operations
    Since the fs is using node size 4096, the log tree can easily go a
    level higher.

    - Log tree needs balance
    Tree block 29421568 gets all its content pushed to right, thus now
    it is empty, and we don't need it.
    btrfs_clean_tree_block() from __push_leaf_right() get called.

    eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
    traced by btrfs_root::dirty_log_pages

    - Log tree write back
    btree_write_cache_pages() goes through dirty pages ranges, but since
    page of tree block 29421568 gets cleaned already, it's not written
    back to disk. Thus it doesn't have WRITTEN bit set.
    But ranges in dirty_log_pages are cleared.

    eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
    not traced by any dirty extent_iot_tree.

    - Extent tree update when committing transaction
    Since tree block 29421568 has transid equal to running trans, and has
    no WRITTEN bit, should_cow_block() will use it directly without adding
    it to btrfs_transaction::dirty_pages.

    eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
    not traced by any dirty extent_iot_tree.

    At this stage, we're doomed. We have a dirty eb not tracked by any
    extent io tree.

    - Transaction gets aborted due to corrupted extent tree
    Btrfs cleans up dirty pages according to transaction::dirty_pages and
    btrfs_root::dirty_log_pages.
    But since tree block 29421568 is not tracked by neither of them, it's
    still dirty.

    eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
    not traced by any dirty extent_iot_tree.

    - Filesystem unmount
    Since all cleanup is assumed to be done, all workqueus are destroyed.
    Then iput(btree_inode) is called, expecting no dirty pages.
    But tree 29421568 is still dirty, thus triggering writeback.
    Since all workqueues are already freed, we cause use-after-free.

    This shows us that, log tree blocks + bad extent tree can cause wild
    dirty pages.

    [FIX]
    To fix the problem, don't submit any btree write bio if the filesytem
    has any error. This is the last safe net, just in case other cleanup
    haven't caught catch it.

    Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     
  • [ Upstream commit ea287ab157c2816bf12aad4cece41372f9d146b4 ]

    We always search the commit root of the extent tree for looking up back
    references, however we track the reloc roots based on their current
    bytenr.

    This is wrong, if we commit the transaction between relocating tree
    blocks we could end up in this code in build_backref_tree

    if (key.objectid == key.offset) {
    /*
    * Only root blocks of reloc trees use backref
    * pointing to itself.
    */
    root = find_reloc_root(rc, cur->bytenr);
    ASSERT(root);
    cur->root = root;
    break;
    }

    find_reloc_root() is looking based on the bytenr we had in the commit
    root, but if we've COWed this reloc root we will not find that bytenr,
    and we will trip over the ASSERT(root).

    Fix this by using the commit_root->start bytenr for indexing the commit
    root. Then we change the __update_reloc_root() caller to be used when
    we switch the commit root for the reloc root during commit.

    This fixes the panic I was seeing when we started throttling relocation
    for delayed refs.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik