20 Jan, 2021

6 commits

  • [ Upstream commit cb13eea3b49055bd78e6ddf39defd6340f7379fc ]

    If we remount a filesystem in RO mode while the qgroup rescan worker is
    running, we can end up having it still running after the remount is done,
    and at unmount time we may end up with an open transaction that ends up
    never getting committed. If that happens we end up with several memory
    leaks and can crash when hardware acceleration is unavailable for crc32c.
    Possibly it can lead to other nasty surprises too, due to use-after-free
    issues.

    The following steps explain how the problem happens.

    1) We have a filesystem mounted in RW mode and the qgroup rescan worker is
    running;

    2) We remount the filesystem in RO mode, and never stop/pause the rescan
    worker, so after the remount the rescan worker is still running. The
    important detail here is that the rescan task is still running after
    the remount operation committed any ongoing transaction through its
    call to btrfs_commit_super();

    3) The rescan is still running, and after the remount completed, the
    rescan worker started a transaction, after it finished iterating all
    leaves of the extent tree, to update the qgroup status item in the
    quotas tree. It does not commit the transaction, it only releases its
    handle on the transaction;

    4) A filesystem unmount operation starts shortly after;

    5) The unmount task, at close_ctree(), stops the transaction kthread,
    which had not had a chance to commit the open transaction since it was
    sleeping and the commit interval (default of 30 seconds) has not yet
    elapsed since the last time it committed a transaction;

    6) So after stopping the transaction kthread we still have the transaction
    used to update the qgroup status item open. At close_ctree(), when the
    filesystem is in RO mode and no transaction abort happened (or the
    filesystem is in error mode), we do not expect to have any transaction
    open, so we do not call btrfs_commit_super();

    7) We then proceed to destroy the work queues, free the roots and block
    groups, etc. After that we drop the last reference on the btree inode
    by calling iput() on it. Since there are dirty pages for the btree
    inode, corresponding to the COWed extent buffer for the quotas btree,
    btree_write_cache_pages() is invoked to flush those dirty pages. This
    results in creating a bio and submitting it, which makes us end up at
    btrfs_submit_metadata_bio();

    8) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
    that calls btrfs_wq_submit_bio(), because check_async_write() returned
    a value of 1. This value of 1 is because we did not have hardware
    acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
    set in fs_info->flags;

    9) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
    workqueue at fs_info->workers, which was already freed before by the
    call to btrfs_stop_all_workers() at close_ctree(). This results in an
    invalid memory access due to a use-after-free, leading to a crash.

    When this happens, before the crash there are several warnings triggered,
    since we have reserved metadata space in a block group, the delayed refs
    reservation, etc:

    ------------[ cut here ]------------
    WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
    Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
    CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
    Code: f0 01 00 00 48 39 c2 75 (...)
    RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
    RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
    RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
    RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
    R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
    FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
    close_ctree+0x2ba/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f15ee221ee7
    Code: ff 0b 00 f7 d8 64 89 01 48 (...)
    RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
    RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
    RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
    RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
    R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
    R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last enabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last disabled at (0): [] 0x0
    ---[ end trace dd74718fef1ed5c6 ]---
    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
    Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
    CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
    Code: 48 83 bb b0 03 00 00 00 (...)
    RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
    RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
    RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
    R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
    FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
    close_ctree+0x2ba/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f15ee221ee7
    Code: ff 0b 00 f7 d8 64 89 01 (...)
    RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
    RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
    RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
    RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
    R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
    R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last enabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last disabled at (0): [] 0x0
    ---[ end trace dd74718fef1ed5c7 ]---
    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
    Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
    CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
    Code: ad de 49 be 22 01 00 (...)
    RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
    RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
    RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
    R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
    FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    close_ctree+0x2ba/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f15ee221ee7
    Code: ff 0b 00 f7 d8 64 89 (...)
    RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
    RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
    RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
    RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
    R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
    R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last enabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last disabled at (0): [] 0x0
    ---[ end trace dd74718fef1ed5c8 ]---
    BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
    BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
    BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
    BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
    BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
    BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
    BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0

    And the crash, which only happens when we do not have crc32c hardware
    acceleration, produces the following trace immediately after those
    warnings:

    stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
    CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
    Code: 54 55 53 48 89 f3 (...)
    RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
    RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
    RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
    R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
    FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
    btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
    submit_one_bio+0x61/0x70 [btrfs]
    btree_write_cache_pages+0x414/0x450 [btrfs]
    ? kobject_put+0x9a/0x1d0
    ? trace_hardirqs_on+0x1b/0xf0
    ? _raw_spin_unlock_irqrestore+0x3c/0x60
    ? free_debug_processing+0x1e1/0x2b0
    do_writepages+0x43/0xe0
    ? lock_acquired+0x199/0x490
    __writeback_single_inode+0x59/0x650
    writeback_single_inode+0xaf/0x120
    write_inode_now+0x94/0xd0
    iput+0x187/0x2b0
    close_ctree+0x2c6/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f3cfebabee7
    Code: ff 0b 00 f7 d8 64 89 01 (...)
    RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
    RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
    RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
    RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
    R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
    R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
    Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
    ---[ end trace dd74718fef1ed5cc ]---

    Finally when we remove the btrfs module (rmmod btrfs), there are several
    warnings about objects that were allocated from our slabs but were never
    freed, consequence of the transaction that was never committed and got
    leaked:

    =============================================================================
    BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
    -----------------------------------------------------------------------------

    INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
    CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    slab_err+0xb7/0xdc
    ? lock_acquired+0x199/0x490
    __kmem_cache_shutdown+0x1ac/0x3c0
    ? lock_release+0x20e/0x4c0
    kmem_cache_destroy+0x55/0x120
    btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 f5 (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    INFO: Object 0x0000000050cbdd61 @offset=12104
    INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
    __slab_alloc.isra.0+0x109/0x1c0
    kmem_cache_alloc+0x7bb/0x830
    btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
    btrfs_free_tree_block+0x128/0x360 [btrfs]
    __btrfs_cow_block+0x489/0x5f0 [btrfs]
    btrfs_cow_block+0xf7/0x220 [btrfs]
    btrfs_search_slot+0x62a/0xc40 [btrfs]
    btrfs_del_orphan_item+0x65/0xd0 [btrfs]
    btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
    open_ctree+0x125a/0x18a0 [btrfs]
    btrfs_mount_root.cold+0x13/0xed [btrfs]
    legacy_get_tree+0x30/0x60
    vfs_get_tree+0x28/0xe0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    btrfs_mount+0x13b/0x3e0 [btrfs]
    INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
    kmem_cache_free+0x34c/0x3c0
    __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
    btrfs_run_delayed_refs+0x81/0x210 [btrfs]
    commit_cowonly_roots+0xfb/0x300 [btrfs]
    btrfs_commit_transaction+0x367/0xc40 [btrfs]
    sync_filesystem+0x74/0x90
    generic_shutdown_super+0x22/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    INFO: Object 0x0000000086e9b0ff @offset=12776
    INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
    __slab_alloc.isra.0+0x109/0x1c0
    kmem_cache_alloc+0x7bb/0x830
    btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
    btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
    alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
    __btrfs_cow_block+0x12d/0x5f0 [btrfs]
    btrfs_cow_block+0xf7/0x220 [btrfs]
    btrfs_search_slot+0x62a/0xc40 [btrfs]
    btrfs_del_orphan_item+0x65/0xd0 [btrfs]
    btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
    open_ctree+0x125a/0x18a0 [btrfs]
    btrfs_mount_root.cold+0x13/0xed [btrfs]
    legacy_get_tree+0x30/0x60
    vfs_get_tree+0x28/0xe0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
    kmem_cache_free+0x34c/0x3c0
    __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
    btrfs_run_delayed_refs+0x81/0x210 [btrfs]
    btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
    commit_cowonly_roots+0x248/0x300 [btrfs]
    btrfs_commit_transaction+0x367/0xc40 [btrfs]
    close_ctree+0x113/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
    CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    kmem_cache_destroy+0x119/0x120
    btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    =============================================================================
    BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
    -----------------------------------------------------------------------------

    INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
    CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    slab_err+0xb7/0xdc
    ? lock_acquired+0x199/0x490
    __kmem_cache_shutdown+0x1ac/0x3c0
    ? lock_release+0x20e/0x4c0
    kmem_cache_destroy+0x55/0x120
    btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 f5 (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    INFO: Object 0x000000001a340018 @offset=4408
    INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
    __slab_alloc.isra.0+0x109/0x1c0
    kmem_cache_alloc+0x7bb/0x830
    btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
    btrfs_free_tree_block+0x128/0x360 [btrfs]
    __btrfs_cow_block+0x489/0x5f0 [btrfs]
    btrfs_cow_block+0xf7/0x220 [btrfs]
    btrfs_search_slot+0x62a/0xc40 [btrfs]
    btrfs_del_orphan_item+0x65/0xd0 [btrfs]
    btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
    open_ctree+0x125a/0x18a0 [btrfs]
    btrfs_mount_root.cold+0x13/0xed [btrfs]
    legacy_get_tree+0x30/0x60
    vfs_get_tree+0x28/0xe0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    btrfs_mount+0x13b/0x3e0 [btrfs]
    INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
    kmem_cache_free+0x34c/0x3c0
    __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
    btrfs_run_delayed_refs+0x81/0x210 [btrfs]
    btrfs_commit_transaction+0x60/0xc40 [btrfs]
    create_subvol+0x56a/0x990 [btrfs]
    btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
    __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
    btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
    btrfs_ioctl+0x1a92/0x36f0 [btrfs]
    __x64_sys_ioctl+0x83/0xb0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    INFO: Object 0x000000002b46292a @offset=13648
    INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
    __slab_alloc.isra.0+0x109/0x1c0
    kmem_cache_alloc+0x7bb/0x830
    btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
    btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
    alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
    __btrfs_cow_block+0x12d/0x5f0 [btrfs]
    btrfs_cow_block+0xf7/0x220 [btrfs]
    btrfs_search_slot+0x62a/0xc40 [btrfs]
    btrfs_del_orphan_item+0x65/0xd0 [btrfs]
    btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
    open_ctree+0x125a/0x18a0 [btrfs]
    btrfs_mount_root.cold+0x13/0xed [btrfs]
    legacy_get_tree+0x30/0x60
    vfs_get_tree+0x28/0xe0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
    kmem_cache_free+0x34c/0x3c0
    __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
    btrfs_run_delayed_refs+0x81/0x210 [btrfs]
    commit_cowonly_roots+0xfb/0x300 [btrfs]
    btrfs_commit_transaction+0x367/0xc40 [btrfs]
    close_ctree+0x113/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
    CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    kmem_cache_destroy+0x119/0x120
    btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 f5 (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    =============================================================================
    BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
    -----------------------------------------------------------------------------

    INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
    CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    slab_err+0xb7/0xdc
    ? lock_acquired+0x199/0x490
    __kmem_cache_shutdown+0x1ac/0x3c0
    ? __mutex_unlock_slowpath+0x45/0x2a0
    kmem_cache_destroy+0x55/0x120
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 f5 (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    INFO: Object 0x000000004cf95ea8 @offset=6264
    INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
    __slab_alloc.isra.0+0x109/0x1c0
    kmem_cache_alloc+0x7bb/0x830
    btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
    alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
    __btrfs_cow_block+0x12d/0x5f0 [btrfs]
    btrfs_cow_block+0xf7/0x220 [btrfs]
    btrfs_search_slot+0x62a/0xc40 [btrfs]
    btrfs_del_orphan_item+0x65/0xd0 [btrfs]
    btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
    open_ctree+0x125a/0x18a0 [btrfs]
    btrfs_mount_root.cold+0x13/0xed [btrfs]
    legacy_get_tree+0x30/0x60
    vfs_get_tree+0x28/0xe0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    btrfs_mount+0x13b/0x3e0 [btrfs]
    INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
    kmem_cache_free+0x34c/0x3c0
    __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
    btrfs_run_delayed_refs+0x81/0x210 [btrfs]
    commit_cowonly_roots+0xfb/0x300 [btrfs]
    btrfs_commit_transaction+0x367/0xc40 [btrfs]
    close_ctree+0x113/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
    CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    kmem_cache_destroy+0x119/0x120
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1

    Fix this issue by having the remount path stop the qgroup rescan worker
    when we are remounting RO and teach the rescan worker to stop when a
    remount is in progress. If later a remount in RW mode happens, we are
    already resuming the qgroup rescan worker through the call to
    btrfs_qgroup_rescan_resume(), so we do not need to worry about that.

    Tested-by: Fabian Vogt
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     
  • [ Upstream commit 8fc058597a283e9a37720abb0e8d68e342b9387d ]

    btrfs_discard_workfn() drops discard_ctl->lock just to take it again in
    a moment in btrfs_discard_schedule_work(). Avoid that and also reuse
    ktime.

    Reviewed-by: Josef Bacik
    Signed-off-by: Pavel Begunkov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     
  • [ Upstream commit ea9ed87c73e87e044b2c58d658eb4ba5216bc488 ]

    Might happen that bg->discard_eligible_time was changed without
    rescheduling, so btrfs_discard_workfn() wakes up earlier than that new
    time, peek_discard_list() returns NULL, and all work halts and goes to
    sleep without further rescheduling even there are block groups to
    discard.

    It happens pretty often, but not so visible from the userspace because
    after some time it usually will be kicked off anyway by someone else
    calling btrfs_discard_reschedule_work().

    Fix it by continue rescheduling if block group discard lists are not
    empty.

    Reviewed-by: Josef Bacik
    Signed-off-by: Pavel Begunkov
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     
  • [ Upstream commit 347fb0cfc9bab5195c6701e62eda488310d7938f ]

    While mounting a crafted image provided by user, kernel panics due to
    the invalid chunk item whose end is less than start.

    [66.387422] loop: module loaded
    [66.389773] loop0: detected capacity change from 262144 to 0
    [66.427708] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 12 /dev/loop0 scanned by mount (613)
    [66.431061] BTRFS info (device loop0): disk space caching is enabled
    [66.431078] BTRFS info (device loop0): has skinny extents
    [66.437101] BTRFS error: insert state: end < start 29360127 37748736
    [66.437136] ------------[ cut here ]------------
    [66.437140] WARNING: CPU: 16 PID: 613 at fs/btrfs/extent_io.c:557 insert_state.cold+0x1a/0x46 [btrfs]
    [66.437369] CPU: 16 PID: 613 Comm: mount Tainted: G O 5.11.0-rc1-custom #45
    [66.437374] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
    [66.437378] RIP: 0010:insert_state.cold+0x1a/0x46 [btrfs]
    [66.437420] RSP: 0018:ffff93e5414c3908 EFLAGS: 00010286
    [66.437427] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
    [66.437431] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
    [66.437434] RBP: ffff93e5414c3938 R08: 0000000000000001 R09: 0000000000000001
    [66.437438] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d72aa0
    [66.437441] R13: ffff8ec78bc71628 R14: 0000000000000000 R15: 0000000002400000
    [66.437447] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
    [66.437451] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [66.437455] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
    [66.437460] PKRU: 55555554
    [66.437464] Call Trace:
    [66.437475] set_extent_bit+0x652/0x740 [btrfs]
    [66.437539] set_extent_bits_nowait+0x1d/0x20 [btrfs]
    [66.437576] add_extent_mapping+0x1e0/0x2f0 [btrfs]
    [66.437621] read_one_chunk+0x33c/0x420 [btrfs]
    [66.437674] btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
    [66.437708] ? kvm_sched_clock_read+0x18/0x40
    [66.437739] open_ctree+0xb32/0x1734 [btrfs]
    [66.437781] ? bdi_register_va+0x1b/0x20
    [66.437788] ? super_setup_bdi_name+0x79/0xd0
    [66.437810] btrfs_mount_root.cold+0x12/0xeb [btrfs]
    [66.437854] ? __kmalloc_track_caller+0x217/0x3b0
    [66.437873] legacy_get_tree+0x34/0x60
    [66.437880] vfs_get_tree+0x2d/0xc0
    [66.437888] vfs_kern_mount.part.0+0x78/0xc0
    [66.437897] vfs_kern_mount+0x13/0x20
    [66.437902] btrfs_mount+0x11f/0x3c0 [btrfs]
    [66.437940] ? kfree+0x5ff/0x670
    [66.437944] ? __kmalloc_track_caller+0x217/0x3b0
    [66.437962] legacy_get_tree+0x34/0x60
    [66.437974] vfs_get_tree+0x2d/0xc0
    [66.437983] path_mount+0x48c/0xd30
    [66.437998] __x64_sys_mount+0x108/0x140
    [66.438011] do_syscall_64+0x38/0x50
    [66.438018] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [66.438023] RIP: 0033:0x7f0138827f6e
    [66.438033] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
    [66.438040] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
    [66.438044] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
    [66.438047] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
    [66.438050] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    [66.438054] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
    [66.438078] irq event stamp: 18169
    [66.438082] hardirqs last enabled at (18175): [] console_unlock+0x4ff/0x5f0
    [66.438088] hardirqs last disabled at (18180): [] console_unlock+0x467/0x5f0
    [66.438092] softirqs last enabled at (16910): [] asm_call_irq_on_stack+0x12/0x20
    [66.438097] softirqs last disabled at (16905): [] asm_call_irq_on_stack+0x12/0x20
    [66.438103] ---[ end trace e114b111db64298b ]---
    [66.438107] BTRFS error: found node 12582912 29360127 on insert of 37748736 29360127
    [66.438127] BTRFS critical: panic in extent_io_tree_panic:679: locking error: extent tree was modified by another thread while locked (errno=-17 Object already exists)
    [66.441069] ------------[ cut here ]------------
    [66.441072] kernel BUG at fs/btrfs/extent_io.c:679!
    [66.442064] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
    [66.443018] CPU: 16 PID: 613 Comm: mount Tainted: G W O 5.11.0-rc1-custom #45
    [66.444538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
    [66.446223] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
    [66.450878] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
    [66.451840] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
    [66.453141] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
    [66.454445] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
    [66.455743] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
    [66.457055] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
    [66.458356] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
    [66.459841] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [66.460895] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
    [66.462196] PKRU: 55555554
    [66.462692] Call Trace:
    [66.463139] set_extent_bit.cold+0x30/0x98 [btrfs]
    [66.464049] set_extent_bits_nowait+0x1d/0x20 [btrfs]
    [66.490466] add_extent_mapping+0x1e0/0x2f0 [btrfs]
    [66.514097] read_one_chunk+0x33c/0x420 [btrfs]
    [66.534976] btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
    [66.555718] ? kvm_sched_clock_read+0x18/0x40
    [66.575758] open_ctree+0xb32/0x1734 [btrfs]
    [66.595272] ? bdi_register_va+0x1b/0x20
    [66.614638] ? super_setup_bdi_name+0x79/0xd0
    [66.633809] btrfs_mount_root.cold+0x12/0xeb [btrfs]
    [66.652938] ? __kmalloc_track_caller+0x217/0x3b0
    [66.671925] legacy_get_tree+0x34/0x60
    [66.690300] vfs_get_tree+0x2d/0xc0
    [66.708221] vfs_kern_mount.part.0+0x78/0xc0
    [66.725808] vfs_kern_mount+0x13/0x20
    [66.742730] btrfs_mount+0x11f/0x3c0 [btrfs]
    [66.759350] ? kfree+0x5ff/0x670
    [66.775441] ? __kmalloc_track_caller+0x217/0x3b0
    [66.791750] legacy_get_tree+0x34/0x60
    [66.807494] vfs_get_tree+0x2d/0xc0
    [66.823349] path_mount+0x48c/0xd30
    [66.838753] __x64_sys_mount+0x108/0x140
    [66.854412] do_syscall_64+0x38/0x50
    [66.869673] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [66.885093] RIP: 0033:0x7f0138827f6e
    [66.945613] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
    [66.977214] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
    [66.994266] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
    [67.011544] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
    [67.028836] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    [67.045812] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
    [67.216138] ---[ end trace e114b111db64298c ]---
    [67.237089] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
    [67.325317] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
    [67.347946] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
    [67.371343] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
    [67.394757] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
    [67.418409] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
    [67.441906] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
    [67.465436] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
    [67.511660] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [67.535047] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
    [67.558449] PKRU: 55555554
    [67.581146] note: mount[613] exited with preempt_count 2

    The image has a chunk item which has a logical start 37748736 and length
    18446744073701163008 (-8M). The calculated end 29360127 overflows.
    EEXIST was caught by insert_state() because of the duplicate end and
    extent_io_tree_panic() was called.

    Add overflow check of chunk item end to tree checker so it can be
    detected early at mount time.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Anand Jain
    Signed-off-by: Su Yue
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Su Yue
     
  • commit 29b665cc51e8b602bf2a275734349494776e3dbc upstream.

    Some extent io trees are initialized with NULL private member (e.g.
    btrfs_device::alloc_state and btrfs_fs_info::excluded_extents).
    Dereference of a NULL tree->private as inode pointer will cause panic.

    Pass tree->fs_info as it's known to be valid in all cases.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
    Fixes: 05912a3c04eb ("btrfs: drop extent_io_ops::tree_fs_info callback")
    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Anand Jain
    Signed-off-by: Su Yue
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Su Yue
     
  • commit 50e31ef486afe60f128d42fb9620e2a63172c15c upstream.

    [BUG]
    There are several bug reports about recent kernel unable to relocate
    certain data block groups.

    Sometimes the error just goes away, but there is one reporter who can
    reproduce it reliably.

    The dmesg would look like:

    [438.260483] BTRFS info (device dm-10): balance: start -dvrange=34625344765952..34625344765953
    [438.269018] BTRFS info (device dm-10): relocating block group 34625344765952 flags data|raid1
    [450.439609] BTRFS info (device dm-10): found 167 extents, stage: move data extents
    [463.501781] BTRFS info (device dm-10): balance: ended with status: -2

    [CAUSE]
    The ENOENT error is returned from the following call chain:

    add_data_references()
    |- delete_v1_space_cache();
    |- if (!found)
    return -ENOENT;

    The variable @found is set to true if we find a data extent whose
    disk bytenr matches parameter @data_bytes.

    With extra debugging, the offending tree block looks like this:

    leaf bytenr = 42676709441536, data_bytenr = 34626327621632

    ctime 1567904822.739884119 (2019-09-08 03:07:02)
    mtime 0.0 (1970-01-01 01:00:00)
    otime 0.0 (1970-01-01 01:00:00)
    item 27 key (51933 EXTENT_DATA 0) itemoff 9854 itemsize 53
    generation 1517381 type 2 (prealloc)
    prealloc data disk byte 34626327621632 nr 262144 <<<
    prealloc data offset 0 nr 262144
    item 28 key (52262 ROOT_ITEM 0) itemoff 9415 itemsize 439
    generation 2618893 root_dirid 256 bytenr 42677048360960 level 3 refs 1
    lastsnap 2618893 byte_limit 0 bytes_used 5557338112 flags 0x0(none)
    uuid d0d4361f-d231-6d40-8901-fe506e4b2b53

    Although item 27 has disk bytenr 34626327621632, which matches the
    data_bytenr, its type is prealloc, not reg.
    This makes the existing code skip that item, and return ENOENT.

    [FIX]
    The code is modified in commit 19b546d7a1b2 ("btrfs: relocation: Use
    btrfs_find_all_leafs to locate data extent parent tree leaves"), before
    that commit, we use something like

    "if (type == BTRFS_FILE_EXTENT_INLINE) continue;"

    But in that offending commit, we use (type == BTRFS_FILE_EXTENT_REG),
    ignoring BTRFS_FILE_EXTENT_PREALLOC.

    Fix it by also checking BTRFS_FILE_EXTENT_PREALLOC.

    Reported-by: Stéphane Lesimple
    Link: https://lore.kernel.org/linux-btrfs/505cabfa88575ed6dbe7cb922d8914fb@lesimple.fr
    Fixes: 19b546d7a1b2 ("btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves")
    CC: stable@vger.kernel.org # 5.6+
    Tested-By: Stéphane Lesimple
    Reviewed-by: Su Yue
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     

17 Jan, 2021

3 commits

  • [ Upstream commit e076ab2a2ca70a0270232067cd49f76cd92efe64 ]

    Commit 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
    shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
    some infrastructure we have in place to flush inodes that we use for
    device replace and snapshot. However this introduced a pretty serious
    performance regression. To reproduce the user untarred the source
    tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
    see it take anywhere from 5 to 20 times as long to untar in 5.10
    compared to 5.9. This was observed on fast devices (SSD and better) and
    not on HDD.

    The root cause is because before we would generally use the normal
    writeback path to reclaim delalloc space, and for this we would provide
    it with the number of pages we wanted to flush. The referenced commit
    changed this to flush that many inodes, which drastically increased the
    amount of space we were flushing in certain cases, which severely
    affected performance.

    We cannot revert this patch unfortunately because of 3d45f221ce62
    ("btrfs: fix deadlock when cloning inline extent and low on free
    metadata space") which requires the ability to skip flushing inodes that
    are being cloned in certain scenarios, which means we need to keep using
    our flushing infrastructure or risk re-introducing the deadlock.

    Instead to fix this problem we can go back to providing
    btrfs_start_delalloc_roots with a number of pages to flush, and then set
    up a writeback_control and utilize sync_inode() to handle the flushing
    for us. This gives us the same behavior we had prior to the fix, while
    still allowing us to avoid the deadlock that was fixed by Filipe. I
    redid the users original test and got the following results on one of
    our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)

    5.9 0m54.258s
    5.10 1m26.212s
    5.10+patch 0m38.800s

    5.10+patch is significantly faster than plain 5.9 because of my patch
    series "Change data reservations to use the ticketing infra" which
    contained the patch that introduced the regression, but generally
    improved the overall ENOSPC flushing mechanisms.

    Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
    the results:

    5.10.5 4m00s
    5.10.5+patch 1m08s
    5.11-rc2 5m14s
    5.11-rc2+patch 1m30s

    Reported-by: René Rebe
    Fixes: 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
    CC: stable@vger.kernel.org # 5.10
    Signed-off-by: Josef Bacik
    Tested-by: David Sterba
    Reviewed-by: David Sterba
    [ add my test results ]
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik
     
  • [ Upstream commit 3d45f221ce627d13e2e6ef3274f06750c84a6542 ]

    When cloning an inline extent there are cases where we can not just copy
    the inline extent from the source range to the target range (e.g. when the
    target range starts at an offset greater than zero). In such cases we copy
    the inline extent's data into a page of the destination inode and then
    dirty that page. However, after that we will need to start a transaction
    for each processed extent and, if we are ever low on available metadata
    space, we may need to flush existing delalloc for all dirty inodes in an
    attempt to release metadata space - if that happens we may deadlock:

    * the async reclaim task queued a delalloc work to flush delalloc for
    the destination inode of the clone operation;

    * the task executing that delalloc work gets blocked waiting for the
    range with the dirty page to be unlocked, which is currently locked
    by the task doing the clone operation;

    * the async reclaim task blocks waiting for the delalloc work to complete;

    * the cloning task is waiting on the waitqueue of its reservation ticket
    while holding the range with the dirty page locked in the inode's
    io_tree;

    * if metadata space is not released by some other task (like delalloc for
    some other inode completing for example), the clone task waits forever
    and as a consequence the delalloc work and async reclaim tasks will hang
    forever as well. Releasing more space on the other hand may require
    starting a transaction, which will hang as well when trying to reserve
    metadata space, resulting in a deadlock between all these tasks.

    When this happens, traces like the following show up in dmesg/syslog:

    [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
    [87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
    [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
    [87452.326136] Call Trace:
    [87452.326737] __schedule+0x5d1/0xcf0
    [87452.327390] schedule+0x45/0xe0
    [87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs]
    [87452.328894] ? finish_wait+0x90/0x90
    [87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs]
    [87452.330133] ? __mod_memcg_state+0x8e/0x160
    [87452.330738] __extent_writepage+0x2d4/0x400 [btrfs]
    [87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs]
    [87452.332007] ? lock_release+0x20e/0x4c0
    [87452.332557] ? trace_hardirqs_on+0x1b/0xf0
    [87452.333127] extent_writepages+0x43/0x90 [btrfs]
    [87452.333653] ? lock_acquire+0x1a3/0x490
    [87452.334177] do_writepages+0x43/0xe0
    [87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100
    [87452.335720] __filemap_fdatawrite_range+0xc5/0x100
    [87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
    [87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs]
    [87452.337838] process_one_work+0x24e/0x5e0
    [87452.338437] worker_thread+0x50/0x3b0
    [87452.339137] ? process_one_work+0x5e0/0x5e0
    [87452.339884] kthread+0x153/0x170
    [87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0
    [87452.341153] ret_from_fork+0x22/0x30
    [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
    [87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
    [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
    [87452.345655] Call Trace:
    [87452.346305] __schedule+0x5d1/0xcf0
    [87452.346947] ? kvm_clock_read+0x14/0x30
    [87452.347676] ? wait_for_completion+0x81/0x110
    [87452.348389] schedule+0x45/0xe0
    [87452.349077] schedule_timeout+0x30c/0x580
    [87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [87452.350340] ? lock_acquire+0x1a3/0x490
    [87452.351006] ? try_to_wake_up+0x7a/0xa20
    [87452.351541] ? lock_release+0x20e/0x4c0
    [87452.352040] ? lock_acquired+0x199/0x490
    [87452.352517] ? wait_for_completion+0x81/0x110
    [87452.353000] wait_for_completion+0xab/0x110
    [87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs]
    [87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
    [87452.354455] flush_space+0x24f/0x660 [btrfs]
    [87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
    [87452.355565] process_one_work+0x24e/0x5e0
    [87452.356024] worker_thread+0x20f/0x3b0
    [87452.356487] ? process_one_work+0x5e0/0x5e0
    [87452.356973] kthread+0x153/0x170
    [87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0
    [87452.357880] ret_from_fork+0x22/0x30
    (...)
    < stack traces of several tasks waiting for the locks of the inodes of the
    clone operation >
    (...)
    [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
    [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
    [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
    [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
    [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
    [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
    [92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000
    [92867.447920] Call Trace:
    [92867.448435] __schedule+0x5d1/0xcf0
    [92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [92867.449423] schedule+0x45/0xe0
    [92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs]
    [92867.450576] ? finish_wait+0x90/0x90
    [92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
    [92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs]
    [92867.452412] start_transaction+0x2d1/0x760 [btrfs]
    [92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs]
    [92867.453848] ? lock_release+0x20e/0x4c0
    [92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
    [92867.455218] btrfs_clone+0x569/0x7e0 [btrfs]
    [92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs]
    [92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs]
    [92867.457213] do_clone_file_range+0xd4/0x1f0
    [92867.457828] vfs_clone_file_range+0x4d/0x230
    [92867.458355] ? lock_release+0x20e/0x4c0
    [92867.458890] ioctl_file_clone+0x8f/0xc0
    [92867.459377] do_vfs_ioctl+0x342/0x750
    [92867.459913] __x64_sys_ioctl+0x62/0xb0
    [92867.460377] do_syscall_64+0x33/0x80
    [92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    (...)
    < stack traces of more tasks blocked on metadata reservation like the clone
    task above, because the async reclaim task has deadlocked >
    (...)

    Another thing to notice is that the worker task that is deadlocked when
    trying to flush the destination inode of the clone operation is at
    btrfs_invalidatepage(). This is simply because the clone operation has a
    destination offset greater than the i_size and we only update the i_size
    of the destination file after cloning an extent (just like we do in the
    buffered write path).

    Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
    the flushing of delalloc for all inodes that have delalloc, add a runtime
    flag to an inode to signal it should not be flushed, and for inodes with
    that flag set, start_delalloc_inodes() will simply skip them. When the
    cloning code needs to dirty a page to copy an inline extent, set that flag
    on the inode and then clear it when the clone operation finishes.

    This could be sporadically triggered with test case generic/269 from
    fstests, which exercises many fsstress processes running in parallel with
    several dd processes filling up the entire filesystem.

    CC: stable@vger.kernel.org # 5.9+
    Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents")
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     
  • [ Upstream commit f2f121ab500d0457cc9c6f54269d21ffdf5bd304 ]

    Every time we log an inode we lookup in the fs/subvol tree for xattrs and
    if we have any, log them into the log tree. However it is very common to
    have inodes without any xattrs, so doing the search wastes times, but more
    importantly it adds contention on the fs/subvol tree locks, either making
    the logging code block and wait for tree locks or making the logging code
    making other concurrent operations block and wait.

    The most typical use cases where xattrs are used are when capabilities or
    ACLs are defined for an inode, or when SELinux is enabled.

    This change makes the logging code detect when an inode does not have
    xattrs and skip the xattrs search the next time the inode is logged,
    unless the inode is evicted and loaded again or a xattr is added to the
    inode. Therefore skipping the search for xattrs on inodes that don't ever
    have xattrs and are fsynced with some frequency.

    The following script that calls dbench was used to measure the impact of
    this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
    directly (no intermediary filesystem on the host) and using a non-debug
    kernel (default configuration on Debian distributions):

    $ cat test.sh
    #!/bin/bash

    DEV=/dev/sdk
    MNT=/mnt/sdk
    MOUNT_OPTIONS="-o ssd"

    mkfs.btrfs -f -m single -d single $DEV
    mount $MOUNT_OPTIONS $DEV $MNT

    dbench -D $MNT -t 200 40

    umount $MNT

    The results before this change:

    Operation Count AvgLat MaxLat
    ----------------------------------------
    NTCreateX 5761605 0.172 312.057
    Close 4232452 0.002 10.927
    Rename 243937 1.406 277.344
    Unlink 1163456 0.631 298.402
    Deltree 160 11.581 221.107
    Mkdir 80 0.003 0.005
    Qpathinfo 5221410 0.065 122.309
    Qfileinfo 915432 0.001 3.333
    Qfsinfo 957555 0.003 3.992
    Sfileinfo 469244 0.023 20.494
    Find 2018865 0.448 123.659
    WriteX 2874851 0.049 118.529
    ReadX 9030579 0.004 21.654
    LockX 18754 0.003 4.423
    UnlockX 18754 0.002 0.331
    Flush 403792 10.944 359.494

    Throughput 908.444 MB/sec 40 clients 40 procs max_latency=359.500 ms

    The results after this change:

    Operation Count AvgLat MaxLat
    ----------------------------------------
    NTCreateX 6442521 0.159 230.693
    Close 4732357 0.002 10.972
    Rename 272809 1.293 227.398
    Unlink 1301059 0.563 218.500
    Deltree 160 7.796 54.887
    Mkdir 80 0.008 0.478
    Qpathinfo 5839452 0.047 124.330
    Qfileinfo 1023199 0.001 4.996
    Qfsinfo 1070760 0.003 5.709
    Sfileinfo 524790 0.033 21.765
    Find 2257658 0.314 125.611
    WriteX 3211520 0.040 232.135
    ReadX 10098969 0.004 25.340
    LockX 20974 0.003 1.569
    UnlockX 20974 0.002 3.475
    Flush 451553 10.287 331.037

    Throughput 1011.77 MB/sec 40 clients 40 procs max_latency=331.045 ms

    +10.8% throughput, -8.2% max latency

    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     

13 Jan, 2021

2 commits

  • commit 0b3f407e6728d990ae1630a02c7b952c21c288d3 upstream.

    When doing an incremental send, if we have a new inode that happens to
    have the same number that an old directory inode had in the base snapshot
    and that old directory has a pending rmdir operation, we end up computing
    a wrong path for the new inode, causing the receiver to fail.

    Example reproducer:

    $ cat test-send-rmdir.sh
    #!/bin/bash

    DEV=/dev/sdi
    MNT=/mnt/sdi

    mkfs.btrfs -f $DEV >/dev/null
    mount $DEV $MNT

    mkdir $MNT/dir
    touch $MNT/dir/file1
    touch $MNT/dir/file2
    touch $MNT/dir/file3

    # Filesystem looks like:
    #
    # . (ino 256)
    # |----- dir/ (ino 257)
    # |----- file1 (ino 258)
    # |----- file2 (ino 259)
    # |----- file3 (ino 260)
    #

    btrfs subvolume snapshot -r $MNT $MNT/snap1
    btrfs send -f /tmp/snap1.send $MNT/snap1

    # Now remove our directory and all its files.
    rm -fr $MNT/dir

    # Unmount the filesystem and mount it again. This is to ensure that
    # the next inode that is created ends up with the same inode number
    # that our directory "dir" had, 257, which is the first free "objectid"
    # available after mounting again the filesystem.
    umount $MNT
    mount $DEV $MNT

    # Now create a new file (it could be a directory as well).
    touch $MNT/newfile

    # Filesystem now looks like:
    #
    # . (ino 256)
    # |----- newfile (ino 257)
    #

    btrfs subvolume snapshot -r $MNT $MNT/snap2
    btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2

    # Now unmount the filesystem, create a new one, mount it and try to apply
    # both send streams to recreate both snapshots.
    umount $DEV

    mkfs.btrfs -f $DEV >/dev/null

    mount $DEV $MNT

    btrfs receive -f /tmp/snap1.send $MNT
    btrfs receive -f /tmp/snap2.send $MNT

    umount $MNT

    When running the test, the receive operation for the incremental stream
    fails:

    $ ./test-send-rmdir.sh
    Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
    At subvol /mnt/sdi/snap1
    Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
    At subvol /mnt/sdi/snap2
    At subvol snap1
    At snapshot snap2
    ERROR: chown o257-9-0 failed: No such file or directory

    So fix this by tracking directories that have a pending rmdir by inode
    number and generation number, instead of only inode number.

    A test case for fstests follows soon.

    Reported-by: Massimo B.
    Tested-by: Massimo B.
    Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
    CC: stable@vger.kernel.org # 4.19+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit ae5e070eaca9dbebde3459dd8f4c2756f8c097d0 upstream.

    There is a chance of racing for qgroup flushing which may lead to
    deadlock:

    Thread A | Thread B
    (not holding trans handle) | (holding a trans handle)
    --------------------------------+--------------------------------
    __btrfs_qgroup_reserve_meta() | __btrfs_qgroup_reserve_meta()
    |- try_flush_qgroup() | |- try_flush_qgroup()
    |- QGROUP_FLUSHING bit set | |
    | | |- test_and_set_bit()
    | | |- wait_event()
    |- btrfs_join_transaction() |
    |- btrfs_commit_transaction()|

    !!! DEAD LOCK !!!

    Since thread A wants to commit transaction, but thread B is holding a
    transaction handle, blocking the commit.
    At the same time, thread B is waiting for thread A to finish its commit.

    This is just a hot fix, and would lead to more EDQUOT when we're near
    the qgroup limit.

    The proper fix would be to make all metadata/data reservations happen
    without holding a transaction handle.

    CC: stable@vger.kernel.org # 5.9+
    Reviewed-by: Filipe Manana
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     

30 Dec, 2020

3 commits

  • commit 7f458a3873ae94efe1f37c8b96c97e7298769e98 upstream.

    When defragmenting we skip ranges that have holes or inline extents, so that
    we don't do unnecessary IO and waste space. We do this check when calling
    should_defrag_range() at btrfs_defrag_file(). However we do it without
    holding the inode's lock. The reason we do it like this is to avoid
    blocking other tasks for too long, that possibly want to operate on other
    file ranges, since after the call to should_defrag_range() and before
    locking the inode, we trigger a synchronous page cache readahead. However
    before we were able to lock the inode, some other task might have punched
    a hole in our range, or we may now have an inline extent there, in which
    case we should not set the range for defrag anymore since that would cause
    unnecessary IO and make us waste space (i.e. allocating extents to contain
    zeros for a hole).

    So after we locked the inode and the range in the iotree, check again if
    we have holes or an inline extent, and if we do, just skip the range.

    I hit this while testing my next patch that fixes races when updating an
    inode's number of bytes (subject "btrfs: update the number of bytes used
    by an inode atomically"), and it depends on this change in order to work
    correctly. Alternatively I could rework that other patch to detect holes
    and flag their range with the 'new delalloc' bit, but this itself fixes
    an efficiency problem due a race that from a functional point of view is
    not harmful (it could be triggered with btrfs/062 from fstests).

    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 27d56e62e4748c2135650c260024e9904b8c1a0a upstream.

    While writing an explanation for the need of the commit_root_sem for
    btrfs_prepare_extent_commit, I realized we have a slight hole that could
    result in leaked space if we have to do the old style caching. Consider
    the following scenario

    commit root
    +----+----+----+----+----+----+----+
    |\\\\| |\\\\|\\\\| |\\\\|\\\\|
    +----+----+----+----+----+----+----+
    0 1 2 3 4 5 6 7

    new commit root
    +----+----+----+----+----+----+----+
    | | | |\\\\| | |\\\\|
    +----+----+----+----+----+----+----+
    0 1 2 3 4 5 6 7

    Prior to this patch, we run btrfs_prepare_extent_commit, which updates
    the last_byte_to_unpin, and then we subsequently run
    switch_commit_roots. In this example lets assume that
    caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which
    means that cache->last_byte_to_unpin == 1. Then we go and do the
    switch_commit_roots(), but in the meantime the caching thread has made
    some more progress, because we drop the commit_root_sem and re-acquired
    it. Now caching_ctl->progress == 3. We swap out the commit root and
    carry on to unpin.

    The race can happen like:

    1) The caching thread was running using the old commit root when it
    found the extent for [2, 3);

    2) Then it released the commit_root_sem because it was in the last
    item of a leaf and the semaphore was contended, and set ->progress
    to 3 (value of 'last'), as the last extent item in the current leaf
    was for the extent for range [2, 3);

    3) Next time it gets the commit_root_sem, will start using the new
    commit root and search for a key with offset 3, so it never finds
    the hole for [2, 3).

    So the caching thread never saw [2, 3) as free space in any of the
    commit roots, and by the time finish_extent_commit() was called for
    the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the
    subrange [0, 1) to the free space cache, skipping [2, 3).

    In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1),
    but do not unpin [2,3). However because caching_ctl->progress == 3 we
    do not see the newly freed section of [2,3), and thus do not add it to
    our free space cache. This results in us missing a chunk of free space
    in memory (on disk too, unless we have a power failure before writing
    the free space cache to disk).

    Fix this by making sure the ->last_byte_to_unpin is set at the same time
    that we swap the commit roots, this ensures that we will always be
    consistent.

    CC: stable@vger.kernel.org # 5.8+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    [ update changelog with Filipe's review comments ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 9076dbd5ee837c3882fc42891c14cecd0354a849 upstream.

    While fixing up our ->last_byte_to_unpin locking I noticed that we will
    shorten len based on ->last_byte_to_unpin if we're caching when we're
    adding back the free space. This is correct for the free space, as we
    cannot unpin more than ->last_byte_to_unpin, however we use len to
    adjust the ->bytes_pinned counters and such, which need to track the
    actual pinned usage. This could result in
    WARN_ON(space_info->bytes_pinned) triggering at unmount time.

    Fix this by using a local variable for the amount to add to free space
    cache, and leave len untouched in this case.

    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     

28 Nov, 2020

1 commit

  • Pull btrfs fixes from David Sterba:
    "A few fixes for various warnings that accumulated over past two weeks:

    - tree-checker: add missing return values for some errors

    - lockdep fixes
    - when reading qgroup config and starting quota rescan
    - reverse order of quota ioctl lock and VFS freeze lock

    - avoid accessing potentially stale fs info during device scan,
    reported by syzbot

    - add scope NOFS protection around qgroup relation changes

    - check for running transaction before flushing qgroups

    - fix tracking of new delalloc ranges for some cases"

    * tag 'for-5.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    btrfs: fix lockdep splat when enabling and disabling qgroups
    btrfs: do nofs allocations when adding and removing qgroup relations
    btrfs: fix lockdep splat when reading qgroup config on mount
    btrfs: tree-checker: add missing returns after data_ref alignment checks
    btrfs: don't access possibly stale fs_info data for printing duplicate device
    btrfs: tree-checker: add missing return after error in root_item
    btrfs: qgroup: don't commit transaction when we already hold the handle
    btrfs: fix missing delalloc new bit for new delalloc ranges

    Linus Torvalds
     

24 Nov, 2020

5 commits

  • When running test case btrfs/017 from fstests, lockdep reported the
    following splat:

    [ 1297.067385] ======================================================
    [ 1297.067708] WARNING: possible circular locking dependency detected
    [ 1297.068022] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
    [ 1297.068322] ------------------------------------------------------
    [ 1297.068629] btrfs/189080 is trying to acquire lock:
    [ 1297.068929] ffff9f2725731690 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_enable+0xaf/0xa70 [btrfs]
    [ 1297.069274]
    but task is already holding lock:
    [ 1297.069868] ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
    [ 1297.070219]
    which lock already depends on the new lock.

    [ 1297.071131]
    the existing dependency chain (in reverse order) is:
    [ 1297.071721]
    -> #1 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}:
    [ 1297.072375] lock_acquire+0xd8/0x490
    [ 1297.072710] __mutex_lock+0xa3/0xb30
    [ 1297.073061] btrfs_qgroup_inherit+0x59/0x6a0 [btrfs]
    [ 1297.073421] create_subvol+0x194/0x990 [btrfs]
    [ 1297.073780] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
    [ 1297.074133] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
    [ 1297.074498] btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
    [ 1297.074872] btrfs_ioctl+0x1a90/0x36f0 [btrfs]
    [ 1297.075245] __x64_sys_ioctl+0x83/0xb0
    [ 1297.075617] do_syscall_64+0x33/0x80
    [ 1297.075993] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 1297.076380]
    -> #0 (sb_internal#2){.+.+}-{0:0}:
    [ 1297.077166] check_prev_add+0x91/0xc60
    [ 1297.077572] __lock_acquire+0x1740/0x3110
    [ 1297.077984] lock_acquire+0xd8/0x490
    [ 1297.078411] start_transaction+0x3c5/0x760 [btrfs]
    [ 1297.078853] btrfs_quota_enable+0xaf/0xa70 [btrfs]
    [ 1297.079323] btrfs_ioctl+0x2c60/0x36f0 [btrfs]
    [ 1297.079789] __x64_sys_ioctl+0x83/0xb0
    [ 1297.080232] do_syscall_64+0x33/0x80
    [ 1297.080680] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 1297.081139]
    other info that might help us debug this:

    [ 1297.082536] Possible unsafe locking scenario:

    [ 1297.083510] CPU0 CPU1
    [ 1297.084005] ---- ----
    [ 1297.084500] lock(&fs_info->qgroup_ioctl_lock);
    [ 1297.084994] lock(sb_internal#2);
    [ 1297.085485] lock(&fs_info->qgroup_ioctl_lock);
    [ 1297.085974] lock(sb_internal#2);
    [ 1297.086454]
    *** DEADLOCK ***
    [ 1297.087880] 3 locks held by btrfs/189080:
    [ 1297.088324] #0: ffff9f2725731470 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0xa73/0x36f0 [btrfs]
    [ 1297.088799] #1: ffff9f2702b60cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
    [ 1297.089284] #2: ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
    [ 1297.089771]
    stack backtrace:
    [ 1297.090662] CPU: 5 PID: 189080 Comm: btrfs Not tainted 5.10.0-rc4-btrfs-next-73 #1
    [ 1297.091132] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    [ 1297.092123] Call Trace:
    [ 1297.092629] dump_stack+0x8d/0xb5
    [ 1297.093115] check_noncircular+0xff/0x110
    [ 1297.093596] check_prev_add+0x91/0xc60
    [ 1297.094076] ? kvm_clock_read+0x14/0x30
    [ 1297.094553] ? kvm_sched_clock_read+0x5/0x10
    [ 1297.095029] __lock_acquire+0x1740/0x3110
    [ 1297.095510] lock_acquire+0xd8/0x490
    [ 1297.095993] ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
    [ 1297.096476] start_transaction+0x3c5/0x760 [btrfs]
    [ 1297.096962] ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
    [ 1297.097451] btrfs_quota_enable+0xaf/0xa70 [btrfs]
    [ 1297.097941] ? btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
    [ 1297.098429] btrfs_ioctl+0x2c60/0x36f0 [btrfs]
    [ 1297.098904] ? do_user_addr_fault+0x20c/0x430
    [ 1297.099382] ? kvm_clock_read+0x14/0x30
    [ 1297.099854] ? kvm_sched_clock_read+0x5/0x10
    [ 1297.100328] ? sched_clock+0x5/0x10
    [ 1297.100801] ? sched_clock_cpu+0x12/0x180
    [ 1297.101272] ? __x64_sys_ioctl+0x83/0xb0
    [ 1297.101739] __x64_sys_ioctl+0x83/0xb0
    [ 1297.102207] do_syscall_64+0x33/0x80
    [ 1297.102673] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 1297.103148] RIP: 0033:0x7f773ff65d87

    This is because during the quota enable ioctl we lock first the mutex
    qgroup_ioctl_lock and then start a transaction, and starting a transaction
    acquires a fs freeze semaphore (at the VFS level). However, every other
    code path, except for the quota disable ioctl path, we do the opposite:
    we start a transaction and then lock the mutex.

    So fix this by making the quota enable and disable paths to start the
    transaction without having the mutex locked, and then, after starting the
    transaction, lock the mutex and check if some other task already enabled
    or disabled the quotas, bailing with success if that was the case.

    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • When adding or removing a qgroup relation we are doing a GFP_KERNEL
    allocation which is not safe because we are holding a transaction
    handle open and that can make us deadlock if the allocator needs to
    recurse into the filesystem. So just surround those calls with a
    nofs context.

    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • Lockdep reported the following splat when running test btrfs/190 from
    fstests:

    [ 9482.126098] ======================================================
    [ 9482.126184] WARNING: possible circular locking dependency detected
    [ 9482.126281] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
    [ 9482.126365] ------------------------------------------------------
    [ 9482.126456] mount/24187 is trying to acquire lock:
    [ 9482.126534] ffffa0c869a7dac0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}, at: qgroup_rescan_init+0x43/0xf0 [btrfs]
    [ 9482.126647]
    but task is already holding lock:
    [ 9482.126777] ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
    [ 9482.126886]
    which lock already depends on the new lock.

    [ 9482.127078]
    the existing dependency chain (in reverse order) is:
    [ 9482.127213]
    -> #1 (btrfs-quota-00){++++}-{3:3}:
    [ 9482.127366] lock_acquire+0xd8/0x490
    [ 9482.127436] down_read_nested+0x45/0x220
    [ 9482.127528] __btrfs_tree_read_lock+0x27/0x120 [btrfs]
    [ 9482.127613] btrfs_read_lock_root_node+0x41/0x130 [btrfs]
    [ 9482.127702] btrfs_search_slot+0x514/0xc30 [btrfs]
    [ 9482.127788] update_qgroup_status_item+0x72/0x140 [btrfs]
    [ 9482.127877] btrfs_qgroup_rescan_worker+0xde/0x680 [btrfs]
    [ 9482.127964] btrfs_work_helper+0xf1/0x600 [btrfs]
    [ 9482.128039] process_one_work+0x24e/0x5e0
    [ 9482.128110] worker_thread+0x50/0x3b0
    [ 9482.128181] kthread+0x153/0x170
    [ 9482.128256] ret_from_fork+0x22/0x30
    [ 9482.128327]
    -> #0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}:
    [ 9482.128464] check_prev_add+0x91/0xc60
    [ 9482.128551] __lock_acquire+0x1740/0x3110
    [ 9482.128623] lock_acquire+0xd8/0x490
    [ 9482.130029] __mutex_lock+0xa3/0xb30
    [ 9482.130590] qgroup_rescan_init+0x43/0xf0 [btrfs]
    [ 9482.131577] btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
    [ 9482.132175] open_ctree+0x1228/0x18a0 [btrfs]
    [ 9482.132756] btrfs_mount_root.cold+0x13/0xed [btrfs]
    [ 9482.133325] legacy_get_tree+0x30/0x60
    [ 9482.133866] vfs_get_tree+0x28/0xe0
    [ 9482.134392] fc_mount+0xe/0x40
    [ 9482.134908] vfs_kern_mount.part.0+0x71/0x90
    [ 9482.135428] btrfs_mount+0x13b/0x3e0 [btrfs]
    [ 9482.135942] legacy_get_tree+0x30/0x60
    [ 9482.136444] vfs_get_tree+0x28/0xe0
    [ 9482.136949] path_mount+0x2d7/0xa70
    [ 9482.137438] do_mount+0x75/0x90
    [ 9482.137923] __x64_sys_mount+0x8e/0xd0
    [ 9482.138400] do_syscall_64+0x33/0x80
    [ 9482.138873] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 9482.139346]
    other info that might help us debug this:

    [ 9482.140735] Possible unsafe locking scenario:

    [ 9482.141594] CPU0 CPU1
    [ 9482.142011] ---- ----
    [ 9482.142411] lock(btrfs-quota-00);
    [ 9482.142806] lock(&fs_info->qgroup_rescan_lock);
    [ 9482.143216] lock(btrfs-quota-00);
    [ 9482.143629] lock(&fs_info->qgroup_rescan_lock);
    [ 9482.144056]
    *** DEADLOCK ***

    [ 9482.145242] 2 locks held by mount/24187:
    [ 9482.145637] #0: ffffa0c8411c40e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xb9/0x400
    [ 9482.146061] #1: ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
    [ 9482.146509]
    stack backtrace:
    [ 9482.147350] CPU: 1 PID: 24187 Comm: mount Not tainted 5.10.0-rc4-btrfs-next-73 #1
    [ 9482.147788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    [ 9482.148709] Call Trace:
    [ 9482.149169] dump_stack+0x8d/0xb5
    [ 9482.149628] check_noncircular+0xff/0x110
    [ 9482.150090] check_prev_add+0x91/0xc60
    [ 9482.150561] ? kvm_clock_read+0x14/0x30
    [ 9482.151017] ? kvm_sched_clock_read+0x5/0x10
    [ 9482.151470] __lock_acquire+0x1740/0x3110
    [ 9482.151941] ? __btrfs_tree_read_lock+0x27/0x120 [btrfs]
    [ 9482.152402] lock_acquire+0xd8/0x490
    [ 9482.152887] ? qgroup_rescan_init+0x43/0xf0 [btrfs]
    [ 9482.153354] __mutex_lock+0xa3/0xb30
    [ 9482.153826] ? qgroup_rescan_init+0x43/0xf0 [btrfs]
    [ 9482.154301] ? qgroup_rescan_init+0x43/0xf0 [btrfs]
    [ 9482.154768] ? qgroup_rescan_init+0x43/0xf0 [btrfs]
    [ 9482.155226] qgroup_rescan_init+0x43/0xf0 [btrfs]
    [ 9482.155690] btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
    [ 9482.156160] open_ctree+0x1228/0x18a0 [btrfs]
    [ 9482.156643] btrfs_mount_root.cold+0x13/0xed [btrfs]
    [ 9482.157108] ? rcu_read_lock_sched_held+0x5d/0x90
    [ 9482.157567] ? kfree+0x31f/0x3e0
    [ 9482.158030] legacy_get_tree+0x30/0x60
    [ 9482.158489] vfs_get_tree+0x28/0xe0
    [ 9482.158947] fc_mount+0xe/0x40
    [ 9482.159403] vfs_kern_mount.part.0+0x71/0x90
    [ 9482.159875] btrfs_mount+0x13b/0x3e0 [btrfs]
    [ 9482.160335] ? rcu_read_lock_sched_held+0x5d/0x90
    [ 9482.160805] ? kfree+0x31f/0x3e0
    [ 9482.161260] ? legacy_get_tree+0x30/0x60
    [ 9482.161714] legacy_get_tree+0x30/0x60
    [ 9482.162166] vfs_get_tree+0x28/0xe0
    [ 9482.162616] path_mount+0x2d7/0xa70
    [ 9482.163070] do_mount+0x75/0x90
    [ 9482.163525] __x64_sys_mount+0x8e/0xd0
    [ 9482.163986] do_syscall_64+0x33/0x80
    [ 9482.164437] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 9482.164902] RIP: 0033:0x7f51e907caaa

    This happens because at btrfs_read_qgroup_config() we can call
    qgroup_rescan_init() while holding a read lock on a quota btree leaf,
    acquired by the previous call to btrfs_search_slot_for_read(), and
    qgroup_rescan_init() acquires the mutex qgroup_rescan_lock.

    A qgroup rescan worker does the opposite: it acquires the mutex
    qgroup_rescan_lock, at btrfs_qgroup_rescan_worker(), and then tries to
    update the qgroup status item in the quota btree through the call to
    update_qgroup_status_item(). This inversion of locking order
    between the qgroup_rescan_lock mutex and quota btree locks causes the
    splat.

    Fix this simply by releasing and freeing the path before calling
    qgroup_rescan_init() at btrfs_read_qgroup_config().

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • There are sectorsize alignment checks that are reported but then
    check_extent_data_ref continues. This was not intended, wrong alignment
    is not a minor problem and we should return with error.

    CC: stable@vger.kernel.org # 5.4+
    Fixes: 0785a9aacf9d ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    David Sterba
     
  • Syzbot reported a possible use-after-free when printing a duplicate device
    warning device_list_add().

    At this point it can happen that a btrfs_device::fs_info is not correctly
    setup yet, so we're accessing stale data, when printing the warning
    message using the btrfs_printk() wrappers.

    ==================================================================
    BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
    Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068

    CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x1d6/0x29e lib/dump_stack.c:118
    print_address_description+0x66/0x620 mm/kasan/report.c:383
    __kasan_report mm/kasan/report.c:513 [inline]
    kasan_report+0x132/0x1d0 mm/kasan/report.c:530
    btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
    device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
    btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
    btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    fc_mount fs/namespace.c:978 [inline]
    vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
    btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    do_new_mount fs/namespace.c:2875 [inline]
    path_mount+0x179d/0x29e0 fs/namespace.c:3192
    do_mount fs/namespace.c:3205 [inline]
    __do_sys_mount fs/namespace.c:3413 [inline]
    __se_sys_mount+0x126/0x180 fs/namespace.c:3390
    do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x44840a
    RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
    RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
    RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
    RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
    R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003

    Allocated by task 6945:
    kasan_save_stack mm/kasan/common.c:48 [inline]
    kasan_set_track mm/kasan/common.c:56 [inline]
    __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
    kmalloc_node include/linux/slab.h:577 [inline]
    kvmalloc_node+0x81/0x110 mm/util.c:574
    kvmalloc include/linux/mm.h:757 [inline]
    kvzalloc include/linux/mm.h:765 [inline]
    btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    fc_mount fs/namespace.c:978 [inline]
    vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
    btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    do_new_mount fs/namespace.c:2875 [inline]
    path_mount+0x179d/0x29e0 fs/namespace.c:3192
    do_mount fs/namespace.c:3205 [inline]
    __do_sys_mount fs/namespace.c:3413 [inline]
    __se_sys_mount+0x126/0x180 fs/namespace.c:3390
    do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Freed by task 6945:
    kasan_save_stack mm/kasan/common.c:48 [inline]
    kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
    kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
    __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
    __cache_free mm/slab.c:3418 [inline]
    kfree+0x113/0x200 mm/slab.c:3756
    deactivate_locked_super+0xa7/0xf0 fs/super.c:335
    btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    fc_mount fs/namespace.c:978 [inline]
    vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
    btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    do_new_mount fs/namespace.c:2875 [inline]
    path_mount+0x179d/0x29e0 fs/namespace.c:3192
    do_mount fs/namespace.c:3205 [inline]
    __do_sys_mount fs/namespace.c:3413 [inline]
    __se_sys_mount+0x126/0x180 fs/namespace.c:3390
    do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The buggy address belongs to the object at ffff8880878e0000
    which belongs to the cache kmalloc-16k of size 16384
    The buggy address is located 1704 bytes inside of
    16384-byte region [ffff8880878e0000, ffff8880878e4000)
    The buggy address belongs to the page:
    page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
    head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
    flags: 0xfffe0000010200(slab|head)
    raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
    raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================

    The syzkaller reproducer for this use-after-free crafts a filesystem image
    and loop mounts it twice in a loop. The mount will fail as the crafted
    image has an invalid chunk tree. When this happens btrfs_mount_root() will
    call deactivate_locked_super(), which then cleans up fs_info and
    fs_info::sb. If a second thread now adds the same block-device to the
    filesystem, it will get detected as a duplicate device and
    device_list_add() will reject the duplicate and print a warning. But as
    the fs_info pointer passed in is non-NULL this will result in a
    use-after-free.

    Instead of printing possibly uninitialized or already freed memory in
    btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
    device name will be skipped altogether.

    There was a slightly different approach discussed in
    https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u

    Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
    Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com
    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Anand Jain
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     

14 Nov, 2020

3 commits

  • There's a missing return statement after an error is found in the
    root_item, this can cause further problems when a crafted image triggers
    the error.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210181
    Fixes: 259ee7754b67 ("btrfs: tree-checker: Add ROOT_ITEM check")
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Qu Wenruo
    Signed-off-by: Daniel Xu
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Daniel Xu
     
  • [BUG]
    When running the following script, btrfs will trigger an ASSERT():

    #/bin/bash
    mkfs.btrfs -f $dev
    mount $dev $mnt
    xfs_io -f -c "pwrite 0 1G" $mnt/file
    sync
    btrfs quota enable $mnt
    btrfs quota rescan -w $mnt

    # Manually set the limit below current usage
    btrfs qgroup limit 512M $mnt $mnt

    # Crash happens
    touch $mnt/file

    The dmesg looks like this:

    assertion failed: refcount_read(&trans->use_count) == 1, in fs/btrfs/transaction.c:2022
    ------------[ cut here ]------------
    kernel BUG at fs/btrfs/ctree.h:3230!
    invalid opcode: 0000 [#1] SMP PTI
    RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
    btrfs_commit_transaction.cold+0x11/0x5d [btrfs]
    try_flush_qgroup+0x67/0x100 [btrfs]
    __btrfs_qgroup_reserve_meta+0x3a/0x60 [btrfs]
    btrfs_delayed_update_inode+0xaa/0x350 [btrfs]
    btrfs_update_inode+0x9d/0x110 [btrfs]
    btrfs_dirty_inode+0x5d/0xd0 [btrfs]
    touch_atime+0xb5/0x100
    iterate_dir+0xf1/0x1b0
    __x64_sys_getdents64+0x78/0x110
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7fb5afe588db

    [CAUSE]
    In try_flush_qgroup(), we assume we don't hold a transaction handle at
    all. This is true for data reservation and mostly true for metadata.
    Since data space reservation always happens before we start a
    transaction, and for most metadata operation we reserve space in
    start_transaction().

    But there is an exception, btrfs_delayed_inode_reserve_metadata().
    It holds a transaction handle, while still trying to reserve extra
    metadata space.

    When we hit EDQUOT inside btrfs_delayed_inode_reserve_metadata(), we
    will join current transaction and commit, while we still have
    transaction handle from qgroup code.

    [FIX]
    Let's check current->journal before we join the transaction.

    If current->journal is unset or BTRFS_SEND_TRANS_STUB, it means
    we are not holding a transaction, thus are able to join and then commit
    transaction.

    If current->journal is a valid transaction handle, we avoid committing
    transaction and just end it

    This is less effective than committing current transaction, as it won't
    free metadata reserved space, but we may still free some data space
    before new data writes.

    Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1178634
    Fixes: c53e9653605d ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
    Reviewed-by: Filipe Manana
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • When doing a buffered write, through one of the write family syscalls, we
    look for ranges which currently don't have allocated extents and set the
    'delalloc new' bit on them, so that we can report a correct number of used
    blocks to the stat(2) syscall until delalloc is flushed and ordered extents
    complete.

    However there are a few other places where we can do a buffered write
    against a range that is mapped to a hole (no extent allocated) and where
    we do not set the 'new delalloc' bit. Those places are:

    - Doing a memory mapped write against a hole;

    - Cloning an inline extent into a hole starting at file offset 0;

    - Calling btrfs_cont_expand() when the i_size of the file is not aligned
    to the sector size and is located in a hole. For example when cloning
    to a destination offset beyond EOF.

    So after such cases, until the corresponding delalloc range is flushed and
    the respective ordered extents complete, we can report an incorrect number
    of blocks used through the stat(2) syscall.

    In some cases we can end up reporting 0 used blocks to stat(2), which is a
    particular bad value to report as it may mislead tools to think a file is
    completely sparse when its i_size is not zero, making them skip reading
    any data, an undesired consequence for tools such as archivers and other
    backup tools, as reported a long time ago in the following thread (and
    other past threads):

    https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html

    Example reproducer:

    $ cat reproducer.sh
    #!/bin/bash

    MNT=/mnt/sdi
    DEV=/dev/sdi

    mkfs.btrfs -f $DEV > /dev/null
    # mkfs.xfs -f $DEV > /dev/null
    # mkfs.ext4 -F $DEV > /dev/null
    # mkfs.f2fs -f $DEV > /dev/null
    mount $DEV $MNT

    xfs_io -f -c "truncate 64K" \
    -c "mmap -w 0 64K" \
    -c "mwrite -S 0xab 0 64K" \
    -c "munmap" \
    $MNT/foo

    blocks_used=$(stat -c %b $MNT/foo)
    echo "blocks used: $blocks_used"

    if [ $blocks_used -eq 0 ]; then
    echo "ERROR: blocks used is 0"
    fi

    umount $DEV

    $ ./reproducer.sh
    blocks used: 0
    ERROR: blocks used is 0

    So move the logic that decides to set the 'delalloc bit' bit into the
    function btrfs_set_extent_delalloc(), since that is what we use for all
    those missing cases as well as for the cases that currently work well.

    This change is also preparatory work for an upcoming patch that fixes
    other problems related to tracking and reporting the number of bytes used
    by an inode.

    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     

11 Nov, 2020

1 commit

  • Pull btrfs fixes from David Sterba:
    "A handful of minor fixes and updates:

    - handle missing device replace item on mount (syzbot report)

    - fix space reservation calculation when finishing relocation

    - fix memory leak on error path in ref-verify (debugging feature)

    - fix potential overflow during defrag on 32bit arches

    - minor code update to silence smatch warning

    - minor error message updates"

    * tag 'for-5.10-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    btrfs: ref-verify: fix memory leak in btrfs_ref_tree_mod
    btrfs: dev-replace: fail mount if we don't have replace item with target device
    btrfs: scrub: update message regarding read-only status
    btrfs: clean up NULL checks in qgroup_unreserve_range()
    btrfs: fix min reserved size calculation in merge_reloc_root
    btrfs: print the block rsv type when we fail our reservation
    btrfs: fix potential overflow in cluster_pages_for_defrag on 32bit arch

    Linus Torvalds
     

05 Nov, 2020

7 commits

  • There is one error handling path that does not free ref, which may cause
    a minor memory leak.

    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Josef Bacik
    Signed-off-by: Dinghao Liu
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Dinghao Liu
     
  • If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace
    item, then it means the filesystem is inconsistent state. This is either
    corruption or a crafted image. Fail the mount as this needs a closer
    look what is actually wrong.

    As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace
    item, in __btrfs_free_extra_devids() we determine that there is an
    extra device, and free those extra devices but continue to mount the
    device.
    However, we were wrong in keeping tack of the rw_devices so the syzbot
    testcase failed:

    WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
    Kernel panic - not syncing: panic_on_warn set ...
    CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x198/0x1fd lib/dump_stack.c:118
    panic+0x347/0x7c0 kernel/panic.c:231
    __warn.cold+0x20/0x46 kernel/panic.c:600
    report_bug+0x1bd/0x210 lib/bug.c:198
    handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
    exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
    asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
    RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
    RSP: 0018:ffffc900091777e0 EFLAGS: 00010246
    RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000
    RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007
    RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130
    R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050
    close_fs_devices fs/btrfs/volumes.c:1193 [inline]
    btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
    open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434
    btrfs_fill_super fs/btrfs/super.c:1316 [inline]
    btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672

    The fix here is, when we determine that there isn't a replace item
    then fail the mount if there is a replace target device (devid 0).

    CC: stable@vger.kernel.org # 4.19+
    Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Anand Jain
     
  • Based on user feedback update the message printed when scrub fails to
    start due to write requirements. To make a distinction add a device id
    to the messages.

    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    David Sterba
     
  • Smatch complains that this code dereferences "entry" before checking
    whether it's NULL on the next line. Fortunately, rb_entry() will never
    return NULL so it doesn't cause a problem. We can clean up the NULL
    checking a bit to silence the warning and make the code more clear.

    Reviewed-by: Qu Wenruo
    Signed-off-by: Dan Carpenter
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Dan Carpenter
     
  • The minimum reserve size was adjusted to take into account the height of
    the tree we are merging, however we can have a root with a level == 0.
    What we want is root_level + 1 to get the number of nodes we may have to
    cow. This fixes the enospc_debug warning pops with btrfs/101.

    Nikolay: this fixes failures on btrfs/060 btrfs/062 btrfs/063 and
    btrfs/195 That I was seeing, the call trace was:

    [ 3680.515564] ------------[ cut here ]------------
    [ 3680.515566] BTRFS: block rsv returned -28
    [ 3680.515585] WARNING: CPU: 2 PID: 8339 at fs/btrfs/block-rsv.c:521 btrfs_use_block_rsv+0x162/0x180
    [ 3680.515587] Modules linked in:
    [ 3680.515591] CPU: 2 PID: 8339 Comm: btrfs Tainted: G W 5.9.0-rc8-default #95
    [ 3680.515593] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
    [ 3680.515595] RIP: 0010:btrfs_use_block_rsv+0x162/0x180
    [ 3680.515600] RSP: 0018:ffffa01ac9753910 EFLAGS: 00010282
    [ 3680.515602] RAX: 0000000000000000 RBX: ffff984b34200000 RCX: 0000000000000027
    [ 3680.515604] RDX: 0000000000000027 RSI: 0000000000000000 RDI: ffff984b3bd19e28
    [ 3680.515606] RBP: 0000000000004000 R08: ffff984b3bd19e20 R09: 0000000000000001
    [ 3680.515608] R10: 0000000000000004 R11: 0000000000000046 R12: ffff984b264fdc00
    [ 3680.515609] R13: ffff984b13149000 R14: 00000000ffffffe4 R15: ffff984b34200000
    [ 3680.515613] FS: 00007f4e2912b8c0(0000) GS:ffff984b3bd00000(0000) knlGS:0000000000000000
    [ 3680.515615] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 3680.515617] CR2: 00007fab87122150 CR3: 0000000118e42000 CR4: 00000000000006e0
    [ 3680.515620] Call Trace:
    [ 3680.515627] btrfs_alloc_tree_block+0x8b/0x340
    [ 3680.515633] ? __lock_acquire+0x51a/0xac0
    [ 3680.515646] alloc_tree_block_no_bg_flush+0x4f/0x60
    [ 3680.515651] __btrfs_cow_block+0x14e/0x7e0
    [ 3680.515662] btrfs_cow_block+0x144/0x2c0
    [ 3680.515670] merge_reloc_root+0x4d4/0x610
    [ 3680.515675] ? btrfs_lookup_fs_root+0x78/0x90
    [ 3680.515686] merge_reloc_roots+0xee/0x280
    [ 3680.515695] relocate_block_group+0x2ce/0x5e0
    [ 3680.515704] btrfs_relocate_block_group+0x16e/0x310
    [ 3680.515711] btrfs_relocate_chunk+0x38/0xf0
    [ 3680.515716] btrfs_shrink_device+0x200/0x560
    [ 3680.515728] btrfs_rm_device+0x1ae/0x6a6
    [ 3680.515744] ? _copy_from_user+0x6e/0xb0
    [ 3680.515750] btrfs_ioctl+0x1afe/0x28c0
    [ 3680.515755] ? find_held_lock+0x2b/0x80
    [ 3680.515760] ? do_user_addr_fault+0x1f8/0x418
    [ 3680.515773] ? __x64_sys_ioctl+0x77/0xb0
    [ 3680.515775] __x64_sys_ioctl+0x77/0xb0
    [ 3680.515781] do_syscall_64+0x31/0x70
    [ 3680.515785] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reported-by: Nikolay Borisov
    Fixes: 44d354abf33e ("btrfs: relocation: review the call sites which can be interrupted by signal")
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Nikolay Borisov
    Tested-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • To help with debugging, print the type of the block rsv when we fail to
    use our target block rsv in btrfs_use_block_rsv.

    This now produces:

    [ 544.672035] BTRFS: block rsv 1 returned -28

    which is still cryptic without consulting the enum in block-rsv.h but I
    guess it's better than nothing.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    [ add note from Nikolay ]
    Signed-off-by: David Sterba

    Josef Bacik
     
  • On 32-bit systems, this shift will overflow for files larger than 4GB as
    start_index is unsigned long while the calls to btrfs_delalloc_*_space
    expect u64.

    CC: stable@vger.kernel.org # 4.4+
    Fixes: df480633b891 ("btrfs: extent-tree: Switch to new delalloc space reserve and release")
    Reviewed-by: Josef Bacik
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: David Sterba
    [ define the variable instead of repeating the shift ]
    Signed-off-by: David Sterba

    Matthew Wilcox (Oracle)
     

31 Oct, 2020

1 commit

  • Pull btrfs fixes from David Sterba:

    - lockdep fixes:
    - drop path locks before manipulating sysfs objects or qgroups
    - preliminary fixes before tree locks get switched to rwsem
    - use annotated seqlock

    - build warning fixes (printk format)

    - fix relocation vs fallocate race

    - tree checker properly validates number of stripes and parity

    - readahead vs device replace fixes

    - iomap dio fix for unnecessary buffered io fallback

    * tag 'for-5.10-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    btrfs: convert data_seqcount to seqcount_mutex_t
    btrfs: don't fallback to buffered read if we don't need to
    btrfs: add a helper to read the tree_root commit root for backref lookup
    btrfs: drop the path before adding qgroup items when enabling qgroups
    btrfs: fix readahead hang and use-after-free after removing a device
    btrfs: fix use-after-free on readahead extent after failure to create it
    btrfs: tree-checker: validate number of chunk stripes and parity
    btrfs: tree-checker: fix incorrect printk format
    btrfs: drop the path before adding block group sysfs files
    btrfs: fix relocation failure due to race with fallocate

    Linus Torvalds
     

27 Oct, 2020

2 commits

  • By doing so we can associate the sequence counter to the chunk_mutex
    for lockdep purposes (compiled-out otherwise), the mutex is otherwise
    used on the write side.
    Also avoid explicitly disabling preemption around the write region as it
    will now be done automatically by the seqcount machinery based on the
    lock type.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Davidlohr Bueso
     
  • Since we switched to the iomap infrastructure in b5ff9f1a96e8f ("btrfs:
    switch to iomap for direct IO") we're calling generic_file_buffered_read()
    directly and not via generic_file_read_iter() anymore.

    If the read could read everything there is no need to bother calling
    generic_file_buffered_read(), like it is handled in
    generic_file_read_iter().

    If we call generic_file_buffered_read() in this case we can hit a
    situation where we do an invalid readahead and cause this UBSAN splat
    in fstest generic/091:

    run fstests generic/091 at 2020-10-21 10:52:32
    ================================================================================
    UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
    shift exponent 64 is too large for 64-bit type 'long unsigned int'
    CPU: 0 PID: 656 Comm: fsx Not tainted 5.9.0-rc7+ #821
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
    Call Trace:
    __dump_stack lib/dump_stack.c:77
    dump_stack+0x57/0x70 lib/dump_stack.c:118
    ubsan_epilogue+0x5/0x40 lib/ubsan.c:148
    __ubsan_handle_shift_out_of_bounds.cold+0x61/0xe9 lib/ubsan.c:395
    __roundup_pow_of_two ./include/linux/log2.h:57
    get_init_ra_size mm/readahead.c:318
    ondemand_readahead.cold+0x16/0x2c mm/readahead.c:530
    generic_file_buffered_read+0x3ac/0x840 mm/filemap.c:2199
    call_read_iter ./include/linux/fs.h:1876
    new_sync_read+0x102/0x180 fs/read_write.c:415
    vfs_read+0x11c/0x1a0 fs/read_write.c:481
    ksys_read+0x4f/0xc0 fs/read_write.c:615
    do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9 arch/x86/entry/entry_64.S:118
    RIP: 0033:0x7fe87fee992e
    RSP: 002b:00007ffe01605278 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
    RAX: ffffffffffffffda RBX: 000000000004f000 RCX: 00007fe87fee992e
    RDX: 0000000000004000 RSI: 0000000001677000 RDI: 0000000000000003
    RBP: 000000000004f000 R08: 0000000000004000 R09: 000000000004f000
    R10: 0000000000053000 R11: 0000000000000246 R12: 0000000000004000
    R13: 0000000000000000 R14: 000000000007a120 R15: 0000000000000000
    ================================================================================
    BTRFS info (device nullb0): has skinny extents
    BTRFS info (device nullb0): ZONED mode enabled, zone size 268435456 B
    BTRFS info (device nullb0): enabling ssd optimizations

    Fixes: f85781fb505e ("btrfs: switch to iomap for direct IO")
    Reviewed-by: Goldwyn Rodrigues
    Signed-off-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    Johannes Thumshirn
     

26 Oct, 2020

6 commits

  • I got the following lockdep splat with tree locks converted to rwsem
    patches on btrfs/104:

    ======================================================
    WARNING: possible circular locking dependency detected
    5.9.0+ #102 Not tainted
    ------------------------------------------------------
    btrfs-cleaner/903 is trying to acquire lock:
    ffff8e7fab6ffe30 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170

    but task is already holding lock:
    ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (&fs_info->commit_root_sem){++++}-{3:3}:
    down_read+0x40/0x130
    caching_thread+0x53/0x5a0
    btrfs_work_helper+0xfa/0x520
    process_one_work+0x238/0x540
    worker_thread+0x55/0x3c0
    kthread+0x13a/0x150
    ret_from_fork+0x1f/0x30

    -> #2 (&caching_ctl->mutex){+.+.}-{3:3}:
    __mutex_lock+0x7e/0x7b0
    btrfs_cache_block_group+0x1e0/0x510
    find_free_extent+0xb6e/0x12f0
    btrfs_reserve_extent+0xb3/0x1b0
    btrfs_alloc_tree_block+0xb1/0x330
    alloc_tree_block_no_bg_flush+0x4f/0x60
    __btrfs_cow_block+0x11d/0x580
    btrfs_cow_block+0x10c/0x220
    commit_cowonly_roots+0x47/0x2e0
    btrfs_commit_transaction+0x595/0xbd0
    sync_filesystem+0x74/0x90
    generic_shutdown_super+0x22/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20
    deactivate_locked_super+0x36/0xa0
    cleanup_mnt+0x12d/0x190
    task_work_run+0x5c/0xa0
    exit_to_user_mode_prepare+0x1df/0x200
    syscall_exit_to_user_mode+0x54/0x280
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #1 (&space_info->groups_sem){++++}-{3:3}:
    down_read+0x40/0x130
    find_free_extent+0x2ed/0x12f0
    btrfs_reserve_extent+0xb3/0x1b0
    btrfs_alloc_tree_block+0xb1/0x330
    alloc_tree_block_no_bg_flush+0x4f/0x60
    __btrfs_cow_block+0x11d/0x580
    btrfs_cow_block+0x10c/0x220
    commit_cowonly_roots+0x47/0x2e0
    btrfs_commit_transaction+0x595/0xbd0
    sync_filesystem+0x74/0x90
    generic_shutdown_super+0x22/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20
    deactivate_locked_super+0x36/0xa0
    cleanup_mnt+0x12d/0x190
    task_work_run+0x5c/0xa0
    exit_to_user_mode_prepare+0x1df/0x200
    syscall_exit_to_user_mode+0x54/0x280
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (btrfs-root-00){++++}-{3:3}:
    __lock_acquire+0x1167/0x2150
    lock_acquire+0xb9/0x3d0
    down_read_nested+0x43/0x130
    __btrfs_tree_read_lock+0x32/0x170
    __btrfs_read_lock_root_node+0x3a/0x50
    btrfs_search_slot+0x614/0x9d0
    btrfs_find_root+0x35/0x1b0
    btrfs_read_tree_root+0x61/0x120
    btrfs_get_root_ref+0x14b/0x600
    find_parent_nodes+0x3e6/0x1b30
    btrfs_find_all_roots_safe+0xb4/0x130
    btrfs_find_all_roots+0x60/0x80
    btrfs_qgroup_trace_extent_post+0x27/0x40
    btrfs_add_delayed_data_ref+0x3fd/0x460
    btrfs_free_extent+0x42/0x100
    __btrfs_mod_ref+0x1d7/0x2f0
    walk_up_proc+0x11c/0x400
    walk_up_tree+0xf0/0x180
    btrfs_drop_snapshot+0x1c7/0x780
    btrfs_clean_one_deleted_snapshot+0xfb/0x110
    cleaner_kthread+0xd4/0x140
    kthread+0x13a/0x150
    ret_from_fork+0x1f/0x30

    other info that might help us debug this:

    Chain exists of:
    btrfs-root-00 --> &caching_ctl->mutex --> &fs_info->commit_root_sem

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&fs_info->commit_root_sem);
    lock(&caching_ctl->mutex);
    lock(&fs_info->commit_root_sem);
    lock(btrfs-root-00);

    *** DEADLOCK ***

    3 locks held by btrfs-cleaner/903:
    #0: ffff8e7fab628838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
    #1: ffff8e7faadac640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
    #2: ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

    stack backtrace:
    CPU: 0 PID: 903 Comm: btrfs-cleaner Not tainted 5.9.0+ #102
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
    Call Trace:
    dump_stack+0x8b/0xb0
    check_noncircular+0xcf/0xf0
    __lock_acquire+0x1167/0x2150
    ? __bfs+0x42/0x210
    lock_acquire+0xb9/0x3d0
    ? __btrfs_tree_read_lock+0x32/0x170
    down_read_nested+0x43/0x130
    ? __btrfs_tree_read_lock+0x32/0x170
    __btrfs_tree_read_lock+0x32/0x170
    __btrfs_read_lock_root_node+0x3a/0x50
    btrfs_search_slot+0x614/0x9d0
    ? find_held_lock+0x2b/0x80
    btrfs_find_root+0x35/0x1b0
    ? do_raw_spin_unlock+0x4b/0xa0
    btrfs_read_tree_root+0x61/0x120
    btrfs_get_root_ref+0x14b/0x600
    find_parent_nodes+0x3e6/0x1b30
    btrfs_find_all_roots_safe+0xb4/0x130
    btrfs_find_all_roots+0x60/0x80
    btrfs_qgroup_trace_extent_post+0x27/0x40
    btrfs_add_delayed_data_ref+0x3fd/0x460
    btrfs_free_extent+0x42/0x100
    __btrfs_mod_ref+0x1d7/0x2f0
    walk_up_proc+0x11c/0x400
    walk_up_tree+0xf0/0x180
    btrfs_drop_snapshot+0x1c7/0x780
    ? btrfs_clean_one_deleted_snapshot+0x73/0x110
    btrfs_clean_one_deleted_snapshot+0xfb/0x110
    cleaner_kthread+0xd4/0x140
    ? btrfs_alloc_root+0x50/0x50
    kthread+0x13a/0x150
    ? kthread_create_worker_on_cpu+0x40/0x40
    ret_from_fork+0x1f/0x30
    BTRFS info (device sdb): disk space caching is enabled
    BTRFS info (device sdb): has skinny extents

    This happens because qgroups does a backref lookup when we create a
    delayed ref. From here it may have to look up a root from an indirect
    ref, which does a normal lookup on the tree_root, which takes the read
    lock on the tree_root nodes.

    To fix this we need to add a variant for looking up roots that searches
    the commit root of the tree_root. Then when we do the backref search
    using the commit root we are sure to not take any locks on the tree_root
    nodes. This gets rid of the lockdep splat when running btrfs/104.

    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • When enabling qgroups we walk the tree_root and then add a qgroup item
    for every root that we have. This creates a lock dependency on the
    tree_root and qgroup_root, which results in the following lockdep splat
    (with tree locks using rwsem), eg. in tests btrfs/017 or btrfs/022:

    ======================================================
    WARNING: possible circular locking dependency detected
    5.9.0-default+ #1299 Not tainted
    ------------------------------------------------------
    btrfs/24552 is trying to acquire lock:
    ffff9142dfc5f630 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]

    but task is already holding lock:
    ffff9142dfc5d0b0 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (btrfs-root-00){++++}-{3:3}:
    __lock_acquire+0x3fb/0x730
    lock_acquire.part.0+0x6a/0x130
    down_read_nested+0x46/0x130
    __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
    btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
    btrfs_search_slot+0xc3/0x9f0 [btrfs]
    btrfs_insert_item+0x6e/0x140 [btrfs]
    btrfs_create_tree+0x1cb/0x240 [btrfs]
    btrfs_quota_enable+0xcd/0x790 [btrfs]
    btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
    __x64_sys_ioctl+0x83/0xa0
    do_syscall_64+0x2d/0x70
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (btrfs-quota-00){++++}-{3:3}:
    check_prev_add+0x91/0xc30
    validate_chain+0x491/0x750
    __lock_acquire+0x3fb/0x730
    lock_acquire.part.0+0x6a/0x130
    down_read_nested+0x46/0x130
    __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
    btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
    btrfs_search_slot+0xc3/0x9f0 [btrfs]
    btrfs_insert_empty_items+0x58/0xa0 [btrfs]
    add_qgroup_item.part.0+0x72/0x210 [btrfs]
    btrfs_quota_enable+0x3bb/0x790 [btrfs]
    btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
    __x64_sys_ioctl+0x83/0xa0
    do_syscall_64+0x2d/0x70
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(btrfs-root-00);
    lock(btrfs-quota-00);
    lock(btrfs-root-00);
    lock(btrfs-quota-00);

    *** DEADLOCK ***

    5 locks held by btrfs/24552:
    #0: ffff9142df431478 (sb_writers#10){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0xa0
    #1: ffff9142f9b10cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl_quota_ctl+0x7b/0xe0 [btrfs]
    #2: ffff9142f9b11a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0x790 [btrfs]
    #3: ffff9142df431698 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x406/0x510 [btrfs]
    #4: ffff9142dfc5d0b0 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]

    stack backtrace:
    CPU: 1 PID: 24552 Comm: btrfs Not tainted 5.9.0-default+ #1299
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
    Call Trace:
    dump_stack+0x77/0x97
    check_noncircular+0xf3/0x110
    check_prev_add+0x91/0xc30
    validate_chain+0x491/0x750
    __lock_acquire+0x3fb/0x730
    lock_acquire.part.0+0x6a/0x130
    ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    ? lock_acquire+0xc4/0x140
    ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    down_read_nested+0x46/0x130
    ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
    ? btrfs_root_node+0xd9/0x200 [btrfs]
    __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
    btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
    btrfs_search_slot+0xc3/0x9f0 [btrfs]
    btrfs_insert_empty_items+0x58/0xa0 [btrfs]
    add_qgroup_item.part.0+0x72/0x210 [btrfs]
    btrfs_quota_enable+0x3bb/0x790 [btrfs]
    btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
    __x64_sys_ioctl+0x83/0xa0
    do_syscall_64+0x2d/0x70
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fix this by dropping the path whenever we find a root item, add the
    qgroup item, and then re-lookup the root item we found and continue
    processing roots.

    Reported-by: David Sterba
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Very sporadically I had test case btrfs/069 from fstests hanging (for
    years, it is not a recent regression), with the following traces in
    dmesg/syslog:

    [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
    [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
    [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
    [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
    [162513.514318] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.514747] task:btrfs-transacti state:D stack: 0 pid:1356167 ppid: 2 flags:0x00004000
    [162513.514751] Call Trace:
    [162513.514761] __schedule+0x5ce/0xd00
    [162513.514765] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.514771] schedule+0x46/0xf0
    [162513.514844] wait_current_trans+0xde/0x140 [btrfs]
    [162513.514850] ? finish_wait+0x90/0x90
    [162513.514864] start_transaction+0x37c/0x5f0 [btrfs]
    [162513.514879] transaction_kthread+0xa4/0x170 [btrfs]
    [162513.514891] ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
    [162513.514894] kthread+0x153/0x170
    [162513.514897] ? kthread_stop+0x2c0/0x2c0
    [162513.514902] ret_from_fork+0x22/0x30
    [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
    [162513.515192] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.515680] task:fsstress state:D stack: 0 pid:1356184 ppid:1356177 flags:0x00004000
    [162513.515682] Call Trace:
    [162513.515688] __schedule+0x5ce/0xd00
    [162513.515691] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.515697] schedule+0x46/0xf0
    [162513.515712] wait_current_trans+0xde/0x140 [btrfs]
    [162513.515716] ? finish_wait+0x90/0x90
    [162513.515729] start_transaction+0x37c/0x5f0 [btrfs]
    [162513.515743] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
    [162513.515753] btrfs_sync_fs+0x61/0x1c0 [btrfs]
    [162513.515758] ? __ia32_sys_fdatasync+0x20/0x20
    [162513.515761] iterate_supers+0x87/0xf0
    [162513.515765] ksys_sync+0x60/0xb0
    [162513.515768] __do_sys_sync+0xa/0x10
    [162513.515771] do_syscall_64+0x33/0x80
    [162513.515774] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [162513.515781] RIP: 0033:0x7f5238f50bd7
    [162513.515782] Code: Bad RIP value.
    [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
    [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
    [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
    [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
    [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
    [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
    [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
    [162513.516064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.516617] task:fsstress state:D stack: 0 pid:1356185 ppid:1356177 flags:0x00000000
    [162513.516620] Call Trace:
    [162513.516625] __schedule+0x5ce/0xd00
    [162513.516628] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.516634] schedule+0x46/0xf0
    [162513.516647] wait_current_trans+0xde/0x140 [btrfs]
    [162513.516650] ? finish_wait+0x90/0x90
    [162513.516662] start_transaction+0x4d7/0x5f0 [btrfs]
    [162513.516679] btrfs_setxattr_trans+0x3c/0x100 [btrfs]
    [162513.516686] __vfs_setxattr+0x66/0x80
    [162513.516691] __vfs_setxattr_noperm+0x70/0x200
    [162513.516697] vfs_setxattr+0x6b/0x120
    [162513.516703] setxattr+0x125/0x240
    [162513.516709] ? lock_acquire+0xb1/0x480
    [162513.516712] ? mnt_want_write+0x20/0x50
    [162513.516721] ? rcu_read_lock_any_held+0x8e/0xb0
    [162513.516723] ? preempt_count_add+0x49/0xa0
    [162513.516725] ? __sb_start_write+0x19b/0x290
    [162513.516727] ? preempt_count_add+0x49/0xa0
    [162513.516732] path_setxattr+0xba/0xd0
    [162513.516739] __x64_sys_setxattr+0x27/0x30
    [162513.516741] do_syscall_64+0x33/0x80
    [162513.516743] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [162513.516745] RIP: 0033:0x7f5238f56d5a
    [162513.516746] Code: Bad RIP value.
    [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
    [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
    [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
    [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
    [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
    [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
    [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
    [162513.517064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.517763] task:fsstress state:D stack: 0 pid:1356196 ppid:1356177 flags:0x00004000
    [162513.517780] Call Trace:
    [162513.517786] __schedule+0x5ce/0xd00
    [162513.517789] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.517796] schedule+0x46/0xf0
    [162513.517810] wait_current_trans+0xde/0x140 [btrfs]
    [162513.517814] ? finish_wait+0x90/0x90
    [162513.517829] start_transaction+0x37c/0x5f0 [btrfs]
    [162513.517845] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
    [162513.517857] btrfs_sync_fs+0x61/0x1c0 [btrfs]
    [162513.517862] ? __ia32_sys_fdatasync+0x20/0x20
    [162513.517865] iterate_supers+0x87/0xf0
    [162513.517869] ksys_sync+0x60/0xb0
    [162513.517872] __do_sys_sync+0xa/0x10
    [162513.517875] do_syscall_64+0x33/0x80
    [162513.517878] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [162513.517881] RIP: 0033:0x7f5238f50bd7
    [162513.517883] Code: Bad RIP value.
    [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
    [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
    [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
    [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
    [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
    [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
    [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
    [162513.518298] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.519157] task:fsstress state:D stack: 0 pid:1356197 ppid:1356177 flags:0x00000000
    [162513.519160] Call Trace:
    [162513.519165] __schedule+0x5ce/0xd00
    [162513.519168] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.519174] schedule+0x46/0xf0
    [162513.519190] wait_current_trans+0xde/0x140 [btrfs]
    [162513.519193] ? finish_wait+0x90/0x90
    [162513.519206] start_transaction+0x4d7/0x5f0 [btrfs]
    [162513.519222] btrfs_create+0x57/0x200 [btrfs]
    [162513.519230] lookup_open+0x522/0x650
    [162513.519246] path_openat+0x2b8/0xa50
    [162513.519270] do_filp_open+0x91/0x100
    [162513.519275] ? find_held_lock+0x32/0x90
    [162513.519280] ? lock_acquired+0x33b/0x470
    [162513.519285] ? do_raw_spin_unlock+0x4b/0xc0
    [162513.519287] ? _raw_spin_unlock+0x29/0x40
    [162513.519295] do_sys_openat2+0x20d/0x2d0
    [162513.519300] do_sys_open+0x44/0x80
    [162513.519304] do_syscall_64+0x33/0x80
    [162513.519307] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [162513.519309] RIP: 0033:0x7f5238f4a903
    [162513.519310] Code: Bad RIP value.
    [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
    [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
    [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
    [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
    [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
    [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
    [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
    [162513.519727] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.520508] task:btrfs state:D stack: 0 pid:1356211 ppid:1356178 flags:0x00004002
    [162513.520511] Call Trace:
    [162513.520516] __schedule+0x5ce/0xd00
    [162513.520519] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.520525] schedule+0x46/0xf0
    [162513.520544] btrfs_scrub_pause+0x11f/0x180 [btrfs]
    [162513.520548] ? finish_wait+0x90/0x90
    [162513.520562] btrfs_commit_transaction+0x45a/0xc30 [btrfs]
    [162513.520574] ? start_transaction+0xe0/0x5f0 [btrfs]
    [162513.520596] btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
    [162513.520619] btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
    [162513.520639] btrfs_ioctl+0x2a25/0x36f0 [btrfs]
    [162513.520643] ? do_sigaction+0xf3/0x240
    [162513.520645] ? find_held_lock+0x32/0x90
    [162513.520648] ? do_sigaction+0xf3/0x240
    [162513.520651] ? lock_acquired+0x33b/0x470
    [162513.520655] ? _raw_spin_unlock_irq+0x24/0x50
    [162513.520657] ? lockdep_hardirqs_on+0x7d/0x100
    [162513.520660] ? _raw_spin_unlock_irq+0x35/0x50
    [162513.520662] ? do_sigaction+0xf3/0x240
    [162513.520671] ? __x64_sys_ioctl+0x83/0xb0
    [162513.520672] __x64_sys_ioctl+0x83/0xb0
    [162513.520677] do_syscall_64+0x33/0x80
    [162513.520679] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [162513.520681] RIP: 0033:0x7fc3cd307d87
    [162513.520682] Code: Bad RIP value.
    [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
    [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
    [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
    [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
    [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
    [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
    [162513.520703]
    Showing all locks held in the system:
    [162513.520712] 1 lock held by khungtaskd/54:
    [162513.520713] #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
    [162513.520728] 1 lock held by in:imklog/596:
    [162513.520729] #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
    [162513.520782] 1 lock held by btrfs-transacti/1356167:
    [162513.520784] #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
    [162513.520798] 1 lock held by btrfs/1356190:
    [162513.520800] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
    [162513.520805] 1 lock held by fsstress/1356184:
    [162513.520806] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
    [162513.520811] 3 locks held by fsstress/1356185:
    [162513.520812] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
    [162513.520815] #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
    [162513.520820] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
    [162513.520833] 1 lock held by fsstress/1356196:
    [162513.520834] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
    [162513.520838] 3 locks held by fsstress/1356197:
    [162513.520839] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
    [162513.520843] #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
    [162513.520846] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
    [162513.520858] 2 locks held by btrfs/1356211:
    [162513.520859] #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
    [162513.520877] #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]

    This was weird because the stack traces show that a transaction commit,
    triggered by a device replace operation, is blocking trying to pause any
    running scrubs but there are no stack traces of blocked tasks doing a
    scrub.

    After poking around with drgn, I noticed there was a scrub task that was
    constantly running and blocking for shorts periods of time:

    >>> t = find_task(prog, 1356190)
    >>> prog.stack_trace(t)
    #0 __schedule+0x5ce/0xcfc
    #1 schedule+0x46/0xe4
    #2 schedule_timeout+0x1df/0x475
    #3 btrfs_reada_wait+0xda/0x132
    #4 scrub_stripe+0x2a8/0x112f
    #5 scrub_chunk+0xcd/0x134
    #6 scrub_enumerate_chunks+0x29e/0x5ee
    #7 btrfs_scrub_dev+0x2d5/0x91b
    #8 btrfs_ioctl+0x7f5/0x36e7
    #9 __x64_sys_ioctl+0x83/0xb0
    #10 do_syscall_64+0x33/0x77
    #11 entry_SYSCALL_64+0x7c/0x156

    Which corresponds to:

    int btrfs_reada_wait(void *handle)
    {
    struct reada_control *rc = handle;
    struct btrfs_fs_info *fs_info = rc->fs_info;

    while (atomic_read(&rc->elems)) {
    if (!atomic_read(&fs_info->reada_works_cnt))
    reada_start_machine(fs_info);
    wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
    (HZ + 9) / 10);
    }
    (...)

    So the counter "rc->elems" was set to 1 and never decreased to 0, causing
    the scrub task to loop forever in that function. Then I used the following
    script for drgn to check the readahead requests:

    $ cat dump_reada.py
    import sys
    import drgn
    from drgn import NULL, Object, cast, container_of, execscript, \
    reinterpret, sizeof
    from drgn.helpers.linux import *

    mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

    mnt = None
    for mnt in for_each_mount(prog, dst = mnt_path):
    pass

    if mnt is None:
    sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
    sys.exit(1)

    fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

    def dump_re(re):
    nzones = re.nzones.value_()
    print(f're at {hex(re.value_())}')
    print(f'\t logical {re.logical.value_()}')
    print(f'\t refcnt {re.refcnt.value_()}')
    print(f'\t nzones {nzones}')
    for i in range(nzones):
    dev = re.zones[i].device
    name = dev.name.str.string_()
    print(f'\t\t dev id {dev.devid.value_()} name {name}')
    print()

    for _, e in radix_tree_for_each(fs_info.reada_tree):
    re = cast('struct reada_extent *', e)
    dump_re(re)

    $ drgn dump_reada.py
    re at 0xffff8f3da9d25ad8
    logical 38928384
    refcnt 1
    nzones 1
    dev id 0 name b'/dev/sdd'
    $

    So there was one readahead extent with a single zone corresponding to the
    source device of that last device replace operation logged in dmesg/syslog.
    Also the ID of that zone's device was 0 which is a special value set in
    the source device of a device replace operation when the operation finishes
    (constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
    confirming again that device /dev/sdd was the source of a device replace
    operation.

    Normally there should be as many zones in the readahead extent as there are
    devices, and I wasn't expecting the extent to be in a block group with a
    'single' profile, so I went and confirmed with the following drgn script
    that there weren't any single profile block groups:

    $ cat dump_block_groups.py
    import sys
    import drgn
    from drgn import NULL, Object, cast, container_of, execscript, \
    reinterpret, sizeof
    from drgn.helpers.linux import *

    mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

    mnt = None
    for mnt in for_each_mount(prog, dst = mnt_path):
    pass

    if mnt is None:
    sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
    sys.exit(1)

    fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

    BTRFS_BLOCK_GROUP_DATA = (1 << 0)
    BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
    BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
    BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
    BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
    BTRFS_BLOCK_GROUP_DUP = (1 << 5)
    BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
    BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
    BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
    BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
    BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)

    def bg_flags_string(bg):
    flags = bg.flags.value_()
    ret = ''
    if flags & BTRFS_BLOCK_GROUP_DATA:
    ret = 'data'
    if flags & BTRFS_BLOCK_GROUP_METADATA:
    if len(ret) > 0:
    ret += '|'
    ret += 'meta'
    if flags & BTRFS_BLOCK_GROUP_SYSTEM:
    if len(ret) > 0:
    ret += '|'
    ret += 'system'
    if flags & BTRFS_BLOCK_GROUP_RAID0:
    ret += ' raid0'
    elif flags & BTRFS_BLOCK_GROUP_RAID1:
    ret += ' raid1'
    elif flags & BTRFS_BLOCK_GROUP_DUP:
    ret += ' dup'
    elif flags & BTRFS_BLOCK_GROUP_RAID10:
    ret += ' raid10'
    elif flags & BTRFS_BLOCK_GROUP_RAID5:
    ret += ' raid5'
    elif flags & BTRFS_BLOCK_GROUP_RAID6:
    ret += ' raid6'
    elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
    ret += ' raid1c3'
    elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
    ret += ' raid1c4'
    else:
    ret += ' single'

    return ret

    def dump_bg(bg):
    print()
    print(f'block group at {hex(bg.value_())}')
    print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
    print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')

    bg_root = fs_info.block_group_cache_tree.address_of_()
    for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
    dump_bg(bg)

    $ drgn dump_block_groups.py

    block group at 0xffff8f3d673b0400
    start 22020096 length 16777216
    flags 258 - system raid6

    block group at 0xffff8f3d53ddb400
    start 38797312 length 536870912
    flags 260 - meta raid6

    block group at 0xffff8f3d5f4d9c00
    start 575668224 length 2147483648
    flags 257 - data raid6

    block group at 0xffff8f3d08189000
    start 2723151872 length 67108864
    flags 258 - system raid6

    block group at 0xffff8f3db70ff000
    start 2790260736 length 1073741824
    flags 260 - meta raid6

    block group at 0xffff8f3d5f4dd800
    start 3864002560 length 67108864
    flags 258 - system raid6

    block group at 0xffff8f3d67037000
    start 3931111424 length 2147483648
    flags 257 - data raid6
    $

    So there were only 2 reasons left for having a readahead extent with a
    single zone: reada_find_zone(), called when creating a readahead extent,
    returned NULL either because we failed to find the corresponding block
    group or because a memory allocation failed. With some additional and
    custom tracing I figured out that on every further ocurrence of the
    problem the block group had just been deleted when we were looping to
    create the zones for the readahead extent (at reada_find_extent()), so we
    ended up with only one zone in the readahead extent, corresponding to a
    device that ends up getting replaced.

    So after figuring that out it became obvious why the hang happens:

    1) Task A starts a scrub on any device of the filesystem, except for
    device /dev/sdd;

    2) Task B starts a device replace with /dev/sdd as the source device;

    3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
    starting to scrub a stripe from block group X. This call to
    btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
    calls reada_add_block(), it passes the logical address of the extent
    tree's root node as its 'logical' argument - a value of 38928384;

    4) Task A then enters reada_find_extent(), called from reada_add_block().
    It finds there isn't any existing readahead extent for the logical
    address 38928384, so it proceeds to the path of creating a new one.

    It calls btrfs_map_block() to find out which stripes exist for the block
    group X. On the first iteration of the for loop that iterates over the
    stripes, it finds the stripe for device /dev/sdd, so it creates one
    zone for that device and adds it to the readahead extent. Before getting
    into the second iteration of the loop, the cleanup kthread deletes block
    group X because it was empty. So in the iterations for the remaining
    stripes it does not add more zones to the readahead extent, because the
    calls to reada_find_zone() returned NULL because they couldn't find
    block group X anymore.

    As a result the new readahead extent has a single zone, corresponding to
    the device /dev/sdd;

    4) Before task A returns to btrfs_reada_add() and queues the readahead job
    for the readahead work queue, task B finishes the device replace and at
    btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
    device /dev/sdg;

    5) Task A returns to reada_add_block(), which increments the counter
    "->elems" of the reada_control structure allocated at btrfs_reada_add().

    Then it returns back to btrfs_reada_add() and calls
    reada_start_machine(). This queues a job in the readahead work queue to
    run the function reada_start_machine_worker(), which calls
    __reada_start_machine().

    At __reada_start_machine() we take the device list mutex and for each
    device found in the current device list, we call
    reada_start_machine_dev() to start the readahead work. However at this
    point the device /dev/sdd was already freed and is not in the device
    list anymore.

    This means the corresponding readahead for the extent at 38928384 is
    never started, and therefore the "->elems" counter of the reada_control
    structure allocated at btrfs_reada_add() never goes down to 0, causing
    the call to btrfs_reada_wait(), done by the scrub task, to wait forever.

    Note that the readahead request can be made either after the device replace
    started or before it started, however in pratice it is very unlikely that a
    device replace is able to start after a readahead request is made and is
    able to complete before the readahead request completes - maybe only on a
    very small and nearly empty filesystem.

    This hang however is not the only problem we can have with readahead and
    device removals. When the readahead extent has other zones other than the
    one corresponding to the device that is being removed (either by a device
    replace or a device remove operation), we risk having a use-after-free on
    the device when dropping the last reference of the readahead extent.

    For example if we create a readahead extent with two zones, one for the
    device /dev/sdd and one for the device /dev/sde:

    1) Before the readahead worker starts, the device /dev/sdd is removed,
    and the corresponding btrfs_device structure is freed. However the
    readahead extent still has the zone pointing to the device structure;

    2) When the readahead worker starts, it only finds device /dev/sde in the
    current device list of the filesystem;

    3) It starts the readahead work, at reada_start_machine_dev(), using the
    device /dev/sde;

    4) Then when it finishes reading the extent from device /dev/sde, it calls
    __readahead_hook() which ends up dropping the last reference on the
    readahead extent through the last call to reada_extent_put();

    5) At reada_extent_put() it iterates over each zone of the readahead extent
    and attempts to delete an element from the device's 'reada_extents'
    radix tree, resulting in a use-after-free, as the device pointer of the
    zone for /dev/sdd is now stale. We can also access the device after
    dropping the last reference of a zone, through reada_zone_release(),
    also called by reada_extent_put().

    And a device remove suffers the same problem, however since it shrinks the
    device size down to zero before removing the device, it is very unlikely to
    still have readahead requests not completed by the time we free the device,
    the only possibility is if the device has a very little space allocated.

    While the hang problem is exclusive to scrub, since it is currently the
    only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
    problem affects any path that triggers readhead, which includes
    btree_readahead_hook() and __readahead_hook() (a readahead worker can
    trigger readahed for the children of a node) for example - any path that
    ends up calling reada_add_block() can trigger the use-after-free after a
    device is removed.

    So fix this by waiting for any readahead requests for a device to complete
    before removing a device, ensuring that while waiting for existing ones no
    new ones can be made.

    This problem has been around for a very long time - the readahead code was
    added in 2011, device remove exists since 2008 and device replace was
    introduced in 2013, hard to pick a specific commit for a git Fixes tag.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • If we fail to find suitable zones for a new readahead extent, we end up
    leaving a stale pointer in the global readahead extents radix tree
    (fs_info->reada_tree), which can trigger the following trace later on:

    [13367.696354] BUG: kernel NULL pointer dereference, address: 00000000000000b0
    [13367.696802] #PF: supervisor read access in kernel mode
    [13367.697249] #PF: error_code(0x0000) - not-present page
    [13367.697721] PGD 0 P4D 0
    [13367.698171] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
    [13367.698632] CPU: 6 PID: 851214 Comm: btrfs Tainted: G W 5.9.0-rc6-btrfs-next-69 #1
    [13367.699100] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    [13367.700069] RIP: 0010:__lock_acquire+0x20a/0x3970
    [13367.700562] Code: ff 1f 0f b7 c0 48 0f (...)
    [13367.701609] RSP: 0018:ffffb14448f57790 EFLAGS: 00010046
    [13367.702140] RAX: 0000000000000000 RBX: 29b935140c15e8cf RCX: 0000000000000000
    [13367.702698] RDX: 0000000000000002 RSI: ffffffffb3d66bd0 RDI: 0000000000000046
    [13367.703240] RBP: ffff8a52ba8ac040 R08: 00000c2866ad9288 R09: 0000000000000001
    [13367.703783] R10: 0000000000000001 R11: 00000000b66d9b53 R12: ffff8a52ba8ac9b0
    [13367.704330] R13: 0000000000000000 R14: ffff8a532b6333e8 R15: 0000000000000000
    [13367.704880] FS: 00007fe1df6b5700(0000) GS:ffff8a5376600000(0000) knlGS:0000000000000000
    [13367.705438] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [13367.705995] CR2: 00000000000000b0 CR3: 000000022cca8004 CR4: 00000000003706e0
    [13367.706565] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [13367.707127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [13367.707686] Call Trace:
    [13367.708246] ? ___slab_alloc+0x395/0x740
    [13367.708820] ? reada_add_block+0xae/0xee0 [btrfs]
    [13367.709383] lock_acquire+0xb1/0x480
    [13367.709955] ? reada_add_block+0xe0/0xee0 [btrfs]
    [13367.710537] ? reada_add_block+0xae/0xee0 [btrfs]
    [13367.711097] ? rcu_read_lock_sched_held+0x5d/0x90
    [13367.711659] ? kmem_cache_alloc_trace+0x8d2/0x990
    [13367.712221] ? lock_acquired+0x33b/0x470
    [13367.712784] _raw_spin_lock+0x34/0x80
    [13367.713356] ? reada_add_block+0xe0/0xee0 [btrfs]
    [13367.713966] reada_add_block+0xe0/0xee0 [btrfs]
    [13367.714529] ? btrfs_root_node+0x15/0x1f0 [btrfs]
    [13367.715077] btrfs_reada_add+0x117/0x170 [btrfs]
    [13367.715620] scrub_stripe+0x21e/0x10d0 [btrfs]
    [13367.716141] ? kvm_sched_clock_read+0x5/0x10
    [13367.716657] ? __lock_acquire+0x41e/0x3970
    [13367.717184] ? scrub_chunk+0x60/0x140 [btrfs]
    [13367.717697] ? find_held_lock+0x32/0x90
    [13367.718254] ? scrub_chunk+0x60/0x140 [btrfs]
    [13367.718773] ? lock_acquired+0x33b/0x470
    [13367.719278] ? scrub_chunk+0xcd/0x140 [btrfs]
    [13367.719786] scrub_chunk+0xcd/0x140 [btrfs]
    [13367.720291] scrub_enumerate_chunks+0x270/0x5c0 [btrfs]
    [13367.720787] ? finish_wait+0x90/0x90
    [13367.721281] btrfs_scrub_dev+0x1ee/0x620 [btrfs]
    [13367.721762] ? rcu_read_lock_any_held+0x8e/0xb0
    [13367.722235] ? preempt_count_add+0x49/0xa0
    [13367.722710] ? __sb_start_write+0x19b/0x290
    [13367.723192] btrfs_ioctl+0x7f5/0x36f0 [btrfs]
    [13367.723660] ? __fget_files+0x101/0x1d0
    [13367.724118] ? find_held_lock+0x32/0x90
    [13367.724559] ? __fget_files+0x101/0x1d0
    [13367.724982] ? __x64_sys_ioctl+0x83/0xb0
    [13367.725399] __x64_sys_ioctl+0x83/0xb0
    [13367.725802] do_syscall_64+0x33/0x80
    [13367.726188] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [13367.726574] RIP: 0033:0x7fe1df7add87
    [13367.726948] Code: 00 00 00 48 8b 05 09 91 (...)
    [13367.727763] RSP: 002b:00007fe1df6b4d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    [13367.728179] RAX: ffffffffffffffda RBX: 000055ce1fb596a0 RCX: 00007fe1df7add87
    [13367.728604] RDX: 000055ce1fb596a0 RSI: 00000000c400941b RDI: 0000000000000003
    [13367.729021] RBP: 0000000000000000 R08: 00007fe1df6b5700 R09: 0000000000000000
    [13367.729431] R10: 00007fe1df6b5700 R11: 0000000000000246 R12: 00007ffd922b07de
    [13367.729842] R13: 00007ffd922b07df R14: 00007fe1df6b4e40 R15: 0000000000802000
    [13367.730275] Modules linked in: btrfs blake2b_generic xor (...)
    [13367.732638] CR2: 00000000000000b0
    [13367.733166] ---[ end trace d298b6805556acd9 ]---

    What happens is the following:

    1) At reada_find_extent() we don't find any existing readahead extent for
    the metadata extent starting at logical address X;

    2) So we proceed to create a new one. We then call btrfs_map_block() to get
    information about which stripes contain extent X;

    3) After that we iterate over the stripes and create only one zone for the
    readahead extent - only one because reada_find_zone() returned NULL for
    all iterations except for one, either because a memory allocation failed
    or it couldn't find the block group of the extent (it may have just been
    deleted);

    4) We then add the new readahead extent to the readahead extents radix
    tree at fs_info->reada_tree;

    5) Then we iterate over each zone of the new readahead extent, and find
    that the device used for that zone no longer exists, because it was
    removed or it was the source device of a device replace operation.
    Since this left 'have_zone' set to 0, after finishing the loop we jump
    to the 'error' label, call kfree() on the new readahead extent and
    return without removing it from the radix tree at fs_info->reada_tree;

    6) Any future call to reada_find_extent() for the logical address X will
    find the stale pointer in the readahead extents radix tree, increment
    its reference counter, which can trigger the use-after-free right
    away or return it to the caller reada_add_block() that results in the
    use-after-free of the example trace above.

    So fix this by making sure we delete the readahead extent from the radix
    tree if we fail to setup zones for it (when 'have_zone = 0').

    Fixes: 319450211842ba ("btrfs: reada: bypass adding extent when all zone failed")
    CC: stable@vger.kernel.org # 4.9+
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • If there's no parity and num_stripes < ncopies, a crafted image can
    trigger a division by zero in calc_stripe_length().

    The image was generated through fuzzing.

    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Qu Wenruo
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209587
    Signed-off-by: Daniel Xu
    Signed-off-by: David Sterba

    Daniel Xu
     
  • This patch addresses a compile warning:

    fs/btrfs/extent-tree.c: In function '__btrfs_free_extent':
    fs/btrfs/extent-tree.c:3187:4: warning: format '%lu' expects argument of type 'long unsigned int', but argument 8 has type 'unsigned int' [-Wformat=]

    Fixes: 1c2a07f598d5 ("btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()")
    Reviewed-by: Filipe Manana
    Signed-off-by: Pujin Shi
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Pujin Shi