25 Jan, 2021

1 commit


20 Jan, 2021

6 commits

  • [ Upstream commit cb13eea3b49055bd78e6ddf39defd6340f7379fc ]

    If we remount a filesystem in RO mode while the qgroup rescan worker is
    running, we can end up having it still running after the remount is done,
    and at unmount time we may end up with an open transaction that ends up
    never getting committed. If that happens we end up with several memory
    leaks and can crash when hardware acceleration is unavailable for crc32c.
    Possibly it can lead to other nasty surprises too, due to use-after-free
    issues.

    The following steps explain how the problem happens.

    1) We have a filesystem mounted in RW mode and the qgroup rescan worker is
    running;

    2) We remount the filesystem in RO mode, and never stop/pause the rescan
    worker, so after the remount the rescan worker is still running. The
    important detail here is that the rescan task is still running after
    the remount operation committed any ongoing transaction through its
    call to btrfs_commit_super();

    3) The rescan is still running, and after the remount completed, the
    rescan worker started a transaction, after it finished iterating all
    leaves of the extent tree, to update the qgroup status item in the
    quotas tree. It does not commit the transaction, it only releases its
    handle on the transaction;

    4) A filesystem unmount operation starts shortly after;

    5) The unmount task, at close_ctree(), stops the transaction kthread,
    which had not had a chance to commit the open transaction since it was
    sleeping and the commit interval (default of 30 seconds) has not yet
    elapsed since the last time it committed a transaction;

    6) So after stopping the transaction kthread we still have the transaction
    used to update the qgroup status item open. At close_ctree(), when the
    filesystem is in RO mode and no transaction abort happened (or the
    filesystem is in error mode), we do not expect to have any transaction
    open, so we do not call btrfs_commit_super();

    7) We then proceed to destroy the work queues, free the roots and block
    groups, etc. After that we drop the last reference on the btree inode
    by calling iput() on it. Since there are dirty pages for the btree
    inode, corresponding to the COWed extent buffer for the quotas btree,
    btree_write_cache_pages() is invoked to flush those dirty pages. This
    results in creating a bio and submitting it, which makes us end up at
    btrfs_submit_metadata_bio();

    8) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
    that calls btrfs_wq_submit_bio(), because check_async_write() returned
    a value of 1. This value of 1 is because we did not have hardware
    acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
    set in fs_info->flags;

    9) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
    workqueue at fs_info->workers, which was already freed before by the
    call to btrfs_stop_all_workers() at close_ctree(). This results in an
    invalid memory access due to a use-after-free, leading to a crash.

    When this happens, before the crash there are several warnings triggered,
    since we have reserved metadata space in a block group, the delayed refs
    reservation, etc:

    ------------[ cut here ]------------
    WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
    Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
    CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
    Code: f0 01 00 00 48 39 c2 75 (...)
    RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
    RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
    RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
    RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
    R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
    FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
    close_ctree+0x2ba/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f15ee221ee7
    Code: ff 0b 00 f7 d8 64 89 01 48 (...)
    RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
    RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
    RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
    RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
    R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
    R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last enabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last disabled at (0): [] 0x0
    ---[ end trace dd74718fef1ed5c6 ]---
    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
    Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
    CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
    Code: 48 83 bb b0 03 00 00 00 (...)
    RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
    RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
    RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
    R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
    FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
    close_ctree+0x2ba/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f15ee221ee7
    Code: ff 0b 00 f7 d8 64 89 01 (...)
    RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
    RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
    RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
    RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
    R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
    R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last enabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last disabled at (0): [] 0x0
    ---[ end trace dd74718fef1ed5c7 ]---
    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
    Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
    CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
    Code: ad de 49 be 22 01 00 (...)
    RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
    RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
    RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
    R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
    FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    close_ctree+0x2ba/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f15ee221ee7
    Code: ff 0b 00 f7 d8 64 89 (...)
    RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
    RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
    RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
    RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
    R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
    R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last enabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last disabled at (0): [] 0x0
    ---[ end trace dd74718fef1ed5c8 ]---
    BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
    BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
    BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
    BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
    BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
    BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
    BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0

    And the crash, which only happens when we do not have crc32c hardware
    acceleration, produces the following trace immediately after those
    warnings:

    stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
    CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
    Code: 54 55 53 48 89 f3 (...)
    RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
    RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
    RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
    R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
    FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
    btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
    submit_one_bio+0x61/0x70 [btrfs]
    btree_write_cache_pages+0x414/0x450 [btrfs]
    ? kobject_put+0x9a/0x1d0
    ? trace_hardirqs_on+0x1b/0xf0
    ? _raw_spin_unlock_irqrestore+0x3c/0x60
    ? free_debug_processing+0x1e1/0x2b0
    do_writepages+0x43/0xe0
    ? lock_acquired+0x199/0x490
    __writeback_single_inode+0x59/0x650
    writeback_single_inode+0xaf/0x120
    write_inode_now+0x94/0xd0
    iput+0x187/0x2b0
    close_ctree+0x2c6/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f3cfebabee7
    Code: ff 0b 00 f7 d8 64 89 01 (...)
    RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
    RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
    RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
    RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
    R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
    R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
    Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
    ---[ end trace dd74718fef1ed5cc ]---

    Finally when we remove the btrfs module (rmmod btrfs), there are several
    warnings about objects that were allocated from our slabs but were never
    freed, consequence of the transaction that was never committed and got
    leaked:

    =============================================================================
    BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
    -----------------------------------------------------------------------------

    INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
    CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    slab_err+0xb7/0xdc
    ? lock_acquired+0x199/0x490
    __kmem_cache_shutdown+0x1ac/0x3c0
    ? lock_release+0x20e/0x4c0
    kmem_cache_destroy+0x55/0x120
    btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 f5 (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    INFO: Object 0x0000000050cbdd61 @offset=12104
    INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
    __slab_alloc.isra.0+0x109/0x1c0
    kmem_cache_alloc+0x7bb/0x830
    btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
    btrfs_free_tree_block+0x128/0x360 [btrfs]
    __btrfs_cow_block+0x489/0x5f0 [btrfs]
    btrfs_cow_block+0xf7/0x220 [btrfs]
    btrfs_search_slot+0x62a/0xc40 [btrfs]
    btrfs_del_orphan_item+0x65/0xd0 [btrfs]
    btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
    open_ctree+0x125a/0x18a0 [btrfs]
    btrfs_mount_root.cold+0x13/0xed [btrfs]
    legacy_get_tree+0x30/0x60
    vfs_get_tree+0x28/0xe0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    btrfs_mount+0x13b/0x3e0 [btrfs]
    INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
    kmem_cache_free+0x34c/0x3c0
    __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
    btrfs_run_delayed_refs+0x81/0x210 [btrfs]
    commit_cowonly_roots+0xfb/0x300 [btrfs]
    btrfs_commit_transaction+0x367/0xc40 [btrfs]
    sync_filesystem+0x74/0x90
    generic_shutdown_super+0x22/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    INFO: Object 0x0000000086e9b0ff @offset=12776
    INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
    __slab_alloc.isra.0+0x109/0x1c0
    kmem_cache_alloc+0x7bb/0x830
    btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
    btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
    alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
    __btrfs_cow_block+0x12d/0x5f0 [btrfs]
    btrfs_cow_block+0xf7/0x220 [btrfs]
    btrfs_search_slot+0x62a/0xc40 [btrfs]
    btrfs_del_orphan_item+0x65/0xd0 [btrfs]
    btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
    open_ctree+0x125a/0x18a0 [btrfs]
    btrfs_mount_root.cold+0x13/0xed [btrfs]
    legacy_get_tree+0x30/0x60
    vfs_get_tree+0x28/0xe0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
    kmem_cache_free+0x34c/0x3c0
    __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
    btrfs_run_delayed_refs+0x81/0x210 [btrfs]
    btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
    commit_cowonly_roots+0x248/0x300 [btrfs]
    btrfs_commit_transaction+0x367/0xc40 [btrfs]
    close_ctree+0x113/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
    CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    kmem_cache_destroy+0x119/0x120
    btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    =============================================================================
    BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
    -----------------------------------------------------------------------------

    INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
    CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    slab_err+0xb7/0xdc
    ? lock_acquired+0x199/0x490
    __kmem_cache_shutdown+0x1ac/0x3c0
    ? lock_release+0x20e/0x4c0
    kmem_cache_destroy+0x55/0x120
    btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 f5 (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    INFO: Object 0x000000001a340018 @offset=4408
    INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
    __slab_alloc.isra.0+0x109/0x1c0
    kmem_cache_alloc+0x7bb/0x830
    btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
    btrfs_free_tree_block+0x128/0x360 [btrfs]
    __btrfs_cow_block+0x489/0x5f0 [btrfs]
    btrfs_cow_block+0xf7/0x220 [btrfs]
    btrfs_search_slot+0x62a/0xc40 [btrfs]
    btrfs_del_orphan_item+0x65/0xd0 [btrfs]
    btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
    open_ctree+0x125a/0x18a0 [btrfs]
    btrfs_mount_root.cold+0x13/0xed [btrfs]
    legacy_get_tree+0x30/0x60
    vfs_get_tree+0x28/0xe0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    btrfs_mount+0x13b/0x3e0 [btrfs]
    INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
    kmem_cache_free+0x34c/0x3c0
    __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
    btrfs_run_delayed_refs+0x81/0x210 [btrfs]
    btrfs_commit_transaction+0x60/0xc40 [btrfs]
    create_subvol+0x56a/0x990 [btrfs]
    btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
    __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
    btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
    btrfs_ioctl+0x1a92/0x36f0 [btrfs]
    __x64_sys_ioctl+0x83/0xb0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    INFO: Object 0x000000002b46292a @offset=13648
    INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
    __slab_alloc.isra.0+0x109/0x1c0
    kmem_cache_alloc+0x7bb/0x830
    btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
    btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
    alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
    __btrfs_cow_block+0x12d/0x5f0 [btrfs]
    btrfs_cow_block+0xf7/0x220 [btrfs]
    btrfs_search_slot+0x62a/0xc40 [btrfs]
    btrfs_del_orphan_item+0x65/0xd0 [btrfs]
    btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
    open_ctree+0x125a/0x18a0 [btrfs]
    btrfs_mount_root.cold+0x13/0xed [btrfs]
    legacy_get_tree+0x30/0x60
    vfs_get_tree+0x28/0xe0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
    kmem_cache_free+0x34c/0x3c0
    __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
    btrfs_run_delayed_refs+0x81/0x210 [btrfs]
    commit_cowonly_roots+0xfb/0x300 [btrfs]
    btrfs_commit_transaction+0x367/0xc40 [btrfs]
    close_ctree+0x113/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
    CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    kmem_cache_destroy+0x119/0x120
    btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 f5 (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    =============================================================================
    BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
    -----------------------------------------------------------------------------

    INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
    CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    slab_err+0xb7/0xdc
    ? lock_acquired+0x199/0x490
    __kmem_cache_shutdown+0x1ac/0x3c0
    ? __mutex_unlock_slowpath+0x45/0x2a0
    kmem_cache_destroy+0x55/0x120
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 f5 (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    INFO: Object 0x000000004cf95ea8 @offset=6264
    INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
    __slab_alloc.isra.0+0x109/0x1c0
    kmem_cache_alloc+0x7bb/0x830
    btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
    alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
    __btrfs_cow_block+0x12d/0x5f0 [btrfs]
    btrfs_cow_block+0xf7/0x220 [btrfs]
    btrfs_search_slot+0x62a/0xc40 [btrfs]
    btrfs_del_orphan_item+0x65/0xd0 [btrfs]
    btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
    open_ctree+0x125a/0x18a0 [btrfs]
    btrfs_mount_root.cold+0x13/0xed [btrfs]
    legacy_get_tree+0x30/0x60
    vfs_get_tree+0x28/0xe0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    btrfs_mount+0x13b/0x3e0 [btrfs]
    INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
    kmem_cache_free+0x34c/0x3c0
    __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
    btrfs_run_delayed_refs+0x81/0x210 [btrfs]
    commit_cowonly_roots+0xfb/0x300 [btrfs]
    btrfs_commit_transaction+0x367/0xc40 [btrfs]
    close_ctree+0x113/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
    CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    kmem_cache_destroy+0x119/0x120
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1

    Fix this issue by having the remount path stop the qgroup rescan worker
    when we are remounting RO and teach the rescan worker to stop when a
    remount is in progress. If later a remount in RW mode happens, we are
    already resuming the qgroup rescan worker through the call to
    btrfs_qgroup_rescan_resume(), so we do not need to worry about that.

    Tested-by: Fabian Vogt
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     
  • [ Upstream commit 8fc058597a283e9a37720abb0e8d68e342b9387d ]

    btrfs_discard_workfn() drops discard_ctl->lock just to take it again in
    a moment in btrfs_discard_schedule_work(). Avoid that and also reuse
    ktime.

    Reviewed-by: Josef Bacik
    Signed-off-by: Pavel Begunkov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     
  • [ Upstream commit ea9ed87c73e87e044b2c58d658eb4ba5216bc488 ]

    Might happen that bg->discard_eligible_time was changed without
    rescheduling, so btrfs_discard_workfn() wakes up earlier than that new
    time, peek_discard_list() returns NULL, and all work halts and goes to
    sleep without further rescheduling even there are block groups to
    discard.

    It happens pretty often, but not so visible from the userspace because
    after some time it usually will be kicked off anyway by someone else
    calling btrfs_discard_reschedule_work().

    Fix it by continue rescheduling if block group discard lists are not
    empty.

    Reviewed-by: Josef Bacik
    Signed-off-by: Pavel Begunkov
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     
  • [ Upstream commit 347fb0cfc9bab5195c6701e62eda488310d7938f ]

    While mounting a crafted image provided by user, kernel panics due to
    the invalid chunk item whose end is less than start.

    [66.387422] loop: module loaded
    [66.389773] loop0: detected capacity change from 262144 to 0
    [66.427708] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 12 /dev/loop0 scanned by mount (613)
    [66.431061] BTRFS info (device loop0): disk space caching is enabled
    [66.431078] BTRFS info (device loop0): has skinny extents
    [66.437101] BTRFS error: insert state: end < start 29360127 37748736
    [66.437136] ------------[ cut here ]------------
    [66.437140] WARNING: CPU: 16 PID: 613 at fs/btrfs/extent_io.c:557 insert_state.cold+0x1a/0x46 [btrfs]
    [66.437369] CPU: 16 PID: 613 Comm: mount Tainted: G O 5.11.0-rc1-custom #45
    [66.437374] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
    [66.437378] RIP: 0010:insert_state.cold+0x1a/0x46 [btrfs]
    [66.437420] RSP: 0018:ffff93e5414c3908 EFLAGS: 00010286
    [66.437427] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
    [66.437431] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
    [66.437434] RBP: ffff93e5414c3938 R08: 0000000000000001 R09: 0000000000000001
    [66.437438] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d72aa0
    [66.437441] R13: ffff8ec78bc71628 R14: 0000000000000000 R15: 0000000002400000
    [66.437447] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
    [66.437451] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [66.437455] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
    [66.437460] PKRU: 55555554
    [66.437464] Call Trace:
    [66.437475] set_extent_bit+0x652/0x740 [btrfs]
    [66.437539] set_extent_bits_nowait+0x1d/0x20 [btrfs]
    [66.437576] add_extent_mapping+0x1e0/0x2f0 [btrfs]
    [66.437621] read_one_chunk+0x33c/0x420 [btrfs]
    [66.437674] btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
    [66.437708] ? kvm_sched_clock_read+0x18/0x40
    [66.437739] open_ctree+0xb32/0x1734 [btrfs]
    [66.437781] ? bdi_register_va+0x1b/0x20
    [66.437788] ? super_setup_bdi_name+0x79/0xd0
    [66.437810] btrfs_mount_root.cold+0x12/0xeb [btrfs]
    [66.437854] ? __kmalloc_track_caller+0x217/0x3b0
    [66.437873] legacy_get_tree+0x34/0x60
    [66.437880] vfs_get_tree+0x2d/0xc0
    [66.437888] vfs_kern_mount.part.0+0x78/0xc0
    [66.437897] vfs_kern_mount+0x13/0x20
    [66.437902] btrfs_mount+0x11f/0x3c0 [btrfs]
    [66.437940] ? kfree+0x5ff/0x670
    [66.437944] ? __kmalloc_track_caller+0x217/0x3b0
    [66.437962] legacy_get_tree+0x34/0x60
    [66.437974] vfs_get_tree+0x2d/0xc0
    [66.437983] path_mount+0x48c/0xd30
    [66.437998] __x64_sys_mount+0x108/0x140
    [66.438011] do_syscall_64+0x38/0x50
    [66.438018] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [66.438023] RIP: 0033:0x7f0138827f6e
    [66.438033] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
    [66.438040] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
    [66.438044] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
    [66.438047] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
    [66.438050] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    [66.438054] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
    [66.438078] irq event stamp: 18169
    [66.438082] hardirqs last enabled at (18175): [] console_unlock+0x4ff/0x5f0
    [66.438088] hardirqs last disabled at (18180): [] console_unlock+0x467/0x5f0
    [66.438092] softirqs last enabled at (16910): [] asm_call_irq_on_stack+0x12/0x20
    [66.438097] softirqs last disabled at (16905): [] asm_call_irq_on_stack+0x12/0x20
    [66.438103] ---[ end trace e114b111db64298b ]---
    [66.438107] BTRFS error: found node 12582912 29360127 on insert of 37748736 29360127
    [66.438127] BTRFS critical: panic in extent_io_tree_panic:679: locking error: extent tree was modified by another thread while locked (errno=-17 Object already exists)
    [66.441069] ------------[ cut here ]------------
    [66.441072] kernel BUG at fs/btrfs/extent_io.c:679!
    [66.442064] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
    [66.443018] CPU: 16 PID: 613 Comm: mount Tainted: G W O 5.11.0-rc1-custom #45
    [66.444538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
    [66.446223] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
    [66.450878] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
    [66.451840] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
    [66.453141] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
    [66.454445] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
    [66.455743] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
    [66.457055] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
    [66.458356] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
    [66.459841] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [66.460895] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
    [66.462196] PKRU: 55555554
    [66.462692] Call Trace:
    [66.463139] set_extent_bit.cold+0x30/0x98 [btrfs]
    [66.464049] set_extent_bits_nowait+0x1d/0x20 [btrfs]
    [66.490466] add_extent_mapping+0x1e0/0x2f0 [btrfs]
    [66.514097] read_one_chunk+0x33c/0x420 [btrfs]
    [66.534976] btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
    [66.555718] ? kvm_sched_clock_read+0x18/0x40
    [66.575758] open_ctree+0xb32/0x1734 [btrfs]
    [66.595272] ? bdi_register_va+0x1b/0x20
    [66.614638] ? super_setup_bdi_name+0x79/0xd0
    [66.633809] btrfs_mount_root.cold+0x12/0xeb [btrfs]
    [66.652938] ? __kmalloc_track_caller+0x217/0x3b0
    [66.671925] legacy_get_tree+0x34/0x60
    [66.690300] vfs_get_tree+0x2d/0xc0
    [66.708221] vfs_kern_mount.part.0+0x78/0xc0
    [66.725808] vfs_kern_mount+0x13/0x20
    [66.742730] btrfs_mount+0x11f/0x3c0 [btrfs]
    [66.759350] ? kfree+0x5ff/0x670
    [66.775441] ? __kmalloc_track_caller+0x217/0x3b0
    [66.791750] legacy_get_tree+0x34/0x60
    [66.807494] vfs_get_tree+0x2d/0xc0
    [66.823349] path_mount+0x48c/0xd30
    [66.838753] __x64_sys_mount+0x108/0x140
    [66.854412] do_syscall_64+0x38/0x50
    [66.869673] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [66.885093] RIP: 0033:0x7f0138827f6e
    [66.945613] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
    [66.977214] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
    [66.994266] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
    [67.011544] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
    [67.028836] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    [67.045812] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
    [67.216138] ---[ end trace e114b111db64298c ]---
    [67.237089] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
    [67.325317] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
    [67.347946] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
    [67.371343] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
    [67.394757] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
    [67.418409] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
    [67.441906] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
    [67.465436] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
    [67.511660] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [67.535047] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
    [67.558449] PKRU: 55555554
    [67.581146] note: mount[613] exited with preempt_count 2

    The image has a chunk item which has a logical start 37748736 and length
    18446744073701163008 (-8M). The calculated end 29360127 overflows.
    EEXIST was caught by insert_state() because of the duplicate end and
    extent_io_tree_panic() was called.

    Add overflow check of chunk item end to tree checker so it can be
    detected early at mount time.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Anand Jain
    Signed-off-by: Su Yue
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Su Yue
     
  • commit 29b665cc51e8b602bf2a275734349494776e3dbc upstream.

    Some extent io trees are initialized with NULL private member (e.g.
    btrfs_device::alloc_state and btrfs_fs_info::excluded_extents).
    Dereference of a NULL tree->private as inode pointer will cause panic.

    Pass tree->fs_info as it's known to be valid in all cases.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
    Fixes: 05912a3c04eb ("btrfs: drop extent_io_ops::tree_fs_info callback")
    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Anand Jain
    Signed-off-by: Su Yue
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Su Yue
     
  • commit 50e31ef486afe60f128d42fb9620e2a63172c15c upstream.

    [BUG]
    There are several bug reports about recent kernel unable to relocate
    certain data block groups.

    Sometimes the error just goes away, but there is one reporter who can
    reproduce it reliably.

    The dmesg would look like:

    [438.260483] BTRFS info (device dm-10): balance: start -dvrange=34625344765952..34625344765953
    [438.269018] BTRFS info (device dm-10): relocating block group 34625344765952 flags data|raid1
    [450.439609] BTRFS info (device dm-10): found 167 extents, stage: move data extents
    [463.501781] BTRFS info (device dm-10): balance: ended with status: -2

    [CAUSE]
    The ENOENT error is returned from the following call chain:

    add_data_references()
    |- delete_v1_space_cache();
    |- if (!found)
    return -ENOENT;

    The variable @found is set to true if we find a data extent whose
    disk bytenr matches parameter @data_bytes.

    With extra debugging, the offending tree block looks like this:

    leaf bytenr = 42676709441536, data_bytenr = 34626327621632

    ctime 1567904822.739884119 (2019-09-08 03:07:02)
    mtime 0.0 (1970-01-01 01:00:00)
    otime 0.0 (1970-01-01 01:00:00)
    item 27 key (51933 EXTENT_DATA 0) itemoff 9854 itemsize 53
    generation 1517381 type 2 (prealloc)
    prealloc data disk byte 34626327621632 nr 262144 <<<
    prealloc data offset 0 nr 262144
    item 28 key (52262 ROOT_ITEM 0) itemoff 9415 itemsize 439
    generation 2618893 root_dirid 256 bytenr 42677048360960 level 3 refs 1
    lastsnap 2618893 byte_limit 0 bytes_used 5557338112 flags 0x0(none)
    uuid d0d4361f-d231-6d40-8901-fe506e4b2b53

    Although item 27 has disk bytenr 34626327621632, which matches the
    data_bytenr, its type is prealloc, not reg.
    This makes the existing code skip that item, and return ENOENT.

    [FIX]
    The code is modified in commit 19b546d7a1b2 ("btrfs: relocation: Use
    btrfs_find_all_leafs to locate data extent parent tree leaves"), before
    that commit, we use something like

    "if (type == BTRFS_FILE_EXTENT_INLINE) continue;"

    But in that offending commit, we use (type == BTRFS_FILE_EXTENT_REG),
    ignoring BTRFS_FILE_EXTENT_PREALLOC.

    Fix it by also checking BTRFS_FILE_EXTENT_PREALLOC.

    Reported-by: Stéphane Lesimple
    Link: https://lore.kernel.org/linux-btrfs/505cabfa88575ed6dbe7cb922d8914fb@lesimple.fr
    Fixes: 19b546d7a1b2 ("btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves")
    CC: stable@vger.kernel.org # 5.6+
    Tested-By: Stéphane Lesimple
    Reviewed-by: Su Yue
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     

17 Jan, 2021

3 commits

  • [ Upstream commit e076ab2a2ca70a0270232067cd49f76cd92efe64 ]

    Commit 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
    shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
    some infrastructure we have in place to flush inodes that we use for
    device replace and snapshot. However this introduced a pretty serious
    performance regression. To reproduce the user untarred the source
    tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
    see it take anywhere from 5 to 20 times as long to untar in 5.10
    compared to 5.9. This was observed on fast devices (SSD and better) and
    not on HDD.

    The root cause is because before we would generally use the normal
    writeback path to reclaim delalloc space, and for this we would provide
    it with the number of pages we wanted to flush. The referenced commit
    changed this to flush that many inodes, which drastically increased the
    amount of space we were flushing in certain cases, which severely
    affected performance.

    We cannot revert this patch unfortunately because of 3d45f221ce62
    ("btrfs: fix deadlock when cloning inline extent and low on free
    metadata space") which requires the ability to skip flushing inodes that
    are being cloned in certain scenarios, which means we need to keep using
    our flushing infrastructure or risk re-introducing the deadlock.

    Instead to fix this problem we can go back to providing
    btrfs_start_delalloc_roots with a number of pages to flush, and then set
    up a writeback_control and utilize sync_inode() to handle the flushing
    for us. This gives us the same behavior we had prior to the fix, while
    still allowing us to avoid the deadlock that was fixed by Filipe. I
    redid the users original test and got the following results on one of
    our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)

    5.9 0m54.258s
    5.10 1m26.212s
    5.10+patch 0m38.800s

    5.10+patch is significantly faster than plain 5.9 because of my patch
    series "Change data reservations to use the ticketing infra" which
    contained the patch that introduced the regression, but generally
    improved the overall ENOSPC flushing mechanisms.

    Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
    the results:

    5.10.5 4m00s
    5.10.5+patch 1m08s
    5.11-rc2 5m14s
    5.11-rc2+patch 1m30s

    Reported-by: René Rebe
    Fixes: 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
    CC: stable@vger.kernel.org # 5.10
    Signed-off-by: Josef Bacik
    Tested-by: David Sterba
    Reviewed-by: David Sterba
    [ add my test results ]
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik
     
  • [ Upstream commit 3d45f221ce627d13e2e6ef3274f06750c84a6542 ]

    When cloning an inline extent there are cases where we can not just copy
    the inline extent from the source range to the target range (e.g. when the
    target range starts at an offset greater than zero). In such cases we copy
    the inline extent's data into a page of the destination inode and then
    dirty that page. However, after that we will need to start a transaction
    for each processed extent and, if we are ever low on available metadata
    space, we may need to flush existing delalloc for all dirty inodes in an
    attempt to release metadata space - if that happens we may deadlock:

    * the async reclaim task queued a delalloc work to flush delalloc for
    the destination inode of the clone operation;

    * the task executing that delalloc work gets blocked waiting for the
    range with the dirty page to be unlocked, which is currently locked
    by the task doing the clone operation;

    * the async reclaim task blocks waiting for the delalloc work to complete;

    * the cloning task is waiting on the waitqueue of its reservation ticket
    while holding the range with the dirty page locked in the inode's
    io_tree;

    * if metadata space is not released by some other task (like delalloc for
    some other inode completing for example), the clone task waits forever
    and as a consequence the delalloc work and async reclaim tasks will hang
    forever as well. Releasing more space on the other hand may require
    starting a transaction, which will hang as well when trying to reserve
    metadata space, resulting in a deadlock between all these tasks.

    When this happens, traces like the following show up in dmesg/syslog:

    [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
    [87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
    [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
    [87452.326136] Call Trace:
    [87452.326737] __schedule+0x5d1/0xcf0
    [87452.327390] schedule+0x45/0xe0
    [87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs]
    [87452.328894] ? finish_wait+0x90/0x90
    [87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs]
    [87452.330133] ? __mod_memcg_state+0x8e/0x160
    [87452.330738] __extent_writepage+0x2d4/0x400 [btrfs]
    [87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs]
    [87452.332007] ? lock_release+0x20e/0x4c0
    [87452.332557] ? trace_hardirqs_on+0x1b/0xf0
    [87452.333127] extent_writepages+0x43/0x90 [btrfs]
    [87452.333653] ? lock_acquire+0x1a3/0x490
    [87452.334177] do_writepages+0x43/0xe0
    [87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100
    [87452.335720] __filemap_fdatawrite_range+0xc5/0x100
    [87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
    [87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs]
    [87452.337838] process_one_work+0x24e/0x5e0
    [87452.338437] worker_thread+0x50/0x3b0
    [87452.339137] ? process_one_work+0x5e0/0x5e0
    [87452.339884] kthread+0x153/0x170
    [87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0
    [87452.341153] ret_from_fork+0x22/0x30
    [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
    [87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
    [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
    [87452.345655] Call Trace:
    [87452.346305] __schedule+0x5d1/0xcf0
    [87452.346947] ? kvm_clock_read+0x14/0x30
    [87452.347676] ? wait_for_completion+0x81/0x110
    [87452.348389] schedule+0x45/0xe0
    [87452.349077] schedule_timeout+0x30c/0x580
    [87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [87452.350340] ? lock_acquire+0x1a3/0x490
    [87452.351006] ? try_to_wake_up+0x7a/0xa20
    [87452.351541] ? lock_release+0x20e/0x4c0
    [87452.352040] ? lock_acquired+0x199/0x490
    [87452.352517] ? wait_for_completion+0x81/0x110
    [87452.353000] wait_for_completion+0xab/0x110
    [87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs]
    [87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
    [87452.354455] flush_space+0x24f/0x660 [btrfs]
    [87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
    [87452.355565] process_one_work+0x24e/0x5e0
    [87452.356024] worker_thread+0x20f/0x3b0
    [87452.356487] ? process_one_work+0x5e0/0x5e0
    [87452.356973] kthread+0x153/0x170
    [87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0
    [87452.357880] ret_from_fork+0x22/0x30
    (...)
    < stack traces of several tasks waiting for the locks of the inodes of the
    clone operation >
    (...)
    [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
    [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
    [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
    [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
    [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
    [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
    [92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000
    [92867.447920] Call Trace:
    [92867.448435] __schedule+0x5d1/0xcf0
    [92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [92867.449423] schedule+0x45/0xe0
    [92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs]
    [92867.450576] ? finish_wait+0x90/0x90
    [92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
    [92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs]
    [92867.452412] start_transaction+0x2d1/0x760 [btrfs]
    [92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs]
    [92867.453848] ? lock_release+0x20e/0x4c0
    [92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
    [92867.455218] btrfs_clone+0x569/0x7e0 [btrfs]
    [92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs]
    [92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs]
    [92867.457213] do_clone_file_range+0xd4/0x1f0
    [92867.457828] vfs_clone_file_range+0x4d/0x230
    [92867.458355] ? lock_release+0x20e/0x4c0
    [92867.458890] ioctl_file_clone+0x8f/0xc0
    [92867.459377] do_vfs_ioctl+0x342/0x750
    [92867.459913] __x64_sys_ioctl+0x62/0xb0
    [92867.460377] do_syscall_64+0x33/0x80
    [92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    (...)
    < stack traces of more tasks blocked on metadata reservation like the clone
    task above, because the async reclaim task has deadlocked >
    (...)

    Another thing to notice is that the worker task that is deadlocked when
    trying to flush the destination inode of the clone operation is at
    btrfs_invalidatepage(). This is simply because the clone operation has a
    destination offset greater than the i_size and we only update the i_size
    of the destination file after cloning an extent (just like we do in the
    buffered write path).

    Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
    the flushing of delalloc for all inodes that have delalloc, add a runtime
    flag to an inode to signal it should not be flushed, and for inodes with
    that flag set, start_delalloc_inodes() will simply skip them. When the
    cloning code needs to dirty a page to copy an inline extent, set that flag
    on the inode and then clear it when the clone operation finishes.

    This could be sporadically triggered with test case generic/269 from
    fstests, which exercises many fsstress processes running in parallel with
    several dd processes filling up the entire filesystem.

    CC: stable@vger.kernel.org # 5.9+
    Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents")
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     
  • [ Upstream commit f2f121ab500d0457cc9c6f54269d21ffdf5bd304 ]

    Every time we log an inode we lookup in the fs/subvol tree for xattrs and
    if we have any, log them into the log tree. However it is very common to
    have inodes without any xattrs, so doing the search wastes times, but more
    importantly it adds contention on the fs/subvol tree locks, either making
    the logging code block and wait for tree locks or making the logging code
    making other concurrent operations block and wait.

    The most typical use cases where xattrs are used are when capabilities or
    ACLs are defined for an inode, or when SELinux is enabled.

    This change makes the logging code detect when an inode does not have
    xattrs and skip the xattrs search the next time the inode is logged,
    unless the inode is evicted and loaded again or a xattr is added to the
    inode. Therefore skipping the search for xattrs on inodes that don't ever
    have xattrs and are fsynced with some frequency.

    The following script that calls dbench was used to measure the impact of
    this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
    directly (no intermediary filesystem on the host) and using a non-debug
    kernel (default configuration on Debian distributions):

    $ cat test.sh
    #!/bin/bash

    DEV=/dev/sdk
    MNT=/mnt/sdk
    MOUNT_OPTIONS="-o ssd"

    mkfs.btrfs -f -m single -d single $DEV
    mount $MOUNT_OPTIONS $DEV $MNT

    dbench -D $MNT -t 200 40

    umount $MNT

    The results before this change:

    Operation Count AvgLat MaxLat
    ----------------------------------------
    NTCreateX 5761605 0.172 312.057
    Close 4232452 0.002 10.927
    Rename 243937 1.406 277.344
    Unlink 1163456 0.631 298.402
    Deltree 160 11.581 221.107
    Mkdir 80 0.003 0.005
    Qpathinfo 5221410 0.065 122.309
    Qfileinfo 915432 0.001 3.333
    Qfsinfo 957555 0.003 3.992
    Sfileinfo 469244 0.023 20.494
    Find 2018865 0.448 123.659
    WriteX 2874851 0.049 118.529
    ReadX 9030579 0.004 21.654
    LockX 18754 0.003 4.423
    UnlockX 18754 0.002 0.331
    Flush 403792 10.944 359.494

    Throughput 908.444 MB/sec 40 clients 40 procs max_latency=359.500 ms

    The results after this change:

    Operation Count AvgLat MaxLat
    ----------------------------------------
    NTCreateX 6442521 0.159 230.693
    Close 4732357 0.002 10.972
    Rename 272809 1.293 227.398
    Unlink 1301059 0.563 218.500
    Deltree 160 7.796 54.887
    Mkdir 80 0.008 0.478
    Qpathinfo 5839452 0.047 124.330
    Qfileinfo 1023199 0.001 4.996
    Qfsinfo 1070760 0.003 5.709
    Sfileinfo 524790 0.033 21.765
    Find 2257658 0.314 125.611
    WriteX 3211520 0.040 232.135
    ReadX 10098969 0.004 25.340
    LockX 20974 0.003 1.569
    UnlockX 20974 0.002 3.475
    Flush 451553 10.287 331.037

    Throughput 1011.77 MB/sec 40 clients 40 procs max_latency=331.045 ms

    +10.8% throughput, -8.2% max latency

    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     

13 Jan, 2021

2 commits

  • commit 0b3f407e6728d990ae1630a02c7b952c21c288d3 upstream.

    When doing an incremental send, if we have a new inode that happens to
    have the same number that an old directory inode had in the base snapshot
    and that old directory has a pending rmdir operation, we end up computing
    a wrong path for the new inode, causing the receiver to fail.

    Example reproducer:

    $ cat test-send-rmdir.sh
    #!/bin/bash

    DEV=/dev/sdi
    MNT=/mnt/sdi

    mkfs.btrfs -f $DEV >/dev/null
    mount $DEV $MNT

    mkdir $MNT/dir
    touch $MNT/dir/file1
    touch $MNT/dir/file2
    touch $MNT/dir/file3

    # Filesystem looks like:
    #
    # . (ino 256)
    # |----- dir/ (ino 257)
    # |----- file1 (ino 258)
    # |----- file2 (ino 259)
    # |----- file3 (ino 260)
    #

    btrfs subvolume snapshot -r $MNT $MNT/snap1
    btrfs send -f /tmp/snap1.send $MNT/snap1

    # Now remove our directory and all its files.
    rm -fr $MNT/dir

    # Unmount the filesystem and mount it again. This is to ensure that
    # the next inode that is created ends up with the same inode number
    # that our directory "dir" had, 257, which is the first free "objectid"
    # available after mounting again the filesystem.
    umount $MNT
    mount $DEV $MNT

    # Now create a new file (it could be a directory as well).
    touch $MNT/newfile

    # Filesystem now looks like:
    #
    # . (ino 256)
    # |----- newfile (ino 257)
    #

    btrfs subvolume snapshot -r $MNT $MNT/snap2
    btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2

    # Now unmount the filesystem, create a new one, mount it and try to apply
    # both send streams to recreate both snapshots.
    umount $DEV

    mkfs.btrfs -f $DEV >/dev/null

    mount $DEV $MNT

    btrfs receive -f /tmp/snap1.send $MNT
    btrfs receive -f /tmp/snap2.send $MNT

    umount $MNT

    When running the test, the receive operation for the incremental stream
    fails:

    $ ./test-send-rmdir.sh
    Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
    At subvol /mnt/sdi/snap1
    Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
    At subvol /mnt/sdi/snap2
    At subvol snap1
    At snapshot snap2
    ERROR: chown o257-9-0 failed: No such file or directory

    So fix this by tracking directories that have a pending rmdir by inode
    number and generation number, instead of only inode number.

    A test case for fstests follows soon.

    Reported-by: Massimo B.
    Tested-by: Massimo B.
    Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
    CC: stable@vger.kernel.org # 4.19+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit ae5e070eaca9dbebde3459dd8f4c2756f8c097d0 upstream.

    There is a chance of racing for qgroup flushing which may lead to
    deadlock:

    Thread A | Thread B
    (not holding trans handle) | (holding a trans handle)
    --------------------------------+--------------------------------
    __btrfs_qgroup_reserve_meta() | __btrfs_qgroup_reserve_meta()
    |- try_flush_qgroup() | |- try_flush_qgroup()
    |- QGROUP_FLUSHING bit set | |
    | | |- test_and_set_bit()
    | | |- wait_event()
    |- btrfs_join_transaction() |
    |- btrfs_commit_transaction()|

    !!! DEAD LOCK !!!

    Since thread A wants to commit transaction, but thread B is holding a
    transaction handle, blocking the commit.
    At the same time, thread B is waiting for thread A to finish its commit.

    This is just a hot fix, and would lead to more EDQUOT when we're near
    the qgroup limit.

    The proper fix would be to make all metadata/data reservations happen
    without holding a transaction handle.

    CC: stable@vger.kernel.org # 5.9+
    Reviewed-by: Filipe Manana
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     

30 Dec, 2020

4 commits

  • Changes in 5.10.4
    hwmon: (k10temp) Remove support for displaying voltage and current on Zen CPUs
    drm/gma500: fix double free of gma_connector
    iio: adc: at91_adc: add Kconfig dep on the OF symbol and remove of_match_ptr()
    drm/aspeed: Fix Kconfig warning & subsequent build errors
    drm/mcde: Fix handling of platform_get_irq() error
    drm/tve200: Fix handling of platform_get_irq() error
    arm64: dts: renesas: hihope-rzg2-ex: Drop rxc-skew-ps from ethernet-phy node
    arm64: dts: renesas: cat875: Remove rxc-skew-ps from ethernet-phy node
    soc: renesas: rmobile-sysc: Fix some leaks in rmobile_init_pm_domains()
    soc: mediatek: Check if power domains can be powered on at boot time
    arm64: dts: mediatek: mt8183: fix gce incorrect mbox-cells value
    arm64: dts: ipq6018: update the reserved-memory node
    arm64: dts: qcom: sc7180: Fix one forgotten interconnect reference
    soc: qcom: geni: More properly switch to DMA mode
    Revert "i2c: i2c-qcom-geni: Fix DMA transfer race"
    RDMA/bnxt_re: Set queue pair state when being queried
    rtc: pcf2127: fix pcf2127_nvmem_read/write() returns
    RDMA/bnxt_re: Fix entry size during SRQ create
    selinux: fix error initialization in inode_doinit_with_dentry()
    ARM: dts: aspeed-g6: Fix the GPIO memory size
    ARM: dts: aspeed: s2600wf: Fix VGA memory region location
    RDMA/core: Fix error return in _ib_modify_qp()
    RDMA/rxe: Compute PSN windows correctly
    x86/mm/ident_map: Check for errors from ident_pud_init()
    ARM: p2v: fix handling of LPAE translation in BE mode
    RDMA/rtrs-clt: Remove destroy_con_cq_qp in case route resolving failed
    RDMA/rtrs-clt: Missing error from rtrs_rdma_conn_established
    RDMA/rtrs-srv: Don't guard the whole __alloc_srv with srv_mutex
    x86/apic: Fix x2apic enablement without interrupt remapping
    ASoC: qcom: fix unsigned int bitwidth compared to less than zero
    sched/deadline: Fix sched_dl_global_validate()
    sched: Reenable interrupts in do_sched_yield()
    drm/amdgpu: fix incorrect enum type
    crypto: talitos - Endianess in current_desc_hdr()
    crypto: talitos - Fix return type of current_desc_hdr()
    crypto: inside-secure - Fix sizeof() mismatch
    ASoC: sun4i-i2s: Fix lrck_period computation for I2S justified mode
    drm/msm: Add missing stub definition
    ARM: dts: aspeed: tiogapass: Remove vuart
    drm/amdgpu: fix build_coefficients() argument
    powerpc/64: Set up a kernel stack for secondaries before cpu_restore()
    spi: img-spfi: fix reference leak in img_spfi_resume
    f2fs: call f2fs_get_meta_page_retry for nat page
    RDMA/mlx5: Fix corruption of reg_pages in mlx5_ib_rereg_user_mr()
    perf test: Use generic event for expand_libpfm_events()
    drm/msm/dp: DisplayPort PHY compliance tests fixup
    drm/msm/dsi_pll_7nm: restore VCO rate during restore_state
    drm/msm/dsi_pll_10nm: restore VCO rate during restore_state
    drm/msm/dpu: fix clock scaling on non-sc7180 board
    spi: spi-mem: fix reference leak in spi_mem_access_start
    scsi: aacraid: Improve compat_ioctl handlers
    pinctrl: core: Add missing #ifdef CONFIG_GPIOLIB
    ASoC: pcm: DRAIN support reactivation
    drm/bridge: tpd12s015: Fix irq registering in tpd12s015_probe
    crypto: arm64/poly1305-neon - reorder PAC authentication with SP update
    crypto: arm/aes-neonbs - fix usage of cbc(aes) fallback
    crypto: caam - fix printing on xts fallback allocation error path
    selinux: fix inode_doinit_with_dentry() LABEL_INVALID error handling
    nl80211/cfg80211: fix potential infinite loop
    spi: stm32: fix reference leak in stm32_spi_resume
    bpf: Fix tests for local_storage
    x86/mce: Correct the detection of invalid notifier priorities
    drm/edid: Fix uninitialized variable in drm_cvt_modes()
    ath11k: Initialize complete alpha2 for regulatory change
    ath11k: Fix number of rules in filtered ETSI regdomain
    ath11k: fix wmi init configuration
    brcmfmac: Fix memory leak for unpaired brcmf_{alloc/free}
    arm64: dts: exynos: Include common syscon restart/poweroff for Exynos7
    arm64: dts: exynos: Correct psci compatible used on Exynos7
    drm/panel: simple: Add flags to boe_nv133fhm_n61
    Bluetooth: Fix null pointer dereference in hci_event_packet()
    Bluetooth: Fix: LL PRivacy BLE device fails to connect
    Bluetooth: hci_h5: fix memory leak in h5_close
    spi: stm32-qspi: fix reference leak in stm32 qspi operations
    spi: spi-ti-qspi: fix reference leak in ti_qspi_setup
    spi: mt7621: fix missing clk_disable_unprepare() on error in mt7621_spi_probe
    spi: tegra20-slink: fix reference leak in slink ops of tegra20
    spi: tegra20-sflash: fix reference leak in tegra_sflash_resume
    spi: tegra114: fix reference leak in tegra spi ops
    spi: bcm63xx-hsspi: fix missing clk_disable_unprepare() on error in bcm63xx_hsspi_resume
    spi: imx: fix reference leak in two imx operations
    ASoC: qcom: common: Fix refcounting in qcom_snd_parse_of()
    ath11k: Handle errors if peer creation fails
    mwifiex: fix mwifiex_shutdown_sw() causing sw reset failure
    drm/msm/a6xx: Clear shadow on suspend
    drm/msm/a5xx: Clear shadow on suspend
    firmware: tegra: fix strncpy()/strncat() confusion
    drm/msm/dp: return correct connection status after suspend
    drm/msm/dp: skip checking LINK_STATUS_UPDATED bit
    drm/msm/dp: do not notify audio subsystem if sink doesn't support audio
    selftests/run_kselftest.sh: fix dry-run typo
    selftest/bpf: Add missed ip6ip6 test back
    ASoC: wm8994: Fix PM disable depth imbalance on error
    ASoC: wm8998: Fix PM disable depth imbalance on error
    spi: sprd: fix reference leak in sprd_spi_remove
    virtiofs fix leak in setup
    ASoC: arizona: Fix a wrong free in wm8997_probe
    RDMa/mthca: Work around -Wenum-conversion warning
    ASoC: SOF: Intel: fix Kconfig dependency for SND_INTEL_DSP_CONFIG
    arm64: dts: ti: k3-am65*/j721e*: Fix unit address format error for dss node
    MIPS: BCM47XX: fix kconfig dependency bug for BCM47XX_BCMA
    drm/amdgpu: fix compute queue priority if num_kcq is less than 4
    soc: ti: omap-prm: Do not check rstst bit on deassert if already deasserted
    crypto: Kconfig - CRYPTO_MANAGER_EXTRA_TESTS requires the manager
    crypto: qat - fix status check in qat_hal_put_rel_rd_xfer()
    firmware: arm_scmi: Fix missing destroy_workqueue()
    drm/udl: Fix missing error code in udl_handle_damage()
    staging: greybus: codecs: Fix reference counter leak in error handling
    staging: gasket: interrupt: fix the missed eventfd_ctx_put() in gasket_interrupt.c
    scripts: kernel-doc: Restore anonymous enum parsing
    drm/amdkfd: Put ACPI table after using it
    ionic: use mc sync for multicast filters
    ionic: flatten calls to ionic_lif_rx_mode
    ionic: change set_rx_mode from_ndo to can_sleep
    media: tm6000: Fix sizeof() mismatches
    media: platform: add missing put_device() call in mtk_jpeg_clk_init()
    media: mtk-vcodec: add missing put_device() call in mtk_vcodec_init_dec_pm()
    media: mtk-vcodec: add missing put_device() call in mtk_vcodec_release_dec_pm()
    media: mtk-vcodec: add missing put_device() call in mtk_vcodec_init_enc_pm()
    media: v4l2-fwnode: Return -EINVAL for invalid bus-type
    media: v4l2-fwnode: v4l2_fwnode_endpoint_parse caller must init vep argument
    media: ov5640: fix support of BT656 bus mode
    media: staging: rkisp1: cap: fix runtime PM imbalance on error
    media: cedrus: fix reference leak in cedrus_start_streaming
    media: platform: add missing put_device() call in mtk_jpeg_probe() and mtk_jpeg_remove()
    media: venus: core: change clk enable and disable order in resume and suspend
    media: venus: core: vote for video-mem path
    media: venus: core: vote with average bandwidth and peak bandwidth as zero
    RDMA/cma: Add missing error handling of listen_id
    ASoC: meson: fix COMPILE_TEST error
    spi: dw: fix build error by selecting MULTIPLEXER
    scsi: core: Fix VPD LUN ID designator priorities
    media: venus: put dummy vote on video-mem path after last session release
    media: solo6x10: fix missing snd_card_free in error handling case
    video: fbdev: atmel_lcdfb: fix return error code in atmel_lcdfb_of_init()
    mmc: sdhci: tegra: fix wrong unit with busy_timeout
    drm/omap: dmm_tiler: fix return error code in omap_dmm_probe()
    drm/meson: Free RDMA resources after tearing down DRM
    drm/meson: Unbind all connectors on module removal
    drm/meson: dw-hdmi: Register a callback to disable the regulator
    drm/meson: dw-hdmi: Ensure that clocks are enabled before touching the TOP registers
    ASoC: intel: SND_SOC_INTEL_KEEMBAY should depend on ARCH_KEEMBAY
    iommu/vt-d: include conditionally on CONFIG_INTEL_IOMMU_SVM
    Input: ads7846 - fix race that causes missing releases
    Input: ads7846 - fix integer overflow on Rt calculation
    Input: ads7846 - fix unaligned access on 7845
    bus: mhi: core: Remove double locking from mhi_driver_remove()
    bus: mhi: core: Fix null pointer access when parsing MHI configuration
    usb/max3421: fix return error code in max3421_probe()
    spi: mxs: fix reference leak in mxs_spi_probe
    selftests/bpf: Fix broken riscv build
    powerpc: Avoid broken GCC __attribute__((optimize))
    powerpc/feature: Fix CPU_FTRS_ALWAYS by removing CPU_FTRS_GENERIC_32
    ARM: dts: tacoma: Fix node vs reg mismatch for flash memory
    Revert "powerpc/pseries/hotplug-cpu: Remove double free in error path"
    powerpc/powernv/sriov: fix unsigned int win compared to less than zero
    mfd: htc-i2cpld: Add the missed i2c_put_adapter() in htcpld_register_chip_i2c()
    mfd: MFD_SL28CPLD should depend on ARCH_LAYERSCAPE
    mfd: stmfx: Fix dev_err_probe() call in stmfx_chip_init()
    mfd: cpcap: Fix interrupt regression with regmap clear_ack
    EDAC/mce_amd: Use struct cpuinfo_x86.cpu_die_id for AMD NodeId
    scsi: ufs: Avoid to call REQ_CLKS_OFF to CLKS_OFF
    scsi: ufs: Fix clkgating on/off
    rcu: Allow rcu_irq_enter_check_tick() from NMI
    rcu,ftrace: Fix ftrace recursion
    rcu/tree: Defer kvfree_rcu() allocation to a clean context
    crypto: crypto4xx - Replace bitwise OR with logical OR in crypto4xx_build_pd
    crypto: omap-aes - Fix PM disable depth imbalance in omap_aes_probe
    crypto: sun8i-ce - fix two error path's memory leak
    spi: fix resource leak for drivers without .remove callback
    drm/meson: dw-hdmi: Disable clocks on driver teardown
    drm/meson: dw-hdmi: Enable the iahb clock early enough
    PCI: Disable MSI for Pericom PCIe-USB adapter
    PCI: brcmstb: Initialize "tmp" before use
    soc: ti: knav_qmss: fix reference leak in knav_queue_probe
    soc: ti: Fix reference imbalance in knav_dma_probe
    drivers: soc: ti: knav_qmss_queue: Fix error return code in knav_queue_probe
    soc: qcom: initialize local variable
    arm64: dts: qcom: sm8250: correct compatible for sm8250-mtp
    arm64: dts: qcom: msm8916-samsung-a2015: Disable muic i2c pin bias
    Input: omap4-keypad - fix runtime PM error handling
    clk: meson: Kconfig: fix dependency for G12A
    staging: mfd: hi6421-spmi-pmic: fix error return code in hi6421_spmi_pmic_probe()
    ath11k: Fix the rx_filter flag setting for peer rssi stats
    RDMA/cxgb4: Validate the number of CQEs
    soundwire: Fix DEBUG_LOCKS_WARN_ON for uninitialized attribute
    pinctrl: sunxi: fix irq bank map for the Allwinner A100 pin controller
    memstick: fix a double-free bug in memstick_check
    ARM: dts: at91: sam9x60: add pincontrol for USB Host
    ARM: dts: at91: sama5d4_xplained: add pincontrol for USB Host
    ARM: dts: at91: sama5d3_xplained: add pincontrol for USB Host
    mmc: pxamci: Fix error return code in pxamci_probe
    brcmfmac: fix error return code in brcmf_cfg80211_connect()
    orinoco: Move context allocation after processing the skb
    qtnfmac: fix error return code in qtnf_pcie_probe()
    rsi: fix error return code in rsi_reset_card()
    cw1200: fix missing destroy_workqueue() on error in cw1200_init_common
    dmaengine: mv_xor_v2: Fix error return code in mv_xor_v2_probe()
    arm64: dts: qcom: sdm845: Limit ipa iommu streams
    leds: netxbig: add missing put_device() call in netxbig_leds_get_of_pdata()
    leds: lp50xx: Fix an error handling path in 'lp50xx_probe_dt()'
    leds: turris-omnia: check for LED_COLOR_ID_RGB instead LED_COLOR_ID_MULTI
    arm64: tegra: Fix DT binding for IO High Voltage entry
    RDMA/cma: Fix deadlock on &lock in rdma_cma_listen_on_all() error unwind
    soundwire: qcom: Fix build failure when slimbus is module
    drm/imx/dcss: fix rotations for Vivante tiled formats
    media: siano: fix memory leak of debugfs members in smsdvb_hotplug
    platform/x86: mlx-platform: Remove PSU EEPROM from default platform configuration
    platform/x86: mlx-platform: Remove PSU EEPROM from MSN274x platform configuration
    arm64: dts: qcom: sc7180: limit IPA iommu streams
    RDMA/hns: Only record vlan info for HIP08
    RDMA/hns: Fix missing fields in address vector
    RDMA/hns: Avoid setting loopback indicator when smac is same as dmac
    serial: 8250-mtk: Fix reference leak in mtk8250_probe
    samples: bpf: Fix lwt_len_hist reusing previous BPF map
    media: imx214: Fix stop streaming
    mips: cdmm: fix use-after-free in mips_cdmm_bus_discover
    media: max2175: fix max2175_set_csm_mode() error code
    slimbus: qcom-ngd-ctrl: Avoid sending power requests without QMI
    RDMA/core: Track device memory MRs
    drm/mediatek: Use correct aliases name for ovl
    HSI: omap_ssi: Don't jump to free ID in ssi_add_controller()
    ARM: dts: Remove non-existent i2c1 from 98dx3236
    arm64: dts: armada-3720-turris-mox: update ethernet-phy handle name
    power: supply: bq25890: Use the correct range for IILIM register
    arm64: dts: rockchip: Set dr_mode to "host" for OTG on rk3328-roc-cc
    power: supply: max17042_battery: Fix current_{avg,now} hiding with no current sense
    power: supply: axp288_charger: Fix HP Pavilion x2 10 DMI matching
    power: supply: bq24190_charger: fix reference leak
    genirq/irqdomain: Don't try to free an interrupt that has no mapping
    arm64: dts: ls1028a: fix ENETC PTP clock input
    arm64: dts: ls1028a: fix FlexSPI clock input
    arm64: dts: freescale: sl28: combine SPI MTD partitions
    phy: tegra: xusb: Fix usb_phy device driver field
    arm64: dts: qcom: c630: Polish i2c-hid devices
    arm64: dts: qcom: c630: Fix pinctrl pins properties
    PCI: Bounds-check command-line resource alignment requests
    PCI: Fix overflow in command-line resource alignment requests
    PCI: iproc: Fix out-of-bound array accesses
    PCI: iproc: Invalidate correct PAXB inbound windows
    arm64: dts: meson: fix spi-max-frequency on Khadas VIM2
    arm64: dts: meson-sm1: fix typo in opp table
    soc: amlogic: canvas: add missing put_device() call in meson_canvas_get()
    scsi: hisi_sas: Fix up probe error handling for v3 hw
    scsi: pm80xx: Do not sleep in atomic context
    spi: spi-fsl-dspi: Use max_native_cs instead of num_chipselect to set SPI_MCR
    ARM: dts: at91: at91sam9rl: fix ADC triggers
    RDMA/hns: Fix 0-length sge calculation error
    RDMA/hns: Bugfix for calculation of extended sge
    mailbox: arm_mhu_db: Fix mhu_db_shutdown by replacing kfree with devm_kfree
    soundwire: master: use pm_runtime_set_active() on add
    platform/x86: dell-smbios-base: Fix error return code in dell_smbios_init
    ASoC: Intel: Boards: tgl_max98373: update TDM slot_width
    media: max9271: Fix GPIO enable/disable
    media: rdacm20: Enable GPIO1 explicitly
    media: i2c: imx219: Selection compliance fixes
    ath11k: Don't cast ath11k_skb_cb to ieee80211_tx_info.control
    ath11k: Reset ath11k_skb_cb before setting new flags
    ath11k: Fix an error handling path
    ath10k: Fix the parsing error in service available event
    ath10k: Fix an error handling path
    ath10k: Release some resources in an error handling path
    SUNRPC: rpc_wake_up() should wake up tasks in the correct order
    NFSv4.2: condition READDIR's mask for security label based on LSM state
    SUNRPC: xprt_load_transport() needs to support the netid "rdma6"
    NFSv4: Fix the alignment of page data in the getdeviceinfo reply
    net: sunrpc: Fix 'snprintf' return value check in 'do_xprt_debugfs'
    lockd: don't use interval-based rebinding over TCP
    NFS: switch nfsiod to be an UNBOUND workqueue.
    selftests/seccomp: Update kernel config
    vfio-pci: Use io_remap_pfn_range() for PCI IO memory
    hwmon: (ina3221) Fix PM usage counter unbalance in ina3221_write_enable
    f2fs: fix double free of unicode map
    media: tvp5150: Fix wrong return value of tvp5150_parse_dt()
    media: saa7146: fix array overflow in vidioc_s_audio()
    powerpc/perf: Fix crash with is_sier_available when pmu is not set
    powerpc/64: Fix an EMIT_BUG_ENTRY in head_64.S
    powerpc/xmon: Fix build failure for 8xx
    powerpc/perf: Fix to update radix_scope_qual in power10
    powerpc/perf: Update the PMU group constraints for l2l3 events in power10
    powerpc/perf: Fix the PMU group constraints for threshold events in power10
    clocksource/drivers/orion: Add missing clk_disable_unprepare() on error path
    clocksource/drivers/cadence_ttc: Fix memory leak in ttc_setup_clockevent()
    clocksource/drivers/ingenic: Fix section mismatch
    clocksource/drivers/riscv: Make RISCV_TIMER depends on RISCV_SBI
    arm64: mte: fix prctl(PR_GET_TAGGED_ADDR_CTRL) if TCF0=NONE
    iio: hrtimer-trigger: Mark hrtimer to expire in hard interrupt context
    libbpf: Sanitise map names before pinning
    ARM: dts: at91: sam9x60ek: remove bypass property
    ARM: dts: at91: sama5d2: map securam as device
    scripts: kernel-doc: fix parsing function-like typedefs
    bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
    selftests/bpf: Fix invalid use of strncat in test_sockmap
    pinctrl: falcon: add missing put_device() call in pinctrl_falcon_probe()
    soc: rockchip: io-domain: Fix error return code in rockchip_iodomain_probe()
    arm64: dts: rockchip: Fix UART pull-ups on rk3328
    memstick: r592: Fix error return in r592_probe()
    MIPS: Don't round up kernel sections size for memblock_add()
    mt76: mt7663s: fix a possible ple quota underflow
    mt76: mt7915: set fops_sta_stats.owner to THIS_MODULE
    mt76: set fops_tx_stats.owner to THIS_MODULE
    mt76: dma: fix possible deadlock running mt76_dma_cleanup
    net/mlx5: Properly convey driver version to firmware
    mt76: fix memory leak if device probing fails
    mt76: fix tkip configuration for mt7615/7663 devices
    ASoC: jz4740-i2s: add missed checks for clk_get()
    ASoC: q6afe-clocks: Add missing parent clock rate
    dm ioctl: fix error return code in target_message
    ASoC: cros_ec_codec: fix uninitialized memory read
    ASoC: atmel: mchp-spdifrx needs COMMON_CLK
    ASoC: qcom: fix QDSP6 dependencies, attempt #3
    phy: mediatek: allow compile-testing the hdmi phy
    phy: renesas: rcar-gen3-usb2: disable runtime pm in case of failure
    memory: ti-emif-sram: only build for ARMv7
    memory: jz4780_nemc: Fix potential NULL dereference in jz4780_nemc_probe()
    drm/msm: a5xx: Make preemption reset case reentrant
    drm/msm: add IOMMU_SUPPORT dependency
    clocksource/drivers/arm_arch_timer: Use stable count reader in erratum sne
    clocksource/drivers/arm_arch_timer: Correct fault programming of CNTKCTL_EL1.EVNTI
    cpufreq: ap806: Add missing MODULE_DEVICE_TABLE
    cpufreq: highbank: Add missing MODULE_DEVICE_TABLE
    cpufreq: mediatek: Add missing MODULE_DEVICE_TABLE
    cpufreq: qcom: Add missing MODULE_DEVICE_TABLE
    cpufreq: st: Add missing MODULE_DEVICE_TABLE
    cpufreq: sun50i: Add missing MODULE_DEVICE_TABLE
    cpufreq: loongson1: Add missing MODULE_ALIAS
    cpufreq: scpi: Add missing MODULE_ALIAS
    cpufreq: vexpress-spc: Add missing MODULE_ALIAS
    cpufreq: imx: fix NVMEM_IMX_OCOTP dependency
    macintosh/adb-iop: Always wait for reply message from IOP
    macintosh/adb-iop: Send correct poll command
    staging: bcm2835: fix vchiq_mmal dependencies
    staging: greybus: audio: Fix possible leak free widgets in gbaudio_dapm_free_controls
    spi: dw: Fix error return code in dw_spi_bt1_probe()
    Bluetooth: btusb: Add the missed release_firmware() in btusb_mtk_setup_firmware()
    Bluetooth: btmtksdio: Add the missed release_firmware() in mtk_setup_firmware()
    Bluetooth: sco: Fix crash when using BT_SNDMTU/BT_RCVMTU option
    block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name
    block/rnbd: fix a null pointer dereference on dev->blk_symlink_name
    Bluetooth: btusb: Fix detection of some fake CSR controllers with a bcdDevice val of 0x0134
    platform/x86: intel-vbtn: Fix SW_TABLET_MODE always reporting 1 on some HP x360 models
    adm8211: fix error return code in adm8211_probe()
    mtd: spi-nor: sst: fix BPn bits for the SST25VF064C
    mtd: spi-nor: ignore errors in spi_nor_unlock_all()
    mtd: spi-nor: atmel: remove global protection flag
    mtd: spi-nor: atmel: fix unlock_all() for AT25FS010/040
    arm64: dts: meson: g12b: odroid-n2: fix PHY deassert timing requirements
    arm64: dts: meson: fix PHY deassert timing requirements
    ARM: dts: meson: fix PHY deassert timing requirements
    arm64: dts: meson: g12a: x96-max: fix PHY deassert timing requirements
    arm64: dts: meson: g12b: w400: fix PHY deassert timing requirements
    clk: fsl-sai: fix memory leak
    scsi: qedi: Fix missing destroy_workqueue() on error in __qedi_probe
    scsi: pm80xx: Fix error return in pm8001_pci_probe()
    scsi: iscsi: Fix inappropriate use of put_device()
    seq_buf: Avoid type mismatch for seq_buf_init
    scsi: fnic: Fix error return code in fnic_probe()
    platform/x86: mlx-platform: Fix item counter assignment for MSN2700, MSN24xx systems
    platform/x86: mlx-platform: Fix item counter assignment for MSN2700/ComEx system
    ARM: 9030/1: entry: omit FP emulation for UND exceptions taken in kernel mode
    powerpc/pseries/hibernation: drop pseries_suspend_begin() from suspend ops
    powerpc/pseries/hibernation: remove redundant cacheinfo update
    powerpc/powermac: Fix low_sleep_handler with CONFIG_VMAP_STACK
    drm/mediatek: avoid dereferencing a null hdmi_phy on an error message
    ASoC: amd: change clk_get() to devm_clk_get() and add missed checks
    coresight: remove broken __exit annotations
    ASoC: max98390: Fix error codes in max98390_dsm_init()
    powerpc/mm: sanity_check_fault() should work for all, not only BOOK3S
    usb: ehci-omap: Fix PM disable depth umbalance in ehci_hcd_omap_probe
    usb: oxu210hp-hcd: Fix memory leak in oxu_create
    speakup: fix uninitialized flush_lock
    nfsd: Fix message level for normal termination
    NFSD: Fix 5 seconds delay when doing inter server copy
    nfs_common: need lock during iterate through the list
    x86/kprobes: Restore BTF if the single-stepping is cancelled
    scsi: qla2xxx: Fix FW initialization error on big endian machines
    scsi: qla2xxx: Fix N2N and NVMe connect retry failure
    platform/chrome: cros_ec_spi: Don't overwrite spi::mode
    misc: pci_endpoint_test: fix return value of error branch
    bus: fsl-mc: add back accidentally dropped error check
    bus: fsl-mc: fix error return code in fsl_mc_object_allocate()
    fsi: Aspeed: Add mutex to protect HW access
    s390/cio: fix use-after-free in ccw_device_destroy_console
    iwlwifi: dbg-tlv: fix old length in is_trig_data_contained()
    iwlwifi: mvm: hook up missing RX handlers
    erofs: avoid using generic_block_bmap
    clk: renesas: r8a779a0: Fix R and OSC clocks
    can: m_can: m_can_config_endisable(): remove double clearing of clock stop request bit
    powerpc/sstep: Emulate prefixed instructions only when CPU_FTR_ARCH_31 is set
    powerpc/sstep: Cover new VSX instructions under CONFIG_VSX
    slimbus: qcom: fix potential NULL dereference in qcom_slim_prg_slew()
    ALSA: hda/hdmi: fix silent stream for first playback to DP
    RDMA/core: Do not indicate device ready when device enablement fails
    RDMA/uverbs: Fix incorrect variable type
    remoteproc/mediatek: change MT8192 CFG register base
    remoteproc/mtk_scp: surround DT device IDs with CONFIG_OF
    remoteproc: q6v5-mss: fix error handling in q6v5_pds_enable
    remoteproc: qcom: fix reference leak in adsp_start
    remoteproc: qcom: pas: fix error handling in adsp_pds_enable
    remoteproc: k3-dsp: Fix return value check in k3_dsp_rproc_of_get_memories()
    remoteproc: qcom: Fix potential NULL dereference in adsp_init_mmio()
    remoteproc/mediatek: unprepare clk if scp_before_load fails
    clk: qcom: gcc-sc7180: Use floor ops for sdcc clks
    clk: tegra: Fix duplicated SE clock entry
    mtd: rawnand: gpmi: fix reference count leak in gpmi ops
    mtd: rawnand: meson: Fix a resource leak in init
    mtd: rawnand: gpmi: Fix the random DMA timeout issue
    samples/bpf: Fix possible hang in xdpsock with multiple threads
    fs: Handle I_DONTCACHE in iput_final() instead of generic_drop_inode()
    extcon: max77693: Fix modalias string
    crypto: atmel-i2c - select CONFIG_BITREVERSE
    mac80211: don't set set TDLS STA bandwidth wider than possible
    mac80211: fix a mistake check for rx_stats update
    ASoC: wm_adsp: remove "ctl" from list on error in wm_adsp_create_control()
    irqchip/alpine-msi: Fix freeing of interrupts on allocation error path
    irqchip/ti-sci-inta: Fix printing of inta id on probe success
    irqchip/ti-sci-intr: Fix freeing of irqs
    dmaengine: ti: k3-udma: Correct normal channel offset when uchan_cnt is not 0
    RDMA/hns: Limit the length of data copied between kernel and userspace
    RDMA/hns: Normalization the judgment of some features
    RDMA/hns: Do shift on traffic class when using RoCEv2
    gpiolib: irq hooks: fix recursion in gpiochip_irq_unmask
    ath11k: Fix incorrect tlvs in scan start command
    irqchip/qcom-pdc: Fix phantom irq when changing between rising/falling
    watchdog: armada_37xx: Add missing dependency on HAS_IOMEM
    watchdog: sirfsoc: Add missing dependency on HAS_IOMEM
    watchdog: sprd: remove watchdog disable from resume fail path
    watchdog: sprd: check busy bit before new loading rather than after that
    watchdog: Fix potential dereferencing of null pointer
    ubifs: Fix error return code in ubifs_init_authentication()
    um: Monitor error events in IRQ controller
    um: tty: Fix handling of close in tty lines
    um: chan_xterm: Fix fd leak
    sunrpc: fix xs_read_xdr_buf for partial pages receive
    RDMA/mlx5: Fix MR cache memory leak
    RDMA/cma: Don't overwrite sgid_attr after device is released
    nfc: s3fwrn5: Release the nfc firmware
    drm: mxsfb: Silence -EPROBE_DEFER while waiting for bridge
    powerpc/perf: Fix Threshold Event Counter Multiplier width for P10
    powerpc/ps3: use dma_mapping_error()
    perf test: Fix metric parsing test
    drm/amdgpu: fix regression in vbios reservation handling on headless
    mm/gup: reorganize internal_get_user_pages_fast()
    mm/gup: prevent gup_fast from racing with COW during fork
    mm/gup: combine put_compound_head() and unpin_user_page()
    mm: memcg/slab: fix return of child memcg objcg for root memcg
    mm: memcg/slab: fix use after free in obj_cgroup_charge
    mm/rmap: always do TTU_IGNORE_ACCESS
    sparc: fix handling of page table constructor failure
    mm/vmalloc: Fix unlock order in s_stop()
    mm/vmalloc.c: fix kasan shadow poisoning size
    mm,memory_failure: always pin the page in madvise_inject_error
    hugetlb: fix an error code in hugetlb_reserve_pages()
    mm: don't wake kswapd prematurely when watermark boosting is disabled
    proc: fix lookup in /proc/net subdirectories after setns(2)
    checkpatch: fix unescaped left brace
    s390/test_unwind: fix CALL_ON_STACK tests
    lan743x: fix rx_napi_poll/interrupt ping-pong
    ice, xsk: clear the status bits for the next_to_use descriptor
    i40e, xsk: clear the status bits for the next_to_use descriptor
    net: dsa: qca: ar9331: fix sleeping function called from invalid context bug
    dpaa2-eth: fix the size of the mapped SGT buffer
    net: bcmgenet: Fix a resource leak in an error handling path in the probe functin
    net: mscc: ocelot: Fix a resource leak in the error handling path of the probe function
    net: allwinner: Fix some resources leak in the error handling path of the probe and in the remove function
    block/rnbd-clt: Get rid of warning regarding size argument in strlcpy
    block/rnbd-clt: Fix possible memleak
    NFS/pNFS: Fix a typo in ff_layout_resend_pnfs_read()
    net: korina: fix return value
    devlink: use _BITUL() macro instead of BIT() in the UAPI header
    libnvdimm/label: Return -ENXIO for no slot in __blk_label_update
    powerpc/32s: Fix cleanup_cpu_mmu_context() compile bug
    watchdog: qcom: Avoid context switch in restart handler
    watchdog: coh901327: add COMMON_CLK dependency
    clk: ti: Fix memleak in ti_fapll_synth_setup
    pwm: zx: Add missing cleanup in error path
    pwm: lp3943: Dynamically allocate PWM chip base
    pwm: imx27: Fix overflow for bigger periods
    pwm: sun4i: Remove erroneous else branch
    io_uring: cancel only requests of current task
    tools build: Add missing libcap to test-all.bin target
    perf record: Fix memory leak when using '--user-regs=?' to list registers
    qlcnic: Fix error code in probe
    nfp: move indirect block cleanup to flower app stop callback
    vdpa/mlx5: Use write memory barrier after updating CQ index
    virtio_ring: Cut and paste bugs in vring_create_virtqueue_packed()
    virtio_net: Fix error code in probe()
    virtio_ring: Fix two use after free bugs
    vhost scsi: fix error return code in vhost_scsi_set_endpoint()
    epoll: check for events when removing a timed out thread from the wait queue
    clk: bcm: dvp: Add MODULE_DEVICE_TABLE()
    clk: at91: sama7g5: fix compilation error
    clk: at91: sam9x60: remove atmel,osc-bypass support
    clk: s2mps11: Fix a resource leak in error handling paths in the probe function
    clk: sunxi-ng: Make sure divider tables have sentinel
    clk: vc5: Use "idt,voltage-microvolt" instead of "idt,voltage-microvolts"
    kconfig: fix return value of do_error_if()
    powerpc/boot: Fix build of dts/fsl
    powerpc/smp: Add __init to init_big_cores()
    ARM: 9044/1: vfp: use undef hook for VFP support detection
    ARM: 9036/1: uncompress: Fix dbgadtb size parameter name
    perf probe: Fix memory leak when synthesizing SDT probes
    io_uring: fix racy IOPOLL flush overflow
    io_uring: cancel reqs shouldn't kill overflow list
    Smack: Handle io_uring kernel thread privileges
    proc mountinfo: make splice available again
    io_uring: fix io_cqring_events()'s noflush
    io_uring: fix racy IOPOLL completions
    io_uring: always let io_iopoll_complete() complete polled io
    vfio/pci: Move dummy_resources_list init in vfio_pci_probe()
    vfio/pci/nvlink2: Do not attempt NPU2 setup on POWER8NVL NPU
    media: gspca: Fix memory leak in probe
    io_uring: fix io_wqe->work_list corruption
    io_uring: fix 0-iov read buffer select
    io_uring: hold uring_lock while completing failed polled io in io_wq_submit_work()
    io_uring: fix ignoring xa_store errors
    io_uring: fix double io_uring free
    io_uring: make ctx cancel on exit targeted to actual ctx
    media: sunxi-cir: ensure IR is handled when it is continuous
    media: netup_unidvb: Don't leak SPI master in probe error path
    media: ipu3-cio2: Remove traces of returned buffers
    media: ipu3-cio2: Return actual subdev format
    media: ipu3-cio2: Serialise access to pad format
    media: ipu3-cio2: Validate mbus format in setting subdev format
    media: ipu3-cio2: Make the field on subdev format V4L2_FIELD_NONE
    Input: cyapa_gen6 - fix out-of-bounds stack access
    ALSA: hda/ca0132 - Change Input Source enum strings.
    ACPI: NFIT: Fix input validation of bus-family
    PM: ACPI: PCI: Drop acpi_pm_set_bridge_wakeup()
    Revert "ACPI / resources: Use AE_CTRL_TERMINATE to terminate resources walks"
    ACPI: PNP: compare the string length in the matching_id()
    ALSA: hda: Fix regressions on clear and reconfig sysfs
    ALSA: hda/ca0132 - Fix AE-5 rear headphone pincfg.
    ALSA: hda/realtek: make bass spk volume adjustable on a yoga laptop
    ALSA: hda/realtek - Enable headset mic of ASUS X430UN with ALC256
    ALSA: hda/realtek - Enable headset mic of ASUS Q524UQK with ALC255
    ALSA: hda/realtek - Add supported for more Lenovo ALC285 Headset Button
    ALSA: pcm: oss: Fix a few more UBSAN fixes
    ALSA/hda: apply jack fixup for the Acer Veriton N4640G/N6640G/N2510G
    ALSA: hda/realtek: Add quirk for MSI-GP73
    ALSA: hda/realtek: Apply jack fixup for Quanta NL3
    ALSA: hda/realtek: Remove dummy lineout on Acer TravelMate P648/P658
    ALSA: hda/realtek - Supported Dell fixed type headset
    ALSA: usb-audio: Add VID to support native DSD reproduction on FiiO devices
    ALSA: usb-audio: Disable sample read check if firmware doesn't give back
    ALSA: usb-audio: Add alias entry for ASUS PRIME TRX40 PRO-S
    ALSA: core: memalloc: add page alignment for iram
    s390/smp: perform initial CPU reset also for SMT siblings
    s390/kexec_file: fix diag308 subcode when loading crash kernel
    s390/idle: add missing mt_cycles calculation
    s390/idle: fix accounting with machine checks
    s390/dasd: fix hanging device offline processing
    s390/dasd: prevent inconsistent LCU device data
    s390/dasd: fix list corruption of pavgroup group list
    s390/dasd: fix list corruption of lcu list
    binder: add flag to clear buffer on txn complete
    ASoC: cx2072x: Fix doubly definitions of Playback and Capture streams
    ASoC: AMD Renoir - add DMI table to avoid the ACP mic probe (broken BIOS)
    ASoC: AMD Raven/Renoir - fix the PCI probe (PCI revision)
    staging: comedi: mf6x4: Fix AI end-of-conversion detection
    z3fold: simplify freeing slots
    z3fold: stricter locking and more careful reclaim
    perf/x86/intel: Add event constraint for CYCLE_ACTIVITY.STALLS_MEM_ANY
    perf/x86/intel: Fix rtm_abort_event encoding on Ice Lake
    perf/x86/intel/lbr: Fix the return type of get_lbr_cycles()
    powerpc/perf: Exclude kernel samples while counting events in user space.
    cpufreq: intel_pstate: Use most recent guaranteed performance values
    crypto: ecdh - avoid unaligned accesses in ecdh_set_secret()
    crypto: arm/aes-ce - work around Cortex-A57/A72 silion errata
    m68k: Fix WARNING splat in pmac_zilog driver
    Documentation: seqlock: s/LOCKTYPE/LOCKNAME/g
    EDAC/i10nm: Use readl() to access MMIO registers
    EDAC/amd64: Fix PCI component registration
    cpuset: fix race between hotplug work and later CPU offline
    dyndbg: fix use before null check
    USB: serial: mos7720: fix parallel-port state restore
    USB: serial: digi_acceleport: fix write-wakeup deadlocks
    USB: serial: keyspan_pda: fix dropped unthrottle interrupts
    USB: serial: keyspan_pda: fix write deadlock
    USB: serial: keyspan_pda: fix stalled writes
    USB: serial: keyspan_pda: fix write-wakeup use-after-free
    USB: serial: keyspan_pda: fix tx-unthrottle use-after-free
    USB: serial: keyspan_pda: fix write unthrottling
    btrfs: do not shorten unpin len for caching block groups
    btrfs: update last_byte_to_unpin in switch_commit_roots
    btrfs: fix race when defragmenting leads to unnecessary IO
    ext4: fix an IS_ERR() vs NULL check
    ext4: fix a memory leak of ext4_free_data
    ext4: fix deadlock with fs freezing and EA inodes
    ext4: don't remount read-only with errors=continue on reboot
    RISC-V: Fix usage of memblock_enforce_memory_limit
    arm64: dts: ti: k3-am65: mark dss as dma-coherent
    arm64: dts: marvell: keep SMMU disabled by default for Armada 7040 and 8040
    KVM: arm64: Introduce handling of AArch32 TTBCR2 traps
    KVM: x86: reinstate vendor-agnostic check on SPEC_CTRL cpuid bits
    KVM: SVM: Remove the call to sev_platform_status() during setup
    iommu/arm-smmu: Allow implementation specific write_s2cr
    iommu/arm-smmu-qcom: Read back stream mappings
    iommu/arm-smmu-qcom: Implement S2CR quirk
    ARM: dts: pandaboard: fix pinmux for gpio user button of Pandaboard ES
    ARM: dts: at91: sama5d2: fix CAN message ram offset and size
    ARM: tegra: Populate OPP table for Tegra20 Ventana
    xprtrdma: Fix XDRBUF_SPARSE_PAGES support
    powerpc/32: Fix vmap stack - Properly set r1 before activating MMU on syscall too
    powerpc: Fix incorrect stw{, ux, u, x} instructions in __set_pte_at
    powerpc/rtas: Fix typo of ibm,open-errinjct in RTAS filter
    powerpc/bitops: Fix possible undefined behaviour with fls() and fls64()
    powerpc/feature: Add CPU_FTR_NOEXECUTE to G2_LE
    powerpc/xmon: Change printk() to pr_cont()
    powerpc/8xx: Fix early debug when SMC1 is relocated
    powerpc/mm: Fix verification of MMU_FTR_TYPE_44x
    powerpc/powernv/npu: Do not attempt NPU2 setup on POWER8NVL NPU
    powerpc/powernv/memtrace: Don't leak kernel memory to user space
    powerpc/powernv/memtrace: Fix crashing the kernel when enabling concurrently
    ovl: make ioctl() safe
    ima: Don't modify file descriptor mode on the fly
    um: Remove use of asprinf in umid.c
    um: Fix time-travel mode
    ceph: fix race in concurrent __ceph_remove_cap invocations
    SMB3: avoid confusing warning message on mount to Azure
    SMB3.1.1: remove confusing mount warning when no SPNEGO info on negprot rsp
    SMB3.1.1: do not log warning message if server doesn't populate salt
    ubifs: wbuf: Don't leak kernel memory to flash
    jffs2: Fix GC exit abnormally
    jffs2: Fix ignoring mounting options problem during remounting
    fsnotify: generalize handle_inode_event()
    inotify: convert to handle_inode_event() interface
    fsnotify: fix events reported to watching parent and child
    jfs: Fix array index bounds check in dbAdjTree
    drm/panfrost: Fix job timeout handling
    drm/panfrost: Move the GPU reset bits outside the timeout handler
    platform/x86: mlx-platform: remove an unused variable
    drm/amdgpu: only set DP subconnector type on DP and eDP connectors
    drm/amd/display: Fix memory leaks in S3 resume
    drm/dp_aux_dev: check aux_dev before use in drm_dp_aux_dev_get_by_minor()
    drm/i915: Fix mismatch between misplaced vma check and vma insert
    iio: ad_sigma_delta: Don't put SPI transfer buffer on the stack
    spi: pxa2xx: Fix use-after-free on unbind
    spi: spi-sh: Fix use-after-free on unbind
    spi: atmel-quadspi: Fix use-after-free on unbind
    spi: spi-mtk-nor: Don't leak SPI master in probe error path
    spi: ar934x: Don't leak SPI master in probe error path
    spi: davinci: Fix use-after-free on unbind
    spi: fsl: fix use of spisel_boot signal on MPC8309
    spi: gpio: Don't leak SPI master in probe error path
    spi: mxic: Don't leak SPI master in probe error path
    spi: npcm-fiu: Disable clock in probe error path
    spi: pic32: Don't leak DMA channels in probe error path
    spi: rb4xx: Don't leak SPI master in probe error path
    spi: rpc-if: Fix use-after-free on unbind
    spi: sc18is602: Don't leak SPI master in probe error path
    spi: spi-geni-qcom: Fix use-after-free on unbind
    spi: spi-qcom-qspi: Fix use-after-free on unbind
    spi: st-ssc4: Fix unbalanced pm_runtime_disable() in probe error path
    spi: synquacer: Disable clock in probe error path
    spi: mt7621: Disable clock in probe error path
    spi: mt7621: Don't leak SPI master in probe error path
    spi: atmel-quadspi: Disable clock in probe error path
    spi: atmel-quadspi: Fix AHB memory accesses
    soc: qcom: smp2p: Safely acquire spinlock without IRQs
    mtd: spinand: Fix OOB read
    mtd: parser: cmdline: Fix parsing of part-names with colons
    mtd: core: Fix refcounting for unpartitioned MTDs
    mtd: rawnand: qcom: Fix DMA sync on FLASH_STATUS register read
    mtd: rawnand: meson: fix meson_nfc_dma_buffer_release() arguments
    scsi: qla2xxx: Fix crash during driver load on big endian machines
    scsi: lpfc: Fix invalid sleeping context in lpfc_sli4_nvmet_alloc()
    scsi: lpfc: Fix scheduling call while in softirq context in lpfc_unreg_rpi
    scsi: lpfc: Re-fix use after free in lpfc_rq_buf_free()
    openat2: reject RESOLVE_BENEATH|RESOLVE_IN_ROOT
    iio: buffer: Fix demux update
    iio: adc: rockchip_saradc: fix missing clk_disable_unprepare() on error in rockchip_saradc_resume
    iio: imu: st_lsm6dsx: fix edge-trigger interrupts
    iio:light:rpr0521: Fix timestamp alignment and prevent data leak.
    iio:light:st_uvis25: Fix timestamp alignment and prevent data leak.
    iio:magnetometer:mag3110: Fix alignment and data leak issues.
    iio:pressure:mpl3115: Force alignment of buffer
    iio:imu:bmi160: Fix too large a buffer.
    iio:imu:bmi160: Fix alignment and data leak issues
    iio:adc:ti-ads124s08: Fix buffer being too long.
    iio:adc:ti-ads124s08: Fix alignment and data leak issues.
    md/cluster: block reshape with remote resync job
    md/cluster: fix deadlock when node is doing resync job
    pinctrl: sunxi: Always call chained_irq_{enter, exit} in sunxi_pinctrl_irq_handler
    clk: ingenic: Fix divider calculation with div tables
    clk: mvebu: a3700: fix the XTAL MODE pin to MPP1_9
    clk: tegra: Do not return 0 on failure
    counter: microchip-tcb-capture: Fix CMR value check
    device-dax/core: Fix memory leak when rmmod dax.ko
    dma-buf/dma-resv: Respect num_fences when initializing the shared fence list.
    driver: core: Fix list corruption after device_del()
    xen-blkback: set ring->xenblkd to NULL after kthread_stop()
    xen/xenbus: Allow watches discard events before queueing
    xen/xenbus: Add 'will_handle' callback support in xenbus_watch_path()
    xen/xenbus/xen_bus_type: Support will_handle watch callback
    xen/xenbus: Count pending messages for each watch
    xenbus/xenbus_backend: Disallow pending watch messages
    memory: jz4780_nemc: Fix an error pointer vs NULL check in probe()
    memory: renesas-rpc-if: Fix a node reference leak in rpcif_probe()
    memory: renesas-rpc-if: Return correct value to the caller of rpcif_manual_xfer()
    memory: renesas-rpc-if: Fix unbalanced pm_runtime_enable in rpcif_{enable,disable}_rpm
    libnvdimm/namespace: Fix reaping of invalidated block-window-namespace labels
    platform/x86: intel-vbtn: Allow switch events on Acer Switch Alpha 12
    tracing: Disable ftrace selftests when any tracer is running
    mt76: add back the SUPPORTS_REORDERING_BUFFER flag
    of: fix linker-section match-table corruption
    PCI: Fix pci_slot_release() NULL pointer dereference
    regulator: axp20x: Fix DLDO2 voltage control register mask for AXP22x
    remoteproc: sysmon: Ensure remote notification ordering
    thermal/drivers/cpufreq_cooling: Update cpufreq_state only if state has changed
    rtc: ep93xx: Fix NULL pointer dereference in ep93xx_rtc_read_time
    Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"
    null_blk: Fix zone size initialization
    null_blk: Fail zone append to conventional zones
    drm/edid: fix objtool warning in drm_cvt_modes()
    x86/CPU/AMD: Save AMD NodeId as cpu_die_id
    Linux 5.10.4

    Signed-off-by: Greg Kroah-Hartman
    Change-Id: I25209e79d8b9faf5382087955a29b7404bdefe38

    Greg Kroah-Hartman
     
  • commit 7f458a3873ae94efe1f37c8b96c97e7298769e98 upstream.

    When defragmenting we skip ranges that have holes or inline extents, so that
    we don't do unnecessary IO and waste space. We do this check when calling
    should_defrag_range() at btrfs_defrag_file(). However we do it without
    holding the inode's lock. The reason we do it like this is to avoid
    blocking other tasks for too long, that possibly want to operate on other
    file ranges, since after the call to should_defrag_range() and before
    locking the inode, we trigger a synchronous page cache readahead. However
    before we were able to lock the inode, some other task might have punched
    a hole in our range, or we may now have an inline extent there, in which
    case we should not set the range for defrag anymore since that would cause
    unnecessary IO and make us waste space (i.e. allocating extents to contain
    zeros for a hole).

    So after we locked the inode and the range in the iotree, check again if
    we have holes or an inline extent, and if we do, just skip the range.

    I hit this while testing my next patch that fixes races when updating an
    inode's number of bytes (subject "btrfs: update the number of bytes used
    by an inode atomically"), and it depends on this change in order to work
    correctly. Alternatively I could rework that other patch to detect holes
    and flag their range with the 'new delalloc' bit, but this itself fixes
    an efficiency problem due a race that from a functional point of view is
    not harmful (it could be triggered with btrfs/062 from fstests).

    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 27d56e62e4748c2135650c260024e9904b8c1a0a upstream.

    While writing an explanation for the need of the commit_root_sem for
    btrfs_prepare_extent_commit, I realized we have a slight hole that could
    result in leaked space if we have to do the old style caching. Consider
    the following scenario

    commit root
    +----+----+----+----+----+----+----+
    |\\\\| |\\\\|\\\\| |\\\\|\\\\|
    +----+----+----+----+----+----+----+
    0 1 2 3 4 5 6 7

    new commit root
    +----+----+----+----+----+----+----+
    | | | |\\\\| | |\\\\|
    +----+----+----+----+----+----+----+
    0 1 2 3 4 5 6 7

    Prior to this patch, we run btrfs_prepare_extent_commit, which updates
    the last_byte_to_unpin, and then we subsequently run
    switch_commit_roots. In this example lets assume that
    caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which
    means that cache->last_byte_to_unpin == 1. Then we go and do the
    switch_commit_roots(), but in the meantime the caching thread has made
    some more progress, because we drop the commit_root_sem and re-acquired
    it. Now caching_ctl->progress == 3. We swap out the commit root and
    carry on to unpin.

    The race can happen like:

    1) The caching thread was running using the old commit root when it
    found the extent for [2, 3);

    2) Then it released the commit_root_sem because it was in the last
    item of a leaf and the semaphore was contended, and set ->progress
    to 3 (value of 'last'), as the last extent item in the current leaf
    was for the extent for range [2, 3);

    3) Next time it gets the commit_root_sem, will start using the new
    commit root and search for a key with offset 3, so it never finds
    the hole for [2, 3).

    So the caching thread never saw [2, 3) as free space in any of the
    commit roots, and by the time finish_extent_commit() was called for
    the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the
    subrange [0, 1) to the free space cache, skipping [2, 3).

    In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1),
    but do not unpin [2,3). However because caching_ctl->progress == 3 we
    do not see the newly freed section of [2,3), and thus do not add it to
    our free space cache. This results in us missing a chunk of free space
    in memory (on disk too, unless we have a power failure before writing
    the free space cache to disk).

    Fix this by making sure the ->last_byte_to_unpin is set at the same time
    that we swap the commit roots, this ensures that we will always be
    consistent.

    CC: stable@vger.kernel.org # 5.8+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    [ update changelog with Filipe's review comments ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 9076dbd5ee837c3882fc42891c14cecd0354a849 upstream.

    While fixing up our ->last_byte_to_unpin locking I noticed that we will
    shorten len based on ->last_byte_to_unpin if we're caching when we're
    adding back the free space. This is correct for the free space, as we
    cannot unpin more than ->last_byte_to_unpin, however we use len to
    adjust the ->bytes_pinned counters and such, which need to track the
    actual pinned usage. This could result in
    WARN_ON(space_info->bytes_pinned) triggering at unmount time.

    Fix this by using a local variable for the amount to add to free space
    cache, and leave len untouched in this case.

    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     

28 Nov, 2020

2 commits

  • …g/pub/scm/linux/kernel/git/arnd/asm-generic") into android-mainline

    Steps on the way to 5.10-rc5

    Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
    Change-Id: I644783003a83186a34cdbb753aa492f4350f49ee

    Greg Kroah-Hartman
     
  • Pull btrfs fixes from David Sterba:
    "A few fixes for various warnings that accumulated over past two weeks:

    - tree-checker: add missing return values for some errors

    - lockdep fixes
    - when reading qgroup config and starting quota rescan
    - reverse order of quota ioctl lock and VFS freeze lock

    - avoid accessing potentially stale fs info during device scan,
    reported by syzbot

    - add scope NOFS protection around qgroup relation changes

    - check for running transaction before flushing qgroups

    - fix tracking of new delalloc ranges for some cases"

    * tag 'for-5.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    btrfs: fix lockdep splat when enabling and disabling qgroups
    btrfs: do nofs allocations when adding and removing qgroup relations
    btrfs: fix lockdep splat when reading qgroup config on mount
    btrfs: tree-checker: add missing returns after data_ref alignment checks
    btrfs: don't access possibly stale fs_info data for printing duplicate device
    btrfs: tree-checker: add missing return after error in root_item
    btrfs: qgroup: don't commit transaction when we already hold the handle
    btrfs: fix missing delalloc new bit for new delalloc ranges

    Linus Torvalds
     

24 Nov, 2020

5 commits

  • When running test case btrfs/017 from fstests, lockdep reported the
    following splat:

    [ 1297.067385] ======================================================
    [ 1297.067708] WARNING: possible circular locking dependency detected
    [ 1297.068022] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
    [ 1297.068322] ------------------------------------------------------
    [ 1297.068629] btrfs/189080 is trying to acquire lock:
    [ 1297.068929] ffff9f2725731690 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_enable+0xaf/0xa70 [btrfs]
    [ 1297.069274]
    but task is already holding lock:
    [ 1297.069868] ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
    [ 1297.070219]
    which lock already depends on the new lock.

    [ 1297.071131]
    the existing dependency chain (in reverse order) is:
    [ 1297.071721]
    -> #1 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}:
    [ 1297.072375] lock_acquire+0xd8/0x490
    [ 1297.072710] __mutex_lock+0xa3/0xb30
    [ 1297.073061] btrfs_qgroup_inherit+0x59/0x6a0 [btrfs]
    [ 1297.073421] create_subvol+0x194/0x990 [btrfs]
    [ 1297.073780] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
    [ 1297.074133] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
    [ 1297.074498] btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
    [ 1297.074872] btrfs_ioctl+0x1a90/0x36f0 [btrfs]
    [ 1297.075245] __x64_sys_ioctl+0x83/0xb0
    [ 1297.075617] do_syscall_64+0x33/0x80
    [ 1297.075993] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 1297.076380]
    -> #0 (sb_internal#2){.+.+}-{0:0}:
    [ 1297.077166] check_prev_add+0x91/0xc60
    [ 1297.077572] __lock_acquire+0x1740/0x3110
    [ 1297.077984] lock_acquire+0xd8/0x490
    [ 1297.078411] start_transaction+0x3c5/0x760 [btrfs]
    [ 1297.078853] btrfs_quota_enable+0xaf/0xa70 [btrfs]
    [ 1297.079323] btrfs_ioctl+0x2c60/0x36f0 [btrfs]
    [ 1297.079789] __x64_sys_ioctl+0x83/0xb0
    [ 1297.080232] do_syscall_64+0x33/0x80
    [ 1297.080680] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 1297.081139]
    other info that might help us debug this:

    [ 1297.082536] Possible unsafe locking scenario:

    [ 1297.083510] CPU0 CPU1
    [ 1297.084005] ---- ----
    [ 1297.084500] lock(&fs_info->qgroup_ioctl_lock);
    [ 1297.084994] lock(sb_internal#2);
    [ 1297.085485] lock(&fs_info->qgroup_ioctl_lock);
    [ 1297.085974] lock(sb_internal#2);
    [ 1297.086454]
    *** DEADLOCK ***
    [ 1297.087880] 3 locks held by btrfs/189080:
    [ 1297.088324] #0: ffff9f2725731470 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0xa73/0x36f0 [btrfs]
    [ 1297.088799] #1: ffff9f2702b60cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
    [ 1297.089284] #2: ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
    [ 1297.089771]
    stack backtrace:
    [ 1297.090662] CPU: 5 PID: 189080 Comm: btrfs Not tainted 5.10.0-rc4-btrfs-next-73 #1
    [ 1297.091132] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    [ 1297.092123] Call Trace:
    [ 1297.092629] dump_stack+0x8d/0xb5
    [ 1297.093115] check_noncircular+0xff/0x110
    [ 1297.093596] check_prev_add+0x91/0xc60
    [ 1297.094076] ? kvm_clock_read+0x14/0x30
    [ 1297.094553] ? kvm_sched_clock_read+0x5/0x10
    [ 1297.095029] __lock_acquire+0x1740/0x3110
    [ 1297.095510] lock_acquire+0xd8/0x490
    [ 1297.095993] ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
    [ 1297.096476] start_transaction+0x3c5/0x760 [btrfs]
    [ 1297.096962] ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
    [ 1297.097451] btrfs_quota_enable+0xaf/0xa70 [btrfs]
    [ 1297.097941] ? btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
    [ 1297.098429] btrfs_ioctl+0x2c60/0x36f0 [btrfs]
    [ 1297.098904] ? do_user_addr_fault+0x20c/0x430
    [ 1297.099382] ? kvm_clock_read+0x14/0x30
    [ 1297.099854] ? kvm_sched_clock_read+0x5/0x10
    [ 1297.100328] ? sched_clock+0x5/0x10
    [ 1297.100801] ? sched_clock_cpu+0x12/0x180
    [ 1297.101272] ? __x64_sys_ioctl+0x83/0xb0
    [ 1297.101739] __x64_sys_ioctl+0x83/0xb0
    [ 1297.102207] do_syscall_64+0x33/0x80
    [ 1297.102673] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 1297.103148] RIP: 0033:0x7f773ff65d87

    This is because during the quota enable ioctl we lock first the mutex
    qgroup_ioctl_lock and then start a transaction, and starting a transaction
    acquires a fs freeze semaphore (at the VFS level). However, every other
    code path, except for the quota disable ioctl path, we do the opposite:
    we start a transaction and then lock the mutex.

    So fix this by making the quota enable and disable paths to start the
    transaction without having the mutex locked, and then, after starting the
    transaction, lock the mutex and check if some other task already enabled
    or disabled the quotas, bailing with success if that was the case.

    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • When adding or removing a qgroup relation we are doing a GFP_KERNEL
    allocation which is not safe because we are holding a transaction
    handle open and that can make us deadlock if the allocator needs to
    recurse into the filesystem. So just surround those calls with a
    nofs context.

    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • Lockdep reported the following splat when running test btrfs/190 from
    fstests:

    [ 9482.126098] ======================================================
    [ 9482.126184] WARNING: possible circular locking dependency detected
    [ 9482.126281] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
    [ 9482.126365] ------------------------------------------------------
    [ 9482.126456] mount/24187 is trying to acquire lock:
    [ 9482.126534] ffffa0c869a7dac0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}, at: qgroup_rescan_init+0x43/0xf0 [btrfs]
    [ 9482.126647]
    but task is already holding lock:
    [ 9482.126777] ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
    [ 9482.126886]
    which lock already depends on the new lock.

    [ 9482.127078]
    the existing dependency chain (in reverse order) is:
    [ 9482.127213]
    -> #1 (btrfs-quota-00){++++}-{3:3}:
    [ 9482.127366] lock_acquire+0xd8/0x490
    [ 9482.127436] down_read_nested+0x45/0x220
    [ 9482.127528] __btrfs_tree_read_lock+0x27/0x120 [btrfs]
    [ 9482.127613] btrfs_read_lock_root_node+0x41/0x130 [btrfs]
    [ 9482.127702] btrfs_search_slot+0x514/0xc30 [btrfs]
    [ 9482.127788] update_qgroup_status_item+0x72/0x140 [btrfs]
    [ 9482.127877] btrfs_qgroup_rescan_worker+0xde/0x680 [btrfs]
    [ 9482.127964] btrfs_work_helper+0xf1/0x600 [btrfs]
    [ 9482.128039] process_one_work+0x24e/0x5e0
    [ 9482.128110] worker_thread+0x50/0x3b0
    [ 9482.128181] kthread+0x153/0x170
    [ 9482.128256] ret_from_fork+0x22/0x30
    [ 9482.128327]
    -> #0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}:
    [ 9482.128464] check_prev_add+0x91/0xc60
    [ 9482.128551] __lock_acquire+0x1740/0x3110
    [ 9482.128623] lock_acquire+0xd8/0x490
    [ 9482.130029] __mutex_lock+0xa3/0xb30
    [ 9482.130590] qgroup_rescan_init+0x43/0xf0 [btrfs]
    [ 9482.131577] btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
    [ 9482.132175] open_ctree+0x1228/0x18a0 [btrfs]
    [ 9482.132756] btrfs_mount_root.cold+0x13/0xed [btrfs]
    [ 9482.133325] legacy_get_tree+0x30/0x60
    [ 9482.133866] vfs_get_tree+0x28/0xe0
    [ 9482.134392] fc_mount+0xe/0x40
    [ 9482.134908] vfs_kern_mount.part.0+0x71/0x90
    [ 9482.135428] btrfs_mount+0x13b/0x3e0 [btrfs]
    [ 9482.135942] legacy_get_tree+0x30/0x60
    [ 9482.136444] vfs_get_tree+0x28/0xe0
    [ 9482.136949] path_mount+0x2d7/0xa70
    [ 9482.137438] do_mount+0x75/0x90
    [ 9482.137923] __x64_sys_mount+0x8e/0xd0
    [ 9482.138400] do_syscall_64+0x33/0x80
    [ 9482.138873] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 9482.139346]
    other info that might help us debug this:

    [ 9482.140735] Possible unsafe locking scenario:

    [ 9482.141594] CPU0 CPU1
    [ 9482.142011] ---- ----
    [ 9482.142411] lock(btrfs-quota-00);
    [ 9482.142806] lock(&fs_info->qgroup_rescan_lock);
    [ 9482.143216] lock(btrfs-quota-00);
    [ 9482.143629] lock(&fs_info->qgroup_rescan_lock);
    [ 9482.144056]
    *** DEADLOCK ***

    [ 9482.145242] 2 locks held by mount/24187:
    [ 9482.145637] #0: ffffa0c8411c40e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xb9/0x400
    [ 9482.146061] #1: ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
    [ 9482.146509]
    stack backtrace:
    [ 9482.147350] CPU: 1 PID: 24187 Comm: mount Not tainted 5.10.0-rc4-btrfs-next-73 #1
    [ 9482.147788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    [ 9482.148709] Call Trace:
    [ 9482.149169] dump_stack+0x8d/0xb5
    [ 9482.149628] check_noncircular+0xff/0x110
    [ 9482.150090] check_prev_add+0x91/0xc60
    [ 9482.150561] ? kvm_clock_read+0x14/0x30
    [ 9482.151017] ? kvm_sched_clock_read+0x5/0x10
    [ 9482.151470] __lock_acquire+0x1740/0x3110
    [ 9482.151941] ? __btrfs_tree_read_lock+0x27/0x120 [btrfs]
    [ 9482.152402] lock_acquire+0xd8/0x490
    [ 9482.152887] ? qgroup_rescan_init+0x43/0xf0 [btrfs]
    [ 9482.153354] __mutex_lock+0xa3/0xb30
    [ 9482.153826] ? qgroup_rescan_init+0x43/0xf0 [btrfs]
    [ 9482.154301] ? qgroup_rescan_init+0x43/0xf0 [btrfs]
    [ 9482.154768] ? qgroup_rescan_init+0x43/0xf0 [btrfs]
    [ 9482.155226] qgroup_rescan_init+0x43/0xf0 [btrfs]
    [ 9482.155690] btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
    [ 9482.156160] open_ctree+0x1228/0x18a0 [btrfs]
    [ 9482.156643] btrfs_mount_root.cold+0x13/0xed [btrfs]
    [ 9482.157108] ? rcu_read_lock_sched_held+0x5d/0x90
    [ 9482.157567] ? kfree+0x31f/0x3e0
    [ 9482.158030] legacy_get_tree+0x30/0x60
    [ 9482.158489] vfs_get_tree+0x28/0xe0
    [ 9482.158947] fc_mount+0xe/0x40
    [ 9482.159403] vfs_kern_mount.part.0+0x71/0x90
    [ 9482.159875] btrfs_mount+0x13b/0x3e0 [btrfs]
    [ 9482.160335] ? rcu_read_lock_sched_held+0x5d/0x90
    [ 9482.160805] ? kfree+0x31f/0x3e0
    [ 9482.161260] ? legacy_get_tree+0x30/0x60
    [ 9482.161714] legacy_get_tree+0x30/0x60
    [ 9482.162166] vfs_get_tree+0x28/0xe0
    [ 9482.162616] path_mount+0x2d7/0xa70
    [ 9482.163070] do_mount+0x75/0x90
    [ 9482.163525] __x64_sys_mount+0x8e/0xd0
    [ 9482.163986] do_syscall_64+0x33/0x80
    [ 9482.164437] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 9482.164902] RIP: 0033:0x7f51e907caaa

    This happens because at btrfs_read_qgroup_config() we can call
    qgroup_rescan_init() while holding a read lock on a quota btree leaf,
    acquired by the previous call to btrfs_search_slot_for_read(), and
    qgroup_rescan_init() acquires the mutex qgroup_rescan_lock.

    A qgroup rescan worker does the opposite: it acquires the mutex
    qgroup_rescan_lock, at btrfs_qgroup_rescan_worker(), and then tries to
    update the qgroup status item in the quota btree through the call to
    update_qgroup_status_item(). This inversion of locking order
    between the qgroup_rescan_lock mutex and quota btree locks causes the
    splat.

    Fix this simply by releasing and freeing the path before calling
    qgroup_rescan_init() at btrfs_read_qgroup_config().

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • There are sectorsize alignment checks that are reported but then
    check_extent_data_ref continues. This was not intended, wrong alignment
    is not a minor problem and we should return with error.

    CC: stable@vger.kernel.org # 5.4+
    Fixes: 0785a9aacf9d ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    David Sterba
     
  • Syzbot reported a possible use-after-free when printing a duplicate device
    warning device_list_add().

    At this point it can happen that a btrfs_device::fs_info is not correctly
    setup yet, so we're accessing stale data, when printing the warning
    message using the btrfs_printk() wrappers.

    ==================================================================
    BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
    Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068

    CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x1d6/0x29e lib/dump_stack.c:118
    print_address_description+0x66/0x620 mm/kasan/report.c:383
    __kasan_report mm/kasan/report.c:513 [inline]
    kasan_report+0x132/0x1d0 mm/kasan/report.c:530
    btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
    device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
    btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
    btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    fc_mount fs/namespace.c:978 [inline]
    vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
    btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    do_new_mount fs/namespace.c:2875 [inline]
    path_mount+0x179d/0x29e0 fs/namespace.c:3192
    do_mount fs/namespace.c:3205 [inline]
    __do_sys_mount fs/namespace.c:3413 [inline]
    __se_sys_mount+0x126/0x180 fs/namespace.c:3390
    do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x44840a
    RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
    RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
    RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
    RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
    R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003

    Allocated by task 6945:
    kasan_save_stack mm/kasan/common.c:48 [inline]
    kasan_set_track mm/kasan/common.c:56 [inline]
    __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
    kmalloc_node include/linux/slab.h:577 [inline]
    kvmalloc_node+0x81/0x110 mm/util.c:574
    kvmalloc include/linux/mm.h:757 [inline]
    kvzalloc include/linux/mm.h:765 [inline]
    btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    fc_mount fs/namespace.c:978 [inline]
    vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
    btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    do_new_mount fs/namespace.c:2875 [inline]
    path_mount+0x179d/0x29e0 fs/namespace.c:3192
    do_mount fs/namespace.c:3205 [inline]
    __do_sys_mount fs/namespace.c:3413 [inline]
    __se_sys_mount+0x126/0x180 fs/namespace.c:3390
    do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Freed by task 6945:
    kasan_save_stack mm/kasan/common.c:48 [inline]
    kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
    kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
    __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
    __cache_free mm/slab.c:3418 [inline]
    kfree+0x113/0x200 mm/slab.c:3756
    deactivate_locked_super+0xa7/0xf0 fs/super.c:335
    btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    fc_mount fs/namespace.c:978 [inline]
    vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
    btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    do_new_mount fs/namespace.c:2875 [inline]
    path_mount+0x179d/0x29e0 fs/namespace.c:3192
    do_mount fs/namespace.c:3205 [inline]
    __do_sys_mount fs/namespace.c:3413 [inline]
    __se_sys_mount+0x126/0x180 fs/namespace.c:3390
    do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The buggy address belongs to the object at ffff8880878e0000
    which belongs to the cache kmalloc-16k of size 16384
    The buggy address is located 1704 bytes inside of
    16384-byte region [ffff8880878e0000, ffff8880878e4000)
    The buggy address belongs to the page:
    page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
    head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
    flags: 0xfffe0000010200(slab|head)
    raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
    raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================

    The syzkaller reproducer for this use-after-free crafts a filesystem image
    and loop mounts it twice in a loop. The mount will fail as the crafted
    image has an invalid chunk tree. When this happens btrfs_mount_root() will
    call deactivate_locked_super(), which then cleans up fs_info and
    fs_info::sb. If a second thread now adds the same block-device to the
    filesystem, it will get detected as a duplicate device and
    device_list_add() will reject the duplicate and print a warning. But as
    the fs_info pointer passed in is non-NULL this will result in a
    use-after-free.

    Instead of printing possibly uninitialized or already freed memory in
    btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
    device name will be skipped altogether.

    There was a slightly different approach discussed in
    https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u

    Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
    Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com
    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Anand Jain
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     

14 Nov, 2020

3 commits

  • There's a missing return statement after an error is found in the
    root_item, this can cause further problems when a crafted image triggers
    the error.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210181
    Fixes: 259ee7754b67 ("btrfs: tree-checker: Add ROOT_ITEM check")
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Qu Wenruo
    Signed-off-by: Daniel Xu
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Daniel Xu
     
  • [BUG]
    When running the following script, btrfs will trigger an ASSERT():

    #/bin/bash
    mkfs.btrfs -f $dev
    mount $dev $mnt
    xfs_io -f -c "pwrite 0 1G" $mnt/file
    sync
    btrfs quota enable $mnt
    btrfs quota rescan -w $mnt

    # Manually set the limit below current usage
    btrfs qgroup limit 512M $mnt $mnt

    # Crash happens
    touch $mnt/file

    The dmesg looks like this:

    assertion failed: refcount_read(&trans->use_count) == 1, in fs/btrfs/transaction.c:2022
    ------------[ cut here ]------------
    kernel BUG at fs/btrfs/ctree.h:3230!
    invalid opcode: 0000 [#1] SMP PTI
    RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
    btrfs_commit_transaction.cold+0x11/0x5d [btrfs]
    try_flush_qgroup+0x67/0x100 [btrfs]
    __btrfs_qgroup_reserve_meta+0x3a/0x60 [btrfs]
    btrfs_delayed_update_inode+0xaa/0x350 [btrfs]
    btrfs_update_inode+0x9d/0x110 [btrfs]
    btrfs_dirty_inode+0x5d/0xd0 [btrfs]
    touch_atime+0xb5/0x100
    iterate_dir+0xf1/0x1b0
    __x64_sys_getdents64+0x78/0x110
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7fb5afe588db

    [CAUSE]
    In try_flush_qgroup(), we assume we don't hold a transaction handle at
    all. This is true for data reservation and mostly true for metadata.
    Since data space reservation always happens before we start a
    transaction, and for most metadata operation we reserve space in
    start_transaction().

    But there is an exception, btrfs_delayed_inode_reserve_metadata().
    It holds a transaction handle, while still trying to reserve extra
    metadata space.

    When we hit EDQUOT inside btrfs_delayed_inode_reserve_metadata(), we
    will join current transaction and commit, while we still have
    transaction handle from qgroup code.

    [FIX]
    Let's check current->journal before we join the transaction.

    If current->journal is unset or BTRFS_SEND_TRANS_STUB, it means
    we are not holding a transaction, thus are able to join and then commit
    transaction.

    If current->journal is a valid transaction handle, we avoid committing
    transaction and just end it

    This is less effective than committing current transaction, as it won't
    free metadata reserved space, but we may still free some data space
    before new data writes.

    Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1178634
    Fixes: c53e9653605d ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
    Reviewed-by: Filipe Manana
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • When doing a buffered write, through one of the write family syscalls, we
    look for ranges which currently don't have allocated extents and set the
    'delalloc new' bit on them, so that we can report a correct number of used
    blocks to the stat(2) syscall until delalloc is flushed and ordered extents
    complete.

    However there are a few other places where we can do a buffered write
    against a range that is mapped to a hole (no extent allocated) and where
    we do not set the 'new delalloc' bit. Those places are:

    - Doing a memory mapped write against a hole;

    - Cloning an inline extent into a hole starting at file offset 0;

    - Calling btrfs_cont_expand() when the i_size of the file is not aligned
    to the sector size and is located in a hole. For example when cloning
    to a destination offset beyond EOF.

    So after such cases, until the corresponding delalloc range is flushed and
    the respective ordered extents complete, we can report an incorrect number
    of blocks used through the stat(2) syscall.

    In some cases we can end up reporting 0 used blocks to stat(2), which is a
    particular bad value to report as it may mislead tools to think a file is
    completely sparse when its i_size is not zero, making them skip reading
    any data, an undesired consequence for tools such as archivers and other
    backup tools, as reported a long time ago in the following thread (and
    other past threads):

    https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html

    Example reproducer:

    $ cat reproducer.sh
    #!/bin/bash

    MNT=/mnt/sdi
    DEV=/dev/sdi

    mkfs.btrfs -f $DEV > /dev/null
    # mkfs.xfs -f $DEV > /dev/null
    # mkfs.ext4 -F $DEV > /dev/null
    # mkfs.f2fs -f $DEV > /dev/null
    mount $DEV $MNT

    xfs_io -f -c "truncate 64K" \
    -c "mmap -w 0 64K" \
    -c "mwrite -S 0xab 0 64K" \
    -c "munmap" \
    $MNT/foo

    blocks_used=$(stat -c %b $MNT/foo)
    echo "blocks used: $blocks_used"

    if [ $blocks_used -eq 0 ]; then
    echo "ERROR: blocks used is 0"
    fi

    umount $DEV

    $ ./reproducer.sh
    blocks used: 0
    ERROR: blocks used is 0

    So move the logic that decides to set the 'delalloc bit' bit into the
    function btrfs_set_extent_delalloc(), since that is what we use for all
    those missing cases as well as for the cases that currently work well.

    This change is also preparatory work for an upcoming patch that fixes
    other problems related to tracking and reporting the number of bytes used
    by an inode.

    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     

12 Nov, 2020

1 commit


11 Nov, 2020

1 commit

  • Pull btrfs fixes from David Sterba:
    "A handful of minor fixes and updates:

    - handle missing device replace item on mount (syzbot report)

    - fix space reservation calculation when finishing relocation

    - fix memory leak on error path in ref-verify (debugging feature)

    - fix potential overflow during defrag on 32bit arches

    - minor code update to silence smatch warning

    - minor error message updates"

    * tag 'for-5.10-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    btrfs: ref-verify: fix memory leak in btrfs_ref_tree_mod
    btrfs: dev-replace: fail mount if we don't have replace item with target device
    btrfs: scrub: update message regarding read-only status
    btrfs: clean up NULL checks in qgroup_unreserve_range()
    btrfs: fix min reserved size calculation in merge_reloc_root
    btrfs: print the block rsv type when we fail our reservation
    btrfs: fix potential overflow in cluster_pages_for_defrag on 32bit arch

    Linus Torvalds
     

05 Nov, 2020

7 commits

  • There is one error handling path that does not free ref, which may cause
    a minor memory leak.

    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Josef Bacik
    Signed-off-by: Dinghao Liu
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Dinghao Liu
     
  • If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace
    item, then it means the filesystem is inconsistent state. This is either
    corruption or a crafted image. Fail the mount as this needs a closer
    look what is actually wrong.

    As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace
    item, in __btrfs_free_extra_devids() we determine that there is an
    extra device, and free those extra devices but continue to mount the
    device.
    However, we were wrong in keeping tack of the rw_devices so the syzbot
    testcase failed:

    WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
    Kernel panic - not syncing: panic_on_warn set ...
    CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x198/0x1fd lib/dump_stack.c:118
    panic+0x347/0x7c0 kernel/panic.c:231
    __warn.cold+0x20/0x46 kernel/panic.c:600
    report_bug+0x1bd/0x210 lib/bug.c:198
    handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
    exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
    asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
    RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
    RSP: 0018:ffffc900091777e0 EFLAGS: 00010246
    RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000
    RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007
    RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130
    R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050
    close_fs_devices fs/btrfs/volumes.c:1193 [inline]
    btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
    open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434
    btrfs_fill_super fs/btrfs/super.c:1316 [inline]
    btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672

    The fix here is, when we determine that there isn't a replace item
    then fail the mount if there is a replace target device (devid 0).

    CC: stable@vger.kernel.org # 4.19+
    Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Anand Jain
     
  • Based on user feedback update the message printed when scrub fails to
    start due to write requirements. To make a distinction add a device id
    to the messages.

    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    David Sterba
     
  • Smatch complains that this code dereferences "entry" before checking
    whether it's NULL on the next line. Fortunately, rb_entry() will never
    return NULL so it doesn't cause a problem. We can clean up the NULL
    checking a bit to silence the warning and make the code more clear.

    Reviewed-by: Qu Wenruo
    Signed-off-by: Dan Carpenter
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Dan Carpenter
     
  • The minimum reserve size was adjusted to take into account the height of
    the tree we are merging, however we can have a root with a level == 0.
    What we want is root_level + 1 to get the number of nodes we may have to
    cow. This fixes the enospc_debug warning pops with btrfs/101.

    Nikolay: this fixes failures on btrfs/060 btrfs/062 btrfs/063 and
    btrfs/195 That I was seeing, the call trace was:

    [ 3680.515564] ------------[ cut here ]------------
    [ 3680.515566] BTRFS: block rsv returned -28
    [ 3680.515585] WARNING: CPU: 2 PID: 8339 at fs/btrfs/block-rsv.c:521 btrfs_use_block_rsv+0x162/0x180
    [ 3680.515587] Modules linked in:
    [ 3680.515591] CPU: 2 PID: 8339 Comm: btrfs Tainted: G W 5.9.0-rc8-default #95
    [ 3680.515593] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
    [ 3680.515595] RIP: 0010:btrfs_use_block_rsv+0x162/0x180
    [ 3680.515600] RSP: 0018:ffffa01ac9753910 EFLAGS: 00010282
    [ 3680.515602] RAX: 0000000000000000 RBX: ffff984b34200000 RCX: 0000000000000027
    [ 3680.515604] RDX: 0000000000000027 RSI: 0000000000000000 RDI: ffff984b3bd19e28
    [ 3680.515606] RBP: 0000000000004000 R08: ffff984b3bd19e20 R09: 0000000000000001
    [ 3680.515608] R10: 0000000000000004 R11: 0000000000000046 R12: ffff984b264fdc00
    [ 3680.515609] R13: ffff984b13149000 R14: 00000000ffffffe4 R15: ffff984b34200000
    [ 3680.515613] FS: 00007f4e2912b8c0(0000) GS:ffff984b3bd00000(0000) knlGS:0000000000000000
    [ 3680.515615] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 3680.515617] CR2: 00007fab87122150 CR3: 0000000118e42000 CR4: 00000000000006e0
    [ 3680.515620] Call Trace:
    [ 3680.515627] btrfs_alloc_tree_block+0x8b/0x340
    [ 3680.515633] ? __lock_acquire+0x51a/0xac0
    [ 3680.515646] alloc_tree_block_no_bg_flush+0x4f/0x60
    [ 3680.515651] __btrfs_cow_block+0x14e/0x7e0
    [ 3680.515662] btrfs_cow_block+0x144/0x2c0
    [ 3680.515670] merge_reloc_root+0x4d4/0x610
    [ 3680.515675] ? btrfs_lookup_fs_root+0x78/0x90
    [ 3680.515686] merge_reloc_roots+0xee/0x280
    [ 3680.515695] relocate_block_group+0x2ce/0x5e0
    [ 3680.515704] btrfs_relocate_block_group+0x16e/0x310
    [ 3680.515711] btrfs_relocate_chunk+0x38/0xf0
    [ 3680.515716] btrfs_shrink_device+0x200/0x560
    [ 3680.515728] btrfs_rm_device+0x1ae/0x6a6
    [ 3680.515744] ? _copy_from_user+0x6e/0xb0
    [ 3680.515750] btrfs_ioctl+0x1afe/0x28c0
    [ 3680.515755] ? find_held_lock+0x2b/0x80
    [ 3680.515760] ? do_user_addr_fault+0x1f8/0x418
    [ 3680.515773] ? __x64_sys_ioctl+0x77/0xb0
    [ 3680.515775] __x64_sys_ioctl+0x77/0xb0
    [ 3680.515781] do_syscall_64+0x31/0x70
    [ 3680.515785] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reported-by: Nikolay Borisov
    Fixes: 44d354abf33e ("btrfs: relocation: review the call sites which can be interrupted by signal")
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Nikolay Borisov
    Tested-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • To help with debugging, print the type of the block rsv when we fail to
    use our target block rsv in btrfs_use_block_rsv.

    This now produces:

    [ 544.672035] BTRFS: block rsv 1 returned -28

    which is still cryptic without consulting the enum in block-rsv.h but I
    guess it's better than nothing.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    [ add note from Nikolay ]
    Signed-off-by: David Sterba

    Josef Bacik
     
  • On 32-bit systems, this shift will overflow for files larger than 4GB as
    start_index is unsigned long while the calls to btrfs_delalloc_*_space
    expect u64.

    CC: stable@vger.kernel.org # 4.4+
    Fixes: df480633b891 ("btrfs: extent-tree: Switch to new delalloc space reserve and release")
    Reviewed-by: Josef Bacik
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: David Sterba
    [ define the variable instead of repeating the shift ]
    Signed-off-by: David Sterba

    Matthew Wilcox (Oracle)
     

02 Nov, 2020

1 commit


31 Oct, 2020

1 commit

  • Pull btrfs fixes from David Sterba:

    - lockdep fixes:
    - drop path locks before manipulating sysfs objects or qgroups
    - preliminary fixes before tree locks get switched to rwsem
    - use annotated seqlock

    - build warning fixes (printk format)

    - fix relocation vs fallocate race

    - tree checker properly validates number of stripes and parity

    - readahead vs device replace fixes

    - iomap dio fix for unnecessary buffered io fallback

    * tag 'for-5.10-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    btrfs: convert data_seqcount to seqcount_mutex_t
    btrfs: don't fallback to buffered read if we don't need to
    btrfs: add a helper to read the tree_root commit root for backref lookup
    btrfs: drop the path before adding qgroup items when enabling qgroups
    btrfs: fix readahead hang and use-after-free after removing a device
    btrfs: fix use-after-free on readahead extent after failure to create it
    btrfs: tree-checker: validate number of chunk stripes and parity
    btrfs: tree-checker: fix incorrect printk format
    btrfs: drop the path before adding block group sysfs files
    btrfs: fix relocation failure due to race with fallocate

    Linus Torvalds
     

27 Oct, 2020

2 commits

  • By doing so we can associate the sequence counter to the chunk_mutex
    for lockdep purposes (compiled-out otherwise), the mutex is otherwise
    used on the write side.
    Also avoid explicitly disabling preemption around the write region as it
    will now be done automatically by the seqcount machinery based on the
    lock type.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Davidlohr Bueso
     
  • Since we switched to the iomap infrastructure in b5ff9f1a96e8f ("btrfs:
    switch to iomap for direct IO") we're calling generic_file_buffered_read()
    directly and not via generic_file_read_iter() anymore.

    If the read could read everything there is no need to bother calling
    generic_file_buffered_read(), like it is handled in
    generic_file_read_iter().

    If we call generic_file_buffered_read() in this case we can hit a
    situation where we do an invalid readahead and cause this UBSAN splat
    in fstest generic/091:

    run fstests generic/091 at 2020-10-21 10:52:32
    ================================================================================
    UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
    shift exponent 64 is too large for 64-bit type 'long unsigned int'
    CPU: 0 PID: 656 Comm: fsx Not tainted 5.9.0-rc7+ #821
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
    Call Trace:
    __dump_stack lib/dump_stack.c:77
    dump_stack+0x57/0x70 lib/dump_stack.c:118
    ubsan_epilogue+0x5/0x40 lib/ubsan.c:148
    __ubsan_handle_shift_out_of_bounds.cold+0x61/0xe9 lib/ubsan.c:395
    __roundup_pow_of_two ./include/linux/log2.h:57
    get_init_ra_size mm/readahead.c:318
    ondemand_readahead.cold+0x16/0x2c mm/readahead.c:530
    generic_file_buffered_read+0x3ac/0x840 mm/filemap.c:2199
    call_read_iter ./include/linux/fs.h:1876
    new_sync_read+0x102/0x180 fs/read_write.c:415
    vfs_read+0x11c/0x1a0 fs/read_write.c:481
    ksys_read+0x4f/0xc0 fs/read_write.c:615
    do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9 arch/x86/entry/entry_64.S:118
    RIP: 0033:0x7fe87fee992e
    RSP: 002b:00007ffe01605278 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
    RAX: ffffffffffffffda RBX: 000000000004f000 RCX: 00007fe87fee992e
    RDX: 0000000000004000 RSI: 0000000001677000 RDI: 0000000000000003
    RBP: 000000000004f000 R08: 0000000000004000 R09: 000000000004f000
    R10: 0000000000053000 R11: 0000000000000246 R12: 0000000000004000
    R13: 0000000000000000 R14: 000000000007a120 R15: 0000000000000000
    ================================================================================
    BTRFS info (device nullb0): has skinny extents
    BTRFS info (device nullb0): ZONED mode enabled, zone size 268435456 B
    BTRFS info (device nullb0): enabling ssd optimizations

    Fixes: f85781fb505e ("btrfs: switch to iomap for direct IO")
    Reviewed-by: Goldwyn Rodrigues
    Signed-off-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    Johannes Thumshirn
     

26 Oct, 2020

1 commit

  • I got the following lockdep splat with tree locks converted to rwsem
    patches on btrfs/104:

    ======================================================
    WARNING: possible circular locking dependency detected
    5.9.0+ #102 Not tainted
    ------------------------------------------------------
    btrfs-cleaner/903 is trying to acquire lock:
    ffff8e7fab6ffe30 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170

    but task is already holding lock:
    ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (&fs_info->commit_root_sem){++++}-{3:3}:
    down_read+0x40/0x130
    caching_thread+0x53/0x5a0
    btrfs_work_helper+0xfa/0x520
    process_one_work+0x238/0x540
    worker_thread+0x55/0x3c0
    kthread+0x13a/0x150
    ret_from_fork+0x1f/0x30

    -> #2 (&caching_ctl->mutex){+.+.}-{3:3}:
    __mutex_lock+0x7e/0x7b0
    btrfs_cache_block_group+0x1e0/0x510
    find_free_extent+0xb6e/0x12f0
    btrfs_reserve_extent+0xb3/0x1b0
    btrfs_alloc_tree_block+0xb1/0x330
    alloc_tree_block_no_bg_flush+0x4f/0x60
    __btrfs_cow_block+0x11d/0x580
    btrfs_cow_block+0x10c/0x220
    commit_cowonly_roots+0x47/0x2e0
    btrfs_commit_transaction+0x595/0xbd0
    sync_filesystem+0x74/0x90
    generic_shutdown_super+0x22/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20
    deactivate_locked_super+0x36/0xa0
    cleanup_mnt+0x12d/0x190
    task_work_run+0x5c/0xa0
    exit_to_user_mode_prepare+0x1df/0x200
    syscall_exit_to_user_mode+0x54/0x280
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #1 (&space_info->groups_sem){++++}-{3:3}:
    down_read+0x40/0x130
    find_free_extent+0x2ed/0x12f0
    btrfs_reserve_extent+0xb3/0x1b0
    btrfs_alloc_tree_block+0xb1/0x330
    alloc_tree_block_no_bg_flush+0x4f/0x60
    __btrfs_cow_block+0x11d/0x580
    btrfs_cow_block+0x10c/0x220
    commit_cowonly_roots+0x47/0x2e0
    btrfs_commit_transaction+0x595/0xbd0
    sync_filesystem+0x74/0x90
    generic_shutdown_super+0x22/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20
    deactivate_locked_super+0x36/0xa0
    cleanup_mnt+0x12d/0x190
    task_work_run+0x5c/0xa0
    exit_to_user_mode_prepare+0x1df/0x200
    syscall_exit_to_user_mode+0x54/0x280
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (btrfs-root-00){++++}-{3:3}:
    __lock_acquire+0x1167/0x2150
    lock_acquire+0xb9/0x3d0
    down_read_nested+0x43/0x130
    __btrfs_tree_read_lock+0x32/0x170
    __btrfs_read_lock_root_node+0x3a/0x50
    btrfs_search_slot+0x614/0x9d0
    btrfs_find_root+0x35/0x1b0
    btrfs_read_tree_root+0x61/0x120
    btrfs_get_root_ref+0x14b/0x600
    find_parent_nodes+0x3e6/0x1b30
    btrfs_find_all_roots_safe+0xb4/0x130
    btrfs_find_all_roots+0x60/0x80
    btrfs_qgroup_trace_extent_post+0x27/0x40
    btrfs_add_delayed_data_ref+0x3fd/0x460
    btrfs_free_extent+0x42/0x100
    __btrfs_mod_ref+0x1d7/0x2f0
    walk_up_proc+0x11c/0x400
    walk_up_tree+0xf0/0x180
    btrfs_drop_snapshot+0x1c7/0x780
    btrfs_clean_one_deleted_snapshot+0xfb/0x110
    cleaner_kthread+0xd4/0x140
    kthread+0x13a/0x150
    ret_from_fork+0x1f/0x30

    other info that might help us debug this:

    Chain exists of:
    btrfs-root-00 --> &caching_ctl->mutex --> &fs_info->commit_root_sem

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&fs_info->commit_root_sem);
    lock(&caching_ctl->mutex);
    lock(&fs_info->commit_root_sem);
    lock(btrfs-root-00);

    *** DEADLOCK ***

    3 locks held by btrfs-cleaner/903:
    #0: ffff8e7fab628838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
    #1: ffff8e7faadac640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
    #2: ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

    stack backtrace:
    CPU: 0 PID: 903 Comm: btrfs-cleaner Not tainted 5.9.0+ #102
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
    Call Trace:
    dump_stack+0x8b/0xb0
    check_noncircular+0xcf/0xf0
    __lock_acquire+0x1167/0x2150
    ? __bfs+0x42/0x210
    lock_acquire+0xb9/0x3d0
    ? __btrfs_tree_read_lock+0x32/0x170
    down_read_nested+0x43/0x130
    ? __btrfs_tree_read_lock+0x32/0x170
    __btrfs_tree_read_lock+0x32/0x170
    __btrfs_read_lock_root_node+0x3a/0x50
    btrfs_search_slot+0x614/0x9d0
    ? find_held_lock+0x2b/0x80
    btrfs_find_root+0x35/0x1b0
    ? do_raw_spin_unlock+0x4b/0xa0
    btrfs_read_tree_root+0x61/0x120
    btrfs_get_root_ref+0x14b/0x600
    find_parent_nodes+0x3e6/0x1b30
    btrfs_find_all_roots_safe+0xb4/0x130
    btrfs_find_all_roots+0x60/0x80
    btrfs_qgroup_trace_extent_post+0x27/0x40
    btrfs_add_delayed_data_ref+0x3fd/0x460
    btrfs_free_extent+0x42/0x100
    __btrfs_mod_ref+0x1d7/0x2f0
    walk_up_proc+0x11c/0x400
    walk_up_tree+0xf0/0x180
    btrfs_drop_snapshot+0x1c7/0x780
    ? btrfs_clean_one_deleted_snapshot+0x73/0x110
    btrfs_clean_one_deleted_snapshot+0xfb/0x110
    cleaner_kthread+0xd4/0x140
    ? btrfs_alloc_root+0x50/0x50
    kthread+0x13a/0x150
    ? kthread_create_worker_on_cpu+0x40/0x40
    ret_from_fork+0x1f/0x30
    BTRFS info (device sdb): disk space caching is enabled
    BTRFS info (device sdb): has skinny extents

    This happens because qgroups does a backref lookup when we create a
    delayed ref. From here it may have to look up a root from an indirect
    ref, which does a normal lookup on the tree_root, which takes the read
    lock on the tree_root nodes.

    To fix this we need to add a variant for looking up roots that searches
    the commit root of the tree_root. Then when we do the backref search
    using the commit root we are sure to not take any locks on the tree_root
    nodes. This gets rid of the lockdep splat when running btrfs/104.

    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik