Eric Lee / smarc-fsl-linux-kernel

25 Jan, 2021

1 commit

2259bc93f Merge remote-tracking branch 'lts/lf-5.10.y' into lf-5.10.y_android ... Browse Code »

Change-Id: I173bc0325c93daaf82bb894f77ba35fa827864de

zhang sanshan
2021-01-25 16:19:35 +0800

20 Jan, 2021

6 commits

29543864c btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan ... Browse Code »

[ Upstream commit cb13eea3b49055bd78e6ddf39defd6340f7379fc ]

If we remount a filesystem in RO mode while the qgroup rescan worker is
running, we can end up having it still running after the remount is done,
and at unmount time we may end up with an open transaction that ends up
never getting committed. If that happens we end up with several memory
leaks and can crash when hardware acceleration is unavailable for crc32c.
Possibly it can lead to other nasty surprises too, due to use-after-free
issues.

The following steps explain how the problem happens.

1) We have a filesystem mounted in RW mode and the qgroup rescan worker is
running;

2) We remount the filesystem in RO mode, and never stop/pause the rescan
worker, so after the remount the rescan worker is still running. The
important detail here is that the rescan task is still running after
the remount operation committed any ongoing transaction through its
call to btrfs_commit_super();

3) The rescan is still running, and after the remount completed, the
rescan worker started a transaction, after it finished iterating all
leaves of the extent tree, to update the qgroup status item in the
quotas tree. It does not commit the transaction, it only releases its
handle on the transaction;

4) A filesystem unmount operation starts shortly after;

5) The unmount task, at close_ctree(), stops the transaction kthread,
which had not had a chance to commit the open transaction since it was
sleeping and the commit interval (default of 30 seconds) has not yet
elapsed since the last time it committed a transaction;

6) So after stopping the transaction kthread we still have the transaction
used to update the qgroup status item open. At close_ctree(), when the
filesystem is in RO mode and no transaction abort happened (or the
filesystem is in error mode), we do not expect to have any transaction
open, so we do not call btrfs_commit_super();

7) We then proceed to destroy the work queues, free the roots and block
groups, etc. After that we drop the last reference on the btree inode
by calling iput() on it. Since there are dirty pages for the btree
inode, corresponding to the COWed extent buffer for the quotas btree,
btree_write_cache_pages() is invoked to flush those dirty pages. This
results in creating a bio and submitting it, which makes us end up at
btrfs_submit_metadata_bio();

8) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
that calls btrfs_wq_submit_bio(), because check_async_write() returned
a value of 1. This value of 1 is because we did not have hardware
acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
set in fs_info->flags;

9) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
workqueue at fs_info->workers, which was already freed before by the
call to btrfs_stop_all_workers() at close_ctree(). This results in an
invalid memory access due to a use-after-free, leading to a crash.

When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:

------------[ cut here ]------------
WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
Code: f0 01 00 00 48 39 c2 75 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 48 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [] 0x0
hardirqs last disabled at (0): [] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [] 0x0
---[ end trace dd74718fef1ed5c6 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Code: 48 83 bb b0 03 00 00 00 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [] 0x0
hardirqs last disabled at (0): [] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [] 0x0
---[ end trace dd74718fef1ed5c7 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Code: ad de 49 be 22 01 00 (...)
RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [] 0x0
hardirqs last disabled at (0): [] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [] 0x0
---[ end trace dd74718fef1ed5c8 ]---
BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0

And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:

stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
Code: 54 55 53 48 89 f3 (...)
RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
submit_one_bio+0x61/0x70 [btrfs]
btree_write_cache_pages+0x414/0x450 [btrfs]
? kobject_put+0x9a/0x1d0
? trace_hardirqs_on+0x1b/0xf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
? free_debug_processing+0x1e1/0x2b0
do_writepages+0x43/0xe0
? lock_acquired+0x199/0x490
__writeback_single_inode+0x59/0x650
writeback_single_inode+0xaf/0x120
write_inode_now+0x94/0xd0
iput+0x187/0x2b0
close_ctree+0x2c6/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3cfebabee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
---[ end trace dd74718fef1ed5cc ]---

Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:

=============================================================================
BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
-----------------------------------------------------------------------------

INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x0000000050cbdd61 @offset=12104
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x0000000086e9b0ff @offset=12776
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
commit_cowonly_roots+0x248/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
-----------------------------------------------------------------------------

INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000001a340018 @offset=4408
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_commit_transaction+0x60/0xc40 [btrfs]
create_subvol+0x56a/0x990 [btrfs]
btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
btrfs_ioctl+0x1a92/0x36f0 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x000000002b46292a @offset=13648
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
-----------------------------------------------------------------------------

INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? __mutex_unlock_slowpath+0x45/0x2a0
kmem_cache_destroy+0x55/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000004cf95ea8 @offset=6264
INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1

Fix this issue by having the remount path stop the qgroup rescan worker
when we are remounting RO and teach the rescan worker to stop when a
remount is in progress. If later a remount in RW mode happens, we are
already resuming the qgroup rescan worker through the call to
btrfs_qgroup_rescan_resume(), so we do not need to worry about that.

Tested-by: Fabian Vogt
Reviewed-by: Josef Bacik
Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin

Filipe Manana
2021-01-20 01:27:24 +0800
f89d84b35 btrfs: merge critical sections of discard lock in workfn ... Browse Code »

[ Upstream commit 8fc058597a283e9a37720abb0e8d68e342b9387d ]

btrfs_discard_workfn() drops discard_ctl->lock just to take it again in
a moment in btrfs_discard_schedule_work(). Avoid that and also reuse
ktime.

Reviewed-by: Josef Bacik
Signed-off-by: Pavel Begunkov
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin

Pavel Begunkov
2021-01-20 01:27:24 +0800
33061bd10 btrfs: fix async discard stall ... Browse Code »

[ Upstream commit ea9ed87c73e87e044b2c58d658eb4ba5216bc488 ]

Might happen that bg->discard_eligible_time was changed without
rescheduling, so btrfs_discard_workfn() wakes up earlier than that new
time, peek_discard_list() returns NULL, and all work halts and goes to
sleep without further rescheduling even there are block groups to
discard.

It happens pretty often, but not so visible from the userspace because
after some time it usually will be kicked off anyway by someone else
calling btrfs_discard_reschedule_work().

Fix it by continue rescheduling if block group discard lists are not
empty.

Reviewed-by: Josef Bacik
Signed-off-by: Pavel Begunkov
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin

Pavel Begunkov
2021-01-20 01:27:24 +0800
41b5ec745 btrfs: tree-checker: check if chunk item end overflows ... Browse Code »

[ Upstream commit 347fb0cfc9bab5195c6701e62eda488310d7938f ]

While mounting a crafted image provided by user, kernel panics due to
the invalid chunk item whose end is less than start.

[66.387422] loop: module loaded
[66.389773] loop0: detected capacity change from 262144 to 0
[66.427708] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 12 /dev/loop0 scanned by mount (613)
[66.431061] BTRFS info (device loop0): disk space caching is enabled
[66.431078] BTRFS info (device loop0): has skinny extents
[66.437101] BTRFS error: insert state: end < start 29360127 37748736
[66.437136] ------------[ cut here ]------------
[66.437140] WARNING: CPU: 16 PID: 613 at fs/btrfs/extent_io.c:557 insert_state.cold+0x1a/0x46 [btrfs]
[66.437369] CPU: 16 PID: 613 Comm: mount Tainted: G O 5.11.0-rc1-custom #45
[66.437374] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[66.437378] RIP: 0010:insert_state.cold+0x1a/0x46 [btrfs]
[66.437420] RSP: 0018:ffff93e5414c3908 EFLAGS: 00010286
[66.437427] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
[66.437431] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
[66.437434] RBP: ffff93e5414c3938 R08: 0000000000000001 R09: 0000000000000001
[66.437438] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d72aa0
[66.437441] R13: ffff8ec78bc71628 R14: 0000000000000000 R15: 0000000002400000
[66.437447] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
[66.437451] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[66.437455] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
[66.437460] PKRU: 55555554
[66.437464] Call Trace:
[66.437475] set_extent_bit+0x652/0x740 [btrfs]
[66.437539] set_extent_bits_nowait+0x1d/0x20 [btrfs]
[66.437576] add_extent_mapping+0x1e0/0x2f0 [btrfs]
[66.437621] read_one_chunk+0x33c/0x420 [btrfs]
[66.437674] btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
[66.437708] ? kvm_sched_clock_read+0x18/0x40
[66.437739] open_ctree+0xb32/0x1734 [btrfs]
[66.437781] ? bdi_register_va+0x1b/0x20
[66.437788] ? super_setup_bdi_name+0x79/0xd0
[66.437810] btrfs_mount_root.cold+0x12/0xeb [btrfs]
[66.437854] ? __kmalloc_track_caller+0x217/0x3b0
[66.437873] legacy_get_tree+0x34/0x60
[66.437880] vfs_get_tree+0x2d/0xc0
[66.437888] vfs_kern_mount.part.0+0x78/0xc0
[66.437897] vfs_kern_mount+0x13/0x20
[66.437902] btrfs_mount+0x11f/0x3c0 [btrfs]
[66.437940] ? kfree+0x5ff/0x670
[66.437944] ? __kmalloc_track_caller+0x217/0x3b0
[66.437962] legacy_get_tree+0x34/0x60
[66.437974] vfs_get_tree+0x2d/0xc0
[66.437983] path_mount+0x48c/0xd30
[66.437998] __x64_sys_mount+0x108/0x140
[66.438011] do_syscall_64+0x38/0x50
[66.438018] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[66.438023] RIP: 0033:0x7f0138827f6e
[66.438033] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[66.438040] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
[66.438044] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
[66.438047] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
[66.438050] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[66.438054] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
[66.438078] irq event stamp: 18169
[66.438082] hardirqs last enabled at (18175): [] console_unlock+0x4ff/0x5f0
[66.438088] hardirqs last disabled at (18180): [] console_unlock+0x467/0x5f0
[66.438092] softirqs last enabled at (16910): [] asm_call_irq_on_stack+0x12/0x20
[66.438097] softirqs last disabled at (16905): [] asm_call_irq_on_stack+0x12/0x20
[66.438103] ---[ end trace e114b111db64298b ]---
[66.438107] BTRFS error: found node 12582912 29360127 on insert of 37748736 29360127
[66.438127] BTRFS critical: panic in extent_io_tree_panic:679: locking error: extent tree was modified by another thread while locked (errno=-17 Object already exists)
[66.441069] ------------[ cut here ]------------
[66.441072] kernel BUG at fs/btrfs/extent_io.c:679!
[66.442064] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[66.443018] CPU: 16 PID: 613 Comm: mount Tainted: G W O 5.11.0-rc1-custom #45
[66.444538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[66.446223] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
[66.450878] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
[66.451840] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
[66.453141] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
[66.454445] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
[66.455743] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
[66.457055] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
[66.458356] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
[66.459841] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[66.460895] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
[66.462196] PKRU: 55555554
[66.462692] Call Trace:
[66.463139] set_extent_bit.cold+0x30/0x98 [btrfs]
[66.464049] set_extent_bits_nowait+0x1d/0x20 [btrfs]
[66.490466] add_extent_mapping+0x1e0/0x2f0 [btrfs]
[66.514097] read_one_chunk+0x33c/0x420 [btrfs]
[66.534976] btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
[66.555718] ? kvm_sched_clock_read+0x18/0x40
[66.575758] open_ctree+0xb32/0x1734 [btrfs]
[66.595272] ? bdi_register_va+0x1b/0x20
[66.614638] ? super_setup_bdi_name+0x79/0xd0
[66.633809] btrfs_mount_root.cold+0x12/0xeb [btrfs]
[66.652938] ? __kmalloc_track_caller+0x217/0x3b0
[66.671925] legacy_get_tree+0x34/0x60
[66.690300] vfs_get_tree+0x2d/0xc0
[66.708221] vfs_kern_mount.part.0+0x78/0xc0
[66.725808] vfs_kern_mount+0x13/0x20
[66.742730] btrfs_mount+0x11f/0x3c0 [btrfs]
[66.759350] ? kfree+0x5ff/0x670
[66.775441] ? __kmalloc_track_caller+0x217/0x3b0
[66.791750] legacy_get_tree+0x34/0x60
[66.807494] vfs_get_tree+0x2d/0xc0
[66.823349] path_mount+0x48c/0xd30
[66.838753] __x64_sys_mount+0x108/0x140
[66.854412] do_syscall_64+0x38/0x50
[66.869673] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[66.885093] RIP: 0033:0x7f0138827f6e
[66.945613] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[66.977214] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
[66.994266] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
[67.011544] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
[67.028836] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[67.045812] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
[67.216138] ---[ end trace e114b111db64298c ]---
[67.237089] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
[67.325317] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
[67.347946] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
[67.371343] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
[67.394757] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
[67.418409] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
[67.441906] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
[67.465436] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
[67.511660] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[67.535047] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
[67.558449] PKRU: 55555554
[67.581146] note: mount[613] exited with preempt_count 2

The image has a chunk item which has a logical start 37748736 and length
18446744073701163008 (-8M). The calculated end 29360127 overflows.
EEXIST was caught by insert_state() because of the duplicate end and
extent_io_tree_panic() was called.

Add overflow check of chunk item end to tree checker so it can be
detected early at mount time.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain
Signed-off-by: Su Yue
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin

Su Yue
2021-01-20 01:27:22 +0800
f37fba66a btrfs: prevent NULL pointer dereference in extent_io_tree_panic ... Browse Code »

commit 29b665cc51e8b602bf2a275734349494776e3dbc upstream.

Some extent io trees are initialized with NULL private member (e.g.
btrfs_device::alloc_state and btrfs_fs_info::excluded_extents).
Dereference of a NULL tree->private as inode pointer will cause panic.

Pass tree->fs_info as it's known to be valid in all cases.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
Fixes: 05912a3c04eb ("btrfs: drop extent_io_ops::tree_fs_info callback")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain
Signed-off-by: Su Yue
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Su Yue
2021-01-20 01:27:17 +0800
e883eb5d1 btrfs: reloc: fix wrong file extent type check to avoid false ENOENT ... Browse Code »

commit 50e31ef486afe60f128d42fb9620e2a63172c15c upstream.

[BUG]
There are several bug reports about recent kernel unable to relocate
certain data block groups.

Sometimes the error just goes away, but there is one reporter who can
reproduce it reliably.

The dmesg would look like:

[438.260483] BTRFS info (device dm-10): balance: start -dvrange=34625344765952..34625344765953
[438.269018] BTRFS info (device dm-10): relocating block group 34625344765952 flags data|raid1
[450.439609] BTRFS info (device dm-10): found 167 extents, stage: move data extents
[463.501781] BTRFS info (device dm-10): balance: ended with status: -2

[CAUSE]
The ENOENT error is returned from the following call chain:

add_data_references()
|- delete_v1_space_cache();
|- if (!found)
return -ENOENT;

The variable @found is set to true if we find a data extent whose
disk bytenr matches parameter @data_bytes.

With extra debugging, the offending tree block looks like this:

leaf bytenr = 42676709441536, data_bytenr = 34626327621632

ctime 1567904822.739884119 (2019-09-08 03:07:02)
mtime 0.0 (1970-01-01 01:00:00)
otime 0.0 (1970-01-01 01:00:00)
item 27 key (51933 EXTENT_DATA 0) itemoff 9854 itemsize 53
generation 1517381 type 2 (prealloc)
prealloc data disk byte 34626327621632 nr 262144 <<<
prealloc data offset 0 nr 262144
item 28 key (52262 ROOT_ITEM 0) itemoff 9415 itemsize 439
generation 2618893 root_dirid 256 bytenr 42677048360960 level 3 refs 1
lastsnap 2618893 byte_limit 0 bytes_used 5557338112 flags 0x0(none)
uuid d0d4361f-d231-6d40-8901-fe506e4b2b53

Although item 27 has disk bytenr 34626327621632, which matches the
data_bytenr, its type is prealloc, not reg.
This makes the existing code skip that item, and return ENOENT.

[FIX]
The code is modified in commit 19b546d7a1b2 ("btrfs: relocation: Use
btrfs_find_all_leafs to locate data extent parent tree leaves"), before
that commit, we use something like

"if (type == BTRFS_FILE_EXTENT_INLINE) continue;"

But in that offending commit, we use (type == BTRFS_FILE_EXTENT_REG),
ignoring BTRFS_FILE_EXTENT_PREALLOC.

Fix it by also checking BTRFS_FILE_EXTENT_PREALLOC.

Reported-by: Stéphane Lesimple
Link: https://lore.kernel.org/linux-btrfs/505cabfa88575ed6dbe7cb922d8914fb@lesimple.fr
Fixes: 19b546d7a1b2 ("btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves")
CC: stable@vger.kernel.org # 5.6+
Tested-By: Stéphane Lesimple
Reviewed-by: Su Yue
Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Qu Wenruo
2021-01-20 01:27:17 +0800

17 Jan, 2021

3 commits

e3b5252b5 btrfs: shrink delalloc pages instead of full inodes ... Browse Code »

[ Upstream commit e076ab2a2ca70a0270232067cd49f76cd92efe64 ]

Commit 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
some infrastructure we have in place to flush inodes that we use for
device replace and snapshot. However this introduced a pretty serious
performance regression. To reproduce the user untarred the source
tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
see it take anywhere from 5 to 20 times as long to untar in 5.10
compared to 5.9. This was observed on fast devices (SSD and better) and
not on HDD.

The root cause is because before we would generally use the normal
writeback path to reclaim delalloc space, and for this we would provide
it with the number of pages we wanted to flush. The referenced commit
changed this to flush that many inodes, which drastically increased the
amount of space we were flushing in certain cases, which severely
affected performance.

We cannot revert this patch unfortunately because of 3d45f221ce62
("btrfs: fix deadlock when cloning inline extent and low on free
metadata space") which requires the ability to skip flushing inodes that
are being cloned in certain scenarios, which means we need to keep using
our flushing infrastructure or risk re-introducing the deadlock.

Instead to fix this problem we can go back to providing
btrfs_start_delalloc_roots with a number of pages to flush, and then set
up a writeback_control and utilize sync_inode() to handle the flushing
for us. This gives us the same behavior we had prior to the fix, while
still allowing us to avoid the deadlock that was fixed by Filipe. I
redid the users original test and got the following results on one of
our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)

5.9 0m54.258s
5.10 1m26.212s
5.10+patch 0m38.800s

5.10+patch is significantly faster than plain 5.9 because of my patch
series "Change data reservations to use the ticketing infra" which
contained the patch that introduced the regression, but generally
improved the overall ENOSPC flushing mechanisms.

Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
the results:

5.10.5 4m00s
5.10.5+patch 1m08s
5.11-rc2 5m14s
5.11-rc2+patch 1m30s

Reported-by: René Rebe
Fixes: 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
CC: stable@vger.kernel.org # 5.10
Signed-off-by: Josef Bacik
Tested-by: David Sterba
Reviewed-by: David Sterba
[ add my test results ]
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin

Josef Bacik
2021-01-17 21:16:54 +0800
17243f73a btrfs: fix deadlock when cloning inline extent and low on free metadata space ... Browse Code »

[ Upstream commit 3d45f221ce627d13e2e6ef3274f06750c84a6542 ]

When cloning an inline extent there are cases where we can not just copy
the inline extent from the source range to the target range (e.g. when the
target range starts at an offset greater than zero). In such cases we copy
the inline extent's data into a page of the destination inode and then
dirty that page. However, after that we will need to start a transaction
for each processed extent and, if we are ever low on available metadata
space, we may need to flush existing delalloc for all dirty inodes in an
attempt to release metadata space - if that happens we may deadlock:

* the async reclaim task queued a delalloc work to flush delalloc for
the destination inode of the clone operation;

* the task executing that delalloc work gets blocked waiting for the
range with the dirty page to be unlocked, which is currently locked
by the task doing the clone operation;

* the async reclaim task blocks waiting for the delalloc work to complete;

* the cloning task is waiting on the waitqueue of its reservation ticket
while holding the range with the dirty page locked in the inode's
io_tree;

* if metadata space is not released by some other task (like delalloc for
some other inode completing for example), the clone task waits forever
and as a consequence the delalloc work and async reclaim tasks will hang
forever as well. Releasing more space on the other hand may require
starting a transaction, which will hang as well when trying to reserve
metadata space, resulting in a deadlock between all these tasks.

When this happens, traces like the following show up in dmesg/syslog:

[87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
[87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
[87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[87452.326136] Call Trace:
[87452.326737] __schedule+0x5d1/0xcf0
[87452.327390] schedule+0x45/0xe0
[87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs]
[87452.328894] ? finish_wait+0x90/0x90
[87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs]
[87452.330133] ? __mod_memcg_state+0x8e/0x160
[87452.330738] __extent_writepage+0x2d4/0x400 [btrfs]
[87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs]
[87452.332007] ? lock_release+0x20e/0x4c0
[87452.332557] ? trace_hardirqs_on+0x1b/0xf0
[87452.333127] extent_writepages+0x43/0x90 [btrfs]
[87452.333653] ? lock_acquire+0x1a3/0x490
[87452.334177] do_writepages+0x43/0xe0
[87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100
[87452.335720] __filemap_fdatawrite_range+0xc5/0x100
[87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
[87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs]
[87452.337838] process_one_work+0x24e/0x5e0
[87452.338437] worker_thread+0x50/0x3b0
[87452.339137] ? process_one_work+0x5e0/0x5e0
[87452.339884] kthread+0x153/0x170
[87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.341153] ret_from_fork+0x22/0x30
[87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
[87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
[87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[87452.345655] Call Trace:
[87452.346305] __schedule+0x5d1/0xcf0
[87452.346947] ? kvm_clock_read+0x14/0x30
[87452.347676] ? wait_for_completion+0x81/0x110
[87452.348389] schedule+0x45/0xe0
[87452.349077] schedule_timeout+0x30c/0x580
[87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[87452.350340] ? lock_acquire+0x1a3/0x490
[87452.351006] ? try_to_wake_up+0x7a/0xa20
[87452.351541] ? lock_release+0x20e/0x4c0
[87452.352040] ? lock_acquired+0x199/0x490
[87452.352517] ? wait_for_completion+0x81/0x110
[87452.353000] wait_for_completion+0xab/0x110
[87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs]
[87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
[87452.354455] flush_space+0x24f/0x660 [btrfs]
[87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
[87452.355565] process_one_work+0x24e/0x5e0
[87452.356024] worker_thread+0x20f/0x3b0
[87452.356487] ? process_one_work+0x5e0/0x5e0
[87452.356973] kthread+0x153/0x170
[87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.357880] ret_from_fork+0x22/0x30
(...)
< stack traces of several tasks waiting for the locks of the inodes of the
clone operation >
(...)
[92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
[92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
[92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
[92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
[92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
[92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000
[92867.447920] Call Trace:
[92867.448435] __schedule+0x5d1/0xcf0
[92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[92867.449423] schedule+0x45/0xe0
[92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs]
[92867.450576] ? finish_wait+0x90/0x90
[92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
[92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs]
[92867.452412] start_transaction+0x2d1/0x760 [btrfs]
[92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs]
[92867.453848] ? lock_release+0x20e/0x4c0
[92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
[92867.455218] btrfs_clone+0x569/0x7e0 [btrfs]
[92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs]
[92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs]
[92867.457213] do_clone_file_range+0xd4/0x1f0
[92867.457828] vfs_clone_file_range+0x4d/0x230
[92867.458355] ? lock_release+0x20e/0x4c0
[92867.458890] ioctl_file_clone+0x8f/0xc0
[92867.459377] do_vfs_ioctl+0x342/0x750
[92867.459913] __x64_sys_ioctl+0x62/0xb0
[92867.460377] do_syscall_64+0x33/0x80
[92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9
(...)
< stack traces of more tasks blocked on metadata reservation like the clone
task above, because the async reclaim task has deadlocked >
(...)

Another thing to notice is that the worker task that is deadlocked when
trying to flush the destination inode of the clone operation is at
btrfs_invalidatepage(). This is simply because the clone operation has a
destination offset greater than the i_size and we only update the i_size
of the destination file after cloning an extent (just like we do in the
buffered write path).

Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
the flushing of delalloc for all inodes that have delalloc, add a runtime
flag to an inode to signal it should not be flushed, and for inodes with
that flag set, start_delalloc_inodes() will simply skip them. When the
cloning code needs to dirty a page to copy an inline extent, set that flag
on the inode and then clear it when the clone operation finishes.

This could be sporadically triggered with test case generic/269 from
fstests, which exercises many fsstress processes running in parallel with
several dd processes filling up the entire filesystem.

CC: stable@vger.kernel.org # 5.9+
Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents")
Reviewed-by: Josef Bacik
Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin

Filipe Manana
2021-01-17 21:16:54 +0800
877381645 btrfs: skip unnecessary searches for xattrs when logging an inode ... Browse Code »

[ Upstream commit f2f121ab500d0457cc9c6f54269d21ffdf5bd304 ]

Every time we log an inode we lookup in the fs/subvol tree for xattrs and
if we have any, log them into the log tree. However it is very common to
have inodes without any xattrs, so doing the search wastes times, but more
importantly it adds contention on the fs/subvol tree locks, either making
the logging code block and wait for tree locks or making the logging code
making other concurrent operations block and wait.

The most typical use cases where xattrs are used are when capabilities or
ACLs are defined for an inode, or when SELinux is enabled.

This change makes the logging code detect when an inode does not have
xattrs and skip the xattrs search the next time the inode is logged,
unless the inode is evicted and loaded again or a xattr is added to the
inode. Therefore skipping the search for xattrs on inodes that don't ever
have xattrs and are fsynced with some frequency.

The following script that calls dbench was used to measure the impact of
this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
directly (no intermediary filesystem on the host) and using a non-debug
kernel (default configuration on Debian distributions):

$ cat test.sh
#!/bin/bash

DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"

mkfs.btrfs -f -m single -d single $DEV
mount $MOUNT_OPTIONS $DEV $MNT

dbench -D $MNT -t 200 40

umount $MNT

The results before this change:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5761605 0.172 312.057
Close 4232452 0.002 10.927
Rename 243937 1.406 277.344
Unlink 1163456 0.631 298.402
Deltree 160 11.581 221.107
Mkdir 80 0.003 0.005
Qpathinfo 5221410 0.065 122.309
Qfileinfo 915432 0.001 3.333
Qfsinfo 957555 0.003 3.992
Sfileinfo 469244 0.023 20.494
Find 2018865 0.448 123.659
WriteX 2874851 0.049 118.529
ReadX 9030579 0.004 21.654
LockX 18754 0.003 4.423
UnlockX 18754 0.002 0.331
Flush 403792 10.944 359.494

Throughput 908.444 MB/sec 40 clients 40 procs max_latency=359.500 ms

The results after this change:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 6442521 0.159 230.693
Close 4732357 0.002 10.972
Rename 272809 1.293 227.398
Unlink 1301059 0.563 218.500
Deltree 160 7.796 54.887
Mkdir 80 0.008 0.478
Qpathinfo 5839452 0.047 124.330
Qfileinfo 1023199 0.001 4.996
Qfsinfo 1070760 0.003 5.709
Sfileinfo 524790 0.033 21.765
Find 2257658 0.314 125.611
WriteX 3211520 0.040 232.135
ReadX 10098969 0.004 25.340
LockX 20974 0.003 1.569
UnlockX 20974 0.002 3.475
Flush 451553 10.287 331.037

Throughput 1011.77 MB/sec 40 clients 40 procs max_latency=331.045 ms

+10.8% throughput, -8.2% max latency

Reviewed-by: Josef Bacik
Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin

Filipe Manana
2021-01-17 21:16:53 +0800

13 Jan, 2021

2 commits

5e84c9905 btrfs: send: fix wrong file path when there is an inode with a pending rmdir ... Browse Code »

commit 0b3f407e6728d990ae1630a02c7b952c21c288d3 upstream.

When doing an incremental send, if we have a new inode that happens to
have the same number that an old directory inode had in the base snapshot
and that old directory has a pending rmdir operation, we end up computing
a wrong path for the new inode, causing the receiver to fail.

Example reproducer:

$ cat test-send-rmdir.sh
#!/bin/bash

DEV=/dev/sdi
MNT=/mnt/sdi

mkfs.btrfs -f $DEV >/dev/null
mount $DEV $MNT

mkdir $MNT/dir
touch $MNT/dir/file1
touch $MNT/dir/file2
touch $MNT/dir/file3

# Filesystem looks like:
#
# . (ino 256)
# |----- dir/ (ino 257)
# |----- file1 (ino 258)
# |----- file2 (ino 259)
# |----- file3 (ino 260)
#

btrfs subvolume snapshot -r $MNT $MNT/snap1
btrfs send -f /tmp/snap1.send $MNT/snap1

# Now remove our directory and all its files.
rm -fr $MNT/dir

# Unmount the filesystem and mount it again. This is to ensure that
# the next inode that is created ends up with the same inode number
# that our directory "dir" had, 257, which is the first free "objectid"
# available after mounting again the filesystem.
umount $MNT
mount $DEV $MNT

# Now create a new file (it could be a directory as well).
touch $MNT/newfile

# Filesystem now looks like:
#
# . (ino 256)
# |----- newfile (ino 257)
#

btrfs subvolume snapshot -r $MNT $MNT/snap2
btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2

# Now unmount the filesystem, create a new one, mount it and try to apply
# both send streams to recreate both snapshots.
umount $DEV

mkfs.btrfs -f $DEV >/dev/null

mount $DEV $MNT

btrfs receive -f /tmp/snap1.send $MNT
btrfs receive -f /tmp/snap2.send $MNT

umount $MNT

When running the test, the receive operation for the incremental stream
fails:

$ ./test-send-rmdir.sh
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
At subvol snap1
At snapshot snap2
ERROR: chown o257-9-0 failed: No such file or directory

So fix this by tracking directories that have a pending rmdir by inode
number and generation number, instead of only inode number.

A test case for fstests follows soon.

Reported-by: Massimo B.
Tested-by: Massimo B.
Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Filipe Manana
2021-01-13 03:18:24 +0800
1888e5df8 btrfs: qgroup: don't try to wait flushing if we're already holding a transaction ... Browse Code »

commit ae5e070eaca9dbebde3459dd8f4c2756f8c097d0 upstream.

There is a chance of racing for qgroup flushing which may lead to
deadlock:

Thread A | Thread B
(not holding trans handle) | (holding a trans handle)
--------------------------------+--------------------------------
__btrfs_qgroup_reserve_meta() | __btrfs_qgroup_reserve_meta()
|- try_flush_qgroup() | |- try_flush_qgroup()
|- QGROUP_FLUSHING bit set | |
| | |- test_and_set_bit()
| | |- wait_event()
|- btrfs_join_transaction() |
|- btrfs_commit_transaction()|

!!! DEAD LOCK !!!

Since thread A wants to commit transaction, but thread B is holding a
transaction handle, blocking the commit.
At the same time, thread B is waiting for thread A to finish its commit.

This is just a hot fix, and would lead to more EDQUOT when we're near
the qgroup limit.

The proper fix would be to make all metadata/data reservations happen
without holding a transaction handle.

CC: stable@vger.kernel.org # 5.9+
Reviewed-by: Filipe Manana
Signed-off-by: Qu Wenruo
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Qu Wenruo
2021-01-13 03:18:24 +0800

30 Dec, 2020

4 commits

19057a6a6 Merge 5.10.4 into android12-5.10 ... Browse Code »

Changes in 5.10.4
hwmon: (k10temp) Remove support for displaying voltage and current on Zen CPUs
drm/gma500: fix double free of gma_connector
iio: adc: at91_adc: add Kconfig dep on the OF symbol and remove of_match_ptr()
drm/aspeed: Fix Kconfig warning & subsequent build errors
drm/mcde: Fix handling of platform_get_irq() error
drm/tve200: Fix handling of platform_get_irq() error
arm64: dts: renesas: hihope-rzg2-ex: Drop rxc-skew-ps from ethernet-phy node
arm64: dts: renesas: cat875: Remove rxc-skew-ps from ethernet-phy node
soc: renesas: rmobile-sysc: Fix some leaks in rmobile_init_pm_domains()
soc: mediatek: Check if power domains can be powered on at boot time
arm64: dts: mediatek: mt8183: fix gce incorrect mbox-cells value
arm64: dts: ipq6018: update the reserved-memory node
arm64: dts: qcom: sc7180: Fix one forgotten interconnect reference
soc: qcom: geni: More properly switch to DMA mode
Revert "i2c: i2c-qcom-geni: Fix DMA transfer race"
RDMA/bnxt_re: Set queue pair state when being queried
rtc: pcf2127: fix pcf2127_nvmem_read/write() returns
RDMA/bnxt_re: Fix entry size during SRQ create
selinux: fix error initialization in inode_doinit_with_dentry()
ARM: dts: aspeed-g6: Fix the GPIO memory size
ARM: dts: aspeed: s2600wf: Fix VGA memory region location
RDMA/core: Fix error return in _ib_modify_qp()
RDMA/rxe: Compute PSN windows correctly
x86/mm/ident_map: Check for errors from ident_pud_init()
ARM: p2v: fix handling of LPAE translation in BE mode
RDMA/rtrs-clt: Remove destroy_con_cq_qp in case route resolving failed
RDMA/rtrs-clt: Missing error from rtrs_rdma_conn_established
RDMA/rtrs-srv: Don't guard the whole __alloc_srv with srv_mutex
x86/apic: Fix x2apic enablement without interrupt remapping
ASoC: qcom: fix unsigned int bitwidth compared to less than zero
sched/deadline: Fix sched_dl_global_validate()
sched: Reenable interrupts in do_sched_yield()
drm/amdgpu: fix incorrect enum type
crypto: talitos - Endianess in current_desc_hdr()
crypto: talitos - Fix return type of current_desc_hdr()
crypto: inside-secure - Fix sizeof() mismatch
ASoC: sun4i-i2s: Fix lrck_period computation for I2S justified mode
drm/msm: Add missing stub definition
ARM: dts: aspeed: tiogapass: Remove vuart
drm/amdgpu: fix build_coefficients() argument
powerpc/64: Set up a kernel stack for secondaries before cpu_restore()
spi: img-spfi: fix reference leak in img_spfi_resume
f2fs: call f2fs_get_meta_page_retry for nat page
RDMA/mlx5: Fix corruption of reg_pages in mlx5_ib_rereg_user_mr()
perf test: Use generic event for expand_libpfm_events()
drm/msm/dp: DisplayPort PHY compliance tests fixup
drm/msm/dsi_pll_7nm: restore VCO rate during restore_state
drm/msm/dsi_pll_10nm: restore VCO rate during restore_state
drm/msm/dpu: fix clock scaling on non-sc7180 board
spi: spi-mem: fix reference leak in spi_mem_access_start
scsi: aacraid: Improve compat_ioctl handlers
pinctrl: core: Add missing #ifdef CONFIG_GPIOLIB
ASoC: pcm: DRAIN support reactivation
drm/bridge: tpd12s015: Fix irq registering in tpd12s015_probe
crypto: arm64/poly1305-neon - reorder PAC authentication with SP update
crypto: arm/aes-neonbs - fix usage of cbc(aes) fallback
crypto: caam - fix printing on xts fallback allocation error path
selinux: fix inode_doinit_with_dentry() LABEL_INVALID error handling
nl80211/cfg80211: fix potential infinite loop
spi: stm32: fix reference leak in stm32_spi_resume
bpf: Fix tests for local_storage
x86/mce: Correct the detection of invalid notifier priorities
drm/edid: Fix uninitialized variable in drm_cvt_modes()
ath11k: Initialize complete alpha2 for regulatory change
ath11k: Fix number of rules in filtered ETSI regdomain
ath11k: fix wmi init configuration
brcmfmac: Fix memory leak for unpaired brcmf_{alloc/free}
arm64: dts: exynos: Include common syscon restart/poweroff for Exynos7
arm64: dts: exynos: Correct psci compatible used on Exynos7
drm/panel: simple: Add flags to boe_nv133fhm_n61
Bluetooth: Fix null pointer dereference in hci_event_packet()
Bluetooth: Fix: LL PRivacy BLE device fails to connect
Bluetooth: hci_h5: fix memory leak in h5_close
spi: stm32-qspi: fix reference leak in stm32 qspi operations
spi: spi-ti-qspi: fix reference leak in ti_qspi_setup
spi: mt7621: fix missing clk_disable_unprepare() on error in mt7621_spi_probe
spi: tegra20-slink: fix reference leak in slink ops of tegra20
spi: tegra20-sflash: fix reference leak in tegra_sflash_resume
spi: tegra114: fix reference leak in tegra spi ops
spi: bcm63xx-hsspi: fix missing clk_disable_unprepare() on error in bcm63xx_hsspi_resume
spi: imx: fix reference leak in two imx operations
ASoC: qcom: common: Fix refcounting in qcom_snd_parse_of()
ath11k: Handle errors if peer creation fails
mwifiex: fix mwifiex_shutdown_sw() causing sw reset failure
drm/msm/a6xx: Clear shadow on suspend
drm/msm/a5xx: Clear shadow on suspend
firmware: tegra: fix strncpy()/strncat() confusion
drm/msm/dp: return correct connection status after suspend
drm/msm/dp: skip checking LINK_STATUS_UPDATED bit
drm/msm/dp: do not notify audio subsystem if sink doesn't support audio
selftests/run_kselftest.sh: fix dry-run typo
selftest/bpf: Add missed ip6ip6 test back
ASoC: wm8994: Fix PM disable depth imbalance on error
ASoC: wm8998: Fix PM disable depth imbalance on error
spi: sprd: fix reference leak in sprd_spi_remove
virtiofs fix leak in setup
ASoC: arizona: Fix a wrong free in wm8997_probe
RDMa/mthca: Work around -Wenum-conversion warning
ASoC: SOF: Intel: fix Kconfig dependency for SND_INTEL_DSP_CONFIG
arm64: dts: ti: k3-am65*/j721e*: Fix unit address format error for dss node
MIPS: BCM47XX: fix kconfig dependency bug for BCM47XX_BCMA
drm/amdgpu: fix compute queue priority if num_kcq is less than 4
soc: ti: omap-prm: Do not check rstst bit on deassert if already deasserted
crypto: Kconfig - CRYPTO_MANAGER_EXTRA_TESTS requires the manager
crypto: qat - fix status check in qat_hal_put_rel_rd_xfer()
firmware: arm_scmi: Fix missing destroy_workqueue()
drm/udl: Fix missing error code in udl_handle_damage()
staging: greybus: codecs: Fix reference counter leak in error handling
staging: gasket: interrupt: fix the missed eventfd_ctx_put() in gasket_interrupt.c
scripts: kernel-doc: Restore anonymous enum parsing
drm/amdkfd: Put ACPI table after using it
ionic: use mc sync for multicast filters
ionic: flatten calls to ionic_lif_rx_mode
ionic: change set_rx_mode from_ndo to can_sleep
media: tm6000: Fix sizeof() mismatches
media: platform: add missing put_device() call in mtk_jpeg_clk_init()
media: mtk-vcodec: add missing put_device() call in mtk_vcodec_init_dec_pm()
media: mtk-vcodec: add missing put_device() call in mtk_vcodec_release_dec_pm()
media: mtk-vcodec: add missing put_device() call in mtk_vcodec_init_enc_pm()
media: v4l2-fwnode: Return -EINVAL for invalid bus-type
media: v4l2-fwnode: v4l2_fwnode_endpoint_parse caller must init vep argument
media: ov5640: fix support of BT656 bus mode
media: staging: rkisp1: cap: fix runtime PM imbalance on error
media: cedrus: fix reference leak in cedrus_start_streaming
media: platform: add missing put_device() call in mtk_jpeg_probe() and mtk_jpeg_remove()
media: venus: core: change clk enable and disable order in resume and suspend
media: venus: core: vote for video-mem path
media: venus: core: vote with average bandwidth and peak bandwidth as zero
RDMA/cma: Add missing error handling of listen_id
ASoC: meson: fix COMPILE_TEST error
spi: dw: fix build error by selecting MULTIPLEXER
scsi: core: Fix VPD LUN ID designator priorities
media: venus: put dummy vote on video-mem path after last session release
media: solo6x10: fix missing snd_card_free in error handling case
video: fbdev: atmel_lcdfb: fix return error code in atmel_lcdfb_of_init()
mmc: sdhci: tegra: fix wrong unit with busy_timeout
drm/omap: dmm_tiler: fix return error code in omap_dmm_probe()
drm/meson: Free RDMA resources after tearing down DRM
drm/meson: Unbind all connectors on module removal
drm/meson: dw-hdmi: Register a callback to disable the regulator
drm/meson: dw-hdmi: Ensure that clocks are enabled before touching the TOP registers
ASoC: intel: SND_SOC_INTEL_KEEMBAY should depend on ARCH_KEEMBAY
iommu/vt-d: include conditionally on CONFIG_INTEL_IOMMU_SVM
Input: ads7846 - fix race that causes missing releases
Input: ads7846 - fix integer overflow on Rt calculation
Input: ads7846 - fix unaligned access on 7845
bus: mhi: core: Remove double locking from mhi_driver_remove()
bus: mhi: core: Fix null pointer access when parsing MHI configuration
usb/max3421: fix return error code in max3421_probe()
spi: mxs: fix reference leak in mxs_spi_probe
selftests/bpf: Fix broken riscv build
powerpc: Avoid broken GCC __attribute__((optimize))
powerpc/feature: Fix CPU_FTRS_ALWAYS by removing CPU_FTRS_GENERIC_32
ARM: dts: tacoma: Fix node vs reg mismatch for flash memory
Revert "powerpc/pseries/hotplug-cpu: Remove double free in error path"
powerpc/powernv/sriov: fix unsigned int win compared to less than zero
mfd: htc-i2cpld: Add the missed i2c_put_adapter() in htcpld_register_chip_i2c()
mfd: MFD_SL28CPLD should depend on ARCH_LAYERSCAPE
mfd: stmfx: Fix dev_err_probe() call in stmfx_chip_init()
mfd: cpcap: Fix interrupt regression with regmap clear_ack
EDAC/mce_amd: Use struct cpuinfo_x86.cpu_die_id for AMD NodeId
scsi: ufs: Avoid to call REQ_CLKS_OFF to CLKS_OFF
scsi: ufs: Fix clkgating on/off
rcu: Allow rcu_irq_enter_check_tick() from NMI
rcu,ftrace: Fix ftrace recursion
rcu/tree: Defer kvfree_rcu() allocation to a clean context
crypto: crypto4xx - Replace bitwise OR with logical OR in crypto4xx_build_pd
crypto: omap-aes - Fix PM disable depth imbalance in omap_aes_probe
crypto: sun8i-ce - fix two error path's memory leak
spi: fix resource leak for drivers without .remove callback
drm/meson: dw-hdmi: Disable clocks on driver teardown
drm/meson: dw-hdmi: Enable the iahb clock early enough
PCI: Disable MSI for Pericom PCIe-USB adapter
PCI: brcmstb: Initialize "tmp" before use
soc: ti: knav_qmss: fix reference leak in knav_queue_probe
soc: ti: Fix reference imbalance in knav_dma_probe
drivers: soc: ti: knav_qmss_queue: Fix error return code in knav_queue_probe
soc: qcom: initialize local variable
arm64: dts: qcom: sm8250: correct compatible for sm8250-mtp
arm64: dts: qcom: msm8916-samsung-a2015: Disable muic i2c pin bias
Input: omap4-keypad - fix runtime PM error handling
clk: meson: Kconfig: fix dependency for G12A
staging: mfd: hi6421-spmi-pmic: fix error return code in hi6421_spmi_pmic_probe()
ath11k: Fix the rx_filter flag setting for peer rssi stats
RDMA/cxgb4: Validate the number of CQEs
soundwire: Fix DEBUG_LOCKS_WARN_ON for uninitialized attribute
pinctrl: sunxi: fix irq bank map for the Allwinner A100 pin controller
memstick: fix a double-free bug in memstick_check
ARM: dts: at91: sam9x60: add pincontrol for USB Host
ARM: dts: at91: sama5d4_xplained: add pincontrol for USB Host
ARM: dts: at91: sama5d3_xplained: add pincontrol for USB Host
mmc: pxamci: Fix error return code in pxamci_probe
brcmfmac: fix error return code in brcmf_cfg80211_connect()
orinoco: Move context allocation after processing the skb
qtnfmac: fix error return code in qtnf_pcie_probe()
rsi: fix error return code in rsi_reset_card()
cw1200: fix missing destroy_workqueue() on error in cw1200_init_common
dmaengine: mv_xor_v2: Fix error return code in mv_xor_v2_probe()
arm64: dts: qcom: sdm845: Limit ipa iommu streams
leds: netxbig: add missing put_device() call in netxbig_leds_get_of_pdata()
leds: lp50xx: Fix an error handling path in 'lp50xx_probe_dt()'
leds: turris-omnia: check for LED_COLOR_ID_RGB instead LED_COLOR_ID_MULTI
arm64: tegra: Fix DT binding for IO High Voltage entry
RDMA/cma: Fix deadlock on &lock in rdma_cma_listen_on_all() error unwind
soundwire: qcom: Fix build failure when slimbus is module
drm/imx/dcss: fix rotations for Vivante tiled formats
media: siano: fix memory leak of debugfs members in smsdvb_hotplug
platform/x86: mlx-platform: Remove PSU EEPROM from default platform configuration
platform/x86: mlx-platform: Remove PSU EEPROM from MSN274x platform configuration
arm64: dts: qcom: sc7180: limit IPA iommu streams
RDMA/hns: Only record vlan info for HIP08
RDMA/hns: Fix missing fields in address vector
RDMA/hns: Avoid setting loopback indicator when smac is same as dmac
serial: 8250-mtk: Fix reference leak in mtk8250_probe
samples: bpf: Fix lwt_len_hist reusing previous BPF map
media: imx214: Fix stop streaming
mips: cdmm: fix use-after-free in mips_cdmm_bus_discover
media: max2175: fix max2175_set_csm_mode() error code
slimbus: qcom-ngd-ctrl: Avoid sending power requests without QMI
RDMA/core: Track device memory MRs
drm/mediatek: Use correct aliases name for ovl
HSI: omap_ssi: Don't jump to free ID in ssi_add_controller()
ARM: dts: Remove non-existent i2c1 from 98dx3236
arm64: dts: armada-3720-turris-mox: update ethernet-phy handle name
power: supply: bq25890: Use the correct range for IILIM register
arm64: dts: rockchip: Set dr_mode to "host" for OTG on rk3328-roc-cc
power: supply: max17042_battery: Fix current_{avg,now} hiding with no current sense
power: supply: axp288_charger: Fix HP Pavilion x2 10 DMI matching
power: supply: bq24190_charger: fix reference leak
genirq/irqdomain: Don't try to free an interrupt that has no mapping
arm64: dts: ls1028a: fix ENETC PTP clock input
arm64: dts: ls1028a: fix FlexSPI clock input
arm64: dts: freescale: sl28: combine SPI MTD partitions
phy: tegra: xusb: Fix usb_phy device driver field
arm64: dts: qcom: c630: Polish i2c-hid devices
arm64: dts: qcom: c630: Fix pinctrl pins properties
PCI: Bounds-check command-line resource alignment requests
PCI: Fix overflow in command-line resource alignment requests
PCI: iproc: Fix out-of-bound array accesses
PCI: iproc: Invalidate correct PAXB inbound windows
arm64: dts: meson: fix spi-max-frequency on Khadas VIM2
arm64: dts: meson-sm1: fix typo in opp table
soc: amlogic: canvas: add missing put_device() call in meson_canvas_get()
scsi: hisi_sas: Fix up probe error handling for v3 hw
scsi: pm80xx: Do not sleep in atomic context
spi: spi-fsl-dspi: Use max_native_cs instead of num_chipselect to set SPI_MCR
ARM: dts: at91: at91sam9rl: fix ADC triggers
RDMA/hns: Fix 0-length sge calculation error
RDMA/hns: Bugfix for calculation of extended sge
mailbox: arm_mhu_db: Fix mhu_db_shutdown by replacing kfree with devm_kfree
soundwire: master: use pm_runtime_set_active() on add
platform/x86: dell-smbios-base: Fix error return code in dell_smbios_init
ASoC: Intel: Boards: tgl_max98373: update TDM slot_width
media: max9271: Fix GPIO enable/disable
media: rdacm20: Enable GPIO1 explicitly
media: i2c: imx219: Selection compliance fixes
ath11k: Don't cast ath11k_skb_cb to ieee80211_tx_info.control
ath11k: Reset ath11k_skb_cb before setting new flags
ath11k: Fix an error handling path
ath10k: Fix the parsing error in service available event
ath10k: Fix an error handling path
ath10k: Release some resources in an error handling path
SUNRPC: rpc_wake_up() should wake up tasks in the correct order
NFSv4.2: condition READDIR's mask for security label based on LSM state
SUNRPC: xprt_load_transport() needs to support the netid "rdma6"
NFSv4: Fix the alignment of page data in the getdeviceinfo reply
net: sunrpc: Fix 'snprintf' return value check in 'do_xprt_debugfs'
lockd: don't use interval-based rebinding over TCP
NFS: switch nfsiod to be an UNBOUND workqueue.
selftests/seccomp: Update kernel config
vfio-pci: Use io_remap_pfn_range() for PCI IO memory
hwmon: (ina3221) Fix PM usage counter unbalance in ina3221_write_enable
f2fs: fix double free of unicode map
media: tvp5150: Fix wrong return value of tvp5150_parse_dt()
media: saa7146: fix array overflow in vidioc_s_audio()
powerpc/perf: Fix crash with is_sier_available when pmu is not set
powerpc/64: Fix an EMIT_BUG_ENTRY in head_64.S
powerpc/xmon: Fix build failure for 8xx
powerpc/perf: Fix to update radix_scope_qual in power10
powerpc/perf: Update the PMU group constraints for l2l3 events in power10
powerpc/perf: Fix the PMU group constraints for threshold events in power10
clocksource/drivers/orion: Add missing clk_disable_unprepare() on error path
clocksource/drivers/cadence_ttc: Fix memory leak in ttc_setup_clockevent()
clocksource/drivers/ingenic: Fix section mismatch
clocksource/drivers/riscv: Make RISCV_TIMER depends on RISCV_SBI
arm64: mte: fix prctl(PR_GET_TAGGED_ADDR_CTRL) if TCF0=NONE
iio: hrtimer-trigger: Mark hrtimer to expire in hard interrupt context
libbpf: Sanitise map names before pinning
ARM: dts: at91: sam9x60ek: remove bypass property
ARM: dts: at91: sama5d2: map securam as device
scripts: kernel-doc: fix parsing function-like typedefs
bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
selftests/bpf: Fix invalid use of strncat in test_sockmap
pinctrl: falcon: add missing put_device() call in pinctrl_falcon_probe()
soc: rockchip: io-domain: Fix error return code in rockchip_iodomain_probe()
arm64: dts: rockchip: Fix UART pull-ups on rk3328
memstick: r592: Fix error return in r592_probe()
MIPS: Don't round up kernel sections size for memblock_add()
mt76: mt7663s: fix a possible ple quota underflow
mt76: mt7915: set fops_sta_stats.owner to THIS_MODULE
mt76: set fops_tx_stats.owner to THIS_MODULE
mt76: dma: fix possible deadlock running mt76_dma_cleanup
net/mlx5: Properly convey driver version to firmware
mt76: fix memory leak if device probing fails
mt76: fix tkip configuration for mt7615/7663 devices
ASoC: jz4740-i2s: add missed checks for clk_get()
ASoC: q6afe-clocks: Add missing parent clock rate
dm ioctl: fix error return code in target_message
ASoC: cros_ec_codec: fix uninitialized memory read
ASoC: atmel: mchp-spdifrx needs COMMON_CLK
ASoC: qcom: fix QDSP6 dependencies, attempt #3
phy: mediatek: allow compile-testing the hdmi phy
phy: renesas: rcar-gen3-usb2: disable runtime pm in case of failure
memory: ti-emif-sram: only build for ARMv7
memory: jz4780_nemc: Fix potential NULL dereference in jz4780_nemc_probe()
drm/msm: a5xx: Make preemption reset case reentrant
drm/msm: add IOMMU_SUPPORT dependency
clocksource/drivers/arm_arch_timer: Use stable count reader in erratum sne
clocksource/drivers/arm_arch_timer: Correct fault programming of CNTKCTL_EL1.EVNTI
cpufreq: ap806: Add missing MODULE_DEVICE_TABLE
cpufreq: highbank: Add missing MODULE_DEVICE_TABLE
cpufreq: mediatek: Add missing MODULE_DEVICE_TABLE
cpufreq: qcom: Add missing MODULE_DEVICE_TABLE
cpufreq: st: Add missing MODULE_DEVICE_TABLE
cpufreq: sun50i: Add missing MODULE_DEVICE_TABLE
cpufreq: loongson1: Add missing MODULE_ALIAS
cpufreq: scpi: Add missing MODULE_ALIAS
cpufreq: vexpress-spc: Add missing MODULE_ALIAS
cpufreq: imx: fix NVMEM_IMX_OCOTP dependency
macintosh/adb-iop: Always wait for reply message from IOP
macintosh/adb-iop: Send correct poll command
staging: bcm2835: fix vchiq_mmal dependencies
staging: greybus: audio: Fix possible leak free widgets in gbaudio_dapm_free_controls
spi: dw: Fix error return code in dw_spi_bt1_probe()
Bluetooth: btusb: Add the missed release_firmware() in btusb_mtk_setup_firmware()
Bluetooth: btmtksdio: Add the missed release_firmware() in mtk_setup_firmware()
Bluetooth: sco: Fix crash when using BT_SNDMTU/BT_RCVMTU option
block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name
block/rnbd: fix a null pointer dereference on dev->blk_symlink_name
Bluetooth: btusb: Fix detection of some fake CSR controllers with a bcdDevice val of 0x0134
platform/x86: intel-vbtn: Fix SW_TABLET_MODE always reporting 1 on some HP x360 models
adm8211: fix error return code in adm8211_probe()
mtd: spi-nor: sst: fix BPn bits for the SST25VF064C
mtd: spi-nor: ignore errors in spi_nor_unlock_all()
mtd: spi-nor: atmel: remove global protection flag
mtd: spi-nor: atmel: fix unlock_all() for AT25FS010/040
arm64: dts: meson: g12b: odroid-n2: fix PHY deassert timing requirements
arm64: dts: meson: fix PHY deassert timing requirements
ARM: dts: meson: fix PHY deassert timing requirements
arm64: dts: meson: g12a: x96-max: fix PHY deassert timing requirements
arm64: dts: meson: g12b: w400: fix PHY deassert timing requirements
clk: fsl-sai: fix memory leak
scsi: qedi: Fix missing destroy_workqueue() on error in __qedi_probe
scsi: pm80xx: Fix error return in pm8001_pci_probe()
scsi: iscsi: Fix inappropriate use of put_device()
seq_buf: Avoid type mismatch for seq_buf_init
scsi: fnic: Fix error return code in fnic_probe()
platform/x86: mlx-platform: Fix item counter assignment for MSN2700, MSN24xx systems
platform/x86: mlx-platform: Fix item counter assignment for MSN2700/ComEx system
ARM: 9030/1: entry: omit FP emulation for UND exceptions taken in kernel mode
powerpc/pseries/hibernation: drop pseries_suspend_begin() from suspend ops
powerpc/pseries/hibernation: remove redundant cacheinfo update
powerpc/powermac: Fix low_sleep_handler with CONFIG_VMAP_STACK
drm/mediatek: avoid dereferencing a null hdmi_phy on an error message
ASoC: amd: change clk_get() to devm_clk_get() and add missed checks
coresight: remove broken __exit annotations
ASoC: max98390: Fix error codes in max98390_dsm_init()
powerpc/mm: sanity_check_fault() should work for all, not only BOOK3S
usb: ehci-omap: Fix PM disable depth umbalance in ehci_hcd_omap_probe
usb: oxu210hp-hcd: Fix memory leak in oxu_create
speakup: fix uninitialized flush_lock
nfsd: Fix message level for normal termination
NFSD: Fix 5 seconds delay when doing inter server copy
nfs_common: need lock during iterate through the list
x86/kprobes: Restore BTF if the single-stepping is cancelled
scsi: qla2xxx: Fix FW initialization error on big endian machines
scsi: qla2xxx: Fix N2N and NVMe connect retry failure
platform/chrome: cros_ec_spi: Don't overwrite spi::mode
misc: pci_endpoint_test: fix return value of error branch
bus: fsl-mc: add back accidentally dropped error check
bus: fsl-mc: fix error return code in fsl_mc_object_allocate()
fsi: Aspeed: Add mutex to protect HW access
s390/cio: fix use-after-free in ccw_device_destroy_console
iwlwifi: dbg-tlv: fix old length in is_trig_data_contained()
iwlwifi: mvm: hook up missing RX handlers
erofs: avoid using generic_block_bmap
clk: renesas: r8a779a0: Fix R and OSC clocks
can: m_can: m_can_config_endisable(): remove double clearing of clock stop request bit
powerpc/sstep: Emulate prefixed instructions only when CPU_FTR_ARCH_31 is set
powerpc/sstep: Cover new VSX instructions under CONFIG_VSX
slimbus: qcom: fix potential NULL dereference in qcom_slim_prg_slew()
ALSA: hda/hdmi: fix silent stream for first playback to DP
RDMA/core: Do not indicate device ready when device enablement fails
RDMA/uverbs: Fix incorrect variable type
remoteproc/mediatek: change MT8192 CFG register base
remoteproc/mtk_scp: surround DT device IDs with CONFIG_OF
remoteproc: q6v5-mss: fix error handling in q6v5_pds_enable
remoteproc: qcom: fix reference leak in adsp_start
remoteproc: qcom: pas: fix error handling in adsp_pds_enable
remoteproc: k3-dsp: Fix return value check in k3_dsp_rproc_of_get_memories()
remoteproc: qcom: Fix potential NULL dereference in adsp_init_mmio()
remoteproc/mediatek: unprepare clk if scp_before_load fails
clk: qcom: gcc-sc7180: Use floor ops for sdcc clks
clk: tegra: Fix duplicated SE clock entry
mtd: rawnand: gpmi: fix reference count leak in gpmi ops
mtd: rawnand: meson: Fix a resource leak in init
mtd: rawnand: gpmi: Fix the random DMA timeout issue
samples/bpf: Fix possible hang in xdpsock with multiple threads
fs: Handle I_DONTCACHE in iput_final() instead of generic_drop_inode()
extcon: max77693: Fix modalias string
crypto: atmel-i2c - select CONFIG_BITREVERSE
mac80211: don't set set TDLS STA bandwidth wider than possible
mac80211: fix a mistake check for rx_stats update
ASoC: wm_adsp: remove "ctl" from list on error in wm_adsp_create_control()
irqchip/alpine-msi: Fix freeing of interrupts on allocation error path
irqchip/ti-sci-inta: Fix printing of inta id on probe success
irqchip/ti-sci-intr: Fix freeing of irqs
dmaengine: ti: k3-udma: Correct normal channel offset when uchan_cnt is not 0
RDMA/hns: Limit the length of data copied between kernel and userspace
RDMA/hns: Normalization the judgment of some features
RDMA/hns: Do shift on traffic class when using RoCEv2
gpiolib: irq hooks: fix recursion in gpiochip_irq_unmask
ath11k: Fix incorrect tlvs in scan start command
irqchip/qcom-pdc: Fix phantom irq when changing between rising/falling
watchdog: armada_37xx: Add missing dependency on HAS_IOMEM
watchdog: sirfsoc: Add missing dependency on HAS_IOMEM
watchdog: sprd: remove watchdog disable from resume fail path
watchdog: sprd: check busy bit before new loading rather than after that
watchdog: Fix potential dereferencing of null pointer
ubifs: Fix error return code in ubifs_init_authentication()
um: Monitor error events in IRQ controller
um: tty: Fix handling of close in tty lines
um: chan_xterm: Fix fd leak
sunrpc: fix xs_read_xdr_buf for partial pages receive
RDMA/mlx5: Fix MR cache memory leak
RDMA/cma: Don't overwrite sgid_attr after device is released
nfc: s3fwrn5: Release the nfc firmware
drm: mxsfb: Silence -EPROBE_DEFER while waiting for bridge
powerpc/perf: Fix Threshold Event Counter Multiplier width for P10
powerpc/ps3: use dma_mapping_error()
perf test: Fix metric parsing test
drm/amdgpu: fix regression in vbios reservation handling on headless
mm/gup: reorganize internal_get_user_pages_fast()
mm/gup: prevent gup_fast from racing with COW during fork
mm/gup: combine put_compound_head() and unpin_user_page()
mm: memcg/slab: fix return of child memcg objcg for root memcg
mm: memcg/slab: fix use after free in obj_cgroup_charge
mm/rmap: always do TTU_IGNORE_ACCESS
sparc: fix handling of page table constructor failure
mm/vmalloc: Fix unlock order in s_stop()
mm/vmalloc.c: fix kasan shadow poisoning size
mm,memory_failure: always pin the page in madvise_inject_error
hugetlb: fix an error code in hugetlb_reserve_pages()
mm: don't wake kswapd prematurely when watermark boosting is disabled
proc: fix lookup in /proc/net subdirectories after setns(2)
checkpatch: fix unescaped left brace
s390/test_unwind: fix CALL_ON_STACK tests
lan743x: fix rx_napi_poll/interrupt ping-pong
ice, xsk: clear the status bits for the next_to_use descriptor
i40e, xsk: clear the status bits for the next_to_use descriptor
net: dsa: qca: ar9331: fix sleeping function called from invalid context bug
dpaa2-eth: fix the size of the mapped SGT buffer
net: bcmgenet: Fix a resource leak in an error handling path in the probe functin
net: mscc: ocelot: Fix a resource leak in the error handling path of the probe function
net: allwinner: Fix some resources leak in the error handling path of the probe and in the remove function
block/rnbd-clt: Get rid of warning regarding size argument in strlcpy
block/rnbd-clt: Fix possible memleak
NFS/pNFS: Fix a typo in ff_layout_resend_pnfs_read()
net: korina: fix return value
devlink: use _BITUL() macro instead of BIT() in the UAPI header
libnvdimm/label: Return -ENXIO for no slot in __blk_label_update
powerpc/32s: Fix cleanup_cpu_mmu_context() compile bug
watchdog: qcom: Avoid context switch in restart handler
watchdog: coh901327: add COMMON_CLK dependency
clk: ti: Fix memleak in ti_fapll_synth_setup
pwm: zx: Add missing cleanup in error path
pwm: lp3943: Dynamically allocate PWM chip base
pwm: imx27: Fix overflow for bigger periods
pwm: sun4i: Remove erroneous else branch
io_uring: cancel only requests of current task
tools build: Add missing libcap to test-all.bin target
perf record: Fix memory leak when using '--user-regs=?' to list registers
qlcnic: Fix error code in probe
nfp: move indirect block cleanup to flower app stop callback
vdpa/mlx5: Use write memory barrier after updating CQ index
virtio_ring: Cut and paste bugs in vring_create_virtqueue_packed()
virtio_net: Fix error code in probe()
virtio_ring: Fix two use after free bugs
vhost scsi: fix error return code in vhost_scsi_set_endpoint()
epoll: check for events when removing a timed out thread from the wait queue
clk: bcm: dvp: Add MODULE_DEVICE_TABLE()
clk: at91: sama7g5: fix compilation error
clk: at91: sam9x60: remove atmel,osc-bypass support
clk: s2mps11: Fix a resource leak in error handling paths in the probe function
clk: sunxi-ng: Make sure divider tables have sentinel
clk: vc5: Use "idt,voltage-microvolt" instead of "idt,voltage-microvolts"
kconfig: fix return value of do_error_if()
powerpc/boot: Fix build of dts/fsl
powerpc/smp: Add __init to init_big_cores()
ARM: 9044/1: vfp: use undef hook for VFP support detection
ARM: 9036/1: uncompress: Fix dbgadtb size parameter name
perf probe: Fix memory leak when synthesizing SDT probes
io_uring: fix racy IOPOLL flush overflow
io_uring: cancel reqs shouldn't kill overflow list
Smack: Handle io_uring kernel thread privileges
proc mountinfo: make splice available again
io_uring: fix io_cqring_events()'s noflush
io_uring: fix racy IOPOLL completions
io_uring: always let io_iopoll_complete() complete polled io
vfio/pci: Move dummy_resources_list init in vfio_pci_probe()
vfio/pci/nvlink2: Do not attempt NPU2 setup on POWER8NVL NPU
media: gspca: Fix memory leak in probe
io_uring: fix io_wqe->work_list corruption
io_uring: fix 0-iov read buffer select
io_uring: hold uring_lock while completing failed polled io in io_wq_submit_work()
io_uring: fix ignoring xa_store errors
io_uring: fix double io_uring free
io_uring: make ctx cancel on exit targeted to actual ctx
media: sunxi-cir: ensure IR is handled when it is continuous
media: netup_unidvb: Don't leak SPI master in probe error path
media: ipu3-cio2: Remove traces of returned buffers
media: ipu3-cio2: Return actual subdev format
media: ipu3-cio2: Serialise access to pad format
media: ipu3-cio2: Validate mbus format in setting subdev format
media: ipu3-cio2: Make the field on subdev format V4L2_FIELD_NONE
Input: cyapa_gen6 - fix out-of-bounds stack access
ALSA: hda/ca0132 - Change Input Source enum strings.
ACPI: NFIT: Fix input validation of bus-family
PM: ACPI: PCI: Drop acpi_pm_set_bridge_wakeup()
Revert "ACPI / resources: Use AE_CTRL_TERMINATE to terminate resources walks"
ACPI: PNP: compare the string length in the matching_id()
ALSA: hda: Fix regressions on clear and reconfig sysfs
ALSA: hda/ca0132 - Fix AE-5 rear headphone pincfg.
ALSA: hda/realtek: make bass spk volume adjustable on a yoga laptop
ALSA: hda/realtek - Enable headset mic of ASUS X430UN with ALC256
ALSA: hda/realtek - Enable headset mic of ASUS Q524UQK with ALC255
ALSA: hda/realtek - Add supported for more Lenovo ALC285 Headset Button
ALSA: pcm: oss: Fix a few more UBSAN fixes
ALSA/hda: apply jack fixup for the Acer Veriton N4640G/N6640G/N2510G
ALSA: hda/realtek: Add quirk for MSI-GP73
ALSA: hda/realtek: Apply jack fixup for Quanta NL3
ALSA: hda/realtek: Remove dummy lineout on Acer TravelMate P648/P658
ALSA: hda/realtek - Supported Dell fixed type headset
ALSA: usb-audio: Add VID to support native DSD reproduction on FiiO devices
ALSA: usb-audio: Disable sample read check if firmware doesn't give back
ALSA: usb-audio: Add alias entry for ASUS PRIME TRX40 PRO-S
ALSA: core: memalloc: add page alignment for iram
s390/smp: perform initial CPU reset also for SMT siblings
s390/kexec_file: fix diag308 subcode when loading crash kernel
s390/idle: add missing mt_cycles calculation
s390/idle: fix accounting with machine checks
s390/dasd: fix hanging device offline processing
s390/dasd: prevent inconsistent LCU device data
s390/dasd: fix list corruption of pavgroup group list
s390/dasd: fix list corruption of lcu list
binder: add flag to clear buffer on txn complete
ASoC: cx2072x: Fix doubly definitions of Playback and Capture streams
ASoC: AMD Renoir - add DMI table to avoid the ACP mic probe (broken BIOS)
ASoC: AMD Raven/Renoir - fix the PCI probe (PCI revision)
staging: comedi: mf6x4: Fix AI end-of-conversion detection
z3fold: simplify freeing slots
z3fold: stricter locking and more careful reclaim
perf/x86/intel: Add event constraint for CYCLE_ACTIVITY.STALLS_MEM_ANY
perf/x86/intel: Fix rtm_abort_event encoding on Ice Lake
perf/x86/intel/lbr: Fix the return type of get_lbr_cycles()
powerpc/perf: Exclude kernel samples while counting events in user space.
cpufreq: intel_pstate: Use most recent guaranteed performance values
crypto: ecdh - avoid unaligned accesses in ecdh_set_secret()
crypto: arm/aes-ce - work around Cortex-A57/A72 silion errata
m68k: Fix WARNING splat in pmac_zilog driver
Documentation: seqlock: s/LOCKTYPE/LOCKNAME/g
EDAC/i10nm: Use readl() to access MMIO registers
EDAC/amd64: Fix PCI component registration
cpuset: fix race between hotplug work and later CPU offline
dyndbg: fix use before null check
USB: serial: mos7720: fix parallel-port state restore
USB: serial: digi_acceleport: fix write-wakeup deadlocks
USB: serial: keyspan_pda: fix dropped unthrottle interrupts
USB: serial: keyspan_pda: fix write deadlock
USB: serial: keyspan_pda: fix stalled writes
USB: serial: keyspan_pda: fix write-wakeup use-after-free
USB: serial: keyspan_pda: fix tx-unthrottle use-after-free
USB: serial: keyspan_pda: fix write unthrottling
btrfs: do not shorten unpin len for caching block groups
btrfs: update last_byte_to_unpin in switch_commit_roots
btrfs: fix race when defragmenting leads to unnecessary IO
ext4: fix an IS_ERR() vs NULL check
ext4: fix a memory leak of ext4_free_data
ext4: fix deadlock with fs freezing and EA inodes
ext4: don't remount read-only with errors=continue on reboot
RISC-V: Fix usage of memblock_enforce_memory_limit
arm64: dts: ti: k3-am65: mark dss as dma-coherent
arm64: dts: marvell: keep SMMU disabled by default for Armada 7040 and 8040
KVM: arm64: Introduce handling of AArch32 TTBCR2 traps
KVM: x86: reinstate vendor-agnostic check on SPEC_CTRL cpuid bits
KVM: SVM: Remove the call to sev_platform_status() during setup
iommu/arm-smmu: Allow implementation specific write_s2cr
iommu/arm-smmu-qcom: Read back stream mappings
iommu/arm-smmu-qcom: Implement S2CR quirk
ARM: dts: pandaboard: fix pinmux for gpio user button of Pandaboard ES
ARM: dts: at91: sama5d2: fix CAN message ram offset and size
ARM: tegra: Populate OPP table for Tegra20 Ventana
xprtrdma: Fix XDRBUF_SPARSE_PAGES support
powerpc/32: Fix vmap stack - Properly set r1 before activating MMU on syscall too
powerpc: Fix incorrect stw{, ux, u, x} instructions in __set_pte_at
powerpc/rtas: Fix typo of ibm,open-errinjct in RTAS filter
powerpc/bitops: Fix possible undefined behaviour with fls() and fls64()
powerpc/feature: Add CPU_FTR_NOEXECUTE to G2_LE
powerpc/xmon: Change printk() to pr_cont()
powerpc/8xx: Fix early debug when SMC1 is relocated
powerpc/mm: Fix verification of MMU_FTR_TYPE_44x
powerpc/powernv/npu: Do not attempt NPU2 setup on POWER8NVL NPU
powerpc/powernv/memtrace: Don't leak kernel memory to user space
powerpc/powernv/memtrace: Fix crashing the kernel when enabling concurrently
ovl: make ioctl() safe
ima: Don't modify file descriptor mode on the fly
um: Remove use of asprinf in umid.c
um: Fix time-travel mode
ceph: fix race in concurrent __ceph_remove_cap invocations
SMB3: avoid confusing warning message on mount to Azure
SMB3.1.1: remove confusing mount warning when no SPNEGO info on negprot rsp
SMB3.1.1: do not log warning message if server doesn't populate salt
ubifs: wbuf: Don't leak kernel memory to flash
jffs2: Fix GC exit abnormally
jffs2: Fix ignoring mounting options problem during remounting
fsnotify: generalize handle_inode_event()
inotify: convert to handle_inode_event() interface
fsnotify: fix events reported to watching parent and child
jfs: Fix array index bounds check in dbAdjTree
drm/panfrost: Fix job timeout handling
drm/panfrost: Move the GPU reset bits outside the timeout handler
platform/x86: mlx-platform: remove an unused variable
drm/amdgpu: only set DP subconnector type on DP and eDP connectors
drm/amd/display: Fix memory leaks in S3 resume
drm/dp_aux_dev: check aux_dev before use in drm_dp_aux_dev_get_by_minor()
drm/i915: Fix mismatch between misplaced vma check and vma insert
iio: ad_sigma_delta: Don't put SPI transfer buffer on the stack
spi: pxa2xx: Fix use-after-free on unbind
spi: spi-sh: Fix use-after-free on unbind
spi: atmel-quadspi: Fix use-after-free on unbind
spi: spi-mtk-nor: Don't leak SPI master in probe error path
spi: ar934x: Don't leak SPI master in probe error path
spi: davinci: Fix use-after-free on unbind
spi: fsl: fix use of spisel_boot signal on MPC8309
spi: gpio: Don't leak SPI master in probe error path
spi: mxic: Don't leak SPI master in probe error path
spi: npcm-fiu: Disable clock in probe error path
spi: pic32: Don't leak DMA channels in probe error path
spi: rb4xx: Don't leak SPI master in probe error path
spi: rpc-if: Fix use-after-free on unbind
spi: sc18is602: Don't leak SPI master in probe error path
spi: spi-geni-qcom: Fix use-after-free on unbind
spi: spi-qcom-qspi: Fix use-after-free on unbind
spi: st-ssc4: Fix unbalanced pm_runtime_disable() in probe error path
spi: synquacer: Disable clock in probe error path
spi: mt7621: Disable clock in probe error path
spi: mt7621: Don't leak SPI master in probe error path
spi: atmel-quadspi: Disable clock in probe error path
spi: atmel-quadspi: Fix AHB memory accesses
soc: qcom: smp2p: Safely acquire spinlock without IRQs
mtd: spinand: Fix OOB read
mtd: parser: cmdline: Fix parsing of part-names with colons
mtd: core: Fix refcounting for unpartitioned MTDs
mtd: rawnand: qcom: Fix DMA sync on FLASH_STATUS register read
mtd: rawnand: meson: fix meson_nfc_dma_buffer_release() arguments
scsi: qla2xxx: Fix crash during driver load on big endian machines
scsi: lpfc: Fix invalid sleeping context in lpfc_sli4_nvmet_alloc()
scsi: lpfc: Fix scheduling call while in softirq context in lpfc_unreg_rpi
scsi: lpfc: Re-fix use after free in lpfc_rq_buf_free()
openat2: reject RESOLVE_BENEATH|RESOLVE_IN_ROOT
iio: buffer: Fix demux update
iio: adc: rockchip_saradc: fix missing clk_disable_unprepare() on error in rockchip_saradc_resume
iio: imu: st_lsm6dsx: fix edge-trigger interrupts
iio:light:rpr0521: Fix timestamp alignment and prevent data leak.
iio:light:st_uvis25: Fix timestamp alignment and prevent data leak.
iio:magnetometer:mag3110: Fix alignment and data leak issues.
iio:pressure:mpl3115: Force alignment of buffer
iio:imu:bmi160: Fix too large a buffer.
iio:imu:bmi160: Fix alignment and data leak issues
iio:adc:ti-ads124s08: Fix buffer being too long.
iio:adc:ti-ads124s08: Fix alignment and data leak issues.
md/cluster: block reshape with remote resync job
md/cluster: fix deadlock when node is doing resync job
pinctrl: sunxi: Always call chained_irq_{enter, exit} in sunxi_pinctrl_irq_handler
clk: ingenic: Fix divider calculation with div tables
clk: mvebu: a3700: fix the XTAL MODE pin to MPP1_9
clk: tegra: Do not return 0 on failure
counter: microchip-tcb-capture: Fix CMR value check
device-dax/core: Fix memory leak when rmmod dax.ko
dma-buf/dma-resv: Respect num_fences when initializing the shared fence list.
driver: core: Fix list corruption after device_del()
xen-blkback: set ring->xenblkd to NULL after kthread_stop()
xen/xenbus: Allow watches discard events before queueing
xen/xenbus: Add 'will_handle' callback support in xenbus_watch_path()
xen/xenbus/xen_bus_type: Support will_handle watch callback
xen/xenbus: Count pending messages for each watch
xenbus/xenbus_backend: Disallow pending watch messages
memory: jz4780_nemc: Fix an error pointer vs NULL check in probe()
memory: renesas-rpc-if: Fix a node reference leak in rpcif_probe()
memory: renesas-rpc-if: Return correct value to the caller of rpcif_manual_xfer()
memory: renesas-rpc-if: Fix unbalanced pm_runtime_enable in rpcif_{enable,disable}_rpm
libnvdimm/namespace: Fix reaping of invalidated block-window-namespace labels
platform/x86: intel-vbtn: Allow switch events on Acer Switch Alpha 12
tracing: Disable ftrace selftests when any tracer is running
mt76: add back the SUPPORTS_REORDERING_BUFFER flag
of: fix linker-section match-table corruption
PCI: Fix pci_slot_release() NULL pointer dereference
regulator: axp20x: Fix DLDO2 voltage control register mask for AXP22x
remoteproc: sysmon: Ensure remote notification ordering
thermal/drivers/cpufreq_cooling: Update cpufreq_state only if state has changed
rtc: ep93xx: Fix NULL pointer dereference in ep93xx_rtc_read_time
Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"
null_blk: Fix zone size initialization
null_blk: Fail zone append to conventional zones
drm/edid: fix objtool warning in drm_cvt_modes()
x86/CPU/AMD: Save AMD NodeId as cpu_die_id
Linux 5.10.4

Signed-off-by: Greg Kroah-Hartman
Change-Id: I25209e79d8b9faf5382087955a29b7404bdefe38

Greg Kroah-Hartman
2020-12-30 19:47:03 +0800
8f4bf6eea btrfs: fix race when defragmenting leads to unnecessary IO ... Browse Code »

commit 7f458a3873ae94efe1f37c8b96c97e7298769e98 upstream.

When defragmenting we skip ranges that have holes or inline extents, so that
we don't do unnecessary IO and waste space. We do this check when calling
should_defrag_range() at btrfs_defrag_file(). However we do it without
holding the inode's lock. The reason we do it like this is to avoid
blocking other tasks for too long, that possibly want to operate on other
file ranges, since after the call to should_defrag_range() and before
locking the inode, we trigger a synchronous page cache readahead. However
before we were able to lock the inode, some other task might have punched
a hole in our range, or we may now have an inline extent there, in which
case we should not set the range for defrag anymore since that would cause
unnecessary IO and make us waste space (i.e. allocating extents to contain
zeros for a hole).

So after we locked the inode and the range in the iotree, check again if
we have holes or an inline extent, and if we do, just skip the range.

I hit this while testing my next patch that fixes races when updating an
inode's number of bytes (subject "btrfs: update the number of bytes used
by an inode atomically"), and it depends on this change in order to work
correctly. Alternatively I could rework that other patch to detect holes
and flag their range with the 'new delalloc' bit, but this itself fixes
an efficiency problem due a race that from a functional point of view is
not harmful (it could be triggered with btrfs/062 from fstests).

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Filipe Manana
2020-12-30 18:54:13 +0800
5c5bc5738 btrfs: update last_byte_to_unpin in switch_commit_roots ... Browse Code »

commit 27d56e62e4748c2135650c260024e9904b8c1a0a upstream.

While writing an explanation for the need of the commit_root_sem for
btrfs_prepare_extent_commit, I realized we have a slight hole that could
result in leaked space if we have to do the old style caching. Consider
the following scenario

commit root
+----+----+----+----+----+----+----+
|\\\\| |\\\\|\\\\| |\\\\|\\\\|
+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7

new commit root
+----+----+----+----+----+----+----+
| | | |\\\\| | |\\\\|
+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7

Prior to this patch, we run btrfs_prepare_extent_commit, which updates
the last_byte_to_unpin, and then we subsequently run
switch_commit_roots. In this example lets assume that
caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which
means that cache->last_byte_to_unpin == 1. Then we go and do the
switch_commit_roots(), but in the meantime the caching thread has made
some more progress, because we drop the commit_root_sem and re-acquired
it. Now caching_ctl->progress == 3. We swap out the commit root and
carry on to unpin.

The race can happen like:

1) The caching thread was running using the old commit root when it
found the extent for [2, 3);

2) Then it released the commit_root_sem because it was in the last
item of a leaf and the semaphore was contended, and set ->progress
to 3 (value of 'last'), as the last extent item in the current leaf
was for the extent for range [2, 3);

3) Next time it gets the commit_root_sem, will start using the new
commit root and search for a key with offset 3, so it never finds
the hole for [2, 3).

So the caching thread never saw [2, 3) as free space in any of the
commit roots, and by the time finish_extent_commit() was called for
the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the
subrange [0, 1) to the free space cache, skipping [2, 3).

In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1),
but do not unpin [2,3). However because caching_ctl->progress == 3 we
do not see the newly freed section of [2,3), and thus do not add it to
our free space cache. This results in us missing a chunk of free space
in memory (on disk too, unless we have a power failure before writing
the free space cache to disk).

Fix this by making sure the ->last_byte_to_unpin is set at the same time
that we swap the commit roots, this ensures that we will always be
consistent.

CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Filipe Manana
Signed-off-by: Josef Bacik
[ update changelog with Filipe's review comments ]
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Josef Bacik
2020-12-30 18:54:13 +0800
56d1654dc btrfs: do not shorten unpin len for caching block groups ... Browse Code »

commit 9076dbd5ee837c3882fc42891c14cecd0354a849 upstream.

While fixing up our ->last_byte_to_unpin locking I noticed that we will
shorten len based on ->last_byte_to_unpin if we're caching when we're
adding back the free space. This is correct for the free space, as we
cannot unpin more than ->last_byte_to_unpin, however we use len to
adjust the ->bytes_pinned counters and such, which need to track the
actual pinned usage. This could result in
WARN_ON(space_info->bytes_pinned) triggering at unmount time.

Fix this by using a local variable for the amount to add to free space
cache, and leave len untouched in this case.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Josef Bacik
2020-12-30 18:54:13 +0800

28 Nov, 2020

2 commits

b19651bfc Merge c84e1efae022 ("Merge tag 'asm-generic-fixes-5.10-2' of git://git.kernel.or… ... Browse Code »

…g/pub/scm/linux/kernel/git/arnd/asm-generic") into android-mainline

Steps on the way to 5.10-rc5

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I644783003a83186a34cdbb753aa492f4350f49ee

Greg Kroah-Hartman
2020-11-28 15:23:09 +0800
a17a3ca55 Merge tag 'for-5.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux ... Browse Code »

Pull btrfs fixes from David Sterba:
"A few fixes for various warnings that accumulated over past two weeks:

- tree-checker: add missing return values for some errors

- lockdep fixes
- when reading qgroup config and starting quota rescan
- reverse order of quota ioctl lock and VFS freeze lock

- avoid accessing potentially stale fs info during device scan,
reported by syzbot

- add scope NOFS protection around qgroup relation changes

- check for running transaction before flushing qgroups

- fix tracking of new delalloc ranges for some cases"

* tag 'for-5.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix lockdep splat when enabling and disabling qgroups
btrfs: do nofs allocations when adding and removing qgroup relations
btrfs: fix lockdep splat when reading qgroup config on mount
btrfs: tree-checker: add missing returns after data_ref alignment checks
btrfs: don't access possibly stale fs_info data for printing duplicate device
btrfs: tree-checker: add missing return after error in root_item
btrfs: qgroup: don't commit transaction when we already hold the handle
btrfs: fix missing delalloc new bit for new delalloc ranges

Linus Torvalds
2020-11-28 04:42:13 +0800

24 Nov, 2020

5 commits

a855fbe69 btrfs: fix lockdep splat when enabling and disabling qgroups ... Browse Code »

When running test case btrfs/017 from fstests, lockdep reported the
following splat:

[ 1297.067385] ======================================================
[ 1297.067708] WARNING: possible circular locking dependency detected
[ 1297.068022] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
[ 1297.068322] ------------------------------------------------------
[ 1297.068629] btrfs/189080 is trying to acquire lock:
[ 1297.068929] ffff9f2725731690 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.069274]
but task is already holding lock:
[ 1297.069868] ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
[ 1297.070219]
which lock already depends on the new lock.

[ 1297.071131]
the existing dependency chain (in reverse order) is:
[ 1297.071721]
-> #1 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}:
[ 1297.072375] lock_acquire+0xd8/0x490
[ 1297.072710] __mutex_lock+0xa3/0xb30
[ 1297.073061] btrfs_qgroup_inherit+0x59/0x6a0 [btrfs]
[ 1297.073421] create_subvol+0x194/0x990 [btrfs]
[ 1297.073780] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
[ 1297.074133] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
[ 1297.074498] btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
[ 1297.074872] btrfs_ioctl+0x1a90/0x36f0 [btrfs]
[ 1297.075245] __x64_sys_ioctl+0x83/0xb0
[ 1297.075617] do_syscall_64+0x33/0x80
[ 1297.075993] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1297.076380]
-> #0 (sb_internal#2){.+.+}-{0:0}:
[ 1297.077166] check_prev_add+0x91/0xc60
[ 1297.077572] __lock_acquire+0x1740/0x3110
[ 1297.077984] lock_acquire+0xd8/0x490
[ 1297.078411] start_transaction+0x3c5/0x760 [btrfs]
[ 1297.078853] btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.079323] btrfs_ioctl+0x2c60/0x36f0 [btrfs]
[ 1297.079789] __x64_sys_ioctl+0x83/0xb0
[ 1297.080232] do_syscall_64+0x33/0x80
[ 1297.080680] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1297.081139]
other info that might help us debug this:

[ 1297.082536] Possible unsafe locking scenario:

[ 1297.083510] CPU0 CPU1
[ 1297.084005] ---- ----
[ 1297.084500] lock(&fs_info->qgroup_ioctl_lock);
[ 1297.084994] lock(sb_internal#2);
[ 1297.085485] lock(&fs_info->qgroup_ioctl_lock);
[ 1297.085974] lock(sb_internal#2);
[ 1297.086454]
*** DEADLOCK ***
[ 1297.087880] 3 locks held by btrfs/189080:
[ 1297.088324] #0: ffff9f2725731470 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0xa73/0x36f0 [btrfs]
[ 1297.088799] #1: ffff9f2702b60cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
[ 1297.089284] #2: ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
[ 1297.089771]
stack backtrace:
[ 1297.090662] CPU: 5 PID: 189080 Comm: btrfs Not tainted 5.10.0-rc4-btrfs-next-73 #1
[ 1297.091132] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 1297.092123] Call Trace:
[ 1297.092629] dump_stack+0x8d/0xb5
[ 1297.093115] check_noncircular+0xff/0x110
[ 1297.093596] check_prev_add+0x91/0xc60
[ 1297.094076] ? kvm_clock_read+0x14/0x30
[ 1297.094553] ? kvm_sched_clock_read+0x5/0x10
[ 1297.095029] __lock_acquire+0x1740/0x3110
[ 1297.095510] lock_acquire+0xd8/0x490
[ 1297.095993] ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.096476] start_transaction+0x3c5/0x760 [btrfs]
[ 1297.096962] ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.097451] btrfs_quota_enable+0xaf/0xa70 [btrfs]
[ 1297.097941] ? btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
[ 1297.098429] btrfs_ioctl+0x2c60/0x36f0 [btrfs]
[ 1297.098904] ? do_user_addr_fault+0x20c/0x430
[ 1297.099382] ? kvm_clock_read+0x14/0x30
[ 1297.099854] ? kvm_sched_clock_read+0x5/0x10
[ 1297.100328] ? sched_clock+0x5/0x10
[ 1297.100801] ? sched_clock_cpu+0x12/0x180
[ 1297.101272] ? __x64_sys_ioctl+0x83/0xb0
[ 1297.101739] __x64_sys_ioctl+0x83/0xb0
[ 1297.102207] do_syscall_64+0x33/0x80
[ 1297.102673] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1297.103148] RIP: 0033:0x7f773ff65d87

This is because during the quota enable ioctl we lock first the mutex
qgroup_ioctl_lock and then start a transaction, and starting a transaction
acquires a fs freeze semaphore (at the VFS level). However, every other
code path, except for the quota disable ioctl path, we do the opposite:
we start a transaction and then lock the mutex.

So fix this by making the quota enable and disable paths to start the
transaction without having the mutex locked, and then, after starting the
transaction, lock the mutex and check if some other task already enabled
or disabled the quotas, bailing with success if that was the case.

Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Filipe Manana
2020-11-24 04:16:43 +0800
7aa6d3598 btrfs: do nofs allocations when adding and removing qgroup relations ... Browse Code »

When adding or removing a qgroup relation we are doing a GFP_KERNEL
allocation which is not safe because we are holding a transaction
handle open and that can make us deadlock if the allocator needs to
recurse into the filesystem. So just surround those calls with a
nofs context.

Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Filipe Manana
2020-11-24 04:16:40 +0800
3d05cad3c btrfs: fix lockdep splat when reading qgroup config on mount ... Browse Code »

Lockdep reported the following splat when running test btrfs/190 from
fstests:

[ 9482.126098] ======================================================
[ 9482.126184] WARNING: possible circular locking dependency detected
[ 9482.126281] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
[ 9482.126365] ------------------------------------------------------
[ 9482.126456] mount/24187 is trying to acquire lock:
[ 9482.126534] ffffa0c869a7dac0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}, at: qgroup_rescan_init+0x43/0xf0 [btrfs]
[ 9482.126647]
but task is already holding lock:
[ 9482.126777] ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
[ 9482.126886]
which lock already depends on the new lock.

[ 9482.127078]
the existing dependency chain (in reverse order) is:
[ 9482.127213]
-> #1 (btrfs-quota-00){++++}-{3:3}:
[ 9482.127366] lock_acquire+0xd8/0x490
[ 9482.127436] down_read_nested+0x45/0x220
[ 9482.127528] __btrfs_tree_read_lock+0x27/0x120 [btrfs]
[ 9482.127613] btrfs_read_lock_root_node+0x41/0x130 [btrfs]
[ 9482.127702] btrfs_search_slot+0x514/0xc30 [btrfs]
[ 9482.127788] update_qgroup_status_item+0x72/0x140 [btrfs]
[ 9482.127877] btrfs_qgroup_rescan_worker+0xde/0x680 [btrfs]
[ 9482.127964] btrfs_work_helper+0xf1/0x600 [btrfs]
[ 9482.128039] process_one_work+0x24e/0x5e0
[ 9482.128110] worker_thread+0x50/0x3b0
[ 9482.128181] kthread+0x153/0x170
[ 9482.128256] ret_from_fork+0x22/0x30
[ 9482.128327]
-> #0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}:
[ 9482.128464] check_prev_add+0x91/0xc60
[ 9482.128551] __lock_acquire+0x1740/0x3110
[ 9482.128623] lock_acquire+0xd8/0x490
[ 9482.130029] __mutex_lock+0xa3/0xb30
[ 9482.130590] qgroup_rescan_init+0x43/0xf0 [btrfs]
[ 9482.131577] btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
[ 9482.132175] open_ctree+0x1228/0x18a0 [btrfs]
[ 9482.132756] btrfs_mount_root.cold+0x13/0xed [btrfs]
[ 9482.133325] legacy_get_tree+0x30/0x60
[ 9482.133866] vfs_get_tree+0x28/0xe0
[ 9482.134392] fc_mount+0xe/0x40
[ 9482.134908] vfs_kern_mount.part.0+0x71/0x90
[ 9482.135428] btrfs_mount+0x13b/0x3e0 [btrfs]
[ 9482.135942] legacy_get_tree+0x30/0x60
[ 9482.136444] vfs_get_tree+0x28/0xe0
[ 9482.136949] path_mount+0x2d7/0xa70
[ 9482.137438] do_mount+0x75/0x90
[ 9482.137923] __x64_sys_mount+0x8e/0xd0
[ 9482.138400] do_syscall_64+0x33/0x80
[ 9482.138873] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 9482.139346]
other info that might help us debug this:

[ 9482.140735] Possible unsafe locking scenario:

[ 9482.141594] CPU0 CPU1
[ 9482.142011] ---- ----
[ 9482.142411] lock(btrfs-quota-00);
[ 9482.142806] lock(&fs_info->qgroup_rescan_lock);
[ 9482.143216] lock(btrfs-quota-00);
[ 9482.143629] lock(&fs_info->qgroup_rescan_lock);
[ 9482.144056]
*** DEADLOCK ***

[ 9482.145242] 2 locks held by mount/24187:
[ 9482.145637] #0: ffffa0c8411c40e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xb9/0x400
[ 9482.146061] #1: ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
[ 9482.146509]
stack backtrace:
[ 9482.147350] CPU: 1 PID: 24187 Comm: mount Not tainted 5.10.0-rc4-btrfs-next-73 #1
[ 9482.147788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 9482.148709] Call Trace:
[ 9482.149169] dump_stack+0x8d/0xb5
[ 9482.149628] check_noncircular+0xff/0x110
[ 9482.150090] check_prev_add+0x91/0xc60
[ 9482.150561] ? kvm_clock_read+0x14/0x30
[ 9482.151017] ? kvm_sched_clock_read+0x5/0x10
[ 9482.151470] __lock_acquire+0x1740/0x3110
[ 9482.151941] ? __btrfs_tree_read_lock+0x27/0x120 [btrfs]
[ 9482.152402] lock_acquire+0xd8/0x490
[ 9482.152887] ? qgroup_rescan_init+0x43/0xf0 [btrfs]
[ 9482.153354] __mutex_lock+0xa3/0xb30
[ 9482.153826] ? qgroup_rescan_init+0x43/0xf0 [btrfs]
[ 9482.154301] ? qgroup_rescan_init+0x43/0xf0 [btrfs]
[ 9482.154768] ? qgroup_rescan_init+0x43/0xf0 [btrfs]
[ 9482.155226] qgroup_rescan_init+0x43/0xf0 [btrfs]
[ 9482.155690] btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
[ 9482.156160] open_ctree+0x1228/0x18a0 [btrfs]
[ 9482.156643] btrfs_mount_root.cold+0x13/0xed [btrfs]
[ 9482.157108] ? rcu_read_lock_sched_held+0x5d/0x90
[ 9482.157567] ? kfree+0x31f/0x3e0
[ 9482.158030] legacy_get_tree+0x30/0x60
[ 9482.158489] vfs_get_tree+0x28/0xe0
[ 9482.158947] fc_mount+0xe/0x40
[ 9482.159403] vfs_kern_mount.part.0+0x71/0x90
[ 9482.159875] btrfs_mount+0x13b/0x3e0 [btrfs]
[ 9482.160335] ? rcu_read_lock_sched_held+0x5d/0x90
[ 9482.160805] ? kfree+0x31f/0x3e0
[ 9482.161260] ? legacy_get_tree+0x30/0x60
[ 9482.161714] legacy_get_tree+0x30/0x60
[ 9482.162166] vfs_get_tree+0x28/0xe0
[ 9482.162616] path_mount+0x2d7/0xa70
[ 9482.163070] do_mount+0x75/0x90
[ 9482.163525] __x64_sys_mount+0x8e/0xd0
[ 9482.163986] do_syscall_64+0x33/0x80
[ 9482.164437] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 9482.164902] RIP: 0033:0x7f51e907caaa

This happens because at btrfs_read_qgroup_config() we can call
qgroup_rescan_init() while holding a read lock on a quota btree leaf,
acquired by the previous call to btrfs_search_slot_for_read(), and
qgroup_rescan_init() acquires the mutex qgroup_rescan_lock.

A qgroup rescan worker does the opposite: it acquires the mutex
qgroup_rescan_lock, at btrfs_qgroup_rescan_worker(), and then tries to
update the qgroup status item in the quota btree through the call to
update_qgroup_status_item(). This inversion of locking order
between the qgroup_rescan_lock mutex and quota btree locks causes the
splat.

Fix this simply by releasing and freeing the path before calling
qgroup_rescan_init() at btrfs_read_qgroup_config().

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Filipe Manana
2020-11-24 04:16:30 +0800
6d06b0ad9 btrfs: tree-checker: add missing returns after data_ref alignment checks ... Browse Code »

There are sectorsize alignment checks that are reported but then
check_extent_data_ref continues. This was not intended, wrong alignment
is not a minor problem and we should return with error.

CC: stable@vger.kernel.org # 5.4+
Fixes: 0785a9aacf9d ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
Reviewed-by: Qu Wenruo
Signed-off-by: David Sterba

David Sterba
2020-11-24 04:16:21 +0800
0697d9a61 btrfs: don't access possibly stale fs_info data for printing duplicate device ... Browse Code »

Syzbot reported a possible use-after-free when printing a duplicate device
warning device_list_add().

At this point it can happen that a btrfs_device::fs_info is not correctly
setup yet, so we're accessing stale data, when printing the warning
message using the btrfs_printk() wrappers.

==================================================================
BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068

CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1d6/0x29e lib/dump_stack.c:118
print_address_description+0x66/0x620 mm/kasan/report.c:383
__kasan_report mm/kasan/report.c:513 [inline]
kasan_report+0x132/0x1d0 mm/kasan/report.c:530
btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
fc_mount fs/namespace.c:978 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x179d/0x29e0 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount+0x126/0x180 fs/namespace.c:3390
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x44840a
RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003

Allocated by task 6945:
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
kmalloc_node include/linux/slab.h:577 [inline]
kvmalloc_node+0x81/0x110 mm/util.c:574
kvmalloc include/linux/mm.h:757 [inline]
kvzalloc include/linux/mm.h:765 [inline]
btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
fc_mount fs/namespace.c:978 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x179d/0x29e0 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount+0x126/0x180 fs/namespace.c:3390
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 6945:
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
__kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
__cache_free mm/slab.c:3418 [inline]
kfree+0x113/0x200 mm/slab.c:3756
deactivate_locked_super+0xa7/0xf0 fs/super.c:335
btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
fc_mount fs/namespace.c:978 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x88/0x270 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x179d/0x29e0 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount+0x126/0x180 fs/namespace.c:3390
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The buggy address belongs to the object at ffff8880878e0000
which belongs to the cache kmalloc-16k of size 16384
The buggy address is located 1704 bytes inside of
16384-byte region [ffff8880878e0000, ffff8880878e4000)
The buggy address belongs to the page:
page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
flags: 0xfffe0000010200(slab|head)
raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

The syzkaller reproducer for this use-after-free crafts a filesystem image
and loop mounts it twice in a loop. The mount will fail as the crafted
image has an invalid chunk tree. When this happens btrfs_mount_root() will
call deactivate_locked_super(), which then cleans up fs_info and
fs_info::sb. If a second thread now adds the same block-device to the
filesystem, it will get detected as a duplicate device and
device_list_add() will reject the duplicate and print a warning. But as
the fs_info pointer passed in is non-NULL this will result in a
use-after-free.

Instead of printing possibly uninitialized or already freed memory in
btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
device name will be skipped altogether.

There was a slightly different approach discussed in
https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u

Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov
Reviewed-by: Anand Jain
Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Johannes Thumshirn
2020-11-24 04:16:12 +0800

14 Nov, 2020

3 commits

1a49a97df btrfs: tree-checker: add missing return after error in root_item ... Browse Code »

There's a missing return statement after an error is found in the
root_item, this can cause further problems when a crafted image triggers
the error.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210181
Fixes: 259ee7754b67 ("btrfs: tree-checker: Add ROOT_ITEM check")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo
Signed-off-by: Daniel Xu
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Daniel Xu
2020-11-14 05:18:10 +0800
6f23277a4 btrfs: qgroup: don't commit transaction when we already hold the handle ... Browse Code »

[BUG]
When running the following script, btrfs will trigger an ASSERT():

#/bin/bash
mkfs.btrfs -f $dev
mount $dev $mnt
xfs_io -f -c "pwrite 0 1G" $mnt/file
sync
btrfs quota enable $mnt
btrfs quota rescan -w $mnt

# Manually set the limit below current usage
btrfs qgroup limit 512M $mnt $mnt

# Crash happens
touch $mnt/file

The dmesg looks like this:

assertion failed: refcount_read(&trans->use_count) == 1, in fs/btrfs/transaction.c:2022
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3230!
invalid opcode: 0000 [#1] SMP PTI
RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
btrfs_commit_transaction.cold+0x11/0x5d [btrfs]
try_flush_qgroup+0x67/0x100 [btrfs]
__btrfs_qgroup_reserve_meta+0x3a/0x60 [btrfs]
btrfs_delayed_update_inode+0xaa/0x350 [btrfs]
btrfs_update_inode+0x9d/0x110 [btrfs]
btrfs_dirty_inode+0x5d/0xd0 [btrfs]
touch_atime+0xb5/0x100
iterate_dir+0xf1/0x1b0
__x64_sys_getdents64+0x78/0x110
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb5afe588db

[CAUSE]
In try_flush_qgroup(), we assume we don't hold a transaction handle at
all. This is true for data reservation and mostly true for metadata.
Since data space reservation always happens before we start a
transaction, and for most metadata operation we reserve space in
start_transaction().

But there is an exception, btrfs_delayed_inode_reserve_metadata().
It holds a transaction handle, while still trying to reserve extra
metadata space.

When we hit EDQUOT inside btrfs_delayed_inode_reserve_metadata(), we
will join current transaction and commit, while we still have
transaction handle from qgroup code.

[FIX]
Let's check current->journal before we join the transaction.

If current->journal is unset or BTRFS_SEND_TRANS_STUB, it means
we are not holding a transaction, thus are able to join and then commit
transaction.

If current->journal is a valid transaction handle, we avoid committing
transaction and just end it

This is less effective than committing current transaction, as it won't
free metadata reserved space, but we may still free some data space
before new data writes.

Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1178634
Fixes: c53e9653605d ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
Reviewed-by: Filipe Manana
Signed-off-by: Qu Wenruo
Signed-off-by: David Sterba

Qu Wenruo
2020-11-14 05:17:57 +0800
c33473098 btrfs: fix missing delalloc new bit for new delalloc ranges ... Browse Code »

When doing a buffered write, through one of the write family syscalls, we
look for ranges which currently don't have allocated extents and set the
'delalloc new' bit on them, so that we can report a correct number of used
blocks to the stat(2) syscall until delalloc is flushed and ordered extents
complete.

However there are a few other places where we can do a buffered write
against a range that is mapped to a hole (no extent allocated) and where
we do not set the 'new delalloc' bit. Those places are:

- Doing a memory mapped write against a hole;

- Cloning an inline extent into a hole starting at file offset 0;

- Calling btrfs_cont_expand() when the i_size of the file is not aligned
to the sector size and is located in a hole. For example when cloning
to a destination offset beyond EOF.

So after such cases, until the corresponding delalloc range is flushed and
the respective ordered extents complete, we can report an incorrect number
of blocks used through the stat(2) syscall.

In some cases we can end up reporting 0 used blocks to stat(2), which is a
particular bad value to report as it may mislead tools to think a file is
completely sparse when its i_size is not zero, making them skip reading
any data, an undesired consequence for tools such as archivers and other
backup tools, as reported a long time ago in the following thread (and
other past threads):

https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html

Example reproducer:

$ cat reproducer.sh
#!/bin/bash

MNT=/mnt/sdi
DEV=/dev/sdi

mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
mount $DEV $MNT

xfs_io -f -c "truncate 64K" \
-c "mmap -w 0 64K" \
-c "mwrite -S 0xab 0 64K" \
-c "munmap" \
$MNT/foo

blocks_used=$(stat -c %b $MNT/foo)
echo "blocks used: $blocks_used"

if [ $blocks_used -eq 0 ]; then
echo "ERROR: blocks used is 0"
fi

umount $DEV

$ ./reproducer.sh
blocks used: 0
ERROR: blocks used is 0

So move the logic that decides to set the 'delalloc bit' bit into the
function btrfs_set_extent_delalloc(), since that is what we use for all
those missing cases as well as for the cases that currently work well.

This change is also preparatory work for an upcoming patch that fixes
other problems related to tracking and reporting the number of bytes used
by an inode.

CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba

Filipe Manana
2020-11-14 05:15:59 +0800

12 Nov, 2020

1 commit

7f6480e40 Merge eccc87672492 ("Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/… ... Browse Code »

…kernel/git/viro/vfs") into android-mainline

Steps on the way to 5.10-rc4

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I9e0fa89c0f6f306fe802ae95c8d01d9ba558e111

Greg Kroah-Hartman
2020-11-12 15:02:21 +0800

11 Nov, 2020

1 commit

e2f0c565e Merge tag 'for-5.10-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux ... Browse Code »

Pull btrfs fixes from David Sterba:
"A handful of minor fixes and updates:

- handle missing device replace item on mount (syzbot report)

- fix space reservation calculation when finishing relocation

- fix memory leak on error path in ref-verify (debugging feature)

- fix potential overflow during defrag on 32bit arches

- minor code update to silence smatch warning

- minor error message updates"

* tag 'for-5.10-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: ref-verify: fix memory leak in btrfs_ref_tree_mod
btrfs: dev-replace: fail mount if we don't have replace item with target device
btrfs: scrub: update message regarding read-only status
btrfs: clean up NULL checks in qgroup_unreserve_range()
btrfs: fix min reserved size calculation in merge_reloc_root
btrfs: print the block rsv type when we fail our reservation
btrfs: fix potential overflow in cluster_pages_for_defrag on 32bit arch

Linus Torvalds
2020-11-11 02:07:15 +0800

05 Nov, 2020

7 commits

468600c6e btrfs: ref-verify: fix memory leak in btrfs_ref_tree_mod ... Browse Code »

There is one error handling path that does not free ref, which may cause
a minor memory leak.

CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik
Signed-off-by: Dinghao Liu
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Dinghao Liu
2020-11-05 20:03:39 +0800
cf89af146 btrfs: dev-replace: fail mount if we don't have replace item with target device ... Browse Code »

If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace
item, then it means the filesystem is inconsistent state. This is either
corruption or a crafted image. Fail the mount as this needs a closer
look what is actually wrong.

As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace
item, in __btrfs_free_extra_devids() we determine that there is an
extra device, and free those extra devices but continue to mount the
device.
However, we were wrong in keeping tack of the rw_devices so the syzbot
testcase failed:

WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x198/0x1fd lib/dump_stack.c:118
panic+0x347/0x7c0 kernel/panic.c:231
__warn.cold+0x20/0x46 kernel/panic.c:600
report_bug+0x1bd/0x210 lib/bug.c:198
handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
RSP: 0018:ffffc900091777e0 EFLAGS: 00010246
RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000
RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007
RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130
R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050
close_fs_devices fs/btrfs/volumes.c:1193 [inline]
btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434
btrfs_fill_super fs/btrfs/super.c:1316 [inline]
btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672

The fix here is, when we determine that there isn't a replace item
then fail the mount if there is a replace target device (devid 0).

CC: stable@vger.kernel.org # 4.19+
Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com
Signed-off-by: Anand Jain
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Anand Jain
2020-11-05 20:03:31 +0800
a4852cf26 btrfs: scrub: update message regarding read-only status ... Browse Code »

Based on user feedback update the message printed when scrub fails to
start due to write requirements. To make a distinction add a device id
to the messages.

Reviewed-by: Josef Bacik
Signed-off-by: David Sterba

David Sterba
2020-11-05 20:02:58 +0800
f07728d54 btrfs: clean up NULL checks in qgroup_unreserve_range() ... Browse Code »

Smatch complains that this code dereferences "entry" before checking
whether it's NULL on the next line. Fortunately, rb_entry() will never
return NULL so it doesn't cause a problem. We can clean up the NULL
checking a bit to silence the warning and make the code more clear.

Reviewed-by: Qu Wenruo
Signed-off-by: Dan Carpenter
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Dan Carpenter
2020-11-05 20:02:20 +0800
fca3a45d0 btrfs: fix min reserved size calculation in merge_reloc_root ... Browse Code »

The minimum reserve size was adjusted to take into account the height of
the tree we are merging, however we can have a root with a level == 0.
What we want is root_level + 1 to get the number of nodes we may have to
cow. This fixes the enospc_debug warning pops with btrfs/101.

Nikolay: this fixes failures on btrfs/060 btrfs/062 btrfs/063 and
btrfs/195 That I was seeing, the call trace was:

[ 3680.515564] ------------[ cut here ]------------
[ 3680.515566] BTRFS: block rsv returned -28
[ 3680.515585] WARNING: CPU: 2 PID: 8339 at fs/btrfs/block-rsv.c:521 btrfs_use_block_rsv+0x162/0x180
[ 3680.515587] Modules linked in:
[ 3680.515591] CPU: 2 PID: 8339 Comm: btrfs Tainted: G W 5.9.0-rc8-default #95
[ 3680.515593] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[ 3680.515595] RIP: 0010:btrfs_use_block_rsv+0x162/0x180
[ 3680.515600] RSP: 0018:ffffa01ac9753910 EFLAGS: 00010282
[ 3680.515602] RAX: 0000000000000000 RBX: ffff984b34200000 RCX: 0000000000000027
[ 3680.515604] RDX: 0000000000000027 RSI: 0000000000000000 RDI: ffff984b3bd19e28
[ 3680.515606] RBP: 0000000000004000 R08: ffff984b3bd19e20 R09: 0000000000000001
[ 3680.515608] R10: 0000000000000004 R11: 0000000000000046 R12: ffff984b264fdc00
[ 3680.515609] R13: ffff984b13149000 R14: 00000000ffffffe4 R15: ffff984b34200000
[ 3680.515613] FS: 00007f4e2912b8c0(0000) GS:ffff984b3bd00000(0000) knlGS:0000000000000000
[ 3680.515615] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3680.515617] CR2: 00007fab87122150 CR3: 0000000118e42000 CR4: 00000000000006e0
[ 3680.515620] Call Trace:
[ 3680.515627] btrfs_alloc_tree_block+0x8b/0x340
[ 3680.515633] ? __lock_acquire+0x51a/0xac0
[ 3680.515646] alloc_tree_block_no_bg_flush+0x4f/0x60
[ 3680.515651] __btrfs_cow_block+0x14e/0x7e0
[ 3680.515662] btrfs_cow_block+0x144/0x2c0
[ 3680.515670] merge_reloc_root+0x4d4/0x610
[ 3680.515675] ? btrfs_lookup_fs_root+0x78/0x90
[ 3680.515686] merge_reloc_roots+0xee/0x280
[ 3680.515695] relocate_block_group+0x2ce/0x5e0
[ 3680.515704] btrfs_relocate_block_group+0x16e/0x310
[ 3680.515711] btrfs_relocate_chunk+0x38/0xf0
[ 3680.515716] btrfs_shrink_device+0x200/0x560
[ 3680.515728] btrfs_rm_device+0x1ae/0x6a6
[ 3680.515744] ? _copy_from_user+0x6e/0xb0
[ 3680.515750] btrfs_ioctl+0x1afe/0x28c0
[ 3680.515755] ? find_held_lock+0x2b/0x80
[ 3680.515760] ? do_user_addr_fault+0x1f8/0x418
[ 3680.515773] ? __x64_sys_ioctl+0x77/0xb0
[ 3680.515775] __x64_sys_ioctl+0x77/0xb0
[ 3680.515781] do_syscall_64+0x31/0x70
[ 3680.515785] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported-by: Nikolay Borisov
Fixes: 44d354abf33e ("btrfs: relocation: review the call sites which can be interrupted by signal")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov
Tested-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2020-11-05 20:02:07 +0800
e38fdb716 btrfs: print the block rsv type when we fail our reservation ... Browse Code »

To help with debugging, print the type of the block rsv when we fail to
use our target block rsv in btrfs_use_block_rsv.

This now produces:

[ 544.672035] BTRFS: block rsv 1 returned -28

which is still cryptic without consulting the enum in block-rsv.h but I
guess it's better than nothing.

Reviewed-by: Nikolay Borisov
Signed-off-by: Josef Bacik
Reviewed-by: David Sterba
[ add note from Nikolay ]
Signed-off-by: David Sterba

Josef Bacik
2020-11-05 20:02:05 +0800
a1fbc6750 btrfs: fix potential overflow in cluster_pages_for_defrag on 32bit arch ... Browse Code »

On 32-bit systems, this shift will overflow for files larger than 4GB as
start_index is unsigned long while the calls to btrfs_delalloc_*_space
expect u64.

CC: stable@vger.kernel.org # 4.4+
Fixes: df480633b891 ("btrfs: extent-tree: Switch to new delalloc space reserve and release")
Reviewed-by: Josef Bacik
Signed-off-by: Matthew Wilcox (Oracle)
Reviewed-by: David Sterba
[ define the variable instead of repeating the shift ]
Signed-off-by: David Sterba

Matthew Wilcox (Oracle)
2020-11-05 20:01:42 +0800

02 Nov, 2020

1 commit

b0748cf3e Merge c2dc4c073fb7 ("Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux… ... Browse Code »

…/kernel/git/mst/vhost") into android-mainline

Steps on the way to 5.10-rc2

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I686e55205b113f69b9ea8a22c56d86a572e8c603

Greg Kroah-Hartman
2020-11-02 18:59:56 +0800

31 Oct, 2020

1 commit

f5d808567 Merge tag 'for-5.10-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux ... Browse Code »

Pull btrfs fixes from David Sterba:

- lockdep fixes:
- drop path locks before manipulating sysfs objects or qgroups
- preliminary fixes before tree locks get switched to rwsem
- use annotated seqlock

- build warning fixes (printk format)

- fix relocation vs fallocate race

- tree checker properly validates number of stripes and parity

- readahead vs device replace fixes

- iomap dio fix for unnecessary buffered io fallback

* tag 'for-5.10-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: convert data_seqcount to seqcount_mutex_t
btrfs: don't fallback to buffered read if we don't need to
btrfs: add a helper to read the tree_root commit root for backref lookup
btrfs: drop the path before adding qgroup items when enabling qgroups
btrfs: fix readahead hang and use-after-free after removing a device
btrfs: fix use-after-free on readahead extent after failure to create it
btrfs: tree-checker: validate number of chunk stripes and parity
btrfs: tree-checker: fix incorrect printk format
btrfs: drop the path before adding block group sysfs files
btrfs: fix relocation failure due to race with fallocate

Linus Torvalds
2020-10-31 04:29:49 +0800

27 Oct, 2020

2 commits

d5c823884 btrfs: convert data_seqcount to seqcount_mutex_t ... Browse Code »

By doing so we can associate the sequence counter to the chunk_mutex
for lockdep purposes (compiled-out otherwise), the mutex is otherwise
used on the write side.
Also avoid explicitly disabling preemption around the write region as it
will now be done automatically by the seqcount machinery based on the
lock type.

Signed-off-by: Davidlohr Bueso
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Davidlohr Bueso
2020-10-27 22:11:51 +0800
0425e7bad btrfs: don't fallback to buffered read if we don't need to ... Browse Code »

Since we switched to the iomap infrastructure in b5ff9f1a96e8f ("btrfs:
switch to iomap for direct IO") we're calling generic_file_buffered_read()
directly and not via generic_file_read_iter() anymore.

If the read could read everything there is no need to bother calling
generic_file_buffered_read(), like it is handled in
generic_file_read_iter().

If we call generic_file_buffered_read() in this case we can hit a
situation where we do an invalid readahead and cause this UBSAN splat
in fstest generic/091:

run fstests generic/091 at 2020-10-21 10:52:32
================================================================================
UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
shift exponent 64 is too large for 64-bit type 'long unsigned int'
CPU: 0 PID: 656 Comm: fsx Not tainted 5.9.0-rc7+ #821
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
Call Trace:
__dump_stack lib/dump_stack.c:77
dump_stack+0x57/0x70 lib/dump_stack.c:118
ubsan_epilogue+0x5/0x40 lib/ubsan.c:148
__ubsan_handle_shift_out_of_bounds.cold+0x61/0xe9 lib/ubsan.c:395
__roundup_pow_of_two ./include/linux/log2.h:57
get_init_ra_size mm/readahead.c:318
ondemand_readahead.cold+0x16/0x2c mm/readahead.c:530
generic_file_buffered_read+0x3ac/0x840 mm/filemap.c:2199
call_read_iter ./include/linux/fs.h:1876
new_sync_read+0x102/0x180 fs/read_write.c:415
vfs_read+0x11c/0x1a0 fs/read_write.c:481
ksys_read+0x4f/0xc0 fs/read_write.c:615
do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9 arch/x86/entry/entry_64.S:118
RIP: 0033:0x7fe87fee992e
RSP: 002b:00007ffe01605278 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 000000000004f000 RCX: 00007fe87fee992e
RDX: 0000000000004000 RSI: 0000000001677000 RDI: 0000000000000003
RBP: 000000000004f000 R08: 0000000000004000 R09: 000000000004f000
R10: 0000000000053000 R11: 0000000000000246 R12: 0000000000004000
R13: 0000000000000000 R14: 000000000007a120 R15: 0000000000000000
================================================================================
BTRFS info (device nullb0): has skinny extents
BTRFS info (device nullb0): ZONED mode enabled, zone size 268435456 B
BTRFS info (device nullb0): enabling ssd optimizations

Fixes: f85781fb505e ("btrfs: switch to iomap for direct IO")
Reviewed-by: Goldwyn Rodrigues
Signed-off-by: Johannes Thumshirn
Signed-off-by: David Sterba

Johannes Thumshirn
2020-10-27 22:11:37 +0800

26 Oct, 2020

1 commit

49d11bead btrfs: add a helper to read the tree_root commit root for backref lookup ... Browse Code »

I got the following lockdep splat with tree locks converted to rwsem
patches on btrfs/104:

======================================================
WARNING: possible circular locking dependency detected
5.9.0+ #102 Not tainted
------------------------------------------------------
btrfs-cleaner/903 is trying to acquire lock:
ffff8e7fab6ffe30 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170

but task is already holding lock:
ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (&fs_info->commit_root_sem){++++}-{3:3}:
down_read+0x40/0x130
caching_thread+0x53/0x5a0
btrfs_work_helper+0xfa/0x520
process_one_work+0x238/0x540
worker_thread+0x55/0x3c0
kthread+0x13a/0x150
ret_from_fork+0x1f/0x30

-> #2 (&caching_ctl->mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7b0
btrfs_cache_block_group+0x1e0/0x510
find_free_extent+0xb6e/0x12f0
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb1/0x330
alloc_tree_block_no_bg_flush+0x4f/0x60
__btrfs_cow_block+0x11d/0x580
btrfs_cow_block+0x10c/0x220
commit_cowonly_roots+0x47/0x2e0
btrfs_commit_transaction+0x595/0xbd0
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x36/0xa0
cleanup_mnt+0x12d/0x190
task_work_run+0x5c/0xa0
exit_to_user_mode_prepare+0x1df/0x200
syscall_exit_to_user_mode+0x54/0x280
entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #1 (&space_info->groups_sem){++++}-{3:3}:
down_read+0x40/0x130
find_free_extent+0x2ed/0x12f0
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb1/0x330
alloc_tree_block_no_bg_flush+0x4f/0x60
__btrfs_cow_block+0x11d/0x580
btrfs_cow_block+0x10c/0x220
commit_cowonly_roots+0x47/0x2e0
btrfs_commit_transaction+0x595/0xbd0
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x36/0xa0
cleanup_mnt+0x12d/0x190
task_work_run+0x5c/0xa0
exit_to_user_mode_prepare+0x1df/0x200
syscall_exit_to_user_mode+0x54/0x280
entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (btrfs-root-00){++++}-{3:3}:
__lock_acquire+0x1167/0x2150
lock_acquire+0xb9/0x3d0
down_read_nested+0x43/0x130
__btrfs_tree_read_lock+0x32/0x170
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x614/0x9d0
btrfs_find_root+0x35/0x1b0
btrfs_read_tree_root+0x61/0x120
btrfs_get_root_ref+0x14b/0x600
find_parent_nodes+0x3e6/0x1b30
btrfs_find_all_roots_safe+0xb4/0x130
btrfs_find_all_roots+0x60/0x80
btrfs_qgroup_trace_extent_post+0x27/0x40
btrfs_add_delayed_data_ref+0x3fd/0x460
btrfs_free_extent+0x42/0x100
__btrfs_mod_ref+0x1d7/0x2f0
walk_up_proc+0x11c/0x400
walk_up_tree+0xf0/0x180
btrfs_drop_snapshot+0x1c7/0x780
btrfs_clean_one_deleted_snapshot+0xfb/0x110
cleaner_kthread+0xd4/0x140
kthread+0x13a/0x150
ret_from_fork+0x1f/0x30

other info that might help us debug this:

Chain exists of:
btrfs-root-00 --> &caching_ctl->mutex --> &fs_info->commit_root_sem

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&fs_info->commit_root_sem);
lock(&caching_ctl->mutex);
lock(&fs_info->commit_root_sem);
lock(btrfs-root-00);

*** DEADLOCK ***

3 locks held by btrfs-cleaner/903:
#0: ffff8e7fab628838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
#1: ffff8e7faadac640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
#2: ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

stack backtrace:
CPU: 0 PID: 903 Comm: btrfs-cleaner Not tainted 5.9.0+ #102
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x8b/0xb0
check_noncircular+0xcf/0xf0
__lock_acquire+0x1167/0x2150
? __bfs+0x42/0x210
lock_acquire+0xb9/0x3d0
? __btrfs_tree_read_lock+0x32/0x170
down_read_nested+0x43/0x130
? __btrfs_tree_read_lock+0x32/0x170
__btrfs_tree_read_lock+0x32/0x170
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x614/0x9d0
? find_held_lock+0x2b/0x80
btrfs_find_root+0x35/0x1b0
? do_raw_spin_unlock+0x4b/0xa0
btrfs_read_tree_root+0x61/0x120
btrfs_get_root_ref+0x14b/0x600
find_parent_nodes+0x3e6/0x1b30
btrfs_find_all_roots_safe+0xb4/0x130
btrfs_find_all_roots+0x60/0x80
btrfs_qgroup_trace_extent_post+0x27/0x40
btrfs_add_delayed_data_ref+0x3fd/0x460
btrfs_free_extent+0x42/0x100
__btrfs_mod_ref+0x1d7/0x2f0
walk_up_proc+0x11c/0x400
walk_up_tree+0xf0/0x180
btrfs_drop_snapshot+0x1c7/0x780
? btrfs_clean_one_deleted_snapshot+0x73/0x110
btrfs_clean_one_deleted_snapshot+0xfb/0x110
cleaner_kthread+0xd4/0x140
? btrfs_alloc_root+0x50/0x50
kthread+0x13a/0x150
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30
BTRFS info (device sdb): disk space caching is enabled
BTRFS info (device sdb): has skinny extents

This happens because qgroups does a backref lookup when we create a
delayed ref. From here it may have to look up a root from an indirect
ref, which does a normal lookup on the tree_root, which takes the read
lock on the tree_root nodes.

To fix this we need to add a variant for looking up roots that searches
the commit root of the tree_root. Then when we do the backref search
using the commit root we are sure to not take any locks on the tree_root
nodes. This gets rid of the lockdep splat when running btrfs/104.

Reviewed-by: Filipe Manana
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2020-10-26 22:04:57 +0800