24 Nov, 2020

1 commit

  • Syzbot reported a possible use-after-free when printing a duplicate device
    warning device_list_add().

    At this point it can happen that a btrfs_device::fs_info is not correctly
    setup yet, so we're accessing stale data, when printing the warning
    message using the btrfs_printk() wrappers.

    ==================================================================
    BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
    Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068

    CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x1d6/0x29e lib/dump_stack.c:118
    print_address_description+0x66/0x620 mm/kasan/report.c:383
    __kasan_report mm/kasan/report.c:513 [inline]
    kasan_report+0x132/0x1d0 mm/kasan/report.c:530
    btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
    device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
    btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
    btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    fc_mount fs/namespace.c:978 [inline]
    vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
    btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    do_new_mount fs/namespace.c:2875 [inline]
    path_mount+0x179d/0x29e0 fs/namespace.c:3192
    do_mount fs/namespace.c:3205 [inline]
    __do_sys_mount fs/namespace.c:3413 [inline]
    __se_sys_mount+0x126/0x180 fs/namespace.c:3390
    do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x44840a
    RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
    RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
    RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
    RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
    R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003

    Allocated by task 6945:
    kasan_save_stack mm/kasan/common.c:48 [inline]
    kasan_set_track mm/kasan/common.c:56 [inline]
    __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
    kmalloc_node include/linux/slab.h:577 [inline]
    kvmalloc_node+0x81/0x110 mm/util.c:574
    kvmalloc include/linux/mm.h:757 [inline]
    kvzalloc include/linux/mm.h:765 [inline]
    btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    fc_mount fs/namespace.c:978 [inline]
    vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
    btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    do_new_mount fs/namespace.c:2875 [inline]
    path_mount+0x179d/0x29e0 fs/namespace.c:3192
    do_mount fs/namespace.c:3205 [inline]
    __do_sys_mount fs/namespace.c:3413 [inline]
    __se_sys_mount+0x126/0x180 fs/namespace.c:3390
    do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Freed by task 6945:
    kasan_save_stack mm/kasan/common.c:48 [inline]
    kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
    kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
    __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
    __cache_free mm/slab.c:3418 [inline]
    kfree+0x113/0x200 mm/slab.c:3756
    deactivate_locked_super+0xa7/0xf0 fs/super.c:335
    btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    fc_mount fs/namespace.c:978 [inline]
    vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
    btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
    legacy_get_tree+0xea/0x180 fs/fs_context.c:592
    vfs_get_tree+0x88/0x270 fs/super.c:1547
    do_new_mount fs/namespace.c:2875 [inline]
    path_mount+0x179d/0x29e0 fs/namespace.c:3192
    do_mount fs/namespace.c:3205 [inline]
    __do_sys_mount fs/namespace.c:3413 [inline]
    __se_sys_mount+0x126/0x180 fs/namespace.c:3390
    do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The buggy address belongs to the object at ffff8880878e0000
    which belongs to the cache kmalloc-16k of size 16384
    The buggy address is located 1704 bytes inside of
    16384-byte region [ffff8880878e0000, ffff8880878e4000)
    The buggy address belongs to the page:
    page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
    head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
    flags: 0xfffe0000010200(slab|head)
    raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
    raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================

    The syzkaller reproducer for this use-after-free crafts a filesystem image
    and loop mounts it twice in a loop. The mount will fail as the crafted
    image has an invalid chunk tree. When this happens btrfs_mount_root() will
    call deactivate_locked_super(), which then cleans up fs_info and
    fs_info::sb. If a second thread now adds the same block-device to the
    filesystem, it will get detected as a duplicate device and
    device_list_add() will reject the duplicate and print a warning. But as
    the fs_info pointer passed in is non-NULL this will result in a
    use-after-free.

    Instead of printing possibly uninitialized or already freed memory in
    btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
    device name will be skipped altogether.

    There was a slightly different approach discussed in
    https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u

    Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
    Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com
    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Anand Jain
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     

05 Nov, 2020

1 commit

  • If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace
    item, then it means the filesystem is inconsistent state. This is either
    corruption or a crafted image. Fail the mount as this needs a closer
    look what is actually wrong.

    As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace
    item, in __btrfs_free_extra_devids() we determine that there is an
    extra device, and free those extra devices but continue to mount the
    device.
    However, we were wrong in keeping tack of the rw_devices so the syzbot
    testcase failed:

    WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
    Kernel panic - not syncing: panic_on_warn set ...
    CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x198/0x1fd lib/dump_stack.c:118
    panic+0x347/0x7c0 kernel/panic.c:231
    __warn.cold+0x20/0x46 kernel/panic.c:600
    report_bug+0x1bd/0x210 lib/bug.c:198
    handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
    exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
    asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
    RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
    RSP: 0018:ffffc900091777e0 EFLAGS: 00010246
    RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000
    RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007
    RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130
    R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050
    close_fs_devices fs/btrfs/volumes.c:1193 [inline]
    btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
    open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434
    btrfs_fill_super fs/btrfs/super.c:1316 [inline]
    btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672

    The fix here is, when we determine that there isn't a replace item
    then fail the mount if there is a replace target device (devid 0).

    CC: stable@vger.kernel.org # 4.19+
    Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Anand Jain
     

27 Oct, 2020

1 commit

  • By doing so we can associate the sequence counter to the chunk_mutex
    for lockdep purposes (compiled-out otherwise), the mutex is otherwise
    used on the write side.
    Also avoid explicitly disabling preemption around the write region as it
    will now be done automatically by the seqcount machinery based on the
    lock type.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Davidlohr Bueso
     

26 Oct, 2020

1 commit

  • Very sporadically I had test case btrfs/069 from fstests hanging (for
    years, it is not a recent regression), with the following traces in
    dmesg/syslog:

    [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
    [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
    [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
    [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
    [162513.514318] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.514747] task:btrfs-transacti state:D stack: 0 pid:1356167 ppid: 2 flags:0x00004000
    [162513.514751] Call Trace:
    [162513.514761] __schedule+0x5ce/0xd00
    [162513.514765] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.514771] schedule+0x46/0xf0
    [162513.514844] wait_current_trans+0xde/0x140 [btrfs]
    [162513.514850] ? finish_wait+0x90/0x90
    [162513.514864] start_transaction+0x37c/0x5f0 [btrfs]
    [162513.514879] transaction_kthread+0xa4/0x170 [btrfs]
    [162513.514891] ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
    [162513.514894] kthread+0x153/0x170
    [162513.514897] ? kthread_stop+0x2c0/0x2c0
    [162513.514902] ret_from_fork+0x22/0x30
    [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
    [162513.515192] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.515680] task:fsstress state:D stack: 0 pid:1356184 ppid:1356177 flags:0x00004000
    [162513.515682] Call Trace:
    [162513.515688] __schedule+0x5ce/0xd00
    [162513.515691] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.515697] schedule+0x46/0xf0
    [162513.515712] wait_current_trans+0xde/0x140 [btrfs]
    [162513.515716] ? finish_wait+0x90/0x90
    [162513.515729] start_transaction+0x37c/0x5f0 [btrfs]
    [162513.515743] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
    [162513.515753] btrfs_sync_fs+0x61/0x1c0 [btrfs]
    [162513.515758] ? __ia32_sys_fdatasync+0x20/0x20
    [162513.515761] iterate_supers+0x87/0xf0
    [162513.515765] ksys_sync+0x60/0xb0
    [162513.515768] __do_sys_sync+0xa/0x10
    [162513.515771] do_syscall_64+0x33/0x80
    [162513.515774] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [162513.515781] RIP: 0033:0x7f5238f50bd7
    [162513.515782] Code: Bad RIP value.
    [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
    [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
    [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
    [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
    [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
    [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
    [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
    [162513.516064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.516617] task:fsstress state:D stack: 0 pid:1356185 ppid:1356177 flags:0x00000000
    [162513.516620] Call Trace:
    [162513.516625] __schedule+0x5ce/0xd00
    [162513.516628] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.516634] schedule+0x46/0xf0
    [162513.516647] wait_current_trans+0xde/0x140 [btrfs]
    [162513.516650] ? finish_wait+0x90/0x90
    [162513.516662] start_transaction+0x4d7/0x5f0 [btrfs]
    [162513.516679] btrfs_setxattr_trans+0x3c/0x100 [btrfs]
    [162513.516686] __vfs_setxattr+0x66/0x80
    [162513.516691] __vfs_setxattr_noperm+0x70/0x200
    [162513.516697] vfs_setxattr+0x6b/0x120
    [162513.516703] setxattr+0x125/0x240
    [162513.516709] ? lock_acquire+0xb1/0x480
    [162513.516712] ? mnt_want_write+0x20/0x50
    [162513.516721] ? rcu_read_lock_any_held+0x8e/0xb0
    [162513.516723] ? preempt_count_add+0x49/0xa0
    [162513.516725] ? __sb_start_write+0x19b/0x290
    [162513.516727] ? preempt_count_add+0x49/0xa0
    [162513.516732] path_setxattr+0xba/0xd0
    [162513.516739] __x64_sys_setxattr+0x27/0x30
    [162513.516741] do_syscall_64+0x33/0x80
    [162513.516743] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [162513.516745] RIP: 0033:0x7f5238f56d5a
    [162513.516746] Code: Bad RIP value.
    [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
    [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
    [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
    [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
    [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
    [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
    [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
    [162513.517064] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.517763] task:fsstress state:D stack: 0 pid:1356196 ppid:1356177 flags:0x00004000
    [162513.517780] Call Trace:
    [162513.517786] __schedule+0x5ce/0xd00
    [162513.517789] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.517796] schedule+0x46/0xf0
    [162513.517810] wait_current_trans+0xde/0x140 [btrfs]
    [162513.517814] ? finish_wait+0x90/0x90
    [162513.517829] start_transaction+0x37c/0x5f0 [btrfs]
    [162513.517845] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
    [162513.517857] btrfs_sync_fs+0x61/0x1c0 [btrfs]
    [162513.517862] ? __ia32_sys_fdatasync+0x20/0x20
    [162513.517865] iterate_supers+0x87/0xf0
    [162513.517869] ksys_sync+0x60/0xb0
    [162513.517872] __do_sys_sync+0xa/0x10
    [162513.517875] do_syscall_64+0x33/0x80
    [162513.517878] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [162513.517881] RIP: 0033:0x7f5238f50bd7
    [162513.517883] Code: Bad RIP value.
    [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
    [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
    [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
    [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
    [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
    [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
    [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
    [162513.518298] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.519157] task:fsstress state:D stack: 0 pid:1356197 ppid:1356177 flags:0x00000000
    [162513.519160] Call Trace:
    [162513.519165] __schedule+0x5ce/0xd00
    [162513.519168] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.519174] schedule+0x46/0xf0
    [162513.519190] wait_current_trans+0xde/0x140 [btrfs]
    [162513.519193] ? finish_wait+0x90/0x90
    [162513.519206] start_transaction+0x4d7/0x5f0 [btrfs]
    [162513.519222] btrfs_create+0x57/0x200 [btrfs]
    [162513.519230] lookup_open+0x522/0x650
    [162513.519246] path_openat+0x2b8/0xa50
    [162513.519270] do_filp_open+0x91/0x100
    [162513.519275] ? find_held_lock+0x32/0x90
    [162513.519280] ? lock_acquired+0x33b/0x470
    [162513.519285] ? do_raw_spin_unlock+0x4b/0xc0
    [162513.519287] ? _raw_spin_unlock+0x29/0x40
    [162513.519295] do_sys_openat2+0x20d/0x2d0
    [162513.519300] do_sys_open+0x44/0x80
    [162513.519304] do_syscall_64+0x33/0x80
    [162513.519307] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [162513.519309] RIP: 0033:0x7f5238f4a903
    [162513.519310] Code: Bad RIP value.
    [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
    [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
    [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
    [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
    [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
    [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
    [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
    [162513.519727] Not tainted 5.9.0-rc6-btrfs-next-69 #1
    [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [162513.520508] task:btrfs state:D stack: 0 pid:1356211 ppid:1356178 flags:0x00004002
    [162513.520511] Call Trace:
    [162513.520516] __schedule+0x5ce/0xd00
    [162513.520519] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [162513.520525] schedule+0x46/0xf0
    [162513.520544] btrfs_scrub_pause+0x11f/0x180 [btrfs]
    [162513.520548] ? finish_wait+0x90/0x90
    [162513.520562] btrfs_commit_transaction+0x45a/0xc30 [btrfs]
    [162513.520574] ? start_transaction+0xe0/0x5f0 [btrfs]
    [162513.520596] btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
    [162513.520619] btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
    [162513.520639] btrfs_ioctl+0x2a25/0x36f0 [btrfs]
    [162513.520643] ? do_sigaction+0xf3/0x240
    [162513.520645] ? find_held_lock+0x32/0x90
    [162513.520648] ? do_sigaction+0xf3/0x240
    [162513.520651] ? lock_acquired+0x33b/0x470
    [162513.520655] ? _raw_spin_unlock_irq+0x24/0x50
    [162513.520657] ? lockdep_hardirqs_on+0x7d/0x100
    [162513.520660] ? _raw_spin_unlock_irq+0x35/0x50
    [162513.520662] ? do_sigaction+0xf3/0x240
    [162513.520671] ? __x64_sys_ioctl+0x83/0xb0
    [162513.520672] __x64_sys_ioctl+0x83/0xb0
    [162513.520677] do_syscall_64+0x33/0x80
    [162513.520679] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [162513.520681] RIP: 0033:0x7fc3cd307d87
    [162513.520682] Code: Bad RIP value.
    [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
    [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
    [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
    [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
    [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
    [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
    [162513.520703]
    Showing all locks held in the system:
    [162513.520712] 1 lock held by khungtaskd/54:
    [162513.520713] #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
    [162513.520728] 1 lock held by in:imklog/596:
    [162513.520729] #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
    [162513.520782] 1 lock held by btrfs-transacti/1356167:
    [162513.520784] #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
    [162513.520798] 1 lock held by btrfs/1356190:
    [162513.520800] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
    [162513.520805] 1 lock held by fsstress/1356184:
    [162513.520806] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
    [162513.520811] 3 locks held by fsstress/1356185:
    [162513.520812] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
    [162513.520815] #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
    [162513.520820] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
    [162513.520833] 1 lock held by fsstress/1356196:
    [162513.520834] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
    [162513.520838] 3 locks held by fsstress/1356197:
    [162513.520839] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
    [162513.520843] #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
    [162513.520846] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
    [162513.520858] 2 locks held by btrfs/1356211:
    [162513.520859] #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
    [162513.520877] #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]

    This was weird because the stack traces show that a transaction commit,
    triggered by a device replace operation, is blocking trying to pause any
    running scrubs but there are no stack traces of blocked tasks doing a
    scrub.

    After poking around with drgn, I noticed there was a scrub task that was
    constantly running and blocking for shorts periods of time:

    >>> t = find_task(prog, 1356190)
    >>> prog.stack_trace(t)
    #0 __schedule+0x5ce/0xcfc
    #1 schedule+0x46/0xe4
    #2 schedule_timeout+0x1df/0x475
    #3 btrfs_reada_wait+0xda/0x132
    #4 scrub_stripe+0x2a8/0x112f
    #5 scrub_chunk+0xcd/0x134
    #6 scrub_enumerate_chunks+0x29e/0x5ee
    #7 btrfs_scrub_dev+0x2d5/0x91b
    #8 btrfs_ioctl+0x7f5/0x36e7
    #9 __x64_sys_ioctl+0x83/0xb0
    #10 do_syscall_64+0x33/0x77
    #11 entry_SYSCALL_64+0x7c/0x156

    Which corresponds to:

    int btrfs_reada_wait(void *handle)
    {
    struct reada_control *rc = handle;
    struct btrfs_fs_info *fs_info = rc->fs_info;

    while (atomic_read(&rc->elems)) {
    if (!atomic_read(&fs_info->reada_works_cnt))
    reada_start_machine(fs_info);
    wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
    (HZ + 9) / 10);
    }
    (...)

    So the counter "rc->elems" was set to 1 and never decreased to 0, causing
    the scrub task to loop forever in that function. Then I used the following
    script for drgn to check the readahead requests:

    $ cat dump_reada.py
    import sys
    import drgn
    from drgn import NULL, Object, cast, container_of, execscript, \
    reinterpret, sizeof
    from drgn.helpers.linux import *

    mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

    mnt = None
    for mnt in for_each_mount(prog, dst = mnt_path):
    pass

    if mnt is None:
    sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
    sys.exit(1)

    fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

    def dump_re(re):
    nzones = re.nzones.value_()
    print(f're at {hex(re.value_())}')
    print(f'\t logical {re.logical.value_()}')
    print(f'\t refcnt {re.refcnt.value_()}')
    print(f'\t nzones {nzones}')
    for i in range(nzones):
    dev = re.zones[i].device
    name = dev.name.str.string_()
    print(f'\t\t dev id {dev.devid.value_()} name {name}')
    print()

    for _, e in radix_tree_for_each(fs_info.reada_tree):
    re = cast('struct reada_extent *', e)
    dump_re(re)

    $ drgn dump_reada.py
    re at 0xffff8f3da9d25ad8
    logical 38928384
    refcnt 1
    nzones 1
    dev id 0 name b'/dev/sdd'
    $

    So there was one readahead extent with a single zone corresponding to the
    source device of that last device replace operation logged in dmesg/syslog.
    Also the ID of that zone's device was 0 which is a special value set in
    the source device of a device replace operation when the operation finishes
    (constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
    confirming again that device /dev/sdd was the source of a device replace
    operation.

    Normally there should be as many zones in the readahead extent as there are
    devices, and I wasn't expecting the extent to be in a block group with a
    'single' profile, so I went and confirmed with the following drgn script
    that there weren't any single profile block groups:

    $ cat dump_block_groups.py
    import sys
    import drgn
    from drgn import NULL, Object, cast, container_of, execscript, \
    reinterpret, sizeof
    from drgn.helpers.linux import *

    mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

    mnt = None
    for mnt in for_each_mount(prog, dst = mnt_path):
    pass

    if mnt is None:
    sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
    sys.exit(1)

    fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

    BTRFS_BLOCK_GROUP_DATA = (1 << 0)
    BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
    BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
    BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
    BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
    BTRFS_BLOCK_GROUP_DUP = (1 << 5)
    BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
    BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
    BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
    BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
    BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)

    def bg_flags_string(bg):
    flags = bg.flags.value_()
    ret = ''
    if flags & BTRFS_BLOCK_GROUP_DATA:
    ret = 'data'
    if flags & BTRFS_BLOCK_GROUP_METADATA:
    if len(ret) > 0:
    ret += '|'
    ret += 'meta'
    if flags & BTRFS_BLOCK_GROUP_SYSTEM:
    if len(ret) > 0:
    ret += '|'
    ret += 'system'
    if flags & BTRFS_BLOCK_GROUP_RAID0:
    ret += ' raid0'
    elif flags & BTRFS_BLOCK_GROUP_RAID1:
    ret += ' raid1'
    elif flags & BTRFS_BLOCK_GROUP_DUP:
    ret += ' dup'
    elif flags & BTRFS_BLOCK_GROUP_RAID10:
    ret += ' raid10'
    elif flags & BTRFS_BLOCK_GROUP_RAID5:
    ret += ' raid5'
    elif flags & BTRFS_BLOCK_GROUP_RAID6:
    ret += ' raid6'
    elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
    ret += ' raid1c3'
    elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
    ret += ' raid1c4'
    else:
    ret += ' single'

    return ret

    def dump_bg(bg):
    print()
    print(f'block group at {hex(bg.value_())}')
    print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
    print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')

    bg_root = fs_info.block_group_cache_tree.address_of_()
    for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
    dump_bg(bg)

    $ drgn dump_block_groups.py

    block group at 0xffff8f3d673b0400
    start 22020096 length 16777216
    flags 258 - system raid6

    block group at 0xffff8f3d53ddb400
    start 38797312 length 536870912
    flags 260 - meta raid6

    block group at 0xffff8f3d5f4d9c00
    start 575668224 length 2147483648
    flags 257 - data raid6

    block group at 0xffff8f3d08189000
    start 2723151872 length 67108864
    flags 258 - system raid6

    block group at 0xffff8f3db70ff000
    start 2790260736 length 1073741824
    flags 260 - meta raid6

    block group at 0xffff8f3d5f4dd800
    start 3864002560 length 67108864
    flags 258 - system raid6

    block group at 0xffff8f3d67037000
    start 3931111424 length 2147483648
    flags 257 - data raid6
    $

    So there were only 2 reasons left for having a readahead extent with a
    single zone: reada_find_zone(), called when creating a readahead extent,
    returned NULL either because we failed to find the corresponding block
    group or because a memory allocation failed. With some additional and
    custom tracing I figured out that on every further ocurrence of the
    problem the block group had just been deleted when we were looping to
    create the zones for the readahead extent (at reada_find_extent()), so we
    ended up with only one zone in the readahead extent, corresponding to a
    device that ends up getting replaced.

    So after figuring that out it became obvious why the hang happens:

    1) Task A starts a scrub on any device of the filesystem, except for
    device /dev/sdd;

    2) Task B starts a device replace with /dev/sdd as the source device;

    3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
    starting to scrub a stripe from block group X. This call to
    btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
    calls reada_add_block(), it passes the logical address of the extent
    tree's root node as its 'logical' argument - a value of 38928384;

    4) Task A then enters reada_find_extent(), called from reada_add_block().
    It finds there isn't any existing readahead extent for the logical
    address 38928384, so it proceeds to the path of creating a new one.

    It calls btrfs_map_block() to find out which stripes exist for the block
    group X. On the first iteration of the for loop that iterates over the
    stripes, it finds the stripe for device /dev/sdd, so it creates one
    zone for that device and adds it to the readahead extent. Before getting
    into the second iteration of the loop, the cleanup kthread deletes block
    group X because it was empty. So in the iterations for the remaining
    stripes it does not add more zones to the readahead extent, because the
    calls to reada_find_zone() returned NULL because they couldn't find
    block group X anymore.

    As a result the new readahead extent has a single zone, corresponding to
    the device /dev/sdd;

    4) Before task A returns to btrfs_reada_add() and queues the readahead job
    for the readahead work queue, task B finishes the device replace and at
    btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
    device /dev/sdg;

    5) Task A returns to reada_add_block(), which increments the counter
    "->elems" of the reada_control structure allocated at btrfs_reada_add().

    Then it returns back to btrfs_reada_add() and calls
    reada_start_machine(). This queues a job in the readahead work queue to
    run the function reada_start_machine_worker(), which calls
    __reada_start_machine().

    At __reada_start_machine() we take the device list mutex and for each
    device found in the current device list, we call
    reada_start_machine_dev() to start the readahead work. However at this
    point the device /dev/sdd was already freed and is not in the device
    list anymore.

    This means the corresponding readahead for the extent at 38928384 is
    never started, and therefore the "->elems" counter of the reada_control
    structure allocated at btrfs_reada_add() never goes down to 0, causing
    the call to btrfs_reada_wait(), done by the scrub task, to wait forever.

    Note that the readahead request can be made either after the device replace
    started or before it started, however in pratice it is very unlikely that a
    device replace is able to start after a readahead request is made and is
    able to complete before the readahead request completes - maybe only on a
    very small and nearly empty filesystem.

    This hang however is not the only problem we can have with readahead and
    device removals. When the readahead extent has other zones other than the
    one corresponding to the device that is being removed (either by a device
    replace or a device remove operation), we risk having a use-after-free on
    the device when dropping the last reference of the readahead extent.

    For example if we create a readahead extent with two zones, one for the
    device /dev/sdd and one for the device /dev/sde:

    1) Before the readahead worker starts, the device /dev/sdd is removed,
    and the corresponding btrfs_device structure is freed. However the
    readahead extent still has the zone pointing to the device structure;

    2) When the readahead worker starts, it only finds device /dev/sde in the
    current device list of the filesystem;

    3) It starts the readahead work, at reada_start_machine_dev(), using the
    device /dev/sde;

    4) Then when it finishes reading the extent from device /dev/sde, it calls
    __readahead_hook() which ends up dropping the last reference on the
    readahead extent through the last call to reada_extent_put();

    5) At reada_extent_put() it iterates over each zone of the readahead extent
    and attempts to delete an element from the device's 'reada_extents'
    radix tree, resulting in a use-after-free, as the device pointer of the
    zone for /dev/sdd is now stale. We can also access the device after
    dropping the last reference of a zone, through reada_zone_release(),
    also called by reada_extent_put().

    And a device remove suffers the same problem, however since it shrinks the
    device size down to zero before removing the device, it is very unlikely to
    still have readahead requests not completed by the time we free the device,
    the only possibility is if the device has a very little space allocated.

    While the hang problem is exclusive to scrub, since it is currently the
    only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
    problem affects any path that triggers readhead, which includes
    btree_readahead_hook() and __readahead_hook() (a readahead worker can
    trigger readahed for the children of a node) for example - any path that
    ends up calling reada_add_block() can trigger the use-after-free after a
    device is removed.

    So fix this by waiting for any readahead requests for a device to complete
    before removing a device, ensuring that while waiting for existing ones no
    new ones can be made.

    This problem has been around for a very long time - the readahead code was
    added in 2011, device remove exists since 2008 and device replace was
    introduced in 2013, hard to pick a specific commit for a git Fixes tag.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     

07 Oct, 2020

25 commits

  • Many things can happen after the device is scanned and before the device
    is mounted. One such thing is losing the BTRFS_MAGIC on the device.
    If it happens we still won't free that device from the memory and cause
    the userland confusion.

    For example: As the BTRFS_IOC_DEV_INFO still carries the device path
    which does not have the BTRFS_MAGIC, 'btrfs fi show' still lists
    device which does not belong to the filesystem anymore:

    $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sda /dev/sdb
    $ wipefs -a /dev/sdb
    # /dev/sdb does not contain magic signature
    $ mount -o degraded /dev/sda /btrfs
    $ btrfs fi show -m
    Label: none uuid: 470ec6fb-646b-4464-b3cb-df1b26c527bd
    Total devices 2 FS bytes used 128.00KiB
    devid 1 size 3.00GiB used 571.19MiB path /dev/sda
    devid 2 size 3.00GiB used 571.19MiB path /dev/sdb

    We need to distinguish the missing signature and invalid superblock, so
    add a specific error code ENODATA for that. This also fixes failure of
    fstest btrfs/198.

    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Josef Bacik
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Anand Jain
     
  • I noticed when fixing device stats for seed devices that we simply threw
    away the return value from btrfs_search_slot(). This is because we may
    not have stat items, but we could very well get an error, and thus miss
    reporting the error up the chain.

    Fix this by returning ret if it's an actual error, and then stop trying
    to init the rest of the devices stats and return the error up the chain.

    Reviewed-by: Anand Jain
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We recently started recording device stats across the fleet, and noticed
    a large increase in messages such as this

    BTRFS warning (device dm-0): get dev_stats failed, not yet valid

    on our tiers that use seed devices for their root devices. This is
    because we do not initialize the device stats for any seed devices if we
    have a sprout device and mount using that sprout device. The basic
    steps for reproducing are:

    $ mkfs seed device
    $ mount seed device
    # fill seed device
    $ umount seed device
    $ btrfstune -S 1 seed device
    $ mount seed device
    $ btrfs device add -f sprout device /mnt/wherever
    $ umount /mnt/wherever
    $ mount sprout device /mnt/wherever
    $ btrfs device stats /mnt/wherever

    This will fail with the above message in dmesg.

    Fix this by iterating over the fs_devices->seed if they exist in
    btrfs_init_dev_stats. This fixed the problem and properly reports the
    stats for both devices.

    Reviewed-by: Anand Jain
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    [ rename to btrfs_device_init_dev_stats ]
    Signed-off-by: David Sterba

    Josef Bacik
     
  • The function does not have a common exit block and returns immediatelly
    so there's no point having the goto. Remove the two cases.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Anand Jain
     
  • We can check the argument value directly, no need for the temporary
    variable.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Anand Jain
     
  • On a mounted sprout filesystem, all threads now are using the
    sprout::device_list_mutex, and this is the only code using the
    seed::device_list_mutex. This patch converts to use the sprouts
    fs_info->fs_devices->device_list_mutex.

    The same reasoning holds true here, that device delete is holding
    the sprout::device_list_mutex.

    Signed-off-by: Anand Jain
    Signed-off-by: David Sterba

    Anand Jain
     
  • Similar to btrfs_sysfs_add_devices_dir()'s refactoring, split
    btrfs_sysfs_remove_devices_dir() so that we don't have to use the device
    argument to indicate whether to free all devices or just one device.

    Export btrfs_sysfs_remove_device() as device operations outside of
    sysfs.c now calls this instead of btrfs_sysfs_remove_devices_dir().

    btrfs_sysfs_remove_devices_dir() is renamed to
    btrfs_sysfs_remove_fs_devices() to suite its new role.

    Now, no one outside of sysfs.c calls btrfs_sysfs_remove_fs_devices()
    so it is redeclared s static. And the same function had to be moved
    before its first caller.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Anand Jain
     
  • When we add a device we need to add it to sysfs, so instead of using the
    btrfs_sysfs_add_devices_dir() fs_devices argument to specify whether to
    add a device or all of fs_devices, call the helper function directly
    btrfs_sysfs_add_device() and thus make it non-static.

    Reviewed-by: Nikolay Borisov
    Reviewed-by: Josef Bacik
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Anand Jain
     
  • Systems booting without the initramfs seems to scan an unusual kind
    of device path (/dev/root). And at a later time, the device is updated
    to the correct path. We generally print the process name and PID of the
    process scanning the device but we don't capture the same information if
    the device path is rescanned with a different pathname.

    The current message is too long, so drop the unnecessary UUID and add
    process name and PID.

    While at this also update the duplicate device warning to include the
    process name and PID so the messages are consistent

    CC: stable@vger.kernel.org # 4.19+
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=89721
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Anand Jain
     
  • Instead of using a flag bit for exclusive operation, use a variable to
    store which exclusive operation is being performed. Introduce an API
    to start and finish an exclusive operation.

    This would enable another way for tools to check which operation is
    running on why starting an exclusive operation failed. The followup
    patch adds a sysfs_notify() to alert userspace when the state changes, so
    userspace can perform select() on it to get notified of the change.

    This would enable us to enqueue a command which will wait for current
    exclusive operation to complete before issuing the next exclusive
    operation. This has been done synchronously as opposed to a background
    process, or else error collection (if any) will become difficult.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Goldwyn Rodrigues
    Reviewed-by: David Sterba
    [ update comments ]
    Signed-off-by: David Sterba

    Goldwyn Rodrigues
     
  • While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the
    following lockdep splat:

    ======================================================
    WARNING: possible circular locking dependency detected
    5.9.0-rc3+ #4 Not tainted
    ------------------------------------------------------
    kswapd0/100 is trying to acquire lock:
    ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330

    but task is already holding lock:
    ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (fs_reclaim){+.+.}-{0:0}:
    fs_reclaim_acquire+0x65/0x80
    slab_pre_alloc_hook.constprop.0+0x20/0x200
    kmem_cache_alloc+0x37/0x270
    alloc_inode+0x82/0xb0
    iget_locked+0x10d/0x2c0
    kernfs_get_inode+0x1b/0x130
    kernfs_get_tree+0x136/0x240
    sysfs_get_tree+0x16/0x40
    vfs_get_tree+0x28/0xc0
    path_mount+0x434/0xc00
    __x64_sys_mount+0xe3/0x120
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #2 (kernfs_mutex){+.+.}-{3:3}:
    __mutex_lock+0x7e/0x7e0
    kernfs_add_one+0x23/0x150
    kernfs_create_link+0x63/0xa0
    sysfs_do_create_link_sd+0x5e/0xd0
    btrfs_sysfs_add_devices_dir+0x81/0x130
    btrfs_init_new_device+0x67f/0x1250
    btrfs_ioctl+0x1ef/0x2e20
    __x64_sys_ioctl+0x83/0xb0
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
    __mutex_lock+0x7e/0x7e0
    btrfs_chunk_alloc+0x125/0x3a0
    find_free_extent+0xdf6/0x1210
    btrfs_reserve_extent+0xb3/0x1b0
    btrfs_alloc_tree_block+0xb0/0x310
    alloc_tree_block_no_bg_flush+0x4a/0x60
    __btrfs_cow_block+0x11a/0x530
    btrfs_cow_block+0x104/0x220
    btrfs_search_slot+0x52e/0x9d0
    btrfs_insert_empty_items+0x64/0xb0
    btrfs_insert_delayed_items+0x90/0x4f0
    btrfs_commit_inode_delayed_items+0x93/0x140
    btrfs_log_inode+0x5de/0x2020
    btrfs_log_inode_parent+0x429/0xc90
    btrfs_log_new_name+0x95/0x9b
    btrfs_rename2+0xbb9/0x1800
    vfs_rename+0x64f/0x9f0
    do_renameat2+0x320/0x4e0
    __x64_sys_rename+0x1f/0x30
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
    __lock_acquire+0x119c/0x1fc0
    lock_acquire+0xa7/0x3d0
    __mutex_lock+0x7e/0x7e0
    __btrfs_release_delayed_node.part.0+0x3f/0x330
    btrfs_evict_inode+0x24c/0x500
    evict+0xcf/0x1f0
    dispose_list+0x48/0x70
    prune_icache_sb+0x44/0x50
    super_cache_scan+0x161/0x1e0
    do_shrink_slab+0x178/0x3c0
    shrink_slab+0x17c/0x290
    shrink_node+0x2b2/0x6d0
    balance_pgdat+0x30a/0x670
    kswapd+0x213/0x4c0
    kthread+0x138/0x160
    ret_from_fork+0x1f/0x30

    other info that might help us debug this:

    Chain exists of:
    &delayed_node->mutex --> kernfs_mutex --> fs_reclaim

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(fs_reclaim);
    lock(kernfs_mutex);
    lock(fs_reclaim);
    lock(&delayed_node->mutex);

    *** DEADLOCK ***

    3 locks held by kswapd0/100:
    #0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
    #1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
    #2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0

    stack backtrace:
    CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
    Call Trace:
    dump_stack+0x8b/0xb8
    check_noncircular+0x12d/0x150
    __lock_acquire+0x119c/0x1fc0
    lock_acquire+0xa7/0x3d0
    ? __btrfs_release_delayed_node.part.0+0x3f/0x330
    __mutex_lock+0x7e/0x7e0
    ? __btrfs_release_delayed_node.part.0+0x3f/0x330
    ? __btrfs_release_delayed_node.part.0+0x3f/0x330
    ? lock_acquire+0xa7/0x3d0
    ? find_held_lock+0x2b/0x80
    __btrfs_release_delayed_node.part.0+0x3f/0x330
    btrfs_evict_inode+0x24c/0x500
    evict+0xcf/0x1f0
    dispose_list+0x48/0x70
    prune_icache_sb+0x44/0x50
    super_cache_scan+0x161/0x1e0
    do_shrink_slab+0x178/0x3c0
    shrink_slab+0x17c/0x290
    shrink_node+0x2b2/0x6d0
    balance_pgdat+0x30a/0x670
    kswapd+0x213/0x4c0
    ? _raw_spin_unlock_irqrestore+0x41/0x50
    ? add_wait_queue_exclusive+0x70/0x70
    ? balance_pgdat+0x670/0x670
    kthread+0x138/0x160
    ? kthread_create_worker_on_cpu+0x40/0x40
    ret_from_fork+0x1f/0x30

    This happens because we are holding the chunk_mutex at the time of
    adding in a new device. However we only need to hold the
    device_list_mutex, as we're going to iterate over the fs_devices
    devices. Move the sysfs init stuff outside of the chunk_mutex to get
    rid of this lockdep splat.

    CC: stable@vger.kernel.org # 4.4.x: f3cd2c58110dad14e: btrfs: sysfs, rename device_link add/remove functions
    CC: stable@vger.kernel.org # 4.4.x
    Reported-by: David Sterba
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Instead of opencoding filemap_write_and_wait simply call syncblockdev as
    it makes it abundantly clear what's going on and why this is used. No
    semantics changes.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Anand Jain
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Following the refactor of btrfs_free_stale_devices in
    7bcb8164ad94 ("btrfs: use device_list_mutex when removing stale devices")
    fs_devices are freed after they have been iterated by the inner
    list_for_each so the use-after-free fixed by introducing the break in
    fd649f10c3d2 ("btrfs: Fix use-after-free when cleaning up fs_devs with
    a single stale device") is no longer necessary. Just remove it
    altogether. No functional changes.

    Reviewed-by: Anand Jain
    Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Invert unlocked to locked and exploit the fact it can only ever be
    modified if we are adding a new device to a seed filesystem. This allows
    to simplify the check in error: label. No semantics changes.

    Reviewed-by: Anand Jain
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • When adding a new device there's a mandatory check to see if a device is
    being duplicated to the filesystem it's added to. Since this is a
    read-only operations not necessary to take device_list_mutex and can simply
    make do with an rcu-readlock.

    Using just RCU is safe because there won't be another device add delete
    running in parallel as btrfs_init_new_device is called only from
    btrfs_ioctl_add_dev.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • The following lockdep splat

    ======================================================
    WARNING: possible circular locking dependency detected
    5.8.0-rc7-00169-g87212851a027-dirty #929 Not tainted
    ------------------------------------------------------
    fsstress/8739 is trying to acquire lock:
    ffff88bfd0eb0c90 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x43/0x70

    but task is already holding lock:
    ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #10 (sb_pagefaults){.+.+}-{0:0}:
    __sb_start_write+0x129/0x210
    btrfs_page_mkwrite+0x6a/0x4a0
    do_page_mkwrite+0x4d/0xc0
    handle_mm_fault+0x103c/0x1730
    exc_page_fault+0x340/0x660
    asm_exc_page_fault+0x1e/0x30

    -> #9 (&mm->mmap_lock#2){++++}-{3:3}:
    __might_fault+0x68/0x90
    _copy_to_user+0x1e/0x80
    perf_read+0x141/0x2c0
    vfs_read+0xad/0x1b0
    ksys_read+0x5f/0xe0
    do_syscall_64+0x50/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #8 (&cpuctx_mutex){+.+.}-{3:3}:
    __mutex_lock+0x9f/0x930
    perf_event_init_cpu+0x88/0x150
    perf_event_init+0x1db/0x20b
    start_kernel+0x3ae/0x53c
    secondary_startup_64+0xa4/0xb0

    -> #7 (pmus_lock){+.+.}-{3:3}:
    __mutex_lock+0x9f/0x930
    perf_event_init_cpu+0x4f/0x150
    cpuhp_invoke_callback+0xb1/0x900
    _cpu_up.constprop.26+0x9f/0x130
    cpu_up+0x7b/0xc0
    bringup_nonboot_cpus+0x4f/0x60
    smp_init+0x26/0x71
    kernel_init_freeable+0x110/0x258
    kernel_init+0xa/0x103
    ret_from_fork+0x1f/0x30

    -> #6 (cpu_hotplug_lock){++++}-{0:0}:
    cpus_read_lock+0x39/0xb0
    kmem_cache_create_usercopy+0x28/0x230
    kmem_cache_create+0x12/0x20
    bioset_init+0x15e/0x2b0
    init_bio+0xa3/0xaa
    do_one_initcall+0x5a/0x2e0
    kernel_init_freeable+0x1f4/0x258
    kernel_init+0xa/0x103
    ret_from_fork+0x1f/0x30

    -> #5 (bio_slab_lock){+.+.}-{3:3}:
    __mutex_lock+0x9f/0x930
    bioset_init+0xbc/0x2b0
    __blk_alloc_queue+0x6f/0x2d0
    blk_mq_init_queue_data+0x1b/0x70
    loop_add+0x110/0x290 [loop]
    fq_codel_tcf_block+0x12/0x20 [sch_fq_codel]
    do_one_initcall+0x5a/0x2e0
    do_init_module+0x5a/0x220
    load_module+0x2459/0x26e0
    __do_sys_finit_module+0xba/0xe0
    do_syscall_64+0x50/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #4 (loop_ctl_mutex){+.+.}-{3:3}:
    __mutex_lock+0x9f/0x930
    lo_open+0x18/0x50 [loop]
    __blkdev_get+0xec/0x570
    blkdev_get+0xe8/0x150
    do_dentry_open+0x167/0x410
    path_openat+0x7c9/0xa80
    do_filp_open+0x93/0x100
    do_sys_openat2+0x22a/0x2e0
    do_sys_open+0x4b/0x80
    do_syscall_64+0x50/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
    __mutex_lock+0x9f/0x930
    blkdev_put+0x1d/0x120
    close_fs_devices.part.31+0x84/0x130
    btrfs_close_devices+0x44/0xb0
    close_ctree+0x296/0x2b2
    generic_shutdown_super+0x69/0x100
    kill_anon_super+0xe/0x30
    btrfs_kill_super+0x12/0x20
    deactivate_locked_super+0x29/0x60
    cleanup_mnt+0xb8/0x140
    task_work_run+0x6d/0xb0
    __prepare_exit_to_usermode+0x1cc/0x1e0
    do_syscall_64+0x5c/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
    __mutex_lock+0x9f/0x930
    btrfs_run_dev_stats+0x49/0x480
    commit_cowonly_roots+0xb5/0x2a0
    btrfs_commit_transaction+0x516/0xa60
    sync_filesystem+0x6b/0x90
    generic_shutdown_super+0x22/0x100
    kill_anon_super+0xe/0x30
    btrfs_kill_super+0x12/0x20
    deactivate_locked_super+0x29/0x60
    cleanup_mnt+0xb8/0x140
    task_work_run+0x6d/0xb0
    __prepare_exit_to_usermode+0x1cc/0x1e0
    do_syscall_64+0x5c/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
    __mutex_lock+0x9f/0x930
    btrfs_commit_transaction+0x4bb/0xa60
    sync_filesystem+0x6b/0x90
    generic_shutdown_super+0x22/0x100
    kill_anon_super+0xe/0x30
    btrfs_kill_super+0x12/0x20
    deactivate_locked_super+0x29/0x60
    cleanup_mnt+0xb8/0x140
    task_work_run+0x6d/0xb0
    __prepare_exit_to_usermode+0x1cc/0x1e0
    do_syscall_64+0x5c/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
    __lock_acquire+0x1272/0x2310
    lock_acquire+0x9e/0x360
    __mutex_lock+0x9f/0x930
    btrfs_record_root_in_trans+0x43/0x70
    start_transaction+0xd1/0x5d0
    btrfs_dirty_inode+0x42/0xd0
    file_update_time+0xc8/0x110
    btrfs_page_mkwrite+0x10c/0x4a0
    do_page_mkwrite+0x4d/0xc0
    handle_mm_fault+0x103c/0x1730
    exc_page_fault+0x340/0x660
    asm_exc_page_fault+0x1e/0x30

    other info that might help us debug this:

    Chain exists of:
    &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(sb_pagefaults);
    lock(&mm->mmap_lock#2);
    lock(sb_pagefaults);
    lock(&fs_info->reloc_mutex);

    *** DEADLOCK ***

    3 locks held by fsstress/8739:
    #0: ffff88bee66eeb68 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x173/0x660
    #1: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
    #2: ffff88bfbd16e630 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3da/0x5d0

    stack backtrace:
    CPU: 17 PID: 8739 Comm: fsstress Kdump: loaded Not tainted 5.8.0-rc7-00169-g87212851a027-dirty #929
    Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
    Call Trace:
    dump_stack+0x78/0xa0
    check_noncircular+0x165/0x180
    __lock_acquire+0x1272/0x2310
    ? btrfs_get_alloc_profile+0x150/0x210
    lock_acquire+0x9e/0x360
    ? btrfs_record_root_in_trans+0x43/0x70
    __mutex_lock+0x9f/0x930
    ? btrfs_record_root_in_trans+0x43/0x70
    ? lock_acquire+0x9e/0x360
    ? join_transaction+0x5d/0x450
    ? find_held_lock+0x2d/0x90
    ? btrfs_record_root_in_trans+0x43/0x70
    ? join_transaction+0x3d5/0x450
    ? btrfs_record_root_in_trans+0x43/0x70
    btrfs_record_root_in_trans+0x43/0x70
    start_transaction+0xd1/0x5d0
    btrfs_dirty_inode+0x42/0xd0
    file_update_time+0xc8/0x110
    btrfs_page_mkwrite+0x10c/0x4a0
    ? handle_mm_fault+0x5e/0x1730
    do_page_mkwrite+0x4d/0xc0
    ? __do_fault+0x32/0x150
    handle_mm_fault+0x103c/0x1730
    exc_page_fault+0x340/0x660
    ? asm_exc_page_fault+0x8/0x30
    asm_exc_page_fault+0x1e/0x30
    RIP: 0033:0x7faa6c9969c4

    Was seen in testing. The fix is similar to that of

    btrfs: open device without device_list_mutex

    where we're holding the device_list_mutex and then grab the bd_mutex,
    which pulls in a bunch of dependencies under the bd_mutex. We only ever
    call btrfs_close_devices() on mount failure or unmount, so we're save to
    not have the device_list_mutex here. We're already holding the
    uuid_mutex which keeps us safe from any external modification of the
    fs_devices.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • When closing and freeing the source device we could end up doing our
    final blkdev_put() on the bdev, which will grab the bd_mutex. As such
    we want to be holding as few locks as possible, so move this call
    outside of the dev_replace->lock_finishing_cancel_unmount lock. Since
    we're modifying the fs_devices we need to make sure we're holding the
    uuid_mutex here, so take that as well.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • btrfs_prepare_sprout is called when the first rw device is added to a
    seed filesystem. This means the filesystem can't have its alloc_list
    be non-empty, since seed filesystems are read only. Simply remove the
    code altogether.

    Reviewed-by: Josef Bacik
    Reviewed-by: Anand Jain
    Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Without good understanding of how seed devices works it's hard to grok
    some of what the code in open_seed_devices or btrfs_prepare_sprout does.

    Add comments hopefully reducing some of the cognitive load.

    Reviewed-by: Anand Jain
    Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • While this patch touches a bunch of files the conversion is
    straighforward. Instead of using the implicit linked list anchored at
    btrfs_fs_devices::seed the code is switched to using
    list_for_each_entry.

    Previous patches in the series already factored out code that processed
    both main and seed devices so in those cases the factored out functions
    are called on the main fs_devices and then on every seed dev inside
    list_for_each_entry.

    Using list api also allows to simplify deletion from the seed dev list
    performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by
    substituting a while() loop with a simple list_del_init.

    Reviewed-by: Josef Bacik
    Reviewed-by: Anand Jain
    Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • It makes no sense to have sysfs-related routines be responsible for
    properly initialising the fs_info pointer of struct btrfs_fs_device.
    Instead this can be streamlined by making it the responsibility of
    btrfs_init_devices_late to initialize it. That function already
    initializes fs_info of every individual device in btrfs_fs_devices.

    As far as clearing it is concerned it makes sense to move it to
    close_fs_devices. That function is only called when struct
    btrfs_fs_devices is no longer in use - either for holding seeds or
    main devices for a mounted filesystem.

    Reviewed-by: Josef Bacik
    Reviewed-by: Anand Jain
    Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • The return value of this function conveys absolutely no information.
    All callers already check the state of fs_devices->opened to decide how
    to proceed. So convert the function to returning void. While at it make
    btrfs_close_devices also return void.

    Reviewed-by: Josef Bacik
    Reviewed-by: Anand Jain
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • This prepares the code to switching seeds devices to a proper list.

    Reviewed-by: Josef Bacik
    Reviewed-by: Anand Jain
    Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Commit 1c11b63eff2a ("btrfs: replace pending/pinned chunks lists with io
    tree") introduced btrfs_device::alloc_state extent io tree, but it
    doesn't initialize the fs_info and owner member.

    This means the following features are not properly supported:

    - Fs owner report for insert_state() error
    Without fs_info initialized, although btrfs_err() won't panic, it
    won't output which fs is causing the error.

    - Wrong owner for trace events
    alloc_state will get the owner as pinned extents.

    Fix this by assiging proper fs_info and owner for
    btrfs_device::alloc_state.

    Fixes: 1c11b63eff2a ("btrfs: replace pending/pinned chunks lists with io tree")
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • It can be accessed from 'fs_devices' as it's identical to
    fs_info->fs_devices. Also add a comment about why we are calling the
    function. No semantic changes.

    Reviewed-by: Josef Bacik
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Anand Jain
    Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba

    Nikolay Borisov
     

01 Oct, 2020

1 commit

  • When closing and freeing the source device we could end up doing our
    final blkdev_put() on the bdev, which will grab the bd_mutex. As such
    we want to be holding as few locks as possible, so move this call
    outside of the dev_replace->lock_finishing_cancel_unmount lock. Since
    we're modifying the fs_devices we need to make sure we're holding the
    uuid_mutex here, so take that as well.

    There's a report from syzbot probably hitting one of the cases where
    the bd_mutex and device_list_mutex are taken in the wrong order, however
    it's not with device replace, like this patch fixes. As there's no
    reproducer available so far, we can't verify the fix.

    https://lore.kernel.org/lkml/000000000000fc04d105afcf86d7@google.com/
    dashboard link: https://syzkaller.appspot.com/bug?extid=84a0634dc5d21d488419

    WARNING: possible circular locking dependency detected
    5.9.0-rc5-syzkaller #0 Not tainted
    ------------------------------------------------------
    syz-executor.0/6878 is trying to acquire lock:
    ffff88804c17d780 (&bdev->bd_mutex){+.+.}-{3:3}, at: blkdev_put+0x30/0x520 fs/block_dev.c:1804

    but task is already holding lock:
    ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #4 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
    __mutex_lock_common kernel/locking/mutex.c:956 [inline]
    __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
    btrfs_finish_chunk_alloc+0x281/0xf90 fs/btrfs/volumes.c:5255
    btrfs_create_pending_block_groups+0x2f3/0x700 fs/btrfs/block-group.c:2109
    __btrfs_end_transaction+0xf5/0x690 fs/btrfs/transaction.c:916
    find_free_extent_update_loop fs/btrfs/extent-tree.c:3807 [inline]
    find_free_extent+0x23b7/0x2e60 fs/btrfs/extent-tree.c:4127
    btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
    cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
    btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
    writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
    __extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
    extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
    extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
    do_writepages+0xec/0x290 mm/page-writeback.c:2352
    __writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
    writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
    wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
    wb_do_writeback fs/fs-writeback.c:2039 [inline]
    wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
    process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
    worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
    kthread+0x3b5/0x4a0 kernel/kthread.c:292
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

    -> #3 (sb_internal#2){.+.+}-{0:0}:
    percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
    __sb_start_write+0x234/0x470 fs/super.c:1672
    sb_start_intwrite include/linux/fs.h:1690 [inline]
    start_transaction+0xbe7/0x1170 fs/btrfs/transaction.c:624
    find_free_extent_update_loop fs/btrfs/extent-tree.c:3789 [inline]
    find_free_extent+0x25e1/0x2e60 fs/btrfs/extent-tree.c:4127
    btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
    cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
    btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
    writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
    __extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
    extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
    extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
    do_writepages+0xec/0x290 mm/page-writeback.c:2352
    __writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
    writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
    wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
    wb_do_writeback fs/fs-writeback.c:2039 [inline]
    wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
    process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
    worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
    kthread+0x3b5/0x4a0 kernel/kthread.c:292
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

    -> #2 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}:
    __flush_work+0x60e/0xac0 kernel/workqueue.c:3041
    wb_shutdown+0x180/0x220 mm/backing-dev.c:355
    bdi_unregister+0x174/0x590 mm/backing-dev.c:872
    del_gendisk+0x820/0xa10 block/genhd.c:933
    loop_remove drivers/block/loop.c:2192 [inline]
    loop_control_ioctl drivers/block/loop.c:2291 [inline]
    loop_control_ioctl+0x3b1/0x480 drivers/block/loop.c:2257
    vfs_ioctl fs/ioctl.c:48 [inline]
    __do_sys_ioctl fs/ioctl.c:753 [inline]
    __se_sys_ioctl fs/ioctl.c:739 [inline]
    __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #1 (loop_ctl_mutex){+.+.}-{3:3}:
    __mutex_lock_common kernel/locking/mutex.c:956 [inline]
    __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
    lo_open+0x19/0xd0 drivers/block/loop.c:1893
    __blkdev_get+0x759/0x1aa0 fs/block_dev.c:1507
    blkdev_get fs/block_dev.c:1639 [inline]
    blkdev_open+0x227/0x300 fs/block_dev.c:1753
    do_dentry_open+0x4b9/0x11b0 fs/open.c:817
    do_open fs/namei.c:3251 [inline]
    path_openat+0x1b9a/0x2730 fs/namei.c:3368
    do_filp_open+0x17e/0x3c0 fs/namei.c:3395
    do_sys_openat2+0x16d/0x420 fs/open.c:1168
    do_sys_open fs/open.c:1184 [inline]
    __do_sys_open fs/open.c:1192 [inline]
    __se_sys_open fs/open.c:1188 [inline]
    __x64_sys_open+0x119/0x1c0 fs/open.c:1188
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (&bdev->bd_mutex){+.+.}-{3:3}:
    check_prev_add kernel/locking/lockdep.c:2496 [inline]
    check_prevs_add kernel/locking/lockdep.c:2601 [inline]
    validate_chain kernel/locking/lockdep.c:3218 [inline]
    __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
    lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
    __mutex_lock_common kernel/locking/mutex.c:956 [inline]
    __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
    blkdev_put+0x30/0x520 fs/block_dev.c:1804
    btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
    btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
    btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
    close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
    close_fs_devices fs/btrfs/volumes.c:1193 [inline]
    btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
    close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
    generic_shutdown_super+0x144/0x370 fs/super.c:464
    kill_anon_super+0x36/0x60 fs/super.c:1108
    btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
    deactivate_locked_super+0x94/0x160 fs/super.c:335
    deactivate_super+0xad/0xd0 fs/super.c:366
    cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
    task_work_run+0xdd/0x190 kernel/task_work.c:141
    tracehook_notify_resume include/linux/tracehook.h:188 [inline]
    exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
    exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
    syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    other info that might help us debug this:

    Chain exists of:
    &bdev->bd_mutex --> sb_internal#2 --> &fs_devs->device_list_mutex

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&fs_devs->device_list_mutex);
    lock(sb_internal#2);
    lock(&fs_devs->device_list_mutex);
    lock(&bdev->bd_mutex);

    *** DEADLOCK ***

    3 locks held by syz-executor.0/6878:
    #0: ffff88809070c0e0 (&type->s_umount_key#70){++++}-{3:3}, at: deactivate_super+0xa5/0xd0 fs/super.c:365
    #1: ffffffff8a5b37a8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_close_devices+0x23/0x1f0 fs/btrfs/volumes.c:1178
    #2: ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159

    stack backtrace:
    CPU: 0 PID: 6878 Comm: syz-executor.0 Not tainted 5.9.0-rc5-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x198/0x1fd lib/dump_stack.c:118
    check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
    check_prev_add kernel/locking/lockdep.c:2496 [inline]
    check_prevs_add kernel/locking/lockdep.c:2601 [inline]
    validate_chain kernel/locking/lockdep.c:3218 [inline]
    __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
    lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
    __mutex_lock_common kernel/locking/mutex.c:956 [inline]
    __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
    blkdev_put+0x30/0x520 fs/block_dev.c:1804
    btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
    btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
    btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
    close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
    close_fs_devices fs/btrfs/volumes.c:1193 [inline]
    btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
    close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
    generic_shutdown_super+0x144/0x370 fs/super.c:464
    kill_anon_super+0x36/0x60 fs/super.c:1108
    btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
    deactivate_locked_super+0x94/0x160 fs/super.c:335
    deactivate_super+0xad/0xd0 fs/super.c:366
    cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
    task_work_run+0xdd/0x190 kernel/task_work.c:141
    tracehook_notify_resume include/linux/tracehook.h:188 [inline]
    exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
    exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
    syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x460027
    RSP: 002b:00007fff59216328 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
    RAX: 0000000000000000 RBX: 0000000000076035 RCX: 0000000000460027
    RDX: 0000000000403188 RSI: 0000000000000002 RDI: 00007fff592163d0
    RBP: 0000000000000333 R08: 0000000000000000 R09: 000000000000000b
    R10: 0000000000000005 R11: 0000000000000246 R12: 00007fff59217460
    R13: 0000000002df2a60 R14: 0000000000000000 R15: 00007fff59217460

    Signed-off-by: Josef Bacik
    [ add syzbot reference ]
    Signed-off-by: David Sterba

    Josef Bacik
     

25 Sep, 2020

1 commit

  • We need to move the closing of the src_device out of all the device
    replace locking, but we definitely want to zero out the superblock
    before we commit the last time to make sure the device is properly
    removed. Handle this by pushing btrfs_scratch_superblocks into
    btrfs_dev_replace_finishing, and then later on we'll move the src_device
    closing and freeing stuff where we need it to be.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     

07 Sep, 2020

1 commit

  • Nikolay reported a lockdep splat in generic/476 that I could reproduce
    with btrfs/187.

    ======================================================
    WARNING: possible circular locking dependency detected
    5.9.0-rc2+ #1 Tainted: G W
    ------------------------------------------------------
    kswapd0/100 is trying to acquire lock:
    ffff9e8ef38b6268 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330

    but task is already holding lock:
    ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (fs_reclaim){+.+.}-{0:0}:
    fs_reclaim_acquire+0x65/0x80
    slab_pre_alloc_hook.constprop.0+0x20/0x200
    kmem_cache_alloc_trace+0x3a/0x1a0
    btrfs_alloc_device+0x43/0x210
    add_missing_dev+0x20/0x90
    read_one_chunk+0x301/0x430
    btrfs_read_sys_array+0x17b/0x1b0
    open_ctree+0xa62/0x1896
    btrfs_mount_root.cold+0x12/0xea
    legacy_get_tree+0x30/0x50
    vfs_get_tree+0x28/0xc0
    vfs_kern_mount.part.0+0x71/0xb0
    btrfs_mount+0x10d/0x379
    legacy_get_tree+0x30/0x50
    vfs_get_tree+0x28/0xc0
    path_mount+0x434/0xc00
    __x64_sys_mount+0xe3/0x120
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
    __mutex_lock+0x7e/0x7e0
    btrfs_chunk_alloc+0x125/0x3a0
    find_free_extent+0xdf6/0x1210
    btrfs_reserve_extent+0xb3/0x1b0
    btrfs_alloc_tree_block+0xb0/0x310
    alloc_tree_block_no_bg_flush+0x4a/0x60
    __btrfs_cow_block+0x11a/0x530
    btrfs_cow_block+0x104/0x220
    btrfs_search_slot+0x52e/0x9d0
    btrfs_lookup_inode+0x2a/0x8f
    __btrfs_update_delayed_inode+0x80/0x240
    btrfs_commit_inode_delayed_inode+0x119/0x120
    btrfs_evict_inode+0x357/0x500
    evict+0xcf/0x1f0
    vfs_rmdir.part.0+0x149/0x160
    do_rmdir+0x136/0x1a0
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
    __lock_acquire+0x1184/0x1fa0
    lock_acquire+0xa4/0x3d0
    __mutex_lock+0x7e/0x7e0
    __btrfs_release_delayed_node.part.0+0x3f/0x330
    btrfs_evict_inode+0x24c/0x500
    evict+0xcf/0x1f0
    dispose_list+0x48/0x70
    prune_icache_sb+0x44/0x50
    super_cache_scan+0x161/0x1e0
    do_shrink_slab+0x178/0x3c0
    shrink_slab+0x17c/0x290
    shrink_node+0x2b2/0x6d0
    balance_pgdat+0x30a/0x670
    kswapd+0x213/0x4c0
    kthread+0x138/0x160
    ret_from_fork+0x1f/0x30

    other info that might help us debug this:

    Chain exists of:
    &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(fs_reclaim);
    lock(&fs_info->chunk_mutex);
    lock(fs_reclaim);
    lock(&delayed_node->mutex);

    *** DEADLOCK ***

    3 locks held by kswapd0/100:
    #0: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
    #1: ffffffffa9d65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
    #2: ffff9e8e9da260e0 (&type->s_umount_key#48){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0

    stack backtrace:
    CPU: 1 PID: 100 Comm: kswapd0 Tainted: G W 5.9.0-rc2+ #1
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
    Call Trace:
    dump_stack+0x92/0xc8
    check_noncircular+0x12d/0x150
    __lock_acquire+0x1184/0x1fa0
    lock_acquire+0xa4/0x3d0
    ? __btrfs_release_delayed_node.part.0+0x3f/0x330
    __mutex_lock+0x7e/0x7e0
    ? __btrfs_release_delayed_node.part.0+0x3f/0x330
    ? __btrfs_release_delayed_node.part.0+0x3f/0x330
    ? lock_acquire+0xa4/0x3d0
    ? btrfs_evict_inode+0x11e/0x500
    ? find_held_lock+0x2b/0x80
    __btrfs_release_delayed_node.part.0+0x3f/0x330
    btrfs_evict_inode+0x24c/0x500
    evict+0xcf/0x1f0
    dispose_list+0x48/0x70
    prune_icache_sb+0x44/0x50
    super_cache_scan+0x161/0x1e0
    do_shrink_slab+0x178/0x3c0
    shrink_slab+0x17c/0x290
    shrink_node+0x2b2/0x6d0
    balance_pgdat+0x30a/0x670
    kswapd+0x213/0x4c0
    ? _raw_spin_unlock_irqrestore+0x46/0x60
    ? add_wait_queue_exclusive+0x70/0x70
    ? balance_pgdat+0x670/0x670
    kthread+0x138/0x160
    ? kthread_create_worker_on_cpu+0x40/0x40
    ret_from_fork+0x1f/0x30

    This is because we are holding the chunk_mutex when we call
    btrfs_alloc_device, which does a GFP_KERNEL allocation. We don't want
    to switch that to a GFP_NOFS lock because this is the only place where
    it matters. So instead use memalloc_nofs_save() around the allocation
    in order to avoid the lockdep splat.

    Reported-by: Nikolay Borisov
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Anand Jain
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     

27 Aug, 2020

1 commit

  • With the conversion of the tree locks to rwsem I got the following
    lockdep splat:

    ======================================================
    WARNING: possible circular locking dependency detected
    5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted
    ------------------------------------------------------
    btrfs-uuid/7955 is trying to acquire lock:
    ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

    but task is already holding lock:
    ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (btrfs-uuid-00){++++}-{3:3}:
    down_read_nested+0x3e/0x140
    __btrfs_tree_read_lock+0x39/0x180
    __btrfs_read_lock_root_node+0x3a/0x50
    btrfs_search_slot+0x4bd/0x990
    btrfs_uuid_tree_add+0x89/0x2d0
    btrfs_uuid_scan_kthread+0x330/0x390
    kthread+0x133/0x150
    ret_from_fork+0x1f/0x30

    -> #0 (btrfs-root-00){++++}-{3:3}:
    __lock_acquire+0x1272/0x2310
    lock_acquire+0x9e/0x360
    down_read_nested+0x3e/0x140
    __btrfs_tree_read_lock+0x39/0x180
    __btrfs_read_lock_root_node+0x3a/0x50
    btrfs_search_slot+0x4bd/0x990
    btrfs_find_root+0x45/0x1b0
    btrfs_read_tree_root+0x61/0x100
    btrfs_get_root_ref.part.50+0x143/0x630
    btrfs_uuid_tree_iterate+0x207/0x314
    btrfs_uuid_rescan_kthread+0x12/0x50
    kthread+0x133/0x150
    ret_from_fork+0x1f/0x30

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(btrfs-uuid-00);
    lock(btrfs-root-00);
    lock(btrfs-uuid-00);
    lock(btrfs-root-00);

    *** DEADLOCK ***

    1 lock held by btrfs-uuid/7955:
    #0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

    stack backtrace:
    CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925
    Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
    Call Trace:
    dump_stack+0x78/0xa0
    check_noncircular+0x165/0x180
    __lock_acquire+0x1272/0x2310
    lock_acquire+0x9e/0x360
    ? __btrfs_tree_read_lock+0x39/0x180
    ? btrfs_root_node+0x1c/0x1d0
    down_read_nested+0x3e/0x140
    ? __btrfs_tree_read_lock+0x39/0x180
    __btrfs_tree_read_lock+0x39/0x180
    __btrfs_read_lock_root_node+0x3a/0x50
    btrfs_search_slot+0x4bd/0x990
    btrfs_find_root+0x45/0x1b0
    btrfs_read_tree_root+0x61/0x100
    btrfs_get_root_ref.part.50+0x143/0x630
    btrfs_uuid_tree_iterate+0x207/0x314
    ? btree_readpage+0x20/0x20
    btrfs_uuid_rescan_kthread+0x12/0x50
    kthread+0x133/0x150
    ? kthread_create_on_node+0x60/0x60
    ret_from_fork+0x1f/0x30

    This problem exists because we have two different rescan threads,
    btrfs_uuid_scan_kthread which creates the uuid tree, and
    btrfs_uuid_tree_iterate that goes through and updates or deletes any out
    of date roots. The problem is they both do things in different order.
    btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
    into the uuid_root. btrfs_uuid_tree_iterate() scans the uuid_root, but
    then does a btrfs_get_fs_root() which can read from the tree_root.

    It's actually easy enough to not be holding the path in
    btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
    it further down and re-start the search when we loop. So simply move
    the path release before we add our entry to the uuid tree.

    This also fixes a problem where we're holding a path open after we do
    btrfs_end_transaction(), which has it's own problems.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     

12 Aug, 2020

1 commit

  • [BUG]
    The following script can lead to tons of beyond device boundary access:

    mkfs.btrfs -f $dev -b 10G
    mount $dev $mnt
    trimfs $mnt
    btrfs filesystem resize 1:-1G $mnt
    trimfs $mnt

    [CAUSE]
    Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to
    find_first_clear_extent_bit"), we try to avoid trimming ranges that's
    already trimmed.

    So we check device->alloc_state by finding the first range which doesn't
    have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.

    But if we shrunk the device, that bits are not cleared, thus we could
    easily got a range starts beyond the shrunk device size.

    This results the returned @start and @end are all beyond device size,
    then we call "end = min(end, device->total_bytes -1);" making @end
    smaller than device size.

    Then finally we goes "len = end - start + 1", totally underflow the
    result, and lead to the beyond-device-boundary access.

    [FIX]
    This patch will fix the problem in two ways:

    - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
    This is the root fix

    - Add extra safety check when trimming free device extents
    We check and warn if the returned range is already beyond current
    device.

    Link: https://github.com/kdave/btrfs-progs/issues/282
    Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
    CC: stable@vger.kernel.org # 5.4+
    Signed-off-by: Qu Wenruo
    Reviewed-by: Filipe Manana
    Signed-off-by: David Sterba

    Qu Wenruo
     

27 Jul, 2020

6 commits

  • We are currently getting this lockdep splat in btrfs/161:

    ======================================================
    WARNING: possible circular locking dependency detected
    5.8.0-rc5+ #20 Tainted: G E
    ------------------------------------------------------
    mount/678048 is trying to acquire lock:
    ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs]

    but task is already holding lock:
    ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
    __mutex_lock+0x8b/0x8f0
    btrfs_init_new_device+0x2d2/0x1240 [btrfs]
    btrfs_ioctl+0x1de/0x2d20 [btrfs]
    ksys_ioctl+0x87/0xc0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x52/0xb0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
    __lock_acquire+0x1240/0x2460
    lock_acquire+0xab/0x360
    __mutex_lock+0x8b/0x8f0
    clone_fs_devices+0x4d/0x170 [btrfs]
    btrfs_read_chunk_tree+0x330/0x800 [btrfs]
    open_ctree+0xb7c/0x18ce [btrfs]
    btrfs_mount_root.cold+0x13/0xfa [btrfs]
    legacy_get_tree+0x30/0x50
    vfs_get_tree+0x28/0xc0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    btrfs_mount+0x13b/0x3e0 [btrfs]
    legacy_get_tree+0x30/0x50
    vfs_get_tree+0x28/0xc0
    do_mount+0x7de/0xb30
    __x64_sys_mount+0x8e/0xd0
    do_syscall_64+0x52/0xb0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&fs_info->chunk_mutex);
    lock(&fs_devs->device_list_mutex);
    lock(&fs_info->chunk_mutex);
    lock(&fs_devs->device_list_mutex);

    *** DEADLOCK ***

    3 locks held by mount/678048:
    #0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
    #1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs]
    #2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]

    stack backtrace:
    CPU: 2 PID: 678048 Comm: mount Tainted: G E 5.8.0-rc5+ #20
    Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
    Call Trace:
    dump_stack+0x96/0xd0
    check_noncircular+0x162/0x180
    __lock_acquire+0x1240/0x2460
    ? asm_sysvec_apic_timer_interrupt+0x12/0x20
    lock_acquire+0xab/0x360
    ? clone_fs_devices+0x4d/0x170 [btrfs]
    __mutex_lock+0x8b/0x8f0
    ? clone_fs_devices+0x4d/0x170 [btrfs]
    ? rcu_read_lock_sched_held+0x52/0x60
    ? cpumask_next+0x16/0x20
    ? module_assert_mutex_or_preempt+0x14/0x40
    ? __module_address+0x28/0xf0
    ? clone_fs_devices+0x4d/0x170 [btrfs]
    ? static_obj+0x4f/0x60
    ? lockdep_init_map_waits+0x43/0x200
    ? clone_fs_devices+0x4d/0x170 [btrfs]
    clone_fs_devices+0x4d/0x170 [btrfs]
    btrfs_read_chunk_tree+0x330/0x800 [btrfs]
    open_ctree+0xb7c/0x18ce [btrfs]
    ? super_setup_bdi_name+0x79/0xd0
    btrfs_mount_root.cold+0x13/0xfa [btrfs]
    ? vfs_parse_fs_string+0x84/0xb0
    ? rcu_read_lock_sched_held+0x52/0x60
    ? kfree+0x2b5/0x310
    legacy_get_tree+0x30/0x50
    vfs_get_tree+0x28/0xc0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    btrfs_mount+0x13b/0x3e0 [btrfs]
    ? cred_has_capability+0x7c/0x120
    ? rcu_read_lock_sched_held+0x52/0x60
    ? legacy_get_tree+0x30/0x50
    legacy_get_tree+0x30/0x50
    vfs_get_tree+0x28/0xc0
    do_mount+0x7de/0xb30
    ? memdup_user+0x4e/0x90
    __x64_sys_mount+0x8e/0xd0
    do_syscall_64+0x52/0xb0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and
    then read the device, which takes the device_list_mutex. The
    device_list_mutex needs to be taken before the chunk_mutex, so this is a
    problem. We only really need the chunk mutex around adding the chunk,
    so move the mutex around read_one_chunk.

    An argument could be made that we don't even need the chunk_mutex here
    as it's during mount, and we are protected by various other locks.
    However we already have special rules for ->device_list_mutex, and I'd
    rather not have another special case for ->chunk_mutex.

    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Anand Jain
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • There's long existed a lockdep splat because we open our bdev's under
    the ->device_list_mutex at mount time, which acquires the bd_mutex.
    Usually this goes unnoticed, but if you do loopback devices at all
    suddenly the bd_mutex comes with a whole host of other dependencies,
    which results in the splat when you mount a btrfs file system.

    ======================================================
    WARNING: possible circular locking dependency detected
    5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
    ------------------------------------------------------
    systemd-journal/509 is trying to acquire lock:
    ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]

    but task is already holding lock:
    ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #6 (sb_pagefaults){.+.+}-{0:0}:
    __sb_start_write+0x13e/0x220
    btrfs_page_mkwrite+0x59/0x560 [btrfs]
    do_page_mkwrite+0x4f/0x130
    do_wp_page+0x3b0/0x4f0
    handle_mm_fault+0xf47/0x1850
    do_user_addr_fault+0x1fc/0x4b0
    exc_page_fault+0x88/0x300
    asm_exc_page_fault+0x1e/0x30

    -> #5 (&mm->mmap_lock#2){++++}-{3:3}:
    __might_fault+0x60/0x80
    _copy_from_user+0x20/0xb0
    get_sg_io_hdr+0x9a/0xb0
    scsi_cmd_ioctl+0x1ea/0x2f0
    cdrom_ioctl+0x3c/0x12b4
    sr_block_ioctl+0xa4/0xd0
    block_ioctl+0x3f/0x50
    ksys_ioctl+0x82/0xc0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x52/0xb0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #4 (&cd->lock){+.+.}-{3:3}:
    __mutex_lock+0x7b/0x820
    sr_block_open+0xa2/0x180
    __blkdev_get+0xdd/0x550
    blkdev_get+0x38/0x150
    do_dentry_open+0x16b/0x3e0
    path_openat+0x3c9/0xa00
    do_filp_open+0x75/0x100
    do_sys_openat2+0x8a/0x140
    __x64_sys_openat+0x46/0x70
    do_syscall_64+0x52/0xb0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
    __mutex_lock+0x7b/0x820
    __blkdev_get+0x6a/0x550
    blkdev_get+0x85/0x150
    blkdev_get_by_path+0x2c/0x70
    btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
    open_fs_devices+0x88/0x240 [btrfs]
    btrfs_open_devices+0x92/0xa0 [btrfs]
    btrfs_mount_root+0x250/0x490 [btrfs]
    legacy_get_tree+0x30/0x50
    vfs_get_tree+0x28/0xc0
    vfs_kern_mount.part.0+0x71/0xb0
    btrfs_mount+0x119/0x380 [btrfs]
    legacy_get_tree+0x30/0x50
    vfs_get_tree+0x28/0xc0
    do_mount+0x8c6/0xca0
    __x64_sys_mount+0x8e/0xd0
    do_syscall_64+0x52/0xb0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
    __mutex_lock+0x7b/0x820
    btrfs_run_dev_stats+0x36/0x420 [btrfs]
    commit_cowonly_roots+0x91/0x2d0 [btrfs]
    btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
    btrfs_sync_file+0x38a/0x480 [btrfs]
    __x64_sys_fdatasync+0x47/0x80
    do_syscall_64+0x52/0xb0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
    __mutex_lock+0x7b/0x820
    btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
    btrfs_sync_file+0x38a/0x480 [btrfs]
    __x64_sys_fdatasync+0x47/0x80
    do_syscall_64+0x52/0xb0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
    __lock_acquire+0x1241/0x20c0
    lock_acquire+0xb0/0x400
    __mutex_lock+0x7b/0x820
    btrfs_record_root_in_trans+0x44/0x70 [btrfs]
    start_transaction+0xd2/0x500 [btrfs]
    btrfs_dirty_inode+0x44/0xd0 [btrfs]
    file_update_time+0xc6/0x120
    btrfs_page_mkwrite+0xda/0x560 [btrfs]
    do_page_mkwrite+0x4f/0x130
    do_wp_page+0x3b0/0x4f0
    handle_mm_fault+0xf47/0x1850
    do_user_addr_fault+0x1fc/0x4b0
    exc_page_fault+0x88/0x300
    asm_exc_page_fault+0x1e/0x30

    other info that might help us debug this:

    Chain exists of:
    &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(sb_pagefaults);
    lock(&mm->mmap_lock#2);
    lock(sb_pagefaults);
    lock(&fs_info->reloc_mutex);

    *** DEADLOCK ***

    3 locks held by systemd-journal/509:
    #0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
    #1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
    #2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]

    stack backtrace:
    CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
    Call Trace:
    dump_stack+0x92/0xc8
    check_noncircular+0x134/0x150
    __lock_acquire+0x1241/0x20c0
    lock_acquire+0xb0/0x400
    ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
    ? lock_acquire+0xb0/0x400
    ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
    __mutex_lock+0x7b/0x820
    ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
    ? kvm_sched_clock_read+0x14/0x30
    ? sched_clock+0x5/0x10
    ? sched_clock_cpu+0xc/0xb0
    btrfs_record_root_in_trans+0x44/0x70 [btrfs]
    start_transaction+0xd2/0x500 [btrfs]
    btrfs_dirty_inode+0x44/0xd0 [btrfs]
    file_update_time+0xc6/0x120
    btrfs_page_mkwrite+0xda/0x560 [btrfs]
    ? sched_clock+0x5/0x10
    do_page_mkwrite+0x4f/0x130
    do_wp_page+0x3b0/0x4f0
    handle_mm_fault+0xf47/0x1850
    do_user_addr_fault+0x1fc/0x4b0
    exc_page_fault+0x88/0x300
    ? asm_exc_page_fault+0x8/0x30
    asm_exc_page_fault+0x1e/0x30
    RIP: 0033:0x7fa3972fdbfe
    Code: Bad RIP value.

    Fix this by not holding the ->device_list_mutex at this point. The
    device_list_mutex exists to protect us from modifying the device list
    while the file system is running.

    However it can also be modified by doing a scan on a device. But this
    action is specifically protected by the uuid_mutex, which we are holding
    here. We cannot race with opening at this point because we have the
    ->s_mount lock held during the mount. Not having the
    ->device_list_mutex here is perfectly safe as we're not going to change
    the devices at this point.

    CC: stable@vger.kernel.org # 4.19+
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    [ add some comments ]
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Since most metadata reservation calls can return -EINTR when get
    interrupted by fatal signal, we need to review the all the metadata
    reservation call sites.

    In relocation code, the metadata reservation happens in the following
    sites:

    - btrfs_block_rsv_refill() in merge_reloc_root()
    merge_reloc_root() is a pretty critical section, we don't want to be
    interrupted by signal, so change the flush status to
    BTRFS_RESERVE_FLUSH_LIMIT, so it won't get interrupted by signal.
    Since such change can be ENPSPC-prone, also shrink the amount of
    metadata to reserve least amount avoid deadly ENOSPC there.

    - btrfs_block_rsv_refill() in reserve_metadata_space()
    It calls with BTRFS_RESERVE_FLUSH_LIMIT, which won't get interrupted
    by signal.

    - btrfs_block_rsv_refill() in prepare_to_relocate()

    - btrfs_block_rsv_add() in prepare_to_relocate()

    - btrfs_block_rsv_refill() in relocate_block_group()

    - btrfs_delalloc_reserve_metadata() in relocate_file_extent_cluster()

    - btrfs_start_transaction() in relocate_block_group()

    - btrfs_start_transaction() in create_reloc_inode()
    Can be interrupted by fatal signal and we can handle it easily.
    For these call sites, just catch the -EINTR value in btrfs_balance()
    and count them as canceled.

    CC: stable@vger.kernel.org # 5.4+
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • The whole chunk tree is read at mount time so we can utilize readahead
    to get the tree blocks to memory before we read the items. The idea is
    from Robbie, but instead of updating search slot readahead, this patch
    implements the chunk tree readahead manually from nodes on level 1.

    We've decided to do specific readahead optimizations and then unify them
    under a common API so we don't break everything by changing the search
    slot readahead logic.

    Higher chunk trees grow on large filesystems (many terabytes), and
    prefetching just level 1 seems to be sufficient. Provided example was
    from a 200TiB filesystem with chunk tree level 2.

    CC: Robbie Ko
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    David Sterba
     
  • Since btrfs_bio always contains the extra space for the tgtdev_map and
    raid_maps it's pointless to make the assignment iff specific conditions
    are met.

    Instead, always assign the pointers to their correct value at allocation
    time. To accommodate this change also move code a bit in
    __btrfs_map_block so that btrfs_bio::stripes array is always initialized
    before the raid_map, subsequently move the call to sort_parity_stripes
    in the 'if' building the raid_map, retaining the old behavior.

    To better understand the change, there are 2 aspects to this:

    1. The original code is harder to grasp because the calculations for
    initializing raid_map/tgtdev ponters are apart from the initial
    allocation of memory. Having them predicated on 2 separate checks
    doesn't help that either... So by moving the initialisation in
    alloc_btrfs_bio puts everything together.

    2. tgtdev/raid_maps are now always initialized despite sometimes they
    might be equal i.e __btrfs_map_block_for_discard calls
    alloc_btrfs_bio with tgtdev = 0 but their usage should be predicated
    on external checks i.e. just because those pointers are non-null
    doesn't mean they are valid per-se. And actually while taking another
    look at __btrfs_map_block I saw a discrepancy:

    Original code initialised tgtdev_map if the following check is true:

    if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL)

    However, further down tgtdev_map is only used if the following check
    is true:

    if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op))

    e.g. the additional need_full_stripe(op) predicate is there.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    [ copy more details from mail discussion ]
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • btrfs_map_bio ensures that all submitted bios to devices have valid
    btrfs_device::bdev so this check can be removed from btrfs_end_bio. This
    check was added in june 2012 597a60fadedf ("Btrfs: don't count I/O
    statistic read errors for missing devices") but then in October of the
    same year another commit de1ee92ac3bc ("Btrfs: recheck bio against
    block device when we map the bio") started checking for the presence of
    btrfs_device::bdev before actually issuing the bio.

    Reviewed-by: Josef Bacik
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov