30 May, 2018

13 commits

  • …created with quota enabled

    [ Upstream commit 4d31778aa2fa342f5f92ca4025b293a1729161d1 ]

    When multiple pending snapshots referring to the same source subvolume
    are executed, enabled quota will cause root item corruption, where root
    items are using old bytenr (no backref in extent tree).

    This can be triggered by fstests btrfs/152.

    The cause is when source subvolume is still dirty, extra commit
    (simplied transaction commit) of qgroup_account_snapshot() can skip
    dirty roots not recorded in current transaction, making root item of
    source subvolume not updated.

    Fix it by forcing recording source subvolume in current transaction
    before qgroup sub-transaction commit.

    Reported-by: Justin Maggard <jmaggard@netgear.com>
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

    Qu Wenruo
     
  • [ Upstream commit 8a5a916d9a35e13576d79cc16e24611821b13e34 ]

    While running btrfs/011, I hit the following lockdep splat.

    This is the important bit:
    pcpu_alloc+0x1ac/0x5e0
    __percpu_counter_init+0x4e/0xb0
    btrfs_init_fs_root+0x99/0x1c0 [btrfs]
    btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
    resolve_indirect_refs+0x130/0x830 [btrfs]
    find_parent_nodes+0x69e/0xff0 [btrfs]
    btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
    btrfs_find_all_roots+0x50/0x70 [btrfs]
    btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
    btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]

    The percpu_counter_init call in btrfs_alloc_subvolume_writers
    uses GFP_KERNEL, which we can't do during transaction commit.

    This switches it to GFP_NOFS.

    ========================================================
    WARNING: possible irq lock inversion dependency detected
    4.12.14-kvmsmall #8 Tainted: G W
    --------------------------------------------------------
    kswapd0/50 just changed the state of lock:
    (&delayed_node->mutex){+.+.-.}, at: [] __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    but this lock took another, RECLAIM_FS-unsafe lock in the past:
    (pcpu_alloc_mutex){+.+.+.}

    and interrupts could create inverse lock ordering between them.

    other info that might help us debug this:
    Chain exists of:
    &delayed_node->mutex --> &found->groups_sem --> pcpu_alloc_mutex

    Possible interrupt unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(pcpu_alloc_mutex);
    local_irq_disable();
    lock(&delayed_node->mutex);
    lock(&found->groups_sem);

    lock(&delayed_node->mutex);

    *** DEADLOCK ***

    2 locks held by kswapd0/50:
    #0: (shrinker_rwsem){++++..}, at: [] shrink_slab+0x7f/0x5b0
    #1: (&type->s_umount_key#30){+++++.}, at: [] trylock_super+0x16/0x50

    the shortest dependencies between 2nd lock and 1st lock:
    -> (pcpu_alloc_mutex){+.+.+.} ops: 4904 {
    HARDIRQ-ON-W at:
    __mutex_lock+0x4e/0x8c0
    pcpu_alloc+0x1ac/0x5e0
    alloc_kmem_cache_cpus.isra.70+0x25/0xa0
    __do_tune_cpucache+0x2c/0x220
    do_tune_cpucache+0x26/0xc0
    enable_cpucache+0x6d/0xf0
    kmem_cache_init_late+0x42/0x75
    start_kernel+0x343/0x4cb
    x86_64_start_kernel+0x127/0x134
    secondary_startup_64+0xa5/0xb0
    SOFTIRQ-ON-W at:
    __mutex_lock+0x4e/0x8c0
    pcpu_alloc+0x1ac/0x5e0
    alloc_kmem_cache_cpus.isra.70+0x25/0xa0
    __do_tune_cpucache+0x2c/0x220
    do_tune_cpucache+0x26/0xc0
    enable_cpucache+0x6d/0xf0
    kmem_cache_init_late+0x42/0x75
    start_kernel+0x343/0x4cb
    x86_64_start_kernel+0x127/0x134
    secondary_startup_64+0xa5/0xb0
    RECLAIM_FS-ON-W at:
    __kmalloc+0x47/0x310
    pcpu_extend_area_map+0x2b/0xc0
    pcpu_alloc+0x3ec/0x5e0
    alloc_kmem_cache_cpus.isra.70+0x25/0xa0
    __do_tune_cpucache+0x2c/0x220
    do_tune_cpucache+0x26/0xc0
    enable_cpucache+0x6d/0xf0
    __kmem_cache_create+0x1bf/0x390
    create_cache+0xba/0x1b0
    kmem_cache_create+0x1f8/0x2b0
    ksm_init+0x6f/0x19d
    do_one_initcall+0x50/0x1b0
    kernel_init_freeable+0x201/0x289
    kernel_init+0xa/0x100
    ret_from_fork+0x3a/0x50
    INITIAL USE at:
    __mutex_lock+0x4e/0x8c0
    pcpu_alloc+0x1ac/0x5e0
    alloc_kmem_cache_cpus.isra.70+0x25/0xa0
    setup_cpu_cache+0x2f/0x1f0
    __kmem_cache_create+0x1bf/0x390
    create_boot_cache+0x8b/0xb1
    kmem_cache_init+0xa1/0x19e
    start_kernel+0x270/0x4cb
    x86_64_start_kernel+0x127/0x134
    secondary_startup_64+0xa5/0xb0
    }
    ... key at: [] pcpu_alloc_mutex+0x70/0xa0
    ... acquired at:
    pcpu_alloc+0x1ac/0x5e0
    __percpu_counter_init+0x4e/0xb0
    btrfs_init_fs_root+0x99/0x1c0 [btrfs]
    btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
    resolve_indirect_refs+0x130/0x830 [btrfs]
    find_parent_nodes+0x69e/0xff0 [btrfs]
    btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
    btrfs_find_all_roots+0x50/0x70 [btrfs]
    btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
    btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
    transaction_kthread+0x176/0x1b0 [btrfs]
    kthread+0x102/0x140
    ret_from_fork+0x3a/0x50

    -> (&fs_info->commit_root_sem){++++..} ops: 1566382 {
    HARDIRQ-ON-W at:
    down_write+0x3e/0xa0
    cache_block_group+0x287/0x420 [btrfs]
    find_free_extent+0x106c/0x12d0 [btrfs]
    btrfs_reserve_extent+0xd8/0x170 [btrfs]
    cow_file_range.isra.66+0x133/0x470 [btrfs]
    run_delalloc_range+0x121/0x410 [btrfs]
    writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
    __extent_writepage+0x19a/0x360 [btrfs]
    extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
    extent_writepages+0x4d/0x60 [btrfs]
    do_writepages+0x1a/0x70
    __filemap_fdatawrite_range+0xa7/0xe0
    btrfs_rename+0x5ee/0xdb0 [btrfs]
    vfs_rename+0x52a/0x7e0
    SyS_rename+0x351/0x3b0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    HARDIRQ-ON-R at:
    down_read+0x35/0x90
    caching_thread+0x57/0x560 [btrfs]
    normal_work_helper+0x1c0/0x5e0 [btrfs]
    process_one_work+0x1e0/0x5c0
    worker_thread+0x44/0x390
    kthread+0x102/0x140
    ret_from_fork+0x3a/0x50
    SOFTIRQ-ON-W at:
    down_write+0x3e/0xa0
    cache_block_group+0x287/0x420 [btrfs]
    find_free_extent+0x106c/0x12d0 [btrfs]
    btrfs_reserve_extent+0xd8/0x170 [btrfs]
    cow_file_range.isra.66+0x133/0x470 [btrfs]
    run_delalloc_range+0x121/0x410 [btrfs]
    writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
    __extent_writepage+0x19a/0x360 [btrfs]
    extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
    extent_writepages+0x4d/0x60 [btrfs]
    do_writepages+0x1a/0x70
    __filemap_fdatawrite_range+0xa7/0xe0
    btrfs_rename+0x5ee/0xdb0 [btrfs]
    vfs_rename+0x52a/0x7e0
    SyS_rename+0x351/0x3b0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-R at:
    down_read+0x35/0x90
    caching_thread+0x57/0x560 [btrfs]
    normal_work_helper+0x1c0/0x5e0 [btrfs]
    process_one_work+0x1e0/0x5c0
    worker_thread+0x44/0x390
    kthread+0x102/0x140
    ret_from_fork+0x3a/0x50
    INITIAL USE at:
    down_write+0x3e/0xa0
    cache_block_group+0x287/0x420 [btrfs]
    find_free_extent+0x106c/0x12d0 [btrfs]
    btrfs_reserve_extent+0xd8/0x170 [btrfs]
    cow_file_range.isra.66+0x133/0x470 [btrfs]
    run_delalloc_range+0x121/0x410 [btrfs]
    writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
    __extent_writepage+0x19a/0x360 [btrfs]
    extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
    extent_writepages+0x4d/0x60 [btrfs]
    do_writepages+0x1a/0x70
    __filemap_fdatawrite_range+0xa7/0xe0
    btrfs_rename+0x5ee/0xdb0 [btrfs]
    vfs_rename+0x52a/0x7e0
    SyS_rename+0x351/0x3b0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    }
    ... key at: [] __key.61970+0x0/0xfffffffffff9aa88 [btrfs]
    ... acquired at:
    cache_block_group+0x287/0x420 [btrfs]
    find_free_extent+0x106c/0x12d0 [btrfs]
    btrfs_reserve_extent+0xd8/0x170 [btrfs]
    btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
    btrfs_create_tree+0xbb/0x2a0 [btrfs]
    btrfs_create_uuid_tree+0x37/0x140 [btrfs]
    open_ctree+0x23c0/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

    -> (&found->groups_sem){++++..} ops: 2134587 {
    HARDIRQ-ON-W at:
    down_write+0x3e/0xa0
    __link_block_group+0x34/0x130 [btrfs]
    btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
    open_ctree+0x2054/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    HARDIRQ-ON-R at:
    down_read+0x35/0x90
    btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
    open_ctree+0x207b/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-W at:
    down_write+0x3e/0xa0
    __link_block_group+0x34/0x130 [btrfs]
    btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
    open_ctree+0x2054/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-R at:
    down_read+0x35/0x90
    btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
    open_ctree+0x207b/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    INITIAL USE at:
    down_write+0x3e/0xa0
    __link_block_group+0x34/0x130 [btrfs]
    btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
    open_ctree+0x2054/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    }
    ... key at: [] __key.59101+0x0/0xfffffffffff9ab78 [btrfs]
    ... acquired at:
    find_free_extent+0xcb4/0x12d0 [btrfs]
    btrfs_reserve_extent+0xd8/0x170 [btrfs]
    btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
    __btrfs_cow_block+0x110/0x5b0 [btrfs]
    btrfs_cow_block+0xd7/0x290 [btrfs]
    btrfs_search_slot+0x1f6/0x960 [btrfs]
    btrfs_lookup_inode+0x2a/0x90 [btrfs]
    __btrfs_update_delayed_inode+0x65/0x210 [btrfs]
    btrfs_commit_inode_delayed_inode+0x121/0x130 [btrfs]
    btrfs_evict_inode+0x3fe/0x6a0 [btrfs]
    evict+0xc4/0x190
    __dentry_kill+0xbf/0x170
    dput+0x2ae/0x2f0
    SyS_rename+0x2a6/0x3b0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

    -> (&delayed_node->mutex){+.+.-.} ops: 5580204 {
    HARDIRQ-ON-W at:
    __mutex_lock+0x4e/0x8c0
    btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
    btrfs_update_inode+0x83/0x110 [btrfs]
    btrfs_dirty_inode+0x62/0xe0 [btrfs]
    touch_atime+0x8c/0xb0
    do_generic_file_read+0x818/0xb10
    __vfs_read+0xdc/0x150
    vfs_read+0x8a/0x130
    SyS_read+0x45/0xa0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-W at:
    __mutex_lock+0x4e/0x8c0
    btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
    btrfs_update_inode+0x83/0x110 [btrfs]
    btrfs_dirty_inode+0x62/0xe0 [btrfs]
    touch_atime+0x8c/0xb0
    do_generic_file_read+0x818/0xb10
    __vfs_read+0xdc/0x150
    vfs_read+0x8a/0x130
    SyS_read+0x45/0xa0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    IN-RECLAIM_FS-W at:
    __mutex_lock+0x4e/0x8c0
    __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    btrfs_evict_inode+0x22c/0x6a0 [btrfs]
    evict+0xc4/0x190
    dispose_list+0x35/0x50
    prune_icache_sb+0x42/0x50
    super_cache_scan+0x139/0x190
    shrink_slab+0x262/0x5b0
    shrink_node+0x2eb/0x2f0
    kswapd+0x2eb/0x890
    kthread+0x102/0x140
    ret_from_fork+0x3a/0x50
    INITIAL USE at:
    __mutex_lock+0x4e/0x8c0
    btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
    btrfs_update_inode+0x83/0x110 [btrfs]
    btrfs_dirty_inode+0x62/0xe0 [btrfs]
    touch_atime+0x8c/0xb0
    do_generic_file_read+0x818/0xb10
    __vfs_read+0xdc/0x150
    vfs_read+0x8a/0x130
    SyS_read+0x45/0xa0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    }
    ... key at: [] __key.56935+0x0/0xfffffffffff96b78 [btrfs]
    ... acquired at:
    __lock_acquire+0x264/0x11c0
    lock_acquire+0xbd/0x1e0
    __mutex_lock+0x4e/0x8c0
    __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    btrfs_evict_inode+0x22c/0x6a0 [btrfs]
    evict+0xc4/0x190
    dispose_list+0x35/0x50
    prune_icache_sb+0x42/0x50
    super_cache_scan+0x139/0x190
    shrink_slab+0x262/0x5b0
    shrink_node+0x2eb/0x2f0
    kswapd+0x2eb/0x890
    kthread+0x102/0x140
    ret_from_fork+0x3a/0x50

    stack backtrace:
    CPU: 1 PID: 50 Comm: kswapd0 Tainted: G W 4.12.14-kvmsmall #8 SLE15 (unreleased)
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    Call Trace:
    dump_stack+0x78/0xb7
    print_irq_inversion_bug.part.38+0x19f/0x1aa
    check_usage_forwards+0x102/0x120
    ? ret_from_fork+0x3a/0x50
    ? check_usage_backwards+0x110/0x110
    mark_lock+0x16c/0x270
    __lock_acquire+0x264/0x11c0
    ? pagevec_lookup_entries+0x1a/0x30
    ? truncate_inode_pages_range+0x2b3/0x7f0
    lock_acquire+0xbd/0x1e0
    ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    __mutex_lock+0x4e/0x8c0
    ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    ? btrfs_evict_inode+0x1f6/0x6a0 [btrfs]
    __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    btrfs_evict_inode+0x22c/0x6a0 [btrfs]
    evict+0xc4/0x190
    dispose_list+0x35/0x50
    prune_icache_sb+0x42/0x50
    super_cache_scan+0x139/0x190
    shrink_slab+0x262/0x5b0
    shrink_node+0x2eb/0x2f0
    kswapd+0x2eb/0x890
    kthread+0x102/0x140
    ? mem_cgroup_shrink_node+0x2c0/0x2c0
    ? kthread_create_on_node+0x40/0x40
    ret_from_fork+0x3a/0x50

    Signed-off-by: Jeff Mahoney
    Reviewed-by: Liu Bo
    Signed-off-by: David Sterba

    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jeff Mahoney
     
  • [ Upstream commit 8434ec46c6e3232cebc25a910363b29f5c617820 ]

    When logging an inode, at tree-log.c:copy_items(), if we call
    btrfs_next_leaf() at the loop which checks for the need to log holes, we
    need to make sure copy_items() returns the value 1 to its caller and
    not 0 (on success). This is because the path the caller passed was
    released and is now different from what is was before, and the caller
    expects a return value of 0 to mean both success and that the path
    has not changed, while a return value of 1 means both success and
    signals the caller that it can not reuse the path, it has to perform
    another tree search.

    Even though this is a case that should not be triggered on normal
    circumstances or very rare at least, its consequences can be very
    unpredictable (especially when replaying a log tree).

    Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents")
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • [ Upstream commit 3c0efdf03b2d127f0e40e30db4e7aa0429b1b79a ]

    The extent tree of the test fs is like the following:

    BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919
    item 0 key (4096 168 4096) itemoff 3944 itemsize 51
    extent refs 1 gen 1 flags 2
    tree block key (68719476736 0 0) level 1
    ^^^^^^^
    ref#0: tree block backref root 5

    And it's using an empty tree for fs tree, so there is no way that its
    level can be 1.

    For REAL (created by mkfs) fs tree backref with no skinny metadata, the
    result should look like:

    item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51
    refs 1 gen 4 flags TREE_BLOCK
    tree block key (256 INODE_ITEM 0) level 0
    ^^^^^^^
    tree block backref root 5

    Fix the level to 0, so it won't break later tree level checker.

    Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code")
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     
  • [ Upstream commit 1e1c50a929bc9e49bc3f9935b92450d9e69f8158 ]

    do_chunk_alloc implements a loop checking whether there is a pending
    chunk allocation and if so causes the caller do loop. Generally this
    loop is executed only once, however testing with btrfs/072 on a single
    core vm machines uncovered an extreme case where the system could loop
    indefinitely. This is due to a missing cond_resched when loop which
    doesn't give a chance to the previous chunk allocator finish its job.

    The fix is to simply add the missing cond_resched.

    Fixes: 6d74119f1a3e ("Btrfs: avoid taking the chunk_mutex in do_chunk_alloc")
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     
  • [ Upstream commit 80c0b4210a963e31529e15bf90519708ec947596 ]

    0, 1 and nodes[0] could be NULL, log_dir_items lacks such a
    check for
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit b98def7ca6e152ee55e36863dddf6f41f12d1dc6 ]

    If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs
    to bail out, otherwise @ret would be forced to be 0 after 'break;' and
    the caller won't be aware of it.

    Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit 471d557afed155b85da237ec46c549f443eeb5de ]

    Currently if we allocate extents beyond an inode's i_size (through the
    fallocate system call) and then fsync the file, we log the extents but
    after a power failure we replay them and then immediately drop them.
    This behaviour happens since about 2009, commit c71bf099abdd ("Btrfs:
    Avoid orphan inodes cleanup while replaying log"), because it marks
    the inode as an orphan instead of dropping any extents beyond i_size
    before replaying logged extents, so after the log replay, and while
    the mount operation is still ongoing, we find the inode marked as an
    orphan and then perform a truncation (drop extents beyond the inode's
    i_size). Because the processing of orphan inodes is still done
    right after replaying the log and before the mount operation finishes,
    the intention of that commit does not make any sense (at least as
    of today). However reverting that behaviour is not enough, because
    we can not simply discard all extents beyond i_size and then replay
    logged extents, because we risk dropping extents beyond i_size created
    in past transactions, for example:

    add prealloc extent beyond i_size
    fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
    transaction commit
    add another prealloc extent beyond i_size
    fsync - triggers the fast fsync path
    power failure

    In that scenario, we would drop the first extent and then replay the
    second one. To fix this just make sure that all prealloc extents
    beyond i_size are logged, and if we find too many (which is far from
    a common case), fallback to a full transaction commit (like we do when
    logging regular extents in the fast fsync path).

    Trivial reproducer:

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt
    $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
    $ sync
    $ xfs_io -c "falloc -k 256K 1M" /mnt/foo
    $ xfs_io -c "fsync" /mnt/foo

    # mount to replay log
    $ mount /dev/sdb /mnt
    # at this point the file only has one extent, at offset 0, size 256K

    A test case for fstests follows soon, covering multiple scenarios that
    involve adding prealloc extents with previous shrinking truncates and
    without such truncates.

    Fixes: c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log")
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • [ Upstream commit af7227338135d2f1b1552bf9a6d43e02dcba10b9 ]

    Currently if some fatal errors occur, like all IO get -EIO, resources
    would be cleaned up when
    a) transaction is being committed or
    b) BTRFS_FS_STATE_ERROR is set

    However, in some rare cases, resources may be left alone after transaction
    gets aborted and umount may run into some ASSERT(), e.g.
    ASSERT(list_empty(&block_group->dirty_list));

    For case a), in btrfs_commit_transaciton(), there're several places at the
    beginning where we just call btrfs_end_transaction() without cleaning up
    resources. For case b), it is possible that the trans handle doesn't have
    any dirty stuff, then only trans hanlde is marked as aborted while
    BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.

    This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
    all resources won't stay in memory after umount.

    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit 9a6509c4daa91400b52a5fd541a5521c649a8fea ]

    If in the same transaction we rename a special file (fifo, character/block
    device or symbolic link), create a hard link for it having its old name
    then sync the log, we will end up with a log that can not be replayed and
    at when attempting to replay it, an EEXIST error is returned and mounting
    the filesystem fails. Example scenario:

    $ mkfs.btrfs -f /dev/sdc
    $ mount /dev/sdc /mnt
    $ mkdir /mnt/testdir
    $ mkfifo /mnt/testdir/foo
    # Make sure everything done so far is durably persisted.
    $ sync

    # Create some unrelated file and fsync it, this is just to create a log
    # tree. The file must be in the same directory as our special file.
    $ touch /mnt/testdir/f1
    $ xfs_io -c "fsync" /mnt/testdir/f1

    # Rename our special file and then create a hard link with its old name.
    $ mv /mnt/testdir/foo /mnt/testdir/bar
    $ ln /mnt/testdir/bar /mnt/testdir/foo

    # Create some other unrelated file and fsync it, this is just to persist
    # the log tree which was modified by the previous rename and link
    # operations. Alternatively we could have modified file f1 and fsync it.
    $ touch /mnt/f2
    $ xfs_io -c "fsync" /mnt/f2

    $ mount /dev/sdc /mnt
    mount: mount /dev/sdc on /mnt failed: File exists

    This happens because when both the log tree and the subvolume's tree have
    an entry in the directory "testdir" with the same name, that is, there
    is one key (258 INODE_REF 257) in the subvolume tree and another one in
    the log tree (where 258 is the inode number of our special file and 257
    is the inode for directory "testdir"). Only the data of those two keys
    differs, in the subvolume tree the index field for inode reference has
    a value of 3 while the log tree it has a value of 5. Because the same key
    exists in both trees, but have different index, the log replay fails with
    an -EEXIST error when attempting to replay the inode reference from the
    log tree.

    Fix this by setting the last_unlink_trans field of the inode (our special
    file) to the current transaction id when a hard link is created, as this
    forces logging the parent directory inode, solving the conflict at log
    replay time.

    A new generic test case for fstests was also submitted.

    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • [ Upstream commit d4dfc0f4d39475ccbbac947880b5464a74c30b99 ]

    When doing an incremental send of a filesystem with the no-holes feature
    enabled, we end up issuing a write operation when using the no data mode
    send flag, instead of issuing an update extent operation. Fix this by
    issuing the update extent operation instead.

    Trivial reproducer:

    $ mkfs.btrfs -f -O no-holes /dev/sdc
    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdc /mnt/sdc
    $ mount /dev/sdd /mnt/sdd

    $ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar
    $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1

    $ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar
    $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2

    $ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd
    $ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \
    | btrfs receive -vv /mnt/sdd

    Before this change the output of the second receive command is:

    receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447...
    utimes
    write foobar, offset 8192, len 8192
    utimes foobar
    BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-...

    After this change it is:

    receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
    utimes
    update_extent foobar: offset=8192, len=8192
    utimes foobar
    BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...

    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • [ Upstream commit a8fd1f71749387c9a1053a83ff1c16287499a4e7 ]

    The srcu_struct in btrfs_fs_info scales in size with NR_CPUS. On
    kernels built with NR_CPUS=8192, this can result in kmalloc failures
    that prevent mounting.

    There is work in progress to try to resolve this for every user of
    srcu_struct but using kvzalloc will work around the failures until
    that is complete.

    As an example with NR_CPUS=512 on x86_64: the overall size of
    subvol_srcu is 3460 bytes, fs_info is 6496.

    Signed-off-by: Jeff Mahoney
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jeff Mahoney
     
  • commit 1e2e547a93a00ebc21582c06ca3c6cfea2a309ee upstream.

    For anything NFS-exported we do _not_ want to unlock new inode
    before it has grown an alias; original set of fixes got the
    ordering right, but missed the nasty complication in case of
    lockdep being enabled - unlock_new_inode() does
    lockdep_annotate_inode_mutex_key(inode)
    which can only be done before anyone gets a chance to touch
    ->i_mutex. Unfortunately, flipping the order and doing
    unlock_new_inode() before d_instantiate() opens a window when
    mkdir can race with open-by-fhandle on a guessed fhandle, leading
    to multiple aliases for a directory inode and all the breakage
    that follows from that.

    Correct solution: a new primitive (d_instantiate_new())
    combining these two in the right order - lockdep annotate, then
    d_instantiate(), then the rest of unlock_new_inode(). All
    combinations of d_instantiate() with unlock_new_inode() should
    be converted to that.

    Cc: stable@kernel.org # 2.6.29 and later
    Tested-by: Mike Marshall
    Reviewed-by: Andreas Dilger
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

23 May, 2018

7 commits

  • commit 02a3307aa9c20b4f6626255b028f07f6cfa16feb upstream.

    If a btree block, aka. extent buffer, is not available in the extent
    buffer cache, it'll be read out from the disk instead, i.e.

    btrfs_search_slot()
    read_block_for_search() # hold parent and its lock, go to read child
    btrfs_release_path()
    read_tree_block() # read child

    Unfortunately, the parent lock got released before reading child, so
    commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
    used 0 as parent transid to read the child block. It forces
    read_tree_block() not to check if parent transid is different with the
    generation id of the child that it reads out from disk.

    A simple PoC is included in btrfs/124,

    0. A two-disk raid1 btrfs,

    1. Right after mkfs.btrfs, block A is allocated to be device tree's root.

    2. Mount this filesystem and put it in use, after a while, device tree's
    root got COW but block A hasn't been allocated/overwritten yet.

    3. Umount it and reload the btrfs module to remove both disks from the
    global @fs_devices list.

    4. mount -odegraded dev1 and write some data, so now block A is allocated
    to be a leaf in checksum tree. Note that only dev1 has the latest
    metadata of this filesystem.

    5. Umount it and mount it again normally (with both disks), since raid1
    can pick up one disk by the writer task's pid, if btrfs_search_slot()
    needs to read block A, dev2 which does NOT have the latest metadata
    might be read for block A, then we got a stale block A.

    6. As parent transid is not checked, block A is marked as uptodate and
    put into the extent buffer cache, so the future search won't bother
    to read disk again, which means it'll make changes on this stale
    one and make it dirty and flush it onto disk.

    To avoid the problem, parent transid needs to be passed to
    read_tree_block().

    In order to get a valid parent transid, we need to hold the parent's
    lock until finishing reading child.

    This patch needs to be slightly adapted for stable kernels, the
    &first_key parameter added to read_tree_block() is from 4.16+
    (581c1760415c4). The fix is to replace 0 by 'gen'.

    Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Liu Bo
    Reviewed-by: Filipe Manana
    Reviewed-by: Qu Wenruo
    [ update changelog ]
    Signed-off-by: David Sterba
    Signed-off-by: Nikolay Borisov
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • commit fe816d0f1d4c31c4c31d42ca78a87660565fc800 upstream.

    When a transaction is aborted btrfs_cleanup_transaction is called to
    cleanup all the various in-flight bits and pieces which migth be
    active. One of those is delalloc inodes - inodes which have dirty
    pages which haven't been persisted yet. Currently the process of
    freeing such delalloc inodes in exceptional circumstances such as
    transaction abort boiled down to calling btrfs_invalidate_inodes whose
    sole job is to invalidate the dentries for all inodes related to a
    root. This is in fact wrong and insufficient since such delalloc inodes
    will likely have pending pages or ordered-extents and will be linked to
    the sb->s_inode_list. This means that unmounting a btrfs instance with
    an aborted transaction could potentially lead inodes/their pages
    visible to the system long after their superblock has been freed. This
    in turn leads to a "use-after-free" situation once page shrink is
    triggered. This situation could be simulated by running generic/019
    which would cause such inodes to be left hanging, followed by
    generic/176 which causes memory pressure and page eviction which lead
    to touching the freed super block instance. This situation is
    additionally detected by the unmount code of VFS with the following
    message:

    "VFS: Busy inodes after unmount of Self-destruct in 5 seconds. Have a nice day..."

    Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
    in free_fs_root for the same reason.

    This patch aims to rectify the sitaution by doing the following:

    1. Change btrfs_destroy_delalloc_inodes so that it calls
    invalidate_inode_pages2 for every inode on the delalloc list, this
    ensures that all the pages of the inode are released. This function
    boils down to calling btrfs_releasepage. During test I observed cases
    where inodes on the delalloc list were having an i_count of 0, so this
    necessitates using igrab to be sure we are working on a non-freed inode.

    2. Since calling btrfs_releasepage might queue delayed iputs move the
    call out to btrfs_cleanup_transaction in btrfs_error_commit_super before
    calling run_delayed_iputs for the last time. This is necessary to ensure
    that delayed iputs are run.

    Note: this patch is tagged for 4.14 stable but the fix applies to older
    versions too but needs to be backported manually due to conflicts.

    CC: stable@vger.kernel.org # 4.14.x: 2b8773313494: btrfs: Split btrfs_del_delalloc_inode into 2 functions
    CC: stable@vger.kernel.org # 4.14.x
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    [ add comment to igrab ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     
  • commit 2b8773313494ede83a26fb372466e634564002ed upstream.

    This is in preparation of fixing delalloc inodes leakage on transaction
    abort. Also export the new function.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Reviewed-by: Anand Jain
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     
  • commit 02ee654d3a04563c67bfe658a05384548b9bb105 upstream.

    We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance()
    only, which isn't called during the remount. So when resuming from
    the paused balance we hit the bug:

    kernel: kernel BUG at fs/btrfs/volumes.c:3890!
    ::
    kernel: balance_kthread+0x51/0x60 [btrfs]
    kernel: kthread+0x111/0x130
    ::
    kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8

    Reproducer:
    On a mounted filesystem:

    btrfs balance start --full-balance /btrfs
    btrfs balance pause /btrfs
    mount -o remount,ro /dev/sdb /btrfs
    mount -o remount,rw /dev/sdb /btrfs

    To fix this set the BTRFS_BALANCE_RESUME flag in
    btrfs_resume_balance_async().

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Anand Jain
     
  • commit 1a63c198ddb810c790101d693c7071cca703b3c7 upstream.

    Incompat flag of LZO/ZSTD compression should be set at:

    1. mount time (-o compress/compress-force)
    2. when defrag is done
    3. when property is set

    Currently 3. is missing and this commit adds this.

    This could lead to a filesystem that uses ZSTD but is not marked as
    such. If a kernel without a ZSTD support encounteres a ZSTD compressed
    extent, it will handle that but this could be confusing to the user.

    Typically the filesystem is mounted with the ZSTD option, but the
    discrepancy can arise when a filesystem is never mounted with ZSTD and
    then the property on some file is set (and some new extents are
    written). A simple mount with -o compress=zstd will fix that up on an
    unpatched kernel.

    Same goes for LZO, but this has been around for a very long time
    (2.6.37) so it's unlikely that a pre-LZO kernel would be used.

    Fixes: 5c1aab1dd544 ("btrfs: Add zstd support")
    CC: stable@vger.kernel.org # 4.14+
    Signed-off-by: Tomohiro Misono
    Reviewed-by: Anand Jain
    Reviewed-by: David Sterba
    [ add user visible impact ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Misono Tomohiro
     
  • commit 6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2 upstream.

    [BUG]
    btrfs incremental send BUG happens when creating a snapshot of snapshot
    that is being used by send.

    [REASON]
    The problem can happen if while we are doing a send one of the snapshots
    used (parent or send) is snapshotted, because snapshoting implies COWing
    the root of the source subvolume/snapshot.

    1. When doing an incremental send, the send process will get the commit
    roots from the parent and send snapshots, and add references to them
    through extent_buffer_get().

    2. When a snapshot/subvolume is snapshotted, its root node is COWed
    (transaction.c:create_pending_snapshot()).

    3. COWing releases the space used by the node immediately, through:

    __btrfs_cow_block()
    --btrfs_free_tree_block()
    ----btrfs_add_free_space(bytenr of node)

    4. Because send doesn't hold a transaction open, it's possible that
    the transaction used to create the snapshot commits, switches the
    commit root and the old space used by the previous root node gets
    assigned to some other node allocation. Allocation of a new node will
    use the existing extent buffer found in memory, which we previously
    got a reference through extent_buffer_get(), and allow the extent
    buffer's content (pages) to be modified:

    btrfs_alloc_tree_block
    --btrfs_reserve_extent
    ----find_free_extent (get bytenr of old node)
    --btrfs_init_new_buffer (use bytenr of old node)
    ----btrfs_find_create_tree_block
    ------alloc_extent_buffer
    --------find_extent_buffer (get old node)

    5. So send can access invalid memory content and have unpredictable
    behaviour.

    [FIX]
    So we fix the problem by copying the commit roots of the send and
    parent snapshots and use those copies.

    CallTrace looks like this:
    ------------[ cut here ]------------
    kernel BUG at fs/btrfs/ctree.c:1861!
    invalid opcode: 0000 [#1] SMP
    CPU: 6 PID: 24235 Comm: btrfs Tainted: P O 3.10.105 #23721
    ffff88046652d680 ti: ffff88041b720000 task.ti: ffff88041b720000
    RIP: 0010:[] read_node_slot+0x108/0x110 [btrfs]
    RSP: 0018:ffff88041b723b68 EFLAGS: 00010246
    RAX: ffff88043ca6b000 RBX: ffff88041b723c50 RCX: ffff880000000000
    RDX: 000000000000004c RSI: ffff880314b133f8 RDI: ffff880458b24000
    RBP: 0000000000000000 R08: 0000000000000001 R09: ffff88041b723c66
    R10: 0000000000000001 R11: 0000000000001000 R12: ffff8803f3e48890
    R13: ffff8803f3e48880 R14: ffff880466351800 R15: 0000000000000001
    FS: 00007f8c321dc8c0(0000) GS:ffff88047fcc0000(0000)
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    R2: 00007efd1006d000 CR3: 0000000213a24000 CR4: 00000000003407e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Stack:
    ffff88041b723c50 ffff8803f3e48880 ffff8803f3e48890 ffff8803f3e48880
    ffff880466351800 0000000000000001 ffffffffa08dd9d7 ffff88041b723c50
    ffff8803f3e48880 ffff88041b723c66 ffffffffa08dde85 a9ff88042d2c4400
    Call Trace:
    [] ? tree_move_down.isra.33+0x27/0x50 [btrfs]
    [] ? tree_advance+0xb5/0xc0 [btrfs]
    [] ? btrfs_compare_trees+0x2d4/0x760 [btrfs]
    [] ? finish_inode_if_needed+0x870/0x870 [btrfs]
    [] ? btrfs_ioctl_send+0xeda/0x1050 [btrfs]
    [] ? btrfs_ioctl+0x1e3d/0x33f0 [btrfs]
    [] ? handle_pte_fault+0x373/0x990
    [] ? atomic_notifier_call_chain+0x16/0x20
    [] ? set_task_cpu+0xb6/0x1d0
    [] ? handle_mm_fault+0x143/0x2a0
    [] ? __do_page_fault+0x1d0/0x500
    [] ? check_preempt_curr+0x57/0x90
    [] ? do_vfs_ioctl+0x4aa/0x990
    [] ? do_fork+0x113/0x3b0
    [] ? trace_hardirqs_off_thunk+0x3a/0x6c
    [] ? SyS_ioctl+0x88/0xa0
    [] ? system_call_fastpath+0x16/0x1b
    ---[ end trace 29576629ee80b2e1 ]---

    Fixes: 7069830a9e38 ("Btrfs: add btrfs_compare_trees function")
    CC: stable@vger.kernel.org # 3.6+
    Signed-off-by: Robbie Ko
    Reviewed-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Robbie Ko
     
  • commit 9a8fca62aacc1599fea8e813d01e1955513e4fad upstream.

    If a file has xattrs, we fsync it, to ensure we clear the flags
    BTRFS_INODE_NEEDS_FULL_SYNC and BTRFS_INODE_COPY_EVERYTHING from its
    inode, the current transaction commits and then we fsync it (without
    either of those bits being set in its inode), we end up not logging
    all its xattrs. This results in deleting all xattrs when replying the
    log after a power failure.

    Trivial reproducer

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt

    $ touch /mnt/foobar
    $ setfattr -n user.xa -v qwerty /mnt/foobar
    $ xfs_io -c "fsync" /mnt/foobar

    $ sync

    $ xfs_io -c "pwrite -S 0xab 0 64K" /mnt/foobar
    $ xfs_io -c "fsync" /mnt/foobar

    $ mount /dev/sdb /mnt
    $ getfattr --absolute-names --dump /mnt/foobar

    $

    So fix this by making sure all xattrs are logged if we log a file's inode
    item and neither the flags BTRFS_INODE_NEEDS_FULL_SYNC nor
    BTRFS_INODE_COPY_EVERYTHING were set in the inode.

    Fixes: 36283bf777d9 ("Btrfs: fix fsync xattr loss in the fast fsync path")
    Cc: # 4.2+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     

19 May, 2018

1 commit

  • commit 998ac6d21cfd6efd58f5edf420bae8839dda9f2a upstream.

    In preivous patch:
    Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist
    We avoid starting btrfs transaction and get this information from
    fs_info->running_transaction directly.

    When accessing running_transaction in check_delayed_ref, there's a
    chance that current transaction will be freed by commit transaction
    after the NULL pointer check of running_transaction is passed.

    After looking all the other places using fs_info->running_transaction,
    they are either protected by trans_lock or holding the transactions.

    Fix this by using trans_lock and increasing the use_count.

    Fixes: e4c3b2dcd144 ("Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist")
    CC: stable@vger.kernel.org # 4.14+
    Signed-off-by: ethanwu
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    ethanwu
     

26 Apr, 2018

7 commits

  • [ Upstream commit 7583d8d088ff2c323b1d4f15b191ca2c23d32558 ]

    Before rbio_orig_end_io() goes to free rbio, rbio may get merged with
    more bios from other rbios and rbio->bio_list becomes non-empty,
    in that case, these newly merged bios don't end properly.

    Once unlock_stripe() is done, rbio->bio_list will not be updated any
    more and we can call bio_endio() on all queued bios.

    It should only happen in error-out cases, the normal path of recover
    and full stripe write have already set RBIO_RMW_LOCKED_BIT to disable
    merge before doing IO, so rbio_orig_end_io() called by them doesn't
    have the above issue.

    Reported-by: Jérôme Carretero
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit 18e83ac75bfe67009c4ddcdd581bba8eb16f4030 ]

    This fixes a corner case that is caused by a race of dio write vs dio
    read/write.

    Here is how the race could happen.

    Suppose that no extent map has been loaded into memory yet.
    There is a file extent [0, 32K), two jobs are running concurrently
    against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio
    read from [0, 4K) or [4K, 8K).

    t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K).

    ------------------------------------------------------
    t1 t2
    btrfs_get_blocks_direct() btrfs_get_blocks_direct()
    -> btrfs_get_extent() -> btrfs_get_extent()
    -> lookup_extent_mapping()
    -> add_extent_mapping() -> lookup_extent_mapping()
    # load [0, 32K)
    -> btrfs_new_extent_direct()
    -> btrfs_drop_extent_cache()
    # split [0, 32K) and
    # drop [8K, 32K)
    -> add_extent_mapping()
    # add [8K, 32K)
    -> add_extent_mapping()
    # handle -EEXIST when adding
    # [0, 32K)
    ------------------------------------------------------
    About how t2(dio read/write) runs into -EEXIST:

    a) add_extent_mapping() gets -EEXIST for adding em [0, 32k),

    b) search_extent_mapping() then returns [0, 8k) as the existing em,
    even though start == existing->start, em is [0, 32k) so that
    extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k,

    c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k)
    (with a length 0) and returns -EEXIST as [8k, 32k) is already in tree,

    d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write,
    which is confusing applications.

    Here I conclude all the possible situations,
    1) start < existing->start

    +-----------+em+-----------+
    +--prev---+ | +-------------+ |
    | | | | | |
    +---------+ + +---+existing++ ++
    +
    |
    +
    start

    2) start == existing->start

    +------------em------------+
    | +-------------+ |
    | | | |
    + +----existing-+ +
    |
    |
    +
    start

    3) start > existing->start && start < (existing->start + existing->len)

    +------------em------------+
    | +-------------+ |
    | | | |
    + +----existing-+ +
    |
    |
    +
    start

    4) start >= (existing->start + existing->len)

    +-----------+em+-----------+
    | +-------------+ | +--next---+
    | | | | | |
    + +---+existing++ + +---------+
    +
    |
    +
    start

    As we can see, it turns out that if start is within existing em (front
    inclusive), then the existing em should be returned as is, otherwise,
    we try our best to merge candidate em with sibling ems to form a
    larger em (in order to reduce the total number of em).

    Reported-by: David Vallender
    Signed-off-by: Liu Bo
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit 6f794e3c5c8f8fdd3b5bb20d9ded894e685b5bbe ]

    It appears from the original commit [1] that there isn't any design
    specific reason not to fail the mount instead of just warning. This
    patch will change it to fail.

    [1]
    commit 319e4d0661e5323c9f9945f0f8fb5905e5fe74c3
    btrfs: Enhance super validation check

    Fixes: 319e4d0661e5323 ("btrfs: Enhance super validation check")
    Signed-off-by: Anand Jain
    Reviewed-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Anand Jain
     
  • [ Upstream commit 762221f095e3932669093466aaf4b85ed9ad2ac1 ]

    The raid6 corruption is that,
    suppose that all disks can be read without problems and if the content
    that was read out doesn't match its checksum, currently for raid6
    btrfs at most retries twice,

    - the 1st retry is to rebuild with all other stripes, it'll eventually
    be a raid5 xor rebuild,
    - if the 1st fails, the 2nd retry will deliberately fail parity p so
    that it will do raid6 style rebuild,

    however, the chances are that another non-parity stripe content also
    has something corrupted, so that the above retries are not able to
    return correct content.

    We've fixed normal reads to rebuild raid6 correctly with more retries
    in Patch "Btrfs: make raid6 rebuild retry more"[1], this is to fix
    scrub to do the exactly same rebuild process.

    [1]: https://patchwork.kernel.org/patch/10091755/

    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit 9ea2c7c9da13c9073e371c046cbbc45481ecb459 ]

    When modifying a tree where the root is at BTRFS_MAX_LEVEL - 1 then
    the level variable is going to be 7 (this is the max height of the
    tree). On the other hand btrfs_cow_block is always called with
    "level + 1" as an index into the nodes and slots arrays. This leads to
    an out of bounds access. Admittdely this will be benign since an OOB
    access of the nodes array will likely read the 0th element from the
    slots array, which in this case is going to be 0 (since we start CoW at
    the top of the tree). The OOB access into the slots array in turn will
    read the 0th and 1st values of the locks array, which would both be 0
    at the time. However, this benign behavior relies on the fact that the
    path being passed hasn't been initialised, if it has already been used to
    query a btree then it could potentially have populated the nodes/slots arrays.

    Fix it by explicitly checking if we are at level 7 (the maximum allowed
    index in nodes/slots arrays) and explicitly call the CoW routine with
    NULL for parent's node/slot.

    Signed-off-by: Nikolay Borisov
    Fixes-coverity-id: 711515
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     
  • [ Upstream commit 343e4fc1c60971b0734de26dbbd475d433950982 ]

    Setting plug can merge adjacent IOs before dispatching IOs to the disk
    driver.

    Without plug, it'd not be a problem for single disk usecases, but for
    multiple disks using raid profile, a large IO can be split to several
    IOs of stripe length, and plug can be helpful to bring them together
    for each disk so that we can save several disk access.

    Moreover, fsync issues synchronous writes, so plug can really take
    effect.

    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • commit 92d32170847bfff2dd08af2c016085779f2fd2a1 upstream.

    The last update to readdir introduced a temporary buffer to store the
    emitted readdir data, but as there are file names of variable length,
    there's a lot of unaligned access.

    This was observed on a sparc64 machine:

    Kernel unaligned access at TPC[102f3080] btrfs_real_readdir+0x51c/0x718 [btrfs]

    Fixes: 23b5ec74943 ("btrfs: fix readdir deadlock with pagefault")
    CC: stable@vger.kernel.org # 4.14+
    Reported-and-tested-by: René Rebe
    Reviewed-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    David Sterba
     

08 Apr, 2018

1 commit

  • commit 5811375325420052fcadd944792a416a43072b7f upstream.

    Fstests generic/475 provides a way to fail metadata reads while
    checking if checksum exists for the inode inside run_delalloc_nocow(),
    and csum_exist_in_range() interprets error (-EIO) as inode having
    checksum and makes its caller enter the cow path.

    In case of free space inode, this ends up with a warning in
    cow_file_range().

    The same problem applies to btrfs_cross_ref_exist() since it may also
    read metadata in between.

    With this, run_delalloc_nocow() bails out when errors occur at the two
    places.

    cc: v2.6.28+
    Fixes: 17d217fe970d ("Btrfs: fix nodatasum handling in balancing code")
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     

21 Mar, 2018

6 commits

  • commit 9deae9689231964972a94bb56a79b669f9d47ac1 upstream.

    Commit addc3fa74e5b ("Btrfs: Fix the problem that the dirty flag of dev
    stats is cleared") reworked the way device stats changes are tracked. A
    new atomic dev_stats_ccnt counter was introduced which is incremented
    every time any of the device stats counters are changed. This serves as
    a flag whether there are any pending stats changes. However, this patch
    only partially implemented the correct memory barriers necessary:

    - It only ordered the stores to the counters but not the reads e.g.
    btrfs_run_dev_stats
    - It completely omitted any comments documenting the intended design and
    how the memory barriers pair with each-other

    This patch provides the necessary comments as well as adds a missing
    smp_rmb in btrfs_run_dev_stats. Furthermore since dev_stats_cnt is only
    a snapshot at best there was no point in reading the counter twice -
    once in btrfs_dev_stats_dirty and then again when assigning stats_cnt.
    Just collapse both reads into 1.

    Fixes: addc3fa74e5b ("Btrfs: Fix the problem that the dirty flag of dev stats is cleared")
    Signed-off-by: Nikolay Borisov
    Reviewed-by: Mathieu Desnoyers
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     
  • commit c8195a7b1ad5648857ce20ba24f384faed8512bc upstream.

    Until v4.14, this warning was very infrequent:

    WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 find_parent_nodes+0xc41/0x14e0
    Modules linked in: [...]
    CPU: 3 PID: 18172 Comm: bees Tainted: G D W L 4.11.9-zb64+ #1
    Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101 12/02/2014
    Call Trace:
    dump_stack+0x85/0xc2
    __warn+0xd1/0xf0
    warn_slowpath_null+0x1d/0x20
    find_parent_nodes+0xc41/0x14e0
    __btrfs_find_all_roots+0xad/0x120
    ? extent_same_check_offsets+0x70/0x70
    iterate_extent_inodes+0x168/0x300
    iterate_inodes_from_logical+0x87/0xb0
    ? iterate_inodes_from_logical+0x87/0xb0
    ? extent_same_check_offsets+0x70/0x70
    btrfs_ioctl+0x8ac/0x2820
    ? lock_acquire+0xc2/0x200
    do_vfs_ioctl+0x91/0x700
    ? __fget+0x112/0x200
    SyS_ioctl+0x79/0x90
    entry_SYSCALL_64_fastpath+0x23/0xc6
    ? trace_hardirqs_off_caller+0x1f/0x140

    Starting with v4.14 (specifically 86d5f9944252 ("btrfs: convert prelimary
    reference tracking to use rbtrees")) the WARN_ON occurs three orders of
    magnitude more frequently--almost once per second while running workloads
    like bees.

    Replace the WARN_ON() with a comment rationale for its removal.
    The rationale is paraphrased from an explanation by Edmund Nadolski
    on the linux-btrfs mailing list.

    Fixes: 8da6d5815c59 ("Btrfs: added btrfs_find_all_roots()")
    Signed-off-by: Zygo Blaxell
    Reviewed-by: Lu Fengqi
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Zygo Blaxell
     
  • commit fd649f10c3d21ee9d7542c609f29978bdf73ab94 upstream.

    Commit 4fde46f0cc71 ("Btrfs: free the stale device") introduced
    btrfs_free_stale_device which iterates the device lists for all
    registered btrfs filesystems and deletes those devices which aren't
    mounted. In a btrfs_devices structure has only 1 device attached to it
    and it is unused then btrfs_free_stale_devices will proceed to also free
    the btrfs_fs_devices struct itself. Currently this leads to a use after
    free since list_for_each_entry will try to perform a check on the
    already freed memory to see if it has to terminate the loop.

    The fix is to use 'break' when we know we are freeing the current
    fs_devs.

    Fixes: 4fde46f0cc71 ("Btrfs: free the stale device")
    Signed-off-by: Nikolay Borisov
    Reviewed-by: Anand Jain
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     
  • commit 92e222df7b8f05c565009c7383321b593eca488b upstream.

    In case of using DUP, we search for enough unallocated disk space on a
    device to hold two stripes.

    The devices_info[ndevs-1].max_avail that holds the amount of unallocated
    space found is directly assigned to stripe_size, while it's actually
    twice the stripe size.

    Later on in the code, an unconditional division of stripe_size by
    dev_stripes corrects the value, but in the meantime there's a check to
    see if the stripe_size does not exceed max_chunk_size. Since during this
    check stripe_size is twice the amount as intended, the check will reduce
    the stripe_size to max_chunk_size if the actual correct to be used
    stripe_size is more than half the amount of max_chunk_size.

    The unconditional division later tries to correct stripe_size, but will
    actually make sure we can't allocate more than half the max_chunk_size.

    Fix this by moving the division by dev_stripes before the max chunk size
    check, so it always contains the right value, instead of putting a duct
    tape division in further on to get it fixed again.

    Since in all other cases than DUP, dev_stripes is 1, this change only
    affects DUP.

    Other attempts in the past were made to fix this:
    * 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried
    to fix the same problem, but still resulted in part of the code acting
    on a wrongly doubled stripe_size value.
    * 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally
    broke this fix again.

    The real problem was already introduced with the rest of the code in
    73c5de0051.

    The user visible result however will be that the max chunk size for DUP
    will suddenly double, while it's actually acting according to the limits
    in the code again like it was 5 years ago.

    Reported-by: Naohiro Aota
    Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html
    Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation")
    Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6")
    Signed-off-by: Hans van Kranenburg
    Reviewed-by: David Sterba
    [ update comment ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Hans van Kranenburg
     
  • commit 18bf591ba9753e3e5ba91f38f756a800693408f4 upstream.

    This patch addresses an issue that causes fiemap to falsely
    report a shared extent. The test case is as follows:

    xfs_io -f -d -c "pwrite -b 16k 0 64k" -c "fiemap -v" /media/scratch/file5
    sync
    xfs_io -c "fiemap -v" /media/scratch/file5

    which gives the resulting output:

    wrote 65536/65536 bytes at offset 0
    64 KiB, 4 ops; 0.0000 sec (121.359 MiB/sec and 7766.9903 ops/sec)
    /media/scratch/file5:
    EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
    0: [0..127]: 24576..24703 128 0x2001
    /media/scratch/file5:
    EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
    0: [0..127]: 24576..24703 128 0x1

    This is because btrfs_check_shared calls find_parent_nodes
    repeatedly in a loop, passing a share_check struct to report
    the count of shared extent. But btrfs_check_shared does not
    re-initialize the count value to zero for subsequent calls
    from the loop, resulting in a false share count value. This
    is a regressive behavior from 4.13.

    With proper re-initialization the test result is as follows:

    wrote 65536/65536 bytes at offset 0
    64 KiB, 4 ops; 0.0000 sec (110.035 MiB/sec and 7042.2535 ops/sec)
    /media/scratch/file5:
    EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
    0: [0..127]: 24576..24703 128 0x1
    /media/scratch/file5:
    EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
    0: [0..127]: 24576..24703 128 0x1

    which corrects the regression.

    Fixes: 3ec4d3238ab ("btrfs: allow backref search checks for shared extents")
    Signed-off-by: Edmund Nadolski
    [ add text from cover letter to changelog ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Edmund Nadolski
     
  • commit 047fdea6341966a0898e3b16c51f54d4f5ba030a upstream.

    On detaching of a disk which is a part of a RAID6 filesystem, the
    following kernel OOPS may happen:

    [63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
    [63122.719584] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo
    [63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
    [63122.803516] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo
    [63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
    [63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo
    [63122.935338] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
    [63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs]
    [63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0
    [63122.971202] Oops: 0000 [#1] SMP
    [63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W 4.14.2-16-scst34x+ #8
    [63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
    [63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs]
    [63123.007595] task: ffff880036ea4040 task.stack: ffffc90006384000
    [63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs]
    [63123.007968] RSP: 0018:ffffc90006387ad8 EFLAGS: 00010287
    [63123.008140] RAX: 0000000000000002 RBX: ffff88004beaa0b8 RCX: ffff8800b2bd5690
    [63123.008359] RDX: 0000000000000000 RSI: ffff88007bb43500 RDI: ffff88004beaa000
    [63123.008621] RBP: ffffc90006387ae8 R08: 0000000099100000 R09: ffff8800b2bd5600
    [63123.008840] R10: 0000000000000004 R11: 0000000000010000 R12: ffff88007bb43500
    [63123.009059] R13: 00000000fffffffb R14: ffff880036fc5180 R15: 0000000000000004
    [63123.009278] FS: 0000000000000000(0000) GS:ffff8800b7000000(0000) knlGS:0000000000000000
    [63123.009564] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [63123.009748] CR2: 0000000000000080 CR3: 00000000b0866000 CR4: 00000000000406f0
    [63123.009969] Call Trace:
    [63123.010085] raid_write_end_io+0x7e/0x80 [btrfs]
    [63123.010251] bio_endio+0xa1/0x120
    [63123.010378] generic_make_request+0x218/0x270
    [63123.010921] submit_bio+0x66/0x130
    [63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs]
    [63123.011245] full_stripe_write+0x96/0xc0 [btrfs]
    [63123.011428] raid56_parity_write+0x117/0x170 [btrfs]
    [63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs]
    [63123.011759] ? ___cache_free+0x1c5/0x300
    [63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs]
    [63123.012087] run_one_async_done+0x9c/0xc0 [btrfs]
    [63123.012257] normal_work_helper+0x19e/0x300 [btrfs]
    [63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs]
    [63123.012656] process_one_work+0x14d/0x350
    [63123.012888] worker_thread+0x4d/0x3a0
    [63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20
    [63123.013192] kthread+0x109/0x140
    [63123.013315] ? process_scheduled_works+0x40/0x40
    [63123.013472] ? kthread_stop+0x110/0x110
    [63123.013610] ret_from_fork+0x25/0x30
    [63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: ffffc90006387ad8
    [63123.014678] CR2: 0000000000000080
    [63123.016590] ---[ end trace a295ea7259c17880 ]—

    This is reproducible in a cycle, where a series of writes is followed by
    SCSI device delete command. The test may take up to few minutes.

    Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
    [ no signed-off-by provided ]
    Author: Dmitriy Gorokh
    Reviewed-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Dmitriy Gorokh
     

19 Mar, 2018

1 commit


09 Mar, 2018

1 commit

  • commit 3c181c12c431fe33b669410d663beb9cceefcd1b upstream.

    The fs_info::super_copy is a byte copy of the on-disk structure and all
    members must use the accessor macros/functions to obtain the right
    value. This was missing in update_super_roots and in sysfs readers.

    Moving between opposite endianness hosts will report bogus numbers in
    sysfs, and mount may fail as the root will not be restored correctly. If
    the filesystem is always used on a same endian host, this will not be a
    problem.

    Fix this by using the btrfs_set_super...() functions to set
    fs_info::super_copy values, and for the sysfs, use the cached
    fs_info::nodesize/sectorsize values.

    CC: stable@vger.kernel.org
    Fixes: df93589a17378 ("btrfs: export more from FS_INFO to sysfs")
    Signed-off-by: Anand Jain
    Reviewed-by: Liu Bo
    Reviewed-by: David Sterba
    [ update changelog ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Anand Jain
     

03 Mar, 2018

1 commit

  • [ Upstream commit beed9263f4000c48a5c48912f26576f6fa091181 ]

    Commit e0ae99941423 ("btrfs: preallocate device flush bio") reworked
    the way the flush bio is allocated and used. Concretely it allocates
    the bio in __alloc_device and then re-uses it multiple times with a
    very simple endio routine that just calls complete() without consuming
    a reference. Allocated bios by default come with a ref count of 1,
    which is then consumed by the endio routine (or not, in which case they
    should be bio_put by the caller). The way the impleementation works now
    is that the flush bio has a refcount of 2 and we only ever bio_put it
    once, leaving it to hang indefinitely. Fix this by removing the extra
    bio_get in __alloc_device.

    Fixes: e0ae99941423 ("btrfs: preallocate device flush bio")
    Signed-off-by: Nikolay Borisov
    Reviewed-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     

25 Feb, 2018

2 commits

  • [ Upstream commit c8bcbfbd239ed60a6562964b58034ac8a25f4c31 ]

    The name char array passed to btrfs_search_path_in_tree is of size
    BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes
    are in the range of [0, 4079]. Currently the code uses the define but this
    represents an off-by-one.

    Implications:

    Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be
    written to extra space, not some padding that could be provided by the
    allocator.

    btrfs-progs store the arguments on stack, but kernel does own copy of
    the ioctl buffer and the off-by-one overwrite does not affect userspace,
    but the ending 0 might be lost.

    Kernel ioctl buffer is allocated dynamically so we're overwriting
    somebody else's memory, and the ioctl is privileged if args.objectid is
    not 256. Which is in most cases, but resolving a subvolume stored in
    another directory will trigger that path.

    Before this patch the buffer was one byte larger, but then the -1 was
    not added.

    Fixes: ac8e9819d71f907 ("Btrfs: add search and inode lookup ioctls")
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    [ added implications ]
    Signed-off-by: David Sterba

    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     
  • [ Upstream commit 1b9e619c5bc8235cfba3dc4ced2fb0e3554a05d4 ]

    I was seeing disk flushes still happening when I mounted a Btrfs
    filesystem with nobarrier for testing. This is because we use FUA to
    write out the first super block, and on devices without FUA support, the
    block layer translates FUA to a flush. Even on devices supporting true
    FUA, using FUA when we asked for no barriers is surprising.

    Fixes: 387125fc722a8ed ("Btrfs: fix barrier flushes")
    Signed-off-by: Omar Sandoval
    Reviewed-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval