13 Dec, 2014

1 commit

  • Pull btrfs update from Chris Mason:
    "From a feature point of view, most of the code here comes from Miao
    Xie and others at Fujitsu to implement scrubbing and replacing devices
    on raid56. This has been in development for a while, and it's a big
    improvement.

    Filipe and Josef have a great assortment of fixes, many of which solve
    problems corruptions either after a crash or in error conditions. I
    still have a round two from Filipe for next week that solves
    corruptions with discard and block group removal"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
    Btrfs: make get_caching_control unconditionally return the ctl
    Btrfs: fix unprotected deletion from pending_chunks list
    Btrfs: fix fs mapping extent map leak
    Btrfs: fix memory leak after block remove + trimming
    Btrfs: make btrfs_abort_transaction consider existence of new block groups
    Btrfs: fix race between writing free space cache and trimming
    Btrfs: fix race between fs trimming and block group remove/allocation
    Btrfs, replace: enable dev-replace for raid56
    Btrfs: fix freeing used extents after removing empty block group
    Btrfs: fix crash caused by block group removal
    Btrfs: fix invalid block group rbtree access after bg is removed
    Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
    Btrfs, replace: write raid56 parity into the replace target device
    Btrfs, replace: write dirty pages into the replace target device
    Btrfs, raid56: support parity scrub on raid56
    Btrfs, raid56: use a variant to record the operation type
    Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
    Btrfs, raid56: don't change bbio and raid_map
    Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
    Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
    ...

    Linus Torvalds
     

11 Dec, 2014

1 commit

  • Pull VFS changes from Al Viro:
    "First pile out of several (there _definitely_ will be more). Stuff in
    this one:

    - unification of d_splice_alias()/d_materialize_unique()

    - iov_iter rewrite

    - killing a bunch of ->f_path.dentry users (and f_dentry macro).

    Getting that completed will make life much simpler for
    unionmount/overlayfs, since then we'll be able to limit the places
    sensitive to file _dentry_ to reasonably few. Which allows to have
    file_inode(file) pointing to inode in a covered layer, with dentry
    pointing to (negative) dentry in union one.

    Still not complete, but much closer now.

    - crapectomy in lustre (dead code removal, mostly)

    - "let's make seq_printf return nothing" preparations

    - assorted cleanups and fixes

    There _definitely_ will be more piles"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    copy_from_iter_nocache()
    new helper: iov_iter_kvec()
    csum_and_copy_..._iter()
    iov_iter.c: handle ITER_KVEC directly
    iov_iter.c: convert copy_to_iter() to iterate_and_advance
    iov_iter.c: convert copy_from_iter() to iterate_and_advance
    iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
    iov_iter.c: convert iov_iter_zero() to iterate_and_advance
    iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
    iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
    iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
    iov_iter.c: iterate_and_advance
    iov_iter.c: macros for iterating over iov_iter
    kill f_dentry macro
    dcache: fix kmemcheck warning in switch_names
    new helper: audit_file()
    nfsd_vfs_write(): use file_inode()
    ncpfs: use file_inode()
    kill f_dentry uses
    lockd: get rid of ->f_path.dentry->d_sb
    ...

    Linus Torvalds
     

09 Dec, 2014

1 commit


03 Dec, 2014

21 commits

  • Chris Mason
     
  • This was written when we didn't do a caching control for the fast free space
    cache loading. However we started doing that a long time ago, and there is
    still a small window of time that we could be caching the block group the fast
    way, so if there is a caching_ctl at all on the block group just return it, the
    callers all wait properly for what they want. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • On block group remove if the corresponding extent map was on the
    transaction->pending_chunks list, we were deleting the extent map
    from that list, through remove_extent_mapping(), without any
    synchronization with chunk allocation (which iterates that list
    and adds new elements to it). Fix this by ensure that this is done
    while the chunk mutex is held, since that's the mutex that protects
    the list in the chunk allocation code path.

    This applies on top (depends on) of my previous patch titled:
    "Btrfs: fix race between fs trimming and block group remove/allocation"

    But the issue in fact was already present before that change, it only
    became easier to hit after Josef's 3.18 patch that added automatic
    removal of empty block groups.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • On chunk allocation error (label "error_del_extent"), after adding the
    extent map to the tree and to the pending chunks list, we would leave
    decrementing the extent map's refcount by 2 instead of 3 (our allocation
    + tree reference + list reference).

    Also, on chunk/block group removal, if the block group was on the list
    pending_chunks we weren't decrementing the respective list reference.

    Detected by 'rmmod btrfs':

    [20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects
    [20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: G W L 3.17.0-rc5-btrfs-next-1+ #1
    [20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    [20770.106130] 0000000000000000 ffff8800ba867eb8 ffffffff813e7a13 ffff8800a2e11040
    [20770.106132] ffff8800ba867ed0 ffffffff81105d0c 0000000000000000 ffff8800ba867ee0
    [20770.106134] ffffffffa035d65e ffff8800ba867ef0 ffffffffa03b0654 ffff8800ba867f78
    [20770.106136] Call Trace:
    [20770.106142] [] dump_stack+0x45/0x56
    [20770.106145] [] kmem_cache_destroy+0x4b/0x90
    [20770.106164] [] extent_map_exit+0x1a/0x1c [btrfs]
    [20770.106176] [] exit_btrfs_fs+0x27/0x9d3 [btrfs]
    [20770.106179] [] SyS_delete_module+0x153/0x1c4
    [20770.106182] [] ? trace_hardirqs_on_thunk+0x3a/0x3c
    [20770.106184] [] system_call_fastpath+0x16/0x1b

    This applies on top (depends on) of my previous patch titled:
    "Btrfs: fix race between fs trimming and block group remove/allocation"

    But the issue in fact was already present before that change, it only
    became easier to hit after Josef's 3.18 patch that added automatic
    removal of empty block groups.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • There was a free space entry structure memeory leak if a block
    group is remove while a free space entry is being trimmed, which
    the following diagram explains:

    CPU 1 CPU 2

    btrfs_trim_block_group()
    trim_no_bitmap()
    remove free space entry from
    block group cache's rbtree
    do_trimming()

    btrfs_remove_block_group()
    btrfs_remove_free_space_cache()

    add back free space entry to
    block group's cache rbtree
    btrfs_put_block_group()

    (...)
    btrfs_put_block_group()
    kfree(bg->free_space_ctl)
    kfree(bg)

    The free space entry added after doing the discard of its respective
    range ends up never being freed.
    Detected after doing an "rmmod btrfs" after running the stress test
    recently submitted for fstests:

    [ 8234.642212] kmem_cache_destroy btrfs_free_space: Slab cache still has objects
    [ 8234.642657] CPU: 1 PID: 32276 Comm: rmmod Tainted: G W L 3.17.0-rc5-btrfs-next-2+ #1
    [ 8234.642660] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    [ 8234.642664] 0000000000000000 ffff8801af1b3eb8 ffffffff8140c7b6 ffff8801dbedd0c0
    [ 8234.642670] ffff8801af1b3ed0 ffffffff811149ce 0000000000000000 ffff8801af1b3ee0
    [ 8234.642676] ffffffffa042dbe7 ffff8801af1b3ef0 ffffffffa0487422 ffff8801af1b3f78
    [ 8234.642682] Call Trace:
    [ 8234.642692] [] dump_stack+0x4d/0x66
    [ 8234.642699] [] kmem_cache_destroy+0x4d/0x92
    [ 8234.642731] [] btrfs_destroy_cachep+0x63/0x76 [btrfs]
    [ 8234.642757] [] exit_btrfs_fs+0x9/0xbe7 [btrfs]
    [ 8234.642762] [] SyS_delete_module+0x155/0x1c6
    [ 8234.642768] [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [ 8234.642773] [] system_call_fastpath+0x16/0x1b

    This applies on top (depends on) of my previous patch titled:
    "Btrfs: fix race between fs trimming and block group remove/allocation"

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • If the transaction handle doesn't have used blocks but has created new block
    groups make sure we turn the fs into readonly mode too. This is because the
    new block groups didn't get all their metadata persisted into the chunk and
    device trees, and therefore if a subsequent transaction starts, allocates
    space from the new block groups, writes data or metadata into that space,
    commits successfully and then after we unmount and mount the filesystem
    again, the same space can be allocated again for a new block group,
    resulting in file data or metadata corruption.

    Example where we don't abort the transaction when we fail to finish the
    chunk allocation (add items to the chunk and device trees) and later a
    future transaction where the block group is removed fails because it can't
    find the chunk item in the chunk tree:

    [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]()
    [25230.404301] BTRFS: Transaction aborted (error -28)
    [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs]
    [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1
    [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    [25230.404328] 0000000000000000 ffff88004581bb08 ffffffff813e7a13 ffff88004581bb50
    [25230.404330] ffff88004581bb40 ffffffff810423aa ffffffffa049386a 00000000ffffffe4
    [25230.404332] ffffffffa05214c0 000000000000240c ffff88010fc8f800 ffff88004581bba8
    [25230.404334] Call Trace:
    [25230.404338] [] dump_stack+0x45/0x56
    [25230.404342] [] warn_slowpath_common+0x7f/0x98
    [25230.404351] [] ? __btrfs_abort_transaction+0x50/0xfc [btrfs]
    [25230.404353] [] warn_slowpath_fmt+0x48/0x50
    [25230.404362] [] __btrfs_abort_transaction+0x50/0xfc [btrfs]
    [25230.404374] [] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
    [25230.404387] [] __btrfs_end_transaction+0x7e/0x2de [btrfs]
    [25230.404398] [] btrfs_end_transaction+0x10/0x12 [btrfs]
    [25230.404408] [] btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
    [25230.404421] [] __btrfs_buffered_write+0x160/0x48d [btrfs]
    [25230.404425] [] ? cap_inode_need_killpriv+0x2d/0x37
    [25230.404429] [] ? get_page+0x1a/0x2b
    [25230.404441] [] btrfs_file_write_iter+0x321/0x42f [btrfs]
    [25230.404443] [] ? handle_mm_fault+0x7f3/0x846
    [25230.404446] [] ? mutex_unlock+0x16/0x18
    [25230.404449] [] new_sync_write+0x7c/0xa0
    [25230.404450] [] vfs_write+0xb0/0x112
    [25230.404452] [] SyS_pwrite64+0x66/0x84
    [25230.404454] [] system_call_fastpath+0x16/0x1b
    [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
    [25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left).
    [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.)

    Signed-off-by: Filipe Manana
    Reviewed-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • Trimming is completely transactionless, and the way it operates consists
    of hiding free space entries from a block group, perform the trim/discard
    and then make the free space entries visible again.
    Therefore while a free space entry is being trimmed, we can have free space
    cache writing running in parallel (as part of a transaction commit) which
    will miss the free space entry. This means that an unmount (or crash/reboot)
    after that transaction commit and mount again before another transaction
    starts/commits after the discard finishes, we will have some free space
    that won't be used again unless the free space cache is rebuilt. After the
    unmount, fsck (btrfsck, btrfs check) reports the issue like the following
    example:

    *** fsck.btrfs output ***
    checking extents
    checking free space cache
    There is no free space entry for 521764864-521781248
    There is no free space entry for 521764864-1103101952
    cache appears valid but isnt 29360128
    Checking filesystem on /dev/sdc
    UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
    found 1235681286 bytes used err is -22
    (...)

    Another issue caused by this race is a crash while writing bitmap entries
    to the cache, because while the cache writeout task accesses the bitmaps,
    the trim task can be concurrently modifying the bitmap or worse might
    be freeing the bitmap. The later case results in the following crash:

    [55650.804460] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
    [55650.804835] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop parport_pc parport i2c_piix4 psmouse evdev pcspkr microcode processor i2ccore serio_raw thermal_sys button ext4 crc16 jbd2 mbcache sg sd_mod crc_t10dif sr_mod cdrom crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 [last unloaded: btrfs]
    [55650.806169] CPU: 1 PID: 31002 Comm: btrfs-transacti Tainted: G W 3.17.0-rc5-btrfs-next-1+ #1
    [55650.806493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    [55650.806867] task: ffff8800b12f6410 ti: ffff880071538000 task.ti: ffff880071538000
    [55650.807166] RIP: 0010:[] [] write_bitmap_entries+0x65/0xbb [btrfs]
    [55650.807514] RSP: 0018:ffff88007153bc30 EFLAGS: 00010246
    [55650.807687] RAX: 000000005d1ec000 RBX: ffff8800a665df08 RCX: 0000000000000400
    [55650.807885] RDX: ffff88005d1ec000 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88005d1ec000
    [55650.808017] RBP: ffff88007153bc58 R08: 00000000ddd51536 R09: 00000000000001e0
    [55650.808017] R10: 0000000000000000 R11: 0000000000000037 R12: 6b6b6b6b6b6b6b6b
    [55650.808017] R13: ffff88007153bca8 R14: 6b6b6b6b6b6b6b6b R15: ffff88007153bc98
    [55650.808017] FS: 0000000000000000(0000) GS:ffff88023ec80000(0000) knlGS:0000000000000000
    [55650.808017] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [55650.808017] CR2: 0000000002273b88 CR3: 00000000b18f6000 CR4: 00000000000006e0
    [55650.808017] Stack:
    [55650.808017] ffff88020e834e00 ffff880172d68db0 0000000000000000 ffff88019257c800
    [55650.808017] ffff8801d42ea720 ffff88007153bd10 ffffffffa037d2fa ffff880224e99180
    [55650.808017] ffff8801469a6188 ffff880224e99140 ffff880172d68c50 00000003000000b7
    [55650.808017] Call Trace:
    [55650.808017] [] __btrfs_write_out_cache+0x1ea/0x37f [btrfs]
    [55650.808017] [] btrfs_write_out_cache+0xa1/0xd8 [btrfs]
    [55650.808017] [] btrfs_write_dirty_block_groups+0x4b5/0x505 [btrfs]
    [55650.808017] [] commit_cowonly_roots+0x15e/0x1f7 [btrfs]
    [55650.808017] [] ? _raw_spin_lock+0xe/0x10
    [55650.808017] [] btrfs_commit_transaction+0x411/0x882 [btrfs]
    [55650.808017] [] transaction_kthread+0xf2/0x1a4 [btrfs]
    [55650.808017] [] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
    [55650.808017] [] kthread+0xb7/0xbf
    [55650.808017] [] ? __kthread_parkme+0x67/0x67
    [55650.808017] [] ret_from_fork+0x7c/0xb0
    [55650.808017] [] ? __kthread_parkme+0x67/0x67
    [55650.808017] Code: 4c 89 ef 8d 70 ff e8 d4 fc ff ff 41 8b 45 34 41 39 45 30 7d 5c 31 f6 4c 89 ef e8 80 f6 ff ff 49 8b 7d 00 4c 89 f6 b9 00 04 00 00 a5 4c 89 ef 41 8b 45 30 8d 70 ff e8 a3 fc ff ff 41 8b 45 34
    [55650.808017] RIP [] write_bitmap_entries+0x65/0xbb [btrfs]
    [55650.808017] RSP
    [55650.815725] ---[ end trace 1c032e96b149ff86 ]---

    Fix this by serializing both tasks in such a way that cache writeout
    doesn't wait for the trim/discard of free space entries to finish and
    doesn't miss any free space entry.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • Our fs trim operation, which is completely transactionless (doesn't start
    or joins an existing transaction) consists of visiting all block groups
    and then for each one to iterate its free space entries and perform a
    discard operation against the space range represented by the free space
    entries. However before performing a discard, the corresponding free space
    entry is removed from the free space rbtree, and when the discard completes
    it is added back to the free space rbtree.

    If a block group remove operation happens while the discard is ongoing (or
    before it starts and after a free space entry is hidden), we end up not
    waiting for the discard to complete, remove the extent map that maps
    logical address to physical addresses and the corresponding chunk metadata
    from the the chunk and device trees. After that and before the discard
    completes, the current running transaction can finish and a new one start,
    allowing for new block groups that map to the same physical addresses to
    be allocated and written to.

    So fix this by keeping the extent map in memory until the discard completes
    so that the same physical addresses aren't reused before it completes.

    If the physical locations that are under a discard operation end up being
    used for a new metadata block group for example, and dirty metadata extents
    are written before the discard finishes (the VM might call writepages() of
    our btree inode's i_mapping for example, or an fsync log commit happens) we
    end up overwriting metadata with zeroes, which leads to errors from fsck
    like the following:

    checking extents
    Check tree block failed, want=833912832, have=0
    Check tree block failed, want=833912832, have=0
    Check tree block failed, want=833912832, have=0
    Check tree block failed, want=833912832, have=0
    Check tree block failed, want=833912832, have=0
    read block failed check_tree_block
    owner ref check failed [833912832 16384]
    Errors found in extent allocation tree or chunk allocation
    checking free space cache
    checking fs roots
    Check tree block failed, want=833912832, have=0
    Check tree block failed, want=833912832, have=0
    Check tree block failed, want=833912832, have=0
    Check tree block failed, want=833912832, have=0
    Check tree block failed, want=833912832, have=0
    read block failed check_tree_block
    root 5 root dir 256 error
    root 5 inode 260 errors 2001, no inode item, link count wrong
    unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
    root 5 inode 262 errors 2001, no inode item, link count wrong
    unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
    root 5 inode 263 errors 2001, no inode item, link count wrong
    (...)

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • Signed-off-by: Zhao Lei
    Signed-off-by: Miao Xie

    Zhao Lei
     
  • There's a race between adding a block group to the list of the unused
    block groups and removing an unused block group (cleaner kthread) that
    leads to freeing extents that are in use or a crash during transaction
    commmit. Basically the cleaner kthread, when executing
    btrfs_delete_unused_bgs(), might catch the newly added block group to
    the list fs_info->unused_bgs and clear the range representing the whole
    group from fs_info->freed_extents[] before the task that added the block
    group to the list (running update_block_group()) marked the last freed
    extent as dirty in fs_info->freed_extents (pinned_extents).

    That is:

    CPU 1 CPU 2

    btrfs_delete_unused_bgs()
    update_block_group()
    add block group to
    fs_info->unused_bgs
    got block group from the list
    clear_extent_bits for the whole
    block group range in freed_extents[]
    set_extent_dirty for the
    range covering the freed
    extent in freed_extents[]
    (fs_info->pinned_extents)

    block group deleted, and a new block
    group with the same logical address is
    created

    reserve space from the new block group
    for new data or metadata - the reserved
    space overlaps the range specified by
    CPU 1 for set_extent_dirty()

    commit transaction
    find all ranges marked as dirty in
    fs_info->pinned_extents, clear them
    and add them to the free space cache

    Alternatively, if CPU 2 doesn't create a new block group with the same
    logical address, we get a crash/BUG_ON at transaction commit when unpining
    extent ranges because we can't find a block group for the range marked as
    dirty by CPU 1. Sample trace:

    [ 2163.426462] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    [ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc
    gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache
    sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio
    [ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: G W 3.17.0-rc5-btrfs-next-1+ #1
    [ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    [ 2163.428875] task: ffff88009f2c0650 ti: ffff8801356bc000 task.ti: ffff8801356bc000
    [ 2163.429157] RIP: 0010:[] [] unpin_extent_range.isra.58+0x62/0x192 [btrfs]
    [ 2163.429562] RSP: 0018:ffff8801356bfda8 EFLAGS: 00010246
    [ 2163.429802] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
    [ 2163.429990] RDX: 0000000041bfffff RSI: 0000000001c00000 RDI: ffff880024307080
    [ 2163.430042] RBP: ffff8801356bfde8 R08: 0000000000000068 R09: ffff88003734f118
    [ 2163.430042] R10: ffff8801356bfcb8 R11: fffffffffffffb69 R12: ffff8800243070d0
    [ 2163.430042] R13: 0000000083c04000 R14: ffff8800751b0f00 R15: ffff880024307000
    [ 2163.430042] FS: 0000000000000000(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
    [ 2163.430042] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 2163.430042] CR2: 00007ff10eb43fc0 CR3: 0000000004cb8000 CR4: 00000000000006f0
    [ 2163.430042] Stack:
    [ 2163.430042] ffff8800243070d0 0000000083c08000 0000000083c07fff ffff88012d6bc800
    [ 2163.430042] ffff8800243070d0 ffff8800751b0f18 ffff8800751b0f00 0000000000000000
    [ 2163.430042] ffff8801356bfe18 ffffffffa037a481 0000000083c04000 0000000083c07fff
    [ 2163.430042] Call Trace:
    [ 2163.430042] [] btrfs_finish_extent_commit+0xac/0xbf [btrfs]
    [ 2163.430042] [] btrfs_commit_transaction+0x6ee/0x882 [btrfs]
    [ 2163.430042] [] transaction_kthread+0xf2/0x1a4 [btrfs]
    [ 2163.430042] [] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
    [ 2163.430042] [] kthread+0xb7/0xbf
    [ 2163.430042] [] ? __kthread_parkme+0x67/0x67
    [ 2163.430042] [] ret_from_fork+0x7c/0xb0
    [ 2163.430042] [] ? __kthread_parkme+0x67/0x67

    So fix this by making update_block_group() first set the range as dirty
    in pinned_extents before adding the block group to the unused_bgs list.

    Signed-off-by: Filipe Manana
    Reviewed-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • If we remove a block group (because it became empty), we might have left
    a caching_ctl structure in fs_info->caching_block_groups that points to
    the block group and is accessed at transaction commit time. This results
    in accessing an invalid or incorrect block group. This issue became visible
    after Josef's patch "Btrfs: remove empty block groups automatically".

    So if the block group is removed make sure we don't leave a dangling
    caching_ctl in caching_block_groups.

    Sample crash trace:

    [58380.439449] BUG: unable to handle kernel paging request at ffff8801446eaeb8
    [58380.439707] IP: [] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
    [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 80000001446ea060
    [58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
    [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs]
    [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G W 3.17.0-rc5-btrfs-next-1+ #1
    [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    [58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti: ffff88013896c000
    [58380.443238] RIP: 0010:[] [] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
    [58380.443238] RSP: 0018:ffff88013896fdd8 EFLAGS: 00010283
    [58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX: 0000000000000000
    [58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI: ffff8801446eaeb8
    [58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09: ffff88013896fc60
    [58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12: ffff880222cae000
    [58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15: ffff8801446eae00
    [58380.443238] FS: 0000000000000000(0000) GS:ffff88023ed80000(0000) knlGS:0000000000000000
    [58380.443238] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4: 00000000000006e0
    [58380.443238] Stack:
    [58380.443238] ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850 ffff880185e16800
    [58380.443238] ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00 0000000000000000
    [58380.443238] ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0 ffff88013ac82090
    [58380.443238] Call Trace:
    [58380.443238] [] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs]
    [58380.443238] [] btrfs_commit_transaction+0x45c/0x882 [btrfs]
    [58380.443238] [] transaction_kthread+0xf2/0x1a4 [btrfs]
    [58380.443238] [] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
    [58380.443238] [] kthread+0xb7/0xbf
    [58380.443238] [] ? __kthread_parkme+0x67/0x67
    [58380.443238] [] ret_from_fork+0x7c/0xb0
    [58380.443238] [] ? __kthread_parkme+0x67/0x67

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • If we grab a block group, for example in btrfs_trim_fs(), we will be holding
    a reference on it but the block group can be removed after we got it (via
    btrfs_remove_block_group), which means it will no longer be part of the
    rbtree.

    However, btrfs_remove_block_group() was only calling rb_erase() which leaves
    the block group's rb_node left and right child pointers with the same content
    they had before calling rb_erase. This was dangerous because a call to
    next_block_group() would access the node's left and right child pointers (via
    rb_next), which can be no longer valid.

    Fix this by clearing a block group's node after removing it from the tree,
    and have next_block_group() do a tree search to get the next block group
    instead of using rb_next() if our block group was removed.

    Signed-off-by: Filipe Manana
    Reviewed-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • The commit c404e0dc (Btrfs: fix use-after-free in the finishing
    procedure of the device replace) fixed a use-after-free problem
    which happened when removing the source device at the end of device
    replace, but at that time, btrfs didn't support device replace
    on raid56, so we didn't fix the problem on the raid56 profile.
    Currently, we implemented device replace for raid56, so we need
    kick that problem out before we enable that function for raid56.

    The fix method is very simple, we just increase the bio per-cpu
    counter before we submit a raid56 io, and decrease the counter
    when the raid56 io ends.

    Signed-off-by: Miao Xie

    Miao Xie
     
  • This function reused the code of parity scrub, and we just write
    the right parity or corrected parity into the target device before
    the parity scrub end.

    Signed-off-by: Miao Xie

    Miao Xie
     
  • The implementation is simple:
    - In order to avoid changing the code logic of btrfs_map_bio and
    RAID56, we add the stripes of the replace target devices at the
    end of the stripe array in btrfs bio, and we sort those target
    device stripes in the array. And we keep the number of the target
    device stripes in the btrfs bio.
    - Except write operation on RAID56, all the other operation don't
    take the target device stripes into account.
    - When we do write operation, we read the data from the common devices
    and calculate the parity. Then write the dirty data and new parity
    out, at this time, we will find the relative replace target stripes
    and wirte the relative data into it.

    Note: The function that copying old data on the source device to
    the target device was implemented in the past, it is similar to
    the other RAID type.

    Signed-off-by: Miao Xie

    Miao Xie
     
  • The implementation is:
    - Read and check all the data with checksum in the same stripe.
    All the data which has checksum is COW data, and we are sure
    that it is not changed though we don't lock the stripe. because
    the space of that data just can be reclaimed after the current
    transction is committed, and then the fs can use it to store the
    other data, but when doing scrub, we hold the current transaction,
    that is that data can not be recovered, it is safe that read and check
    it out of the stripe lock.
    - Lock the stripe
    - Read out all the data without checksum and parity
    The data without checksum and the parity may be changed if we don't
    lock the stripe, so we need read it in the stripe lock context.
    - Check the parity
    - Re-calculate the new parity and write back it if the old parity
    is not right
    - Unlock the stripe

    If we can not read out the data or the data we read is corrupted,
    we will try to repair it. If the repair fails. we will mark the
    horizontal sub-stripe(pages on the same horizontal) as corrupted
    sub-stripe, and we will skip the parity check and repair of that
    horizontal sub-stripe.

    And in order to skip the horizontal sub-stripe that has no data, we
    introduce a bitmap. If there is some data on the horizontal sub-stripe,
    we will the relative bit to 1, and when we check and repair the
    parity, we will skip those horizontal sub-stripes that the relative
    bits is 0.

    Signed-off-by: Miao Xie

    Miao Xie
     
  • We will introduce new operation type later, if we still use integer
    variant as bool variant to record the operation type, we would add new
    variant and increase the size of raid bio structure. It is not good,
    by this patch, we define different number for different operation,
    and we can just use a variant to record the operation type.

    Signed-off-by: Miao Xie

    Miao Xie
     
  • This patch implement the RAID5/6 common data repair function, the
    implementation is similar to the scrub on the other RAID such as
    RAID1, the differentia is that we don't read the data from the
    mirror, we use the data repair function of RAID5/6.

    Signed-off-by: Miao Xie

    Miao Xie
     
  • Because we will reuse bbio and raid_map during the scrub later, it is
    better that we don't change any variant of bbio and don't free it at
    the end of IO request. So we introduced similar variants into the raid
    bio, and don't access those bbio's variants any more.

    Signed-off-by: Miao Xie

    Miao Xie
     
  • stripe_index's value was set again in latter line:
    stripe_index = 0;

    Signed-off-by: Zhao Lei
    Signed-off-by: Miao Xie
    Reviewed-by: David Sterba

    Zhao Lei
     
  • bbio_ret in this condition is always !NULL because previous code
    already have a check-and-skip:
    4908 if (!bbio_ret)
    4909 goto out;

    Signed-off-by: Zhao Lei
    Signed-off-by: Miao Xie
    Reviewed-by: David Sterba

    Zhao Lei
     

01 Dec, 2014

1 commit

  • Don Bailey noticed that our page zeroing for compression at end-io time
    isn't complete. This reworks a patch from Linus to push the zeroing
    into the zlib and lzo specific functions instead of trying to handle the
    corners inside btrfs_decompress_buf2page

    Signed-off-by: Chris Mason
    Reviewed-by: Josef Bacik
    Reported-by: Don A. Bailey
    cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Chris Mason
     

25 Nov, 2014

6 commits

  • If right after starting the snapshot creation ioctl we perform a write against a
    file followed by a truncate, with both operations increasing the file's size, we
    can get a snapshot tree that reflects a state of the source subvolume's tree where
    the file truncation happened but the write operation didn't. This leaves a gap
    between 2 file extent items of the inode, which makes btrfs' fsck complain about it.

    For example, if we perform the following file operations:

    $ mkfs.btrfs -f /dev/vdd
    $ mount /dev/vdd /mnt
    $ xfs_io -f \
    -c "pwrite -S 0xaa -b 32K 0 32K" \
    -c "fsync" \
    -c "pwrite -S 0xbb -b 32770 16K 32770" \
    -c "truncate 90123" \
    /mnt/foobar

    and the snapshot creation ioctl was just called before the second write, we often
    can get the following inode items in the snapshot's btree:

    item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
    inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
    item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
    inode ref index 282 namelen 10 name: foobar
    item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
    extent data disk byte 1104855040 nr 32768
    extent data offset 0 nr 32768 ram 32768
    extent compression 0
    item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
    extent data disk byte 0 nr 0
    extent data offset 0 nr 40960 ram 40960
    extent compression 0

    There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
    for which there's no file extent item covering it. This is because the file write
    and file truncate operations happened both right after the snapshot creation ioctl
    called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
    ordered extent that matches the write and, in btrfs_setsize(), we were able to call
    btrfs_cont_expand() before being able to commit the current transaction in the
    snapshot creation ioctl. So this made it possibe to insert the hole file extent
    item in the source subvolume (which represents the region added by the truncate)
    right before the transaction commit from the snapshot creation ioctl.

    Btrfs' fsck tool complains about such cases with a message like the following:

    "root 331 inode 257 errors 100, file extent discount"

    >From a user perspective, the expectation when a snapshot is created while those
    file operations are being performed is that the snapshot will have a file that
    either:

    1) is empty
    2) only the first write was captured
    3) only the 2 writes were captured
    4) both writes and the truncation were captured

    But never capture a state where only the first write and the truncation were
    captured (since the second write was performed before the truncation).

    A test case for xfstests follows.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • Move the logic from the snapshot creation ioctl into send. This avoids
    doing the transaction commit if send isn't used, and ensures that if
    a crash/reboot happens after the transaction commit that created the
    snapshot and before the transaction commit that switched the commit
    root, send will not get a commit root that differs from the main root
    (that has orphan items).

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • Due to ignoring errors returned by clear_extent_bits (at the moment only
    -ENOMEM is possible), we can end up freeing an extent that is actually in
    use (i.e. return the extent to the free space cache).

    The sequence of steps that lead to this:

    1) Cleaner thread starts execution and calls btrfs_delete_unused_bgs(), with
    the goal of freeing empty block groups;

    2) btrfs_delete_unused_bgs() finds an empty block group, joins the current
    transaction (or starts a new one if none is running) and attempts to
    clear the EXTENT_DIRTY bit for the block group's range from freed_extents[0]
    and freed_extents[1] (of which one corresponds to fs_info->pinned_extents);

    3) Clearing the EXTENT_DIRTY bit (via clear_extent_bits()) fails with
    -ENOMEM, but such error is ignored and btrfs_delete_unused_bgs() proceeds
    to delete the block group and the respective chunk, while pinned_extents
    remains with that bit set for the whole (or a part of the) range covered
    by the block group;

    4) Later while the transaction is still running, the chunk ends up being reused
    for a new block group (maybe for different purpose, data or metadata), and
    extents belonging to the new block group are allocated for file data or btree
    nodes/leafs;

    5) The current transaction is committed, meaning that we unpinned one or more
    extents from the new block group (through btrfs_finish_extent_commit() and
    unpin_extent_range()) which are now being used for new file data or new
    metadata (through btrfs_finish_extent_commit() and unpin_extent_range()).
    And unpinning means we returned the extents to the free space cache of the
    new block group, which implies those extents can be used for future allocations
    while they're still in use.

    Alternatively, we can hit a BUG_ON() when doing a lookup for a block group's cache
    object in unpin_extent_range() if a new block group didn't end up being allocated for
    the same chunk (step 4 above).

    Fix this by not freeing the block group and chunk if we fail to clear the dirty bit.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • Fengguang's build monster reported warnings on some arches because we
    don't have vmalloc.h included

    Signed-off-by: Chris Mason
    Reported-by: fengguang.wu@intel.com

    Chris Mason
     
  • The following lockdep warning is triggered during xfstests:

    [ 1702.980872] =========================================================
    [ 1702.981181] [ INFO: possible irq lock inversion dependency detected ]
    [ 1702.981482] 3.18.0-rc1 #27 Not tainted
    [ 1702.981781] ---------------------------------------------------------
    [ 1702.982095] kswapd0/77 just changed the state of lock:
    [ 1702.982415] (&delayed_node->mutex){+.+.-.}, at: [] __btrfs_release_delayed_node+0x41/0x1f0 [btrfs]
    [ 1702.982794] but this lock took another, RECLAIM_FS-unsafe lock in the past:
    [ 1702.983160] (&fs_info->dev_replace.lock){+.+.+.}

    and interrupts could create inverse lock ordering between them.

    [ 1702.984675]
    other info that might help us debug this:
    [ 1702.985524] Chain exists of:
    &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock

    [ 1702.986799] Possible interrupt unsafe locking scenario:

    [ 1702.987681] CPU0 CPU1
    [ 1702.988137] ---- ----
    [ 1702.988598] lock(&fs_info->dev_replace.lock);
    [ 1702.989069] local_irq_disable();
    [ 1702.989534] lock(&delayed_node->mutex);
    [ 1702.990038] lock(&found->groups_sem);
    [ 1702.990494]
    [ 1702.990938] lock(&delayed_node->mutex);
    [ 1702.991407]
    *** DEADLOCK ***

    It is because the btrfs_kobj_{add/rm}_device() will call memory
    allocation with GFP_KERNEL,
    which may flush fs page cache to free space, waiting for it self to do
    the commit, causing the deadlock.

    To solve the problem, move btrfs_kobj_{add/rm}_device() out of the
    dev_replace lock range, also involing split the
    btrfs_rm_dev_replace_srcdev() function into remove and free parts.

    Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace
    lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are
    called out of the lock range.

    Signed-off-by: Qu Wenruo
    Signed-off-by: Chris Mason

    Qu Wenruo
     
  • …git/kdave/linux into for-linus

    Chris Mason
     

24 Nov, 2014

1 commit


22 Nov, 2014

5 commits

  • When doing a fsync with a fast path we have a time window where we can miss
    the fact that writeback of some file data failed, and therefore we endup
    returning success (0) from fsync when we should return an error.
    The steps that lead to this are the following:

    1) We start all ordered extents by calling filemap_fdatawrite_range();

    2) We do some other work like locking the inode's i_mutex, start a transaction,
    start a log transaction, etc;

    3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
    ordered extents from inode's ordered tree into a list;

    4) But by the time we do ordered extent collection, some ordered extents we started
    at step 1) might have already completed with an error, and therefore we didn't
    found them in the ordered tree and had no idea they finished with an error. This
    makes our fsync return success (0) to userspace, but has no bad effects on the log
    like for example insertion of file extent items into the log that point to unwritten
    extents, because the invalid extent maps were removed before the ordered extent
    completed (in inode.c:btrfs_finish_ordered_io).

    So after collecting the ordered extents just check if the inode's i_mapping has any
    error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
    writeback fails for a page of an ordered extent, we call mapping_set_error (done in
    extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
    that sets one of those error flags in the inode's i_mapping flags.

    This change also has the side effect of fixing the issue where for fast fsyncs we
    never checked/cleared the error flags from the inode's i_mapping flags, which means
    that a full fsync performed after a fast fsync could get such errors that belonged
    to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
    calls filemap_fdatawait_range(), and the later checks for and clears those flags,
    while for fast fsyncs we never call filemap_fdatawait_range() or anything else
    that checks for and clears the error flags from the inode's i_mapping.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • Instead of collecting all ordered extents from the inode's ordered tree
    and then wait for all of them to complete, just collect the ones that
    overlap the fsync range.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • If an error happens during writeback of log btree extents, make sure the
    error is returned to the caller (fsync), so that it takes proper action
    (commit current transaction) instead of writing a superblock that points
    to log btrees with all or some nodes that weren't durably persisted.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • We use the modified list to keep track of which extents have been modified so we
    know which ones are candidates for logging at fsync() time. Newly modified
    extents are added to the list at modification time, around the same time the
    ordered extent is created. We do this so that we don't have to wait for ordered
    extents to complete before we know what we need to log. The problem is when
    something like this happens

    log extent 0-4k on inode 1
    copy csum for 0-4k from ordered extent into log
    sync log
    commit transaction
    log some other extent on inode 1
    ordered extent for 0-4k completes and adds itself onto modified list again
    log changed extents
    see ordered extent for 0-4k has already been logged
    at this point we assume the csum has been copied
    sync log
    crash

    On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
    which is the same one that we are replaying which also drops the csum, and then
    we won't find the csum in the log for that bytenr. This of course causes us to
    have errors about not having csums for certain ranges of our inode. So remove
    the modified list manipulation in unpin_extent_cache, any modified extents
    should have been added well before now, and we don't want them re-logged. This
    fixes my test that I could reliably reproduce this problem with. Thanks,

    cc: stable@vger.kernel.org
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Liu Bo pointed out that my previous fix would lose the generation update in the
    scenario I described. It is actually much worse than that, we could lose the
    entire extent if we lose power right after the transaction commits. Consider
    the following

    write extent 0-4k
    log extent in log tree
    commit transaction
    < power fail happens here
    ordered extent completes

    We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
    the transaction commit will reset the log so it isn't replayed. If we lose
    power before the transaction commit we are save, otherwise we are not.

    Fix this by keeping track of all extents we logged in this transaction. Then
    when we go to commit the transaction make sure we wait for all of those ordered
    extents to complete before proceeding. This will make sure that if we lose
    power after the transaction commit we still have our data. This also fixes the
    problem of the improperly updated extent generation. Thanks,

    cc: stable@vger.kernel.org
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

21 Nov, 2014

3 commits

  • If we have two fsync()'s race on different subvols one will do all of its work
    to get into the log_tree, wait on it's outstanding IO, and then allow the
    log_tree to finish it's commit. The problem is we were just free'ing that
    subvols logged extents instead of waiting on them, so whoever lost the race
    wouldn't really have their data on disk. Fix this by waiting properly instead
    of freeing the logged extents. Thanks,

    cc: stable@vger.kernel.org
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • The sizes that are obtained from space infos are in raw units and have
    to be adjusted according to the raid factor. This was missing for
    f_bavail and df reported doubled size for raid1.

    Reported-by: Martin Steigerwald
    Fixes: ba7b6e62f420 ("btrfs: adjust statfs calculations according to raid profiles")
    CC: stable@vger.kernel.org
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • This can be reproduced by fstests: btrfs/070

    The scenario is like the following:

    replace worker thread defrag thread
    --------------------- -------------
    copy_nocow_pages_worker btrfs_defrag_file
    copy_nocow_pages_for_inode ...
    btrfs_writepages
    |A| lock_extent_bits extent_write_cache_pages
    |B| lock_page
    __extent_writepage
    ... writepage_delalloc
    find_lock_delalloc_range
    |B| lock_extent_bits
    find_or_create_page
    pagecache_get_page
    |A| lock_page

    This leads to an ABBA pattern deadlock. To fix it,
    o we just change it to an AABB pattern which means to @unlock_extent_bits()
    before we @lock_page(), and in this way the @extent_read_full_page_nolock()
    is no longer in an locked context, so change it back to @extent_read_full_page()
    to regain protection.

    o Since we @unlock_extent_bits() earlier, then before @write_page_nocow(),
    the extent may not really point at the physical block we want, so we
    have to check it before write.

    Signed-off-by: Gui Hecheng
    Tested-by: David Sterba
    Signed-off-by: Chris Mason

    Gui Hecheng