22 Mar, 2016

1 commit

  • Pull btrfs updates from Chris Mason:
    "We have a good sized cleanup of our internal read ahead code, and the
    first series of commits from Chandan to enable PAGE_SIZE > sectorsize

    Otherwise, it's a normal series of cleanups and fixes, with many
    thanks to Dave Sterba for doing most of the patch wrangling this time"

    * 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (82 commits)
    btrfs: make sure we stay inside the bvec during __btrfs_lookup_bio_sums
    btrfs: Fix misspellings in comments.
    btrfs: Print Warning only if ENOSPC_DEBUG is enabled
    btrfs: scrub: silence an uninitialized variable warning
    btrfs: move btrfs_compression_type to compression.h
    btrfs: rename btrfs_print_info to btrfs_print_mod_info
    Btrfs: Show a warning message if one of objectid reaches its highest value
    Documentation: btrfs: remove usage specific information
    btrfs: use kbasename in btrfsic_mount
    Btrfs: do not collect ordered extents when logging that inode exists
    Btrfs: fix race when checking if we can skip fsync'ing an inode
    Btrfs: fix listxattrs not listing all xattrs packed in the same item
    Btrfs: fix deadlock between direct IO reads and buffered writes
    Btrfs: fix extent_same allowing destination offset beyond i_size
    Btrfs: fix file loss on log replay after renaming a file and fsync
    Btrfs: fix unreplayable log after snapshot delete + parent dir fsync
    Btrfs: fix lockdep deadlock warning due to dev_replace
    btrfs: drop unused argument in btrfs_ioctl_get_supported_features
    btrfs: add GET_SUPPORTED_FEATURES to the control device ioctls
    btrfs: change max_inline default to 2048
    ...

    Linus Torvalds
     

21 Mar, 2016

1 commit

  • Commit c40a3d38aff4e1c (Btrfs: Compute and look up csums based on
    sectorsized blocks) changes around how we walk the bios while looking up
    crcs. There's an inner loop that is jumping to the next bvec based on
    sectors and before it derefs the next bvec, it needs to make sure we're
    still in the bio.

    In this case, the outer loop would have decided to stop moving forward
    too, and the bvec deref is never actually used for anything. But
    CONFIG_DEBUG_PAGEALLOC catches it because we're outside our bio.

    Signed-off-by: Chris Mason
    Reviewed-by: David Sterba

    Chris Mason
     

18 Mar, 2016

1 commit

  • Even though this is a 'can't happen' situation, use the new
    radix_tree_iter_retry() pattern to eliminate a goto.

    [akpm@linux-foundation.org: fix btrfs build]
    Signed-off-by: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Konstantin Khlebnikov
    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: David Sterba
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

14 Mar, 2016

2 commits


12 Mar, 2016

4 commits


11 Mar, 2016

1 commit


07 Mar, 2016

1 commit


05 Mar, 2016

1 commit


04 Mar, 2016

1 commit

  • When looking for orphan roots during mount we can end up hitting a
    BUG_ON() (at root-item.c:btrfs_find_orphan_roots()) if a log tree is
    replayed and qgroups are enabled. This is because after a log tree is
    replayed, a transaction commit is made, which triggers qgroup extent
    accounting which in turn does backref walking which ends up reading and
    inserting all roots in the radix tree fs_info->fs_root_radix, including
    orphan roots (deleted snapshots). So after the log tree is replayed, when
    finding orphan roots we hit the BUG_ON with the following trace:

    [118209.182438] ------------[ cut here ]------------
    [118209.183279] kernel BUG at fs/btrfs/root-tree.c:314!
    [118209.184074] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    [118209.185123] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic ppdev xor raid6_pq evdev sg parport_pc parport acpi_cpufreq tpm_tis tpm psmouse
    processor i2c_piix4 serio_raw pcspkr i2c_core button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata
    virtio_pci virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
    [118209.186318] CPU: 14 PID: 28428 Comm: mount Tainted: G W 4.5.0-rc5-btrfs-next-24+ #1
    [118209.186318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
    [118209.186318] task: ffff8801ec131040 ti: ffff8800af34c000 task.ti: ffff8800af34c000
    [118209.186318] RIP: 0010:[] [] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
    [118209.186318] RSP: 0018:ffff8800af34faa8 EFLAGS: 00010246
    [118209.186318] RAX: 00000000ffffffef RBX: 00000000ffffffef RCX: 0000000000000001
    [118209.186318] RDX: 0000000080000000 RSI: 0000000000000001 RDI: 00000000ffffffff
    [118209.186318] RBP: ffff8800af34fb08 R08: 0000000000000001 R09: 0000000000000000
    [118209.186318] R10: ffff8800af34f9f0 R11: 6db6db6db6db6db7 R12: ffff880171b97000
    [118209.186318] R13: ffff8801ca9d65e0 R14: ffff8800afa2e000 R15: 0000160000000000
    [118209.186318] FS: 00007f5bcb914840(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
    [118209.186318] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [118209.186318] CR2: 00007f5bcaceb5d9 CR3: 00000000b49b5000 CR4: 00000000000006e0
    [118209.186318] Stack:
    [118209.186318] fffffbffffffffff 010230ffffffffff 0101000000000000 ff84000000000000
    [118209.186318] fbffffffffffffff 30ffffffffffffff 0000000000000101 ffff880082348000
    [118209.186318] 0000000000000000 ffff8800afa2e000 ffff8800afa2e000 0000000000000000
    [118209.186318] Call Trace:
    [118209.186318] [] open_ctree+0x1e37/0x21b9 [btrfs]
    [118209.186318] [] btrfs_mount+0x97e/0xaed [btrfs]
    [118209.186318] [] ? trace_hardirqs_on+0xd/0xf
    [118209.186318] [] mount_fs+0x67/0x131
    [118209.186318] [] vfs_kern_mount+0x6c/0xde
    [118209.186318] [] btrfs_mount+0x1ac/0xaed [btrfs]
    [118209.186318] [] ? trace_hardirqs_on+0xd/0xf
    [118209.186318] [] ? lockdep_init_map+0xb9/0x1b3
    [118209.186318] [] mount_fs+0x67/0x131
    [118209.186318] [] vfs_kern_mount+0x6c/0xde
    [118209.186318] [] do_mount+0x8a6/0x9e8
    [118209.186318] [] SyS_mount+0x77/0x9f
    [118209.186318] [] entry_SYSCALL_64_fastpath+0x12/0x6b
    [118209.186318] Code: 64 00 00 85 c0 89 c3 75 24 f0 41 80 4c 24 20 20 49 8b bc 24 f0 01 00 00 4c 89 e6 e8 e8 65 00 00 85 c0 89 c3 74 11 83 f8 ef 75 02 0b
    4c 89 e7 e8 da 72 00 00 eb 1c 41 83 bc 24 00 01 00 00 00
    [118209.186318] RIP [] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
    [118209.186318] RSP
    [118209.230735] ---[ end trace 83938f987d85d477 ]---

    So fix this by not treating the error -EEXIST, returned when attempting
    to insert a root already inserted by the backref walking code, as an error.

    The following test case for xfstests reproduces the bug:

    seq=`basename $0`
    seqres=$RESULT_DIR/$seq
    echo "QA output created by $seq"
    tmp=/tmp/$$
    status=1 # failure is the default!
    trap "_cleanup; exit \$status" 0 1 2 3 15

    _cleanup()
    {
    _cleanup_flakey
    cd /
    rm -f $tmp.*
    }

    # get standard environment, filters and checks
    . ./common/rc
    . ./common/filter
    . ./common/dmflakey

    # real QA test starts here
    _supported_fs btrfs
    _supported_os Linux
    _require_scratch
    _require_dm_target flakey
    _require_metadata_journaling $SCRATCH_DEV

    rm -f $seqres.full

    _scratch_mkfs >>$seqres.full 2>&1
    _init_flakey
    _mount_flakey

    _run_btrfs_util_prog quota enable $SCRATCH_MNT

    # Create 2 directories with one file in one of them.
    # We use these just to trigger a transaction commit later, moving the file from
    # directory a to directory b and doing an fsync against directory a.
    mkdir $SCRATCH_MNT/a
    mkdir $SCRATCH_MNT/b
    touch $SCRATCH_MNT/a/f
    sync

    # Create our test file with 2 4K extents.
    $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io

    # Create a snapshot and delete it. This doesn't really delete the snapshot
    # immediately, just makes it inaccessible and invisible to user space, the
    # snapshot is deleted later by a dedicated kernel thread (cleaner kthread)
    # which is woke up at the next transaction commit.
    # A root orphan item is inserted into the tree of tree roots, so that if a
    # power failure happens before the dedicated kernel thread does the snapshot
    # deletion, the next time the filesystem is mounted it resumes the snapshot
    # deletion.
    _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap
    _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap

    # Now overwrite half of the extents we wrote before. Because we made a snapshpot
    # before, which isn't really deleted yet (since no transaction commit happened
    # after we did the snapshot delete request), the non overwritten extents get
    # referenced twice, once by the default subvolume and once by the snapshot.
    $XFS_IO_PROG -c "pwrite -S 0xbb 4K 8K" $SCRATCH_MNT/foobar | _filter_xfs_io

    # Now move file f from directory a to directory b and fsync directory a.
    # The fsync on the directory a triggers a transaction commit (because a file
    # was moved from it to another directory) and the file fsync leaves a log tree
    # with file extent items to replay.
    mv $SCRATCH_MNT/a/f $SCRATCH_MNT/a/b
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar

    echo "File digest before power failure:"
    md5sum $SCRATCH_MNT/foobar | _filter_scratch

    # Now simulate a power failure and mount the filesystem to replay the log tree.
    # After the log tree was replayed, we used to hit a BUG_ON() when processing
    # the root orphan item for the deleted snapshot. This is because when processing
    # an orphan root the code expected to be the first code inserting the root into
    # the fs_info->fs_root_radix radix tree, while in reallity it was the second
    # caller attempting to do it - the first caller was the transaction commit that
    # took place after replaying the log tree, when updating the qgroup counters.
    _flakey_drop_and_remount

    echo "File digest before after failure:"
    # Must match what he got before the power failure.
    md5sum $SCRATCH_MNT/foobar | _filter_scratch

    _unmount_flakey
    status=0
    exit

    Fixes: 2d9e97761087 ("Btrfs: use btrfs_get_fs_root in resolve_indirect_ref")
    Cc: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Reviewed-by: Qu Wenruo
    Signed-off-by: Chris Mason

    Filipe Manana
     

02 Mar, 2016

8 commits

  • When logging that an inode exists, for example as part of a directory
    fsync operation, we were collecting any ordered extents for the inode but
    we ended up doing nothing with them except tagging them as processed, by
    setting the flag BTRFS_ORDERED_LOGGED on them, which prevented a
    subsequent fsync of that inode (using the LOG_INODE_ALL mode) from
    collecting and processing them. This created a time window where a second
    fsync against the inode, using the fast path, ended up not logging the
    checksums for the new extents but it logged the extents since they were
    part of the list of modified extents. This happened because the ordered
    extents were not collected and checksums were not yet added to the csum
    tree - the ordered extents have not gone through btrfs_finish_ordered_io()
    yet (which is where we add them to the csum tree by calling
    inode.c:add_pending_csums()).

    So fix this by not collecting an inode's ordered extents if we are logging
    it with the LOG_INODE_EXISTS mode.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • If we're about to do a fast fsync for an inode and btrfs_inode_in_log()
    returns false, it's possible that we had an ordered extent in progress
    (btrfs_finish_ordered_io() not run yet) when we noticed that the inode's
    last_trans field was not greater than the id of the last committed
    transaction, but shortly after, before we checked if there were any
    ongoing ordered extents, the ordered extent had just completed and
    removed itself from the inode's ordered tree, in which case we end up not
    logging the inode, losing some data if a power failure or crash happens
    after the fsync handler returns and before the transaction is committed.

    Fix this by checking first if there are any ongoing ordered extents
    before comparing the inode's last_trans with the id of the last committed
    transaction - when it completes, an ordered extent always updates the
    inode's last_trans before it removes itself from the inode's ordered
    tree (at btrfs_finish_ordered_io()).

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • In the listxattrs handler, we were not listing all the xattrs that are
    packed in the same btree item, which happens when multiple xattrs have
    a name that when crc32c hashed produce the same checksum value.

    Fix this by processing them all.

    The following test case for xfstests reproduces the issue:

    seq=`basename $0`
    seqres=$RESULT_DIR/$seq
    echo "QA output created by $seq"
    tmp=/tmp/$$
    status=1 # failure is the default!
    trap "_cleanup; exit \$status" 0 1 2 3 15

    _cleanup()
    {
    cd /
    rm -f $tmp.*
    }

    # get standard environment, filters and checks
    . ./common/rc
    . ./common/filter
    . ./common/attr

    # real QA test starts here
    _supported_fs generic
    _supported_os Linux
    _require_scratch
    _require_attrs

    rm -f $seqres.full

    _scratch_mkfs >>$seqres.full 2>&1
    _scratch_mount

    # Create our test file with a few xattrs. The first 3 xattrs have a name
    # that when given as input to a crc32c function result in the same checksum.
    # This made btrfs list only one of the xattrs through listxattrs system call
    # (because it packs xattrs with the same name checksum into the same btree
    # item).
    touch $SCRATCH_MNT/testfile
    $SETFATTR_PROG -n user.foobar -v 123 $SCRATCH_MNT/testfile
    $SETFATTR_PROG -n user.WvG1c1Td -v qwerty $SCRATCH_MNT/testfile
    $SETFATTR_PROG -n user.J3__T_Km3dVsW_ -v hello $SCRATCH_MNT/testfile
    $SETFATTR_PROG -n user.something -v pizza $SCRATCH_MNT/testfile
    $SETFATTR_PROG -n user.ping -v pong $SCRATCH_MNT/testfile

    # Now call getfattr with --dump, which calls the listxattrs system call.
    # It should list all the xattrs we have set before.
    $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/testfile | _filter_scratch

    status=0
    exit

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • While running a test with a mix of buffered IO and direct IO against
    the same files I hit a deadlock reported by the following trace:

    [11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds.
    [11642.142452] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
    [11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [11642.146332] kworker/u32:3 D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification.
    [11642.149771] 0 15282 2 0x00000000
    [11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
    [11642.154074] ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0
    [11642.156722] ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff
    [11642.159205] 0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541
    [11642.161403] Call Trace:
    [11642.162129] [] ? bit_wait+0x2f/0x2f
    [11642.163396] [] schedule+0x82/0x9a
    [11642.164871] [] schedule_timeout+0x43/0x109
    [11642.167020] [] ? bit_wait+0x2f/0x2f
    [11642.167931] [] ? trace_hardirqs_on_caller+0x17b/0x197
    [11642.182320] [] ? trace_hardirqs_on+0xd/0xf
    [11642.183762] [] ? timekeeping_get_ns+0xe/0x33
    [11642.185308] [] ? ktime_get+0x41/0x52
    [11642.186782] [] io_schedule_timeout+0xa0/0x102
    [11642.188217] [] ? io_schedule_timeout+0xa0/0x102
    [11642.189626] [] bit_wait_io+0x1b/0x39
    [11642.190803] [] __wait_on_bit_lock+0x4c/0x90
    [11642.192158] [] __lock_page+0x66/0x68
    [11642.193379] [] ? autoremove_wake_function+0x3a/0x3a
    [11642.194831] [] lock_page+0x31/0x34 [btrfs]
    [11642.197068] [] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
    [11642.199188] [] extent_writepages+0x4b/0x5c [btrfs]
    [11642.200723] [] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
    [11642.202465] [] btrfs_writepages+0x28/0x2a [btrfs]
    [11642.203836] [] do_writepages+0x23/0x2c
    [11642.205624] [] __filemap_fdatawrite_range+0x5a/0x61
    [11642.207057] [] filemap_fdatawrite_range+0x13/0x15
    [11642.208529] [] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
    [11642.210375] [] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs]
    [11642.212132] [] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs]
    [11642.213837] [] btrfs_scrubparity_helper+0x15c/0x33a [btrfs]
    [11642.215457] [] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
    [11642.217095] [] process_one_work+0x256/0x48b
    [11642.218324] [] worker_thread+0x1f5/0x2a7
    [11642.219466] [] ? rescuer_thread+0x289/0x289
    [11642.220801] [] kthread+0xd4/0xdc
    [11642.222032] [] ? kthread_parkme+0x24/0x24
    [11642.223190] [] ret_from_fork+0x3f/0x70
    [11642.224394] [] ? kthread_parkme+0x24/0x24
    [11642.226295] 2 locks held by kworker/u32:3/15282:
    [11642.227273] #0: ("%s-%s""btrfs", name){++++.+}, at: [] process_one_work+0x165/0x48b
    [11642.229412] #1: ((&work->normal_work)){+.+.+.}, at: [] process_one_work+0x165/0x48b
    [11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds.
    [11642.232872] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
    [11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [11642.235776] kworker/u32:8 D ffff88020de5f848 0 15289 2 0x00000000
    [11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481)
    [11642.238670] ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0
    [11642.240475] ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff
    [11642.242154] 0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541
    [11642.243715] Call Trace:
    [11642.244390] [] ? bit_wait+0x2f/0x2f
    [11642.245432] [] schedule+0x82/0x9a
    [11642.246392] [] schedule_timeout+0x43/0x109
    [11642.247479] [] ? bit_wait+0x2f/0x2f
    [11642.248551] [] ? trace_hardirqs_on_caller+0x17b/0x197
    [11642.249968] [] ? trace_hardirqs_on+0xd/0xf
    [11642.251043] [] ? timekeeping_get_ns+0xe/0x33
    [11642.252202] [] ? ktime_get+0x41/0x52
    [11642.253210] [] io_schedule_timeout+0xa0/0x102
    [11642.254307] [] ? io_schedule_timeout+0xa0/0x102
    [11642.256118] [] bit_wait_io+0x1b/0x39
    [11642.257131] [] __wait_on_bit_lock+0x4c/0x90
    [11642.258200] [] __lock_page+0x66/0x68
    [11642.259168] [] ? autoremove_wake_function+0x3a/0x3a
    [11642.260516] [] lock_page+0x31/0x34 [btrfs]
    [11642.261841] [] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
    [11642.263531] [] extent_writepages+0x4b/0x5c [btrfs]
    [11642.264747] [] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
    [11642.266148] [] btrfs_writepages+0x28/0x2a [btrfs]
    [11642.267264] [] do_writepages+0x23/0x2c
    [11642.268280] [] __writeback_single_inode+0xda/0x5ba
    [11642.269407] [] writeback_sb_inodes+0x27b/0x43d
    [11642.270476] [] __writeback_inodes_wb+0x76/0xae
    [11642.271547] [] wb_writeback+0x19e/0x41c
    [11642.272588] [] wb_workfn+0x201/0x341
    [11642.273523] [] ? wb_workfn+0x201/0x341
    [11642.274479] [] process_one_work+0x256/0x48b
    [11642.275497] [] worker_thread+0x1f5/0x2a7
    [11642.276518] [] ? rescuer_thread+0x289/0x289
    [11642.277520] [] ? rescuer_thread+0x289/0x289
    [11642.278517] [] kthread+0xd4/0xdc
    [11642.279371] [] ? kthread_parkme+0x24/0x24
    [11642.280468] [] ret_from_fork+0x3f/0x70
    [11642.281607] [] ? kthread_parkme+0x24/0x24
    [11642.282604] 3 locks held by kworker/u32:8/15289:
    [11642.283423] #0: ("writeback"){++++.+}, at: [] process_one_work+0x165/0x48b
    [11642.285629] #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [] process_one_work+0x165/0x48b
    [11642.287538] #2: (&type->s_umount_key#37){+++++.}, at: [] trylock_super+0x1b/0x4b
    [11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds.
    [11642.290547] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
    [11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [11642.292864] fdm-stress D ffff88022c107c20 0 26848 26591 0x00000000
    [11642.294118] ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0
    [11642.295602] ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff
    [11642.297098] ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541
    [11642.298433] Call Trace:
    [11642.298896] [] schedule+0x82/0x9a
    [11642.299738] [] lock_extent_bits+0xfe/0x1a3 [btrfs]
    [11642.300833] [] ? add_wait_queue_exclusive+0x44/0x44
    [11642.301943] [] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs]
    [11642.303270] [] __btrfs_buffered_write+0x238/0x4c1 [btrfs]
    [11642.304552] [] ? btrfs_file_write_iter+0x17c/0x408 [btrfs]
    [11642.305782] [] btrfs_file_write_iter+0x2f4/0x408 [btrfs]
    [11642.306878] [] __vfs_write+0x7c/0xa5
    [11642.307729] [] vfs_write+0x9d/0xe8
    [11642.308602] [] SyS_write+0x50/0x7e
    [11642.309410] [] entry_SYSCALL_64_fastpath+0x12/0x6b
    [11642.310403] 3 locks held by fdm-stress/26848:
    [11642.311108] #0: (&f->f_pos_lock){+.+.+.}, at: [] __fdget_pos+0x3a/0x40
    [11642.312578] #1: (sb_writers#11){.+.+.+}, at: [] __sb_start_write+0x5f/0xb0
    [11642.314170] #2: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [] btrfs_file_write_iter+0x73/0x408 [btrfs]
    [11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds.
    [11642.317842] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
    [11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [11642.319959] fdm-stress D ffff8801964ffa68 0 26849 26591 0x00000000
    [11642.321312] ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0
    [11642.322555] ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002
    [11642.323715] ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541
    [11642.325096] Call Trace:
    [11642.325532] [] schedule+0x82/0x9a
    [11642.326303] [] schedule_timeout+0x43/0x109
    [11642.327180] [] ? mark_held_locks+0x5e/0x74
    [11642.328114] [] ? _raw_spin_unlock_irq+0x2c/0x4a
    [11642.329051] [] ? trace_hardirqs_on_caller+0x17b/0x197
    [11642.330053] [] __wait_for_common+0x109/0x147
    [11642.330952] [] ? __wait_for_common+0x109/0x147
    [11642.331869] [] ? usleep_range+0x4a/0x4a
    [11642.332925] [] ? wake_up_q+0x47/0x47
    [11642.333736] [] wait_for_completion+0x24/0x26
    [11642.334672] [] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs]
    [11642.335858] [] btrfs_mksubvol+0x224/0x45d [btrfs]
    [11642.336854] [] ? add_wait_queue_exclusive+0x44/0x44
    [11642.337820] [] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs]
    [11642.339026] [] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs]
    [11642.340214] [] btrfs_ioctl+0x590/0x27bd [btrfs]
    [11642.341123] [] ? mutex_unlock+0xe/0x10
    [11642.341934] [] ? ext4_file_write_iter+0x2a3/0x36f [ext4]
    [11642.342936] [] ? __lock_is_held+0x3c/0x57
    [11642.343772] [] ? rcu_read_unlock+0x3e/0x5d
    [11642.344673] [] do_vfs_ioctl+0x458/0x4dc
    [11642.346024] [] ? __fget_light+0x62/0x71
    [11642.346873] [] SyS_ioctl+0x57/0x79
    [11642.347720] [] entry_SYSCALL_64_fastpath+0x12/0x6b
    [11642.350222] 4 locks held by fdm-stress/26849:
    [11642.350898] #0: (sb_writers#11){.+.+.+}, at: [] __sb_start_write+0x5f/0xb0
    [11642.352375] #1: (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [] btrfs_mksubvol+0x4b/0x45d [btrfs]
    [11642.354072] #2: (&fs_info->subvol_sem){++++..}, at: [] btrfs_mksubvol+0xf4/0x45d [btrfs]
    [11642.355647] #3: (&root->ordered_extent_mutex){+.+...}, at: [] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
    [11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds.
    [11642.358508] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
    [11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [11642.368625] fdm-stress D ffff88021f167688 0 26850 26591 0x00000000
    [11642.369716] ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0
    [11642.370950] ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff
    [11642.372210] 0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541
    [11642.373430] Call Trace:
    [11642.373853] [] ? bit_wait+0x2f/0x2f
    [11642.374623] [] schedule+0x82/0x9a
    [11642.375948] [] schedule_timeout+0x43/0x109
    [11642.376862] [] ? bit_wait+0x2f/0x2f
    [11642.377637] [] ? trace_hardirqs_on_caller+0x17b/0x197
    [11642.378610] [] ? trace_hardirqs_on+0xd/0xf
    [11642.379457] [] ? timekeeping_get_ns+0xe/0x33
    [11642.380366] [] ? ktime_get+0x41/0x52
    [11642.381353] [] io_schedule_timeout+0xa0/0x102
    [11642.382255] [] ? io_schedule_timeout+0xa0/0x102
    [11642.383162] [] bit_wait_io+0x1b/0x39
    [11642.383945] [] __wait_on_bit_lock+0x4c/0x90
    [11642.384875] [] __lock_page+0x66/0x68
    [11642.385749] [] ? autoremove_wake_function+0x3a/0x3a
    [11642.386721] [] lock_page+0x31/0x34 [btrfs]
    [11642.387596] [] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
    [11642.389030] [] extent_writepages+0x4b/0x5c [btrfs]
    [11642.389973] [] ? rcu_read_lock_sched_held+0x61/0x69
    [11642.390939] [] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
    [11642.392271] [] ? __clear_extent_bit+0x26e/0x2c0 [btrfs]
    [11642.393305] [] btrfs_writepages+0x28/0x2a [btrfs]
    [11642.394239] [] do_writepages+0x23/0x2c
    [11642.395045] [] __filemap_fdatawrite_range+0x5a/0x61
    [11642.395991] [] filemap_fdatawrite_range+0x13/0x15
    [11642.397144] [] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
    [11642.398392] [] ? clear_extent_bit+0x17/0x19 [btrfs]
    [11642.399363] [] btrfs_get_blocks_direct+0x12b/0x61c [btrfs]
    [11642.400445] [] ? dio_bio_add_page+0x3d/0x54
    [11642.401309] [] ? submit_page_section+0x7b/0x111
    [11642.402213] [] do_blockdev_direct_IO+0x685/0xc24
    [11642.403139] [] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs]
    [11642.404360] [] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
    [11642.406187] [] __blockdev_direct_IO+0x31/0x33
    [11642.407070] [] ? __blockdev_direct_IO+0x31/0x33
    [11642.407990] [] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
    [11642.409192] [] btrfs_direct_IO+0x1c7/0x27e [btrfs]
    [11642.410146] [] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
    [11642.411291] [] generic_file_read_iter+0x89/0x4e1
    [11642.412263] [] ? mark_lock+0x24/0x201
    [11642.413057] [] __vfs_read+0x79/0x9d
    [11642.413897] [] vfs_read+0x8f/0xd2
    [11642.414708] [] SyS_read+0x50/0x7e
    [11642.415573] [] entry_SYSCALL_64_fastpath+0x12/0x6b
    [11642.416572] 1 lock held by fdm-stress/26850:
    [11642.417345] #0: (&f->f_pos_lock){+.+.+.}, at: [] __fdget_pos+0x3a/0x40
    [11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds.
    [11642.419698] Not tainted 4.4.0-rc6-btrfs-next-21+ #1
    [11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [11642.421807] fdm-stress D ffff880196483d28 0 26851 26591 0x00000000
    [11642.422878] ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0
    [11642.424149] ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740
    [11642.425374] ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541
    [11642.426591] Call Trace:
    [11642.427013] [] schedule+0x82/0x9a
    [11642.427856] [] schedule_preempt_disabled+0x18/0x24
    [11642.428852] [] mutex_lock_nested+0x1d7/0x3b4
    [11642.429743] [] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
    [11642.430911] [] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
    [11642.432102] [] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs]
    [11642.433259] [] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
    [11642.434431] [] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs]
    [11642.436079] [] btrfs_sync_fs+0xe0/0x1ad [btrfs]
    [11642.437009] [] ? SyS_tee+0x23c/0x23c
    [11642.437860] [] sync_fs_one_sb+0x20/0x22
    [11642.438723] [] iterate_supers+0x75/0xc2
    [11642.439597] [] sys_sync+0x52/0x80
    [11642.440454] [] entry_SYSCALL_64_fastpath+0x12/0x6b
    [11642.441533] 3 locks held by fdm-stress/26851:
    [11642.442370] #0: (&type->s_umount_key#37){+++++.}, at: [] iterate_supers+0x5f/0xc2
    [11642.444043] #1: (&fs_info->ordered_operations_mutex){+.+...}, at: [] btrfs_wait_ordered_roots+0x44/0x191 [btrfs]
    [11642.446010] #2: (&root->ordered_extent_mutex){+.+...}, at: [] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]

    This happened because under specific timings the path for direct IO reads
    can deadlock with concurrent buffered writes. The diagram below shows how
    this happens for an example file that has the following layout:

    [ extent A ] [ extent B ] [ ....
    0K 4K 8K

    CPU 1 CPU 2 CPU 3

    DIO read against range
    [0K, 8K[ starts

    btrfs_direct_IO()
    --> calls btrfs_get_blocks_direct()
    which finds the extent map for the
    extent A and leaves the range
    [0K, 4K[ locked in the inode's
    io tree

    buffered write against
    range [4K, 8K[ starts

    __btrfs_buffered_write()
    --> dirties page at 4K

    a user space
    task calls sync
    for e.g or
    writepages() is
    invoked by mm

    writepages()
    run_delalloc_range()
    cow_file_range()
    --> ordered extent X
    for the buffered
    write is created
    and
    writeback starts

    --> calls btrfs_get_blocks_direct()
    again, without submitting first
    a bio for reading extent A, and
    finds the extent map for extent B

    --> calls lock_extent_direct()

    --> locks range [4K, 8K[
    --> finds ordered extent X
    covering range [4K, 8K[
    --> unlocks range [4K, 8K[

    buffered write against
    range [0K, 8K[ starts

    __btrfs_buffered_write()
    prepare_pages()
    --> locks pages with
    offsets 0 and 4K
    lock_and_cleanup_extent_if_need()
    --> blocks attempting to
    lock range [0K, 8K[ in
    the inode's io tree,
    because the range [0, 4K[
    is already locked by the
    direct IO task at CPU 1

    --> calls
    btrfs_start_ordered_extent(oe X)

    btrfs_start_ordered_extent(oe X)

    --> At this point writeback for ordered
    extent X has not finished yet

    filemap_fdatawrite_range()
    btrfs_writepages()
    extent_writepages()
    extent_write_cache_pages()
    --> finds page with offset 0
    with the writeback tag
    (and not dirty)
    --> tries to lock it
    --> deadlock, task at CPU 2
    has the page locked and
    is blocked on the io range
    [0, 4K[ that was locked
    earlier by this task

    So fix this by falling back to a buffered read in the direct IO read path
    when an ordered extent for a buffered write is found.

    Signed-off-by: Filipe Manana
    Reviewed-by: Liu Bo
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • When using the same file as the source and destination for a dedup
    (extent_same ioctl) operation we were allowing it to dedup to a
    destination offset beyond the file's size, which doesn't make sense and
    it's not allowed for the case where the source and destination files are
    not the same file. This made de deduplication operation successful only
    when the source range corresponded to a hole, a prealloc extent or an
    extent with all bytes having a value of 0x00. This was also leaving a
    file hole (between i_size and destination offset) without the
    corresponding file extent items, which can be reproduced with the
    following steps for example:

    $ mkfs.btrfs -f /dev/sdi
    $ mount /dev/sdi /mnt/sdi

    $ xfs_io -f -c "pwrite -S 0xab 304457 404990" /mnt/sdi/foobar
    wrote 404990/404990 bytes at offset 304457
    395 KiB, 99 ops; 0.0000 sec (31.150 MiB/sec and 7984.5149 ops/sec)

    $ /git/hub/duperemove/btrfs-extent-same 24576 /mnt/sdi/foobar 28672 /mnt/sdi/foobar 929792
    Deduping 2 total files
    (28672, 24576): /mnt/sdi/foobar
    (929792, 24576): /mnt/sdi/foobar
    1 files asked to be deduped
    i: 0, status: 0, bytes_deduped: 24576
    24576 total bytes deduped in this operation

    $ umount /mnt/sdi
    $ btrfsck /dev/sdi
    Checking filesystem on /dev/sdi
    UUID: 98c528aa-0833-427d-9403-b98032ffbf9d
    checking extents
    checking free space cache
    checking fs roots
    root 5 inode 257 errors 100, file extent discount
    Found file extent holes:
    start: 712704, len: 217088
    found 540673 bytes used err is 1
    total csum bytes: 400
    total tree bytes: 131072
    total fs tree bytes: 32768
    total extent tree bytes: 16384
    btree space waste bytes: 123675
    file data blocks allocated: 671744
    referenced 671744
    btrfs-progs v4.2.3

    So fix this by not allowing the destination to go beyond the file's size,
    just as we do for the same where the source and destination files are not
    the same.

    A test for xfstests follows.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • We have two cases where we end up deleting a file at log replay time
    when we should not. For this to happen the file must have been renamed
    and a directory inode must have been fsynced/logged.

    Two examples that exercise these two cases are listed below.

    Case 1)

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt
    $ mkdir -p /mnt/a/b
    $ mkdir /mnt/c
    $ touch /mnt/a/b/foo
    $ sync
    $ mv /mnt/a/b/foo /mnt/c/
    # Create file bar just to make sure the fsync on directory a/ does
    # something and it's not a no-op.
    $ touch /mnt/a/bar
    $ xfs_io -c "fsync" /mnt/a
    < power fail / crash >

    The next time the filesystem is mounted, the log replay procedure
    deletes file foo.

    Case 2)

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt
    $ mkdir /mnt/a
    $ mkdir /mnt/b
    $ mkdir /mnt/c
    $ touch /mnt/a/foo
    $ ln /mnt/a/foo /mnt/b/foo_link
    $ touch /mnt/b/bar
    $ sync
    $ unlink /mnt/b/foo_link
    $ mv /mnt/b/bar /mnt/c/
    $ xfs_io -c "fsync" /mnt/a/foo
    < power fail / crash >

    The next time the filesystem is mounted, the log replay procedure
    deletes file bar.

    The reason why the files are deleted is because when we log inodes
    other then the fsync target inode, we ignore their last_unlink_trans
    value and leave the log without enough information to later replay the
    rename operations. So we need to look at the last_unlink_trans values
    and fallback to a transaction commit if they are greater than the
    id of the last committed transaction.

    So fix this by looking at the last_unlink_trans values and fallback to
    transaction commits when needed. Also, when logging other inodes (for
    case 1 we logged descendants of the fsync target inode while for case 2
    we logged ascendants) we need to care about concurrent tasks updating
    the last_unlink_trans of inodes we are logging (which was already an
    existing problem in check_parent_dirs_for_sync()). Since we can not
    acquire their inode mutex (vfs' struct inode ->i_mutex), as that causes
    deadlocks with other concurrent operations that acquire the i_mutex of
    2 inodes (other fsyncs or renames for example), we need to serialize on
    the log_mutex of the inode we are logging. A task setting a new value for
    an inode's last_unlink_trans must acquire the inode's log_mutex and it
    must do this update before doing the actual unlink operation (which is
    already the case except when deleting a snapshot). Conversely the task
    logging the inode must first log the inode and then check the inode's
    last_unlink_trans value while holding its log_mutex, as if its value is
    not greater then the id of the last committed transaction it means it
    logged a safe state of the inode's items, while if its value is not
    smaller then the id of the last committed transaction it means the inode
    state it has logged might not be safe (the concurrent task might have
    just updated last_unlink_trans but hasn't done yet the unlink operation)
    and therefore a transaction commit must be done.

    Test cases for xfstests follow in separate patches.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • If we delete a snapshot, fsync its parent directory and crash/power fail
    before the next transaction commit, on the next mount when we attempt to
    replay the log tree of the root containing the parent directory we will
    fail and prevent the filesystem from mounting, which is solvable by wiping
    out the log trees with the btrfs-zero-log tool but very inconvenient as
    we will lose any data and metadata fsynced before the parent directory
    was fsynced.

    For example:

    $ mkfs.btrfs -f /dev/sdc
    $ mount /dev/sdc /mnt
    $ mkdir /mnt/testdir
    $ btrfs subvolume snapshot /mnt /mnt/testdir/snap
    $ btrfs subvolume delete /mnt/testdir/snap
    $ xfs_io -c "fsync" /mnt/testdir
    < crash / power failure and reboot >
    $ mount /dev/sdc /mnt
    mount: mount(2) failed: No such file or directory

    And in dmesg/syslog we get the following message and trace:

    [192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
    [192066.363010] ------------[ cut here ]------------
    [192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x17a/0x354 [btrfs]()
    [192066.367250] BTRFS: Transaction aborted (error -2)
    [192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
    [192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: G W 4.4.0-rc6-btrfs-next-20+ #1
    [192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
    [192066.380889] 0000000000000000 ffff880143923670 ffffffff81257570 ffff8801439236b8
    [192066.382561] ffff8801439236a8 ffffffff8104ec07 ffffffffa039dc2c 00000000fffffffe
    [192066.384191] ffff8801ed31d000 ffff8801b9fc9c88 ffff8801086875e0 ffff880143923710
    [192066.385827] Call Trace:
    [192066.386373] [] dump_stack+0x4e/0x79
    [192066.387387] [] warn_slowpath_common+0x99/0xb2
    [192066.388429] [] ? __btrfs_unlink_inode+0x17a/0x354 [btrfs]
    [192066.389236] [] warn_slowpath_fmt+0x48/0x50
    [192066.389884] [] __btrfs_unlink_inode+0x17a/0x354 [btrfs]
    [192066.390621] [] ? iput+0xb0/0x266
    [192066.391200] [] btrfs_unlink_inode+0x1c/0x3d [btrfs]
    [192066.391930] [] check_item_in_log+0x1fe/0x29b [btrfs]
    [192066.392715] [] replay_dir_deletes+0x167/0x1cf [btrfs]
    [192066.393510] [] replay_one_buffer+0x417/0x570 [btrfs]
    [192066.394241] [] walk_up_log_tree+0x10e/0x1dc [btrfs]
    [192066.394958] [] walk_log_tree+0xa5/0x190 [btrfs]
    [192066.395628] [] btrfs_recover_log_trees+0x239/0x32c [btrfs]
    [192066.396790] [] ? replay_one_extent+0x50a/0x50a [btrfs]
    [192066.397891] [] open_ctree+0x1d8b/0x2167 [btrfs]
    [192066.398897] [] btrfs_mount+0x5ef/0x729 [btrfs]
    [192066.399823] [] ? trace_hardirqs_on+0xd/0xf
    [192066.400739] [] ? lockdep_init_map+0xb9/0x1b3
    [192066.401700] [] mount_fs+0x67/0x131
    [192066.402482] [] vfs_kern_mount+0x6c/0xde
    [192066.403930] [] btrfs_mount+0x1cb/0x729 [btrfs]
    [192066.404831] [] ? trace_hardirqs_on+0xd/0xf
    [192066.405726] [] ? lockdep_init_map+0xb9/0x1b3
    [192066.406621] [] mount_fs+0x67/0x131
    [192066.407401] [] vfs_kern_mount+0x6c/0xde
    [192066.408247] [] do_mount+0x893/0x9d2
    [192066.409047] [] ? strndup_user+0x3f/0x8c
    [192066.409842] [] SyS_mount+0x75/0xa1
    [192066.410621] [] entry_SYSCALL_64_fastpath+0x12/0x6b
    [192066.411572] ---[ end trace 2de42126c1e0a0f0 ]---
    [192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: errno=-2 No such entry
    [192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 No such entry (Failed to recover log tree)
    [192066.415458] BTRFS error (device dm-0): cleaner transaction attach returned -30
    [192066.444613] BTRFS: open_ctree failed

    This happens because when we are replaying the log and processing the
    directory entry pointing to the snapshot in the subvolume tree, we treat
    its btrfs_dir_item item as having a location with a key type matching
    BTRFS_INODE_ITEM_KEY, which is wrong because the type matches
    BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the
    object id refers to a root number and not to an inode in the root
    containing the parent directory.

    So fix this by triggering a transaction commit if an fsync against the
    parent directory is requested after deleting a snapshot. This is the
    simplest approach for a rare use case. Some alternative that avoids the
    transaction commit would require more code to explicitly delete the
    snapshot at log replay time (factoring out common code from ioctl.c:
    btrfs_ioctl_snap_destroy()), special care at fsync time to remove the
    log tree of the snapshot's root from the log root of the root of tree
    roots, amongst other steps.

    A test case for xfstests that triggers the issue follows.

    seq=`basename $0`
    seqres=$RESULT_DIR/$seq
    echo "QA output created by $seq"
    tmp=/tmp/$$
    status=1 # failure is the default!
    trap "_cleanup; exit \$status" 0 1 2 3 15

    _cleanup()
    {
    _cleanup_flakey
    cd /
    rm -f $tmp.*
    }

    # get standard environment, filters and checks
    . ./common/rc
    . ./common/filter
    . ./common/dmflakey

    # real QA test starts here
    _need_to_be_root
    _supported_fs btrfs
    _supported_os Linux
    _require_scratch
    _require_dm_target flakey
    _require_metadata_journaling $SCRATCH_DEV

    rm -f $seqres.full

    _scratch_mkfs >>$seqres.full 2>&1
    _init_flakey
    _mount_flakey

    # Create a snapshot at the root of our filesystem (mount point path), delete it,
    # fsync the mount point path, crash and mount to replay the log. This should
    # succeed and after the filesystem is mounted the snapshot should not be visible
    # anymore.
    _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
    _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT
    _flakey_drop_and_remount
    [ -e $SCRATCH_MNT/snap1 ] && \
    echo "Snapshot snap1 still exists after log replay"

    # Similar scenario as above, but this time the snapshot is created inside a
    # directory and not directly under the root (mount point path).
    mkdir $SCRATCH_MNT/testdir
    _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2
    _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir
    _flakey_drop_and_remount
    [ -e $SCRATCH_MNT/testdir/snap2 ] && \
    echo "Snapshot snap2 still exists after log replay"

    _unmount_flakey

    echo "Silence is golden"
    status=0
    exit

    Signed-off-by: Filipe Manana
    Tested-by: Liu Bo
    Reviewed-by: Liu Bo
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • …ux into for-linus-4.6

    Btrfs patchsets for 4.6

    Chris Mason
     

26 Feb, 2016

10 commits


23 Feb, 2016

6 commits

  • Xfstests btrfs/011 complains about a deadlock warning,

    [ 1226.649039] =========================================================
    [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
    [ 1226.649039] 4.1.0+ #270 Not tainted
    [ 1226.649039] ---------------------------------------------------------
    [ 1226.652955] kswapd0/46 just changed the state of lock:
    [ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [] __btrfs_release_delayed_node+0x45/0x1d0
    [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
    [ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.}

    and interrupts could create inverse lock ordering between them.

    [ 1226.652955]
    other info that might help us debug this:
    [ 1226.652955] Chain exists of:
    &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock

    [ 1226.652955] Possible interrupt unsafe locking scenario:

    [ 1226.652955] CPU0 CPU1
    [ 1226.652955] ---- ----
    [ 1226.652955] lock(&fs_info->dev_replace.lock);
    [ 1226.652955] local_irq_disable();
    [ 1226.652955] lock(&delayed_node->mutex);
    [ 1226.652955] lock(&found->groups_sem);
    [ 1226.652955]
    [ 1226.652955] lock(&delayed_node->mutex);
    [ 1226.652955]
    *** DEADLOCK ***

    Commit 084b6e7c7607 ("btrfs: Fix a lockdep warning when running xfstest.") tried
    to fix a similar one that has the exactly same warning, but with that, we still
    run to this.

    The above lock chain comes from
    btrfs_commit_transaction
    ->btrfs_run_delayed_items
    ...
    ->__btrfs_update_delayed_inode
    ...
    ->__btrfs_cow_block
    ...
    ->find_free_extent
    ->cache_block_group
    ->load_free_space_cache
    ->btrfs_readpages
    ->submit_one_bio
    ...
    ->__btrfs_map_block
    ->btrfs_dev_replace_lock

    However, with high memory pressure, tasks which hold dev_replace.lock can
    be interrupted by kswapd and then kswapd is intended to release memory occupied
    by superblock, inodes and dentries, where we may call evict_inode, and it comes
    to

    [ 1226.652955] [] __btrfs_release_delayed_node+0x45/0x1d0
    [ 1226.652955] [] btrfs_remove_delayed_node+0x24/0x30
    [ 1226.652955] [] btrfs_evict_inode+0x34e/0x700

    delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
    to a ABBA deadlock.

    To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
    things are simpler here since we only needs read's spinlock to blocking lock.

    With this, btrfs/011 no more produces warnings in dmesg.

    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba

    Liu Bo
     
  • Reviewed-by: Anand Jain
    Signed-off-by: David Sterba

    David Sterba
     
  • The control device is accessible when no filesystem is mounted and we
    may want to query features supported by the module. This is already
    possible using the sysfs files, this ioctl is for parity and
    convenience.

    Reviewed-by: Anand Jain
    Signed-off-by: David Sterba

    David Sterba
     
  • The current practical default is ~4k on x86_64 (the logic is more complex,
    simplified for brevity), the inlined files land in the metadata group and
    thus consume space that could be needed for the real metadata.

    The inlining brings some usability surprises:

    1) total space consumption measured on various filesystems and btrfs
    with DUP metadata was quite visible because of the duplicated data
    within metadata

    2) inlined data may exhaust the metadata, which are more precious in case
    the entire device space is allocated to chunks (ie. balance cannot
    make the space more compact)

    3) performance suffers a bit as the inlined blocks are duplicate and
    stored far away on the device.

    Proposed fix: set the default to 2048

    This fixes namely 1), the total filesysystem space consumption will be on
    par with other filesystems.

    Partially fixes 2), more data are pushed to the data block groups.

    The characteristics of 3) are based on actual small file size
    distribution.

    The change is independent of the metadata blockgroup type (though it's
    most visible with DUP) or system page size as these parameters are not
    trival to find out, compared to file size.

    Signed-off-by: David Sterba

    David Sterba
     
  • Let's remove the error message that appears when the tree_id is not
    present. This can happen with the quota tree and has been observed in
    practice. The applications are supposed to handle -ENOENT and we don't
    need to report that in the system log as it's not a fatal error.

    Reported-by: Vlastimil Babka
    Signed-off-by: David Sterba

    David Sterba
     
  • With CONFIG_SMP and CONFIG_PREEMPT both disabled, gcc decides
    to partially inline the get_state_failrec() function but cannot
    figure out that means the failrec pointer is always valid
    if the function returns success, which causes a harmless
    warning:

    fs/btrfs/extent_io.c: In function 'clean_io_failure':
    fs/btrfs/extent_io.c:2131:4: error: 'failrec' may be used uninitialized in this function [-Werror=maybe-uninitialized]

    This marks get_state_failrec() and set_state_failrec() both
    as 'noinline', which avoids the warning in all cases for me,
    and seems less ugly than adding a fake initialization.

    Signed-off-by: Arnd Bergmann
    Fixes: 47dc196ae719 ("btrfs: use proper type for failrec in extent_state")
    Signed-off-by: David Sterba

    Arnd Bergmann
     

20 Feb, 2016

1 commit


18 Feb, 2016

2 commits

  • When starting up linux with btrfs filesystem, I got many memory leak
    messages by kmemleak as,

    unreferenced object 0xffff880066882000 (size 4096):
    comm "modprobe", pid 730, jiffies 4294690024 (age 196.599s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x4e/0xb0
    [] kmem_cache_alloc_trace+0xea/0x1e0
    [] btrfs_alloc_dummy_fs_info+0x6b/0x2a0 [btrfs]
    [] btrfs_alloc_dummy_block_group+0x5c/0x120 [btrfs]
    [] btrfs_test_free_space_cache+0x39/0xed0 [btrfs]
    [] trace_raw_output_xfs_attr_class+0x54/0xe0 [xfs]
    [] do_one_initcall+0xb2/0x1f0
    [] do_init_module+0x5e/0x1e9
    [] load_module+0x20a9/0x2690
    [] SyS_finit_module+0xb9/0xf0
    [] entry_SYSCALL_64_fastpath+0x12/0x76
    [] 0xffffffffffffffff
    unreferenced object 0xffff8800573f8000 (size 10256):
    comm "modprobe", pid 730, jiffies 4294690185 (age 196.460s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x4e/0xb0
    [] kmalloc_order+0x5e/0x70
    [] kmalloc_order_trace+0x24/0x90
    [] btrfs_alloc_dummy_fs_info+0x23/0x2a0 [btrfs]
    [] btrfs_alloc_dummy_block_group+0x5c/0x120 [btrfs]
    [] run_test+0xfd/0x320 [btrfs]
    [] btrfs_test_free_space_tree+0x94/0xee [btrfs]
    [] trace_raw_output_xfs_attr_class+0x8b/0xe0 [xfs]
    [] do_one_initcall+0xb2/0x1f0
    [] do_init_module+0x5e/0x1e9
    [] load_module+0x20a9/0x2690
    [] SyS_finit_module+0xb9/0xf0
    [] entry_SYSCALL_64_fastpath+0x12/0x76
    [] 0xffffffffffffffff

    This patch lets btrfs using fs_info stored in btrfs_root for
    block group cache directly without allocating a new one.

    Fixes: d0bd456074 ("Btrfs: add fragment=* debug mount option")
    Signed-off-by: Kinglong Mee
    Signed-off-by: David Sterba

    Kinglong Mee
     
  • btrfs failed in xfstests btrfs/080 with -o nodatacow.

    Can be reproduced by following script:
    DEV=/dev/vdg
    MNT=/mnt/tmp

    umount $DEV &>/dev/null
    mkfs.btrfs -f $DEV
    mount -o nodatacow $DEV $MNT

    dd if=/dev/zero of=$MNT/test bs=1 count=2048 &
    btrfs subvolume snapshot -r $MNT $MNT/test_snap &
    wait
    --
    We can see dd failed on NO_SPACE.

    Reason:
    __btrfs_buffered_write should run cow write when no_cow impossible,
    and current code is designed with above logic.
    But check_can_nocow() have 2 type of return value(0 and
    Signed-off-by: Zhao Lei

    Zhao Lei