18 Apr, 2017

2 commits


06 Dec, 2016

3 commits

  • Now we only use the root parameter to print the root objectid in
    a tracepoint. We can use the root parameter from the transaction
    handle for that. It's also used to join the transaction with
    async commits, so we remove the comment that it's just for checking.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • btrfs_write_and_wait_marked_extents and btrfs_sync_log both call
    btrfs_wait_marked_extents, which provides a core loop and then handles
    errors differently based on whether it's it's a log root or not.

    This means that btrfs_write_and_wait_marked_extents needs to take a root
    because btrfs_wait_marked_extents requires one, even though it's only
    used to determine whether the root is a log root. The log root code
    won't ever call into the transaction commit code using a log root, so we
    can factor out the core loop and provide the error handling appropriate
    to each waiter in new routines. This allows us to eventually remove
    the root argument from btrfs_commit_transaction, and as a result,
    btrfs_end_transaction.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • There are loads of functions in btrfs that accept a root parameter
    but only use it to obtain an fs_info pointer. Let's convert those to
    just accept an fs_info pointer directly.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: David Sterba

    Jeff Mahoney
     

27 Sep, 2016

1 commit

  • For many printks, we want to know which file system issued the message.

    This patch converts most pr_* calls to use the btrfs_* versions instead.
    In some cases, this means adding plumbing to allow call sites access to
    an fs_info pointer.

    fs/btrfs/check-integrity.c is left alone for another day.

    Signed-off-by: Jeff Mahoney
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Jeff Mahoney
     

26 Jul, 2016

1 commit

  • btrfs_trans_handle->root is documented as for use for confirming
    that the root passed in to start the transaction is the same as the
    one ending it. It's used in several places when an fs_info pointer
    is needed, so let's just add an fs_info pointer directly. Eventually,
    the root pointer can be removed.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: David Sterba

    Jeff Mahoney
     

18 Jun, 2016

1 commit

  • The test for !trans->blocks_used in btrfs_abort_transaction is
    insufficient to determine whether it's safe to drop the transaction
    handle on the floor. btrfs_cow_block, informed by should_cow_block,
    can return blocks that have already been CoW'd in the current
    transaction. trans->blocks_used is only incremented for new block
    allocations. If an operation overlaps the blocks in the current
    transaction entirely and must abort the transaction, we'll happily
    let it clean up the trans handle even though it may have modified
    the blocks and will commit an incomplete operation.

    In the long-term, I'd like to do closer tracking of when the fs
    is actually modified so we can still recover as gracefully as possible,
    but that approach will need some discussion. In the short term,
    since this is the only code using trans->blocks_used, let's just
    switch it to a bool indicating whether any blocks were used and set
    it when should_cow_block returns false.

    Cc: stable@vger.kernel.org # 3.4+
    Signed-off-by: Jeff Mahoney
    Reviewed-by: Filipe Manana
    Signed-off-by: David Sterba

    Jeff Mahoney
     

26 May, 2016

1 commit


07 Jan, 2016

2 commits


10 Dec, 2015

1 commit

  • As of my previous change titled "Btrfs: fix scrub preventing unused block
    groups from being deleted", the following warning at
    extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
    filesysten with "-o discard":

    10263 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
    10264 {
    (...)
    10405 if (trimming) {
    10406 WARN_ON(!list_empty(&block_group->bg_list));
    10407 spin_lock(&trans->transaction->deleted_bgs_lock);
    10408 list_move(&block_group->bg_list,
    10409 &trans->transaction->deleted_bgs);
    10410 spin_unlock(&trans->transaction->deleted_bgs_lock);
    10411 btrfs_get_block_group(block_group);
    10412 }
    (...)

    This happens because scrub can now add back the block group to the list of
    unused block groups (fs_info->unused_bgs). This is dangerous because we
    are moving the block group from the unused block groups list to the list
    of deleted block groups without holding the lock that protects the source
    list (fs_info->unused_bgs_lock).

    The following diagram illustrates how this happens:

    CPU 1 CPU 2

    cleaner_kthread()
    btrfs_delete_unused_bgs()

    sees bg X in list
    fs_info->unused_bgs

    deletes bg X from list
    fs_info->unused_bgs

    scrub_enumerate_chunks()

    searches device tree using
    its commit root

    finds device extent for
    block group X

    gets block group X from the tree
    fs_info->block_group_cache_tree
    (via btrfs_lookup_block_group())

    sets bg X to RO (again)

    scrub_chunk(bg X)

    sets bg X back to RW mode

    adds bg X to the list
    fs_info->unused_bgs again,
    since it's still unused and
    currently not in that list

    sets bg X to RO mode

    btrfs_remove_chunk(bg X)

    --> discard is enabled and bg X
    is in the fs_info->unused_bgs
    list again so the warning is
    triggered
    --> we move it from that list into
    the transaction's delete_bgs
    list, but we can have another
    task currently manipulating
    the first list (fs_info->unused_bgs)

    Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
    the list of unused block groups and the list of deleted block groups. This
    makes it safe and there's not much worry for more lock contention, as this
    lock is seldom used and only the cleaner kthread adds elements to the list
    of deleted block groups. The warning goes away too, as this was previously
    an impossible case (and would have been better a BUG_ON/ASSERT) but it's
    not impossible anymore.
    Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").

    Signed-off-by: Filipe Manana

    Filipe Manana
     

25 Nov, 2015

1 commit

  • It's possible to reach a state where the cleaner kthread isn't able to
    start a transaction to delete an unused block group due to lack of enough
    free metadata space and due to lack of unallocated device space to allocate
    a new metadata block group as well. If this happens try to use space from
    the global block group reserve just like we do for unlink operations, so
    that we don't reach a permanent state where starting a transaction for
    filesystem operations (file creation, renames, etc) keeps failing with
    -ENOSPC. Such an unfortunate state was observed on a machine where over
    a dozen unused data block groups existed and the cleaner kthread was
    failing to delete them due to ENOSPC error when attempting to start a
    transaction, and even running balance with a -dusage=0 filter failed with
    ENOSPC as well. Also unmounting and mounting again the filesystem didn't
    help. Allowing the cleaner kthread to use the global block reserve to
    delete the unused data block groups fixed the problem.

    Signed-off-by: Filipe Manana
    Signed-off-by: Jeff Mahoney
    Signed-off-by: Chris Mason

    Filipe Manana
     

22 Oct, 2015

6 commits

  • Signed-off-by: Chris Mason

    Chris Mason
     
  • If we hit ENOSPC when setting up a space cache don't bother setting up any of
    the other space cache's in this transaction, it'll just induce unnecessary
    latency. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • I want to set some per transaction flags, so instead of adding yet another int
    lets just convert the current two int indicators to flags and add a flags field
    for future use. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • We have a mechanism to make sure we don't lose updates for ordered extents that
    were logged in the transaction that is currently running. We add the ordered
    extent to a transaction list and then the transaction waits on all the ordered
    extents in that list. However are substantially large file systems this list
    can be extremely large, and can give us soft lockups, since the ordered extents
    don't remove themselves from the list when they do complete.

    To fix this we simply add a counter to the transaction that is incremented any
    time we have a logged extent that needs to be completed in the current
    transaction. Then when the ordered extent finally completes it decrements the
    per transaction counter and wakes up the transaction if we are the last ones.
    This will eliminate the softlockup. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • As we have the new metadata reservation functions, use them to replace
    the old btrfs_qgroup_reserve() call for metadata.

    Signed-off-by: Qu Wenruo
    Signed-off-by: Chris Mason

    Qu Wenruo
     
  • The value of num_items that start_transaction() ultimately
    always takes is a small one, so a 64 bit integer is overkill.

    Also change num_items for btrfs_start_transaction() and
    btrfs_start_transaction_lflush() as well.

    Reviewed-by: David Sterba
    Signed-off-by: Alexandru Moise
    Signed-off-by: David Sterba

    Alexandru Moise
     

06 Oct, 2015

1 commit

  • Josef ran into a deadlock while a transaction handle was finalizing the
    creation of its block groups, which produced the following trace:

    [260445.593112] fio D ffff88022a9df468 0 8924 4518 0x00000084
    [260445.593119] ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488
    [260445.593126] ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0
    [260445.593132] ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00
    [260445.593137] Call Trace:
    [260445.593145] [] schedule+0x37/0x80
    [260445.593189] [] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
    [260445.593197] [] ? prepare_to_wait_event+0xf0/0xf0
    [260445.593225] [] btrfs_lock_root_node+0x34/0x50 [btrfs]
    [260445.593253] [] btrfs_search_slot+0x88b/0xa00 [btrfs]
    [260445.593295] [] ? free_extent_buffer+0x4f/0x90 [btrfs]
    [260445.593324] [] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
    [260445.593351] [] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
    [260445.593394] [] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
    [260445.593427] [] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
    [260445.593459] [] do_chunk_alloc+0x2a4/0x2e0 [btrfs]
    [260445.593491] [] find_free_extent+0xa55/0xd90 [btrfs]
    [260445.593524] [] btrfs_reserve_extent+0xd2/0x220 [btrfs]
    [260445.593532] [] ? account_page_dirtied+0xdd/0x170
    [260445.593564] [] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
    [260445.593597] [] ? btree_set_page_dirty+0xe/0x10 [btrfs]
    [260445.593626] [] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
    [260445.593654] [] btrfs_cow_block+0x11f/0x1c0 [btrfs]
    [260445.593682] [] btrfs_search_slot+0x1e7/0xa00 [btrfs]
    [260445.593724] [] ? free_extent_buffer+0x4f/0x90 [btrfs]
    [260445.593752] [] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
    [260445.593830] [] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
    [260445.593905] [] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
    [260445.593946] [] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
    [260445.593990] [] btrfs_commit_transaction+0xa8/0xb40 [btrfs]
    [260445.594042] [] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
    [260445.594089] [] btrfs_sync_file+0x294/0x350 [btrfs]
    [260445.594115] [] vfs_fsync_range+0x3b/0xa0
    [260445.594133] [] ? syscall_trace_enter_phase1+0x131/0x180
    [260445.594149] [] do_fsync+0x3d/0x70
    [260445.594169] [] ? syscall_trace_leave+0xb8/0x110
    [260445.594187] [] SyS_fsync+0x10/0x20
    [260445.594204] [] entry_SYSCALL_64_fastpath+0x12/0x71

    This happened because the same transaction handle created a large number
    of block groups and while finalizing their creation (inserting new items
    and updating existing items in the chunk and device trees) a new metadata
    extent had to be allocated and no free space was found in the current
    metadata block groups, which made find_free_extent() attempt to allocate
    a new block group via do_chunk_alloc(). However at do_chunk_alloc() we
    ended up allocating a new system chunk too and exceeded the threshold
    of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the
    final part of block group creation again (at
    btrfs_create_pending_block_groups()) and attempt to lock again the root
    of the chunk tree when it's already write locked by the same task.

    Similarly we can deadlock on extent tree nodes/leafs if while we are
    running delayed references we end up creating a new metadata block group
    in order to allocate a new node/leaf for the extent tree (as part of
    a CoW operation or growing the tree), as btrfs_create_pending_block_groups
    inserts items into the extent tree as well. In this case we get the
    following trace:

    [14242.773581] fio D ffff880428ca3418 0 3615 3100 0x00000084
    [14242.773588] ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438
    [14242.773594] ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460
    [14242.773600] ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190
    [14242.773606] Call Trace:
    [14242.773613] [] schedule+0x37/0x80
    [14242.773656] [] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
    [14242.773664] [] ? prepare_to_wait_event+0xf0/0xf0
    [14242.773692] [] btrfs_lock_root_node+0x34/0x50 [btrfs]
    [14242.773720] [] btrfs_search_slot+0x88b/0xa00 [btrfs]
    [14242.773750] [] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
    [14242.773758] [] ? kmem_cache_alloc+0x1d2/0x200
    [14242.773786] [] btrfs_insert_item+0x71/0xf0 [btrfs]
    [14242.773818] [] btrfs_create_pending_block_groups+0x102/0x200 [btrfs]
    [14242.773850] [] do_chunk_alloc+0x2ae/0x2f0 [btrfs]
    [14242.773934] [] find_free_extent+0xa55/0xd90 [btrfs]
    [14242.773998] [] btrfs_reserve_extent+0xc2/0x1d0 [btrfs]
    [14242.774041] [] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
    [14242.774078] [] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
    [14242.774118] [] btrfs_cow_block+0x11f/0x1c0 [btrfs]
    [14242.774155] [] btrfs_search_slot+0x1e7/0xa00 [btrfs]
    [14242.774194] [] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs]
    [14242.774235] [] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
    [14242.774274] [] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
    [14242.774318] [] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs]
    [14242.774358] [] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs]
    [14242.774391] [] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
    [14242.774432] [] commit_cowonly_roots+0x8d/0x2bd [btrfs]
    [14242.774474] [] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs]
    [14242.774516] [] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs]
    [14242.774558] [] btrfs_commit_transaction+0x590/0xb40 [btrfs]
    [14242.774599] [] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
    [14242.774642] [] btrfs_sync_file+0x294/0x350 [btrfs]
    [14242.774650] [] vfs_fsync_range+0x3b/0xa0
    [14242.774657] [] ? syscall_trace_enter_phase1+0x131/0x180
    [14242.774663] [] do_fsync+0x3d/0x70
    [14242.774669] [] ? syscall_trace_leave+0xb8/0x110
    [14242.774675] [] SyS_fsync+0x10/0x20
    [14242.774681] [] entry_SYSCALL_64_fastpath+0x12/0x71

    Fix this by never recursing into the finalization phase of block group
    creation and making sure we never trigger the finalization of block group
    creation while running delayed references.

    Reported-by: Josef Bacik
    Fixes: 00d80e342c0f ("Btrfs: fix quick exhaustion of the system array in the superblock")
    Signed-off-by: Filipe Manana

    Filipe Manana
     

23 Sep, 2015

1 commit

  • When dropping a snapshot we need to account for the qgroup changes. If we drop
    the snapshot in all one go then the backref code will fail to find blocks from
    the snapshot we dropped since it won't be able to find the root in the fs root
    cache. This can lead to us failing to find refs from other roots that pointed
    at blocks in the now deleted root. To handle this we need to not remove the fs
    roots from the cache until after we process the qgroup operations. Do this by
    adding dropped roots to a list on the transaction, and letting the transaction
    remove the roots at the same time it drops the commit roots. This will keep all
    of the backref searching code in sync properly, and fixes a problem Mark was
    seeing with snapshot delete and qgroups. Thanks,

    Signed-off-by: Josef Bacik
    Tested-by: Holger Hoffstätte
    Signed-off-by: Chris Mason

    Josef Bacik
     

29 Jul, 2015

1 commit

  • When we clear the dirty bits in btrfs_delete_unused_bgs for extents
    in the empty block group, it results in btrfs_finish_extent_commit being
    unable to discard the freed extents.

    The block group removal patch added an alternate path to forget extents
    other than btrfs_finish_extent_commit. As a result, any extents that
    would be freed when the block group is removed aren't discarded. In my
    test run, with a large copy of mixed sized files followed by removal, it
    left nearly 2/3 of extents undiscarded.

    To clean up the block groups, we add the removed block group onto a list
    that will be discarded after transaction commit.

    Signed-off-by: Jeff Mahoney
    Reviewed-by: Filipe Manana
    Tested-by: Filipe Manana
    Signed-off-by: Chris Mason

    Jeff Mahoney
     

11 Jun, 2015

1 commit

  • This is used by later qgroup fix patches for snapshot.

    As current snapshot accounting is done by btrfs_qgroup_inherit(), but
    new extent oriented quota mechanism will account extent from
    btrfs_copy_root() and other snapshot things, causing wrong result.

    So add this ability to handle snapshot accounting.

    Signed-off-by: Qu Wenruo
    Signed-off-by: Chris Mason

    Qu Wenruo
     

03 Jun, 2015

1 commit

  • While creating a block group, we often end up getting ENOSPC while updating
    the chunk tree, which leads to a transaction abortion that produces a trace
    like the following:

    [30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
    [30670.117777] BTRFS: Transaction aborted (error -28)
    (...)
    [30670.163567] Call Trace:
    [30670.163906] [] dump_stack+0x4f/0x7b
    [30670.164522] [] ? console_unlock+0x361/0x3ad
    [30670.165171] [] warn_slowpath_common+0xa1/0xbb
    [30670.166323] [] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
    [30670.167213] [] warn_slowpath_fmt+0x46/0x48
    [30670.167862] [] __btrfs_abort_transaction+0x52/0x106 [btrfs]
    [30670.169116] [] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
    [30670.170593] [] __btrfs_end_transaction+0x84/0x366 [btrfs]
    [30670.171960] [] btrfs_end_transaction+0x10/0x12 [btrfs]
    [30670.174649] [] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
    [30670.176092] [] btrfs_fallocate+0x7c8/0xb96 [btrfs]
    [30670.177218] [] ? __this_cpu_preempt_check+0x13/0x15
    [30670.178622] [] vfs_fallocate+0x14c/0x1de
    [30670.179642] [] ? __fget_light+0x2d/0x4f
    [30670.180692] [] SyS_fallocate+0x47/0x62
    [30670.186737] [] system_call_fastpath+0x12/0x17
    [30670.187792] ---[ end trace 0373e6b491c4a8cc ]---

    This is because we don't do proper space reservation for the chunk block
    reserve when we have multiple tasks allocating chunks in parallel.

    So block group creation has 2 phases, and the first phase essentially
    checks if there is enough space in the system space_info, allocating a
    new system chunk if there isn't, while the second phase updates the
    device, extent and chunk trees. However, because the updates to the
    chunk tree happen in the second phase, if we have N tasks, each with
    its own transaction handle, allocating new chunks in parallel and if
    there is only enough space in the system space_info to allocate M chunks,
    where M < N, none of the tasks ends up allocating a new system chunk in
    the first phase and N - M tasks will get -ENOSPC when attempting to
    update the chunk tree in phase 2 if they need to COW any nodes/leafs
    from the chunk tree.

    Fix this by doing proper reservation in the chunk block reserve.

    The issue could be reproduced by running fstests generic/038 in a loop,
    which eventually triggered the problem.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     

11 Apr, 2015

2 commits

  • We loop through all of the dirty block groups during commit and write
    the free space cache. In order to make sure the cache is currect, we do
    this while no other writers are allowed in the commit.

    If a large number of block groups are dirty, this can introduce long
    stalls during the final stages of the commit, which can block new procs
    trying to change the filesystem.

    This commit changes the block group cache writeout to take appropriate
    locks and allow it to run earlier in the commit. We'll still have to
    redo some of the block groups, but it means we can get most of the work
    out of the way without blocking the entire FS.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This changes our delayed refs calculations to include the space needed
    to write back dirty block groups.

    Signed-off-by: Chris Mason

    Josef Bacik
     

27 Mar, 2015

1 commit

  • We can get into inconsistency between inodes and directory entries
    after fsyncing a directory. The issue is that while a directory gets
    the new dentries persisted in the fsync log and replayed at mount time,
    the link count of the inode that directory entries point to doesn't
    get updated, staying with an incorrect link count (smaller then the
    correct value). This later leads to stale file handle errors when
    accessing (including attempt to delete) some of the links if all the
    other ones are removed, which also implies impossibility to delete the
    parent directories, since the dentries can not be removed.

    Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2),
    when fsyncing a directory, new files aren't logged (their metadata and
    dentries) nor any child directories. So this patch fixes this issue too,
    since it has the same resolution as the incorrect inode link count issue
    mentioned before.

    This is very easy to reproduce, and the following excerpt from my test
    case for xfstests shows how:

    _scratch_mkfs >> $seqres.full 2>&1
    _init_flakey
    _mount_flakey

    # Create our main test file and directory.
    $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io
    mkdir $SCRATCH_MNT/mydir

    # Make sure all metadata and data are durably persisted.
    sync

    # Add a hard link to 'foo' inside our test directory and fsync only the
    # directory. The btrfs fsync implementation had a bug that caused the new
    # directory entry to be visible after the fsync log replay but, the inode
    # of our file remained with a link count of 1.
    ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2

    # Add a few more links and new files.
    # This is just to verify nothing breaks or gives incorrect results after the
    # fsync log is replayed.
    ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3
    $XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io
    ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2

    # Add some subdirectories and new files and links to them. This is to verify
    # that after fsyncing our top level directory 'mydir', all the subdirectories
    # and their files/links are registered in the fsync log and exist after the
    # fsync log is replayed.
    mkdir -p $SCRATCH_MNT/mydir/x/y/z
    ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link
    ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link
    touch $SCRATCH_MNT/mydir/x/y/z/qwerty

    # Now fsync only our top directory.
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir

    # And fsync now our new file named 'hello', just to verify later that it has
    # the expected content and that the previous fsync on the directory 'mydir' had
    # no bad influence on this fsync.
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello

    # Simulate a crash/power loss.
    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    # Verify the content of our file 'foo' remains the same as before, 8192 bytes,
    # all with the value 0xaa.
    echo "File 'foo' content after log replay:"
    od -t x1 $SCRATCH_MNT/foo

    # Remove the first name of our inode. Because of the directory fsync bug, the
    # inode's link count was 1 instead of 5, so removing the 'foo' name ended up
    # deleting the inode and the other names became stale directory entries (still
    # visible to applications). Attempting to remove or access the remaining
    # dentries pointing to that inode resulted in stale file handle errors and
    # made it impossible to remove the parent directories since it was impossible
    # for them to become empty.
    echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)"
    rm -f $SCRATCH_MNT/foo

    # Now verify that all files, links and directories created before fsyncing our
    # directory exist after the fsync log was replayed.
    [ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing"
    [ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing"
    [ -f $SCRATCH_MNT/hello ] || echo "File hello is missing"
    [ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing"
    [ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \
    echo "Link mydir/x/y/foo_y_link is missing"
    [ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \
    echo "Link mydir/x/y/z/foo_z_link is missing"
    [ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \
    echo "File mydir/x/y/z/qwerty is missing"

    # We expect our file here to have a size of 64Kb and all the bytes having the
    # value 0xff.
    echo "file 'hello' content after log replay:"
    od -t x1 $SCRATCH_MNT/hello

    # Now remove all files/links, under our test directory 'mydir', and verify we
    # can remove all the directories.
    rm -f $SCRATCH_MNT/mydir/x/y/z/*
    rmdir $SCRATCH_MNT/mydir/x/y/z
    rm -f $SCRATCH_MNT/mydir/x/y/*
    rmdir $SCRATCH_MNT/mydir/x/y
    rmdir $SCRATCH_MNT/mydir/x
    rm -f $SCRATCH_MNT/mydir/*
    rmdir $SCRATCH_MNT/mydir

    # An fsck, run by the fstests framework everytime a test finishes, also detected
    # the inconsistency and printed the following error message:
    #
    # root 5 inode 257 errors 2001, no inode item, link count wrong
    # unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
    # unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref

    status=0
    exit

    The expected golden output for the test is:

    wrote 8192/8192 bytes at offset 0
    XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    wrote 65536/65536 bytes at offset 0
    XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    File 'foo' content after log replay:
    0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
    *
    0020000
    file 'foo' link count after log replay: 5
    file 'hello' content after log replay:
    0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    *
    0200000

    Which is the output after this patch and when running the test against
    ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's
    output is:

    wrote 8192/8192 bytes at offset 0
    XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    wrote 65536/65536 bytes at offset 0
    XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    File 'foo' content after log replay:
    0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
    *
    0020000
    file 'foo' link count after log replay: 1
    Link mydir/foo_2 is missing
    Link mydir/foo_3 is missing
    Link mydir/x/y/foo_y_link is missing
    Link mydir/x/y/z/foo_z_link is missing
    File mydir/x/y/z/qwerty is missing
    file 'hello' content after log replay:
    0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    *
    0200000
    rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory
    rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory
    rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory
    rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle
    rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle
    rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty

    Fsck, without this fix, also complains about the wrong link count:

    root 5 inode 257 errors 2001, no inode item, link count wrong
    unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
    unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref

    So fix this by logging the inodes that the dentries point to when
    fsyncing a directory.

    A test case for xfstests follows.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     

15 Feb, 2015

1 commit

  • Btrfs will report NO_SPACE when we create and remove files for several times,
    and we can't write to filesystem until mount it again.

    Steps to reproduce:
    1: Create a single-dev btrfs fs with default option
    2: Write a file into it to take up most fs space
    3: Delete above file
    4: Wait about 100s to let chunk removed
    5: goto 2

    Script is like following:
    #!/bin/bash

    # Recommend 1.2G space, too large disk will make test slow
    DEV="/dev/sda16"
    MNT="/mnt/tmp"

    dev_size="$(lsblk -bn -o SIZE "$DEV")" || exit 2
    file_size_m=$((dev_size * 75 / 100 / 1024 / 1024))

    echo "Loop write ${file_size_m}M file on $((dev_size / 1024 / 1024))M dev"

    for ((i = 0; i < 10; i++)); do umount "$MNT" 2>/dev/null; done
    echo "mkfs $DEV"
    mkfs.btrfs -f "$DEV" >/dev/null || exit 2
    echo "mount $DEV $MNT"
    mount "$DEV" "$MNT" || exit 2

    for ((loop_i = 0; loop_i < 20; loop_i++)); do
    echo
    echo "loop $loop_i"

    echo "dd file..."
    cmd=(dd if=/dev/zero of="$MNT"/file0 bs=1M count="$file_size_m")
    "${cmd[@]}" 2>/dev/null || {
    # NO_SPACE error triggered
    echo "dd failed: ${cmd[*]}"
    exit 1
    }

    echo "rm file..."
    rm -f "$MNT"/file0 || exit 2

    for ((i = 0; i < 10; i++)); do
    df "$MNT" | tail -1
    sleep 10
    done
    done

    Reason:
    It is triggered by commit: 47ab2a6c689913db23ccae38349714edf8365e0a
    which is used to remove empty block groups automatically, but the
    reason is not in that patch. Code before works well because btrfs
    don't need to create and delete chunks so many times with high
    complexity.
    Above bug is caused by many reason, any of them can trigger it.

    Reason1:
    When we remove some continuous chunks but leave other chunks after,
    these disk space should be used by chunk-recreating, but in current
    code, only first create will successed.
    Fixed by Forrest Liu in:
    Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole

    Reason2:
    contains_pending_extent() return wrong value in calculation.
    Fixed by Forrest Liu in:
    Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole

    Reason3:
    btrfs_check_data_free_space() try to commit transaction and retry
    allocating chunk when the first allocating failed, but space_info->full
    is set in first allocating, and prevent second allocating in retry.
    Fixed in this patch by clear space_info->full in commit transaction.

    Tested for severial times by above script.

    Changelog v3->v4:
    use light weight int instead of atomic_t to record have_remove_bgs in
    transaction, suggested by:
    Josef Bacik

    Changelog v2->v3:
    v2 fixed the bug by adding more commit-transaction, but we
    only need to reclaim space when we are really have no space for
    new chunk, noticed by:
    Filipe David Manana

    Actually, our code already have this type of commit-and-retry,
    we only need to make it working with removed-bgs.
    v3 fixed the bug with above way.

    Changelog v1->v2:
    v1 will introduce a new bug when delete and create chunk in same disk
    space in same transaction, noticed by:
    Filipe David Manana
    V2 fix this bug by commit transaction after remove block grops.

    Reported-by: Tsutomu Itoh
    Suggested-by: Filipe David Manana
    Suggested-by: Josef Bacik
    Signed-off-by: Zhao Lei
    Signed-off-by: Chris Mason

    Zhao Lei
     

22 Jan, 2015

1 commit

  • Currently any time we try to update the block groups on disk we will walk _all_
    block groups and check for the ->dirty flag to see if it is set. This function
    can get called several times during a commit. So if you have several terabytes
    of data you will be a very sad panda as we will loop through _all_ of the block
    groups several times, which makes the commit take a while which slows down the
    rest of the file system operations.

    This patch introduces a dirty list for the block groups that we get added to
    when we dirty the block group for the first time. Then we simply update any
    block groups that have been dirtied since the last time we called
    btrfs_write_dirty_block_groups. This allows us to clean up how we write the
    free space cache out so it is much cleaner. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

25 Nov, 2014

1 commit


22 Nov, 2014

1 commit

  • Liu Bo pointed out that my previous fix would lose the generation update in the
    scenario I described. It is actually much worse than that, we could lose the
    entire extent if we lose power right after the transaction commits. Consider
    the following

    write extent 0-4k
    log extent in log tree
    commit transaction
    < power fail happens here
    ordered extent completes

    We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
    the transaction commit will reset the log so it isn't replayed. If we lose
    power before the transaction commit we are save, otherwise we are not.

    Fix this by keeping track of all extents we logged in this transaction. Then
    when we go to commit the transaction make sure we wait for all of those ordered
    extents to complete before proceeding. This will make sure that if we lose
    power after the transaction commit we still have our data. This also fixes the
    problem of the improperly updated extent generation. Thanks,

    cc: stable@vger.kernel.org
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

21 Nov, 2014

1 commit

  • When committing a transaction or a log, we look for btree extents that
    need to be durably persisted by searching for ranges in a io tree that
    have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear
    those bits and set the EXTENT_NEED_WAIT bit, with calls to the function
    convert_extent_bit, and then start writeback for the extents.

    That function however can return an error (at the moment only -ENOMEM
    is possible, specially when it does GFP_ATOMIC allocation requests
    through alloc_extent_state_atomic) - that means the ranges didn't got
    the EXTENT_NEED_WAIT bit set (or at least not for the whole range),
    which in turn means a call to btrfs_wait_marked_extents() won't find
    those ranges for which we started writeback, causing a transaction
    commit or a log commit to persist a new superblock without waiting
    for the writeback of extents in that range to finish first.

    Therefore if a crash happens after persisting the new superblock and
    before writeback finishes, we have a superblock pointing to roots that
    weren't fully persisted or roots that point to nodes or leafs that weren't
    fully persisted, causing all sorts of unexpected/bad behaviour as we endup
    reading garbage from disk or the content of some node/leaf from a past
    generation that got cowed or deleted and is no longer valid (for this later
    case we end up getting error messages like "parent transid verify failed on
    X wanted Y found Z" when reading btree nodes/leafs from disk).

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     

12 Nov, 2014

1 commit

  • There are some actions that modify global filesystem state but cannot be
    performed at the time of request, but later at the transaction commit
    time when the filesystem is in a known state.

    For example enabling new incompat features on-the-fly or issuing
    transaction commit from unsafe contexts (sysfs handlers).

    Signed-off-by: David Sterba

    David Sterba
     

02 Oct, 2014

1 commit


15 Aug, 2014

1 commit

  • Truncates and renames are often used to replace old versions of a file
    with new versions. Applications often expect this to be an atomic
    replacement, even if they haven't done anything to make sure the new
    version is fully on disk.

    Btrfs has strict flushing in place to make sure that renaming over an
    old file with a new file will fully flush out the new file before
    allowing the transaction commit with the rename to complete.

    This ordering means the commit code needs to be able to lock file pages,
    and there are a few paths in the filesystem where we will try to end a
    transaction with the page lock held. It's rare, but these things can
    deadlock.

    This patch removes the ordered flushes and switches to a best effort
    filemap_flush like ext4 uses. It's not perfect, but it should fix the
    deadlocks.

    Signed-off-by: Chris Mason

    Chris Mason
     

10 Jun, 2014

1 commit

  • This exercises the various parts of the new qgroup accounting code. We do some
    basic stuff and do some things with the shared refs to make sure all that code
    works. I had to add a bunch of infrastructure because I needed to be able to
    insert items into a fake tree without having to do all the hard work myself,
    hopefully this will be usefull in the future. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

07 Apr, 2014

2 commits

  • Lets try this again. We can deadlock the box if we send on a box and try to
    write onto the same fs with the app that is trying to listen to the send pipe.
    This is because the writer could get stuck waiting for a transaction commit
    which is being blocked by the send. So fix this by making sure looking at the
    commit roots is always going to be consistent. We do this by keeping track of
    which roots need to have their commit roots swapped during commit, and then
    taking the commit_root_sem and swapping them all at once. Then make sure we
    take a read lock on the commit_root_sem in cases where we search the commit root
    to make sure we're always looking at a consistent view of the commit roots.
    Previously we had problems with this because we would swap a fs tree commit root
    and then swap the extent tree commit root independently which would cause the
    backref walking code to screw up sometimes. With this patch we no longer
    deadlock and pass all the weird send/receive corner cases. Thanks,

    Reportedy-by: Hugo Mills
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • So I have an awful exercise script that will run snapshot, balance and
    send/receive in parallel. This sometimes would crash spectacularly and when it
    came back up the fs would be completely hosed. Turns out this is because of a
    bad interaction of balance and send/receive. Send will hold onto its entire
    path for the whole send, but its blocks could get relocated out from underneath
    it, and because it doesn't old tree locks theres nothing to keep this from
    happening. So it will go to read in a slot with an old transid, and we could
    have re-allocated this block for something else and it could have a completely
    different transid. But because we think it is invalid we clear uptodate and
    re-read in the block. If we do this before we actually write out the new block
    we could write back stale data to the fs, and boom we're screwed.

    Now we definitely need to fix this disconnect between send and balance, but we
    really really need to not allow ourselves to accidently read in stale data over
    new data. So make sure we check if the extent buffer is not under io before
    clearing uptodate, this will kick back EIO to the caller instead of reading in
    stale data and keep us from corrupting the fs. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

29 Jan, 2014

2 commits

  • Looking into some performance related issues with large amounts of metadata
    revealed that we can have some pretty huge swings in fsync() performance. If we
    have a lot of delayed refs backed up (as you will tend to do with lots of
    metadata) fsync() will wander off and try to run some of those delayed refs
    which can result in reading from disk and such. Since the actual act of fsync()
    doesn't create any delayed refs there is no need to make it throttle on delayed
    ref stuff, that will be handled by other people. With this patch we get much
    smoother fsync performance with large amounts of metadata. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Two reasons:
    - btrfs_end_transaction_dmeta() is the same as btrfs_end_transaction_throttle()
    so it is unnecessary.
    - All the delayed items should be dealt in the current transaction, so the
    workers should not commit the transaction, instead, deal with the delayed
    items as many as possible.

    So we can remove btrfs_end_transaction_dmeta()

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie