Eric Lee / smarc-fsl-linux-kernel

18 Apr, 2017

2 commits

f486135eb btrfs: remove unused qgroup members from btrfs_trans_handle ... Browse Code »

The members have been effectively unused since "Btrfs: rework qgroup
accounting" (fcebe4562dec83b3), there's no substitute for
assert_qgroups_uptodate so it's removed as well.

Reviewed-by: Qu Wenruo
Signed-off-by: David Sterba

David Sterba
2017-04-18 20:07:25 +0800
9b64f57dd btrfs: convert btrfs_transaction.use_count from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David Sterba

Elena Reshetova
2017-04-18 20:07:23 +0800

06 Dec, 2016

3 commits

3a45bb207 btrfs: remove root parameter from transaction commit/end routines ... Browse Code »

Now we only use the root parameter to print the root objectid in
a tracepoint. We can use the root parameter from the transaction
handle for that. It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.

Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba

Jeff Mahoney
2016-12-06 23:07:00 +0800
bf89d38fe btrfs: split btrfs_wait_marked_extents into normal and tree log functions ... Browse Code »

btrfs_write_and_wait_marked_extents and btrfs_sync_log both call
btrfs_wait_marked_extents, which provides a core loop and then handles
errors differently based on whether it's it's a log root or not.

This means that btrfs_write_and_wait_marked_extents needs to take a root
because btrfs_wait_marked_extents requires one, even though it's only
used to determine whether the root is a log root. The log root code
won't ever call into the transaction commit code using a log root, so we
can factor out the core loop and provide the error handling appropriate
to each waiter in new routines. This allows us to eventually remove
the root argument from btrfs_commit_transaction, and as a result,
btrfs_end_transaction.

Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba

Jeff Mahoney
2016-12-06 23:07:00 +0800
2ff7e61e0 btrfs: take an fs_info directly when the root is not used otherwise ... Browse Code »

There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer. Let's convert those to
just accept an fs_info pointer directly.

Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba

Jeff Mahoney
2016-12-06 23:06:59 +0800

27 Sep, 2016

1 commit

ab8d0fc48 btrfs: convert pr_* to btrfs_* where possible ... Browse Code »

For many printks, we want to know which file system issued the message.

This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.

fs/btrfs/check-integrity.c is left alone for another day.

Signed-off-by: Jeff Mahoney
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Jeff Mahoney
2016-09-27 01:37:04 +0800

26 Jul, 2016

1 commit

64b635807 btrfs: add btrfs_trans_handle->fs_info pointer ... Browse Code »

btrfs_trans_handle->root is documented as for use for confirming
that the root passed in to start the transaction is the same as the
one ending it. It's used in several places when an fs_info pointer
is needed, so let's just add an fs_info pointer directly. Eventually,
the root pointer can be removed.

Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba

Jeff Mahoney
2016-07-26 19:54:26 +0800

18 Jun, 2016

1 commit

64c12921e btrfs: account for non-CoW'd blocks in btrfs_abort_transaction ... Browse Code »

The test for !trans->blocks_used in btrfs_abort_transaction is
insufficient to determine whether it's safe to drop the transaction
handle on the floor. btrfs_cow_block, informed by should_cow_block,
can return blocks that have already been CoW'd in the current
transaction. trans->blocks_used is only incremented for new block
allocations. If an operation overlaps the blocks in the current
transaction entirely and must abort the transaction, we'll happily
let it clean up the trans handle even though it may have modified
the blocks and will commit an incomplete operation.

In the long-term, I'd like to do closer tracking of when the fs
is actually modified so we can still recover as gracefully as possible,
but that approach will need some discussion. In the short term,
since this is the only code using trans->blocks_used, let's just
switch it to a bool indicating whether any blocks were used and set
it when should_cow_block returns false.

Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Jeff Mahoney
Reviewed-by: Filipe Manana
Signed-off-by: David Sterba

Jeff Mahoney
2016-06-18 00:32:40 +0800

26 May, 2016

1 commit

013276101 btrfs: fix string and comment grammatical issues and typos ... Browse Code »

Signed-off-by: Nicholas D Steeves
Signed-off-by: David Sterba

Nicholas D Steeves
2016-05-26 04:35:14 +0800

07 Jan, 2016

2 commits

8546b5705 btrfs: preallocate path for snapshot creation at ioctl time ... Browse Code »

We can also preallocate btrfs_path that's used during pending snapshot
creation and avoid another late ENOMEM failure.

Signed-off-by: David Sterba

David Sterba
2016-01-07 22:20:55 +0800
b0c0ea633 btrfs: allocate root item at snapshot ioctl time ... Browse Code »

The actual snapshot creation is delayed until transaction commit. If we
cannot get enough memory for the root item there, we have to fail the
whole transaction commit which is bad. So we'll allocate the memory at
the ioctl call and pass it along with the pending_snapshot struct. The
potential ENOMEM will be returned to the caller of snapshot ioctl.

Signed-off-by: David Sterba

David Sterba
2016-01-07 22:20:54 +0800

10 Dec, 2015

1 commit

348a0013d Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list ... Browse Code »

As of my previous change titled "Btrfs: fix scrub preventing unused block
groups from being deleted", the following warning at
extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
filesysten with "-o discard":

10263 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
10264 {
(...)
10405 if (trimming) {
10406 WARN_ON(!list_empty(&block_group->bg_list));
10407 spin_lock(&trans->transaction->deleted_bgs_lock);
10408 list_move(&block_group->bg_list,
10409 &trans->transaction->deleted_bgs);
10410 spin_unlock(&trans->transaction->deleted_bgs_lock);
10411 btrfs_get_block_group(block_group);
10412 }
(...)

This happens because scrub can now add back the block group to the list of
unused block groups (fs_info->unused_bgs). This is dangerous because we
are moving the block group from the unused block groups list to the list
of deleted block groups without holding the lock that protects the source
list (fs_info->unused_bgs_lock).

The following diagram illustrates how this happens:

CPU 1 CPU 2

cleaner_kthread()
btrfs_delete_unused_bgs()

sees bg X in list
fs_info->unused_bgs

deletes bg X from list
fs_info->unused_bgs

scrub_enumerate_chunks()

searches device tree using
its commit root

finds device extent for
block group X

gets block group X from the tree
fs_info->block_group_cache_tree
(via btrfs_lookup_block_group())

sets bg X to RO (again)

scrub_chunk(bg X)

sets bg X back to RW mode

adds bg X to the list
fs_info->unused_bgs again,
since it's still unused and
currently not in that list

sets bg X to RO mode

btrfs_remove_chunk(bg X)

--> discard is enabled and bg X
is in the fs_info->unused_bgs
list again so the warning is
triggered
--> we move it from that list into
the transaction's delete_bgs
list, but we can have another
task currently manipulating
the first list (fs_info->unused_bgs)

Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
the list of unused block groups and the list of deleted block groups. This
makes it safe and there's not much worry for more lock contention, as this
lock is seldom used and only the cleaner kthread adds elements to the list
of deleted block groups. The warning goes away too, as this was previously
an impossible case (and would have been better a BUG_ON/ASSERT) but it's
not impossible anymore.
Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").

Signed-off-by: Filipe Manana

Filipe Manana
2015-12-10 19:22:38 +0800

25 Nov, 2015

1 commit

8eab77ff1 Btrfs: use global reserve when deleting unused block group after ENOSPC ... Browse Code »

It's possible to reach a state where the cleaner kthread isn't able to
start a transaction to delete an unused block group due to lack of enough
free metadata space and due to lack of unallocated device space to allocate
a new metadata block group as well. If this happens try to use space from
the global block group reserve just like we do for unlink operations, so
that we don't reach a permanent state where starting a transaction for
filesystem operations (file creation, renames, etc) keeps failing with
-ENOSPC. Such an unfortunate state was observed on a machine where over
a dozen unused data block groups existed and the cleaner kthread was
failing to delete them due to ENOSPC error when attempting to start a
transaction, and even running balance with a -dusage=0 filter failed with
ENOSPC as well. Also unmounting and mounting again the filesystem didn't
help. Allowing the cleaner kthread to use the global block reserve to
delete the unused data block groups fixed the problem.

Signed-off-by: Filipe Manana
Signed-off-by: Jeff Mahoney
Signed-off-by: Chris Mason

Filipe Manana
2015-11-25 21:19:50 +0800

22 Oct, 2015

6 commits

a9e6d1535 Merge branch 'allocator-fixes' into for-linus-4.4 ... Browse Code »

Signed-off-by: Chris Mason

Chris Mason
2015-10-22 10:00:38 +0800
2968b1f48 Btrfs: don't continue setting up space cache when enospc ... Browse Code »

If we hit ENOSPC when setting up a space cache don't bother setting up any of
the other space cache's in this transaction, it'll just induce unnecessary
latency. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2015-10-22 09:55:36 +0800
3204d33cd Btrfs: add a flags field to btrfs_transaction ... Browse Code »

I want to set some per transaction flags, so instead of adding yet another int
lets just convert the current two int indicators to flags and add a flags field
for future use. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2015-10-22 09:51:45 +0800
161c3549b Btrfs: change how we wait for pending ordered extents ... Browse Code »

We have a mechanism to make sure we don't lose updates for ordered extents that
were logged in the transaction that is currently running. We add the ordered
extent to a transaction list and then the transaction waits on all the ordered
extents in that list. However are substantially large file systems this list
can be extremely large, and can give us soft lockups, since the ordered extents
don't remove themselves from the list when they do complete.

To fix this we simply add a counter to the transaction that is incremented any
time we have a logged extent that needs to be completed in the current
transaction. Then when the ordered extent finally completes it decrements the
per transaction counter and wakes up the transaction if we are the last ones.
This will eliminate the softlockup. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2015-10-22 09:51:40 +0800
7174109c6 btrfs: qgroup: Use new metadata reservation. ... Browse Code »

As we have the new metadata reservation functions, use them to replace
the old btrfs_qgroup_reserve() call for metadata.

Signed-off-by: Qu Wenruo
Signed-off-by: Chris Mason

Qu Wenruo
2015-10-22 09:40:40 +0800
5aed1dd8b btrfs: change num_items type from u64 to unsigned int ... Browse Code »

The value of num_items that start_transaction() ultimately
always takes is a small one, so a 64 bit integer is overkill.

Also change num_items for btrfs_start_transaction() and
btrfs_start_transaction_lflush() as well.

Reviewed-by: David Sterba
Signed-off-by: Alexandru Moise
Signed-off-by: David Sterba

Alexandru Moise
2015-10-22 00:28:48 +0800

06 Oct, 2015

1 commit

d9a0540a7 Btrfs: fix deadlock when finalizing block group creation ... Browse Code »

Josef ran into a deadlock while a transaction handle was finalizing the
creation of its block groups, which produced the following trace:

[260445.593112] fio D ffff88022a9df468 0 8924 4518 0x00000084
[260445.593119] ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488
[260445.593126] ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0
[260445.593132] ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00
[260445.593137] Call Trace:
[260445.593145] [] schedule+0x37/0x80
[260445.593189] [] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
[260445.593197] [] ? prepare_to_wait_event+0xf0/0xf0
[260445.593225] [] btrfs_lock_root_node+0x34/0x50 [btrfs]
[260445.593253] [] btrfs_search_slot+0x88b/0xa00 [btrfs]
[260445.593295] [] ? free_extent_buffer+0x4f/0x90 [btrfs]
[260445.593324] [] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
[260445.593351] [] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[260445.593394] [] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
[260445.593427] [] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
[260445.593459] [] do_chunk_alloc+0x2a4/0x2e0 [btrfs]
[260445.593491] [] find_free_extent+0xa55/0xd90 [btrfs]
[260445.593524] [] btrfs_reserve_extent+0xd2/0x220 [btrfs]
[260445.593532] [] ? account_page_dirtied+0xdd/0x170
[260445.593564] [] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
[260445.593597] [] ? btree_set_page_dirty+0xe/0x10 [btrfs]
[260445.593626] [] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
[260445.593654] [] btrfs_cow_block+0x11f/0x1c0 [btrfs]
[260445.593682] [] btrfs_search_slot+0x1e7/0xa00 [btrfs]
[260445.593724] [] ? free_extent_buffer+0x4f/0x90 [btrfs]
[260445.593752] [] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
[260445.593830] [] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[260445.593905] [] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
[260445.593946] [] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
[260445.593990] [] btrfs_commit_transaction+0xa8/0xb40 [btrfs]
[260445.594042] [] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
[260445.594089] [] btrfs_sync_file+0x294/0x350 [btrfs]
[260445.594115] [] vfs_fsync_range+0x3b/0xa0
[260445.594133] [] ? syscall_trace_enter_phase1+0x131/0x180
[260445.594149] [] do_fsync+0x3d/0x70
[260445.594169] [] ? syscall_trace_leave+0xb8/0x110
[260445.594187] [] SyS_fsync+0x10/0x20
[260445.594204] [] entry_SYSCALL_64_fastpath+0x12/0x71

This happened because the same transaction handle created a large number
of block groups and while finalizing their creation (inserting new items
and updating existing items in the chunk and device trees) a new metadata
extent had to be allocated and no free space was found in the current
metadata block groups, which made find_free_extent() attempt to allocate
a new block group via do_chunk_alloc(). However at do_chunk_alloc() we
ended up allocating a new system chunk too and exceeded the threshold
of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the
final part of block group creation again (at
btrfs_create_pending_block_groups()) and attempt to lock again the root
of the chunk tree when it's already write locked by the same task.

Similarly we can deadlock on extent tree nodes/leafs if while we are
running delayed references we end up creating a new metadata block group
in order to allocate a new node/leaf for the extent tree (as part of
a CoW operation or growing the tree), as btrfs_create_pending_block_groups
inserts items into the extent tree as well. In this case we get the
following trace:

[14242.773581] fio D ffff880428ca3418 0 3615 3100 0x00000084
[14242.773588] ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438
[14242.773594] ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460
[14242.773600] ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190
[14242.773606] Call Trace:
[14242.773613] [] schedule+0x37/0x80
[14242.773656] [] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
[14242.773664] [] ? prepare_to_wait_event+0xf0/0xf0
[14242.773692] [] btrfs_lock_root_node+0x34/0x50 [btrfs]
[14242.773720] [] btrfs_search_slot+0x88b/0xa00 [btrfs]
[14242.773750] [] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
[14242.773758] [] ? kmem_cache_alloc+0x1d2/0x200
[14242.773786] [] btrfs_insert_item+0x71/0xf0 [btrfs]
[14242.773818] [] btrfs_create_pending_block_groups+0x102/0x200 [btrfs]
[14242.773850] [] do_chunk_alloc+0x2ae/0x2f0 [btrfs]
[14242.773934] [] find_free_extent+0xa55/0xd90 [btrfs]
[14242.773998] [] btrfs_reserve_extent+0xc2/0x1d0 [btrfs]
[14242.774041] [] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
[14242.774078] [] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
[14242.774118] [] btrfs_cow_block+0x11f/0x1c0 [btrfs]
[14242.774155] [] btrfs_search_slot+0x1e7/0xa00 [btrfs]
[14242.774194] [] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs]
[14242.774235] [] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
[14242.774274] [] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[14242.774318] [] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs]
[14242.774358] [] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs]
[14242.774391] [] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
[14242.774432] [] commit_cowonly_roots+0x8d/0x2bd [btrfs]
[14242.774474] [] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs]
[14242.774516] [] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs]
[14242.774558] [] btrfs_commit_transaction+0x590/0xb40 [btrfs]
[14242.774599] [] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
[14242.774642] [] btrfs_sync_file+0x294/0x350 [btrfs]
[14242.774650] [] vfs_fsync_range+0x3b/0xa0
[14242.774657] [] ? syscall_trace_enter_phase1+0x131/0x180
[14242.774663] [] do_fsync+0x3d/0x70
[14242.774669] [] ? syscall_trace_leave+0xb8/0x110
[14242.774675] [] SyS_fsync+0x10/0x20
[14242.774681] [] entry_SYSCALL_64_fastpath+0x12/0x71

Fix this by never recursing into the finalization phase of block group
creation and making sure we never trigger the finalization of block group
creation while running delayed references.

Reported-by: Josef Bacik
Fixes: 00d80e342c0f ("Btrfs: fix quick exhaustion of the system array in the superblock")
Signed-off-by: Filipe Manana

Filipe Manana
2015-10-06 07:56:38 +0800

23 Sep, 2015

1 commit

2b9dbef27 Btrfs: keep dropped roots in cache until transaction commit ... Browse Code »

When dropping a snapshot we need to account for the qgroup changes. If we drop
the snapshot in all one go then the backref code will fail to find blocks from
the snapshot we dropped since it won't be able to find the root in the fs root
cache. This can lead to us failing to find refs from other roots that pointed
at blocks in the now deleted root. To handle this we need to not remove the fs
roots from the cache until after we process the qgroup operations. Do this by
adding dropped roots to a list on the transaction, and letting the transaction
remove the roots at the same time it drops the commit roots. This will keep all
of the backref searching code in sync properly, and fixes a problem Mark was
seeing with snapshot delete and qgroups. Thanks,

Signed-off-by: Josef Bacik
Tested-by: Holger Hoffstätte
Signed-off-by: Chris Mason

Josef Bacik
2015-09-23 01:22:56 +0800

29 Jul, 2015

1 commit

e33e17ee1 btrfs: add missing discards when unpinning extents with -o discard ... Browse Code »

When we clear the dirty bits in btrfs_delete_unused_bgs for extents
in the empty block group, it results in btrfs_finish_extent_commit being
unable to discard the freed extents.

The block group removal patch added an alternate path to forget extents
other than btrfs_finish_extent_commit. As a result, any extents that
would be freed when the block group is removed aren't discarded. In my
test run, with a large copy of mixed sized files followed by removal, it
left nearly 2/3 of extents undiscarded.

To clean up the block groups, we add the removed block group onto a list
that will be discarded after transaction commit.

Signed-off-by: Jeff Mahoney
Reviewed-by: Filipe Manana
Tested-by: Filipe Manana
Signed-off-by: Chris Mason

Jeff Mahoney
2015-07-29 23:15:29 +0800

11 Jun, 2015

1 commit

9086db86e btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots. ... Browse Code »

This is used by later qgroup fix patches for snapshot.

As current snapshot accounting is done by btrfs_qgroup_inherit(), but
new extent oriented quota mechanism will account extent from
btrfs_copy_root() and other snapshot things, causing wrong result.

So add this ability to handle snapshot accounting.

Signed-off-by: Qu Wenruo
Signed-off-by: Chris Mason

Qu Wenruo
2015-06-11 00:26:23 +0800

03 Jun, 2015

1 commit

4fbcdf669 Btrfs: fix -ENOSPC when finishing block group creation ... Browse Code »

While creating a block group, we often end up getting ENOSPC while updating
the chunk tree, which leads to a transaction abortion that produces a trace
like the following:

[30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
[30670.117777] BTRFS: Transaction aborted (error -28)
(...)
[30670.163567] Call Trace:
[30670.163906] [] dump_stack+0x4f/0x7b
[30670.164522] [] ? console_unlock+0x361/0x3ad
[30670.165171] [] warn_slowpath_common+0xa1/0xbb
[30670.166323] [] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.167213] [] warn_slowpath_fmt+0x46/0x48
[30670.167862] [] __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.169116] [] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
[30670.170593] [] __btrfs_end_transaction+0x84/0x366 [btrfs]
[30670.171960] [] btrfs_end_transaction+0x10/0x12 [btrfs]
[30670.174649] [] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[30670.176092] [] btrfs_fallocate+0x7c8/0xb96 [btrfs]
[30670.177218] [] ? __this_cpu_preempt_check+0x13/0x15
[30670.178622] [] vfs_fallocate+0x14c/0x1de
[30670.179642] [] ? __fget_light+0x2d/0x4f
[30670.180692] [] SyS_fallocate+0x47/0x62
[30670.186737] [] system_call_fastpath+0x12/0x17
[30670.187792] ---[ end trace 0373e6b491c4a8cc ]---

This is because we don't do proper space reservation for the chunk block
reserve when we have multiple tasks allocating chunks in parallel.

So block group creation has 2 phases, and the first phase essentially
checks if there is enough space in the system space_info, allocating a
new system chunk if there isn't, while the second phase updates the
device, extent and chunk trees. However, because the updates to the
chunk tree happen in the second phase, if we have N tasks, each with
its own transaction handle, allocating new chunks in parallel and if
there is only enough space in the system space_info to allocate M chunks,
where M < N, none of the tasks ends up allocating a new system chunk in
the first phase and N - M tasks will get -ENOSPC when attempting to
update the chunk tree in phase 2 if they need to COW any nodes/leafs
from the chunk tree.

Fix this by doing proper reservation in the chunk block reserve.

The issue could be reproduced by running fstests generic/038 in a loop,
which eventually triggered the problem.

Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2015-06-03 19:03:04 +0800

11 Apr, 2015

2 commits

1bbc621ef Btrfs: allow block group cache writeout outside critical section in commit ... Browse Code »

We loop through all of the dirty block groups during commit and write
the free space cache. In order to make sure the cache is currect, we do
this while no other writers are allowed in the commit.

If a large number of block groups are dirty, this can introduce long
stalls during the final stages of the commit, which can block new procs
trying to change the filesystem.

This commit changes the block group cache writeout to take appropriate
locks and allow it to run earlier in the commit. We'll still have to
redo some of the block groups, but it means we can get most of the work
out of the way without blocking the entire FS.

Signed-off-by: Chris Mason

Chris Mason
2015-04-11 05:07:22 +0800
cb723e491 Btrfs: reserve space for block groups ... Browse Code »

This changes our delayed refs calculations to include the space needed
to write back dirty block groups.

Signed-off-by: Chris Mason

Josef Bacik
2015-04-11 05:06:48 +0800

27 Mar, 2015

1 commit

2f2ff0ee5 Btrfs: fix metadata inconsistencies after directory fsync ... Browse Code »

We can get into inconsistency between inodes and directory entries
after fsyncing a directory. The issue is that while a directory gets
the new dentries persisted in the fsync log and replayed at mount time,
the link count of the inode that directory entries point to doesn't
get updated, staying with an incorrect link count (smaller then the
correct value). This later leads to stale file handle errors when
accessing (including attempt to delete) some of the links if all the
other ones are removed, which also implies impossibility to delete the
parent directories, since the dentries can not be removed.

Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2),
when fsyncing a directory, new files aren't logged (their metadata and
dentries) nor any child directories. So this patch fixes this issue too,
since it has the same resolution as the incorrect inode link count issue
mentioned before.

This is very easy to reproduce, and the following excerpt from my test
case for xfstests shows how:

_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey

# Create our main test file and directory.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io
mkdir $SCRATCH_MNT/mydir

# Make sure all metadata and data are durably persisted.
sync

# Add a hard link to 'foo' inside our test directory and fsync only the
# directory. The btrfs fsync implementation had a bug that caused the new
# directory entry to be visible after the fsync log replay but, the inode
# of our file remained with a link count of 1.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2

# Add a few more links and new files.
# This is just to verify nothing breaks or gives incorrect results after the
# fsync log is replayed.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3
$XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io
ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2

# Add some subdirectories and new files and links to them. This is to verify
# that after fsyncing our top level directory 'mydir', all the subdirectories
# and their files/links are registered in the fsync log and exist after the
# fsync log is replayed.
mkdir -p $SCRATCH_MNT/mydir/x/y/z
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link
touch $SCRATCH_MNT/mydir/x/y/z/qwerty

# Now fsync only our top directory.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir

# And fsync now our new file named 'hello', just to verify later that it has
# the expected content and that the previous fsync on the directory 'mydir' had
# no bad influence on this fsync.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello

# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey

_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey

# Verify the content of our file 'foo' remains the same as before, 8192 bytes,
# all with the value 0xaa.
echo "File 'foo' content after log replay:"
od -t x1 $SCRATCH_MNT/foo

# Remove the first name of our inode. Because of the directory fsync bug, the
# inode's link count was 1 instead of 5, so removing the 'foo' name ended up
# deleting the inode and the other names became stale directory entries (still
# visible to applications). Attempting to remove or access the remaining
# dentries pointing to that inode resulted in stale file handle errors and
# made it impossible to remove the parent directories since it was impossible
# for them to become empty.
echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)"
rm -f $SCRATCH_MNT/foo

# Now verify that all files, links and directories created before fsyncing our
# directory exist after the fsync log was replayed.
[ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing"
[ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing"
[ -f $SCRATCH_MNT/hello ] || echo "File hello is missing"
[ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing"
[ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \
echo "Link mydir/x/y/foo_y_link is missing"
[ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \
echo "Link mydir/x/y/z/foo_z_link is missing"
[ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \
echo "File mydir/x/y/z/qwerty is missing"

# We expect our file here to have a size of 64Kb and all the bytes having the
# value 0xff.
echo "file 'hello' content after log replay:"
od -t x1 $SCRATCH_MNT/hello

# Now remove all files/links, under our test directory 'mydir', and verify we
# can remove all the directories.
rm -f $SCRATCH_MNT/mydir/x/y/z/*
rmdir $SCRATCH_MNT/mydir/x/y/z
rm -f $SCRATCH_MNT/mydir/x/y/*
rmdir $SCRATCH_MNT/mydir/x/y
rmdir $SCRATCH_MNT/mydir/x
rm -f $SCRATCH_MNT/mydir/*
rmdir $SCRATCH_MNT/mydir

# An fsck, run by the fstests framework everytime a test finishes, also detected
# the inconsistency and printed the following error message:
#
# root 5 inode 257 errors 2001, no inode item, link count wrong
# unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
# unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref

status=0
exit

The expected golden output for the test is:

wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 65536/65536 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File 'foo' content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000
file 'foo' link count after log replay: 5
file 'hello' content after log replay:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0200000

Which is the output after this patch and when running the test against
ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's
output is:

wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 65536/65536 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File 'foo' content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000
file 'foo' link count after log replay: 1
Link mydir/foo_2 is missing
Link mydir/foo_3 is missing
Link mydir/x/y/foo_y_link is missing
Link mydir/x/y/z/foo_z_link is missing
File mydir/x/y/z/qwerty is missing
file 'hello' content after log replay:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0200000
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory
rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle
rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty

Fsck, without this fix, also complains about the wrong link count:

root 5 inode 257 errors 2001, no inode item, link count wrong
unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref

So fix this by logging the inodes that the dentries point to when
fsyncing a directory.

A test case for xfstests follows.

Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2015-03-27 08:56:23 +0800

15 Feb, 2015

1 commit

13212b54d btrfs: Fix out-of-space bug ... Browse Code »

Btrfs will report NO_SPACE when we create and remove files for several times,
and we can't write to filesystem until mount it again.

Steps to reproduce:
1: Create a single-dev btrfs fs with default option
2: Write a file into it to take up most fs space
3: Delete above file
4: Wait about 100s to let chunk removed
5: goto 2

Script is like following:
#!/bin/bash

# Recommend 1.2G space, too large disk will make test slow
DEV="/dev/sda16"
MNT="/mnt/tmp"

dev_size="$(lsblk -bn -o SIZE "$DEV")" || exit 2
file_size_m=$((dev_size * 75 / 100 / 1024 / 1024))

echo "Loop write ${file_size_m}M file on $((dev_size / 1024 / 1024))M dev"

for ((i = 0; i < 10; i++)); do umount "$MNT" 2>/dev/null; done
echo "mkfs $DEV"
mkfs.btrfs -f "$DEV" >/dev/null || exit 2
echo "mount $DEV $MNT"
mount "$DEV" "$MNT" || exit 2

for ((loop_i = 0; loop_i < 20; loop_i++)); do
echo
echo "loop $loop_i"

echo "dd file..."
cmd=(dd if=/dev/zero of="$MNT"/file0 bs=1M count="$file_size_m")
"${cmd[@]}" 2>/dev/null || {
# NO_SPACE error triggered
echo "dd failed: ${cmd[*]}"
exit 1
}

echo "rm file..."
rm -f "$MNT"/file0 || exit 2

for ((i = 0; i < 10; i++)); do
df "$MNT" | tail -1
sleep 10
done
done

Reason:
It is triggered by commit: 47ab2a6c689913db23ccae38349714edf8365e0a
which is used to remove empty block groups automatically, but the
reason is not in that patch. Code before works well because btrfs
don't need to create and delete chunks so many times with high
complexity.
Above bug is caused by many reason, any of them can trigger it.

Reason1:
When we remove some continuous chunks but leave other chunks after,
these disk space should be used by chunk-recreating, but in current
code, only first create will successed.
Fixed by Forrest Liu in:
Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole

Reason2:
contains_pending_extent() return wrong value in calculation.
Fixed by Forrest Liu in:
Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole

Reason3:
btrfs_check_data_free_space() try to commit transaction and retry
allocating chunk when the first allocating failed, but space_info->full
is set in first allocating, and prevent second allocating in retry.
Fixed in this patch by clear space_info->full in commit transaction.

Tested for severial times by above script.

Changelog v3->v4:
use light weight int instead of atomic_t to record have_remove_bgs in
transaction, suggested by:
Josef Bacik

Changelog v2->v3:
v2 fixed the bug by adding more commit-transaction, but we
only need to reclaim space when we are really have no space for
new chunk, noticed by:
Filipe David Manana

Actually, our code already have this type of commit-and-retry,
we only need to make it working with removed-bgs.
v3 fixed the bug with above way.

Changelog v1->v2:
v1 will introduce a new bug when delete and create chunk in same disk
space in same transaction, noticed by:
Filipe David Manana
V2 fix this bug by commit transaction after remove block grops.

Reported-by: Tsutomu Itoh
Suggested-by: Filipe David Manana
Suggested-by: Josef Bacik
Signed-off-by: Zhao Lei
Signed-off-by: Chris Mason

Zhao Lei
2015-02-15 00:19:14 +0800

22 Jan, 2015

1 commit

ce93ec548 Btrfs: track dirty block groups on their own list ... Browse Code »

Currently any time we try to update the block groups on disk we will walk _all_
block groups and check for the ->dirty flag to see if it is set. This function
can get called several times during a commit. So if you have several terabytes
of data you will be a very sad panda as we will loop through _all_ of the block
groups several times, which makes the commit take a while which slows down the
rest of the file system operations.

This patch introduces a dirty list for the block groups that we get added to
when we dirty the block group for the first time. Then we simply update any
block groups that have been dirtied since the last time we called
btrfs_write_dirty_block_groups. This allows us to clean up how we write the
free space cache out so it is much cleaner. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2015-01-22 09:36:52 +0800

25 Nov, 2014

1 commit

ad27c0dab Merge branch 'dev/pending-changes' of git://git.kernel.org/pub/scm/linux/kernel/… ... Browse Code »

…git/kdave/linux into for-linus

Chris Mason
2014-11-25 21:45:30 +0800

22 Nov, 2014

1 commit

50d9aa99b Btrfs: make sure logged extents complete in the current transaction V3 ... Browse Code »

Liu Bo pointed out that my previous fix would lose the generation update in the
scenario I described. It is actually much worse than that, we could lose the
entire extent if we lose power right after the transaction commits. Consider
the following

write extent 0-4k
log extent in log tree
commit transaction
< power fail happens here
ordered extent completes

We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
the transaction commit will reset the log so it isn't replayed. If we lose
power before the transaction commit we are save, otherwise we are not.

Fix this by keeping track of all extents we logged in this transaction. Then
when we go to commit the transaction make sure we wait for all of those ordered
extents to complete before proceeding. This will make sure that if we lose
power after the transaction commit we still have our data. This also fixes the
problem of the improperly updated extent generation. Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2014-11-22 03:58:32 +0800

21 Nov, 2014

1 commit

663dfbb07 Btrfs: deal with convert_extent_bit errors to avoid fs corruption ... Browse Code »

When committing a transaction or a log, we look for btree extents that
need to be durably persisted by searching for ranges in a io tree that
have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear
those bits and set the EXTENT_NEED_WAIT bit, with calls to the function
convert_extent_bit, and then start writeback for the extents.

That function however can return an error (at the moment only -ENOMEM
is possible, specially when it does GFP_ATOMIC allocation requests
through alloc_extent_state_atomic) - that means the ranges didn't got
the EXTENT_NEED_WAIT bit set (or at least not for the whole range),
which in turn means a call to btrfs_wait_marked_extents() won't find
those ranges for which we started writeback, causing a transaction
commit or a log commit to persist a new superblock without waiting
for the writeback of extents in that range to finish first.

Therefore if a crash happens after persisting the new superblock and
before writeback finishes, we have a superblock pointing to roots that
weren't fully persisted or roots that point to nodes or leafs that weren't
fully persisted, causing all sorts of unexpected/bad behaviour as we endup
reading garbage from disk or the content of some node/leaf from a past
generation that got cowed or deleted and is no longer valid (for this later
case we end up getting error messages like "parent transid verify failed on
X wanted Y found Z" when reading btree nodes/leafs from disk).

Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2014-11-21 09:14:29 +0800

12 Nov, 2014

1 commit

572d9ab78 btrfs: add support for processing pending changes ... Browse Code »

There are some actions that modify global filesystem state but cannot be
performed at the time of request, but later at the transaction commit
time when the filesystem is in a known state.

For example enabling new incompat features on-the-fly or issuing
transaction commit from unsafe contexts (sysfs handlers).

Signed-off-by: David Sterba

David Sterba
2014-11-12 23:53:12 +0800

02 Oct, 2014

1 commit

2755a0de6 btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB ... Browse Code »

Signed-off-by: David Sterba

David Sterba
2014-10-02 23:30:31 +0800

15 Aug, 2014

1 commit

8d875f95d btrfs: disable strict file flushes for renames and truncates ... Browse Code »

Truncates and renames are often used to replace old versions of a file
with new versions. Applications often expect this to be an atomic
replacement, even if they haven't done anything to make sure the new
version is fully on disk.

Btrfs has strict flushing in place to make sure that renaming over an
old file with a new file will fully flush out the new file before
allowing the transaction commit with the rename to complete.

This ordering means the commit code needs to be able to lock file pages,
and there are a few paths in the filesystem where we will try to end a
transaction with the page lock held. It's rare, but these things can
deadlock.

This patch removes the ordered flushes and switches to a best effort
filemap_flush like ext4 uses. It's not perfect, but it should fix the
deadlocks.

Signed-off-by: Chris Mason

Chris Mason
2014-08-15 22:43:42 +0800

10 Jun, 2014

1 commit

faa2dbf00 Btrfs: add sanity tests for new qgroup accounting code ... Browse Code »

This exercises the various parts of the new qgroup accounting code. We do some
basic stuff and do some things with the shared refs to make sure all that code
works. I had to add a bunch of infrastructure because I needed to be able to
insert items into a fake tree without having to do all the hard work myself,
hopefully this will be usefull in the future. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2014-06-10 08:20:49 +0800

07 Apr, 2014

2 commits

9e351cc86 Btrfs: remove transaction from send ... Browse Code »

Lets try this again. We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send. So fix this by making sure looking at the
commit roots is always going to be consistent. We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once. Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes. With this patch we no longer
deadlock and pass all the weird send/receive corner cases. Thanks,

Reportedy-by: Hugo Mills
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2014-04-07 08:39:30 +0800
a26e8c9f7 Btrfs: don't clear uptodate if the eb is under IO ... Browse Code »

So I have an awful exercise script that will run snapshot, balance and
send/receive in parallel. This sometimes would crash spectacularly and when it
came back up the fs would be completely hosed. Turns out this is because of a
bad interaction of balance and send/receive. Send will hold onto its entire
path for the whole send, but its blocks could get relocated out from underneath
it, and because it doesn't old tree locks theres nothing to keep this from
happening. So it will go to read in a slot with an old transid, and we could
have re-allocated this block for something else and it could have a completely
different transid. But because we think it is invalid we clear uptodate and
re-read in the block. If we do this before we actually write out the new block
we could write back stale data to the fs, and boom we're screwed.

Now we definitely need to fix this disconnect between send and balance, but we
really really need to not allow ourselves to accidently read in stale data over
new data. So make sure we check if the extent buffer is not under io before
clearing uptodate, this will kick back EIO to the caller instead of reading in
stale data and keep us from corrupting the fs. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2014-04-07 08:34:37 +0800

29 Jan, 2014

2 commits

5039eddc1 Btrfs: make fsync latency less sucky ... Browse Code »

Looking into some performance related issues with large amounts of metadata
revealed that we can have some pretty huge swings in fsync() performance. If we
have a lot of delayed refs backed up (as you will tend to do with lots of
metadata) fsync() will wander off and try to run some of those delayed refs
which can result in reading from disk and such. Since the actual act of fsync()
doesn't create any delayed refs there is no need to make it throttle on delayed
ref stuff, that will be handled by other people. With this patch we get much
smoother fsync performance with large amounts of metadata. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2014-01-29 05:20:25 +0800
a56dbd894 Btrfs: remove btrfs_end_transaction_dmeta() ... Browse Code »

Two reasons:
- btrfs_end_transaction_dmeta() is the same as btrfs_end_transaction_throttle()
so it is unnecessary.
- All the delayed items should be dealt in the current transaction, so the
workers should not commit the transaction, instead, deal with the delayed
items as many as possible.

So we can remove btrfs_end_transaction_dmeta()

Signed-off-by: Miao Xie
Signed-off-by: Chris Mason

Miao Xie
2014-01-29 05:20:08 +0800