Eric Lee / smarc-fsl-linux-kernel

29 Feb, 2020

1 commit

c383f8ad2 Btrfs: fix btrfs_wait_ordered_range() so that it waits for all ordered extents ... Browse Code »

commit e75fd33b3f744f644061a4f9662bd63f5434f806 upstream.

In btrfs_wait_ordered_range() once we find an ordered extent that has
finished with an error we exit the loop and don't wait for any other
ordered extents that might be still in progress.

All the users of btrfs_wait_ordered_range() expect that there are no more
ordered extents in progress after that function returns. So past fixes
such like the ones from the two following commits:

ff612ba7849964 ("btrfs: fix panic during relocation after ENOSPC before
writeback happens")

28aeeac1dd3080 ("Btrfs: fix panic when starting bg cache writeout after
IO error")

don't work when there are multiple ordered extents in the range.

Fix that by making btrfs_wait_ordered_range() wait for all ordered extents
even after it finds one that had an error.

Link: https://github.com/kdave/btrfs-progs/issues/228#issuecomment-569777554
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo
Reviewed-by: Josef Bacik
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Filipe Manana
2020-02-29 00:22:24 +0800

09 Jan, 2020

1 commit

2bae3ee32 btrfs: get rid of unique workqueue helper functions ... Browse Code »

[ Upstream commit a0cac0ec961f0d42828eeef196ac2246a2f07659 ]

Commit 9e0af2376434 ("Btrfs: fix task hang under heavy compressed
write") worked around the issue that a recycled work item could get a
false dependency on the original work item due to how the workqueue code
guarantees non-reentrancy. It did so by giving different work functions
to different types of work.

However, the fixes in the previous few patches are more complete, as
they prevent a work item from being recycled at all (except for a tiny
window that the kernel workqueue code handles for us). This obsoletes
the previous fix, so we don't need the unique helpers for correctness.
The only other reason to keep them would be so they show up in stack
traces, but they always seem to be optimized to a tail call, so they
don't show up anyways. So, let's just get rid of the extra indirection.

While we're here, rename normal_work_helper() to the more informative
btrfs_work_helper().

Reviewed-by: Nikolay Borisov
Reviewed-by: Filipe Manana
Signed-off-by: Omar Sandoval
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin

Omar Sandoval
2020-01-09 17:20:06 +0800

09 Sep, 2019

1 commit

602cbe91f btrfs: move cond_wake_up functions out of ctree ... Browse Code »

The file ctree.h serves as a header for everything and has become quite
bloated. Split some helpers that are generic and create a new file that
should be the catch-all for code that's not btrfs-specific.

Reviewed-by: Johannes Thumshirn
Signed-off-by: David Sterba

David Sterba
2019-09-09 20:59:15 +0800

26 Jul, 2019

1 commit

a3b46b86c btrfs: fix extent_state leak in btrfs_lock_and_flush_ordered_range ... Browse Code »

btrfs_lock_and_flush_ordered_range() loads given "*cached_state" into
cachedp, which, in general, is NULL. Then, lock_extent_bits() updates
"cachedp", but it never goes backs to the caller. Thus the caller still
see its "cached_state" to be NULL and never free the state allocated
under btrfs_lock_and_flush_ordered_range(). As a result, we will
see massive state leak with e.g. fstests btrfs/005. Fix this bug by
properly handling the pointers.

Fixes: bd80d94efb83 ("btrfs: Always use a cached extent_state in btrfs_lock_and_flush_ordered_range")
Reviewed-by: Nikolay Borisov
Signed-off-by: Naohiro Aota
Signed-off-by: David Sterba

Naohiro Aota
2019-07-26 18:21:22 +0800

04 Jul, 2019

1 commit

867363429 btrfs: migrate the delalloc space stuff to it's own home ... Browse Code »

We have code for data and metadata reservations for delalloc. There's
quite a bit of code here, and it's used in a lot of places so I've
separated it out to it's own file. inode.c and file.c are already
pretty large, and this code is complicated enough to live in its own
space.

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-07-04 23:26:17 +0800

01 Jul, 2019

3 commits

1e25a2e3c btrfs: don't assume ordered sums to be 4 bytes ... Browse Code »

BTRFS has the implicit assumption that a checksum in btrfs_orderd_sums
is 4 bytes. While this is true for CRC32C, it is not for any other
checksum.

Change the data type to be a byte array and adjust loop index
calculation accordingly.

This includes moving the adjustment of 'index' by 'ins_size' in
btrfs_csum_file_blocks() before dividing 'ins_size' by the checksum
size, because before this patch the 'sums' member of 'struct
btrfs_ordered_sum' was 4 Bytes in size and afterwards it is only one
byte.

Reviewed-by: Nikolay Borisov
Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Johannes Thumshirn
2019-07-01 19:35:00 +0800
bd80d94ef btrfs: Always use a cached extent_state in btrfs_lock_and_flush_ordered_range ... Browse Code »

In case no cached_state argument is passed to
btrfs_lock_and_flush_ordered_range use one locally in the function. This
optimises the case when an ordered extent is found since the unlock
function will be able to unlock that state directly without searching
for it again.

Reviewed-by: Josef Bacik
Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2019-07-01 19:34:59 +0800
ffa87214c btrfs: add new helper btrfs_lock_and_flush_ordered_range ... Browse Code »

There is a certain idiom used in multiple places in btrfs' codebase,
dealing with flushing an ordered range. Factor this in a separate
function that can be reused. Future patches will replace the existing
code with that function.

Reviewed-by: Josef Bacik
Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2019-07-01 19:34:59 +0800

30 Apr, 2019

2 commits

4297ff84d btrfs: track DIO bytes in flight ... Browse Code »

When diagnosing a slowdown of generic/224 I noticed we were not doing
anything when calling into shrink_delalloc(). This is because all
writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes
counter is 0, which short circuits most of the work inside of
shrink_delalloc(). However O_DIRECT writes still consume metadata
resources and generate ordered extents, which we can still wait on.

Fix this by tracking outstanding DIO write bytes, and use this as well
as the delalloc bytes counter to decide if we need to lookup and wait on
any ordered extents. If we have more DIO writes than delalloc bytes
we'll go ahead and wait on any ordered extents regardless of our flush
state as flushing delalloc is likely to not gain us anything.

Signed-off-by: Josef Bacik
[ use dio instead of odirect in identifiers ]
Signed-off-by: David Sterba

Josef Bacik
2019-04-30 01:25:37 +0800
f9756261c btrfs: Remove redundant inode argument from btrfs_add_ordered_sum ... Browse Code »

Ordered csums are keyed off of a btrfs_ordered_extent, which already has
a reference to the inode. This implies that an explicit inode argument
is redundant. So remove it.

Reviewed-by: Johannes Thumshirn
Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2019-04-30 01:02:40 +0800

25 Apr, 2019

1 commit

a3d46aea4 btrfs: Switch memory allocations in async csum calculation path to kvmalloc ... Browse Code »

Recent multi-page biovec rework allowed creation of bios that can span
large regions - up to 128 megabytes in the case of btrfs. OTOH btrfs'
submission path currently allocates a contiguous array to store the
checksums for every bio submitted. This means we can request up to
(128mb / BTRFS_SECTOR_SIZE) * 4 bytes + 32bytes of memory from kmalloc.
On busy systems with possibly fragmented memory said kmalloc can fail
which will trigger BUG_ON due to improper error handling IO submission
context in btrfs.

Until error handling is improved or bios in btrfs limited to a more
manageable size (e.g. 1m) let's use kvmalloc to fallback to vmalloc for
such large allocations. There is no hard requirement that the memory
allocated for checksums during IO submission has to be contiguous, but
this is a simple fix that does not require several non-contiguous
allocations.

For small writes this is unlikely to have any visible effect since
kmalloc will still satisfy allocation requests as usual. For larger
requests the code will just fallback to vmalloc.

We've performed evaluation on several workload types and there was no
significant difference kmalloc vs kvmalloc.

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2019-04-25 20:17:38 +0800

17 Dec, 2018

1 commit

85dd506c8 Btrfs: remove no longer used stuff for tracking pending ordered extents ... Browse Code »

Tracking pending ordered extents per transaction was introduced in commit
50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current
transaction V3") and later updated in commit 161c3549b45a ("Btrfs: change
how we wait for pending ordered extents").

However now that on fsync we always wait for ordered extents to complete
before logging, done in commit 5636cf7d6dc8 ("btrfs: remove the logged
extents infrastructure"), we no longer need the stuff to track for pending
ordered extents, which was not completely removed in the mentioned commit.
So remove the remaining of the pending ordered extents infrastructure.

Reviewed-by: Liu Bo
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba

Filipe Manana
2018-12-17 21:51:25 +0800

06 Aug, 2018

3 commits

d7f663fa3 btrfs: prune unused includes ... Browse Code »

Remove includes if none of the interfaces and exports is used in the
given source file.

Signed-off-by: David Sterba

David Sterba
2018-08-06 19:12:43 +0800
ca5788aba btrfs: remove remaing full_sync logic from btrfs_sync_file ... Browse Code »

The logic to check if the inode is already in the log can now be
simplified since we always wait for the ordered extents to complete
before deciding whether the inode needs to be logged. The big comment
about it can go away too.

CC: Filipe Manana
Suggested-by: Filipe Manana
[ code and changelog copied from mail discussion ]
Signed-off-by: David Sterba

David Sterba
2018-08-06 19:12:31 +0800
5636cf7d6 btrfs: remove the logged extents infrastructure ... Browse Code »

This is no longer used anywhere, remove all of it.

Signed-off-by: Josef Bacik
Reviewed-by: Filipe Manana
Signed-off-by: David Sterba

Josef Bacik
2018-08-06 19:12:30 +0800

29 May, 2018

1 commit

093258e6e btrfs: replace waitqueue_actvie with cond_wake_up ... Browse Code »

Use the wrappers and reduce the amount of low-level details about the
waitqueue management.

Reviewed-by: Nikolay Borisov
Signed-off-by: David Sterba

David Sterba
2018-05-29 00:23:09 +0800

12 Apr, 2018

1 commit

c1d7c514f btrfs: replace GPL boilerplate by SPDX -- sources ... Browse Code »

Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.

Signed-off-by: David Sterba

David Sterba
2018-04-12 22:29:51 +0800

31 Mar, 2018

1 commit

43b18595d btrfs: qgroup: Use separate meta reservation type for delalloc ... Browse Code »

Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
preallocated meta rsv, making it quite easy to underflow qgroup meta
reservation.

Since we have the new qgroup meta rsv types, apply it to delalloc
reservation.

Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
rsv type.

And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
they will convert corresponding META_PREALLOC reservation to
META_PERTRANS.

This is mainly due to the fact that current qgroup numbers will only be
updated in btrfs_commit_transaction(), that's to say if we don't keep
such placeholder reservation, we can exceed qgroup limitation.

And for callers freeing outstanding extent in error handler, we will
just free META_PREALLOC bytes.

This behavior makes callers of btrfs_qgroup_release_meta() or
btrfs_qgroup_convert_meta() to be aware of which type they are.
So in this patch, btrfs_delalloc_release_metadata() and its callers get
an extra parameter to info qgroup to do correct meta convert/release.

The good news is, even we use the wrong type (convert or free), it won't
cause obvious bug, as prealloc type is always in good shape, and the
type only affects how per-trans meta is increased or not.

So the worst case will be at most metadata limitation can be sometimes
exceeded (no convert at all) or metadata limitation is reached too soon
(no free at all).

Signed-off-by: Qu Wenruo
Signed-off-by: David Sterba

Qu Wenruo
2018-03-31 07:41:14 +0800

26 Mar, 2018

1 commit

e67c718b5 btrfs: add more __cold annotations ... Browse Code »

The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.

Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:

- printf wrappers, error messages
- exit helpers

Signed-off-by: David Sterba

David Sterba
2018-03-26 21:09:39 +0800

02 Nov, 2017

1 commit

8b62f87ba Btrfs: rework outstanding_extents ... Browse Code »

Right now we do a lot of weird hoops around outstanding_extents in order
to keep the extent count consistent. This is because we logically
transfer the outstanding_extent count from the initial reservation
through the set_delalloc_bits. This makes it pretty difficult to get a
handle on how and when we need to mess with outstanding_extents.

Fix this by revamping the rules of how we deal with outstanding_extents.
Now instead everybody that is holding on to a delalloc extent is
required to increase the outstanding extents count for itself. This
means we'll have something like this

btrfs_delalloc_reserve_metadata - outstanding_extents = 1
btrfs_set_extent_delalloc - outstanding_extents = 2
btrfs_release_delalloc_extents - outstanding_extents = 1

for an initial file write. Now take the append write where we extend an
existing delalloc range but still under the maximum extent size

btrfs_delalloc_reserve_metadata - outstanding_extents = 2
btrfs_set_extent_delalloc
btrfs_set_bit_hook - outstanding_extents = 3
btrfs_merge_extent_hook - outstanding_extents = 2
btrfs_delalloc_release_extents - outstanding_extnets = 1

In order to make the ordered extent transition we of course must now
make ordered extents carry their own outstanding_extent reservation, so
for cow_file_range we end up with

btrfs_add_ordered_extent - outstanding_extents = 2
clear_extent_bit - outstanding_extents = 1
btrfs_remove_ordered_extent - outstanding_extents = 0

This makes all manipulations of outstanding_extents much more explicit.
Every successful call to btrfs_delalloc_reserve_metadata _must_ now be
combined with btrfs_release_delalloc_extents, even in the error case, as
that is the only function that actually modifies the
outstanding_extents counter.

The drawback to this is now we are much more likely to have transient
cases where outstanding_extents is much larger than it actually should
be. This could happen before as we manipulated the delalloc bits, but
now it happens basically at every write. This may put more pressure on
the ENOSPC flushing code, but I think making this code simpler is worth
the cost. I have another change coming to mitigate this side-effect
somewhat.

I also added trace points for the counter manipulation. These were used
by a bpf script I wrote to help track down leak issues.

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2017-11-02 03:45:35 +0800

30 Jun, 2017

1 commit

6374e57ad btrfs: fix integer overflow in calc_reclaim_items_nr ... Browse Code »

Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
v4.12-rc6. This was because commit 70e7af244 made it possible for
calc_reclaim_items_nr() to return a negative number. It's not really a
bug in that commit, it just didn't go far enough down the stack to find
all the possible 64->32 bit overflows.

This switches calc_reclaim_items_nr() to return a u64 and changes everyone
that uses the results of that math to u64 as well.

Reported-by: Dave Jones
Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
Signed-off-by: Chris Mason
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Chris Mason
2017-06-30 02:17:02 +0800

18 Apr, 2017

2 commits

e76edab7f btrfs: convert btrfs_ordered_extent.refs from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David Sterba

Elena Reshetova
2017-04-18 20:07:23 +0800
9b64f57dd btrfs: convert btrfs_transaction.use_count from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David Sterba

Elena Reshetova
2017-04-18 20:07:23 +0800

28 Feb, 2017

1 commit

a776c6fa1 btrfs: Make btrfs_lookup_ordered_range take btrfs_inode ... Browse Code »

Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba

Nikolay Borisov
2017-02-28 18:30:08 +0800

14 Feb, 2017

3 commits

62c821a8e Btrfs: clean up btrfs_ordered_update_i_size ... Browse Code »

Since we have a good helper entry_end, use it for ordered extent.

Signed-off-by: Liu Bo
Reviewed-by: David Sterba
[ whitespace reformatting ]
Signed-off-by: David Sterba

Liu Bo
2017-02-14 22:50:58 +0800
19fd2df5b Btrfs: fix btrfs_ordered_update_i_size to update disk_i_size properly ... Browse Code »

btrfs_ordered_update_i_size can be called by truncate and endio, but
only endio takes ordered_extent which contains the completed IO.

while truncating down a file, if there are some in-flight IOs,
btrfs_ordered_update_i_size in endio will set disk_i_size to
@orig_offset that is zero. If truncating-down fails somehow, we try to
recover in memory isize with this zero'd disk_i_size.

Fix it by only updating disk_i_size with @orig_offset when
btrfs_ordered_update_i_size is not called from endio while truncating
down and waiting for in-flight IOs completing their work before recover
in-memory size.

Besides fixing the above issue, add an assertion for last_size to double
check we truncate down to the desired size.

Signed-off-by: Liu Bo
Signed-off-by: David Sterba

Liu Bo
2017-02-14 22:50:57 +0800
223466370 btrfs: Make btrfs_get_logged_extents take btrfs_inode ... Browse Code »

Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba

Nikolay Borisov
2017-02-14 22:50:55 +0800

06 Dec, 2016

2 commits

0b246afa6 btrfs: root->fs_info cleanup, add fs_info convenience variables ... Browse Code »

In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable. This makes the code considerably
more readable.

Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba

Jeff Mahoney
2016-12-06 23:06:59 +0800
da17066c4 btrfs: pull node/sector/stripe sizes out of root and into fs_info ... Browse Code »

We track the node sizes per-root, but they never vary from the values
in the superblock. This patch messes with the 80-column style a bit,
but subsequent patches to factor out root->fs_info into a convenience
variable fix it up again.

Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba

Jeff Mahoney
2016-12-06 23:06:58 +0800

27 Sep, 2016

1 commit

5d163e0e6 btrfs: unsplit printed strings ... Browse Code »

CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."

This patch unsplits user-visible strings.

Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba

Jeff Mahoney
2016-09-27 00:08:44 +0800

26 Jul, 2016

1 commit

fba4b6977 btrfs: Fix slab accounting flags ... Browse Code »

BTRFS is using a variety of slab caches to satisfy internal needs.
Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
meaning allocations from the caches are going to be accounted as
SReclaimable. At the same time btrfs is not registering any shrinkers
whatsoever, thus preventing memory from the slabs to be shrunk. This
means those caches are not in fact reclaimable.

To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
inode cache, since this one is being freed by the generic VFS super_block
shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
to better document the lifetime of the objects (it just translates
to SLAB_RECLAIM_ACCOUNT).

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2016-07-26 19:52:25 +0800

24 Jun, 2016

1 commit

c0d2f6104 btrfs: fix disk_i_size update bug when fallocate() fails ... Browse Code »

When doing truncate operation, btrfs_setsize() will first call
truncate_setsize() to set new inode->i_size, but if later
btrfs_truncate() fails, btrfs_setsize() will call
"i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
inmemory inode size, now bug occurs. It's because for truncate
case btrfs_ordered_update_i_size() directly uses inode->i_size
to update BTRFS_I(inode)->disk_i_size, indeed we should use the
"offset" argument to update disk_i_size. Here is the call graph:
==>btrfs_truncate()
====>btrfs_truncate_inode_items()
======>btrfs_ordered_update_i_size(inode, last_size, NULL);
Here btrfs_ordered_update_i_size()'s offset argument is last_size.

And below test case can reveal this bug:

dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
dev=$(losetup --show -f fs.img)
mkdir -p /mnt/mntpoint
mkfs.btrfs -f $dev
mount $dev /mnt/mntpoint
cd /mnt/mntpoint

echo "workdir is: /mnt/mntpoint"
blocksize=$((128 * 1024))
dd if=/dev/zero of=testfile bs=$blocksize count=1
sync
count=$((17*1024*1024*1024/blocksize))
echo "file size is:" $((count*blocksize))
for ((i = 1; i /dev/null
done
sync

truncate --size 0 testfile
ls -l testfile
du -sh testfile
exit

In this case, truncate operation will fail for enospc reason and
"du -sh testfile" returns value greater than 0, but testfile's
size is 0, we need to reflect correct inode->i_size.

Signed-off-by: Wang Xiaoguang
Signed-off-by: David Sterba
Signed-off-by: Chris Mason

Wang Xiaoguang
2016-06-24 01:44:41 +0800

30 May, 2016

1 commit

f0e9b7d64 Btrfs: fix race setting block group readonly during device replace ... Browse Code »

When we do a device replace, for each device extent we find from the
source device, we set the corresponding block group to readonly mode to
prevent writes into it from happening while we are copying the device
extent from the source to the target device. However just before we set
the block group to readonly mode some concurrent task might have already
allocated an extent from it or decided it could perform a nocow write
into one of its extents, which can make the device replace process to
miss copying an extent since it uses the extent tree's commit root to
search for extents and only once it finishes searching for all extents
belonging to the block group it does set the left cursor to the logical
end address of the block group - this is a problem if the respective
ordered extents finish while we are searching for extents using the
extent tree's commit root and no transaction commit happens while we
are iterating the tree, since it's the delayed references created by the
ordered extents (when they complete) that insert the extent items into
the extent tree (using the non-commit root of course).
Example:

CPU 1 CPU 2

btrfs_dev_replace_start()
btrfs_scrub_dev()
scrub_enumerate_chunks()
--> finds device extent belonging
to block group X

starts buffered write
against some inode

writepages is run against
that inode forcing dellaloc
to run

btrfs_writepages()
extent_writepages()
extent_write_cache_pages()
__extent_writepage()
writepage_delalloc()
run_delalloc_range()
cow_file_range()
btrfs_reserve_extent()
--> allocates an extent
from block group X
(which is not yet
in RO mode)
btrfs_add_ordered_extent()
--> creates ordered extent Y
flush_epd_write_bio()
--> bio against the extent from
block group X is submitted

btrfs_inc_block_group_ro(bg X)
--> sets block group X to readonly

scrub_chunk(bg X)
scrub_stripe(device extent from srcdev)
--> keeps searching for extent items
belonging to the block group using
the extent tree's commit root
--> it never blocks due to
fs_info->scrub_pause_req as no
one tries to commit transaction N
--> copies all extents found from the
source device into the target device
--> finishes search loop

bio completes

ordered extent Y completes
and creates delayed data
reference which will add an
extent item to the extent
tree when run (typically
at transaction commit time)

--> so the task doing the
scrub/device replace
at CPU 1 misses this
and does not copy this
extent into the new/target
device

btrfs_dec_block_group_ro(bg X)
--> turns block group X back to RW mode

dev_replace->cursor_left is set to the
logical end offset of block group X

So fix this by waiting for all cow and nocow writes after setting a block
group to readonly mode.

Signed-off-by: Filipe Manana
Reviewed-by: Josef Bacik

Filipe Manana
2016-05-30 19:58:21 +0800

13 May, 2016

1 commit

578def7c5 Btrfs: don't wait for unrelated IO to finish before relocation ... Browse Code »

Before the relocation process of a block group starts, it sets the block
group to readonly mode, then flushes all delalloc writes and then finally
it waits for all ordered extents to complete. This last step includes
waiting for ordered extents destinated at extents allocated in other block
groups, making us waste unecessary time.

So improve this by waiting only for ordered extents that fall into the
block group's range.

Signed-off-by: Filipe Manana
Reviewed-by: Josef Bacik
Reviewed-by: Liu Bo

Filipe Manana
2016-05-13 08:59:14 +0800

14 Mar, 2016

1 commit

bb7ab3b92 btrfs: Fix misspellings in comments. ... Browse Code »

Signed-off-by: Adam Buchbinder
Signed-off-by: David Sterba

Adam Buchbinder
2016-03-14 22:05:02 +0800

12 Mar, 2016

1 commit

ebb8765b2 btrfs: move btrfs_compression_type to compression.h ... Browse Code »

So that its better organized.

Signed-off-by: Anand Jain
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Anand Jain
2016-03-12 00:12:46 +0800

18 Feb, 2016

1 commit

5598e9005 btrfs: drop null testing before destroy functions ... Browse Code »

Cleanup.

kmem_cache_destroy has support NULL argument checking,
so drop the double null testing before calling it.

Signed-off-by: Kinglong Mee
Signed-off-by: David Sterba

Kinglong Mee
2016-02-18 18:46:03 +0800

22 Oct, 2015

1 commit

161c3549b Btrfs: change how we wait for pending ordered extents ... Browse Code »

We have a mechanism to make sure we don't lose updates for ordered extents that
were logged in the transaction that is currently running. We add the ordered
extent to a transaction list and then the transaction waits on all the ordered
extents in that list. However are substantially large file systems this list
can be extremely large, and can give us soft lockups, since the ordered extents
don't remove themselves from the list when they do complete.

To fix this we simply add a counter to the transaction that is incremented any
time we have a logged extent that needs to be completed in the current
transaction. Then when the ordered extent finally completes it decrements the
per transaction counter and wakes up the transaction if we are the last ones.
This will eliminate the softlockup. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2015-10-22 09:51:40 +0800

11 Oct, 2015

1 commit

a83342aa0 btrfs: add comments to barriers before waitqueue_active ... Browse Code »

Reduce number of undocumented barriers out there.

Signed-off-by: David Sterba

David Sterba
2015-10-11 00:40:04 +0800

02 Jul, 2015

1 commit

61de718fc Btrfs: fix memory corruption on failure to submit bio for direct IO ... Browse Code »

If we fail to submit a bio for a direct IO request, we were grabbing the
corresponding ordered extent and decrementing its reference count twice,
once for our lookup reference and once for the ordered tree reference.
This was a problem because it caused the ordered extent to be freed
without removing it from the ordered tree and any lists it might be
attached to, leaving dangling pointers to the ordered extent around.
Example trace with CONFIG_DEBUG_PAGEALLOC=y:

[161779.858707] BUG: unable to handle kernel paging request at 0000000087654330
[161779.859983] IP: [] rb_prev+0x22/0x3b
[161779.860636] PGD 34d818067 PUD 0
[161779.860636] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
(...)
[161779.860636] Call Trace:
[161779.860636] [] __tree_search+0xd9/0xf9 [btrfs]
[161779.860636] [] tree_search+0x42/0x63 [btrfs]
[161779.860636] [] ? btrfs_lookup_ordered_range+0x2d/0xa5 [btrfs]
[161779.860636] [] btrfs_lookup_ordered_range+0x38/0xa5 [btrfs]
[161779.860636] [] btrfs_get_blocks_direct+0x11b/0x615 [btrfs]
[161779.860636] [] do_blockdev_direct_IO+0x5ff/0xb43
[161779.860636] [] ? btrfs_page_exists_in_range+0x1ad/0x1ad [btrfs]
[161779.860636] [] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636] [] __blockdev_direct_IO+0x32/0x34
[161779.860636] [] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636] [] btrfs_direct_IO+0x198/0x21f [btrfs]
[161779.860636] [] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636] [] generic_file_direct_write+0xb3/0x128
[161779.860636] [] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
[161779.860636] [] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
(...)

We were also not freeing the btrfs_dio_private we allocated previously,
which kmemleak reported with the following trace in its sysfs file:

unreferenced object 0xffff8803f553bf80 (size 96):
comm "xfs_io", pid 4501, jiffies 4295039588 (age 173.936s)
hex dump (first 32 bytes):
88 6c 9b f5 02 88 ff ff 00 00 00 00 00 00 00 00 .l..............
00 00 00 00 00 00 00 00 00 00 c4 00 00 00 00 00 ................
backtrace:
[] create_object+0x172/0x29a
[] kmemleak_alloc+0x25/0x41
[] kmemleak_alloc_recursive.constprop.40+0x16/0x18
[] kmem_cache_alloc_trace+0xfb/0x148
[] btrfs_submit_direct+0x65/0x16a [btrfs]
[] dio_bio_submit+0x62/0x8f
[] do_blockdev_direct_IO+0x97e/0xb43
[] __blockdev_direct_IO+0x32/0x34
[] btrfs_direct_IO+0x198/0x21f [btrfs]
[] generic_file_direct_write+0xb3/0x128
[] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
[] __vfs_write+0x7c/0xa5
[] vfs_write+0xa0/0xe4
[] SyS_pwrite64+0x64/0x82
[] system_call_fastpath+0x12/0x6f
[] 0xffffffffffffffff

For read requests we weren't doing any cleanup either (none of the work
done by btrfs_endio_direct_read()), so a failure submitting a bio for a
read request would leave a range in the inode's io_tree locked forever,
blocking any future operations (both reads and writes) against that range.

So fix this by making sure we do the same cleanup that we do for the case
where the bio submission succeeds.

Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2015-07-02 08:17:18 +0800