Eric Lee / smarc-fsl-linux-kernel

24 Mar, 2020

40 commits

48a2e88f5 uuid: Provide a GUID generator for raw buffer ... Browse Code »

In some cases we would like to generate a GUID and export it. Though it
would require either casting to internal kernel types or an intermediate
buffer. Instead we may achieve this by supplying a pointer to raw buffer
and make a complimentary API to existing one for UUIDs.

Reviewed-by: Christoph Hellwig
Signed-off-by: Andy Shevchenko
Signed-off-by: David Sterba

Andy Shevchenko
2020-03-24 00:01:47 +0800
d01cd6240 uuid: Add inline helpers to import / export UUIDs ... Browse Code »

Sometimes we may need to import UUID from or export to the raw buffer,
which is provided outside of kernel and can't be declared as UUID type.
With current API this operation will require an explicit casting
to one of UUID types and length, that is always a constant derived as sizeof
the certain UUID type.

Provide a helpful set of inline helpers to minimize developer's effort
in the cases when raw buffers are involved.

Suggested-by: David Sterba
Acked-by: Christoph Hellwig
Signed-off-by: Andy Shevchenko
Signed-off-by: David Sterba

Andy Shevchenko
2020-03-24 00:01:46 +0800
b3ff8f1d3 btrfs: Don't submit any btree write bio if the fs has errors ... Browse Code »

[BUG]
There is a fuzzed image which could cause KASAN report at unmount time.

BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
Read of size 8 at addr ffff888067cf6848 by task umount/1922

CPU: 0 PID: 1922 Comm: umount Tainted: G W 5.0.21 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x5b/0x8b
print_address_description+0x70/0x280
kasan_report+0x13a/0x19b
btrfs_queue_work+0x2c1/0x390
btrfs_wq_submit_bio+0x1cd/0x240
btree_submit_bio_hook+0x18c/0x2a0
submit_one_bio+0x1be/0x320
flush_write_bio.isra.41+0x2c/0x70
btree_write_cache_pages+0x3bb/0x7f0
do_writepages+0x5c/0x130
__writeback_single_inode+0xa3/0x9a0
writeback_single_inode+0x23d/0x390
write_inode_now+0x1b5/0x280
iput+0x2ef/0x600
close_ctree+0x341/0x750
generic_shutdown_super+0x126/0x370
kill_anon_super+0x31/0x50
btrfs_kill_super+0x36/0x2b0
deactivate_locked_super+0x80/0xc0
deactivate_super+0x13c/0x150
cleanup_mnt+0x9a/0x130
task_work_run+0x11a/0x1b0
exit_to_usermode_loop+0x107/0x130
do_syscall_64+0x1e5/0x280
entry_SYSCALL_64_after_hwframe+0x44/0xa9

[CAUSE]
The fuzzed image has a completely screwd up extent tree:

leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 0 count 1
item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 271 offset 0 count 1
item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
item 3 key (29360128 169 0) itemoff 3803 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 4 key (29368320 169 1) itemoff 3770 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 5 key (29372416 169 0) itemoff 3737 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5

Note that leaf 29421568 doesn't have its backref in the extent tree.
Thus extent allocator can re-allocate leaf 29421568 for other trees.

In short, the bug is caused by:

- Existing tree block gets allocated to log tree
This got its generation bumped.

- Log tree balance cleaned dirty bit of offending tree block
It will not be written back to disk, thus no WRITTEN flag.

- Original owner of the tree block gets COWed
Since the tree block has higher transid, no WRITTEN flag, it's reused,
and not traced by transaction::dirty_pages.

- Transaction aborted
Tree blocks get cleaned according to transaction::dirty_pages. But the
offending tree block is not recorded at all.

- Filesystem unmount
All pages are assumed to be are clean, destroying all workqueue, then
call iput(btree_inode).
But offending tree block is still dirty, which triggers writeback, and
causes use-after-free bug.

The detailed sequence looks like this:

- Initial status
eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8,
not traced by any dirty extent_iot_tree.

- New tree block is allocated
Since there is no backref for 29421568, it's re-allocated as new tree
block.
Keep in mind that tree block 29421568 is still referred by extent
tree.

- Tree block 29421568 is filled for log tree
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped)
traced by btrfs_root::dirty_log_pages

- Some log tree operations
Since the fs is using node size 4096, the log tree can easily go a
level higher.

- Log tree needs balance
Tree block 29421568 gets all its content pushed to right, thus now
it is empty, and we don't need it.
btrfs_clean_tree_block() from __push_leaf_right() get called.

eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
traced by btrfs_root::dirty_log_pages

- Log tree write back
btree_write_cache_pages() goes through dirty pages ranges, but since
page of tree block 29421568 gets cleaned already, it's not written
back to disk. Thus it doesn't have WRITTEN bit set.
But ranges in dirty_log_pages are cleared.

eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
not traced by any dirty extent_iot_tree.

- Extent tree update when committing transaction
Since tree block 29421568 has transid equal to running trans, and has
no WRITTEN bit, should_cow_block() will use it directly without adding
it to btrfs_transaction::dirty_pages.

eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.

At this stage, we're doomed. We have a dirty eb not tracked by any
extent io tree.

- Transaction gets aborted due to corrupted extent tree
Btrfs cleans up dirty pages according to transaction::dirty_pages and
btrfs_root::dirty_log_pages.
But since tree block 29421568 is not tracked by neither of them, it's
still dirty.

eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.

- Filesystem unmount
Since all cleanup is assumed to be done, all workqueus are destroyed.
Then iput(btree_inode) is called, expecting no dirty pages.
But tree 29421568 is still dirty, thus triggering writeback.
Since all workqueues are already freed, we cause use-after-free.

This shows us that, log tree blocks + bad extent tree can cause wild
dirty pages.

[FIX]
To fix the problem, don't submit any btree write bio if the filesytem
has any error. This is the last safe net, just in case other cleanup
haven't caught catch it.

Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik
Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Qu Wenruo
2020-03-24 00:01:46 +0800
faf8f7b95 btrfs: ioctl: resize: only show message if size is changed ... Browse Code »

There is no point to inform the user about size change if there's none.
Update the message to conform to a commonly used format where the path
and devid are printed and also print old and new sizes.

Reviewed-by: Johannes Thumshirn
Signed-off-by: Marcos Paulo de Souza
Reviewed-by: David Sterba
[ enhance message ]
Signed-off-by: David Sterba

Marcos Paulo de Souza
2020-03-24 00:01:46 +0800
b82582d66 btrfs: slightly simplify global block reserve calculations ... Browse Code »

In btrfs_update_global_block_rsv the lines:

num_bytes = block_rsv->size - block_rsv->reserved;
block_rsv->reserved += num_bytes;

imply:

block_rsv->reserved = block_rsv->size;

Assign block_rsv->size to block_rsv->reserved directly and reorder lines
so they match the other branch.

Signed-off-by: Anand Jain
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Anand Jain
2020-03-24 00:01:46 +0800
56e9f6ea3 btrfs: merge unlocking to common exit block in btrfs_commit_transaction ... Browse Code »

The tree_log_mutex and reloc_mutex locks are properly nested so we can
simplify error handling and add labels for them. This reduces line count
of the function.

Reviewed-by: Anand Jain
Signed-off-by: David Sterba

David Sterba
2020-03-24 00:01:46 +0800
15b6e8a83 btrfs: reduce pointer intdirections in btree_readpage_end_io_hook ... Browse Code »

All we need to read is checksum size from fs_info superblock, and
fs_info is provided by extent buffer so we can get rid of the wild
pointer indirections from page/inode/root.

Reviewed-by: Johannes Thumshirn
Signed-off-by: David Sterba

David Sterba
2020-03-24 00:01:45 +0800
b79ce3ddd btrfs: adjust delayed refs message level ... Browse Code »

The message seems to be for debugging and has little value for users.

Reviewed-by: Johannes Thumshirn
Reviewed-by: Anand Jain
Signed-off-by: David Sterba

David Sterba
2020-03-24 00:01:45 +0800
1db45a35f btrfs: replace u_long type cast with unsigned long ... Browse Code »

We don't use the u_XX types anywhere, though they're defined.

Reviewed-by: Johannes Thumshirn
Signed-off-by: David Sterba

David Sterba
2020-03-24 00:01:45 +0800
eeb6f1720 btrfs: raid56: simplify sort_parity_stripes ... Browse Code »

Remove trivial comprator and open coded swap of two values.

Reviewed-by: Johannes Thumshirn
Signed-off-by: David Sterba

David Sterba
2020-03-24 00:01:45 +0800
7e8f19e50 btrfs: adjust message level for unrecognized mount option ... Browse Code »

An unrecognized option is a failure that should get user/administrator
attention, the info level is often below what gets logged, so make it
error.

Reviewed-by: Johannes Thumshirn
Signed-off-by: David Sterba

David Sterba
2020-03-24 00:01:45 +0800
42c9d0b52 btrfs: simplify parameters of btrfs_set_disk_extent_flags ... Browse Code »

All callers pass extent buffer start and length so the extent buffer
itself should work fine.

Reviewed-by: Johannes Thumshirn
Reviewed-by: Anand Jain
Signed-off-by: David Sterba

David Sterba
2020-03-24 00:01:45 +0800
c4ac75419 btrfs: open code trivial helper btrfs_header_chunk_tree_uuid ... Browse Code »

The helper btrfs_header_chunk_tree_uuid follows naming convention of
other struct accessors but does something compeletly different. As the
offsetof calculation is clear in the context of extent buffer operations
we can remove it.

Reviewed-by: Johannes Thumshirn
Signed-off-by: David Sterba

David Sterba
2020-03-24 00:01:44 +0800
9a8658e33 btrfs: open code trivial helper btrfs_header_fsid ... Browse Code »

The helper btrfs_header_fsid follows naming convention of other struct
accessors but does something compeletly different. As the offsetof
calculation is clear in the context of extent buffer operations we can
remove it.

Reviewed-by: Johannes Thumshirn
Signed-off-by: David Sterba

David Sterba
2020-03-24 00:01:44 +0800
75fb2e9e4 btrfs: move mapping of block for discard to its caller ... Browse Code »

There's a simple forwarded call based on the operation that would better
fit the caller btrfs_map_block that's until now a trivial wrapper.

Reviewed-by: Johannes Thumshirn
Reviewed-by: Anand Jain
Signed-off-by: David Sterba

David Sterba
2020-03-24 00:01:44 +0800
ee787f955 btrfs: use struct_size to calculate size of raid hash table ... Browse Code »

The struct_size macro does the same calculation and is safe regarding
overflows. Though we're not expecting them to happen, use the helper for
clarity.

Reviewed-by: Johannes Thumshirn
Signed-off-by: David Sterba

David Sterba
2020-03-24 00:01:44 +0800
dcc3eb963 btrfs: convert snapshot/nocow exlcusion to drew lock ... Browse Code »

This patch removes all haphazard code implementing nocow writers
exclusion from pending snapshot creation and switches to using the drew
lock to ensure this invariant still holds.

'Readers' are snapshot creators from create_snapshot and 'writers' are
nocow writers from buffered write path or btrfs_setsize. This locking
scheme allows for multiple snapshots to happen while any nocow writers
are blocked, since writes to page cache in the nocow path will make
snapshots inconsistent.

So for performance reasons we'd like to have the ability to run multiple
concurrent snapshots and also favors readers in this case. And in case
there aren't pending snapshots (which will be the majority of the cases)
we rely on the percpu's writers counter to avoid cacheline contention.

The main gain from using the drew lock is it's now a lot easier to
reason about the guarantees of the locking scheme and whether there is
some silent breakage lurking.

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2020-03-24 00:01:44 +0800
2992df732 btrfs: Implement DREW lock ... Browse Code »

A (D)ouble (R)eader (W)riter (E)xclustion lock is a locking primitive
that allows to have multiple readers or multiple writers but not
multiple readers and writers holding it concurrently.

The code is factored out from the existing open-coded locking scheme
used to exclude pending snapshots from nocow writers and vice-versa.
Current implementation actually favors Readers (that is snapshot
creaters) to writers (nocow writers of the filesystem).

The API provides lock/unlock/trylock for reads and writes.

Formal specification for TLA+ provided by Valentin Schneider is at
https://lore.kernel.org/linux-btrfs/2dcaf81c-f0d3-409e-cb29-733d8b3b4cc9@arm.com/

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2020-03-24 00:01:43 +0800
fd8efa818 btrfs: simplify error handling in __btrfs_write_out_cache() ... Browse Code »

The error cleanup gotos in __btrfs_write_out_cache() needlessly jump
back making the code less readable then needed. Flatten them out so no
back-jump is necessary and the read flow is uninterrupted.

Reviewed-by: Josef Bacik
Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Johannes Thumshirn
2020-03-24 00:01:43 +0800
1afb648e9 btrfs: use standard debug config option to enable free-space-cache debug prints ... Browse Code »

free-space-cache.c has it's own set of DEBUG ifdefs which need to be
turned on instead of the global CONFIG_BTRFS_DEBUG to print debug
messages about failed block-group writes.

Switch this over to CONFIG_BTRFS_DEBUG so we always see these messages
when running a debug kernel.

Reviewed-by: Josef Bacik
Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Johannes Thumshirn
2020-03-24 00:01:43 +0800
7a195f6db btrfs: make the uptodate argument of io_ctl_add_pages() boolean ... Browse Code »

Make the uptodate argument of io_ctl_add_pages() boolean.

Reviewed-by: Josef Bacik
Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Johannes Thumshirn
2020-03-24 00:01:43 +0800
831fa14f1 btrfs: use inode from io_ctl in io_ctl_prepare_pages ... Browse Code »

io_ctl_prepare_pages() gets a 'struct btrfs_io_ctl' as well as a 'struct
inode', but btrfs_io_ctl::inode points to the same struct inode as this is
assgined in io_ctl_init().

Use the inode form io_ctl to reduce the arguments of io_ctl_prepare_pages.

Reviewed-by: Josef Bacik
Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Johannes Thumshirn
2020-03-24 00:01:43 +0800
949964c92 btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl ... Browse Code »

This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:

/
@subvol1/
@subvol2/
@subvol_default/

If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.

This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.

Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.

The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.

The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.

Signed-off-by: Marcos Paulo de Souza
Reviewed-by: David Sterba
[ update changelog ]
Signed-off-by: David Sterba

Marcos Paulo de Souza
2020-03-24 00:01:42 +0800
c0c907a47 btrfs: export helpers for subvolume name/id resolution ... Browse Code »

The functions will be used outside of export.c and super.c to allow
resolving subvolume name from a given id, eg. for subvolume deletion by
id ioctl.

Signed-off-by: Marcos Paulo de Souza
Reviewed-by: David Sterba
[ split from the next patch ]
Signed-off-by: David Sterba

Marcos Paulo de Souza
2020-03-24 00:01:42 +0800
748449cdb btrfs: use ioctl args support mask for device delete ... Browse Code »

When the device remove v2 ioctl was added, the full support mask was
added to sanity check the flags. However this would allow to let the
subvolume related flags to be accepted. This is not supposed to happen.

Use the correct support mask, which means that now any of
BTRFS_SUBVOL_CREATE_ASYNC, BTRFS_SUBVOL_RDONLY or
BTRFS_SUBVOL_QGROUP_INHERIT will be rejected as ENOTSUPP. Though this is
a user-visible change, specifying subvolume flags for device deletion
does not make sense and there are hopefully no applications doing that.

Reviewed-by: Marcos Paulo de Souza
Reviewed-by: Nikolay Borisov
Signed-off-by: David Sterba

David Sterba
2020-03-24 00:01:42 +0800
673990dba btrfs: use ioctl args support mask for subvolume create/delete ... Browse Code »

Using the defined mask instead of flag enumeration in the ioctl handler
is preferred. No functional changes.

Reviewed-by: Marcos Paulo de Souza
Reviewed-by: Nikolay Borisov
Signed-off-by: David Sterba

David Sterba
2020-03-24 00:01:42 +0800
eed026905 btrfs: define support masks for ioctl volume args v2 ... Browse Code »

The ioctl data for devices or subvolumes can be passed via
btrfs_ioctl_vol_args or btrfs_ioctl_vol_args_v2. The latter is more
versatile and needs some caution as some of the flags make sense only
for some ioctls.

As we're going to extend the flags, define support masks for each ioctl
class separately.

Reviewed-by: Marcos Paulo de Souza
Reviewed-by: Nikolay Borisov
Signed-off-by: David Sterba

David Sterba
2020-03-24 00:01:42 +0800
5ce48d0f0 btrfs: Add missing lock annotation for release_extent_buffer() ... Browse Code »

Sparse reports a warning at release_extent_buffer()
warning: context imbalance in release_extent_buffer() - unexpected unlock

The root cause is the missing annotation at release_extent_buffer()
Add the missing __releases(&eb->refs_lock) annotation

Reviewed-by: Nikolay Borisov
Signed-off-by: Jules Irenge
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Jules Irenge
2020-03-24 00:01:42 +0800
75ec1db87 btrfs: set update the uuid generation as soon as possible ... Browse Code »

In my EIO stress testing I noticed I was getting forced to rescan the
uuid tree pretty often, which was weird. This is because my error
injection stuff would sometimes inject an error after log replay but
before we loaded the UUID tree. If log replay committed the transaction
it wouldn't have updated the uuid tree generation, but the tree was
valid and didn't change, so there's no reason to not update the
generation here.

Fix this by setting the BTRFS_FS_UPDATE_UUID_TREE_GEN bit immediately
after reading all the fs roots if the uuid tree generation matches the
fs generation. Then any transaction commits that happen during mount
won't screw up our uuid tree state, forcing us to do needless uuid
rescans.

Fixes: 70f801754728 ("Btrfs: check UUID tree during mount if required")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Josef Bacik
2020-03-24 00:01:41 +0800
c94bec2c6 btrfs: bail out of uuid tree scanning if we're closing ... Browse Code »

In doing my fsstress+EIO stress testing I started running into issues
where umount would get stuck forever because the uuid checker was
chewing through the thousands of subvolumes I had created.

We shouldn't block umount on this, simply bail if we're unmounting the
fs. We need to make sure we don't mark the UUID tree as ok, so we only
set that bit if we made it through the whole rescan operation, but
otherwise this is completely safe.

Reviewed-by: Johannes Thumshirn
Signed-off-by: Josef Bacik
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Josef Bacik
2020-03-24 00:01:41 +0800
97f4dd09d btrfs: make btrfs_check_uuid_tree private to disk-io.c ... Browse Code »

It's used only during filesystem mount as such it can be made private to
disk-io.c file. Also use the occasion to move btrfs_uuid_rescan_kthread
as btrfs_check_uuid_tree is its sole caller.

Reviewed-by: Josef Bacik
Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2020-03-24 00:01:41 +0800
560b7a4aa btrfs: call btrfs_check_uuid_tree_entry directly in btrfs_uuid_tree_iterate ... Browse Code »

btrfs_uuid_tree_iterate is called from only once place and its 2nd
argument is always btrfs_check_uuid_tree_entry. Simplify
btrfs_uuid_tree_iterate's signature by removing its 2nd argument and
directly calling btrfs_check_uuid_tree_entry. Also move the latter into
uuid-tree.h. No functional changes.

Reviewed-by: Johannes Thumshirn
Reviewed-by: Josef Bacik
Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2020-03-24 00:01:41 +0800
c17af9655 btrfs: raid56: simplify tracking of Q stripe presence ... Browse Code »

There are temporary variables tracking the index of P and Q stripes, but
none of them is really used as such, merely for determining if the Q
stripe is present. This leads to compiler warnings with
-Wunused-but-set-variable and has been reported several times.

fs/btrfs/raid56.c: In function ‘finish_rmw’:
fs/btrfs/raid56.c:1199:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
1199 | int p_stripe = -1;
| ^~~~~~~~
fs/btrfs/raid56.c: In function ‘finish_parity_scrub’:
fs/btrfs/raid56.c:2356:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
2356 | int p_stripe = -1;
| ^~~~~~~~

Replace the two variables with one that has a clear meaning and also get
rid of the warnings. The logic that verifies that there are only 2
valid cases is unchanged.

Reviewed-by: Johannes Thumshirn
Signed-off-by: David Sterba

David Sterba
2020-03-24 00:01:41 +0800
b25b0b871 btrfs: backref, use correct count to resolve normal data refs ... Browse Code »

With the following patches:

- btrfs: backref, only collect file extent items matching backref offset
- btrfs: backref, not adding refs from shared block when resolving normal backref
- btrfs: backref, only search backref entries from leaves of the same root

we only collect the normal data refs we want, so the imprecise upper
bound total_refs of that EXTENT_ITEM could now be changed to the count
of the normal backref entry we want to search.

Background and how the patches fit together:

Btrfs has two types of data backref.
For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
exact block number. Therefore, we need to call resolve_indirect_refs.
It uses btrfs_search_slot to locate the leaf block. Then
we need to walk through the leaves to search for the EXTENT_DATA items
that have disk bytenr matching the extent item (add_all_parents).

When resolving indirect refs, we could take entries that don't
belong to the backref entry we are searching for right now.
For that reason when searching backref entry, we always use total
refs of that EXTENT_ITEM rather than individual count.

For example:
item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize
extent refs 24 gen 7302 flags DATA
shared data backref parent 394985472 count 10 #1
extent data backref root 257 objectid 260 offset 1048576 count 3 #2
extent data backref root 256 objectid 260 offset 65536 count 6 #3
extent data backref root 257 objectid 260 offset 65536 count 5 #4

For example, when searching backref entry #4, we'll use total_refs
24, a very loose loop ending condition, instead of total_refs = 5.

But using total_refs = 24 is not accurate. Sometimes, we'll never find
all the refs from specific root. As a result, the loop keeps on going
until we reach the end of that inode.

The first 3 patches, handle 3 different types refs we might encounter.
These refs do not belong to the normal backref we are searching, and
hence need to be skipped.

This patch changes the total_refs to correct number so that we could
end loop as soon as we find all the refs we want.

btrfs send uses backref to find possible clone sources, the following
is a simple test to compare the results with and without this patch:

$ btrfs subvolume create /sub1
$ for i in `seq 1 163840`; do
dd if=/dev/zero of=/sub1/file bs=64K count=1 seek=$((i-1)) conv=notrunc oflag=direct
done
$ btrfs subvolume snapshot /sub1 /sub2
$ for i in `seq 1 163840`; do
dd if=/dev/zero of=/sub1/file bs=4K count=1 seek=$(((i-1)*16+10)) conv=notrunc oflag=direct
done
$ btrfs subvolume snapshot -r /sub1 /snap1
$ time btrfs send /snap1 | btrfs receive /volume2

Without this patch:

real 69m48.124s
user 0m50.199s
sys 70m15.600s

With this patch:

real 1m59.683s
user 0m35.421s
sys 2m42.684s

Reviewed-by: Josef Bacik
Reviewed-by: Johannes Thumshirn
Signed-off-by: ethanwu
[ add patchset cover letter with background and numbers ]
Signed-off-by: David Sterba

ethanwu
2020-03-24 00:01:40 +0800
cfc0eed0e btrfs: backref, only search backref entries from leaves of the same root ... Browse Code »

We could have some nodes/leaves in subvolume whose owner are not the
that subvolume. In this way, when we resolve normal backrefs of that
subvolume, we should avoid collecting those references from these blocks.

Reviewed-by: Josef Bacik
Reviewed-by: Johannes Thumshirn
Signed-off-by: ethanwu
Signed-off-by: David Sterba

ethanwu
2020-03-24 00:01:40 +0800
ed58f2e66 btrfs: backref, don't add refs from shared block when resolving normal backref ... Browse Code »

All references from the block of SHARED_DATA_REF belong to that shared
block backref.

For example:

item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 95
extent refs 24 gen 7302 flags DATA
extent data backref root 257 objectid 260 offset 65536 count 5
extent data backref root 258 objectid 265 offset 0 count 9
shared data backref parent 394985472 count 10

Block 394985472 might be leaf from root 257, and the item obejctid and
(file_pos - file_extent_item::offset) in that leaf just happens to be
260 and 65536 which is equal to the first extent data backref entry.

Before this patch, when we resolve backref:

root 257 objectid 260 offset 65536

we will add those refs in block 394985472 and wrongly treat those as the
refs we want.

Fix this by checking if the leaf we are processing is shared data
backref, if so, just skip this leaf.

Shared data refs added into preftrees.direct have all entry value = 0
(root_id = 0, key = NULL, level = 0) except parent entry.

Other refs from indirect tree will have key value and root id != 0, and
these values won't be changed when their parent is resolved and added to
preftrees.direct. Therefore, we could reuse the preftrees.direct and
search ref with all values = 0 except parent is set to avoid getting
those resolved refs block.

Reviewed-by: Josef Bacik
Reviewed-by: Johannes Thumshirn
Signed-off-by: ethanwu
Signed-off-by: David Sterba

ethanwu
2020-03-24 00:01:40 +0800
7ac8b88ee btrfs: backref, only collect file extent items matching backref offset ... Browse Code »

When resolving one backref of type EXTENT_DATA_REF, we collect all
references that simply reference the EXTENT_ITEM even though their
(file_pos - file_extent_item::offset) are not the same as the
btrfs_extent_data_ref::offset we are searching for.

This patch adds additional check so that we only collect references whose
(file_pos - file_extent_item::offset) == btrfs_extent_data_ref::offset.

Reviewed-by: Josef Bacik
Reviewed-by: Johannes Thumshirn
Signed-off-by: ethanwu
Signed-off-by: David Sterba

ethanwu
2020-03-24 00:01:40 +0800
9da2b242e btrfs: remove buffer_heads form super block mirror integrity checking ... Browse Code »

The integrity checking code for the super block mirrors is the last
remaining user of buffer_heads, change it to using plain bios as well.

Reviewed-by: Josef Bacik
Reviewed-by: Nikolay Borisov
Signed-off-by: Johannes Thumshirn
Signed-off-by: David Sterba

Johannes Thumshirn
2020-03-24 00:01:40 +0800
59aaad503 btrfs: remove buffer_heads from btrfsic_process_written_block() ... Browse Code »

Now that the last caller of btrfsic_process_written_block() with
buffer_heads is gone, remove the buffer_head processing path from it as
well.

Reviewed-by: Josef Bacik
Reviewed-by: Nikolay Borisov
Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Johannes Thumshirn
2020-03-24 00:01:40 +0800
61ecc5fc1 btrfs: remove btrfsic_submit_bh() ... Browse Code »

Now that the last use of btrfsic_submit_bh() is gone as the super block
is now written using bios, remove the function as well.

Reviewed-by: Nikolay Borisov
Reviewed-by: Josef Bacik
Reviewed-by: Anand Jain
Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Johannes Thumshirn
2020-03-24 00:01:39 +0800