24 Mar, 2020

40 commits

  • In some cases we would like to generate a GUID and export it. Though it
    would require either casting to internal kernel types or an intermediate
    buffer. Instead we may achieve this by supplying a pointer to raw buffer
    and make a complimentary API to existing one for UUIDs.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Andy Shevchenko
    Signed-off-by: David Sterba

    Andy Shevchenko
     
  • Sometimes we may need to import UUID from or export to the raw buffer,
    which is provided outside of kernel and can't be declared as UUID type.
    With current API this operation will require an explicit casting
    to one of UUID types and length, that is always a constant derived as sizeof
    the certain UUID type.

    Provide a helpful set of inline helpers to minimize developer's effort
    in the cases when raw buffers are involved.

    Suggested-by: David Sterba
    Acked-by: Christoph Hellwig
    Signed-off-by: Andy Shevchenko
    Signed-off-by: David Sterba

    Andy Shevchenko
     
  • [BUG]
    There is a fuzzed image which could cause KASAN report at unmount time.

    BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
    Read of size 8 at addr ffff888067cf6848 by task umount/1922

    CPU: 0 PID: 1922 Comm: umount Tainted: G W 5.0.21 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
    Call Trace:
    dump_stack+0x5b/0x8b
    print_address_description+0x70/0x280
    kasan_report+0x13a/0x19b
    btrfs_queue_work+0x2c1/0x390
    btrfs_wq_submit_bio+0x1cd/0x240
    btree_submit_bio_hook+0x18c/0x2a0
    submit_one_bio+0x1be/0x320
    flush_write_bio.isra.41+0x2c/0x70
    btree_write_cache_pages+0x3bb/0x7f0
    do_writepages+0x5c/0x130
    __writeback_single_inode+0xa3/0x9a0
    writeback_single_inode+0x23d/0x390
    write_inode_now+0x1b5/0x280
    iput+0x2ef/0x600
    close_ctree+0x341/0x750
    generic_shutdown_super+0x126/0x370
    kill_anon_super+0x31/0x50
    btrfs_kill_super+0x36/0x2b0
    deactivate_locked_super+0x80/0xc0
    deactivate_super+0x13c/0x150
    cleanup_mnt+0x9a/0x130
    task_work_run+0x11a/0x1b0
    exit_to_usermode_loop+0x107/0x130
    do_syscall_64+0x1e5/0x280
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    [CAUSE]
    The fuzzed image has a completely screwd up extent tree:

    leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
    refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
    item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
    extent refs 1 gen 9 flags 1
    ref#0: extent data backref root 5 objectid 259 offset 0 count 1
    item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
    extent refs 1 gen 9 flags 1
    ref#0: extent data backref root 5 objectid 271 offset 0 count 1
    item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
    extent refs 1 gen 9 flags 1
    ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
    item 3 key (29360128 169 0) itemoff 3803 itemsize 33
    extent refs 1 gen 9 flags 2
    ref#0: tree block backref root 5
    item 4 key (29368320 169 1) itemoff 3770 itemsize 33
    extent refs 1 gen 9 flags 2
    ref#0: tree block backref root 5
    item 5 key (29372416 169 0) itemoff 3737 itemsize 33
    extent refs 1 gen 9 flags 2
    ref#0: tree block backref root 5

    Note that leaf 29421568 doesn't have its backref in the extent tree.
    Thus extent allocator can re-allocate leaf 29421568 for other trees.

    In short, the bug is caused by:

    - Existing tree block gets allocated to log tree
    This got its generation bumped.

    - Log tree balance cleaned dirty bit of offending tree block
    It will not be written back to disk, thus no WRITTEN flag.

    - Original owner of the tree block gets COWed
    Since the tree block has higher transid, no WRITTEN flag, it's reused,
    and not traced by transaction::dirty_pages.

    - Transaction aborted
    Tree blocks get cleaned according to transaction::dirty_pages. But the
    offending tree block is not recorded at all.

    - Filesystem unmount
    All pages are assumed to be are clean, destroying all workqueue, then
    call iput(btree_inode).
    But offending tree block is still dirty, which triggers writeback, and
    causes use-after-free bug.

    The detailed sequence looks like this:

    - Initial status
    eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8,
    not traced by any dirty extent_iot_tree.

    - New tree block is allocated
    Since there is no backref for 29421568, it's re-allocated as new tree
    block.
    Keep in mind that tree block 29421568 is still referred by extent
    tree.

    - Tree block 29421568 is filled for log tree
    eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped)
    traced by btrfs_root::dirty_log_pages

    - Some log tree operations
    Since the fs is using node size 4096, the log tree can easily go a
    level higher.

    - Log tree needs balance
    Tree block 29421568 gets all its content pushed to right, thus now
    it is empty, and we don't need it.
    btrfs_clean_tree_block() from __push_leaf_right() get called.

    eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
    traced by btrfs_root::dirty_log_pages

    - Log tree write back
    btree_write_cache_pages() goes through dirty pages ranges, but since
    page of tree block 29421568 gets cleaned already, it's not written
    back to disk. Thus it doesn't have WRITTEN bit set.
    But ranges in dirty_log_pages are cleared.

    eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
    not traced by any dirty extent_iot_tree.

    - Extent tree update when committing transaction
    Since tree block 29421568 has transid equal to running trans, and has
    no WRITTEN bit, should_cow_block() will use it directly without adding
    it to btrfs_transaction::dirty_pages.

    eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
    not traced by any dirty extent_iot_tree.

    At this stage, we're doomed. We have a dirty eb not tracked by any
    extent io tree.

    - Transaction gets aborted due to corrupted extent tree
    Btrfs cleans up dirty pages according to transaction::dirty_pages and
    btrfs_root::dirty_log_pages.
    But since tree block 29421568 is not tracked by neither of them, it's
    still dirty.

    eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
    not traced by any dirty extent_iot_tree.

    - Filesystem unmount
    Since all cleanup is assumed to be done, all workqueus are destroyed.
    Then iput(btree_inode) is called, expecting no dirty pages.
    But tree 29421568 is still dirty, thus triggering writeback.
    Since all workqueues are already freed, we cause use-after-free.

    This shows us that, log tree blocks + bad extent tree can cause wild
    dirty pages.

    [FIX]
    To fix the problem, don't submit any btree write bio if the filesytem
    has any error. This is the last safe net, just in case other cleanup
    haven't caught catch it.

    Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • There is no point to inform the user about size change if there's none.
    Update the message to conform to a commonly used format where the path
    and devid are printed and also print old and new sizes.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Marcos Paulo de Souza
    Reviewed-by: David Sterba
    [ enhance message ]
    Signed-off-by: David Sterba

    Marcos Paulo de Souza
     
  • In btrfs_update_global_block_rsv the lines:

    num_bytes = block_rsv->size - block_rsv->reserved;
    block_rsv->reserved += num_bytes;

    imply:

    block_rsv->reserved = block_rsv->size;

    Assign block_rsv->size to block_rsv->reserved directly and reorder lines
    so they match the other branch.

    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Anand Jain
     
  • The tree_log_mutex and reloc_mutex locks are properly nested so we can
    simplify error handling and add labels for them. This reduces line count
    of the function.

    Reviewed-by: Anand Jain
    Signed-off-by: David Sterba

    David Sterba
     
  • All we need to read is checksum size from fs_info superblock, and
    fs_info is provided by extent buffer so we can get rid of the wild
    pointer indirections from page/inode/root.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    David Sterba
     
  • The message seems to be for debugging and has little value for users.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Anand Jain
    Signed-off-by: David Sterba

    David Sterba
     
  • We don't use the u_XX types anywhere, though they're defined.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    David Sterba
     
  • Remove trivial comprator and open coded swap of two values.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    David Sterba
     
  • An unrecognized option is a failure that should get user/administrator
    attention, the info level is often below what gets logged, so make it
    error.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    David Sterba
     
  • All callers pass extent buffer start and length so the extent buffer
    itself should work fine.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Anand Jain
    Signed-off-by: David Sterba

    David Sterba
     
  • The helper btrfs_header_chunk_tree_uuid follows naming convention of
    other struct accessors but does something compeletly different. As the
    offsetof calculation is clear in the context of extent buffer operations
    we can remove it.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    David Sterba
     
  • The helper btrfs_header_fsid follows naming convention of other struct
    accessors but does something compeletly different. As the offsetof
    calculation is clear in the context of extent buffer operations we can
    remove it.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    David Sterba
     
  • There's a simple forwarded call based on the operation that would better
    fit the caller btrfs_map_block that's until now a trivial wrapper.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Anand Jain
    Signed-off-by: David Sterba

    David Sterba
     
  • The struct_size macro does the same calculation and is safe regarding
    overflows. Though we're not expecting them to happen, use the helper for
    clarity.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    David Sterba
     
  • This patch removes all haphazard code implementing nocow writers
    exclusion from pending snapshot creation and switches to using the drew
    lock to ensure this invariant still holds.

    'Readers' are snapshot creators from create_snapshot and 'writers' are
    nocow writers from buffered write path or btrfs_setsize. This locking
    scheme allows for multiple snapshots to happen while any nocow writers
    are blocked, since writes to page cache in the nocow path will make
    snapshots inconsistent.

    So for performance reasons we'd like to have the ability to run multiple
    concurrent snapshots and also favors readers in this case. And in case
    there aren't pending snapshots (which will be the majority of the cases)
    we rely on the percpu's writers counter to avoid cacheline contention.

    The main gain from using the drew lock is it's now a lot easier to
    reason about the guarantees of the locking scheme and whether there is
    some silent breakage lurking.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • A (D)ouble (R)eader (W)riter (E)xclustion lock is a locking primitive
    that allows to have multiple readers or multiple writers but not
    multiple readers and writers holding it concurrently.

    The code is factored out from the existing open-coded locking scheme
    used to exclude pending snapshots from nocow writers and vice-versa.
    Current implementation actually favors Readers (that is snapshot
    creaters) to writers (nocow writers of the filesystem).

    The API provides lock/unlock/trylock for reads and writes.

    Formal specification for TLA+ provided by Valentin Schneider is at
    https://lore.kernel.org/linux-btrfs/2dcaf81c-f0d3-409e-cb29-733d8b3b4cc9@arm.com/

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • The error cleanup gotos in __btrfs_write_out_cache() needlessly jump
    back making the code less readable then needed. Flatten them out so no
    back-jump is necessary and the read flow is uninterrupted.

    Reviewed-by: Josef Bacik
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • free-space-cache.c has it's own set of DEBUG ifdefs which need to be
    turned on instead of the global CONFIG_BTRFS_DEBUG to print debug
    messages about failed block-group writes.

    Switch this over to CONFIG_BTRFS_DEBUG so we always see these messages
    when running a debug kernel.

    Reviewed-by: Josef Bacik
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • Make the uptodate argument of io_ctl_add_pages() boolean.

    Reviewed-by: Josef Bacik
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • io_ctl_prepare_pages() gets a 'struct btrfs_io_ctl' as well as a 'struct
    inode', but btrfs_io_ctl::inode points to the same struct inode as this is
    assgined in io_ctl_init().

    Use the inode form io_ctl to reduce the arguments of io_ctl_prepare_pages.

    Reviewed-by: Josef Bacik
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • This ioctl will be responsible for deleting a subvolume using its id.
    This can be used when a system has a file system mounted from a
    subvolume, rather than the root file system, like below:

    /
    @subvol1/
    @subvol2/
    @subvol_default/

    If only @subvol_default is mounted, we have no path to reach @subvol1
    and @subvol2, thus no way to delete them. Current subvolume delete ioctl
    takes a file handle point as argument, and if @subvol_default is
    mounted, we can't reach @subvol1 and @subvol2 from the same mount point.

    This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
    the extended structure with flags to allow to delete subvolume using
    subvolid.

    Now, we can use this new ioctl specifying the subvolume id and refer to
    the same mount point. It doesn't matter which subvolume was mounted,
    since we can reach to the desired one using the subvolume id, and then
    delete it.

    The full path to the subvolume id is resolved internally and access is
    verified as if the subvolume was accessed by path.

    The volume args v2 structure is extended to use the existing union for
    subvolume id specification, that's valid in case the
    BTRFS_SUBVOL_SPEC_BY_ID is set.

    Signed-off-by: Marcos Paulo de Souza
    Reviewed-by: David Sterba
    [ update changelog ]
    Signed-off-by: David Sterba

    Marcos Paulo de Souza
     
  • The functions will be used outside of export.c and super.c to allow
    resolving subvolume name from a given id, eg. for subvolume deletion by
    id ioctl.

    Signed-off-by: Marcos Paulo de Souza
    Reviewed-by: David Sterba
    [ split from the next patch ]
    Signed-off-by: David Sterba

    Marcos Paulo de Souza
     
  • When the device remove v2 ioctl was added, the full support mask was
    added to sanity check the flags. However this would allow to let the
    subvolume related flags to be accepted. This is not supposed to happen.

    Use the correct support mask, which means that now any of
    BTRFS_SUBVOL_CREATE_ASYNC, BTRFS_SUBVOL_RDONLY or
    BTRFS_SUBVOL_QGROUP_INHERIT will be rejected as ENOTSUPP. Though this is
    a user-visible change, specifying subvolume flags for device deletion
    does not make sense and there are hopefully no applications doing that.

    Reviewed-by: Marcos Paulo de Souza
    Reviewed-by: Nikolay Borisov
    Signed-off-by: David Sterba

    David Sterba
     
  • Using the defined mask instead of flag enumeration in the ioctl handler
    is preferred. No functional changes.

    Reviewed-by: Marcos Paulo de Souza
    Reviewed-by: Nikolay Borisov
    Signed-off-by: David Sterba

    David Sterba
     
  • The ioctl data for devices or subvolumes can be passed via
    btrfs_ioctl_vol_args or btrfs_ioctl_vol_args_v2. The latter is more
    versatile and needs some caution as some of the flags make sense only
    for some ioctls.

    As we're going to extend the flags, define support masks for each ioctl
    class separately.

    Reviewed-by: Marcos Paulo de Souza
    Reviewed-by: Nikolay Borisov
    Signed-off-by: David Sterba

    David Sterba
     
  • Sparse reports a warning at release_extent_buffer()
    warning: context imbalance in release_extent_buffer() - unexpected unlock

    The root cause is the missing annotation at release_extent_buffer()
    Add the missing __releases(&eb->refs_lock) annotation

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Jules Irenge
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Jules Irenge
     
  • In my EIO stress testing I noticed I was getting forced to rescan the
    uuid tree pretty often, which was weird. This is because my error
    injection stuff would sometimes inject an error after log replay but
    before we loaded the UUID tree. If log replay committed the transaction
    it wouldn't have updated the uuid tree generation, but the tree was
    valid and didn't change, so there's no reason to not update the
    generation here.

    Fix this by setting the BTRFS_FS_UPDATE_UUID_TREE_GEN bit immediately
    after reading all the fs roots if the uuid tree generation matches the
    fs generation. Then any transaction commits that happen during mount
    won't screw up our uuid tree state, forcing us to do needless uuid
    rescans.

    Fixes: 70f801754728 ("Btrfs: check UUID tree during mount if required")
    CC: stable@vger.kernel.org # 4.19+
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • In doing my fsstress+EIO stress testing I started running into issues
    where umount would get stuck forever because the uuid checker was
    chewing through the thousands of subvolumes I had created.

    We shouldn't block umount on this, simply bail if we're unmounting the
    fs. We need to make sure we don't mark the UUID tree as ok, so we only
    set that bit if we made it through the whole rescan operation, but
    otherwise this is completely safe.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • It's used only during filesystem mount as such it can be made private to
    disk-io.c file. Also use the occasion to move btrfs_uuid_rescan_kthread
    as btrfs_check_uuid_tree is its sole caller.

    Reviewed-by: Josef Bacik
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • btrfs_uuid_tree_iterate is called from only once place and its 2nd
    argument is always btrfs_check_uuid_tree_entry. Simplify
    btrfs_uuid_tree_iterate's signature by removing its 2nd argument and
    directly calling btrfs_check_uuid_tree_entry. Also move the latter into
    uuid-tree.h. No functional changes.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Josef Bacik
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • There are temporary variables tracking the index of P and Q stripes, but
    none of them is really used as such, merely for determining if the Q
    stripe is present. This leads to compiler warnings with
    -Wunused-but-set-variable and has been reported several times.

    fs/btrfs/raid56.c: In function ‘finish_rmw’:
    fs/btrfs/raid56.c:1199:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
    1199 | int p_stripe = -1;
    | ^~~~~~~~
    fs/btrfs/raid56.c: In function ‘finish_parity_scrub’:
    fs/btrfs/raid56.c:2356:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
    2356 | int p_stripe = -1;
    | ^~~~~~~~

    Replace the two variables with one that has a clear meaning and also get
    rid of the warnings. The logic that verifies that there are only 2
    valid cases is unchanged.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    David Sterba
     
  • With the following patches:

    - btrfs: backref, only collect file extent items matching backref offset
    - btrfs: backref, not adding refs from shared block when resolving normal backref
    - btrfs: backref, only search backref entries from leaves of the same root

    we only collect the normal data refs we want, so the imprecise upper
    bound total_refs of that EXTENT_ITEM could now be changed to the count
    of the normal backref entry we want to search.

    Background and how the patches fit together:

    Btrfs has two types of data backref.
    For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
    exact block number. Therefore, we need to call resolve_indirect_refs.
    It uses btrfs_search_slot to locate the leaf block. Then
    we need to walk through the leaves to search for the EXTENT_DATA items
    that have disk bytenr matching the extent item (add_all_parents).

    When resolving indirect refs, we could take entries that don't
    belong to the backref entry we are searching for right now.
    For that reason when searching backref entry, we always use total
    refs of that EXTENT_ITEM rather than individual count.

    For example:
    item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize
    extent refs 24 gen 7302 flags DATA
    shared data backref parent 394985472 count 10 #1
    extent data backref root 257 objectid 260 offset 1048576 count 3 #2
    extent data backref root 256 objectid 260 offset 65536 count 6 #3
    extent data backref root 257 objectid 260 offset 65536 count 5 #4

    For example, when searching backref entry #4, we'll use total_refs
    24, a very loose loop ending condition, instead of total_refs = 5.

    But using total_refs = 24 is not accurate. Sometimes, we'll never find
    all the refs from specific root. As a result, the loop keeps on going
    until we reach the end of that inode.

    The first 3 patches, handle 3 different types refs we might encounter.
    These refs do not belong to the normal backref we are searching, and
    hence need to be skipped.

    This patch changes the total_refs to correct number so that we could
    end loop as soon as we find all the refs we want.

    btrfs send uses backref to find possible clone sources, the following
    is a simple test to compare the results with and without this patch:

    $ btrfs subvolume create /sub1
    $ for i in `seq 1 163840`; do
    dd if=/dev/zero of=/sub1/file bs=64K count=1 seek=$((i-1)) conv=notrunc oflag=direct
    done
    $ btrfs subvolume snapshot /sub1 /sub2
    $ for i in `seq 1 163840`; do
    dd if=/dev/zero of=/sub1/file bs=4K count=1 seek=$(((i-1)*16+10)) conv=notrunc oflag=direct
    done
    $ btrfs subvolume snapshot -r /sub1 /snap1
    $ time btrfs send /snap1 | btrfs receive /volume2

    Without this patch:

    real 69m48.124s
    user 0m50.199s
    sys 70m15.600s

    With this patch:

    real 1m59.683s
    user 0m35.421s
    sys 2m42.684s

    Reviewed-by: Josef Bacik
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: ethanwu
    [ add patchset cover letter with background and numbers ]
    Signed-off-by: David Sterba

    ethanwu
     
  • We could have some nodes/leaves in subvolume whose owner are not the
    that subvolume. In this way, when we resolve normal backrefs of that
    subvolume, we should avoid collecting those references from these blocks.

    Reviewed-by: Josef Bacik
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: ethanwu
    Signed-off-by: David Sterba

    ethanwu
     
  • All references from the block of SHARED_DATA_REF belong to that shared
    block backref.

    For example:

    item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 95
    extent refs 24 gen 7302 flags DATA
    extent data backref root 257 objectid 260 offset 65536 count 5
    extent data backref root 258 objectid 265 offset 0 count 9
    shared data backref parent 394985472 count 10

    Block 394985472 might be leaf from root 257, and the item obejctid and
    (file_pos - file_extent_item::offset) in that leaf just happens to be
    260 and 65536 which is equal to the first extent data backref entry.

    Before this patch, when we resolve backref:

    root 257 objectid 260 offset 65536

    we will add those refs in block 394985472 and wrongly treat those as the
    refs we want.

    Fix this by checking if the leaf we are processing is shared data
    backref, if so, just skip this leaf.

    Shared data refs added into preftrees.direct have all entry value = 0
    (root_id = 0, key = NULL, level = 0) except parent entry.

    Other refs from indirect tree will have key value and root id != 0, and
    these values won't be changed when their parent is resolved and added to
    preftrees.direct. Therefore, we could reuse the preftrees.direct and
    search ref with all values = 0 except parent is set to avoid getting
    those resolved refs block.

    Reviewed-by: Josef Bacik
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: ethanwu
    Signed-off-by: David Sterba

    ethanwu
     
  • When resolving one backref of type EXTENT_DATA_REF, we collect all
    references that simply reference the EXTENT_ITEM even though their
    (file_pos - file_extent_item::offset) are not the same as the
    btrfs_extent_data_ref::offset we are searching for.

    This patch adds additional check so that we only collect references whose
    (file_pos - file_extent_item::offset) == btrfs_extent_data_ref::offset.

    Reviewed-by: Josef Bacik
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: ethanwu
    Signed-off-by: David Sterba

    ethanwu
     
  • The integrity checking code for the super block mirrors is the last
    remaining user of buffer_heads, change it to using plain bios as well.

    Reviewed-by: Josef Bacik
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • Now that the last caller of btrfsic_process_written_block() with
    buffer_heads is gone, remove the buffer_head processing path from it as
    well.

    Reviewed-by: Josef Bacik
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • Now that the last use of btrfsic_submit_bh() is gone as the super block
    is now written using bios, remove the function as well.

    Reviewed-by: Nikolay Borisov
    Reviewed-by: Josef Bacik
    Reviewed-by: Anand Jain
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn