29 Feb, 2020

1 commit

  • commit e75fd33b3f744f644061a4f9662bd63f5434f806 upstream.

    In btrfs_wait_ordered_range() once we find an ordered extent that has
    finished with an error we exit the loop and don't wait for any other
    ordered extents that might be still in progress.

    All the users of btrfs_wait_ordered_range() expect that there are no more
    ordered extents in progress after that function returns. So past fixes
    such like the ones from the two following commits:

    ff612ba7849964 ("btrfs: fix panic during relocation after ENOSPC before
    writeback happens")

    28aeeac1dd3080 ("Btrfs: fix panic when starting bg cache writeout after
    IO error")

    don't work when there are multiple ordered extents in the range.

    Fix that by making btrfs_wait_ordered_range() wait for all ordered extents
    even after it finds one that had an error.

    Link: https://github.com/kdave/btrfs-progs/issues/228#issuecomment-569777554
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Qu Wenruo
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     

09 Jan, 2020

1 commit

  • [ Upstream commit a0cac0ec961f0d42828eeef196ac2246a2f07659 ]

    Commit 9e0af2376434 ("Btrfs: fix task hang under heavy compressed
    write") worked around the issue that a recycled work item could get a
    false dependency on the original work item due to how the workqueue code
    guarantees non-reentrancy. It did so by giving different work functions
    to different types of work.

    However, the fixes in the previous few patches are more complete, as
    they prevent a work item from being recycled at all (except for a tiny
    window that the kernel workqueue code handles for us). This obsoletes
    the previous fix, so we don't need the unique helpers for correctness.
    The only other reason to keep them would be so they show up in stack
    traces, but they always seem to be optimized to a tail call, so they
    don't show up anyways. So, let's just get rid of the extra indirection.

    While we're here, rename normal_work_helper() to the more informative
    btrfs_work_helper().

    Reviewed-by: Nikolay Borisov
    Reviewed-by: Filipe Manana
    Signed-off-by: Omar Sandoval
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Omar Sandoval
     

09 Sep, 2019

1 commit


26 Jul, 2019

1 commit

  • btrfs_lock_and_flush_ordered_range() loads given "*cached_state" into
    cachedp, which, in general, is NULL. Then, lock_extent_bits() updates
    "cachedp", but it never goes backs to the caller. Thus the caller still
    see its "cached_state" to be NULL and never free the state allocated
    under btrfs_lock_and_flush_ordered_range(). As a result, we will
    see massive state leak with e.g. fstests btrfs/005. Fix this bug by
    properly handling the pointers.

    Fixes: bd80d94efb83 ("btrfs: Always use a cached extent_state in btrfs_lock_and_flush_ordered_range")
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Naohiro Aota
    Signed-off-by: David Sterba

    Naohiro Aota
     

04 Jul, 2019

1 commit

  • We have code for data and metadata reservations for delalloc. There's
    quite a bit of code here, and it's used in a lot of places so I've
    separated it out to it's own file. inode.c and file.c are already
    pretty large, and this code is complicated enough to live in its own
    space.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     

01 Jul, 2019

3 commits

  • BTRFS has the implicit assumption that a checksum in btrfs_orderd_sums
    is 4 bytes. While this is true for CRC32C, it is not for any other
    checksum.

    Change the data type to be a byte array and adjust loop index
    calculation accordingly.

    This includes moving the adjustment of 'index' by 'ins_size' in
    btrfs_csum_file_blocks() before dividing 'ins_size' by the checksum
    size, because before this patch the 'sums' member of 'struct
    btrfs_ordered_sum' was 4 Bytes in size and afterwards it is only one
    byte.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • In case no cached_state argument is passed to
    btrfs_lock_and_flush_ordered_range use one locally in the function. This
    optimises the case when an ordered extent is found since the unlock
    function will be able to unlock that state directly without searching
    for it again.

    Reviewed-by: Josef Bacik
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • There is a certain idiom used in multiple places in btrfs' codebase,
    dealing with flushing an ordered range. Factor this in a separate
    function that can be reused. Future patches will replace the existing
    code with that function.

    Reviewed-by: Josef Bacik
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     

30 Apr, 2019

2 commits

  • When diagnosing a slowdown of generic/224 I noticed we were not doing
    anything when calling into shrink_delalloc(). This is because all
    writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes
    counter is 0, which short circuits most of the work inside of
    shrink_delalloc(). However O_DIRECT writes still consume metadata
    resources and generate ordered extents, which we can still wait on.

    Fix this by tracking outstanding DIO write bytes, and use this as well
    as the delalloc bytes counter to decide if we need to lookup and wait on
    any ordered extents. If we have more DIO writes than delalloc bytes
    we'll go ahead and wait on any ordered extents regardless of our flush
    state as flushing delalloc is likely to not gain us anything.

    Signed-off-by: Josef Bacik
    [ use dio instead of odirect in identifiers ]
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Ordered csums are keyed off of a btrfs_ordered_extent, which already has
    a reference to the inode. This implies that an explicit inode argument
    is redundant. So remove it.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     

25 Apr, 2019

1 commit

  • Recent multi-page biovec rework allowed creation of bios that can span
    large regions - up to 128 megabytes in the case of btrfs. OTOH btrfs'
    submission path currently allocates a contiguous array to store the
    checksums for every bio submitted. This means we can request up to
    (128mb / BTRFS_SECTOR_SIZE) * 4 bytes + 32bytes of memory from kmalloc.
    On busy systems with possibly fragmented memory said kmalloc can fail
    which will trigger BUG_ON due to improper error handling IO submission
    context in btrfs.

    Until error handling is improved or bios in btrfs limited to a more
    manageable size (e.g. 1m) let's use kvmalloc to fallback to vmalloc for
    such large allocations. There is no hard requirement that the memory
    allocated for checksums during IO submission has to be contiguous, but
    this is a simple fix that does not require several non-contiguous
    allocations.

    For small writes this is unlikely to have any visible effect since
    kmalloc will still satisfy allocation requests as usual. For larger
    requests the code will just fallback to vmalloc.

    We've performed evaluation on several workload types and there was no
    significant difference kmalloc vs kvmalloc.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     

17 Dec, 2018

1 commit

  • Tracking pending ordered extents per transaction was introduced in commit
    50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current
    transaction V3") and later updated in commit 161c3549b45a ("Btrfs: change
    how we wait for pending ordered extents").

    However now that on fsync we always wait for ordered extents to complete
    before logging, done in commit 5636cf7d6dc8 ("btrfs: remove the logged
    extents infrastructure"), we no longer need the stuff to track for pending
    ordered extents, which was not completely removed in the mentioned commit.
    So remove the remaining of the pending ordered extents infrastructure.

    Reviewed-by: Liu Bo
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     

06 Aug, 2018

3 commits


29 May, 2018

1 commit


12 Apr, 2018

1 commit


31 Mar, 2018

1 commit

  • Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
    preallocated meta rsv, making it quite easy to underflow qgroup meta
    reservation.

    Since we have the new qgroup meta rsv types, apply it to delalloc
    reservation.

    Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
    rsv type.

    And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
    they will convert corresponding META_PREALLOC reservation to
    META_PERTRANS.

    This is mainly due to the fact that current qgroup numbers will only be
    updated in btrfs_commit_transaction(), that's to say if we don't keep
    such placeholder reservation, we can exceed qgroup limitation.

    And for callers freeing outstanding extent in error handler, we will
    just free META_PREALLOC bytes.

    This behavior makes callers of btrfs_qgroup_release_meta() or
    btrfs_qgroup_convert_meta() to be aware of which type they are.
    So in this patch, btrfs_delalloc_release_metadata() and its callers get
    an extra parameter to info qgroup to do correct meta convert/release.

    The good news is, even we use the wrong type (convert or free), it won't
    cause obvious bug, as prealloc type is always in good shape, and the
    type only affects how per-trans meta is increased or not.

    So the worst case will be at most metadata limitation can be sometimes
    exceeded (no convert at all) or metadata limitation is reached too soon
    (no free at all).

    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     

26 Mar, 2018

1 commit

  • The __cold functions are placed to a special section, as they're
    expected to be called rarely. This could help i-cache prefetches or help
    compiler to decide which branches are more/less likely to be taken
    without any other annotations needed.

    Though we can't add more __exit annotations, it's still possible to add
    __cold (that's also added with __exit). That way the following function
    categories are tagged:

    - printf wrappers, error messages
    - exit helpers

    Signed-off-by: David Sterba

    David Sterba
     

02 Nov, 2017

1 commit

  • Right now we do a lot of weird hoops around outstanding_extents in order
    to keep the extent count consistent. This is because we logically
    transfer the outstanding_extent count from the initial reservation
    through the set_delalloc_bits. This makes it pretty difficult to get a
    handle on how and when we need to mess with outstanding_extents.

    Fix this by revamping the rules of how we deal with outstanding_extents.
    Now instead everybody that is holding on to a delalloc extent is
    required to increase the outstanding extents count for itself. This
    means we'll have something like this

    btrfs_delalloc_reserve_metadata - outstanding_extents = 1
    btrfs_set_extent_delalloc - outstanding_extents = 2
    btrfs_release_delalloc_extents - outstanding_extents = 1

    for an initial file write. Now take the append write where we extend an
    existing delalloc range but still under the maximum extent size

    btrfs_delalloc_reserve_metadata - outstanding_extents = 2
    btrfs_set_extent_delalloc
    btrfs_set_bit_hook - outstanding_extents = 3
    btrfs_merge_extent_hook - outstanding_extents = 2
    btrfs_delalloc_release_extents - outstanding_extnets = 1

    In order to make the ordered extent transition we of course must now
    make ordered extents carry their own outstanding_extent reservation, so
    for cow_file_range we end up with

    btrfs_add_ordered_extent - outstanding_extents = 2
    clear_extent_bit - outstanding_extents = 1
    btrfs_remove_ordered_extent - outstanding_extents = 0

    This makes all manipulations of outstanding_extents much more explicit.
    Every successful call to btrfs_delalloc_reserve_metadata _must_ now be
    combined with btrfs_release_delalloc_extents, even in the error case, as
    that is the only function that actually modifies the
    outstanding_extents counter.

    The drawback to this is now we are much more likely to have transient
    cases where outstanding_extents is much larger than it actually should
    be. This could happen before as we manipulated the delalloc bits, but
    now it happens basically at every write. This may put more pressure on
    the ENOSPC flushing code, but I think making this code simpler is worth
    the cost. I have another change coming to mitigate this side-effect
    somewhat.

    I also added trace points for the counter manipulation. These were used
    by a bpf script I wrote to help track down leak issues.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     

30 Jun, 2017

1 commit

  • Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
    v4.12-rc6. This was because commit 70e7af244 made it possible for
    calc_reclaim_items_nr() to return a negative number. It's not really a
    bug in that commit, it just didn't go far enough down the stack to find
    all the possible 64->32 bit overflows.

    This switches calc_reclaim_items_nr() to return a u64 and changes everyone
    that uses the results of that math to u64 as well.

    Reported-by: Dave Jones
    Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
    Signed-off-by: Chris Mason
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Chris Mason
     

18 Apr, 2017

2 commits


28 Feb, 2017

1 commit


14 Feb, 2017

3 commits

  • Since we have a good helper entry_end, use it for ordered extent.

    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    [ whitespace reformatting ]
    Signed-off-by: David Sterba

    Liu Bo
     
  • btrfs_ordered_update_i_size can be called by truncate and endio, but
    only endio takes ordered_extent which contains the completed IO.

    while truncating down a file, if there are some in-flight IOs,
    btrfs_ordered_update_i_size in endio will set disk_i_size to
    @orig_offset that is zero. If truncating-down fails somehow, we try to
    recover in memory isize with this zero'd disk_i_size.

    Fix it by only updating disk_i_size with @orig_offset when
    btrfs_ordered_update_i_size is not called from endio while truncating
    down and waiting for in-flight IOs completing their work before recover
    in-memory size.

    Besides fixing the above issue, add an assertion for last_size to double
    check we truncate down to the desired size.

    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba

    Liu Bo
     
  • Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba

    Nikolay Borisov
     

06 Dec, 2016

2 commits


27 Sep, 2016

1 commit

  • CodingStyle chapter 2:
    "[...] never break user-visible strings such as printk messages,
    because that breaks the ability to grep for them."

    This patch unsplits user-visible strings.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: David Sterba

    Jeff Mahoney
     

26 Jul, 2016

1 commit

  • BTRFS is using a variety of slab caches to satisfy internal needs.
    Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
    meaning allocations from the caches are going to be accounted as
    SReclaimable. At the same time btrfs is not registering any shrinkers
    whatsoever, thus preventing memory from the slabs to be shrunk. This
    means those caches are not in fact reclaimable.

    To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
    inode cache, since this one is being freed by the generic VFS super_block
    shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
    to better document the lifetime of the objects (it just translates
    to SLAB_RECLAIM_ACCOUNT).

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     

24 Jun, 2016

1 commit

  • When doing truncate operation, btrfs_setsize() will first call
    truncate_setsize() to set new inode->i_size, but if later
    btrfs_truncate() fails, btrfs_setsize() will call
    "i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
    inmemory inode size, now bug occurs. It's because for truncate
    case btrfs_ordered_update_i_size() directly uses inode->i_size
    to update BTRFS_I(inode)->disk_i_size, indeed we should use the
    "offset" argument to update disk_i_size. Here is the call graph:
    ==>btrfs_truncate()
    ====>btrfs_truncate_inode_items()
    ======>btrfs_ordered_update_i_size(inode, last_size, NULL);
    Here btrfs_ordered_update_i_size()'s offset argument is last_size.

    And below test case can reveal this bug:

    dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
    dev=$(losetup --show -f fs.img)
    mkdir -p /mnt/mntpoint
    mkfs.btrfs -f $dev
    mount $dev /mnt/mntpoint
    cd /mnt/mntpoint

    echo "workdir is: /mnt/mntpoint"
    blocksize=$((128 * 1024))
    dd if=/dev/zero of=testfile bs=$blocksize count=1
    sync
    count=$((17*1024*1024*1024/blocksize))
    echo "file size is:" $((count*blocksize))
    for ((i = 1; i /dev/null
    done
    sync

    truncate --size 0 testfile
    ls -l testfile
    du -sh testfile
    exit

    In this case, truncate operation will fail for enospc reason and
    "du -sh testfile" returns value greater than 0, but testfile's
    size is 0, we need to reflect correct inode->i_size.

    Signed-off-by: Wang Xiaoguang
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Wang Xiaoguang
     

30 May, 2016

1 commit

  • When we do a device replace, for each device extent we find from the
    source device, we set the corresponding block group to readonly mode to
    prevent writes into it from happening while we are copying the device
    extent from the source to the target device. However just before we set
    the block group to readonly mode some concurrent task might have already
    allocated an extent from it or decided it could perform a nocow write
    into one of its extents, which can make the device replace process to
    miss copying an extent since it uses the extent tree's commit root to
    search for extents and only once it finishes searching for all extents
    belonging to the block group it does set the left cursor to the logical
    end address of the block group - this is a problem if the respective
    ordered extents finish while we are searching for extents using the
    extent tree's commit root and no transaction commit happens while we
    are iterating the tree, since it's the delayed references created by the
    ordered extents (when they complete) that insert the extent items into
    the extent tree (using the non-commit root of course).
    Example:

    CPU 1 CPU 2

    btrfs_dev_replace_start()
    btrfs_scrub_dev()
    scrub_enumerate_chunks()
    --> finds device extent belonging
    to block group X

    starts buffered write
    against some inode

    writepages is run against
    that inode forcing dellaloc
    to run

    btrfs_writepages()
    extent_writepages()
    extent_write_cache_pages()
    __extent_writepage()
    writepage_delalloc()
    run_delalloc_range()
    cow_file_range()
    btrfs_reserve_extent()
    --> allocates an extent
    from block group X
    (which is not yet
    in RO mode)
    btrfs_add_ordered_extent()
    --> creates ordered extent Y
    flush_epd_write_bio()
    --> bio against the extent from
    block group X is submitted

    btrfs_inc_block_group_ro(bg X)
    --> sets block group X to readonly

    scrub_chunk(bg X)
    scrub_stripe(device extent from srcdev)
    --> keeps searching for extent items
    belonging to the block group using
    the extent tree's commit root
    --> it never blocks due to
    fs_info->scrub_pause_req as no
    one tries to commit transaction N
    --> copies all extents found from the
    source device into the target device
    --> finishes search loop

    bio completes

    ordered extent Y completes
    and creates delayed data
    reference which will add an
    extent item to the extent
    tree when run (typically
    at transaction commit time)

    --> so the task doing the
    scrub/device replace
    at CPU 1 misses this
    and does not copy this
    extent into the new/target
    device

    btrfs_dec_block_group_ro(bg X)
    --> turns block group X back to RW mode

    dev_replace->cursor_left is set to the
    logical end offset of block group X

    So fix this by waiting for all cow and nocow writes after setting a block
    group to readonly mode.

    Signed-off-by: Filipe Manana
    Reviewed-by: Josef Bacik

    Filipe Manana
     

13 May, 2016

1 commit

  • Before the relocation process of a block group starts, it sets the block
    group to readonly mode, then flushes all delalloc writes and then finally
    it waits for all ordered extents to complete. This last step includes
    waiting for ordered extents destinated at extents allocated in other block
    groups, making us waste unecessary time.

    So improve this by waiting only for ordered extents that fall into the
    block group's range.

    Signed-off-by: Filipe Manana
    Reviewed-by: Josef Bacik
    Reviewed-by: Liu Bo

    Filipe Manana
     

14 Mar, 2016

1 commit


12 Mar, 2016

1 commit


18 Feb, 2016

1 commit


22 Oct, 2015

1 commit

  • We have a mechanism to make sure we don't lose updates for ordered extents that
    were logged in the transaction that is currently running. We add the ordered
    extent to a transaction list and then the transaction waits on all the ordered
    extents in that list. However are substantially large file systems this list
    can be extremely large, and can give us soft lockups, since the ordered extents
    don't remove themselves from the list when they do complete.

    To fix this we simply add a counter to the transaction that is incremented any
    time we have a logged extent that needs to be completed in the current
    transaction. Then when the ordered extent finally completes it decrements the
    per transaction counter and wakes up the transaction if we are the last ones.
    This will eliminate the softlockup. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

11 Oct, 2015

1 commit


02 Jul, 2015

1 commit

  • If we fail to submit a bio for a direct IO request, we were grabbing the
    corresponding ordered extent and decrementing its reference count twice,
    once for our lookup reference and once for the ordered tree reference.
    This was a problem because it caused the ordered extent to be freed
    without removing it from the ordered tree and any lists it might be
    attached to, leaving dangling pointers to the ordered extent around.
    Example trace with CONFIG_DEBUG_PAGEALLOC=y:

    [161779.858707] BUG: unable to handle kernel paging request at 0000000087654330
    [161779.859983] IP: [] rb_prev+0x22/0x3b
    [161779.860636] PGD 34d818067 PUD 0
    [161779.860636] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    (...)
    [161779.860636] Call Trace:
    [161779.860636] [] __tree_search+0xd9/0xf9 [btrfs]
    [161779.860636] [] tree_search+0x42/0x63 [btrfs]
    [161779.860636] [] ? btrfs_lookup_ordered_range+0x2d/0xa5 [btrfs]
    [161779.860636] [] btrfs_lookup_ordered_range+0x38/0xa5 [btrfs]
    [161779.860636] [] btrfs_get_blocks_direct+0x11b/0x615 [btrfs]
    [161779.860636] [] do_blockdev_direct_IO+0x5ff/0xb43
    [161779.860636] [] ? btrfs_page_exists_in_range+0x1ad/0x1ad [btrfs]
    [161779.860636] [] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
    [161779.860636] [] __blockdev_direct_IO+0x32/0x34
    [161779.860636] [] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
    [161779.860636] [] btrfs_direct_IO+0x198/0x21f [btrfs]
    [161779.860636] [] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
    [161779.860636] [] generic_file_direct_write+0xb3/0x128
    [161779.860636] [] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
    [161779.860636] [] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
    (...)

    We were also not freeing the btrfs_dio_private we allocated previously,
    which kmemleak reported with the following trace in its sysfs file:

    unreferenced object 0xffff8803f553bf80 (size 96):
    comm "xfs_io", pid 4501, jiffies 4295039588 (age 173.936s)
    hex dump (first 32 bytes):
    88 6c 9b f5 02 88 ff ff 00 00 00 00 00 00 00 00 .l..............
    00 00 00 00 00 00 00 00 00 00 c4 00 00 00 00 00 ................
    backtrace:
    [] create_object+0x172/0x29a
    [] kmemleak_alloc+0x25/0x41
    [] kmemleak_alloc_recursive.constprop.40+0x16/0x18
    [] kmem_cache_alloc_trace+0xfb/0x148
    [] btrfs_submit_direct+0x65/0x16a [btrfs]
    [] dio_bio_submit+0x62/0x8f
    [] do_blockdev_direct_IO+0x97e/0xb43
    [] __blockdev_direct_IO+0x32/0x34
    [] btrfs_direct_IO+0x198/0x21f [btrfs]
    [] generic_file_direct_write+0xb3/0x128
    [] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
    [] __vfs_write+0x7c/0xa5
    [] vfs_write+0xa0/0xe4
    [] SyS_pwrite64+0x64/0x82
    [] system_call_fastpath+0x12/0x6f
    [] 0xffffffffffffffff

    For read requests we weren't doing any cleanup either (none of the work
    done by btrfs_endio_direct_read()), so a failure submitting a bio for a
    read request would leave a range in the inode's io_tree locked forever,
    blocking any future operations (both reads and writes) against that range.

    So fix this by making sure we do the same cleanup that we do for the case
    where the bio submission succeeds.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana