06 Dec, 2016

2 commits


30 Nov, 2016

38 commits

  • David Sterba
     
  • This issue was found when I tried to delete a heavily reflinked file,
    when deleting such files, other transaction operation will not have a
    chance to make progress, for example, start_transaction() will blocked
    in wait_current_trans(root) for long time, sometimes it even triggers
    soft lockups, and the time taken to delete such heavily reflinked file
    is also very large, often hundreds of seconds. Using perf top, it reports
    that:

    PerfTop: 7416 irqs/sec kernel:99.8% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs)
    ---------------------------------------------------------------------------------------
    84.37% [btrfs] [k] __btrfs_run_delayed_refs.constprop.80
    11.02% [kernel] [k] delay_tsc
    0.79% [kernel] [k] _raw_spin_unlock_irq
    0.78% [kernel] [k] _raw_spin_unlock_irqrestore
    0.45% [kernel] [k] do_raw_spin_lock
    0.18% [kernel] [k] __slab_alloc
    It seems __btrfs_run_delayed_refs() took most cpu time, after some debug
    work, I found it's select_delayed_ref() causing this issue, for a delayed
    head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but
    select_delayed_ref() will firstly try to iterate node list to find
    BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and
    waste much time.

    To fix this issue, we introduce a new ref_add_list in struct btrfs_delayed_ref_head,
    then in select_delayed_ref(), if this list is not empty, we can directly use
    nodes in this list. With this patch, it just took about 10~15 seconds to
    delte the same file. Now using perf top, it reports that:

    PerfTop: 2734 irqs/sec kernel:99.5% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs)
    ----------------------------------------------------------------------------------------

    20.74% [kernel] [k] _raw_spin_unlock_irqrestore
    16.33% [kernel] [k] __slab_alloc
    5.41% [kernel] [k] lock_acquired
    4.42% [kernel] [k] lock_acquire
    4.05% [kernel] [k] lock_release
    3.37% [kernel] [k] _raw_spin_unlock_irq

    For normal files, this patch also gives help, at least we do not need to
    iterate whole list to found BTRFS_ADD_DELAYED_REF nodes.

    Signed-off-by: Wang Xiaoguang
    Reviewed-by: Liu Bo
    Tested-by: Holger Hoffstätte
    Signed-off-by: David Sterba

    Wang Xiaoguang
     
  • Commit 62b99540a1d91e464 (btrfs: relocation: Fix leaking qgroups numbers
    on data extents) only fixes the problem partly.

    The previous fix is to trace all new data extents at transaction commit
    time when balance finishes.

    However balance is not done in a large transaction, every path
    replacement can happen in its own transaction.
    This makes the fix useless if transaction commits during relocation.

    For example:
    relocate_block_group()
    |-merge_reloc_roots()
    | |- merge_reloc_root()
    | |- btrfs_start_transaction()
    Signed-off-by: Qu Wenruo
    Reviewed-and-Tested-by: Goldwyn Rodrigues
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Move account_shared_subtree() to qgroup.c and rename it to
    btrfs_qgroup_trace_subtree().

    Do the same thing for account_leaf_items() and rename it to
    btrfs_qgroup_trace_leaf_items().

    Since all these functions are only for qgroup, move them to qgroup.c and
    export them is more appropriate.

    Signed-off-by: Qu Wenruo
    Reviewed-and-Tested-by: Goldwyn Rodrigues
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Rename btrfs_qgroup_insert_dirty_extent(_nolock) to
    btrfs_qgroup_trace_extent(_nolock), according to the new
    reserve/trace/account naming schema.

    Signed-off-by: Qu Wenruo
    Reviewed-and-Tested-by: Goldwyn Rodrigues
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Add explaination how btrfs qgroups work.

    Qgroup is split into 3 main phrases:
    1) Reserve
    To ensure qgroup doesn't exceed its limit

    2) Trace
    To info qgroup to trace which extent

    3) Account
    Calculate qgroup number change for each traced extent.

    This should save quite some time for new developers.

    Signed-off-by: Qu Wenruo
    Reviewed-by: Goldwyn Rodrigues
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • And remove the bogus check for a NULL return value from kmap, which
    can't happen. While we're at it: I don't think that kmapping up to 256
    will work without deadlocks on highmem machines, a better idea would
    be to use vm_map_ram to map all of them into a single virtual address
    range. Incidentally that would also simplify the code a lot.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Omar Sandoval
    Signed-off-by: David Sterba

    Christoph Hellwig
     
  • Rework the loop a little bit to use the generic bio_for_each_segment_all
    helper for iterating over the bio.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Omar Sandoval
    Signed-off-by: David Sterba

    Christoph Hellwig
     
  • Use the bvec offset and len members to prepare for multipage bvecs.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: David Sterba

    Christoph Hellwig
     
  • Instead of using bi_vcnt to calculate it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Omar Sandoval
    Signed-off-by: David Sterba

    Christoph Hellwig
     
  • Use bio_for_each_segment_all to iterate over the segments instead.
    This requires a bit of reshuffling so that we only lookup up the ordered
    item once inside the bio_for_each_segment_all loop.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Omar Sandoval
    Signed-off-by: David Sterba

    Christoph Hellwig
     
  • Just use bio_for_each_segment_all to iterate over all segments.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Omar Sandoval
    Signed-off-by: David Sterba

    Christoph Hellwig
     
  • Just use bio_for_each_segment_all to iterate over all segments.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Omar Sandoval
    Signed-off-by: David Sterba

    Christoph Hellwig
     
  • Pass the full bio to the decompression routines and use bio iterators
    to iterate over the data in the bio.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: David Sterba

    Christoph Hellwig
     
  • This fixes the WARN_ON on BTRFS_I(inode)->reserved_extents in
    btrfs_destroy_inode and the WARN_ON on nonzero delalloc bytes on umount
    with qgroups enabled.

    I was able to reproduce this by setting up a small (~500kb) quota limit
    and writing a file one byte at a time until I hit the limit. The warnings
    would all hit on umount.

    The root cause is that we would reserve a block-sized range in both
    the reservation and the quota in btrfs_check_data_free_space, but if we
    encountered a problem (like e.g. EDQUOT), we would only release the single
    byte in the qgroup reservation. That caused an iotree state split, which
    increased the number of outstanding extents, in turn disallowing releasing
    the metadata reservation.

    Signed-off-by: Jeff Mahoney
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • At this point we will have dropped extent entries from the file, so if we fail
    to insert the new hole entries then we are leaving the fs in a corrupt state
    (albeit an easily fixed one). Abort the transaciton if this happens so we can
    avoid corrupting the fs. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • In order to do hole punching we have a block reserve to hold the reservation we
    need to drop the extents in our range. Since we could end up dropping a lot of
    extents we set rsv->failfast so we can just loop around again and drop the
    remaining of the range. Unfortunately we unconditionally fill the hole extents
    in and start from the last extent we encountered, which we may or may not have
    dropped. So this can result in overlapping file extent entries, which can be
    tripped over in a variety of ways, either by hitting BUG_ON(!ret) in
    fill_holes() after the search, or in btrfs_set_item_key_safe() in
    btrfs_drop_extent() at a later time by an unrelated task. Fix this by only
    setting drop_end to the last extent we did actually drop. This way our holes
    are filled in properly for the range that we did drop, and the rest of the range
    that remains to be dropped is actually dropped. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • If we process the last item in the leaf and hit an I/O error while
    reading the next leaf, we return -EIO without having adjusted the
    position. Since we have emitted dirents, getdents() will return
    the byte count to the user instead of the error. Subsequent callers
    will emit the last successful dirent again, and return -EIO again,
    with the same result. Callers loop forever.

    Instead, if we always increment ctx->pos after emitting or skipping
    the dirent, we'll be sure that we won't hit the same one again. When
    we go to process the next leaf, we won't have emitted any dirents
    and the -EIO will be returned to the user properly. We also don't
    need to track if we've emitted a dirent already or if we've changed
    the position yet.

    Signed-off-by: Jeff Mahoney
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • Commit 3de4586c527 (Btrfs: Allow subvolumes and snapshots anywhere
    in the directory tree) introduced the current system of placing
    snapshots in the directory tree. It also introduced the behavior of
    creating the snapshot and then creating the directory entries for it.

    We've kept this code around for compatibility reasons, but it turns
    out that no file systems with the old tree_root based snapshots can
    be mounted on newer (>= 2009) kernels anyway. About a month after the
    above commit, commit 2a7108ad89e (Btrfs: rev the disk format for the
    inode compat and csum selection changes) landed, changing the superblock
    magic number.

    As a result, we know that we'll never encounter tree_root-based dirents
    or have to deal with skipping our own snapshot dirents. Since that
    also means that we're now only iterating over DIR_INDEX items, which only
    contain one directory entry per leaf item, we don't need to loop over
    the leaf item contents anymore either.

    Signed-off-by: Jeff Mahoney
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • If zlib_inflateInit2 fails, the input page is never unmapped.
    Add a call to kunmap when it fails.

    Signed-off-by: Nick Terrell
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nick Terrell
     
  • The balance status item contains currently known filter values, but the
    stripes filter was unintentionally not among them. This would mean, that
    interrupted and automatically restarted balance does not apply the
    stripe filters.

    Fixes: dee32d0ac3719ef8d640efaf0884111df444730f
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: David Sterba

    David Sterba
     
  • 'btrfs_iget()' can not return NULL, so this test can be removed.

    Signed-off-by: Christophe JAILLET
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Christophe JAILLET
     
  • csum member of struct btrfs_super_block has array type of u8. It makes
    sense that function btrfs_csum_final should be also declared to accept
    u8 *. I changed the declaration of method void btrfs_csum_final(u32 crc,
    char *result); to void btrfs_csum_final(u32 crc, u8 *result);

    Signed-off-by: Domagoj Tršan
    [ changed cast to u8 at several call sites ]
    Signed-off-by: David Sterba

    Domagoj Tršan
     
  • If we have

    |0--hole--4095||4096--preallocate--12287|

    instead of using preallocated space, a 8K direct write will just
    create a new 8K extent and it'll end up with

    |0--new extent--8191||8192--preallocate--12287|

    It's because we find a hole em and then go to create a new 8K
    extent directly without adjusting @len.

    Signed-off-by: Liu Bo
    Reviewed-by: Chris Mason
    Signed-off-by: David Sterba

    Liu Bo
     
  • There is no need to call kfree() if memdup_user() fails, as no memory
    was allocated and the error in the error-valued pointer should be returned.

    Signed-off-by: Shailendra Verma
    [ edit subject ]
    Signed-off-by: David Sterba

    Shailendra Verma
     
  • Using copy_extent_buffer is suitable for copying betwenn buffers from an
    arbitrary offset and deals with page boundaries. This is not necessary
    when doing a full extent_buffer-to-extent_buffer copy. We can utilize
    the copy_page helper as well.

    Signed-off-by: David Sterba

    David Sterba
     
  • The only memset we do is to 0, so sink the parameter to the function and
    simplify all calls. Rename the function to reflect the behaviour.

    Signed-off-by: David Sterba

    David Sterba
     
  • The copy_page is usually optimized and can be faster than memcpy.

    Signed-off-by: David Sterba

    David Sterba
     
  • Signed-off-by: David Sterba

    David Sterba
     
  • The fsid and chunk tree uuid are always located in the first page,
    we don't need the to use write_extent_buffer.

    Signed-off-by: David Sterba

    David Sterba
     
  • __bdev' has never been used since
    0b86a832a1f38abec695864ec2eaedc9d2383f1b (2008).

    Signed-off-by: David Sterba

    David Sterba
     
  • During the time, the function has been shrunk to the point that it just
    calls find_extent_buffer, just passing the parameters.

    Signed-off-by: David Sterba

    David Sterba
     
  • We dereference fs_info several times, besides that post-mount functions
    should never see a NULL fs_info.

    Signed-off-by: David Sterba

    David Sterba
     
  • The lock is held, we make the same lookup that previously failed with
    EEXIST and we don't insert NULL pointers.

    Signed-off-by: David Sterba

    David Sterba
     
  • Originally, the eb and start were passed separately in case eb is NULL.
    Since the readahead has been refactored in 4.6, this is not true anymore
    and we can get rid of the parameter.

    Signed-off-by: David Sterba

    David Sterba
     
  • 'start' is not used since "btrfs: reada: Pass reada_extent into
    __readahead_hook directly" (6e39dbe8b9e55280c).

    Signed-off-by: David Sterba

    David Sterba
     
  • We can't touch the eb directly in case the function is called with a
    non-zero error, so we can read the eb level when needed.

    Signed-off-by: David Sterba

    David Sterba
     
  • The helpers are not meant to be generic, the name is misleading. Convert
    them to static inlines for type checking.

    Signed-off-by: David Sterba

    David Sterba