12 Jan, 2017

7 commits

  • commit 0260d8ff5f76617e3a55a1c471383ecb4404c3ad upstream.

    COW fork reservation is implemented via delayed allocation. The code is
    modeled after the traditional delalloc allocation code, but is slightly
    different in terms of how preallocation occurs. Rather than post-eof
    speculative preallocation, COW fork preallocation is implemented via a
    COW extent size hint that is designed to minimize fragmentation as a
    reflinked file is split over time.

    xfs_reflink_reserve_cow() still uses logic that is oriented towards
    dealing with post-eof speculative preallocation, however, and is stale
    or not necessarily correct. First, the EOF alignment to the COW extent
    size hint is implemented in xfs_bmapi_reserve_delalloc() (which does so
    correctly by aligning the start and end offsets) and so is not necessary
    in xfs_reflink_reserve_cow(). The backoff and retry logic on ENOSPC is
    also ineffective for the same reason, as xfs_bmapi_reserve_delalloc()
    will simply perform the same allocation request on the retry. Finally,
    since the COW extent size hint aligns the start and end offset of the
    range to allocate, the end_fsb != orig_end_fsb logic is not sufficient.
    Indeed, if a write request happens to end on an aligned offset, it is
    possible that we do not tag the inode for COW preallocation even though
    xfs_bmapi_reserve_delalloc() may have preallocated at the start offset.

    Kill the unnecessary, duplicate code in xfs_reflink_reserve_cow().
    Remove the inode tag logic as well since xfs_bmapi_reserve_delalloc()
    has been updated to tag the inode correctly.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Brian Foster
     
  • commit 2755fc4438501c8c28e7783df890e889f6772bee upstream.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     
  • commit 974ae922efd93b07b6cdf989ae959883f6f05fd8 upstream.

    Speculative preallocation is currently processed entirely by the callers
    of xfs_bmapi_reserve_delalloc(). The caller determines how much
    preallocation to include, adjusts the extent length and passes down the
    resulting request.

    While this works fine for post-eof speculative preallocation, it is not
    as reliable for COW fork preallocation. COW fork preallocation is
    implemented via the cowextszhint, which aligns the start offset as well
    as the length of the extent. Further, it is difficult for the caller to
    accurately identify when preallocation occurs because the returned
    extent could have been merged with neighboring extents in the fork.

    To simplify this situation and facilitate further COW fork preallocation
    enhancements, update xfs_bmapi_reserve_delalloc() to take a separate
    preallocation parameter to incorporate into the allocation request. The
    preallocation blocks value is tacked onto the end of the request and
    adjusted to accommodate neighboring extents and extent size limits.
    Since xfs_bmapi_reserve_delalloc() now knows precisely how much
    preallocation was included in the allocation, it can also tag the inodes
    appropriately to support preallocation reclaim.

    Note that xfs_bmapi_reserve_delalloc() callers are not yet updated to
    use the preallocation mechanism. This patch should not change behavior
    outside of correctly tagging reflink inodes when start offset
    preallocation occurs (which the caller does not handle correctly).

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Brian Foster
     
  • commit 65c5f419788d623a0410eca1866134f5e4628594 upstream.

    We can easily lookup the previous extent for the cases where we need it,
    which saves the callers from looking it up for us later in the series.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     
  • commit fba3e594ef0ad911fa8f559732d588172f212d71 upstream.

    It turns out that btrfs and xfs had differing interpretations of what
    to do when the dedupe length is zero. Change xfs to follow btrfs'
    semantics so that the userland interface is consistent.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Darrick J. Wong
     
  • commit 5d829300bee000980a09ac2ccb761cb25867b67c upstream.

    The open-coded pattern:

    ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)

    is all over the xfs code; provide a new helper
    xfs_iext_count(ifp) to count the number of inline extents
    in an inode fork.

    [dchinner: pick up several missed conversions]

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Eric Sandeen
     
  • commit 399372349a7f9b2d7e56e4fa4467c69822d07024 upstream.

    The cowblocks background scanner currently clears the cowblocks tag
    for inodes without any real allocations in the cow fork. This
    excludes inodes with only delalloc blocks in the cow fork. While we
    might never expect to clear delalloc blocks from the cow fork in the
    background scanner, it is not necessarily correct to clear the
    cowblocks tag from such inodes.

    For example, if the background scanner happens to process an inode
    between a buffered write and writeback, the scanner catches the
    inode in a state after delalloc blocks have been allocated to the
    cow fork but before the delalloc blocks have been converted to real
    blocks by writeback. The background scanner then incorrectly clears
    the cowblocks tag, even if part of the aforementioned delalloc
    reservation will not be remapped to the data fork (i.e., extra
    blocks due to the cowextsize hint). This means that any such
    additional blocks in the cow fork might never be reclaimed by the
    background scanner and could persist until the inode itself is
    reclaimed.

    To address this problem, only skip and clear inodes without any cow
    fork allocations whatsoever from the background scanner. While we
    generally do not want to cancel delalloc reservations from the
    background scanner, the pagecache dirty check following the
    cowblocks check should prevent that situation. If we do end up with
    delalloc cow fork blocks without a dirty address space mapping, this
    is probably an indication that something has gone wrong and the
    blocks should be reclaimed, as they may never be converted to a real
    allocation.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Brian Foster
     

24 Oct, 2016

1 commit

  • The background cowblocks scan job takes care of scanning for inodes with
    potentially lingering blocks in the cow fork and clearing them out. If
    the background scanner reclaims the cow fork blocks, however, it doesn't
    immediately clear the cowblocks tag from the inode. Instead, the inode
    remains tagged until the background scanner comes around again,
    discovers the inode cow fork has no blocks, clears the tag and fires the
    trace_xfs_inode_free_cowblocks_invalid() tracepoint to indicate that the
    inode may have been incorrectly tagged.

    This is not a major functional problem as the tag is ultimately cleared.
    Nonetheless, clear the tag when an inode cow fork is explicitly emptied
    to avoid the extra round trip through the background scanner and
    spurious "invalid" tracepoint.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Brian Foster
     

20 Oct, 2016

8 commits

  • Instead of doing a full extent list search for each extent that is
    to be deleted using xfs_bmapi_read and then doing another one inside
    of xfs_bunmapi_cow use the same scheme that xfs_bumapi uses: look
    up the last extent to be deleted and then use the extent index to
    walk downward until we are outside the range to be deleted.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Rewrite xfs_reflink_cancel_cow_blocks so that we only do a search for
    the first extent in the extent list and then iterate over the remaining
    extents using the extent index, passing the extent we operate on
    directly to xfs_bmap_del_extent_delay or xfs_bmap_del_extent_cow instead
    of going through xfs_bunmapi and doing yet another extent list lookup.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Split out two helpers for deleting delayed or real extents from the COW fork.
    This allows to call them directly from xfs_reflink_cow_end_io once that
    function is refactored to iterate the extent tree. It will also allow
    to reuse the delalloc deletion from xfs_bunmapi in the future.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Instead of reserving space as the first thing in write_begin move it past
    reading the extent in the data fork. That way we only have to read from
    the data fork once and can reuse that information for trimming the extent
    to the shared/unshared boundary. Additionally this allows to easily
    limit the actual write size to said boundary, and avoid a roundtrip on the
    ilock.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Delalloc extents in the extent list contain the number of reserved
    indirect blocks in their startblock value and don't use the magic
    DELAYSTARTBLOCK constant. Ensure that xfs_reflink_trim_around_shared
    handles them properly by checking for isnullstartblock().

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • There is no clear division of responsibility between those functions, so
    just merge them into one to keep the code simple. Also move
    xfs_file_wait_for_io to xfs_reflink.c together with its only caller.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • We need the iolock protection to stabilizie the IS_SWAPFILE and
    IS_IMMUTABLE values, as well as preventing new buffered writers
    re-dirtying the file data that we just wrote out.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • with gcc 4.1.2:

    fs/xfs/xfs_reflink.c: In function xfs_reflink_reserve_cow_range:
    fs/xfs/xfs_reflink.c:327: warning: error may be used uninitialized in this function

    Indeed, if "count" is zero, the function will return an uninitialized
    error value.

    While "count" is unlikely to be zero, this function is called through
    the public iomap API. Hence fix this by preinitializing error to zero.

    Fixes: 2a06705cd5954030 ("xfs: create delalloc extents in CoW fork")
    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Geert Uytterhoeven
     

10 Oct, 2016

4 commits


06 Oct, 2016

9 commits

  • Trim CoW reservations made on behalf of a cowextsz hint if they get too
    old or we run low on quota, so long as we don't have dirty data awaiting
    writeback or directio operations in progress.

    Garbage collection of the cowextsize extents are kept separate from
    prealloc extent reaping because setting the CoW prealloc lifetime to a
    (much) higher value than the regular prealloc extent lifetime has been
    useful for combatting CoW fragmentation on VM hosts where the VMs
    experience bursty write behaviors and we can keep the utilization ratios
    low enough that we don't start to run out of space. IOWs, it benefits
    us to keep the CoW fork reservations around for as long as we can unless
    we run out of blocks or hit inode reclaim.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • If the AG free space is down to the reserves, refuse to reflink our
    way out of space. Hopefully userspace will make a real copy and/or go
    elsewhere.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Create a per-inode extent size allocator hint for copy-on-write. This
    hint is separate from the existing extent size hint so that CoW can
    take advantage of the fragmentation-reducing properties of extent size
    hints without disabling delalloc for regular writes.

    The extent size hint that's fed to the allocator during a copy on
    write operation is the greater of the cowextsize and regular extsize
    hint.

    During reflink, if we're sharing the entire source file to the entire
    destination file and the destination file doesn't already have a
    cowextsize hint, propagate the source file's cowextsize hint to the
    destination file.

    Furthermore, zero the bulkstat buffer prior to setting the fields
    so that we don't copy kernel memory contents into userspace.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Unshare all shared extents if the user calls fallocate with the new
    unshare mode flag set, so that we can guarantee that a subsequent
    write will not ENOSPC.

    Signed-off-by: Darrick J. Wong
    [hch: pass inode instead of file to xfs_reflink_dirty_range,
    use iomap infrastructure for copy up]
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     
  • Define a VFS function which allows userspace to request that the
    kernel reflink a range of blocks between two files if the ranges'
    contents match. The function fits the new VFS ioctl that standardizes
    the checking for the btrfs EXTENT SAME ioctl.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Reflink extents from one file to another; that is to say, iteratively
    remove the mappings from the destination file, copy the mappings from
    the source file to the destination file, and increment the reference
    count of all the blocks that got remapped.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Due to the way the CoW algorithm in XFS works, there's an interval
    during which blocks allocated to handle a CoW can be lost -- if the FS
    goes down after the blocks are allocated but before the block
    remapping takes place. This is exacerbated by the cowextsz hint --
    allocated reservations can sit around for a while, waiting to get
    used.

    Since the refcount btree doesn't normally store records with refcount
    of 1, we can use it to record these in-progress extents. In-progress
    blocks cannot be shared because they're not user-visible, so there
    shouldn't be any conflicts with other programs. This is a better
    solution than holding EFIs during writeback because (a) EFIs can't be
    relogged currently, (b) even if they could, EFIs are bound by
    available log space, which puts an unnecessary upper bound on how much
    CoW we can have in flight, and (c) we already have a mechanism to
    track blocks.

    At mount time, read the refcount records and free anything we find
    with a refcount of 1 because those were in-progress when the FS went
    down.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • For O_DIRECT writes to shared blocks, we have to CoW them just like
    we would with buffered writes. For writes that are not block-aligned,
    just bounce them to the page cache.

    For block-aligned writes, however, we can do better than that. Use
    the same mechanisms that we employ for buffered CoW to set up a
    delalloc reservation, allocate all the blocks at once, issue the
    writes against the new blocks and use the same ioend functions to
    remap the blocks after the write. This should be fairly performant.

    Christoph discovered that xfs_reflink_allocate_cow_range may stumble
    over invalid entries in the extent array given that it drops the ilock
    but still expects the index to be stable. Simple fixing it to a new
    lookup for every iteration still isn't correct given that
    xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
    there is nothing preventing a xfs_bunmapi_cow call removing extents
    once we dropped the ilock either.

    This patch duplicates the inner loop of xfs_bmapi_allocate into a
    helper for xfs_reflink_allocate_cow_range so that it can be done under
    the same ilock critical section as our CoW fork delayed allocation.
    The directio CoW warts will be revisited in a later patch.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     
  • After the write component of a copy-write operation finishes, clean up
    the bookkeeping left behind. On error, we simply free the new blocks
    and pass the error up. If we succeed, however, then we must remove
    the old data fork mapping and move the cow fork mapping to the data
    fork.

    Signed-off-by: Darrick J. Wong
    [hch: Call the CoW failure function during xfs_cancel_ioend]
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     

05 Oct, 2016

3 commits

  • Modify the writepage handler to find and convert pending delalloc
    extents to real allocations. Furthermore, when we're doing non-cow
    writes to a part of a file that already has a CoW reservation (the
    cowextsz hint that we set up in a subsequent patch facilitates this),
    promote the write to copy-on-write so that the entire extent can get
    written out as a single extent on disk, thereby reducing post-CoW
    fragmentation.

    Christoph moved the CoW support code in _map_blocks to a separate helper
    function, refactored other functions, and reduced the number of CoW fork
    lookups, so I merged those changes here to reduce churn.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     
  • Wire up iomap_begin to detect shared extents and create delayed allocation
    extents in the CoW fork:

    1) Check if we already have an extent in the COW fork for the area.
    If so nothing to do, we can move along.
    2) Look up block number for the current extent, and if there is none
    it's not shared move along.
    3) Unshare the current extent as far as we are going to write into it.
    For this we avoid an additional COW fork lookup and use the
    information we set aside in step 1) above.
    4) Goto 1) unless we've covered the whole range.

    Last but not least, this updates the xfs_reflink_reserve_cow_range calling
    convention to pass a byte offset and length, as that is what both callers
    expect anyway. This patch has been refactored considerably as part of the
    iomap transition.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     
  • Introduce a new in-core fork for storing copy-on-write delalloc
    reservations and allocated extents that are in the process of being
    written out.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong