07 Jul, 2020

2 commits


21 Jan, 2020

1 commit

  • Fixes coccicheck warning:

    fs/xfs/xfs_reflink.c:236:9-10: WARNING: return of 0/1 in function 'xfs_inode_need_cow' with return type bool

    Reported-by: Hulk Robot
    Signed-off-by: zhengbin
    [darrick: rename the function so it doesn't sound like a predicate]
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    zhengbin
     

22 Oct, 2019

1 commit

  • xfs_reflink_allocate_cow consumes the source data fork imap, and
    potentially returns the COW fork imap. Split the arguments in two
    to clear up the calling conventions and to prepare for returning
    a source iomap from ->iomap_begin.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

26 Feb, 2019

1 commit


21 Feb, 2019

3 commits

  • Add a mode where XFS never overwrites existing blocks in place. This
    is to aid debugging our COW code, and also put infatructure in place
    for things like possible future support for zoned block devices, which
    can't support overwrites.

    This mode is enabled globally by doing a:

    echo 1 > /sys/fs/xfs/debug/always_cow

    Note that the parameter is global to allow running all tests in xfstests
    easily in this mode, which would not easily be possible with a per-fs
    sysfs file.

    In always_cow mode persistent preallocations are disabled, and fallocate
    will fail when called with a 0 mode (with our without
    FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space
    when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE.

    There are a few interesting xfstests failures when run in always_cow
    mode:

    - generic/392 fails because the bytes used in the file used to test
    hole punch recovery are less after the log replay. This is
    because the blocks written and then punched out are only freed
    with a delay due to the logging mechanism.
    - xfs/170 will fail as the already fragile file streams mechanism
    doesn't seem to interact well with the COW allocator
    - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim
    the file system is badly fragmented, but there is not much we
    can do to avoid that when always writing out of place
    - xfs/205 fails because overwriting a file in always_cow mode
    will require new space allocation and the assumption in the
    test thus don't work anymore.
    - xfs/326 fails to modify the file at all in always_cow mode after
    injecting the refcount error, leading to an unexpected md5sum
    after the remount, but that again is expected

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Besides simplifying the code a bit this allows to actually implement
    the behavior of using COW preallocation for non-COW data mentioned
    in the current comments.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • While using delalloc for extsize hints is generally a good idea, the
    current code that does so only for COW doesn't help us much and creates
    a lot of special cases. Switch it to use real allocations like we
    do for direct I/O.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

03 Nov, 2018

1 commit

  • Pull vfs dedup fixes from Dave Chinner:
    "This reworks the vfs data cloning infrastructure.

    We discovered many issues with these interfaces late in the 4.19 cycle
    - the worst of them (data corruption, setuid stripping) were fixed for
    XFS in 4.19-rc8, but a larger rework of the infrastructure fixing all
    the problems was needed. That rework is the contents of this pull
    request.

    Rework the vfs_clone_file_range and vfs_dedupe_file_range
    infrastructure to use a common .remap_file_range method and supply
    generic bounds and sanity checking functions that are shared with the
    data write path. The current VFS infrastructure has problems with
    rlimit, LFS file sizes, file time stamps, maximum filesystem file
    sizes, stripping setuid bits, etc and so they are addressed in these
    commits.

    We also introduce the ability for the ->remap_file_range methods to
    return short clones so that clones for vfs_copy_file_range() don't get
    rejected if the entire range can't be cloned. It also allows
    filesystems to sliently skip deduplication of partial EOF blocks if
    they are not capable of doing so without requiring errors to be thrown
    to userspace.

    Existing filesystems are converted to user the new remap_file_range
    method, and both XFS and ocfs2 are modified to make use of the new
    generic checking infrastructure"

    * tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits)
    xfs: remove [cm]time update from reflink calls
    xfs: remove xfs_reflink_remap_range
    xfs: remove redundant remap partial EOF block checks
    xfs: support returning partial reflink results
    xfs: clean up xfs_reflink_remap_blocks call site
    xfs: fix pagecache truncation prior to reflink
    ocfs2: remove ocfs2_reflink_remap_range
    ocfs2: support partial clone range and dedupe range
    ocfs2: fix pagecache truncation prior to reflink
    ocfs2: truncate page cache for clone destination file before remapping
    vfs: clean up generic_remap_file_range_prep return value
    vfs: hide file range comparison function
    vfs: enable remap callers that can handle short operations
    vfs: plumb remap flags through the vfs dedupe functions
    vfs: plumb remap flags through the vfs clone functions
    vfs: make remap_file_range functions take and return bytes completed
    vfs: remap helper should update destination inode metadata
    vfs: pass remap flags to generic_remap_checks
    vfs: pass remap flags to generic_remap_file_range_prep
    vfs: combine the clone and dedupe into a single remap_file_range
    ...

    Linus Torvalds
     

30 Oct, 2018

4 commits

  • Since xfs_file_remap_range is a thin wrapper, move the contents of
    xfs_reflink_remap_range into the shell. This cuts down on the vfs
    calls being made from internal xfs code.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Back when the XFS reflink code only supported clone_file_range, we were
    only able to return zero or negative error codes to userspace. However,
    now that copy_file_range (which returns bytes copied) can use XFS'
    clone_file_range, we have the opportunity to return partial results.
    For example, if userspace sends a 1GB clone request and we run out of
    space halfway through, we at least can tell userspace that we completed
    512M of that request like a regular write.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Change the remap_file_range functions to take a number of bytes to
    operate upon and return the number of bytes they operated on. This is a
    requirement for allowing fs implementations to return short clone/dedupe
    results to the user, which will enable us to obey resource limits in a
    graceful manner.

    A subsequent patch will enable copy_file_range to signal to the
    ->clone_file_range implementation that it can handle a short length,
    which will be returned in the function's return value. For now the
    short return is not implemented anywhere so the behavior won't change --
    either copy_file_range manages to clone the entire range or it tries an
    alternative.

    Neither clone ioctl can take advantage of this, alas.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Amir Goldstein
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Plumb the remap flags through the filesystem from the vfs function
    dispatcher all the way to the prep function to prepare for behavior
    changes in subsequent patches.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Amir Goldstein
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     

18 Oct, 2018

2 commits


12 Jul, 2018

2 commits

  • We only have one caller left, and open coding the simple extent list
    lookup in it allows us to make the code both more understandable and
    reuse calculations and variables already present.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • We already have to check for overlapping COW extents everytime we
    come back to a page in xfs_writepage_map / xfs_map_cow, so this
    additional trim is not required.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

07 Jun, 2018

1 commit

  • Remove the verbose license text from XFS files and replace them
    with SPDX tags. This does not change the license of any of the code,
    merely refers to the common, up-to-date license files in LICENSES/

    This change was mostly scripted. fs/xfs/Makefile and
    fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
    and modified by the following command:

    for f in `git grep -l "GNU General" fs/xfs/` ; do
    echo $f
    cat $f | awk -f hdr.awk > $f.new
    mv -f $f.new $f
    done

    And the hdr.awk script that did the modification (including
    detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
    is as follows:

    $ cat hdr.awk
    BEGIN {
    hdr = 1.0
    tag = "GPL-2.0"
    str = ""
    }

    /^ \* This program is free software/ {
    hdr = 2.0;
    next
    }

    /any later version./ {
    tag = "GPL-2.0+"
    next
    }

    /^ \*\// {
    if (hdr > 0.0) {
    print "// SPDX-License-Identifier: " tag
    print str
    print $0
    str=""
    hdr = 0.0
    next
    }
    print $0
    next
    }

    /^ \* / {
    if (hdr > 1.0)
    next
    if (hdr > 0.0) {
    if (str != "")
    str = str "\n"
    str = str $0
    next
    }
    print $0
    next
    }

    /^ \*/ {
    if (hdr > 0.0)
    next
    print $0
    next
    }

    // {
    if (hdr > 0.0) {
    if (str != "")
    str = str "\n"
    str = str $0
    next
    }
    print $0
    }

    END { }
    $

    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     

20 Jun, 2017

2 commits


08 Mar, 2017

1 commit

  • We only want to reclaim preallocations from our periodic work item.
    Currently this is archived by looking for a dirty inode, but that check
    is rather fragile. Instead add a flag to xfs_reflink_cancel_cow_* so
    that the caller can ask for just cancelling unwritten extents in the COW
    fork.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    [darrick: fix typos in commit message]
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

07 Feb, 2017

1 commit


03 Feb, 2017

1 commit

  • Christoph Hellwig pointed out that there's a potentially nasty race when
    performing simultaneous nearby directio cow writes:

    "Thread 1 writes a range from B to c

    " B --------- C
    p

    "a little later thread 2 writes from A to B

    " A --------- B
    p

    [editor's note: the 'p' denote cowextsize boundaries, which I added to
    make this more clear]

    "but the code preallocates beyond B into the range where thread
    "1 has just written, but ->end_io hasn't been called yet.
    "But once ->end_io is called thread 2 has already allocated
    "up to the extent size hint into the write range of thread 1,
    "so the end_io handler will splice the unintialized blocks from
    "that preallocation back into the file right after B."

    We can avoid this race by ensuring that thread 1 cannot accidentally
    remap the blocks that thread 2 allocated (as part of speculative
    preallocation) as part of t2's write preparation in t1's end_io handler.
    The way we make this happen is by taking advantage of the unwritten
    extent flag as an intermediate step.

    Recall that when we begin the process of writing data to shared blocks,
    we create a delayed allocation extent in the CoW fork:

    D: --RRRRRRSSSRRRRRRRR---
    C: ------DDDDDDD---------

    When a thread prepares to CoW some dirty data out to disk, it will now
    convert the delalloc reservation into an /unwritten/ allocated extent in
    the cow fork. The da conversion code tries to opportunistically
    allocate as much of a (speculatively prealloc'd) extent as possible, so
    we may end up allocating a larger extent than we're actually writing
    out:

    D: --RRRRRRSSSRRRRRRRR---
    U: ------UUUUUUU---------

    Next, we convert only the part of the extent that we're actively
    planning to write to normal (i.e. not unwritten) status:

    D: --RRRRRRSSSRRRRRRRR---
    U: ------UURRUUU---------

    If the write succeeds, the end_cow function will now scan the relevant
    range of the CoW fork for real extents and remap only the real extents
    into the data fork:

    D: --RRRRRRRRSRRRRRRRR---
    U: ------UU--UUU---------

    This ensures that we never obliterate valid data fork extents with
    unwritten blocks from the CoW fork.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

24 Nov, 2016

2 commits


08 Nov, 2016

1 commit

  • The cowblocks background scanner currently clears the cowblocks tag
    for inodes without any real allocations in the cow fork. This
    excludes inodes with only delalloc blocks in the cow fork. While we
    might never expect to clear delalloc blocks from the cow fork in the
    background scanner, it is not necessarily correct to clear the
    cowblocks tag from such inodes.

    For example, if the background scanner happens to process an inode
    between a buffered write and writeback, the scanner catches the
    inode in a state after delalloc blocks have been allocated to the
    cow fork but before the delalloc blocks have been converted to real
    blocks by writeback. The background scanner then incorrectly clears
    the cowblocks tag, even if part of the aforementioned delalloc
    reservation will not be remapped to the data fork (i.e., extra
    blocks due to the cowextsize hint). This means that any such
    additional blocks in the cow fork might never be reclaimed by the
    background scanner and could persist until the inode itself is
    reclaimed.

    To address this problem, only skip and clear inodes without any cow
    fork allocations whatsoever from the background scanner. While we
    generally do not want to cancel delalloc reservations from the
    background scanner, the pagecache dirty check following the
    cowblocks check should prevent that situation. If we do end up with
    delalloc cow fork blocks without a dirty address space mapping, this
    is probably an indication that something has gone wrong and the
    blocks should be reclaimed, as they may never be converted to a real
    allocation.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Brian Foster
     

20 Oct, 2016

2 commits

  • Instead of reserving space as the first thing in write_begin move it past
    reading the extent in the data fork. That way we only have to read from
    the data fork once and can reuse that information for trimming the extent
    to the shared/unshared boundary. Additionally this allows to easily
    limit the actual write size to said boundary, and avoid a roundtrip on the
    ilock.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • There is no clear division of responsibility between those functions, so
    just merge them into one to keep the code simple. Also move
    xfs_file_wait_for_io to xfs_reflink.c together with its only caller.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

06 Oct, 2016

7 commits

  • Trim CoW reservations made on behalf of a cowextsz hint if they get too
    old or we run low on quota, so long as we don't have dirty data awaiting
    writeback or directio operations in progress.

    Garbage collection of the cowextsize extents are kept separate from
    prealloc extent reaping because setting the CoW prealloc lifetime to a
    (much) higher value than the regular prealloc extent lifetime has been
    useful for combatting CoW fragmentation on VM hosts where the VMs
    experience bursty write behaviors and we can keep the utilization ratios
    low enough that we don't start to run out of space. IOWs, it benefits
    us to keep the CoW fork reservations around for as long as we can unless
    we run out of blocks or hit inode reclaim.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Unshare all shared extents if the user calls fallocate with the new
    unshare mode flag set, so that we can guarantee that a subsequent
    write will not ENOSPC.

    Signed-off-by: Darrick J. Wong
    [hch: pass inode instead of file to xfs_reflink_dirty_range,
    use iomap infrastructure for copy up]
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     
  • Define a VFS function which allows userspace to request that the
    kernel reflink a range of blocks between two files if the ranges'
    contents match. The function fits the new VFS ioctl that standardizes
    the checking for the btrfs EXTENT SAME ioctl.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Reflink extents from one file to another; that is to say, iteratively
    remove the mappings from the destination file, copy the mappings from
    the source file to the destination file, and increment the reference
    count of all the blocks that got remapped.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Due to the way the CoW algorithm in XFS works, there's an interval
    during which blocks allocated to handle a CoW can be lost -- if the FS
    goes down after the blocks are allocated but before the block
    remapping takes place. This is exacerbated by the cowextsz hint --
    allocated reservations can sit around for a while, waiting to get
    used.

    Since the refcount btree doesn't normally store records with refcount
    of 1, we can use it to record these in-progress extents. In-progress
    blocks cannot be shared because they're not user-visible, so there
    shouldn't be any conflicts with other programs. This is a better
    solution than holding EFIs during writeback because (a) EFIs can't be
    relogged currently, (b) even if they could, EFIs are bound by
    available log space, which puts an unnecessary upper bound on how much
    CoW we can have in flight, and (c) we already have a mechanism to
    track blocks.

    At mount time, read the refcount records and free anything we find
    with a refcount of 1 because those were in-progress when the FS went
    down.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • For O_DIRECT writes to shared blocks, we have to CoW them just like
    we would with buffered writes. For writes that are not block-aligned,
    just bounce them to the page cache.

    For block-aligned writes, however, we can do better than that. Use
    the same mechanisms that we employ for buffered CoW to set up a
    delalloc reservation, allocate all the blocks at once, issue the
    writes against the new blocks and use the same ioend functions to
    remap the blocks after the write. This should be fairly performant.

    Christoph discovered that xfs_reflink_allocate_cow_range may stumble
    over invalid entries in the extent array given that it drops the ilock
    but still expects the index to be stable. Simple fixing it to a new
    lookup for every iteration still isn't correct given that
    xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
    there is nothing preventing a xfs_bunmapi_cow call removing extents
    once we dropped the ilock either.

    This patch duplicates the inner loop of xfs_bmapi_allocate into a
    helper for xfs_reflink_allocate_cow_range so that it can be done under
    the same ilock critical section as our CoW fork delayed allocation.
    The directio CoW warts will be revisited in a later patch.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     
  • After the write component of a copy-write operation finishes, clean up
    the bookkeeping left behind. On error, we simply free the new blocks
    and pass the error up. If we succeed, however, then we must remove
    the old data fork mapping and move the cow fork mapping to the data
    fork.

    Signed-off-by: Darrick J. Wong
    [hch: Call the CoW failure function during xfs_cancel_ioend]
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     

05 Oct, 2016

3 commits

  • Modify the writepage handler to find and convert pending delalloc
    extents to real allocations. Furthermore, when we're doing non-cow
    writes to a part of a file that already has a CoW reservation (the
    cowextsz hint that we set up in a subsequent patch facilitates this),
    promote the write to copy-on-write so that the entire extent can get
    written out as a single extent on disk, thereby reducing post-CoW
    fragmentation.

    Christoph moved the CoW support code in _map_blocks to a separate helper
    function, refactored other functions, and reduced the number of CoW fork
    lookups, so I merged those changes here to reduce churn.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     
  • Wire up iomap_begin to detect shared extents and create delayed allocation
    extents in the CoW fork:

    1) Check if we already have an extent in the COW fork for the area.
    If so nothing to do, we can move along.
    2) Look up block number for the current extent, and if there is none
    it's not shared move along.
    3) Unshare the current extent as far as we are going to write into it.
    For this we avoid an additional COW fork lookup and use the
    information we set aside in step 1) above.
    4) Goto 1) unless we've covered the whole range.

    Last but not least, this updates the xfs_reflink_reserve_cow_range calling
    convention to pass a byte offset and length, as that is what both callers
    expect anyway. This patch has been refactored considerably as part of the
    iomap transition.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     
  • Introduce a new in-core fork for storing copy-on-write delalloc
    reservations and allocated extents that are in the process of being
    written out.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong