05 Nov, 2020

1 commit


05 Aug, 2020

1 commit

  • Delete repeated words in fs/xfs/.
    {we, that, the, a, to, fork}
    Change "it it" to "it is" in one location.

    Signed-off-by: Randy Dunlap
    To: linux-fsdevel@vger.kernel.org
    Cc: Darrick J. Wong
    Cc: linux-xfs@vger.kernel.org
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Randy Dunlap
     

07 Jul, 2020

9 commits

  • Move the double-inode locking helpers to xfs_inode.c since they're not
    specific to reflink.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • Refactor the two functions that we use to lock and unlock two inodes to
    block userspace from initiating IO against a file, whether via system
    calls or mmap activity.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • Fix the return value of xfs_reflink_remap_prep so that its return value
    conventions match the rest of xfs.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • If the source and destination map are identical, we can skip the remap
    step to save some time.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • When logging quota block count updates during a reflink operation, we
    only log the /delta/ of the block count changes to the dquot. Since we
    now know ahead of time the extent type of both dmap and smap (and that
    they have the same length), we know that we only need to reserve quota
    blocks for dmap's blockcount if we're mapping it into a hole.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • Now that we've reworked xfs_reflink_remap_extent to remap only one
    extent per transaction, we actually know if the extent being removed is
    an allocated mapping. This means that we now know ahead of time if
    we're going to be touching the data fork.

    Since we only need blocks for a bmbt split if we're going to update the
    data fork, we only need to get quota reservation if we know we're going
    to touch the data fork.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • The existing reflink remapping loop has some structural problems that
    need addressing:

    The biggest problem is that we create one transaction for each extent in
    the source file without accounting for the number of mappings there are
    for the same range in the destination file. In other words, we don't
    know the number of remap operations that will be necessary and we
    therefore cannot guess the block reservation required. On highly
    fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
    run out of block reservation, and fail.

    The second problem is that we don't actually use the bmap intents to
    their full potential -- instead of calling bunmapi directly and having
    to deal with its backwards operation, we could call the deferred ops
    xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead. This
    makes the frontend loop much simpler.

    Solve all of these problems by refactoring the remapping loops so that
    we only perform one remapping operation per transaction, and each
    operation only tries to remap a single extent from source to dest.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Reported-by: Edwin Török
    Tested-by: Edwin Török

    Darrick J. Wong
     
  • The name of this predicate is a little misleading -- it decides if the
    extent mapping is allocated and written. Change the name to be more
    direct, as we're going to add a new predicate in the next patch.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • Quota reservations are supposed to account for the blocks that might be
    allocated due to a bmap btree split. Reflink doesn't do this, so fix
    this to make the quota accounting more accurate before we start
    rearranging things.

    Fixes: 862bb360ef56 ("xfs: reflink extents from one file to another")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     

13 Apr, 2020

1 commit

  • In the reflink extent remap function, it turns out that uirec (the block
    mapping corresponding only to the part of the passed-in mapping that got
    unmapped) was not fully initialized. Specifically, br_state was not
    being copied from the passed-in struct to the uirec. This could lead to
    unpredictable results such as the reflinked mapping being marked
    unwritten in the destination file.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     

27 Jan, 2020

1 commit


21 Jan, 2020

1 commit

  • Fixes coccicheck warning:

    fs/xfs/xfs_reflink.c:236:9-10: WARNING: return of 0/1 in function 'xfs_inode_need_cow' with return type bool

    Reported-by: Hulk Robot
    Signed-off-by: zhengbin
    [darrick: rename the function so it doesn't sound like a predicate]
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    zhengbin
     

15 Jan, 2020

1 commit

  • Introduce a new #define for the maximum supported file block offset.
    We'll use this in the next patch to make it more obvious that we're
    doing some operation for all possible inode fork mappings after a given
    offset. We can't use ULLONG_MAX here because bunmapi uses that to
    detect when it's done.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

24 Oct, 2019

1 commit

  • xfs_bmapi_write() takes a total block requirement parameter that is
    passed down to the block allocation code and is used to specify the
    total block requirement of the associated transaction. This is used
    to try and select an AG that can not only satisfy the requested
    extent allocation, but can also accommodate subsequent allocations
    that might be required to complete the transaction. For example,
    additional bmbt block allocations may be required on insertion of
    the resulting extent to an inode data fork.

    While it's important for callers to calculate and reserve such extra
    blocks in the transaction, it is not necessary to pass the total
    value to xfs_bmapi_write() in all cases. The latter automatically
    sets minleft to ensure that sufficient free blocks remain after the
    allocation attempt to expand the format of the associated inode
    (i.e., such as extent to btree conversion, btree splits, etc).
    Therefore, any callers that pass a total block requirement of the
    bmap mapping length plus worst case bmbt expansion essentially
    specify the additional reservation requirement twice. These callers
    can pass a total of zero to rely on the bmapi minleft policy.

    Beyond being superfluous, the primary motivation for this change is
    that the total reservation logic in the bmbt code is dubious in
    scenarios where minlen < maxlen and a maxlen extent cannot be
    allocated (which is more common for data extent allocations where
    contiguity is not required). The total value is based on maxlen in
    the xfs_bmapi_write() caller. If the bmbt code falls back to an
    allocation between minlen and maxlen, that allocation will not
    succeed until total is reset to minlen, which essentially throws
    away any additional reservation included in total by the caller. In
    addition, the total value is not reset until after alignment is
    dropped, which means that such callers drop alignment far too
    aggressively than necessary.

    Update all callers of xfs_bmapi_write() that pass a total block
    value of the mapping length plus bmbt reservation to instead pass
    zero and rely on xfs_bmapi_minleft() to enforce the bmbt reservation
    requirement. This trades off slightly less conservative AG selection
    for the ability to preserve alignment in more scenarios.
    xfs_bmapi_write() callers that incorporate unrelated or additional
    reservations in total beyond what is already included in minleft
    must continue to use the former.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

22 Oct, 2019

3 commits


21 Oct, 2019

1 commit


28 Aug, 2019

2 commits


19 Aug, 2019

1 commit

  • While trawling through the dedupe file comparison code trying to fix
    page deadlocking problems, Dave Chinner noticed that the reflink code
    only takes shared IOLOCK/MMAPLOCKs on the source file. Because
    page_mkwrite and directio writes do not take the EXCL versions of those
    locks, this means that reflink can race with writer processes.

    For pure remapping this can lead to undefined behavior and file
    corruption; for dedupe this means that we cannot be sure that the
    contents are identical when we decide to go ahead with the remapping.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

01 Jul, 2019

1 commit

  • Instead of a magic flag for xfs_trans_alloc, just ensure all callers
    that can't relclaim through the file system use memalloc_nofs_save to
    set the per-task nofs flag.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

29 Jun, 2019

1 commit

  • There are many, many xfs header files which are included but
    unneeded (or included twice) in the xfs code, so remove them.

    nb: xfs_linux.h includes about 9 headers for everyone, so those
    explicit includes get removed by this. I'm not sure what the
    preference is, but if we wanted explicit includes everywhere,
    a followup patch could remove those xfs_*.h includes from
    xfs_linux.h and move them into the files that need them.
    Or it could be left as-is.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eric Sandeen
     

26 Feb, 2019

2 commits


21 Feb, 2019

4 commits

  • Add a mode where XFS never overwrites existing blocks in place. This
    is to aid debugging our COW code, and also put infatructure in place
    for things like possible future support for zoned block devices, which
    can't support overwrites.

    This mode is enabled globally by doing a:

    echo 1 > /sys/fs/xfs/debug/always_cow

    Note that the parameter is global to allow running all tests in xfstests
    easily in this mode, which would not easily be possible with a per-fs
    sysfs file.

    In always_cow mode persistent preallocations are disabled, and fallocate
    will fail when called with a 0 mode (with our without
    FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space
    when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE.

    There are a few interesting xfstests failures when run in always_cow
    mode:

    - generic/392 fails because the bytes used in the file used to test
    hole punch recovery are less after the log replay. This is
    because the blocks written and then punched out are only freed
    with a delay due to the logging mechanism.
    - xfs/170 will fail as the already fragile file streams mechanism
    doesn't seem to interact well with the COW allocator
    - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim
    the file system is badly fragmented, but there is not much we
    can do to avoid that when always writing out of place
    - xfs/205 fails because overwriting a file in always_cow mode
    will require new space allocation and the assumption in the
    test thus don't work anymore.
    - xfs/326 fails to modify the file at all in always_cow mode after
    injecting the refcount error, leading to an unexpected md5sum
    after the remount, but that again is expected

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • If we have racing buffered and direct I/O COW fork extents under
    writeback can have been moved to the data fork by the time we call
    xfs_reflink_convert_cow from xfs_submit_ioend. This would be mostly
    harmless as the block numbers don't change by this move, except for
    the fact that xfs_bmapi_write will crash or trigger asserts when
    not finding existing extents, even despite trying to paper over this
    with the XFS_BMAPI_CONVERT_ONLY flag.

    Instead of special casing non-transaction conversions in the already
    way too complicated xfs_bmapi_write just add a new helper for the much
    simpler non-transactional COW fork case, which simplify ignores not
    found extents.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Besides simplifying the code a bit this allows to actually implement
    the behavior of using COW preallocation for non-COW data mentioned
    in the current comments.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • While using delalloc for extsize hints is generally a good idea, the
    current code that does so only for COW doesn't help us much and creates
    a lot of special cases. Switch it to use real allocations like we
    do for direct I/O.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

18 Feb, 2019

1 commit


13 Dec, 2018

1 commit

  • In xfs_reflink_end_cow, we allocate a single transaction for the entire
    end_cow operation and then loop the CoW fork mappings to move them to
    the data fork. This design fails on a heavily fragmented filesystem
    where an inode's data fork has exactly one more extent than would fit in
    an extents-format fork, because the unmap can collapse the data fork
    into extents format (freeing the bmbt block) but the remap can expand
    the data fork back into a (newly allocated) bmbt block. If the number
    of extents we end up remapping is large, we can overflow the block
    reservation because we reserved blocks assuming that we were adding
    mappings into an already-cleared area of the data fork.

    Let's say we have 8 extents in the data fork, 8 extents in the CoW fork,
    and the data fork can hold at most 7 extents before needing to convert
    to btree format; and that blocks A-P are discontiguous single-block
    extents:

    0......7
    D: ABCDEFGH
    C: IJKLMNOP

    When a write to file blocks 0-7 completes, we must remap I-P into the
    data fork. We start by removing H from the btree-format data fork. Now
    we have 7 extents, so we convert the fork to extents format, freeing the
    bmbt block. We then move P into the data fork and it now has 8 extents
    again. We must convert the data fork back to btree format, requiring a
    block allocation. If we repeat this sequence for blocks 6-5-4-3-2-1-0,
    we'll need a total of 8 block allocations to remap all 8 blocks. We
    reserved only enough blocks to handle one btree split (5 blocks on a 4k
    block filesystem), which means we overflow the block reservation.

    To fix this issue, create a separate helper function to remap a single
    extent, and change _reflink_end_cow to call it in a tight loop over the
    entire range we're completing. As a side effect this also removes the
    size restrictions on how many extents we can end_cow at a time, though
    nobody ever hit that. It is not reasonable to reserve N blocks to remap
    N blocks.

    Note that this can be reproduced after ~320 million fsx ops while
    running generic/938 (long soak directio fsx exerciser):

    XFS: Assertion failed: tp->t_blk_res >= tp->t_blk_res_used, file: fs/xfs/xfs_trans.c, line: 116

    Call Trace:
    xfs_trans_dup+0x211/0x250 [xfs]
    xfs_trans_roll+0x6d/0x180 [xfs]
    xfs_defer_trans_roll+0x10c/0x3b0 [xfs]
    xfs_defer_finish_noroll+0xdf/0x740 [xfs]
    xfs_defer_finish+0x13/0x70 [xfs]
    xfs_reflink_end_cow+0x2c6/0x680 [xfs]
    xfs_dio_write_end_io+0x115/0x220 [xfs]
    iomap_dio_complete+0x3f/0x130
    iomap_dio_rw+0x3c3/0x420
    xfs_file_dio_aio_write+0x132/0x3c0 [xfs]
    xfs_file_write_iter+0x8b/0xc0 [xfs]
    __vfs_write+0x193/0x1f0
    vfs_write+0xba/0x1c0
    ksys_write+0x52/0xc0
    do_syscall_64+0x50/0x160
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     

22 Nov, 2018

1 commit

  • On a sub-page block size filesystem, fsx is failing with a data
    corruption after a series of operations involving copying a file
    with the destination offset beyond EOF of the destination of the file:

    8093(157 mod 256): TRUNCATE DOWN from 0x7a120 to 0x50000 ******WWWW
    8094(158 mod 256): INSERT 0x25000 thru 0x25fff (0x1000 bytes)
    8095(159 mod 256): COPY 0x18000 thru 0x1afff (0x3000 bytes) to 0x2f400
    8096(160 mod 256): WRITE 0x5da00 thru 0x651ff (0x7800 bytes) HOLE
    8097(161 mod 256): COPY 0x2000 thru 0x5fff (0x4000 bytes) to 0x6fc00

    The second copy here is beyond EOF, and it is to sub-page (4k) but
    block aligned (1k) offset. The clone runs the EOF zeroing, landing
    in a pre-existing post-eof delalloc extent. This zeroes the post-eof
    extents in the page cache just fine, dirtying the pages correctly.

    The problem is that xfs_reflink_remap_prep() now truncates the page
    cache over the range that it is copying it to, and rounds that down
    to cover the entire start page. This removes the dirty page over the
    delalloc extent from the page cache without having written it back.
    Hence later, when the page cache is flushed, the page at offset
    0x6f000 has not been written back and hence exposes stale data,
    which fsx trips over less than 10 operations later.

    Fix this by changing xfs_reflink_remap_prep() to use
    xfs_flush_unmap_range().

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     

20 Nov, 2018

1 commit

  • Page writeback indirectly handles shared extents via the existence
    of overlapping COW fork blocks. If COW fork blocks exist, writeback
    always performs the associated copy-on-write regardless if the
    underlying blocks are actually shared. If the blocks are shared,
    then overlapping COW fork blocks must always exist.

    fstests shared/010 reproduces a case where a buffered write occurs
    over a shared block without performing the requisite COW fork
    reservation. This ultimately causes writeback to the shared extent
    and data corruption that is detected across md5 checks of the
    filesystem across a mount cycle.

    The problem occurs when a buffered write lands over a shared extent
    that crosses an extent size hint boundary and that also happens to
    have a partial COW reservation that doesn't cover the start and end
    blocks of the data fork extent.

    For example, a buffered write occurs across the file offset (in FSB
    units) range of [29, 57]. A shared extent exists at blocks [29, 35]
    and COW reservation already exists at blocks [32, 34]. After
    accommodating a COW extent size hint of 32 blocks and the existing
    reservation at offset 32, xfs_reflink_reserve_cow() allocates 32
    blocks of reservation at offset 0 and returns with COW reservation
    across the range of [0, 34]. The associated data fork extent is
    still [29, 35], however, which isn't fully covered by the COW
    reservation.

    This leads to a buffered write at file offset 35 over a shared
    extent without associated COW reservation. Writeback eventually
    kicks in, performs an overwrite of the underlying shared block and
    causes the associated data corruption.

    Update xfs_reflink_reserve_cow() to accommodate the fact that a
    delalloc allocation request may not fully cover the extent in the
    data fork. Trim the data fork extent appropriately, just as is done
    for shared extent boundaries and/or existing COW reservations that
    happen to overlap the start of the data fork extent. This prevents
    shared/010 failures due to data corruption on reflink enabled
    filesystems.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

03 Nov, 2018

1 commit

  • Pull vfs dedup fixes from Dave Chinner:
    "This reworks the vfs data cloning infrastructure.

    We discovered many issues with these interfaces late in the 4.19 cycle
    - the worst of them (data corruption, setuid stripping) were fixed for
    XFS in 4.19-rc8, but a larger rework of the infrastructure fixing all
    the problems was needed. That rework is the contents of this pull
    request.

    Rework the vfs_clone_file_range and vfs_dedupe_file_range
    infrastructure to use a common .remap_file_range method and supply
    generic bounds and sanity checking functions that are shared with the
    data write path. The current VFS infrastructure has problems with
    rlimit, LFS file sizes, file time stamps, maximum filesystem file
    sizes, stripping setuid bits, etc and so they are addressed in these
    commits.

    We also introduce the ability for the ->remap_file_range methods to
    return short clones so that clones for vfs_copy_file_range() don't get
    rejected if the entire range can't be cloned. It also allows
    filesystems to sliently skip deduplication of partial EOF blocks if
    they are not capable of doing so without requiring errors to be thrown
    to userspace.

    Existing filesystems are converted to user the new remap_file_range
    method, and both XFS and ocfs2 are modified to make use of the new
    generic checking infrastructure"

    * tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits)
    xfs: remove [cm]time update from reflink calls
    xfs: remove xfs_reflink_remap_range
    xfs: remove redundant remap partial EOF block checks
    xfs: support returning partial reflink results
    xfs: clean up xfs_reflink_remap_blocks call site
    xfs: fix pagecache truncation prior to reflink
    ocfs2: remove ocfs2_reflink_remap_range
    ocfs2: support partial clone range and dedupe range
    ocfs2: fix pagecache truncation prior to reflink
    ocfs2: truncate page cache for clone destination file before remapping
    vfs: clean up generic_remap_file_range_prep return value
    vfs: hide file range comparison function
    vfs: enable remap callers that can handle short operations
    vfs: plumb remap flags through the vfs dedupe functions
    vfs: plumb remap flags through the vfs clone functions
    vfs: make remap_file_range functions take and return bytes completed
    vfs: remap helper should update destination inode metadata
    vfs: pass remap flags to generic_remap_checks
    vfs: pass remap flags to generic_remap_file_range_prep
    vfs: combine the clone and dedupe into a single remap_file_range
    ...

    Linus Torvalds
     

30 Oct, 2018

4 commits

  • Now that the vfs remap helper dirties the inode [cm]time for us, xfs no
    longer needs to do that on its own.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Since xfs_file_remap_range is a thin wrapper, move the contents of
    xfs_reflink_remap_range into the shell. This cuts down on the vfs
    calls being made from internal xfs code.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Now that we've moved the partial EOF block checks to the VFS helpers, we
    can remove the redundant functionality from XFS.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Back when the XFS reflink code only supported clone_file_range, we were
    only able to return zero or negative error codes to userspace. However,
    now that copy_file_range (which returns bytes copied) can use XFS'
    clone_file_range, we have the opportunity to return partial results.
    For example, if userspace sends a 1GB clone request and we run out of
    space halfway through, we at least can tell userspace that we completed
    512M of that request like a regular write.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong