20 Nov, 2020

2 commits

  • This reverts commit 6ff646b2ceb0eec916101877f38da0b73e3a5b7f.

    Your maintainer committed a major braino in the rmap code by adding the
    attr fork, bmbt, and unwritten extent usage bits into rmap record key
    comparisons. While XFS uses the usage bits *in the rmap records* for
    cross-referencing metadata in xfs_scrub and xfs_repair, it only needs
    the owner and offset information to distinguish between reverse mappings
    of the same physical extent into the data fork of a file at multiple
    offsets. The other bits are not important for key comparisons for index
    lookups, and never have been.

    Eric Sandeen reports that this causes regressions in generic/299, so
    undo this patch before it does more damage.

    Reported-by: Eric Sandeen
    Fixes: 6ff646b2ceb0 ("xfs: fix rmap key and record comparison functions")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Eric Sandeen

    Darrick J. Wong
     
  • Jens has reported a situation where partial direct IOs can be issued
    and completed yet still return -EAGAIN. We don't want this to report
    a short IO as we want XFS to complete user DIO entirely or not at
    all.

    This partial IO situation can occur on a write IO that is split
    across an allocated extent and a hole, and the second mapping is
    returning EAGAIN because allocation would be required.

    The trivial reproducer:

    $ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo
    wrote 4096/4096 bytes at offset 0
    4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec)
    pwrite: Resource temporarily unavailable
    $

    The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done
    the first 4kB write:

    xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000
    iomap_apply: dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
    xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
    xfs_iunlock: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
    xfs_iomap_found: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1
    iomap_apply_dstmap: dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY

    Here the first iomap loop has mapped the first 4kB of the file and
    issued the IO, and we enter the second iomap_apply loop:

    iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
    xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
    xfs_iunlock: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin

    And we exit with -EAGAIN out because we hit the allocate case trying
    to make the second 4kB block.

    Then IO completes on the first 4kB and the original IO context
    completes and unlocks the inode, returning -EAGAIN to userspace:

    xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096
    xfs_iunlock: dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write

    There are other vectors to the same problem when we re-enter the
    mapping code if we have to make multiple mappinfs under NOWAIT
    conditions. e.g. failing trylocks, COW extents being found,
    allocation being required, and so on.

    Avoid all these potential problems by only allowing IOMAP_NOWAIT IO
    to go ahead if the mapping we retrieve for the IO spans an entire
    allocated extent. This avoids the possibility of subsequent mappings
    to complete the IO from triggering NOWAIT semantics by any means as
    NOWAIT IO will now only enter the mapping code once per NOWAIT IO.

    Reported-and-tested-by: Jens Axboe
    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     

19 Nov, 2020

6 commits

  • In xfs_initialize_perag(), if kmem_zalloc(), xfs_buf_hash_init(), or
    radix_tree_preload() failed, the returned value 'error' is not set
    accordingly.

    Reported-as-fixing: 8b26c5825e02 ("xfs: handle ENOMEM correctly during initialisation of perag structures")
    Fixes: 9b2471797942 ("xfs: cache unlinked pointers in an rhashtable")
    Reported-by: Hulk Robot
    Signed-off-by: Yu Kuai
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Yu Kuai
     
  • The aim of the inode btree record iterator function is to call a
    callback on every record in the btree. To avoid having to tear down and
    recreate the inode btree cursor around every callback, it caches a
    certain number of records in a memory buffer. After each batch of
    callback invocations, we have to perform a btree lookup to find the
    next record after where we left off.

    However, if the keys of the inode btree are corrupt, the lookup might
    put us in the wrong part of the inode btree, causing the walk function
    to loop forever. Therefore, we add extra cursor tracking to make sure
    that we never go backwards neither when performing the lookup nor when
    jumping to the next inobt record. This also fixes an off by one error
    where upon resume the lookup should have been for the inode /after/ the
    point at which we stopped.

    Found by fuzzing xfs/460 with keys[2].startino = ones causing bulkstat
    and quotacheck to hang.

    Fixes: a211432c27ff ("xfs: create simplified inode walk function")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     
  • Currently, commit e9e2eae89ddb dropped a (int) decoration from
    XFS_LITINO(mp), and since sizeof() expression is also involved,
    the result of XFS_LITINO(mp) is simply as the size_t type
    (commonly unsigned long).

    Considering the expression in xfs_attr_shortform_bytesfit():
    offset = (XFS_LITINO(mp) - bytes) >> 3;
    let "bytes" be (int)340, and
    "XFS_LITINO(mp)" be (unsigned long)336.

    on 64-bit platform, the expression is
    offset = ((unsigned long)336 - (int)340) >> 3 =
    (int)(0xfffffffffffffffcUL >> 3) = -1

    but on 32-bit platform, the expression is
    offset = ((unsigned long)336 - (int)340) >> 3 =
    (int)(0xfffffffcUL >> 3) = 0x1fffffff
    instead.

    so offset becomes a large positive number on 32-bit platform, and
    cause xfs_attr_shortform_bytesfit() returns maxforkoff rather than 0.

    Therefore, one result is
    "ASSERT(new_size < bytes" in advance suggested by Eric and a misleading
    comment together with this bugfix suggested by Darrick. It seems the
    other users of XFS_LITINO(mp) are not impacted.

    Fixes: e9e2eae89ddb ("xfs: only check the superblock version for dinode size calculation")
    Cc: # 5.7+
    Reported-and-tested-by: Dennis Gilmore
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Gao Xiang
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Gao Xiang
     
  • Teach the directory scrubber to check all the bestfree entries,
    including the null ones. We want to be able to detect the case where
    the entry is null but there actually /is/ a directory data block.

    Found by fuzzing lbests[0] = ones in xfs/391.

    Fixes: df481968f33b ("xfs: scrub directory freespace")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • We always know the correct state of the rmap record flags (attr, bmbt,
    unwritten) so check them by direct comparison.

    Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • The comment and logic in xchk_btree_check_minrecs for dealing with
    inode-rooted btrees isn't quite correct. While the direct children of
    the inode root are allowed to have fewer records than what would
    normally be allowed for a regular ondisk btree block, this is only true
    if there is only one child block and the number of records don't fit in
    the inode root.

    Fixes: 08a3a692ef58 ("xfs: btree scrub should check minrecs")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

12 Nov, 2020

1 commit


11 Nov, 2020

4 commits

  • Fix some serious WTF in the reference count scrubber's rmap fragment
    processing. The code comment says that this loop is supposed to move
    all fragment records starting at or before bno onto the worklist, but
    there's no obvious reason why nr (the number of items added) should
    increment starting from 1, and breaking the loop when we've added the
    target number seems dubious since we could have more rmap fragments that
    should have been added to the worklist.

    This seems to manifest in xfs/411 when adding one to the refcount field.

    Fixes: dbde19da9637 ("xfs: cross-reference the rmapbt data with the refcountbt")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Keys for extent interval records in the reverse mapping btree are
    supposed to be computed as follows:

    (physical block, owner, fork, is_btree, is_unwritten, offset)

    This provides users the ability to look up a reverse mapping from a bmbt
    record -- start with the physical block; then if there are multiple
    records for the same block, move on to the owner; then the inode fork
    type; and so on to the file offset.

    However, the key comparison functions incorrectly remove the
    fork/btree/unwritten information that's encoded in the on-disk offset.
    This means that lookup comparisons are only done with:

    (physical block, owner, offset)

    This means that queries can return incorrect results. On consistent
    filesystems this hasn't been an issue because blocks are never shared
    between forks or with bmbt blocks; and are never unwritten. However,
    this bug means that online repair cannot always detect corruption in the
    key information in internal rmapbt nodes.

    Found by fuzzing keys[1].attrfork = ones on xfs/371.

    Fixes: 4b8ed67794fe ("xfs: add rmap btree operations")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • When the bmbt scrubber is looking up rmap extents, we need to set the
    extent flags from the bmbt record fully. This will matter once we fix
    the rmap btree comparison functions to check those flags correctly.

    Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Pass the same oldext argument (which contains the existing rmapping's
    unwritten state) to xfs_rmap_lookup_le_range at the start of
    xfs_rmap_convert_shared. At this point in the code, flags is zero,
    which means that we perform lookups using the wrong key.

    Fixes: 3f165b334e51 ("xfs: convert unwritten status of reverse mappings for shared files")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

05 Nov, 2020

5 commits

  • There's no reason to flush an entire file when we're unsharing part of
    a file. Therefore, only initiate writeback on the selected range.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     
  • The kernel has always allowed directories to have the rtinherit flag
    set, even if there is no rt device, so this check is wrong.

    Fixes: 80e4e1268802 ("xfs: scrub inodes")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • In commit 7588cbeec6df, we tried to fix a race stemming from the lack of
    coordination between higher level code that wants to allocate and remap
    CoW fork extents into the data fork. Christoph cites as examples the
    always_cow mode, and a directio write completion racing with writeback.

    According to the comments before the goto retry, we want to restart the
    lookup to catch the extent in the data fork, but we don't actually reset
    whichfork or cow_fsb, which means the second try executes using stale
    information. Up until now I think we've gotten lucky that either
    there's something left in the CoW fork to cause cow_fsb to be reset, or
    either data/cow fork sequence numbers have advanced enough to force a
    fresh lookup from the data fork. However, if we reach the retry with an
    empty stable CoW fork and a stable data fork, neither of those things
    happens. The retry foolishly re-calls xfs_convert_blocks on the CoW
    fork which fails again. This time, we toss the write.

    I've recently been working on extending reflink to the realtime device.
    When the realtime extent size is larger than a single block, we have to
    force the page cache to CoW the entire rt extent if a write (or
    fallocate) are not aligned with the rt extent size. The strategy I've
    chosen to deal with this is derived from Dave's blocksize > pagesize
    series: dirtying around the write range, and ensuring that writeback
    always starts mapping on an rt extent boundary. This has brought this
    race front and center, since generic/522 blows up immediately.

    However, I'm pretty sure this is a bug outright, independent of that.

    Fixes: 7588cbeec6df ("xfs: retry COW fork delalloc conversion when no extent was found")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • iomap writeback mapping failure only calls into ->discard_page() if
    the current page has not been added to the ioend. Accordingly, the
    XFS callback assumes a full page discard and invalidation. This is
    problematic for sub-page block size filesystems where some portion
    of a page might have been mapped successfully before a failure to
    map a delalloc block occurs. ->discard_page() is not called in that
    error scenario and the bio is explicitly failed by iomap via the
    error return from ->prepare_ioend(). As a result, the filesystem
    leaks delalloc blocks and corrupts the filesystem block counters.

    Since XFS is the only user of ->discard_page(), tweak the semantics
    to invoke the callback unconditionally on mapping errors and provide
    the file offset that failed to map. Update xfs_discard_page() to
    discard the corresponding portion of the file and pass the range
    along to iomap_invalidatepage(). The latter already properly handles
    both full and sub-page scenarios by not changing any iomap or page
    state on sub-page invalidations.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • It is possible to expose non-zeroed post-EOF data in XFS if the new
    EOF page is dirty, backed by an unwritten block and the truncate
    happens to race with writeback. iomap_truncate_page() will not zero
    the post-EOF portion of the page if the underlying block is
    unwritten. The subsequent call to truncate_setsize() will, but
    doesn't dirty the page. Therefore, if writeback happens to complete
    after iomap_truncate_page() (so it still sees the unwritten block)
    but before truncate_setsize(), the cached page becomes inconsistent
    with the on-disk block. A mapped read after the associated page is
    reclaimed or invalidated exposes non-zero post-EOF data.

    For example, consider the following sequence when run on a kernel
    modified to explicitly flush the new EOF page within the race
    window:

    $ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file
    $ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file
    ...
    $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
    00000400: 00 00 00 00 00 00 00 00 ........
    $ umount /mnt/; mount /mnt/
    $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
    00000400: cd cd cd cd cd cd cd cd ........

    Update xfs_setattr_size() to explicitly flush the new EOF page prior
    to the page truncate to ensure iomap has the latest state of the
    underlying block.

    Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

29 Oct, 2020

1 commit

  • Make sure that we actually initialize xefi_discard when we're scheduling
    a deferred free of an AGFL block. This was (eventually) found by the
    UBSAN while I was banging on realtime rmap problems, but it exists in
    the upstream codebase. While we're at it, rearrange the structure to
    reduce the struct size from 64 to 56 bytes.

    Fixes: fcb762f5de2e ("xfs: add bmapi nodiscard flag")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     

26 Oct, 2020

1 commit

  • Use a more generic form for __section that requires quotes to avoid
    complications with clang and gcc differences.

    Remove the quote operator # from compiler_attributes.h __section macro.

    Convert all unquoted __section(foo) uses to quoted __section("foo").
    Also convert __attribute__((section("foo"))) uses to __section("foo")
    even if the __attribute__ has multiple list entry forms.

    Conversion done using the script at:

    https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl

    Signed-off-by: Joe Perches
    Reviewed-by: Nick Desaulniers
    Reviewed-by: Miguel Ojeda
    Signed-off-by: Linus Torvalds

    Joe Perches
     

25 Oct, 2020

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted stuff all over the place (the largest group here is
    Christoph's stat cleanups)"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: remove KSTAT_QUERY_FLAGS
    fs: remove vfs_stat_set_lookup_flags
    fs: move vfs_fstatat out of line
    fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat
    fs: remove vfs_statx_fd
    fs: omfs: use kmemdup() rather than kmalloc+memcpy
    [PATCH] reduce boilerplate in fsid handling
    fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS
    selftests: mount: add nosymfollow tests
    Add a "nosymfollow" mount option.

    Linus Torvalds
     

24 Oct, 2020

1 commit

  • Pull xfs fixes from Darrick Wong:
    "Two bug fixes that trickled in during the merge window:

    - Make fallocate check the alignment of its arguments against the
    fundamental allocation unit of the volume the file lives on, so
    that we don't trigger the fs' alignment checks.

    - Cancel unprocessed log intents immediately when log recovery fails,
    to avoid a log deadlock"

    * tag 'xfs-5.10-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    xfs: cancel intents immediately if process_intents fails
    xfs: fix fallocate functions when rtextsize is larger than 1

    Linus Torvalds
     

22 Oct, 2020

2 commits

  • If processing recovered log intent items fails, we need to cancel all
    the unprocessed recovered items immediately so that a subsequent AIL
    push in the bail out path won't get wedged on the pinned intent items
    that didn't get processed.

    This can happen if the log contains (1) an intent that gets and releases
    an inode, (2) an intent that cannot be recovered successfully, and (3)
    some third intent item. When recovery of (2) fails, we leave (3) pinned
    in memory. Inode reclamation is called in the error-out path of
    xfs_mountfs before xfs_log_cancel_mount. Reclamation calls
    xfs_ail_push_all_sync, which gets stuck waiting for (3).

    Therefore, call xlog_recover_cancel_intents if _process_intents fails.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • In commit fe341eb151ec, I forgot that xfs_free_file_space isn't strictly
    a "remove mapped blocks" function. It is actually a function to zero
    file space by punching out the middle and writing zeroes to the
    unaligned ends of the specified range. Therefore, putting a rtextsize
    alignment check in that function is wrong because that breaks unaligned
    ZERO_RANGE on the realtime volume.

    Furthermore, xfs_file_fallocate already has alignment checks for the
    functions require the file range to be aligned to the size of a
    fundamental allocation unit (which is 1 FSB on the data volume and 1 rt
    extent on the realtime volume). Create a new helper to check fallocate
    arguments against the realtiem allocation unit size, fix the fallocate
    frontend to use it, fix free_file_space to delete the correct range, and
    remove a now redundant check from insert_file_space.

    NOTE: The realtime extent size is not required to be a power of two!

    Fixes: fe341eb151ec ("xfs: ensure that fpunch, fcollapse, and finsert operations are aligned to rt extent size")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     

20 Oct, 2020

1 commit

  • Pull more xfs updates from Darrick Wong:
    "The second large pile of new stuff for 5.10, with changes even more
    monumental than last week!

    We are formally announcing the deprecation of the V4 filesystem format
    in 2030. All users must upgrade to the V5 format, which contains
    design improvements that greatly strengthen metadata validation,
    supports reflink and online fsck, and is the intended vehicle for
    handling timestamps past 2038. We're also deprecating the old Irix
    behavioral tweaks in September 2025.

    Coming along for the ride are two design changes to the deferred
    metadata ops subsystem. One of the improvements is to retain correct
    logical ordering of tasks and subtasks, which is a more logical design
    for upper layers of XFS and will become necessary when we add atomic
    file range swaps and commits. The second improvement to deferred ops
    improves the scalability of the log by helping the log tail to move
    forward during long-running operations. This reduces log contention
    when there are a large number of threads trying to run transactions.

    In addition to that, this fixes numerous small bugs in log recovery;
    refactors logical intent log item recovery to remove the last
    remaining place in XFS where we could have nested transactions; fixes
    a couple of ways that intent log item recovery could fail in ways that
    wouldn't have happened in the regular commit paths; fixes a deadlock
    vector in the GETFSMAP implementation (which improves its performance
    by 20%); and fixes serious bugs in the realtime growfs, fallocate, and
    bitmap handling code.

    Summary:

    - Deprecate the V4 filesystem format, some disused mount options, and
    some legacy sysctl knobs now that we can support dates into the
    25th century. Note that removal of V4 support will not happen until
    the early 2030s.

    - Fix some probles with inode realtime flag propagation.

    - Fix some buffer handling issues when growing a rt filesystem.

    - Fix a problem where a BMAP_REMAP unmap call would free rt extents
    even though the purpose of BMAP_REMAP is to avoid freeing the
    blocks.

    - Strengthen the dabtree online scrubber to check hash values on
    child dabtree blocks.

    - Actually log new intent items created as part of recovering log
    intent items.

    - Fix a bug where quotas weren't attached to an inode undergoing bmap
    intent item recovery.

    - Fix a buffer overrun problem with specially crafted log buffer
    headers.

    - Various cleanups to type usage and slightly inaccurate comments.

    - More cleanups to the xattr, log, and quota code.

    - Don't run the (slower) shared-rmap operations on attr fork
    mappings.

    - Fix a bug where we failed to check the LSN of finobt blocks during
    replay and could therefore overwrite newer data with older data.

    - Clean up the ugly nested transaction mess that log recovery uses to
    stage intent item recovery in the correct order by creating a
    proper data structure to capture recovered chains.

    - Use the capture structure to resume intent item chains with the
    same log space and block reservations as when they were captured.

    - Fix a UAF bug in bmap intent item recovery where we failed to
    maintain our reference to the incore inode if the bmap operation
    needed to relog itself to continue.

    - Rearrange the defer ops mechanism to finish newly created subtasks
    of a parent task before moving on to the next parent task.

    - Automatically relog intent items in deferred ops chains if doing so
    would help us avoid pinning the log tail. This will help fix some
    log scaling problems now and will facilitate atomic file updates
    later.

    - Fix a deadlock in the GETFSMAP implementation by using an internal
    memory buffer to reduce indirect calls and copies to userspace,
    thereby improving its performance by ~20%.

    - Fix various problems when calling growfs on a realtime volume would
    not fully update the filesystem metadata.

    - Fix broken Kconfig asking about deprecated XFS when XFS is
    disabled"

    * tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
    xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n
    xfs: fix high key handling in the rt allocator's query_range function
    xfs: annotate grabbing the realtime bitmap/summary locks in growfs
    xfs: make xfs_growfs_rt update secondary superblocks
    xfs: fix realtime bitmap/summary file truncation when growing rt volume
    xfs: fix the indent in xfs_trans_mod_dquot
    xfs: do the ASSERT for the arguments O_{u,g,p}dqpp
    xfs: fix deadlock and streamline xfs_getfsmap performance
    xfs: limit entries returned when counting fsmap records
    xfs: only relog deferred intent items if free space in the log gets low
    xfs: expose the log push threshold
    xfs: periodically relog deferred intent items
    xfs: change the order in which child and parent defer ops are finished
    xfs: fix an incore inode UAF in xfs_bui_recover
    xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering
    xfs: clean up bmap intent item recovery checking
    xfs: xfs_defer_capture should absorb remaining transaction reservation
    xfs: xfs_defer_capture should absorb remaining block reservations
    xfs: proper replay of deferred ops queued during log recovery
    xfs: remove XFS_LI_RECOVERED
    ...

    Linus Torvalds
     

17 Oct, 2020

2 commits

  • Pavel Machek complained that the question about supporting deprecated
    XFS v4 comes up even when XFS is disabled. This clearly makes no sense,
    so fix Kconfig.

    Reported-by: Pavel Machek
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Eric Sandeen

    Darrick J. Wong
     
  • Fix some off-by-one errors in xfs_rtalloc_query_range. The highest key
    in the realtime bitmap is always one less than the number of rt extents,
    which means that the key clamp at the start of the function is wrong.
    The 4th argument to xfs_rtfind_forw is the highest rt extent that we
    want to probe, which means that passing 1 less than the high key is
    wrong. Finally, drop the rem variable that controls the loop because we
    can compare the iteration point (rtstart) against the high key directly.

    The sordid history of this function is that the original commit (fb3c3)
    incorrectly passed (high_rec->ar_startblock - 1) as the 'limit' parameter
    to xfs_rtfind_forw. This was wrong because the "high key" is supposed
    to be the largest key for which the caller wants result rows, not the
    key for the first row that could possibly be outside the range that the
    caller wants to see.

    A subsequent attempt (8ad56) to strengthen the parameter checking added
    incorrect clamping of the parameters to the number of rt blocks in the
    system (despite the bitmap functions all taking units of rt extents) to
    avoid querying ranges past the end of rt bitmap file but failed to fix
    the incorrect _rtfind_forw parameter. The original _rtfind_forw
    parameter error then survived the conversion of the startblock and
    blockcount fields to rt extents (a0e5c), and the most recent off-by-one
    fix (a3a37) thought it was patching a problem when the end of the rt
    volume is not in use, but none of these fixes actually solved the
    original problem that the author was confused about the "limit" argument
    to xfs_rtfind_forw.

    Sadly, all four of these patches were written by this author and even
    his own usage of this function and rt testing were inadequate to get
    this fixed quickly.

    Original-problem: fb3c3de2f65c ("xfs: add a couple of queries to iterate free extents in the rtbitmap")
    Not-fixed-by: 8ad560d2565e ("xfs: strengthen rtalloc query range checks")
    Not-fixed-by: a0e5c435babd ("xfs: fix xfs_rtalloc_rec units")
    Fixes: a3a374bf1889 ("xfs: fix off-by-one error in xfs_rtalloc_query_range")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     

15 Oct, 2020

1 commit

  • Pull xfs updates from Darrick Wong:
    "The biggest changes are two new features for the ondisk metadata: one
    to record the sizes of the inode btrees in the AG to increase
    redundancy checks and to improve mount times; and a second new feature
    to support timestamps until the year 2486.

    We also fixed a problem where reflinking into a file that requires
    synchronous writes wouldn't actually flush the updates to disk; clean
    up a fair amount of cruft; and started fixing some bugs in the
    realtime volume code.

    Summary:

    - Clean up the buffer ioend calling path so that the retry strategy
    isn't quite so scattered everywhere.

    - Clean up m_sb_bp handling.

    - New feature: storing inode btree counts in the AGI to speed up
    certain mount time per-AG block reservation operatoins and add a
    little more metadata redundancy.

    - New feature: Widen inode timestamps and quota grace expiration
    timestamps to support dates through the year 2486.

    - Get rid of more of our custom buffer allocation API wrappers.

    - Use a proper VLA for shortform xattr structure namevals.

    - Force the log after reflinking or deduping into a file that is
    opened with O_SYNC or O_DSYNC.

    - Fix some math errors in the realtime allocator"

    * tag 'xfs-5.10-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (42 commits)
    xfs: ensure that fpunch, fcollapse, and finsert operations are aligned to rt extent size
    xfs: make sure the rt allocator doesn't run off the end
    xfs: Remove unneeded semicolon
    xfs: force the log after remapping a synchronous-writes file
    xfs: Convert xfs_attr_sf macros to inline functions
    xfs: Use variable-size array for nameval in xfs_attr_sf_entry
    xfs: Remove typedef xfs_attr_shortform_t
    xfs: remove typedef xfs_attr_sf_entry_t
    xfs: Remove kmem_zalloc_large()
    xfs: enable big timestamps
    xfs: trace timestamp limits
    xfs: widen ondisk quota expiration timestamps to handle y2038+
    xfs: widen ondisk inode timestamps to deal with y2038+
    xfs: redefine xfs_ictimestamp_t
    xfs: redefine xfs_timestamp_t
    xfs: move xfs_log_dinode_to_disk to the log recovery code
    xfs: refactor quota timestamp coding
    xfs: refactor default quota grace period setting code
    xfs: refactor quota expiration timer modification
    xfs: explicitly define inode timestamp range
    ...

    Linus Torvalds
     

13 Oct, 2020

3 commits

  • Use XFS_ILOCK_RT{BITMAP,SUM} to annotate grabbing the rt bitmap and
    summary locks when we grow the realtime volume, just like we do most
    everywhere else. This shuts up lockdep warnings about grabbing the
    ILOCK class of locks recursively:

    ============================================
    WARNING: possible recursive locking detected
    5.9.0-rc4-djw #rc4 Tainted: G O
    --------------------------------------------
    xfs_growfs/4841 is trying to acquire lock:
    ffff888035acc230 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xac/0x1a0 [xfs]

    but task is already holding lock:
    ffff888035acedb0 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xac/0x1a0 [xfs]

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&xfs_nondir_ilock_class);
    lock(&xfs_nondir_ilock_class);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     
  • When we call growfs on the data device, we update the secondary
    superblocks to reflect the updated filesystem geometry. We need to do
    this for growfs on the realtime volume too, because a future xfs_repair
    run could try to fix the filesystem using a backup superblock.

    This was observed by the online superblock scrubbers while running
    xfs/233. One can also trigger this by growing an rt volume, cycling the
    mount, and creating new rt files.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     
  • The realtime bitmap and summary files are regular files that are hidden
    away from the directory tree. Since they're regular files, inode
    inactivation will try to purge what it thinks are speculative
    preallocations beyond the incore size of the file. Unfortunately,
    xfs_growfs_rt forgets to update the incore size when it resizes the
    inodes, with the result that inactivating the rt inodes at unmount time
    will cause their contents to be truncated.

    Fix this by updating the incore size when we change the ondisk size as
    part of updating the superblock. Note that we don't do this when we're
    allocating blocks to the rt inodes because we actually want those blocks
    to get purged if the growfs fails.

    This fixes corruption complaints from the online rtsummary checker when
    running xfs/233. Since that test requires rmap, one can also trigger
    this by growing an rt volume, cycling the mount, and creating rt files.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     

07 Oct, 2020

9 commits

  • The formatting is strange in xfs_trans_mod_dquot, so do a reindent.

    Signed-off-by: Kaixu Xia
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Kaixu Xia
     
  • If we pass in XFS_QMOPT_{U,G,P}QUOTA flags and different uid/gid/prid
    than them currently associated with the inode, the arguments
    O_{u,g,p}dqpp shouldn't be NULL, so add the ASSERT for them.

    Signed-off-by: Kaixu Xia
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Kaixu Xia
     
  • Refactor xfs_getfsmap to improve its performance: instead of indirectly
    calling a function that copies one record to userspace at a time, create
    a shadow buffer in the kernel and copy the whole array once at the end.
    On the author's computer, this reduces the runtime on his /home by ~20%.

    This also eliminates a deadlock when running GETFSMAP against the
    realtime device. The current code locks the rtbitmap to create
    fsmappings and copies them into userspace, having not released the
    rtbitmap lock. If the userspace buffer is an mmap of a sparse file that
    itself resides on the realtime device, the write page fault will recurse
    into the fs for allocation, which will deadlock on the rtbitmap lock.

    Fixes: 4c934c7dd60c ("xfs: report realtime space information via the rtbitmap")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     
  • If userspace asked fsmap to count the number of entries, we cannot
    return more than UINT_MAX entries because fmh_entries is u32.
    Therefore, stop counting if we hit this limit or else we will waste time
    to return truncated results.

    Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     
  • Now that we have the ability to ask the log how far the tail needs to be
    pushed to maintain its free space targets, augment the decision to relog
    an intent item so that we only do it if the log has hit the 75% full
    threshold. There's no point in relogging an intent into the same
    checkpoint, and there's no need to relog if there's plenty of free space
    in the log.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • Separate the computation of the log push threshold and the push logic in
    xlog_grant_push_ail. This enables higher level code to determine (for
    example) that it is holding on to a logged intent item and the log is so
    busy that it is more than 75% full. In that case, it would be desirable
    to move the log item towards the head to release the tail, which we will
    cover in the next patch.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • There's a subtle design flaw in the deferred log item code that can lead
    to pinning the log tail. Taking up the defer ops chain examples from
    the previous commit, we can get trapped in sequences like this:

    Caller hands us a transaction t0 with D0-D3 attached. The defer ops
    chain will look like the following if the transaction rolls succeed:

    t1: D0(t0), D1(t0), D2(t0), D3(t0)
    t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
    t3: d5(t1), D1(t0), D2(t0), D3(t0)
    ...
    t9: d9(t7), D3(t0)
    t10: D3(t0)
    t11: d10(t10), d11(t10)
    t12: d11(t10)

    In transaction 9, we finish d9 and try to roll to t10 while holding onto
    an intent item for D3 that we logged in t0.

    The previous commit changed the order in which we place new defer ops in
    the defer ops processing chain to reduce the maximum chain length. Now
    make xfs_defer_finish_noroll capable of relogging the entire chain
    periodically so that we can always move the log tail forward. Most
    chains will never get relogged, except for operations that generate very
    long chains (large extents containing many blocks with different sharing
    levels) or are on filesystems with small logs and a lot of ongoing
    metadata updates.

    Callers are now required to ensure that the transaction reservation is
    large enough to handle logging done items and new intent items for the
    maximum possible chain length. Most callers are careful to keep the
    chain lengths low, so the overhead should be minimal.

    The decision to relog an intent item is made based on whether the intent
    was logged in a previous checkpoint, since there's no point in relogging
    an intent into the same checkpoint.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • The defer ops code has been finishing items in the wrong order -- if a
    top level defer op creates items A and B, and finishing item A creates
    more defer ops A1 and A2, we'll put the new items on the end of the
    chain and process them in the order A B A1 A2. This is kind of weird,
    since it's convenient for programmers to be able to think of A and B as
    an ordered sequence where all the sub-tasks for A must finish before we
    move on to B, e.g. A A1 A2 D.

    Right now, our log intent items are not so complex that this matters,
    but this will become important for the atomic extent swapping patchset.
    In order to maintain correct reference counting of extents, we have to
    unmap and remap extents in that order, and we want to complete that work
    before moving on to the next range that the user wants to swap. This
    patch fixes defer ops to satsify that requirement.

    The primary symptom of the incorrect order was noticed in an early
    performance analysis of the atomic extent swap code. An astonishingly
    large number of deferred work items accumulated when userspace requested
    an atomic update of two very fragmented files. The cause of this was
    traced to the same ordering bug in the inner loop of
    xfs_defer_finish_noroll.

    If the ->finish_item method of a deferred operation queues new deferred
    operations, those new deferred ops are appended to the tail of the
    pending work list. To illustrate, say that a caller creates a
    transaction t0 with four deferred operations D0-D3. The first thing
    defer ops does is roll the transaction to t1, leaving us with:

    t1: D0(t0), D1(t0), D2(t0), D3(t0)

    Let's say that finishing each of D0-D3 will create two new deferred ops.
    After finish D0 and roll, we'll have the following chain:

    t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1)

    d4 and d5 were logged to t1. Notice that while we're about to start
    work on D1, we haven't actually completed all the work implied by D0
    being finished. So far we've been careful (or lucky) to structure the
    dfops callers such that D1 doesn't depend on d4 or d5 being finished,
    but this is a potential logic bomb.

    There's a second problem lurking. Let's see what happens as we finish
    D1-D3:

    t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2)
    t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3)
    t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)

    Let's say that d4-d11 are simple work items that don't queue any other
    operations, which means that we can complete each d4 and roll to t6:

    t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
    t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
    ...
    t11: d10(t4), d11(t4)
    t12: d11(t4)

    When we try to roll to transaction #12, we're holding defer op d11,
    which we logged way back in t4. This means that the tail of the log is
    pinned at t4. If the log is very small or there are a lot of other
    threads updating metadata, this means that we might have wrapped the log
    and cannot get roll to t11 because there isn't enough space left before
    we'd run into t4.

    Let's shift back to the original failure. I mentioned before that I
    discovered this flaw while developing the atomic file update code. In
    that scenario, we have a defer op (D0) that finds a range of file blocks
    to remap, creates a handful of new defer ops to do that, and then asks
    to be continued with however much work remains.

    So, D0 is the original swapext deferred op. The first thing defer ops
    does is rolls to t1:

    t1: D0(t0)

    We try to finish D0, logging d1 and d2 in the process, but can't get all
    the work done. We log a done item and a new intent item for the work
    that D0 still has to do, and roll to t2:

    t2: D0'(t1), d1(t1), d2(t1)

    We roll and try to finish D0', but still can't get all the work done, so
    we log a done item and a new intent item for it, requeue D0 a second
    time, and roll to t3:

    t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2)

    If it takes 48 more rolls to complete D0, then we'll finally dispense
    with D0 in t50:

    t50: D(t49), d1(t1), ..., d102(t50)

    We then try to roll again to get a chain like this:

    t51: d1(t1), d2(t1), ..., d101(t50), d102(t50)
    ...
    t152: d102(t50)

    Notice that in rolling to transaction #51, we're holding on to a log
    intent item for d1 that was logged in transaction #1. This means that
    the tail of the log is pinned at t1. If the log is very small or there
    are a lot of other threads updating metadata, this means that we might
    have wrapped the log and cannot roll to t51 because there isn't enough
    space left before we'd run into t1. This is of course problem #2 again.

    But notice the third problem with this scenario: we have 102 defer ops
    tied to this transaction! Each of these items are backed by pinned
    kernel memory, which means that we risk OOM if the chains get too long.

    Yikes. Problem #1 is a subtle logic bomb that could hit someone in the
    future; problem #2 applies (rarely) to the current upstream, and problem
    #3 applies to work under development.

    This is not how incremental deferred operations were supposed to work.
    The dfops design of logging in the same transaction an intent-done item
    and a new intent item for the work remaining was to make it so that we
    only have to juggle enough deferred work items to finish that one small
    piece of work. Deferred log item recovery will find that first
    unfinished work item and restart it, no matter how many other intent
    items might follow it in the log. Therefore, it's ok to put the new
    intents at the start of the dfops chain.

    For the first example, the chains look like this:

    t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
    t3: d5(t1), D1(t0), D2(t0), D3(t0)
    ...
    t9: d9(t7), D3(t0)
    t10: D3(t0)
    t11: d10(t10), d11(t10)
    t12: d11(t10)

    For the second example, the chains look like this:

    t1: D0(t0)
    t2: d1(t1), d2(t1), D0'(t1)
    t3: d2(t1), D0'(t1)
    t4: D0'(t1)
    t5: d1(t4), d2(t4), D0''(t4)
    ...
    t148: D0(t147)
    t149: d101(t148), d102(t148)
    t150: d102(t148)

    This actually sucks more for pinning the log tail (we try to roll to t10
    while holding an intent item that was logged in t1) but we've solved
    problem #1. We've also reduced the maximum chain length from:

    sum(all the new items) + nr_original_items

    to:

    max(new items that each original item creates) + nr_original_items

    This solves problem #3 by sharply reducing the number of defer ops that
    can be attached to a transaction at any given time. The change makes
    the problem of log tail pinning worse, but is improvement we need to
    solve problem #2. Actually solving #2, however, is left to the next
    patch.

    Note that a subsequent analysis of some hard-to-trigger reflink and COW
    livelocks on extremely fragmented filesystems (or systems running a lot
    of IO threads) showed the same symptoms -- uncomfortably large numbers
    of incore deferred work items and occasional stalls in the transaction
    grant code while waiting for log reservations. I think this patch and
    the next one will also solve these problems.

    As originally written, the code used list_splice_tail_init instead of
    list_splice_init, so change that, and leave a short comment explaining
    our actions.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • In xfs_bui_item_recover, there exists a use-after-free bug with regards
    to the inode that is involved in the bmap replay operation. If the
    mapping operation does not complete, we call xfs_bmap_unmap_extent to
    create a deferred op to finish the unmapping work, and we retain a
    pointer to the incore inode.

    Unfortunately, the very next thing we do is commit the transaction and
    drop the inode. If reclaim tears down the inode before we try to finish
    the defer ops, we dereference garbage and blow up. Therefore, create a
    way to join inodes to the defer ops freezer so that we can maintain the
    xfs_inode reference until we're done with the inode.

    Note: This imposes the requirement that there be enough memory to keep
    every incore inode in memory throughout recovery.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong