23 Nov, 2020

1 commit


20 Nov, 2020

2 commits

  • This reverts commit 6ff646b2ceb0eec916101877f38da0b73e3a5b7f.

    Your maintainer committed a major braino in the rmap code by adding the
    attr fork, bmbt, and unwritten extent usage bits into rmap record key
    comparisons. While XFS uses the usage bits *in the rmap records* for
    cross-referencing metadata in xfs_scrub and xfs_repair, it only needs
    the owner and offset information to distinguish between reverse mappings
    of the same physical extent into the data fork of a file at multiple
    offsets. The other bits are not important for key comparisons for index
    lookups, and never have been.

    Eric Sandeen reports that this causes regressions in generic/299, so
    undo this patch before it does more damage.

    Reported-by: Eric Sandeen
    Fixes: 6ff646b2ceb0 ("xfs: fix rmap key and record comparison functions")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Eric Sandeen

    Darrick J. Wong
     
  • Jens has reported a situation where partial direct IOs can be issued
    and completed yet still return -EAGAIN. We don't want this to report
    a short IO as we want XFS to complete user DIO entirely or not at
    all.

    This partial IO situation can occur on a write IO that is split
    across an allocated extent and a hole, and the second mapping is
    returning EAGAIN because allocation would be required.

    The trivial reproducer:

    $ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo
    wrote 4096/4096 bytes at offset 0
    4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec)
    pwrite: Resource temporarily unavailable
    $

    The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done
    the first 4kB write:

    xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000
    iomap_apply: dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
    xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
    xfs_iunlock: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
    xfs_iomap_found: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1
    iomap_apply_dstmap: dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY

    Here the first iomap loop has mapped the first 4kB of the file and
    issued the IO, and we enter the second iomap_apply loop:

    iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
    xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
    xfs_iunlock: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin

    And we exit with -EAGAIN out because we hit the allocate case trying
    to make the second 4kB block.

    Then IO completes on the first 4kB and the original IO context
    completes and unlocks the inode, returning -EAGAIN to userspace:

    xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096
    xfs_iunlock: dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write

    There are other vectors to the same problem when we re-enter the
    mapping code if we have to make multiple mappinfs under NOWAIT
    conditions. e.g. failing trylocks, COW extents being found,
    allocation being required, and so on.

    Avoid all these potential problems by only allowing IOMAP_NOWAIT IO
    to go ahead if the mapping we retrieve for the IO spans an entire
    allocated extent. This avoids the possibility of subsequent mappings
    to complete the IO from triggering NOWAIT semantics by any means as
    NOWAIT IO will now only enter the mapping code once per NOWAIT IO.

    Reported-and-tested-by: Jens Axboe
    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     

19 Nov, 2020

6 commits

  • In xfs_initialize_perag(), if kmem_zalloc(), xfs_buf_hash_init(), or
    radix_tree_preload() failed, the returned value 'error' is not set
    accordingly.

    Reported-as-fixing: 8b26c5825e02 ("xfs: handle ENOMEM correctly during initialisation of perag structures")
    Fixes: 9b2471797942 ("xfs: cache unlinked pointers in an rhashtable")
    Reported-by: Hulk Robot
    Signed-off-by: Yu Kuai
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Yu Kuai
     
  • The aim of the inode btree record iterator function is to call a
    callback on every record in the btree. To avoid having to tear down and
    recreate the inode btree cursor around every callback, it caches a
    certain number of records in a memory buffer. After each batch of
    callback invocations, we have to perform a btree lookup to find the
    next record after where we left off.

    However, if the keys of the inode btree are corrupt, the lookup might
    put us in the wrong part of the inode btree, causing the walk function
    to loop forever. Therefore, we add extra cursor tracking to make sure
    that we never go backwards neither when performing the lookup nor when
    jumping to the next inobt record. This also fixes an off by one error
    where upon resume the lookup should have been for the inode /after/ the
    point at which we stopped.

    Found by fuzzing xfs/460 with keys[2].startino = ones causing bulkstat
    and quotacheck to hang.

    Fixes: a211432c27ff ("xfs: create simplified inode walk function")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     
  • Currently, commit e9e2eae89ddb dropped a (int) decoration from
    XFS_LITINO(mp), and since sizeof() expression is also involved,
    the result of XFS_LITINO(mp) is simply as the size_t type
    (commonly unsigned long).

    Considering the expression in xfs_attr_shortform_bytesfit():
    offset = (XFS_LITINO(mp) - bytes) >> 3;
    let "bytes" be (int)340, and
    "XFS_LITINO(mp)" be (unsigned long)336.

    on 64-bit platform, the expression is
    offset = ((unsigned long)336 - (int)340) >> 3 =
    (int)(0xfffffffffffffffcUL >> 3) = -1

    but on 32-bit platform, the expression is
    offset = ((unsigned long)336 - (int)340) >> 3 =
    (int)(0xfffffffcUL >> 3) = 0x1fffffff
    instead.

    so offset becomes a large positive number on 32-bit platform, and
    cause xfs_attr_shortform_bytesfit() returns maxforkoff rather than 0.

    Therefore, one result is
    "ASSERT(new_size < bytes" in advance suggested by Eric and a misleading
    comment together with this bugfix suggested by Darrick. It seems the
    other users of XFS_LITINO(mp) are not impacted.

    Fixes: e9e2eae89ddb ("xfs: only check the superblock version for dinode size calculation")
    Cc: # 5.7+
    Reported-and-tested-by: Dennis Gilmore
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Gao Xiang
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Gao Xiang
     
  • Teach the directory scrubber to check all the bestfree entries,
    including the null ones. We want to be able to detect the case where
    the entry is null but there actually /is/ a directory data block.

    Found by fuzzing lbests[0] = ones in xfs/391.

    Fixes: df481968f33b ("xfs: scrub directory freespace")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • We always know the correct state of the rmap record flags (attr, bmbt,
    unwritten) so check them by direct comparison.

    Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • The comment and logic in xchk_btree_check_minrecs for dealing with
    inode-rooted btrees isn't quite correct. While the direct children of
    the inode root are allowed to have fewer records than what would
    normally be allowed for a regular ondisk btree block, this is only true
    if there is only one child block and the number of records don't fit in
    the inode root.

    Fixes: 08a3a692ef58 ("xfs: btree scrub should check minrecs")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

14 Nov, 2020

1 commit


12 Nov, 2020

1 commit


11 Nov, 2020

4 commits

  • Fix some serious WTF in the reference count scrubber's rmap fragment
    processing. The code comment says that this loop is supposed to move
    all fragment records starting at or before bno onto the worklist, but
    there's no obvious reason why nr (the number of items added) should
    increment starting from 1, and breaking the loop when we've added the
    target number seems dubious since we could have more rmap fragments that
    should have been added to the worklist.

    This seems to manifest in xfs/411 when adding one to the refcount field.

    Fixes: dbde19da9637 ("xfs: cross-reference the rmapbt data with the refcountbt")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Keys for extent interval records in the reverse mapping btree are
    supposed to be computed as follows:

    (physical block, owner, fork, is_btree, is_unwritten, offset)

    This provides users the ability to look up a reverse mapping from a bmbt
    record -- start with the physical block; then if there are multiple
    records for the same block, move on to the owner; then the inode fork
    type; and so on to the file offset.

    However, the key comparison functions incorrectly remove the
    fork/btree/unwritten information that's encoded in the on-disk offset.
    This means that lookup comparisons are only done with:

    (physical block, owner, offset)

    This means that queries can return incorrect results. On consistent
    filesystems this hasn't been an issue because blocks are never shared
    between forks or with bmbt blocks; and are never unwritten. However,
    this bug means that online repair cannot always detect corruption in the
    key information in internal rmapbt nodes.

    Found by fuzzing keys[1].attrfork = ones on xfs/371.

    Fixes: 4b8ed67794fe ("xfs: add rmap btree operations")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • When the bmbt scrubber is looking up rmap extents, we need to set the
    extent flags from the bmbt record fully. This will matter once we fix
    the rmap btree comparison functions to check those flags correctly.

    Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Pass the same oldext argument (which contains the existing rmapping's
    unwritten state) to xfs_rmap_lookup_le_range at the start of
    xfs_rmap_convert_shared. At this point in the code, flags is zero,
    which means that we perform lookups using the wrong key.

    Fixes: 3f165b334e51 ("xfs: convert unwritten status of reverse mappings for shared files")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

09 Nov, 2020

1 commit


05 Nov, 2020

5 commits

  • There's no reason to flush an entire file when we're unsharing part of
    a file. Therefore, only initiate writeback on the selected range.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     
  • The kernel has always allowed directories to have the rtinherit flag
    set, even if there is no rt device, so this check is wrong.

    Fixes: 80e4e1268802 ("xfs: scrub inodes")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • In commit 7588cbeec6df, we tried to fix a race stemming from the lack of
    coordination between higher level code that wants to allocate and remap
    CoW fork extents into the data fork. Christoph cites as examples the
    always_cow mode, and a directio write completion racing with writeback.

    According to the comments before the goto retry, we want to restart the
    lookup to catch the extent in the data fork, but we don't actually reset
    whichfork or cow_fsb, which means the second try executes using stale
    information. Up until now I think we've gotten lucky that either
    there's something left in the CoW fork to cause cow_fsb to be reset, or
    either data/cow fork sequence numbers have advanced enough to force a
    fresh lookup from the data fork. However, if we reach the retry with an
    empty stable CoW fork and a stable data fork, neither of those things
    happens. The retry foolishly re-calls xfs_convert_blocks on the CoW
    fork which fails again. This time, we toss the write.

    I've recently been working on extending reflink to the realtime device.
    When the realtime extent size is larger than a single block, we have to
    force the page cache to CoW the entire rt extent if a write (or
    fallocate) are not aligned with the rt extent size. The strategy I've
    chosen to deal with this is derived from Dave's blocksize > pagesize
    series: dirtying around the write range, and ensuring that writeback
    always starts mapping on an rt extent boundary. This has brought this
    race front and center, since generic/522 blows up immediately.

    However, I'm pretty sure this is a bug outright, independent of that.

    Fixes: 7588cbeec6df ("xfs: retry COW fork delalloc conversion when no extent was found")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • iomap writeback mapping failure only calls into ->discard_page() if
    the current page has not been added to the ioend. Accordingly, the
    XFS callback assumes a full page discard and invalidation. This is
    problematic for sub-page block size filesystems where some portion
    of a page might have been mapped successfully before a failure to
    map a delalloc block occurs. ->discard_page() is not called in that
    error scenario and the bio is explicitly failed by iomap via the
    error return from ->prepare_ioend(). As a result, the filesystem
    leaks delalloc blocks and corrupts the filesystem block counters.

    Since XFS is the only user of ->discard_page(), tweak the semantics
    to invoke the callback unconditionally on mapping errors and provide
    the file offset that failed to map. Update xfs_discard_page() to
    discard the corresponding portion of the file and pass the range
    along to iomap_invalidatepage(). The latter already properly handles
    both full and sub-page scenarios by not changing any iomap or page
    state on sub-page invalidations.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • It is possible to expose non-zeroed post-EOF data in XFS if the new
    EOF page is dirty, backed by an unwritten block and the truncate
    happens to race with writeback. iomap_truncate_page() will not zero
    the post-EOF portion of the page if the underlying block is
    unwritten. The subsequent call to truncate_setsize() will, but
    doesn't dirty the page. Therefore, if writeback happens to complete
    after iomap_truncate_page() (so it still sees the unwritten block)
    but before truncate_setsize(), the cached page becomes inconsistent
    with the on-disk block. A mapped read after the associated page is
    reclaimed or invalidated exposes non-zero post-EOF data.

    For example, consider the following sequence when run on a kernel
    modified to explicitly flush the new EOF page within the race
    window:

    $ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file
    $ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file
    ...
    $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
    00000400: 00 00 00 00 00 00 00 00 ........
    $ umount /mnt/; mount /mnt/
    $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
    00000400: cd cd cd cd cd cd cd cd ........

    Update xfs_setattr_size() to explicitly flush the new EOF page prior
    to the page truncate to ensure iomap has the latest state of the
    underlying block.

    Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

29 Oct, 2020

3 commits


27 Oct, 2020

1 commit


26 Oct, 2020

2 commits

  • …/kernel/git/sre/linux-power-supply") into android-mainline

    Steps on the way to 5.10-rc1

    Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
    Change-Id: Ie67a4cdc4b30c466c302c46bd1348b3e94ea8f91

    Greg Kroah-Hartman
     
  • Use a more generic form for __section that requires quotes to avoid
    complications with clang and gcc differences.

    Remove the quote operator # from compiler_attributes.h __section macro.

    Convert all unquoted __section(foo) uses to quoted __section("foo").
    Also convert __attribute__((section("foo"))) uses to __section("foo")
    even if the __attribute__ has multiple list entry forms.

    Conversion done using the script at:

    https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl

    Signed-off-by: Joe Perches
    Reviewed-by: Nick Desaulniers
    Reviewed-by: Miguel Ojeda
    Signed-off-by: Linus Torvalds

    Joe Perches
     

25 Oct, 2020

2 commits


24 Oct, 2020

1 commit

  • Pull xfs fixes from Darrick Wong:
    "Two bug fixes that trickled in during the merge window:

    - Make fallocate check the alignment of its arguments against the
    fundamental allocation unit of the volume the file lives on, so
    that we don't trigger the fs' alignment checks.

    - Cancel unprocessed log intents immediately when log recovery fails,
    to avoid a log deadlock"

    * tag 'xfs-5.10-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    xfs: cancel intents immediately if process_intents fails
    xfs: fix fallocate functions when rtextsize is larger than 1

    Linus Torvalds
     

22 Oct, 2020

2 commits

  • If processing recovered log intent items fails, we need to cancel all
    the unprocessed recovered items immediately so that a subsequent AIL
    push in the bail out path won't get wedged on the pinned intent items
    that didn't get processed.

    This can happen if the log contains (1) an intent that gets and releases
    an inode, (2) an intent that cannot be recovered successfully, and (3)
    some third intent item. When recovery of (2) fails, we leave (3) pinned
    in memory. Inode reclamation is called in the error-out path of
    xfs_mountfs before xfs_log_cancel_mount. Reclamation calls
    xfs_ail_push_all_sync, which gets stuck waiting for (3).

    Therefore, call xlog_recover_cancel_intents if _process_intents fails.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • In commit fe341eb151ec, I forgot that xfs_free_file_space isn't strictly
    a "remove mapped blocks" function. It is actually a function to zero
    file space by punching out the middle and writing zeroes to the
    unaligned ends of the specified range. Therefore, putting a rtextsize
    alignment check in that function is wrong because that breaks unaligned
    ZERO_RANGE on the realtime volume.

    Furthermore, xfs_file_fallocate already has alignment checks for the
    functions require the file range to be aligned to the size of a
    fundamental allocation unit (which is 1 FSB on the data volume and 1 rt
    extent on the realtime volume). Create a new helper to check fallocate
    arguments against the realtiem allocation unit size, fix the fallocate
    frontend to use it, fix free_file_space to delete the correct range, and
    remove a now redundant check from insert_file_space.

    NOTE: The realtime extent size is not required to be a power of two!

    Fixes: fe341eb151ec ("xfs: ensure that fpunch, fcollapse, and finsert operations are aligned to rt extent size")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     

20 Oct, 2020

1 commit

  • Pull more xfs updates from Darrick Wong:
    "The second large pile of new stuff for 5.10, with changes even more
    monumental than last week!

    We are formally announcing the deprecation of the V4 filesystem format
    in 2030. All users must upgrade to the V5 format, which contains
    design improvements that greatly strengthen metadata validation,
    supports reflink and online fsck, and is the intended vehicle for
    handling timestamps past 2038. We're also deprecating the old Irix
    behavioral tweaks in September 2025.

    Coming along for the ride are two design changes to the deferred
    metadata ops subsystem. One of the improvements is to retain correct
    logical ordering of tasks and subtasks, which is a more logical design
    for upper layers of XFS and will become necessary when we add atomic
    file range swaps and commits. The second improvement to deferred ops
    improves the scalability of the log by helping the log tail to move
    forward during long-running operations. This reduces log contention
    when there are a large number of threads trying to run transactions.

    In addition to that, this fixes numerous small bugs in log recovery;
    refactors logical intent log item recovery to remove the last
    remaining place in XFS where we could have nested transactions; fixes
    a couple of ways that intent log item recovery could fail in ways that
    wouldn't have happened in the regular commit paths; fixes a deadlock
    vector in the GETFSMAP implementation (which improves its performance
    by 20%); and fixes serious bugs in the realtime growfs, fallocate, and
    bitmap handling code.

    Summary:

    - Deprecate the V4 filesystem format, some disused mount options, and
    some legacy sysctl knobs now that we can support dates into the
    25th century. Note that removal of V4 support will not happen until
    the early 2030s.

    - Fix some probles with inode realtime flag propagation.

    - Fix some buffer handling issues when growing a rt filesystem.

    - Fix a problem where a BMAP_REMAP unmap call would free rt extents
    even though the purpose of BMAP_REMAP is to avoid freeing the
    blocks.

    - Strengthen the dabtree online scrubber to check hash values on
    child dabtree blocks.

    - Actually log new intent items created as part of recovering log
    intent items.

    - Fix a bug where quotas weren't attached to an inode undergoing bmap
    intent item recovery.

    - Fix a buffer overrun problem with specially crafted log buffer
    headers.

    - Various cleanups to type usage and slightly inaccurate comments.

    - More cleanups to the xattr, log, and quota code.

    - Don't run the (slower) shared-rmap operations on attr fork
    mappings.

    - Fix a bug where we failed to check the LSN of finobt blocks during
    replay and could therefore overwrite newer data with older data.

    - Clean up the ugly nested transaction mess that log recovery uses to
    stage intent item recovery in the correct order by creating a
    proper data structure to capture recovered chains.

    - Use the capture structure to resume intent item chains with the
    same log space and block reservations as when they were captured.

    - Fix a UAF bug in bmap intent item recovery where we failed to
    maintain our reference to the incore inode if the bmap operation
    needed to relog itself to continue.

    - Rearrange the defer ops mechanism to finish newly created subtasks
    of a parent task before moving on to the next parent task.

    - Automatically relog intent items in deferred ops chains if doing so
    would help us avoid pinning the log tail. This will help fix some
    log scaling problems now and will facilitate atomic file updates
    later.

    - Fix a deadlock in the GETFSMAP implementation by using an internal
    memory buffer to reduce indirect calls and copies to userspace,
    thereby improving its performance by ~20%.

    - Fix various problems when calling growfs on a realtime volume would
    not fully update the filesystem metadata.

    - Fix broken Kconfig asking about deprecated XFS when XFS is
    disabled"

    * tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
    xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n
    xfs: fix high key handling in the rt allocator's query_range function
    xfs: annotate grabbing the realtime bitmap/summary locks in growfs
    xfs: make xfs_growfs_rt update secondary superblocks
    xfs: fix realtime bitmap/summary file truncation when growing rt volume
    xfs: fix the indent in xfs_trans_mod_dquot
    xfs: do the ASSERT for the arguments O_{u,g,p}dqpp
    xfs: fix deadlock and streamline xfs_getfsmap performance
    xfs: limit entries returned when counting fsmap records
    xfs: only relog deferred intent items if free space in the log gets low
    xfs: expose the log push threshold
    xfs: periodically relog deferred intent items
    xfs: change the order in which child and parent defer ops are finished
    xfs: fix an incore inode UAF in xfs_bui_recover
    xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering
    xfs: clean up bmap intent item recovery checking
    xfs: xfs_defer_capture should absorb remaining transaction reservation
    xfs: xfs_defer_capture should absorb remaining block reservations
    xfs: proper replay of deferred ops queued during log recovery
    xfs: remove XFS_LI_RECOVERED
    ...

    Linus Torvalds
     

17 Oct, 2020

2 commits

  • Pavel Machek complained that the question about supporting deprecated
    XFS v4 comes up even when XFS is disabled. This clearly makes no sense,
    so fix Kconfig.

    Reported-by: Pavel Machek
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Eric Sandeen

    Darrick J. Wong
     
  • Fix some off-by-one errors in xfs_rtalloc_query_range. The highest key
    in the realtime bitmap is always one less than the number of rt extents,
    which means that the key clamp at the start of the function is wrong.
    The 4th argument to xfs_rtfind_forw is the highest rt extent that we
    want to probe, which means that passing 1 less than the high key is
    wrong. Finally, drop the rem variable that controls the loop because we
    can compare the iteration point (rtstart) against the high key directly.

    The sordid history of this function is that the original commit (fb3c3)
    incorrectly passed (high_rec->ar_startblock - 1) as the 'limit' parameter
    to xfs_rtfind_forw. This was wrong because the "high key" is supposed
    to be the largest key for which the caller wants result rows, not the
    key for the first row that could possibly be outside the range that the
    caller wants to see.

    A subsequent attempt (8ad56) to strengthen the parameter checking added
    incorrect clamping of the parameters to the number of rt blocks in the
    system (despite the bitmap functions all taking units of rt extents) to
    avoid querying ranges past the end of rt bitmap file but failed to fix
    the incorrect _rtfind_forw parameter. The original _rtfind_forw
    parameter error then survived the conversion of the startblock and
    blockcount fields to rt extents (a0e5c), and the most recent off-by-one
    fix (a3a37) thought it was patching a problem when the end of the rt
    volume is not in use, but none of these fixes actually solved the
    original problem that the author was confused about the "limit" argument
    to xfs_rtfind_forw.

    Sadly, all four of these patches were written by this author and even
    his own usage of this function and rt testing were inadequate to get
    this fixed quickly.

    Original-problem: fb3c3de2f65c ("xfs: add a couple of queries to iterate free extents in the rtbitmap")
    Not-fixed-by: 8ad560d2565e ("xfs: strengthen rtalloc query range checks")
    Not-fixed-by: a0e5c435babd ("xfs: fix xfs_rtalloc_rec units")
    Fixes: a3a374bf1889 ("xfs: fix off-by-one error in xfs_rtalloc_query_range")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     

15 Oct, 2020

1 commit

  • Pull xfs updates from Darrick Wong:
    "The biggest changes are two new features for the ondisk metadata: one
    to record the sizes of the inode btrees in the AG to increase
    redundancy checks and to improve mount times; and a second new feature
    to support timestamps until the year 2486.

    We also fixed a problem where reflinking into a file that requires
    synchronous writes wouldn't actually flush the updates to disk; clean
    up a fair amount of cruft; and started fixing some bugs in the
    realtime volume code.

    Summary:

    - Clean up the buffer ioend calling path so that the retry strategy
    isn't quite so scattered everywhere.

    - Clean up m_sb_bp handling.

    - New feature: storing inode btree counts in the AGI to speed up
    certain mount time per-AG block reservation operatoins and add a
    little more metadata redundancy.

    - New feature: Widen inode timestamps and quota grace expiration
    timestamps to support dates through the year 2486.

    - Get rid of more of our custom buffer allocation API wrappers.

    - Use a proper VLA for shortform xattr structure namevals.

    - Force the log after reflinking or deduping into a file that is
    opened with O_SYNC or O_DSYNC.

    - Fix some math errors in the realtime allocator"

    * tag 'xfs-5.10-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (42 commits)
    xfs: ensure that fpunch, fcollapse, and finsert operations are aligned to rt extent size
    xfs: make sure the rt allocator doesn't run off the end
    xfs: Remove unneeded semicolon
    xfs: force the log after remapping a synchronous-writes file
    xfs: Convert xfs_attr_sf macros to inline functions
    xfs: Use variable-size array for nameval in xfs_attr_sf_entry
    xfs: Remove typedef xfs_attr_shortform_t
    xfs: remove typedef xfs_attr_sf_entry_t
    xfs: Remove kmem_zalloc_large()
    xfs: enable big timestamps
    xfs: trace timestamp limits
    xfs: widen ondisk quota expiration timestamps to handle y2038+
    xfs: widen ondisk inode timestamps to deal with y2038+
    xfs: redefine xfs_ictimestamp_t
    xfs: redefine xfs_timestamp_t
    xfs: move xfs_log_dinode_to_disk to the log recovery code
    xfs: refactor quota timestamp coding
    xfs: refactor default quota grace period setting code
    xfs: refactor quota expiration timer modification
    xfs: explicitly define inode timestamp range
    ...

    Linus Torvalds
     

13 Oct, 2020

3 commits

  • Use XFS_ILOCK_RT{BITMAP,SUM} to annotate grabbing the rt bitmap and
    summary locks when we grow the realtime volume, just like we do most
    everywhere else. This shuts up lockdep warnings about grabbing the
    ILOCK class of locks recursively:

    ============================================
    WARNING: possible recursive locking detected
    5.9.0-rc4-djw #rc4 Tainted: G O
    --------------------------------------------
    xfs_growfs/4841 is trying to acquire lock:
    ffff888035acc230 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xac/0x1a0 [xfs]

    but task is already holding lock:
    ffff888035acedb0 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xac/0x1a0 [xfs]

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&xfs_nondir_ilock_class);
    lock(&xfs_nondir_ilock_class);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     
  • When we call growfs on the data device, we update the secondary
    superblocks to reflect the updated filesystem geometry. We need to do
    this for growfs on the realtime volume too, because a future xfs_repair
    run could try to fix the filesystem using a backup superblock.

    This was observed by the online superblock scrubbers while running
    xfs/233. One can also trigger this by growing an rt volume, cycling the
    mount, and creating new rt files.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     
  • The realtime bitmap and summary files are regular files that are hidden
    away from the directory tree. Since they're regular files, inode
    inactivation will try to purge what it thinks are speculative
    preallocations beyond the incore size of the file. Unfortunately,
    xfs_growfs_rt forgets to update the incore size when it resizes the
    inodes, with the result that inactivating the rt inodes at unmount time
    will cause their contents to be truncated.

    Fix this by updating the incore size when we change the ondisk size as
    part of updating the superblock. Note that we don't do this when we're
    allocating blocks to the rt inodes because we actually want those blocks
    to get purged if the growfs fails.

    This fixes corruption complaints from the online rtsummary checker when
    running xfs/233. Since that test requires rmap, one can also trigger
    this by growing an rt volume, cycling the mount, and creating rt files.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Chandan Babu R

    Darrick J. Wong
     

07 Oct, 2020

1 commit