Eric Lee / smarc-fsl-linux-kernel

20 Nov, 2020

2 commits

eb8409071 xfs: revert "xfs: fix rmap key and record comparison functions" ... Browse Code »

This reverts commit 6ff646b2ceb0eec916101877f38da0b73e3a5b7f.

Your maintainer committed a major braino in the rmap code by adding the
attr fork, bmbt, and unwritten extent usage bits into rmap record key
comparisons. While XFS uses the usage bits *in the rmap records* for
cross-referencing metadata in xfs_scrub and xfs_repair, it only needs
the owner and offset information to distinguish between reverse mappings
of the same physical extent into the data fork of a file at multiple
offsets. The other bits are not important for key comparisons for index
lookups, and never have been.

Eric Sandeen reports that this causes regressions in generic/299, so
undo this patch before it does more damage.

Reported-by: Eric Sandeen
Fixes: 6ff646b2ceb0 ("xfs: fix rmap key and record comparison functions")
Signed-off-by: Darrick J. Wong
Reviewed-by: Eric Sandeen

Darrick J. Wong
2020-11-20 07:17:50 +0800
883a790a8 xfs: don't allow NOWAIT DIO across extent boundaries ... Browse Code »

Jens has reported a situation where partial direct IOs can be issued
and completed yet still return -EAGAIN. We don't want this to report
a short IO as we want XFS to complete user DIO entirely or not at
all.

This partial IO situation can occur on a write IO that is split
across an allocated extent and a hole, and the second mapping is
returning EAGAIN because allocation would be required.

The trivial reproducer:

$ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec)
pwrite: Resource temporarily unavailable
$

The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done
the first 4kB write:

xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000
iomap_apply: dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
xfs_iunlock: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
xfs_iomap_found: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1
iomap_apply_dstmap: dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY

Here the first iomap loop has mapped the first 4kB of the file and
issued the IO, and we enter the second iomap_apply loop:

iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
xfs_iunlock: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin

And we exit with -EAGAIN out because we hit the allocate case trying
to make the second 4kB block.

Then IO completes on the first 4kB and the original IO context
completes and unlocks the inode, returning -EAGAIN to userspace:

xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096
xfs_iunlock: dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write

There are other vectors to the same problem when we re-enter the
mapping code if we have to make multiple mappinfs under NOWAIT
conditions. e.g. failing trylocks, COW extents being found,
allocation being required, and so on.

Avoid all these potential problems by only allowing IOMAP_NOWAIT IO
to go ahead if the mapping we retrieve for the IO spans an entire
allocated extent. This avoids the possibility of subsequent mappings
to complete the IO from triggering NOWAIT semantics by any means as
NOWAIT IO will now only enter the mapping code once per NOWAIT IO.

Reported-and-tested-by: Jens Axboe
Signed-off-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-11-20 00:59:11 +0800

19 Nov, 2020

6 commits

595189c25 xfs: return corresponding errcode if xfs_initialize_perag() fail ... Browse Code »

In xfs_initialize_perag(), if kmem_zalloc(), xfs_buf_hash_init(), or
radix_tree_preload() failed, the returned value 'error' is not set
accordingly.

Reported-as-fixing: 8b26c5825e02 ("xfs: handle ENOMEM correctly during initialisation of perag structures")
Fixes: 9b2471797942 ("xfs: cache unlinked pointers in an rhashtable")
Reported-by: Hulk Robot
Signed-off-by: Yu Kuai
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Yu Kuai
2020-11-19 01:23:51 +0800
27c14b5da xfs: ensure inobt record walks always make forward progress ... Browse Code »

The aim of the inode btree record iterator function is to call a
callback on every record in the btree. To avoid having to tear down and
recreate the inode btree cursor around every callback, it caches a
certain number of records in a memory buffer. After each batch of
callback invocations, we have to perform a btree lookup to find the
next record after where we left off.

However, if the keys of the inode btree are corrupt, the lookup might
put us in the wrong part of the inode btree, causing the walk function
to loop forever. Therefore, we add extra cursor tracking to make sure
that we never go backwards neither when performing the lookup nor when
jumping to the next inobt record. This also fixes an off by one error
where upon resume the lookup should have been for the inode /after/ the
point at which we stopped.

Found by fuzzing xfs/460 with keys[2].startino = ones causing bulkstat
and quotacheck to hang.

Fixes: a211432c27ff ("xfs: create simplified inode walk function")
Signed-off-by: Darrick J. Wong
Reviewed-by: Chandan Babu R

Darrick J. Wong
2020-11-19 01:23:51 +0800
ada49d64f xfs: fix forkoff miscalculation related to XFS_LITINO(mp) ... Browse Code »

Currently, commit e9e2eae89ddb dropped a (int) decoration from
XFS_LITINO(mp), and since sizeof() expression is also involved,
the result of XFS_LITINO(mp) is simply as the size_t type
(commonly unsigned long).

Considering the expression in xfs_attr_shortform_bytesfit():
offset = (XFS_LITINO(mp) - bytes) >> 3;
let "bytes" be (int)340, and
"XFS_LITINO(mp)" be (unsigned long)336.

on 64-bit platform, the expression is
offset = ((unsigned long)336 - (int)340) >> 3 =
(int)(0xfffffffffffffffcUL >> 3) = -1

but on 32-bit platform, the expression is
offset = ((unsigned long)336 - (int)340) >> 3 =
(int)(0xfffffffcUL >> 3) = 0x1fffffff
instead.

so offset becomes a large positive number on 32-bit platform, and
cause xfs_attr_shortform_bytesfit() returns maxforkoff rather than 0.

Therefore, one result is
"ASSERT(new_size < bytes" in advance suggested by Eric and a misleading
comment together with this bugfix suggested by Darrick. It seems the
other users of XFS_LITINO(mp) are not impacted.

Fixes: e9e2eae89ddb ("xfs: only check the superblock version for dinode size calculation")
Cc: # 5.7+
Reported-and-tested-by: Dennis Gilmore
Reviewed-by: Christoph Hellwig
Signed-off-by: Gao Xiang
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Gao Xiang
2020-11-19 01:23:51 +0800
6b48e5b8a xfs: directory scrub should check the null bestfree entries too ... Browse Code »

Teach the directory scrubber to check all the bestfree entries,
including the null ones. We want to be able to detect the case where
the entry is null but there actually /is/ a directory data block.

Found by fuzzing lbests[0] = ones in xfs/391.

Fixes: df481968f33b ("xfs: scrub directory freespace")
Signed-off-by: Darrick J. Wong
Reviewed-by: Chandan Babu R
Reviewed-by: Christoph Hellwig

Darrick J. Wong
2020-11-19 01:23:50 +0800
498fe261f xfs: strengthen rmap record flags checking ... Browse Code »

We always know the correct state of the rmap record flags (attr, bmbt,
unwritten) so check them by direct comparison.

Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree")
Signed-off-by: Darrick J. Wong
Reviewed-by: Chandan Babu R
Reviewed-by: Christoph Hellwig

Darrick J. Wong
2020-11-19 01:23:50 +0800
e95b6c3ef xfs: fix the minrecs logic when dealing with inode root child blocks ... Browse Code »

The comment and logic in xchk_btree_check_minrecs for dealing with
inode-rooted btrees isn't quite correct. While the direct children of
the inode root are allowed to have fewer records than what would
normally be allowed for a regular ondisk btree block, this is only true
if there is only one child block and the number of records don't fit in
the inode root.

Fixes: 08a3a692ef58 ("xfs: btree scrub should check minrecs")
Signed-off-by: Darrick J. Wong
Reviewed-by: Chandan Babu R
Reviewed-by: Christoph Hellwig

Darrick J. Wong
2020-11-19 01:23:50 +0800

12 Nov, 2020

1 commit

2bd3fa793 xfs: fix a missing unlock on error in xfs_fs_map_blocks ... Browse Code »

We also need to drop the iolock when invalidate_inode_pages2 fails, not
only on all other error or successful cases.

Fixes: 527851124d10 ("xfs: implement pNFS export operations")
Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Christoph Hellwig
2020-11-12 00:07:37 +0800

11 Nov, 2020

4 commits

54e9b09e1 xfs: fix brainos in the refcount scrubber's rmap fragment processor ... Browse Code »

Fix some serious WTF in the reference count scrubber's rmap fragment
processing. The code comment says that this loop is supposed to move
all fragment records starting at or before bno onto the worklist, but
there's no obvious reason why nr (the number of items added) should
increment starting from 1, and breaking the loop when we've added the
target number seems dubious since we could have more rmap fragments that
should have been added to the worklist.

This seems to manifest in xfs/411 when adding one to the refcount field.

Fixes: dbde19da9637 ("xfs: cross-reference the rmapbt data with the refcountbt")
Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig

Darrick J. Wong
2020-11-11 08:48:03 +0800
6ff646b2c xfs: fix rmap key and record comparison functions ... Browse Code »

Keys for extent interval records in the reverse mapping btree are
supposed to be computed as follows:

(physical block, owner, fork, is_btree, is_unwritten, offset)

This provides users the ability to look up a reverse mapping from a bmbt
record -- start with the physical block; then if there are multiple
records for the same block, move on to the owner; then the inode fork
type; and so on to the file offset.

However, the key comparison functions incorrectly remove the
fork/btree/unwritten information that's encoded in the on-disk offset.
This means that lookup comparisons are only done with:

(physical block, owner, offset)

This means that queries can return incorrect results. On consistent
filesystems this hasn't been an issue because blocks are never shared
between forks or with bmbt blocks; and are never unwritten. However,
this bug means that online repair cannot always detect corruption in the
key information in internal rmapbt nodes.

Found by fuzzing keys[1].attrfork = ones on xfs/371.

Fixes: 4b8ed67794fe ("xfs: add rmap btree operations")
Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig

Darrick J. Wong
2020-11-11 08:47:56 +0800
5dda3897f xfs: set the unwritten bit in rmap lookup flags in xchk_bmap_get_rmapextents ... Browse Code »

When the bmbt scrubber is looking up rmap extents, we need to set the
extent flags from the bmbt record fully. This will matter once we fix
the rmap btree comparison functions to check those flags correctly.

Fixes: d852657ccfc0 ("xfs: cross-reference reverse-mapping btree")
Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig

Darrick J. Wong
2020-11-11 08:47:51 +0800
ea8439899 xfs: fix flags argument to rmap lookup when converting shared file rmaps ... Browse Code »

Pass the same oldext argument (which contains the existing rmapping's
unwritten state) to xfs_rmap_lookup_le_range at the start of
xfs_rmap_convert_shared. At this point in the code, flags is zero,
which means that we perform lookups using the wrong key.

Fixes: 3f165b334e51 ("xfs: convert unwritten status of reverse mappings for shared files")
Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig

Darrick J. Wong
2020-11-11 08:47:34 +0800

05 Nov, 2020

5 commits

46afb0628 xfs: only flush the unshared range in xfs_reflink_unshare ... Browse Code »

There's no reason to flush an entire file when we're unsharing part of
a file. Therefore, only initiate writeback on the selected range.

Signed-off-by: Darrick J. Wong
Reviewed-by: Chandan Babu R

Darrick J. Wong
2020-11-05 09:41:56 +0800
c1f6b1ac0 xfs: fix scrub flagging rtinherit even if there is no rt device ... Browse Code »

The kernel has always allowed directories to have the rtinherit flag
set, even if there is no rt device, so this check is wrong.

Fixes: 80e4e1268802 ("xfs: scrub inodes")
Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig

Darrick J. Wong
2020-11-05 00:52:47 +0800
c2f09217a xfs: fix missing CoW blocks writeback conversion retry ... Browse Code »

In commit 7588cbeec6df, we tried to fix a race stemming from the lack of
coordination between higher level code that wants to allocate and remap
CoW fork extents into the data fork. Christoph cites as examples the
always_cow mode, and a directio write completion racing with writeback.

According to the comments before the goto retry, we want to restart the
lookup to catch the extent in the data fork, but we don't actually reset
whichfork or cow_fsb, which means the second try executes using stale
information. Up until now I think we've gotten lucky that either
there's something left in the CoW fork to cause cow_fsb to be reset, or
either data/cow fork sequence numbers have advanced enough to force a
fresh lookup from the data fork. However, if we reach the retry with an
empty stable CoW fork and a stable data fork, neither of those things
happens. The retry foolishly re-calls xfs_convert_blocks on the CoW
fork which fails again. This time, we toss the write.

I've recently been working on extending reflink to the realtime device.
When the realtime extent size is larger than a single block, we have to
force the page cache to CoW the entire rt extent if a write (or
fallocate) are not aligned with the rt extent size. The strategy I've
chosen to deal with this is derived from Dave's blocksize > pagesize
series: dirtying around the write range, and ensuring that writeback
always starts mapping on an rt extent boundary. This has brought this
race front and center, since generic/522 blows up immediately.

However, I'm pretty sure this is a bug outright, independent of that.

Fixes: 7588cbeec6df ("xfs: retry COW fork delalloc conversion when no extent was found")
Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig

Darrick J. Wong
2020-11-05 00:52:47 +0800
763e4cdc0 iomap: support partial page discard on writeback block mapping failure ... Browse Code »

iomap writeback mapping failure only calls into ->discard_page() if
the current page has not been added to the ioend. Accordingly, the
XFS callback assumes a full page discard and invalidation. This is
problematic for sub-page block size filesystems where some portion
of a page might have been mapped successfully before a failure to
map a delalloc block occurs. ->discard_page() is not called in that
error scenario and the bio is explicitly failed by iomap via the
error return from ->prepare_ioend(). As a result, the filesystem
leaks delalloc blocks and corrupts the filesystem block counters.

Since XFS is the only user of ->discard_page(), tweak the semantics
to invoke the callback unconditionally on mapping errors and provide
the file offset that failed to map. Update xfs_discard_page() to
discard the corresponding portion of the file and pass the range
along to iomap_invalidatepage(). The latter already properly handles
both full and sub-page scenarios by not changing any iomap or page
state on sub-page invalidations.

Signed-off-by: Brian Foster
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Brian Foster
2020-11-05 00:52:46 +0800
869ae85da xfs: flush new eof page on truncate to avoid post-eof corruption ... Browse Code »

It is possible to expose non-zeroed post-EOF data in XFS if the new
EOF page is dirty, backed by an unwritten block and the truncate
happens to race with writeback. iomap_truncate_page() will not zero
the post-EOF portion of the page if the underlying block is
unwritten. The subsequent call to truncate_setsize() will, but
doesn't dirty the page. Therefore, if writeback happens to complete
after iomap_truncate_page() (so it still sees the unwritten block)
but before truncate_setsize(), the cached page becomes inconsistent
with the on-disk block. A mapped read after the associated page is
reclaimed or invalidated exposes non-zero post-EOF data.

For example, consider the following sequence when run on a kernel
modified to explicitly flush the new EOF page within the race
window:

$ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file
$ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file
...
$ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
00000400: 00 00 00 00 00 00 00 00 ........
$ umount /mnt/; mount /mnt/
$ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file
00000400: cd cd cd cd cd cd cd cd ........

Update xfs_setattr_size() to explicitly flush the new EOF page prior
to the page truncate to ensure iomap has the latest state of the
underlying block.

Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Signed-off-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Brian Foster
2020-11-05 00:52:46 +0800

29 Oct, 2020

1 commit

2c334e12f xfs: set xefi_discard when creating a deferred agfl free log intent item ... Browse Code »

Make sure that we actually initialize xefi_discard when we're scheduling
a deferred free of an AGFL block. This was (eventually) found by the
UBSAN while I was banging on realtime rmap problems, but it exists in
the upstream codebase. While we're at it, rearrange the structure to
reduce the struct size from 64 to 56 bytes.

Fixes: fcb762f5de2e ("xfs: add bmapi nodiscard flag")
Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster

Darrick J. Wong
2020-10-29 23:19:18 +0800

26 Oct, 2020

1 commit

33def8498 treewide: Convert macro and uses of __section(foo) to __section("foo") ... Browse Code »

Use a more generic form for __section that requires quotes to avoid
complications with clang and gcc differences.

Remove the quote operator # from compiler_attributes.h __section macro.

Convert all unquoted __section(foo) uses to quoted __section("foo").
Also convert __attribute__((section("foo"))) uses to __section("foo")
even if the __attribute__ has multiple list entry forms.

Conversion done using the script at:

https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl

Signed-off-by: Joe Perches
Reviewed-by: Nick Desaulniers
Reviewed-by: Miguel Ojeda
Signed-off-by: Linus Torvalds

Joe Perches
2020-10-26 05:51:49 +0800

25 Oct, 2020

1 commit

0eac1102e Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc vfs updates from Al Viro:
"Assorted stuff all over the place (the largest group here is
Christoph's stat cleanups)"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: remove KSTAT_QUERY_FLAGS
fs: remove vfs_stat_set_lookup_flags
fs: move vfs_fstatat out of line
fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat
fs: remove vfs_statx_fd
fs: omfs: use kmemdup() rather than kmalloc+memcpy
[PATCH] reduce boilerplate in fsid handling
fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS
selftests: mount: add nosymfollow tests
Add a "nosymfollow" mount option.

Linus Torvalds
2020-10-25 03:26:05 +0800

24 Oct, 2020

1 commit

f11901ed7 Merge tag 'xfs-5.10-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux ... Browse Code »

Pull xfs fixes from Darrick Wong:
"Two bug fixes that trickled in during the merge window:

- Make fallocate check the alignment of its arguments against the
fundamental allocation unit of the volume the file lives on, so
that we don't trigger the fs' alignment checks.

- Cancel unprocessed log intents immediately when log recovery fails,
to avoid a log deadlock"

* tag 'xfs-5.10-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: cancel intents immediately if process_intents fails
xfs: fix fallocate functions when rtextsize is larger than 1

Linus Torvalds
2020-10-24 08:15:06 +0800

22 Oct, 2020

2 commits

2e76f188f xfs: cancel intents immediately if process_intents fails ... Browse Code »

If processing recovered log intent items fails, we need to cancel all
the unprocessed recovered items immediately so that a subsequent AIL
push in the bail out path won't get wedged on the pinned intent items
that didn't get processed.

This can happen if the log contains (1) an intent that gets and releases
an inode, (2) an intent that cannot be recovered successfully, and (3)
some third intent item. When recovery of (2) fails, we leave (3) pinned
in memory. Inode reclamation is called in the error-out path of
xfs_mountfs before xfs_log_cancel_mount. Reclamation calls
xfs_ail_push_all_sync, which gets stuck waiting for (3).

Therefore, call xlog_recover_cancel_intents if _process_intents fails.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster

Darrick J. Wong
2020-10-22 07:28:46 +0800
25219dbfa xfs: fix fallocate functions when rtextsize is larger than 1 ... Browse Code »

In commit fe341eb151ec, I forgot that xfs_free_file_space isn't strictly
a "remove mapped blocks" function. It is actually a function to zero
file space by punching out the middle and writing zeroes to the
unaligned ends of the specified range. Therefore, putting a rtextsize
alignment check in that function is wrong because that breaks unaligned
ZERO_RANGE on the realtime volume.

Furthermore, xfs_file_fallocate already has alignment checks for the
functions require the file range to be aligned to the size of a
fundamental allocation unit (which is 1 FSB on the data volume and 1 rt
extent on the realtime volume). Create a new helper to check fallocate
arguments against the realtiem allocation unit size, fix the fallocate
frontend to use it, fix free_file_space to delete the correct range, and
remove a now redundant check from insert_file_space.

NOTE: The realtime extent size is not required to be a power of two!

Fixes: fe341eb151ec ("xfs: ensure that fpunch, fcollapse, and finsert operations are aligned to rt extent size")
Signed-off-by: Darrick J. Wong
Reviewed-by: Chandan Babu R

Darrick J. Wong
2020-10-22 00:05:19 +0800

20 Oct, 2020

1 commit

bbe85027c Merge tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux ... Browse Code »

Pull more xfs updates from Darrick Wong:
"The second large pile of new stuff for 5.10, with changes even more
monumental than last week!

We are formally announcing the deprecation of the V4 filesystem format
in 2030. All users must upgrade to the V5 format, which contains
design improvements that greatly strengthen metadata validation,
supports reflink and online fsck, and is the intended vehicle for
handling timestamps past 2038. We're also deprecating the old Irix
behavioral tweaks in September 2025.

Coming along for the ride are two design changes to the deferred
metadata ops subsystem. One of the improvements is to retain correct
logical ordering of tasks and subtasks, which is a more logical design
for upper layers of XFS and will become necessary when we add atomic
file range swaps and commits. The second improvement to deferred ops
improves the scalability of the log by helping the log tail to move
forward during long-running operations. This reduces log contention
when there are a large number of threads trying to run transactions.

In addition to that, this fixes numerous small bugs in log recovery;
refactors logical intent log item recovery to remove the last
remaining place in XFS where we could have nested transactions; fixes
a couple of ways that intent log item recovery could fail in ways that
wouldn't have happened in the regular commit paths; fixes a deadlock
vector in the GETFSMAP implementation (which improves its performance
by 20%); and fixes serious bugs in the realtime growfs, fallocate, and
bitmap handling code.

Summary:

- Deprecate the V4 filesystem format, some disused mount options, and
some legacy sysctl knobs now that we can support dates into the
25th century. Note that removal of V4 support will not happen until
the early 2030s.

- Fix some probles with inode realtime flag propagation.

- Fix some buffer handling issues when growing a rt filesystem.

- Fix a problem where a BMAP_REMAP unmap call would free rt extents
even though the purpose of BMAP_REMAP is to avoid freeing the
blocks.

- Strengthen the dabtree online scrubber to check hash values on
child dabtree blocks.

- Actually log new intent items created as part of recovering log
intent items.

- Fix a bug where quotas weren't attached to an inode undergoing bmap
intent item recovery.

- Fix a buffer overrun problem with specially crafted log buffer
headers.

- Various cleanups to type usage and slightly inaccurate comments.

- More cleanups to the xattr, log, and quota code.

- Don't run the (slower) shared-rmap operations on attr fork
mappings.

- Fix a bug where we failed to check the LSN of finobt blocks during
replay and could therefore overwrite newer data with older data.

- Clean up the ugly nested transaction mess that log recovery uses to
stage intent item recovery in the correct order by creating a
proper data structure to capture recovered chains.

- Use the capture structure to resume intent item chains with the
same log space and block reservations as when they were captured.

- Fix a UAF bug in bmap intent item recovery where we failed to
maintain our reference to the incore inode if the bmap operation
needed to relog itself to continue.

- Rearrange the defer ops mechanism to finish newly created subtasks
of a parent task before moving on to the next parent task.

- Automatically relog intent items in deferred ops chains if doing so
would help us avoid pinning the log tail. This will help fix some
log scaling problems now and will facilitate atomic file updates
later.

- Fix a deadlock in the GETFSMAP implementation by using an internal
memory buffer to reduce indirect calls and copies to userspace,
thereby improving its performance by ~20%.

- Fix various problems when calling growfs on a realtime volume would
not fully update the filesystem metadata.

- Fix broken Kconfig asking about deprecated XFS when XFS is
disabled"

* tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n
xfs: fix high key handling in the rt allocator's query_range function
xfs: annotate grabbing the realtime bitmap/summary locks in growfs
xfs: make xfs_growfs_rt update secondary superblocks
xfs: fix realtime bitmap/summary file truncation when growing rt volume
xfs: fix the indent in xfs_trans_mod_dquot
xfs: do the ASSERT for the arguments O_{u,g,p}dqpp
xfs: fix deadlock and streamline xfs_getfsmap performance
xfs: limit entries returned when counting fsmap records
xfs: only relog deferred intent items if free space in the log gets low
xfs: expose the log push threshold
xfs: periodically relog deferred intent items
xfs: change the order in which child and parent defer ops are finished
xfs: fix an incore inode UAF in xfs_bui_recover
xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering
xfs: clean up bmap intent item recovery checking
xfs: xfs_defer_capture should absorb remaining transaction reservation
xfs: xfs_defer_capture should absorb remaining block reservations
xfs: proper replay of deferred ops queued during log recovery
xfs: remove XFS_LI_RECOVERED
...

Linus Torvalds
2020-10-20 05:38:46 +0800

17 Oct, 2020

2 commits

894645546 xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n ... Browse Code »

Pavel Machek complained that the question about supporting deprecated
XFS v4 comes up even when XFS is disabled. This clearly makes no sense,
so fix Kconfig.

Reported-by: Pavel Machek
Signed-off-by: Darrick J. Wong
Reviewed-by: Eric Sandeen

Darrick J. Wong
2020-10-17 06:34:28 +0800
d88850bd5 xfs: fix high key handling in the rt allocator's query_range function ... Browse Code »

Fix some off-by-one errors in xfs_rtalloc_query_range. The highest key
in the realtime bitmap is always one less than the number of rt extents,
which means that the key clamp at the start of the function is wrong.
The 4th argument to xfs_rtfind_forw is the highest rt extent that we
want to probe, which means that passing 1 less than the high key is
wrong. Finally, drop the rem variable that controls the loop because we
can compare the iteration point (rtstart) against the high key directly.

The sordid history of this function is that the original commit (fb3c3)
incorrectly passed (high_rec->ar_startblock - 1) as the 'limit' parameter
to xfs_rtfind_forw. This was wrong because the "high key" is supposed
to be the largest key for which the caller wants result rows, not the
key for the first row that could possibly be outside the range that the
caller wants to see.

A subsequent attempt (8ad56) to strengthen the parameter checking added
incorrect clamping of the parameters to the number of rt blocks in the
system (despite the bitmap functions all taking units of rt extents) to
avoid querying ranges past the end of rt bitmap file but failed to fix
the incorrect _rtfind_forw parameter. The original _rtfind_forw
parameter error then survived the conversion of the startblock and
blockcount fields to rt extents (a0e5c), and the most recent off-by-one
fix (a3a37) thought it was patching a problem when the end of the rt
volume is not in use, but none of these fixes actually solved the
original problem that the author was confused about the "limit" argument
to xfs_rtfind_forw.

Sadly, all four of these patches were written by this author and even
his own usage of this function and rt testing were inadequate to get
this fixed quickly.

Original-problem: fb3c3de2f65c ("xfs: add a couple of queries to iterate free extents in the rtbitmap")
Not-fixed-by: 8ad560d2565e ("xfs: strengthen rtalloc query range checks")
Not-fixed-by: a0e5c435babd ("xfs: fix xfs_rtalloc_rec units")
Fixes: a3a374bf1889 ("xfs: fix off-by-one error in xfs_rtalloc_query_range")
Signed-off-by: Darrick J. Wong
Reviewed-by: Chandan Babu R

Darrick J. Wong
2020-10-17 06:34:28 +0800

15 Oct, 2020

1 commit

2fc61f25f Merge tag 'xfs-5.10-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux ... Browse Code »

Pull xfs updates from Darrick Wong:
"The biggest changes are two new features for the ondisk metadata: one
to record the sizes of the inode btrees in the AG to increase
redundancy checks and to improve mount times; and a second new feature
to support timestamps until the year 2486.

We also fixed a problem where reflinking into a file that requires
synchronous writes wouldn't actually flush the updates to disk; clean
up a fair amount of cruft; and started fixing some bugs in the
realtime volume code.

Summary:

- Clean up the buffer ioend calling path so that the retry strategy
isn't quite so scattered everywhere.

- Clean up m_sb_bp handling.

- New feature: storing inode btree counts in the AGI to speed up
certain mount time per-AG block reservation operatoins and add a
little more metadata redundancy.

- New feature: Widen inode timestamps and quota grace expiration
timestamps to support dates through the year 2486.

- Get rid of more of our custom buffer allocation API wrappers.

- Use a proper VLA for shortform xattr structure namevals.

- Force the log after reflinking or deduping into a file that is
opened with O_SYNC or O_DSYNC.

- Fix some math errors in the realtime allocator"

* tag 'xfs-5.10-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (42 commits)
xfs: ensure that fpunch, fcollapse, and finsert operations are aligned to rt extent size
xfs: make sure the rt allocator doesn't run off the end
xfs: Remove unneeded semicolon
xfs: force the log after remapping a synchronous-writes file
xfs: Convert xfs_attr_sf macros to inline functions
xfs: Use variable-size array for nameval in xfs_attr_sf_entry
xfs: Remove typedef xfs_attr_shortform_t
xfs: remove typedef xfs_attr_sf_entry_t
xfs: Remove kmem_zalloc_large()
xfs: enable big timestamps
xfs: trace timestamp limits
xfs: widen ondisk quota expiration timestamps to handle y2038+
xfs: widen ondisk inode timestamps to deal with y2038+
xfs: redefine xfs_ictimestamp_t
xfs: redefine xfs_timestamp_t
xfs: move xfs_log_dinode_to_disk to the log recovery code
xfs: refactor quota timestamp coding
xfs: refactor default quota grace period setting code
xfs: refactor quota expiration timer modification
xfs: explicitly define inode timestamp range
...

Linus Torvalds
2020-10-15 05:06:06 +0800

13 Oct, 2020

3 commits

ace74e797 xfs: annotate grabbing the realtime bitmap/summary locks in growfs ... Browse Code »

Use XFS_ILOCK_RT{BITMAP,SUM} to annotate grabbing the rt bitmap and
summary locks when we grow the realtime volume, just like we do most
everywhere else. This shuts up lockdep warnings about grabbing the
ILOCK class of locks recursively:

============================================
WARNING: possible recursive locking detected
5.9.0-rc4-djw #rc4 Tainted: G O
--------------------------------------------
xfs_growfs/4841 is trying to acquire lock:
ffff888035acc230 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xac/0x1a0 [xfs]

but task is already holding lock:
ffff888035acedb0 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xac/0x1a0 [xfs]

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(&xfs_nondir_ilock_class);
lock(&xfs_nondir_ilock_class);

*** DEADLOCK ***

May be due to missing lock nesting notation

Signed-off-by: Darrick J. Wong
Reviewed-by: Chandan Babu R

Darrick J. Wong
2020-10-13 23:41:31 +0800
7249c95a3 xfs: make xfs_growfs_rt update secondary superblocks ... Browse Code »

When we call growfs on the data device, we update the secondary
superblocks to reflect the updated filesystem geometry. We need to do
this for growfs on the realtime volume too, because a future xfs_repair
run could try to fix the filesystem using a backup superblock.

This was observed by the online superblock scrubbers while running
xfs/233. One can also trigger this by growing an rt volume, cycling the
mount, and creating new rt files.

Signed-off-by: Darrick J. Wong
Reviewed-by: Chandan Babu R

Darrick J. Wong
2020-10-13 23:41:31 +0800
f4c32e87d xfs: fix realtime bitmap/summary file truncation when growing rt volume ... Browse Code »

The realtime bitmap and summary files are regular files that are hidden
away from the directory tree. Since they're regular files, inode
inactivation will try to purge what it thinks are speculative
preallocations beyond the incore size of the file. Unfortunately,
xfs_growfs_rt forgets to update the incore size when it resizes the
inodes, with the result that inactivating the rt inodes at unmount time
will cause their contents to be truncated.

Fix this by updating the incore size when we change the ondisk size as
part of updating the superblock. Note that we don't do this when we're
allocating blocks to the rt inodes because we actually want those blocks
to get purged if the growfs fails.

This fixes corruption complaints from the online rtsummary checker when
running xfs/233. Since that test requires rmap, one can also trigger
this by growing an rt volume, cycling the mount, and creating rt files.

Signed-off-by: Darrick J. Wong
Reviewed-by: Chandan Babu R

Darrick J. Wong
2020-10-13 23:41:31 +0800

07 Oct, 2020

9 commits

e5b23740d xfs: fix the indent in xfs_trans_mod_dquot ... Browse Code »

The formatting is strange in xfs_trans_mod_dquot, so do a reindent.

Signed-off-by: Kaixu Xia
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Kaixu Xia
2020-10-07 23:40:29 +0800
97611f936 xfs: do the ASSERT for the arguments O_{u,g,p}dqpp ... Browse Code »

If we pass in XFS_QMOPT_{U,G,P}QUOTA flags and different uid/gid/prid
than them currently associated with the inode, the arguments
O_{u,g,p}dqpp shouldn't be NULL, so add the ASSERT for them.

Signed-off-by: Kaixu Xia
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Kaixu Xia
2020-10-07 23:40:29 +0800
8ffa90e11 xfs: fix deadlock and streamline xfs_getfsmap performance ... Browse Code »

Refactor xfs_getfsmap to improve its performance: instead of indirectly
calling a function that copies one record to userspace at a time, create
a shadow buffer in the kernel and copy the whole array once at the end.
On the author's computer, this reduces the runtime on his /home by ~20%.

This also eliminates a deadlock when running GETFSMAP against the
realtime device. The current code locks the rtbitmap to create
fsmappings and copies them into userspace, having not released the
rtbitmap lock. If the userspace buffer is an mmap of a sparse file that
itself resides on the realtime device, the write page fault will recurse
into the fs for allocation, which will deadlock on the rtbitmap lock.

Fixes: 4c934c7dd60c ("xfs: report realtime space information via the rtbitmap")
Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Reviewed-by: Chandan Babu R

Darrick J. Wong
2020-10-07 23:40:29 +0800
acd1ac3aa xfs: limit entries returned when counting fsmap records ... Browse Code »

If userspace asked fsmap to count the number of entries, we cannot
return more than UINT_MAX entries because fmh_entries is u32.
Therefore, stop counting if we hit this limit or else we will waste time
to return truncated results.

Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl")
Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Reviewed-by: Chandan Babu R

Darrick J. Wong
2020-10-07 23:40:29 +0800
74f4d6a1e xfs: only relog deferred intent items if free space in the log gets low ... Browse Code »

Now that we have the ability to ask the log how far the tail needs to be
pushed to maintain its free space targets, augment the decision to relog
an intent item so that we only do it if the log has hit the 75% full
threshold. There's no point in relogging an intent into the same
checkpoint, and there's no need to relog if there's plenty of free space
in the log.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster

Darrick J. Wong
2020-10-07 23:40:29 +0800
ed1575daf xfs: expose the log push threshold ... Browse Code »

Separate the computation of the log push threshold and the push logic in
xlog_grant_push_ail. This enables higher level code to determine (for
example) that it is holding on to a logged intent item and the log is so
busy that it is more than 75% full. In that case, it would be desirable
to move the log item towards the head to release the tail, which we will
cover in the next patch.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster

Darrick J. Wong
2020-10-07 23:40:29 +0800
4e919af78 xfs: periodically relog deferred intent items ... Browse Code »

There's a subtle design flaw in the deferred log item code that can lead
to pinning the log tail. Taking up the defer ops chain examples from
the previous commit, we can get trapped in sequences like this:

Caller hands us a transaction t0 with D0-D3 attached. The defer ops
chain will look like the following if the transaction rolls succeed:

t1: D0(t0), D1(t0), D2(t0), D3(t0)
t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
t3: d5(t1), D1(t0), D2(t0), D3(t0)
...
t9: d9(t7), D3(t0)
t10: D3(t0)
t11: d10(t10), d11(t10)
t12: d11(t10)

In transaction 9, we finish d9 and try to roll to t10 while holding onto
an intent item for D3 that we logged in t0.

The previous commit changed the order in which we place new defer ops in
the defer ops processing chain to reduce the maximum chain length. Now
make xfs_defer_finish_noroll capable of relogging the entire chain
periodically so that we can always move the log tail forward. Most
chains will never get relogged, except for operations that generate very
long chains (large extents containing many blocks with different sharing
levels) or are on filesystems with small logs and a lot of ongoing
metadata updates.

Callers are now required to ensure that the transaction reservation is
large enough to handle logging done items and new intent items for the
maximum possible chain length. Most callers are careful to keep the
chain lengths low, so the overhead should be minimal.

The decision to relog an intent item is made based on whether the intent
was logged in a previous checkpoint, since there's no point in relogging
an intent into the same checkpoint.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster

Darrick J. Wong
2020-10-07 23:40:28 +0800
27dada070 xfs: change the order in which child and parent defer ops are finished ... Browse Code »

The defer ops code has been finishing items in the wrong order -- if a
top level defer op creates items A and B, and finishing item A creates
more defer ops A1 and A2, we'll put the new items on the end of the
chain and process them in the order A B A1 A2. This is kind of weird,
since it's convenient for programmers to be able to think of A and B as
an ordered sequence where all the sub-tasks for A must finish before we
move on to B, e.g. A A1 A2 D.

Right now, our log intent items are not so complex that this matters,
but this will become important for the atomic extent swapping patchset.
In order to maintain correct reference counting of extents, we have to
unmap and remap extents in that order, and we want to complete that work
before moving on to the next range that the user wants to swap. This
patch fixes defer ops to satsify that requirement.

The primary symptom of the incorrect order was noticed in an early
performance analysis of the atomic extent swap code. An astonishingly
large number of deferred work items accumulated when userspace requested
an atomic update of two very fragmented files. The cause of this was
traced to the same ordering bug in the inner loop of
xfs_defer_finish_noroll.

If the ->finish_item method of a deferred operation queues new deferred
operations, those new deferred ops are appended to the tail of the
pending work list. To illustrate, say that a caller creates a
transaction t0 with four deferred operations D0-D3. The first thing
defer ops does is roll the transaction to t1, leaving us with:

t1: D0(t0), D1(t0), D2(t0), D3(t0)

Let's say that finishing each of D0-D3 will create two new deferred ops.
After finish D0 and roll, we'll have the following chain:

t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1)

d4 and d5 were logged to t1. Notice that while we're about to start
work on D1, we haven't actually completed all the work implied by D0
being finished. So far we've been careful (or lucky) to structure the
dfops callers such that D1 doesn't depend on d4 or d5 being finished,
but this is a potential logic bomb.

There's a second problem lurking. Let's see what happens as we finish
D1-D3:

t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2)
t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3)
t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)

Let's say that d4-d11 are simple work items that don't queue any other
operations, which means that we can complete each d4 and roll to t6:

t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
...
t11: d10(t4), d11(t4)
t12: d11(t4)

When we try to roll to transaction #12, we're holding defer op d11,
which we logged way back in t4. This means that the tail of the log is
pinned at t4. If the log is very small or there are a lot of other
threads updating metadata, this means that we might have wrapped the log
and cannot get roll to t11 because there isn't enough space left before
we'd run into t4.

Let's shift back to the original failure. I mentioned before that I
discovered this flaw while developing the atomic file update code. In
that scenario, we have a defer op (D0) that finds a range of file blocks
to remap, creates a handful of new defer ops to do that, and then asks
to be continued with however much work remains.

So, D0 is the original swapext deferred op. The first thing defer ops
does is rolls to t1:

t1: D0(t0)

We try to finish D0, logging d1 and d2 in the process, but can't get all
the work done. We log a done item and a new intent item for the work
that D0 still has to do, and roll to t2:

t2: D0'(t1), d1(t1), d2(t1)

We roll and try to finish D0', but still can't get all the work done, so
we log a done item and a new intent item for it, requeue D0 a second
time, and roll to t3:

t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2)

If it takes 48 more rolls to complete D0, then we'll finally dispense
with D0 in t50:

t50: D(t49), d1(t1), ..., d102(t50)

We then try to roll again to get a chain like this:

t51: d1(t1), d2(t1), ..., d101(t50), d102(t50)
...
t152: d102(t50)

Notice that in rolling to transaction #51, we're holding on to a log
intent item for d1 that was logged in transaction #1. This means that
the tail of the log is pinned at t1. If the log is very small or there
are a lot of other threads updating metadata, this means that we might
have wrapped the log and cannot roll to t51 because there isn't enough
space left before we'd run into t1. This is of course problem #2 again.

But notice the third problem with this scenario: we have 102 defer ops
tied to this transaction! Each of these items are backed by pinned
kernel memory, which means that we risk OOM if the chains get too long.

Yikes. Problem #1 is a subtle logic bomb that could hit someone in the
future; problem #2 applies (rarely) to the current upstream, and problem
#3 applies to work under development.

This is not how incremental deferred operations were supposed to work.
The dfops design of logging in the same transaction an intent-done item
and a new intent item for the work remaining was to make it so that we
only have to juggle enough deferred work items to finish that one small
piece of work. Deferred log item recovery will find that first
unfinished work item and restart it, no matter how many other intent
items might follow it in the log. Therefore, it's ok to put the new
intents at the start of the dfops chain.

For the first example, the chains look like this:

t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
t3: d5(t1), D1(t0), D2(t0), D3(t0)
...
t9: d9(t7), D3(t0)
t10: D3(t0)
t11: d10(t10), d11(t10)
t12: d11(t10)

For the second example, the chains look like this:

t1: D0(t0)
t2: d1(t1), d2(t1), D0'(t1)
t3: d2(t1), D0'(t1)
t4: D0'(t1)
t5: d1(t4), d2(t4), D0''(t4)
...
t148: D0(t147)
t149: d101(t148), d102(t148)
t150: d102(t148)

This actually sucks more for pinning the log tail (we try to roll to t10
while holding an intent item that was logged in t1) but we've solved
problem #1. We've also reduced the maximum chain length from:

sum(all the new items) + nr_original_items

to:

max(new items that each original item creates) + nr_original_items

This solves problem #3 by sharply reducing the number of defer ops that
can be attached to a transaction at any given time. The change makes
the problem of log tail pinning worse, but is improvement we need to
solve problem #2. Actually solving #2, however, is left to the next
patch.

Note that a subsequent analysis of some hard-to-trigger reflink and COW
livelocks on extremely fragmented filesystems (or systems running a lot
of IO threads) showed the same symptoms -- uncomfortably large numbers
of incore deferred work items and occasional stalls in the transaction
grant code while waiting for log reservations. I think this patch and
the next one will also solve these problems.

As originally written, the code used list_splice_tail_init instead of
list_splice_init, so change that, and leave a short comment explaining
our actions.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Reviewed-by: Brian Foster

Darrick J. Wong
2020-10-07 23:40:28 +0800
ff4ab5e02 xfs: fix an incore inode UAF in xfs_bui_recover ... Browse Code »

In xfs_bui_item_recover, there exists a use-after-free bug with regards
to the inode that is involved in the bmap replay operation. If the
mapping operation does not complete, we call xfs_bmap_unmap_extent to
create a deferred op to finish the unmapping work, and we retain a
pointer to the incore inode.

Unfortunately, the very next thing we do is commit the transaction and
drop the inode. If reclaim tears down the inode before we try to finish
the defer ops, we dereference garbage and blow up. Therefore, create a
way to join inodes to the defer ops freezer so that we can maintain the
xfs_inode reference until we're done with the inode.

Note: This imposes the requirement that there be enough memory to keep
every incore inode in memory throughout recovery.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig

Darrick J. Wong
2020-10-07 23:40:28 +0800