Eric Lee / smarc-fsl-linux-kernel

17 Apr, 2020

1 commit

f0f7a674d xfs: move inode flush to the sync workqueue ... Browse Code »

Move the inode dirty data flushing to a workqueue so that multiple
threads can take advantage of a single thread's flushing work. The
ratelimiting technique used in bdd4ee4 was not successful, because
threads that skipped the inode flush scan due to ratelimiting would
ENOSPC early, which caused occasional (but noticeable) changes in
behavior and sporadic fstest regressions.

Therefore, make all the writer threads wait on a single inode flush,
which eliminates both the stampeding hordes of flushers and the small
window in which a write could fail with ENOSPC because it lost the
ratelimit race after even another thread freed space.

Fixes: c6425702f21e ("xfs: ratelimit inode flush on buffered write ENOSPC")
Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster

Darrick J. Wong
2020-04-17 00:07:42 +0800

13 Apr, 2020

2 commits

c142932c2 xfs: fix partially uninitialized structure in xfs_reflink_remap_extent ... Browse Code »

In the reflink extent remap function, it turns out that uirec (the block
mapping corresponding only to the part of the passed-in mapping that got
unmapped) was not fully initialized. Specifically, br_state was not
being copied from the passed-in struct to the uirec. This could lead to
unpredictable results such as the reflinked mapping being marked
unwritten in the destination file.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster

Darrick J. Wong
2020-04-13 23:00:23 +0800
4b674b9ac xfs: acquire superblock freeze protection on eofblocks scans ... Browse Code »

The filesystem freeze sequence in XFS waits on any background
eofblocks or cowblocks scans to complete before the filesystem is
quiesced. At this point, the freezer has already stopped the
transaction subsystem, however, which means a truncate or cowblock
cancellation in progress is likely blocked in transaction
allocation. This results in a deadlock between freeze and the
associated scanner.

Fix this problem by holding superblock write protection across calls
into the block reapers. Since protection for background scans is
acquired from the workqueue task context, trylock to avoid a similar
deadlock between freeze and blocking on the write lock.

Fixes: d6b636ebb1c9f ("xfs: halt auto-reclamation activities while rebuilding rmap")
Reported-by: Paul Furtado
Signed-off-by: Brian Foster
Reviewed-by: Chandan Rajendra
Reviewed-by: Christoph Hellwig
Reviewed-by: Allison Collins
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Brian Foster
2020-04-13 23:00:19 +0800

06 Apr, 2020

2 commits

5833112df xfs: reflink should force the log out if mounted with wsync ... Browse Code »

Reflink should force the log out to disk if the filesystem was mounted
with wsync, the same as most other operations in xfs.

[Note: XFS_MOUNT_WSYNC is set when the admin mounts the filesystem
with either the 'wsync' or 'sync' mount options, which effectively means
that we're classifying reflink/dedupe as IO operations and making them
synchronous when required.]

Fixes: 3fc9f5e409319 ("xfs: remove xfs_reflink_remap_range")
Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
[darrick: add more to the changelog]
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Christoph Hellwig
2020-04-06 23:44:39 +0800
54fbdd103 xfs: factor out a new xfs_log_force_inode helper ... Browse Code »

Create a new helper to force the log up to the last LSN touching an
inode.

Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Christoph Hellwig
2020-04-06 23:44:35 +0800

02 Apr, 2020

1 commit

d9fdd0adf xfs: fix inode number overflow in ifree cluster helper ... Browse Code »

Qian Cai reports seemingly random buffer read verifier errors during
filesystem writeback. This was isolated to a recent patch that
factored out some inode cluster freeing code and happened to cast an
unsigned inode number type to a signed value. If the inode number
value overflows, we can skip marking in-core inodes associated with
the underlying buffer stale at the time the physical inodes are
freed. If such an inode happens to be dirty, xfsaild will eventually
attempt to write it back over non-inode blocks. The invalidation of
the underlying inode buffer causes writeback to read the buffer from
disk. This fails the read verifier (preventing eventual corruption)
if the buffer no longer looks like an inode cluster. Analysis by
Dave Chinner.

Fix up the helper to use the proper type for inode number values.

Fixes: 5806165a6663 ("xfs: factor inode lookup from xfs_ifree_cluster")
Reported-by: Qian Cai
Signed-off-by: Brian Foster
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Brian Foster
2020-04-02 23:19:25 +0800

31 Mar, 2020

2 commits

d8fcb6f13 xfs: remove redundant variable assignment in xfs_symlink() ... Browse Code »

The variables 'udqp' and 'gdqp' have been initialized, so remove
redundant variable assignment in xfs_symlink().

Signed-off-by: Kaixu Xia
Reviewed-by: Chaitanya Kulkarni
Reviewed-by: Dave Chinner
Signed-off-by: Darrick J. Wong

Kaixu Xia
2020-03-31 23:42:22 +0800
c6425702f xfs: ratelimit inode flush on buffered write ENOSPC ... Browse Code »

A customer reported rcu stalls and softlockup warnings on a computer
with many CPU cores and many many more IO threads trying to write to a
filesystem that is totally out of space. Subsequent analysis pointed to
the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
which causes a lot of wb_writeback_work to be queued. The writeback
worker spends so much time trying to wake the many many threads waiting
for writeback completion that it trips the softlockup detector, and (in
this case) the system automatically reboots.

In addition, they complain that the lengthy xfs_flush_inodes scan traps
all of those threads in uninterruptible sleep, which hampers their
ability to kill the program or do anything else to escape the situation.

If there's thousands of threads trying to write to files on a full
filesystem, each of those threads will start separate copies of the
inode flush scan. This is kind of pointless since we only need one
scan, so rate limit the inode flush.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner

Darrick J. Wong
2020-03-31 23:41:45 +0800

29 Mar, 2020

3 commits

d4bc4c5fd xfs: return locked status of inode buffer on xfsaild push ... Browse Code »

If the inode buffer backing a particular inode is locked,
xfs_iflush() returns -EAGAIN and xfs_inode_item_push() skips the
inode. It still returns success to xfsaild, however, which bypasses
the xfsaild backoff heuristic. Update xfs_inode_item_push() to
return locked status if the inode buffer couldn't be locked.

Signed-off-by: Brian Foster
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Brian Foster
2020-03-29 00:40:12 +0800
8d3d7e2b3 xfs: trylock underlying buffer on dquot flush ... Browse Code »

A dquot flush currently blocks on the buffer lock for the underlying
dquot buffer. In turn, this causes xfsaild to block rather than
continue processing other items in the meantime. Update
xfs_qm_dqflush() to trylock the buffer, similar to how inode buffers
are handled, and return -EAGAIN if the lock fails. Fix up any
callers that don't currently handle the error properly.

Signed-off-by: Brian Foster
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Brian Foster
2020-03-29 00:40:11 +0800
63337b63e xfs: remove unnecessary ternary from xfs_create ... Browse Code »

Since the "no-allocation" reservations for file creations has
been removed, the resblks value should be larger than zero, so
remove unnecessary ternary conditional.

Signed-off-by: Kaixu Xia
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
[darrick: s/judgment/ternary/]
Signed-off-by: Darrick J. Wong

Kaixu Xia
2020-03-29 00:40:11 +0800

27 Mar, 2020

18 commits

5cc3c006e xfs: don't write a corrupt unmount record to force summary counter recalc ... Browse Code »

In commit f467cad95f5e3, I added the ability to force a recalculation of
the filesystem summary counters if they seemed incorrect. This was done
(not entirely correctly) by tweaking the log code to write an unmount
record without the UMOUNT_TRANS flag set. At next mount, the log
recovery code will fail to find the unmount record and go into recovery,
which triggers the recalculation.

What actually gets written to the log is what ought to be an unmount
record, but without any flags set to indicate what kind of record it
actually is. This worked to trigger the recalculation, but we shouldn't
write bogus log records when we could simply write nothing.

Fixes: f467cad95f5e3 ("xfs: force summary counter recalc at next mount")
Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Reviewed-by: Brian Foster

Darrick J. Wong
2020-03-27 23:32:55 +0800
5806165a6 xfs: factor inode lookup from xfs_ifree_cluster ... Browse Code »

There's lots of indent in this code which makes it a bit hard to
follow. We are also going to completely rework the inode lookup code
as part of the inode reclaim rework, so factor out the inode lookup
code from the inode cluster freeing code.

Based on prototype code from Christoph Hellwig.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-03-27 23:32:55 +0800
8eb807bd8 xfs: tail updates only need to occur when LSN changes ... Browse Code »

We currently wake anything waiting on the log tail to move whenever
the log item at the tail of the log is removed. Historically this
was fine behaviour because there were very few items at any given
LSN. But with delayed logging, there may be thousands of items at
any given LSN, and we can't move the tail until they are all gone.

Hence if we are removing them in near tail-first order, we might be
waking up processes waiting on the tail LSN to change (e.g. log
space waiters) repeatedly without them being able to make progress.
This also occurs with the new sync push waiters, and can result in
thousands of spurious wakeups every second when under heavy direct
reclaim pressure.

To fix this, check that the tail LSN has actually changed on the
AIL before triggering wakeups. This will reduce the number of
spurious wakeups when doing bulk AIL removal and make this code much
more efficient.

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Allison Collins
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-03-27 23:32:55 +0800
4165994ac xfs: factor common AIL item deletion code ... Browse Code »

Factor the common AIL deletion code that does all the wakeups into a
helper so we only have one copy of this somewhat tricky code to
interface with all the wakeups necessary when the LSN of the log
tail changes.

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Reviewed-by: Allison Collins
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-03-27 23:32:55 +0800
d59eadaea xfs: correctly acount for reclaimable slabs ... Browse Code »

The XFS inode item slab actually reclaimed by inode shrinker
callbacks from the memory reclaim subsystem. These should be marked
as reclaimable so the mm subsystem has the full picture of how much
memory it can actually reclaim from the XFS slab caches.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Reviewed-by: Allison Collins
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-03-27 23:32:54 +0800
12eba65b2 xfs: Improve metadata buffer reclaim accountability ... Browse Code »

The buffer cache shrinker frees more than just the xfs_buf slab
objects - it also frees the pages attached to the buffers. Make sure
the memory reclaim code accounts for this memory being freed
correctly, similar to how the inode shrinker accounts for pages
freed from the page cache due to mapping invalidation.

We also need to make sure that the mm subsystem knows these are
reclaimable objects. We provide the memory reclaim subsystem with a
a shrinker to reclaim xfs_bufs, so we should really mark the slab
that way.

We also have a lot of xfs_bufs in a busy system, spread them around
like we do inodes.

Signed-off-by: Dave Chinner
Reviewed-by: Allison Collins
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-03-27 23:32:54 +0800
2def2845c xfs: don't allow log IO to be throttled ... Browse Code »

Running metadata intensive workloads, I've been seeing the AIL
pushing getting stuck on pinned buffers and triggering log forces.
The log force is taking a long time to run because the log IO is
getting throttled by wbt_wait() - the block layer writeback
throttle. It's being throttled because there is a huge amount of
metadata writeback going on which is filling the request queue.

IOWs, we have a priority inversion problem here.

Mark the log IO bios with REQ_IDLE so they don't get throttled
by the block layer writeback throttle. When we are forcing the CIL,
we are likely to need to to tens of log IOs, and they are issued as
fast as they can be build and IO completed. Hence REQ_IDLE is
appropriate - it's an indication that more IO will follow shortly.

And because we also set REQ_SYNC, the writeback throttle will now
treat log IO the same way it treats direct IO writes - it will not
throttle them at all. Hence we solve the priority inversion problem
caused by the writeback throttle being unable to distinguish between
high priority log IO and background metadata writeback.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Reviewed-by: Allison Collins
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-03-27 23:32:54 +0800
0e7ab7efe xfs: Throttle commits on delayed background CIL push ... Browse Code »

In certain situations the background CIL push can be indefinitely
delayed. While we have workarounds from the obvious cases now, it
doesn't solve the underlying issue. This issue is that there is no
upper limit on the CIL where we will either force or wait for
a background push to start, hence allowing the CIL to grow without
bound until it consumes all log space.

To fix this, add a new wait queue to the CIL which allows background
pushes to wait for the CIL context to be switched out. This happens
when the push starts, so it will allow us to block incoming
transaction commit completion until the push has started. This will
only affect processes that are running modifications, and only when
the CIL threshold has been significantly overrun.

This has no apparent impact on performance, and doesn't even trigger
until over 45 million inodes had been created in a 16-way fsmark
test on a 2GB log. That was limiting at 64MB of log space used, so
the active CIL size is only about 3% of the total log in that case.
The concurrent removal of those files did not trigger the background
sleep at all.

Signed-off-by: Dave Chinner
Reviewed-by: Allison Collins
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-03-27 23:32:54 +0800
108a42358 xfs: Lower CIL flush limit for large logs ... Browse Code »

The current CIL size aggregation limit is 1/8th the log size. This
means for large logs we might be aggregating at least 250MB of dirty objects
in memory before the CIL is flushed to the journal. With CIL shadow
buffers sitting around, this means the CIL is often consuming >500MB
of temporary memory that is all allocated under GFP_NOFS conditions.

Flushing the CIL can take some time to do if there is other IO
ongoing, and can introduce substantial log force latency by itself.
It also pins the memory until the objects are in the AIL and can be
written back and reclaimed by shrinkers. Hence this threshold also
tends to determine the minimum amount of memory XFS can operate in
under heavy modification without triggering the OOM killer.

Modify the CIL space limit to prevent such huge amounts of pinned
metadata from aggregating. We can have 2MB of log IO in flight at
once, so limit aggregation to 16x this size. This threshold was
chosen as it little impact on performance (on 16-way fsmark) or log
traffic but pins a lot less memory on large logs especially under
heavy memory pressure. An aggregation limit of 8x had 5-10%
performance degradation and a 50% increase in log throughput for
the same workload, so clearly that was too small for highly
concurrent workloads on large logs.

This was found via trace analysis of AIL behaviour. e.g. insertion
from a single CIL flush:

xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL

$ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l
1721823
$

So there were 1.7 million objects inserted into the AIL from this
CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which
was the end of the trace (i.e. it hadn't finished). Clearly a major
problem.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Reviewed-by: Allison Collins
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-03-27 23:32:54 +0800
b843299ba xfs: remove some stale comments from the log code ... Browse Code »

Signed-off-by: Dave Chinner
Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-03-27 23:32:54 +0800
3c702f959 xfs: refactor unmount record writing ... Browse Code »

Separate out the unmount record writing from the rest of the
ticket and log state futzing necessary to make it work. This is
a no-op, just makes the code cleaner and places the unmount record
formatting and writing alongside the commit record formatting and
writing code.

We can also get rid of the ticket flag clearing before the
xlog_write() call because it no longer cares about the state of
XLOG_TIC_INITED.

Signed-off-by: Dave Chinner
Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-03-27 23:32:54 +0800
f10e925de xfs: merge xlog_commit_record with xlog_write_done ... Browse Code »

xlog_write_done() is just a thin wrapper around xlog_commit_record(), so
they can be merged together easily.

Signed-off-by: Dave Chinner
Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-03-27 23:32:54 +0800
8b41e3f98 xfs: split xlog_ticket_done ... Browse Code »

Remove xlog_ticket_done and just call the renamed low-level helpers for
ungranting or regranting log space directly. To make that a little
the reference put on the ticket and all tracing is moved into the actual
helpers.

Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Christoph Hellwig
2020-03-27 23:32:53 +0800
70e42f2d4 xfs: kill XLOG_TIC_INITED ... Browse Code »

It is not longer used or checked by anything, so remove the last
traces from the log ticket code.

Signed-off-by: Dave Chinner
Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-03-27 23:32:53 +0800
dd401770b xfs: refactor and split xfs_log_done() ... Browse Code »

xfs_log_done() does two separate things. Firstly, it triggers commit
records to be written for permanent transactions, and secondly it
releases or regrants transaction reservation space.

Since delayed logging was introduced, transactions no longer write
directly to the log, hence they never have the XLOG_TIC_INITED flag
cleared on them. Hence transactions never write commit records to
the log and only need to modify reservation space.

Split up xfs_log_done into two parts, and only call the parts of the
operation needed for the context xfs_log_done() is currently being
called from.

Signed-off-by: Dave Chinner
Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-03-27 23:32:53 +0800
9590e9c68 xfs: re-order initial space accounting checks in xlog_write ... Browse Code »

Commit and unmount records records do not need start records to be
written, so rearrange the logic in xlog_write() to remove the need
to check for XLOG_TIC_INITED to determine if we should account for
the space used by a start record.

Signed-off-by: Dave Chinner
Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-03-27 23:32:53 +0800
7ec949212 xfs: don't try to write a start record into every iclog ... Browse Code »

The xlog_write() function iterates over iclogs until it completes
writing all the log vectors passed in. The ticket tracks whether
a start record has been written or not, so only the first iclog gets
a start record. We only ever pass single use tickets to
xlog_write() so we only ever need to write a start record once per
xlog_write() call.

Hence we don't need to store whether we should write a start record
in the ticket as the callers provide all the information we need to
determine if a start record should be written. For the moment, we
have to ensure that we clear the XLOG_TIC_INITED appropriately so
the code in xfs_log_done() still works correctly for committing
transactions.

(darrick: Note the slight behavior change that we always deduct the
size of the op header from the ticket, even for unmount records)

Signed-off-by: Dave Chinner
[hch: pass an explicit need_start_rec argument]
Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Dave Chinner
2020-03-27 23:32:53 +0800
f8e566c0f xfs: validate the realtime geometry in xfs_validate_sb_common ... Browse Code »

Validate the geometry of the realtime geometry when we mount the
filesystem, so that we don't abruptly shut down the filesystem later on.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner

Darrick J. Wong
2020-03-27 23:32:53 +0800

26 Mar, 2020

5 commits

27fb5a72f xfs: prohibit fs freezing when using empty transactions ... Browse Code »

I noticed that fsfreeze can take a very long time to freeze an XFS if
there happens to be a GETFSMAP caller running in the background. I also
happened to notice the following in dmesg:

------------[ cut here ]------------
WARNING: CPU: 2 PID: 43492 at fs/xfs/xfs_super.c:853 xfs_quiesce_attr+0x83/0x90 [xfs]
Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables bfq iptable_filter sch_fq_codel ip_tables x_tables nfsv4 af_packet [last unloaded: xfs]
CPU: 2 PID: 43492 Comm: xfs_io Not tainted 5.6.0-rc4-djw #rc4
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:xfs_quiesce_attr+0x83/0x90 [xfs]
Code: 7c 07 00 00 85 c0 75 22 48 89 df 5b e9 96 c1 00 00 48 c7 c6 b0 2d 38 a0 48 89 df e8 57 64 ff ff 8b 83 7c 07 00 00 85 c0 74 de 0b 48 89 df 5b e9 72 c1 00 00 66 90 0f 1f 44 00 00 41 55 41 54
RSP: 0018:ffffc900030f3e28 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff88802ac54000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81e4a6f0 RDI: 00000000ffffffff
RBP: ffff88807859f070 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000010 R12: 0000000000000000
R13: ffff88807859f388 R14: ffff88807859f4b8 R15: ffff88807859f5e8
FS: 00007fad1c6c0fc0(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0c7d237000 CR3: 0000000077f01003 CR4: 00000000001606a0
Call Trace:
xfs_fs_freeze+0x25/0x40 [xfs]
freeze_super+0xc8/0x180
do_vfs_ioctl+0x70b/0x750
? __fget_files+0x135/0x210
ksys_ioctl+0x3a/0xb0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x1a0
entry_SYSCALL_64_after_hwframe+0x49/0xbe

These two things appear to be related. The assertion trips when another
thread initiates a fsmap request (which uses an empty transaction) after
the freezer waited for m_active_trans to hit zero but before the the
freezer executes the WARN_ON just prior to calling xfs_log_quiesce.

The lengthy delays in freezing happen because the freezer calls
xfs_wait_buftarg to clean out the buffer lru list. Meanwhile, the
GETFSMAP caller is continuing to grab and release buffers, which means
that it can take a very long time for the buffer lru list to empty out.

We fix both of these races by calling sb_start_write to obtain freeze
protection while using empty transactions for GETFSMAP and for metadata
scrubbing. The other two users occur during mount, during which time we
cannot fs freeze.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner

Darrick J. Wong
2020-03-26 23:19:24 +0800
842a42d12 xfs: shutdown on failure to add page to log bio ... Browse Code »

If the bio_add_page() call fails, we proceed to write out a
partially constructed log buffer. This corrupts the physical log
such that log recovery is not possible. Worse, persistent
occurrences of this error eventually lead to a BUG_ON() failure in
bio_split() as iclogs wrap the end of the physical log, which
triggers log recovery on subsequent mount.

Rather than warn about writing out a corrupted log buffer, shutdown
the fs as is done for any log I/O related error. This preserves the
consistency of the physical log such that log recovery succeeds on a
subsequent mount. Note that this was observed on a 64k page debug
kernel without upstream commit 59bb47985c1d ("mm, sl[aou]b:
guarantee natural alignment for kmalloc(power-of-two)"), which
demonstrated frequent iclog bio overflows due to unaligned (slab
allocated) iclog data buffers.

Signed-off-by: Brian Foster
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Brian Foster
2020-03-26 23:19:24 +0800
d59f44d3e xfs: directory bestfree check should release buffers ... Browse Code »

When we're checking bestfree information in directory blocks, always
drop the block buffer at the end of the function. We should always
release resources when we're done using them.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner

Darrick J. Wong
2020-03-26 23:19:24 +0800
afbabf563 xfs: drop all altpath buffers at the end of the sibling check ... Browse Code »

The dirattr btree checking code uses the altpath substructure of the
dirattr state structure to check the sibling pointers of dir/attr tree
blocks. At the end of sibling checks, xfs_da3_path_shift could have
changed multiple levels of buffer pointers in the altpath structure.
Although we release the leaf level buffer, this isn't enough -- we also
need to release the node buffers that are unique to the altpath.

Not releasing all of the altpath buffers leaves them locked to the
transaction. This is suboptimal because we should release resources
when we don't need them anymore. Fix the function to loop all levels of
the altpath, and fix the return logic so that we always run the loop.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner

Darrick J. Wong
2020-03-26 23:19:24 +0800
5885539f0 xfs: preserve default grace interval during quotacheck ... Browse Code »

When quotacheck runs, it zeroes all the timer fields in every dquot.
Unfortunately, it also does this to the root dquot, which erases any
preconfigured grace intervals and warning limits that the administrator
may have set. Worse yet, the incore copies of those variables remain
set. This cache coherence problem manifests itself as the grace
interval mysteriously being reset back to the defaults at the /next/
mount.

Fix it by not resetting the root disk dquot's timer and warning fields.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Reviewed-by: Christoph Hellwig

Darrick J. Wong
2020-03-26 23:19:24 +0800

23 Mar, 2020

6 commits

693639994 xfs: remove xlog_state_want_sync ... Browse Code »

Open code the xlog_state_want_sync logic in its two callers given that
this function is a trivial wrapper around xlog_state_switch_iclogs.

Move the lockdep assert into xlog_state_switch_iclogs to not lose this
debugging aid, and improve the comment that documents
xlog_state_switch_iclogs as well.

Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Christoph Hellwig
2020-03-23 23:27:59 +0800
5781464bd xfs: move the ioerror check out of xlog_state_clean_iclog ... Browse Code »

Use the shutdown flag in the log to bypass xlog_state_clean_iclog
entirely in case of a shut down log.

Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Christoph Hellwig
2020-03-23 23:27:59 +0800
c814b4f24 xfs: refactor xlog_state_clean_iclog ... Browse Code »

Factor out a few self-contained helpers from xlog_state_clean_iclog, and
update the documentation so it primarily documents why things happens
instead of how.

Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Christoph Hellwig
2020-03-23 23:27:59 +0800
12e6a0f44 xfs: remove the aborted parameter to xlog_state_done_syncing ... Browse Code »

We can just check for a shut down log all the way down in
xlog_cil_committed instead of passing the parameter. This means a
slight behavior change in that we now also abort log items if the
shutdown came in halfway into the I/O completion processing, which
actually is the right thing to do.

Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Christoph Hellwig
2020-03-23 23:27:59 +0800
a582f32fa xfs: simplify log shutdown checking in xfs_log_release_iclog ... Browse Code »

There is no need to check for the ioerror state before the lock, as
the shutdown case is not a fast path. Also remove the call to force
shutdown the file system, as it must have been shut down already
for an iclog to be in the ioerror state. Also clean up the flow of
the function a bit.

Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Christoph Hellwig
2020-03-23 23:27:59 +0800
f97a43e43 xfs: simplify the xfs_log_release_iclog calling convention ... Browse Code »

The only caller of xfs_log_release_iclog doesn't care about the return
value, so remove it. Also don't bother passing the mount pointer,
given that we can trivially derive it from the iclog.

Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Christoph Hellwig
2020-03-23 23:27:58 +0800