Eric Lee / smarc-fsl-linux-kernel

05 Oct, 2017

1 commit

00a163a55 block: better op and flags encoding ... Browse Code »

Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields. This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits. Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
(cherry picked from commit ef295ecf090d3e86e5b742fc6ab34f1122a43773)

Conflicts:
block/blk-mq.c
include/linux/blk_types.h
include/linux/blkdev.h

Christoph Hellwig
2017-10-05 19:58:58 +0800

20 Sep, 2017

39 commits

ae04a8c4c xfs: fix compiler warnings ... Browse Code »

commit 7bf7a193a90cadccaad21c5970435c665c40fe27 upstream.

Fix up all the compiler warnings that have crept in.

Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Cc: Arnd Bergmann
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-09-20 14:20:02 +0800
81cb6f1a2 xfs: use kmem_free to free return value of kmem_zalloc ... Browse Code »

commit 6c370590cfe0c36bcd62d548148aa65c984540b7 upstream.

In function xfs_test_remount_options(), kfree() is used to free memory
allocated by kmem_zalloc(). But it is better to use kmem_free().

Signed-off-by: Pan Bian
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Pan Bian
2017-09-20 14:20:02 +0800
772003c6a xfs: open code end_buffer_async_write in xfs_finish_page_writeback ... Browse Code »

commit 8353a814f2518dcfa79a5bb77afd0e7dfa391bb1 upstream.

Our loop in xfs_finish_page_writeback, which iterates over all buffer
heads in a page and then calls end_buffer_async_write, which also
iterates over all buffers in the page to check if any I/O is in flight
is not only inefficient, but also potentially dangerous as
end_buffer_async_write can cause the page and all buffers to be freed.

Replace it with a single loop that does the work of end_buffer_async_write
on a per-page basis.

Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2017-09-20 14:20:02 +0800
bb69e8a22 xfs: don't set v3 xflags for v2 inodes ... Browse Code »

commit dd60687ee541ca3f6df8758f38e6f22f57c42a37 upstream.

Reject attempts to set XFLAGS that correspond to di_flags2 inode flags
if the inode isn't a v3 inode, because di_flags2 only exists on v3.

Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2017-09-20 14:20:02 +0800
f46a61f68 xfs: fix incorrect log_flushed on fsync ... Browse Code »

commit 47c7d0b19502583120c3f396c7559e7a77288a68 upstream.

When calling into _xfs_log_force{,_lsn}() with a pointer
to log_flushed variable, log_flushed will be set to 1 if:
1. xlog_sync() is called to flush the active log buffer
AND/OR
2. xlog_wait() is called to wait on a syncing log buffers

xfs_file_fsync() checks the value of log_flushed after
_xfs_log_force_lsn() call to optimize away an explicit
PREFLUSH request to the data block device after writing
out all the file's pages to disk.

This optimization is incorrect in the following sequence of events:

Task A Task B
-------------------------------------------------------
xfs_file_fsync()
_xfs_log_force_lsn()
xlog_sync()
[submit PREFLUSH]
xfs_file_fsync()
file_write_and_wait_range()
[submit WRITE X]
[endio WRITE X]
_xfs_log_force_lsn()
xlog_wait()
[endio PREFLUSH]

The write X is not guarantied to be on persistent storage
when PREFLUSH request in completed, because write A was submitted
after the PREFLUSH request, but xfs_file_fsync() of task A will
be notified of log_flushed=1 and will skip explicit flush.

If the system crashes after fsync of task A, write X may not be
present on disk after reboot.

This bug was discovered and demonstrated using Josef Bacik's
dm-log-writes target, which can be used to record block io operations
and then replay a subset of these operations onto the target device.
The test goes something like this:
- Use fsx to execute ops of a file and record ops on log device
- Every now and then fsync the file, store md5 of file and mark
the location in the log
- Then replay log onto device for each mark, mount fs and compare
md5 of file to stored value

Cc: Christoph Hellwig
Cc: Josef Bacik
Cc:
Signed-off-by: Amir Goldstein
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Amir Goldstein
2017-09-20 14:20:02 +0800
0e8d7e364 xfs: disable per-inode DAX flag ... Browse Code »

commit 742d84290739ae908f1b61b7d17ea382c8c0073a upstream.

Currently flag switching can be used to easily crash the kernel. Disable
the per-inode DAX flag until that is sorted out.

Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2017-09-20 14:20:02 +0800
a46cf5926 xfs: relog dirty buffers during swapext bmbt owner change ... Browse Code »

commit 2dd3d709fc4338681a3aa61658122fa8faa5a437 upstream.

The owner change bmbt scan that occurs during extent swap operations
does not handle ordered buffer failures. Buffers that cannot be
marked ordered must be physically logged so previously dirty ranges
of the buffer can be relogged in the transaction.

Since the bmbt scan may need to process and potentially log a large
number of blocks, we can't expect to complete this operation in a
single transaction. Update extent swap to use a permanent
transaction with enough log reservation to physically log a buffer.
Update the bmbt scan to physically log any buffers that cannot be
ordered and to terminate the scan with -EAGAIN. On -EAGAIN, the
caller rolls the transaction and restarts the scan. Finally, update
the bmbt scan helper function to skip bmbt blocks that already match
the expected owner so they are not reprocessed after scan restarts.

Signed-off-by: Brian Foster
Reviewed-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
[darrick: fix the xfs_trans_roll call]
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-09-20 14:20:02 +0800
e2bb92633 xfs: disallow marking previously dirty buffers as ordered ... Browse Code »

commit a5814bceea48ee1c57c4db2bd54b0c0246daf54a upstream.

Ordered buffers are used in situations where the buffer is not
physically logged but must pass through the transaction/logging
pipeline for a particular transaction. As a result, ordered buffers
are not unpinned and written back until the transaction commits to
the log. Ordered buffers have a strict requirement that the target
buffer must not be currently dirty and resident in the log pipeline
at the time it is marked ordered. If a dirty+ordered buffer is
committed, the buffer is reinserted to the AIL but not physically
relogged at the LSN of the associated checkpoint. The buffer log
item is assigned the LSN of the latest checkpoint and the AIL
effectively releases the previously logged buffer content from the
active log before the buffer has been written back. If the tail
pushes forward and a filesystem crash occurs while in this state, an
inconsistent filesystem could result.

It is currently the caller responsibility to ensure an ordered
buffer is not already dirty from a previous modification. This is
unclear and error prone when not used in situations where it is
guaranteed a buffer has not been previously modified (such as new
metadata allocations).

To facilitate general purpose use of ordered buffers, update
xfs_trans_ordered_buf() to conditionally order the buffer based on
state of the log item and return the status of the result. If the
bli is dirty, do not order the buffer and return false. The caller
must either physically log the buffer (having acquired the
appropriate log reservation) or push it from the AIL to clean it
before it can be marked ordered in the current transaction.

Note that ordered buffers are currently only used in two situations:
1.) inode chunk allocation where previously logged buffers are not
possible and 2.) extent swap which will be updated to handle ordered
buffer failures in a separate patch.

Signed-off-by: Brian Foster
Reviewed-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-09-20 14:20:02 +0800
a51e3e2cf xfs: move bmbt owner change to last step of extent swap ... Browse Code »

commit 6fb10d6d22094bc4062f92b9ccbcee2f54033d04 upstream.

The extent swap operation currently resets bmbt block owners before
the inode forks are swapped. The bmbt buffers are marked as ordered
so they do not have to be physically logged in the transaction.

This use of ordered buffers is not safe as bmbt buffers may have
been previously physically logged. The bmbt owner change algorithm
needs to be updated to physically log buffers that are already dirty
when/if they are encountered. This means that an extent swap will
eventually require multiple rolling transactions to handle large
btrees. In addition, all inode related changes must be logged before
the bmbt owner change scan begins and can roll the transaction for
the first time to preserve fs consistency via log recovery.

In preparation for such fixes to the bmbt owner change algorithm,
refactor the bmbt scan out of the extent fork swap code to the last
operation before the transaction is committed. Update
xfs_swap_extent_forks() to only set the inode log flags when an
owner change scan is necessary. Update xfs_swap_extents() to trigger
the owner change based on the inode log flags. Note that since the
owner change now occurs after the extent fork swap, the inode btrees
must be fixed up with the inode number of the current inode (similar
to log recovery).

Signed-off-by: Brian Foster
Reviewed-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-09-20 14:20:01 +0800
f9e583edf xfs: skip bmbt block ino validation during owner change ... Browse Code »

commit 99c794c639a65cc7b74f30a674048fd100fe9ac8 upstream.

Extent swap uses xfs_btree_visit_blocks() to fix up bmbt block
owners on v5 (!rmapbt) filesystems. The bmbt scan uses
xfs_btree_lookup_get_block() to read bmbt blocks which verifies the
current owner of the block against the parent inode of the bmbt.
This works during extent swap because the bmbt owners are updated to
the opposite inode number before the inode extent forks are swapped.

The modified bmbt blocks are marked as ordered buffers which allows
everything to commit in a single transaction. If the transaction
commits to the log and the system crashes such that recovery of the
extent swap is required, log recovery restarts the bmbt scan to fix
up any bmbt blocks that may have not been written back before the
crash. The log recovery bmbt scan occurs after the inode forks have
been swapped, however. This causes the bmbt block owner verification
to fail, leads to log recovery failure and requires xfs_repair to
zap the log to recover.

Define a new invalid inode owner flag to inform the btree block
lookup mechanism that the current inode may be invalid with respect
to the current owner of the bmbt block. Set this flag on the cursor
used for change owner scans to allow this operation to work at
runtime and during log recovery.

Signed-off-by: Brian Foster
Fixes: bb3be7e7c ("xfs: check for bogus values in btree block headers")
Cc: stable@vger.kernel.org
Reviewed-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-09-20 14:20:01 +0800
fe211e174 xfs: don't log dirty ranges for ordered buffers ... Browse Code »

commit 8dc518dfa7dbd079581269e51074b3c55a65a880 upstream.

Ordered buffers are attached to transactions and pushed through the
logging infrastructure just like normal buffers with the exception
that they are not actually written to the log. Therefore, we don't
need to log dirty ranges of ordered buffers. xfs_trans_log_buf() is
called on ordered buffers to set up all of the dirty state on the
transaction, buffer and log item and prepare the buffer for I/O.

Now that xfs_trans_dirty_buf() is available, call it from
xfs_trans_ordered_buf() so the latter is now mutually exclusive with
xfs_trans_log_buf(). This reflects the implementation of ordered
buffers and helps eliminate confusion over the need to log ranges of
ordered buffers just to set up internal log state.

Signed-off-by: Brian Foster
Reviewed-by: Allison Henderson
Reviewed-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-09-20 14:20:01 +0800
19a87a940 xfs: refactor buffer logging into buffer dirtying helper ... Browse Code »

commit 9684010d38eccda733b61106765e9357cf436f65 upstream.

xfs_trans_log_buf() is responsible for logging the dirty segments of
a buffer along with setting all of the necessary state on the
transaction, buffer, bli, etc., to ensure that the associated items
are marked as dirty and prepared for I/O. We have a couple use cases
that need to to dirty a buffer in a transaction without actually
logging dirty ranges of the buffer. One existing use case is
ordered buffers, which are currently logged with arbitrary ranges to
accomplish this even though the content of ordered buffers is never
written to the log. Another pending use case is to relog an already
dirty buffer across rolled transactions within the deferred
operations infrastructure. This is required to prevent a held
(XFS_BLI_HOLD) buffer from pinning the tail of the log.

Refactor xfs_trans_log_buf() into a new function that contains all
of the logic responsible to dirty the transaction, lidp, buffer and
bli. This new function can be used in the future for the use cases
outlined above. This patch does not introduce functional changes.

Signed-off-by: Brian Foster
Reviewed-by: Allison Henderson
Reviewed-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-09-20 14:20:01 +0800
93b645160 xfs: ordered buffer log items are never formatted ... Browse Code »

commit e9385cc6fb7edf23702de33a2dc82965d92d9392 upstream.

Ordered buffers pass through the logging infrastructure without ever
being written to the log. The way this works is that the ordered
buffer status is transferred to the log vector at commit time via
the ->iop_size() callback. In xlog_cil_insert_format_items(),
ordered log vectors bypass ->iop_format() processing altogether.

Therefore it is unnecessary for xfs_buf_item_format() to handle
ordered buffers. Remove the unnecessary logic and assert that an
ordered buffer never reaches this point.

Signed-off-by: Brian Foster
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-09-20 14:20:01 +0800
ba986b3c8 xfs: remove unnecessary dirty bli format check for ordered bufs ... Browse Code »

commit 6453c65d3576bc3e602abb5add15f112755c08ca upstream.

xfs_buf_item_unlock() historically checked the dirty state of the
buffer by manually checking the buffer log formats for dirty
segments. The introduction of ordered buffers invalidated this check
because ordered buffers have dirty bli's but no dirty (logged)
segments. The check was updated to accommodate ordered buffers by
looking at the bli state first and considering the blf only if the
bli is clean.

This logic is safe but unnecessary. There is no valid case where the
bli is clean yet the blf has dirty segments. The bli is set dirty
whenever the blf is logged (via xfs_trans_log_buf()) and the blf is
cleared in the only place BLI_DIRTY is cleared (xfs_trans_binval()).

Remove the conditional blf dirty checks and replace with an assert
that should catch any discrepencies between bli and blf dirty
states. Refactor the old blf dirty check into a helper function to
be used by the assert.

Signed-off-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-09-20 14:20:01 +0800
0f5af7eae xfs: open-code xfs_buf_item_dirty() ... Browse Code »

commit a4f6cf6b2b6b60ec2a05a33a32e65caa4149aa2b upstream.

It checks a single flag and has one caller. It probably isn't worth
its own function.

Signed-off-by: Brian Foster
Reviewed-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-09-20 14:20:01 +0800
81286ade8 xfs: check for race with xfs_reclaim_inode() in xfs_ifree_cluster() ... Browse Code »

commit f2e9ad212def50bcf4c098c6288779dd97fff0f0 upstream.

After xfs_ifree_cluster() finds an inode in the radix tree and verifies
that the inode number is what it expected, xfs_reclaim_inode() can swoop
in and free it. xfs_ifree_cluster() will then happily continue working
on the freed inode. Most importantly, it will mark the inode stale,
which will probably be overwritten when the inode slab object is
reallocated, but if it has already been reallocated then we can end up
with an inode spuriously marked stale.

In 8a17d7ddedb4 ("xfs: mark reclaimed inodes invalid earlier") we added
a second check to xfs_iflush_cluster() to detect this race, but the
similar RCU lookup in xfs_ifree_cluster() needs the same treatment.

Signed-off-by: Omar Sandoval
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Omar Sandoval
2017-09-20 14:20:01 +0800
63d184d29 xfs: evict all inodes involved with log redo item ... Browse Code »

commit 799ea9e9c59949008770aab4e1da87f10e99dbe4 upstream.

When we introduced the bmap redo log items, we set MS_ACTIVE on the
mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
from being truncated prematurely during log recovery. This also had the
effect of putting linked inodes on the lru instead of evicting them.

Unfortunately, we neglected to find all those unreferenced lru inodes
and evict them after finishing log recovery, which means that we leak
them if anything goes wrong in the rest of xfs_mountfs, because the lru
is only cleaned out on unmount.

Therefore, evict unreferenced inodes in the lru list immediately
after clearing MS_ACTIVE.

Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
Signed-off-by: Darrick J. Wong
Cc: viro@ZenIV.linux.org.uk
Reviewed-by: Brian Foster
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-09-20 14:20:01 +0800
536932f39 xfs: stop searching for free slots in an inode chunk when there are none ... Browse Code »

commit 2d32311cf19bfb8c1d2b4601974ddd951f9cfd0b upstream.

In a filesystem without finobt, the Space manager selects an AG to alloc a new
inode, where xfs_dialloc_ag_inobt() will search the AG for the free slot chunk.

When the new inode is in the same AG as its parent, the btree will be searched
starting on the parent's record, and then retried from the top if no slot is
available beyond the parent's record.

To exit this loop though, xfs_dialloc_ag_inobt() relies on the fact that the
btree must have a free slot available, once its callers relied on the
agi->freecount when deciding how/where to allocate this new inode.

In the case when the agi->freecount is corrupted, showing available inodes in an
AG, when in fact there is none, this becomes an infinite loop.

Add a way to stop the loop when a free slot is not found in the btree, making
the function to fall into the whole AG scan which will then, be able to detect
the corruption and shut the filesystem down.

As pointed by Brian, this might impact performance, giving the fact we
don't reset the search distance anymore when we reach the end of the
tree, giving it fewer tries before falling back to the whole AG search, but
it will only affect searches that start within 10 records to the end of the tree.

Signed-off-by: Carlos Maiolino
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Carlos Maiolino
2017-09-20 14:20:00 +0800
6b6505d90 xfs: add log recovery tracepoint for head/tail ... Browse Code »

commit e67d3d4246e5fbb0c7c700426d11241ca9c6f473 upstream.

Torn write detection and tail overwrite detection can shift the log
head and tail respectively in the event of CRC mismatch or
corruption errors. Add a high-level log recovery tracepoint to dump
the final log head/tail and make those values easily attainable in
debug/diagnostic situations.

Signed-off-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-09-20 14:20:00 +0800
7549e7c01 xfs: handle -EFSCORRUPTED during head/tail verification ... Browse Code »

commit a4c9b34d6a17081005ec459b57b8effc08f4c731 upstream.

Torn write and tail overwrite detection both trigger only on
-EFSBADCRC errors. While this is the most likely failure scenario
for each condition, -EFSCORRUPTED is still possible in certain cases
depending on what ends up on disk when a torn write or partial tail
overwrite occurs. For example, an invalid log record h_len can lead
to an -EFSCORRUPTED error when running the log recovery CRC pass.

Therefore, update log head and tail verification to trigger the
associated head/tail fixups in the event of -EFSCORRUPTED errors
along with -EFSBADCRC. Also, -EFSCORRUPTED can currently be returned
from xlog_do_recovery_pass() before rhead_blk is initialized if the
first record encountered happens to be corrupted. This leads to an
incorrect 'first_bad' return value. Initialize rhead_blk earlier in
the function to address that problem as well.

Signed-off-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-09-20 14:20:00 +0800
47db1fc60 xfs: fix log recovery corruption error due to tail overwrite ... Browse Code »

commit 4a4f66eac4681378996a1837ad1ffec3a2e2981f upstream.

If we consider the case where the tail (T) of the log is pinned long
enough for the head (H) to push and block behind the tail, we can
end up blocked in the following state without enough free space (f)
in the log to satisfy a transaction reservation:

0 phys. log N
[-------HffT---H'--T'---]

The last good record in the log (before H) refers to T. The tail
eventually pushes forward (T') leaving more free space in the log
for writes to H. At this point, suppose space frees up in the log
for the maximum of 8 in-core log buffers to start flushing out to
the log. If this pushes the head from H to H', these next writes
overwrite the previous tail T. This is safe because the items logged
from T to T' have been written back and removed from the AIL.

If the next log writes (H -> H') happen to fail and result in
partial records in the log, the filesystem shuts down having
overwritten T with invalid data. Log recovery correctly locates H on
the subsequent mount, but H still refers to the now corrupted tail
T. This results in log corruption errors and recovery failure.

Since the tail overwrite results from otherwise correct runtime
behavior, it is up to log recovery to try and deal with this
situation. Update log recovery tail verification to run a CRC pass
from the first record past the tail to the head. This facilitates
error detection at T and moves the recovery tail to the first good
record past H' (similar to truncating the head on torn write
detection). If corruption is detected beyond the range possibly
affected by the max number of iclogs, the log is legitimately
corrupted and log recovery failure is expected.

Signed-off-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-09-20 14:20:00 +0800
e34b72a23 xfs: always verify the log tail during recovery ... Browse Code »

commit 5297ac1f6d7cbf45464a49b9558831f271dfc559 upstream.

Log tail verification currently only occurs when torn writes are
detected at the head of the log. This was introduced because a
change in the head block due to torn writes can lead to a change in
the tail block (each log record header references the current tail)
and the tail block should be verified before log recovery proceeds.

Tail corruption is possible outside of torn write scenarios,
however. For example, partial log writes can be detected and cleared
during the initial head/tail block discovery process. If the partial
write coincides with a tail overwrite, the log tail is corrupted and
recovery fails.

To facilitate correct handling of log tail overwites, update log
recovery to always perform tail verification. This is necessary to
detect potential tail overwrite conditions when torn writes may not
have occurred. This changes normal (i.e., no torn writes) recovery
behavior slightly to detect and return CRC related errors near the
tail before actual recovery starts.

Signed-off-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-09-20 14:20:00 +0800
35093926c xfs: fix recovery failure when log record header wraps log end ... Browse Code »

commit 284f1c2c9bebf871861184b0e2c40fa921dd380b upstream.

The high-level log recovery algorithm consists of two loops that
walk the physical log and process log records from the tail to the
head. The first loop handles the case where the tail is beyond the
head and processes records up to the end of the physical log. The
subsequent loop processes records from the beginning of the physical
log to the head.

Because log records can wrap around the end of the physical log, the
first loop mentioned above must handle this case appropriately.
Records are processed from in-core buffers, which means that this
algorithm must split the reads of such records into two partial
I/Os: 1.) from the beginning of the record to the end of the log and
2.) from the beginning of the log to the end of the record. This is
further complicated by the fact that the log record header and log
record data are read into independent buffers.

The current handling of each buffer correctly splits the reads when
either the header or data starts before the end of the log and wraps
around the end. The data read does not correctly handle the case
where the prior header read wrapped or ends on the physical log end
boundary. blk_no is incremented to or beyond the log end after the
header read to point to the record data, but the split data read
logic triggers, attempts to read from an invalid log block and
ultimately causes log recovery to fail. This can be reproduced
fairly reliably via xfstests tests generic/047 and generic/388 with
large iclog sizes (256k) and small (10M) logs.

If the record header read has pushed beyond the end of the physical
log, the subsequent data read is actually contiguous. Update the
data read logic to detect the case where blk_no has wrapped, mod it
against the log size to read from the correct address and issue one
contiguous read for the log data buffer. The log record is processed
as normal from the buffer(s), the loop exits after the current
iteration and the subsequent loop picks up with the first new record
after the start of the log.

Signed-off-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-09-20 14:20:00 +0800
0800356de xfs: Properly retry failed inode items in case of error during buffer writeback ... Browse Code »

commit d3a304b6292168b83b45d624784f973fdc1ca674 upstream.

When a buffer has been failed during writeback, the inode items into it
are kept flush locked, and are never resubmitted due the flush lock, so,
if any buffer fails to be written, the items in AIL are never written to
disk and never unlocked.

This causes unmount operation to hang due these items flush locked in AIL,
but this also causes the items in AIL to never be written back, even when
the IO device comes back to normal.

I've been testing this patch with a DM-thin device, creating a
filesystem larger than the real device.

When writing enough data to fill the DM-thin device, XFS receives ENOSPC
errors from the device, and keep spinning on xfsaild (when 'retry
forever' configuration is set).

At this point, the filesystem can not be unmounted because of the flush locked
items in AIL, but worse, the items in AIL are never retried at all
(once xfs_inode_item_push() will skip the items that are flush locked),
even if the underlying DM-thin device is expanded to the proper size.

This patch fixes both cases, retrying any item that has been failed
previously, using the infra-structure provided by the previous patch.

Reviewed-by: Brian Foster
Signed-off-by: Carlos Maiolino
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Carlos Maiolino
2017-09-20 14:20:00 +0800
7942f605c xfs: Add infrastructure needed for error propagation during buffer IO failure ... Browse Code »

commit 0b80ae6ed13169bd3a244e71169f2cc020b0c57a upstream.

With the current code, XFS never re-submit a failed buffer for IO,
because the failed item in the buffer is kept in the flush locked state
forever.

To be able to resubmit an log item for IO, we need a way to mark an item
as failed, if, for any reason the buffer which the item belonged to
failed during writeback.

Add a new log item callback to be used after an IO completion failure
and make the needed clean ups.

Reviewed-by: Brian Foster
Signed-off-by: Carlos Maiolino
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Carlos Maiolino
2017-09-20 14:20:00 +0800
1ba049334 xfs: remove xfs_trans_ail_delete_bulk ... Browse Code »

commit 27af1bbf524459962d1477a38ac6e0b7f79aaecc upstream.

xfs_iflush_done uses an on-stack variable length array to pass the log
items to be deleted to xfs_trans_ail_delete_bulk. On-stack VLAs are a
nasty gcc extension that can lead to unbounded stack allocations, but
fortunately we can easily avoid them by simply open coding
xfs_trans_ail_delete_bulk in xfs_iflush_done, which is the only caller
of it except for the single-item xfs_trans_ail_delete.

Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2017-09-20 14:20:00 +0800
9a3f75229 xfs: toggle readonly state around xfs_log_mount_finish ... Browse Code »

commit 6f4a1eefdd0ad4561543270a7fceadabcca075dd upstream.

When we do log recovery on a readonly mount, unlinked inode
processing does not happen due to the readonly checks in
xfs_inactive(), which are trying to prevent any I/O on a
readonly mount.

This is misguided - we do I/O on readonly mounts all the time,
for consistency; for example, log recovery. So do the same
RDONLY flag twiddling around xfs_log_mount_finish() as we
do around xfs_log_mount(), for the same reason.

This all cries out for a big rework but for now this is a
simple fix to an obvious problem.

Signed-off-by: Eric Sandeen
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Eric Sandeen
2017-09-20 14:19:59 +0800
01d38e380 xfs: write unmount record for ro mounts ... Browse Code »

commit 757a69ef6cf2bf839bd4088e5609ddddd663b0c4 upstream.

There are dueling comments in the xfs code about intent
for log writes when unmounting a readonly filesystem.

In xfs_mountfs, we see the intent:

/*
* Now the log is fully replayed, we can transition to full read-only
* mode for read-only mounts. This will sync all the metadata and clean
* the log so that the recovery we just performed does not have to be
* replayed again on the next mount.
*/

and it calls xfs_quiesce_attr(), but by the time we get to
xfs_log_unmount_write(), it returns early for a RDONLY mount:

* Don't write out unmount record on read-only mounts.

Because of this, sequential ro mounts of a filesystem with
a dirty log will replay the log each time, which seems odd.

Fix this by writing an unmount record even for RO mounts, as long
as norecovery wasn't specified (don't write a clean log record
if a dirty log may still be there!) and the log device is
writable.

Signed-off-by: Eric Sandeen
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Eric Sandeen
2017-09-20 14:19:59 +0800
ec0d46ef8 iomap: fix integer truncation issues in the zeroing and dirtying helpers ... Browse Code »

commit e28ae8e428fefe2facd72cea9f29906ecb9c861d upstream.

Fix the min_t calls in the zeroing and dirtying helpers to perform the
comparisms on 64-bit types, which prevents them from incorrectly
being truncated, and larger zeroing operations being stuck in a never
ending loop.

Special thanks to Markus Stockhausen for spotting the bug.

Reported-by: Paul Menzel
Tested-by: Paul Menzel
Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2017-09-20 14:19:59 +0800
e1a7b7e1f xfs: don't leak quotacheck dquots when cow recovery ... Browse Code »

commit 77aff8c76425c8f49b50d0b9009915066739e7d2 upstream.

If we fail a mount on account of cow recovery errors, it's possible that
a previous quotacheck left some dquots in memory. The bailout clause of
xfs_mountfs forgets to purge these, and so we leak them. Fix that.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-09-20 14:19:59 +0800
7fb3e5e37 xfs: clear MS_ACTIVE after finishing log recovery ... Browse Code »

commit 8204f8ddaafafcae074746fcf2a05a45e6827603 upstream.

Way back when we established inode block-map redo log items, it was
discovered that we needed to prevent the VFS from evicting inodes during
log recovery because any given inode might be have bmap redo items to
replay even if the inode has no link count and is ultimately deleted,
and any eviction of an unlinked inode causes the inode to be truncated
and freed too early.

To make this possible, we set MS_ACTIVE so that inodes would not be torn
down immediately upon release. Unfortunately, this also results in the
quota inodes not being released at all if a later part of the mount
process should fail, because we never reclaim the inodes. So, set
MS_ACTIVE right before we do the last part of log recovery and clear it
immediately after we finish the log recovery so that everything
will be torn down properly if we abort the mount.

Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-09-20 14:19:59 +0800
8edd73a13 xfs: fix inobt inode allocation search optimization ... Browse Code »

commit c44245b3d5435f533ca8346ece65918f84c057f9 upstream.

When we try to allocate a free inode by searching the inobt, we try to
find the inode nearest the parent inode by searching chunks both left
and right of the chunk containing the parent. As an optimization, we
cache the leftmost and rightmost records that we previously searched; if
we do another allocation with the same parent inode, we'll pick up the
search where it last left off.

There's a bug in the case where we found a free inode to the left of the
parent's chunk: we need to update the cached left and right records, but
because we already reassigned the right record to point to the left, we
end up assigning the left record to both the cached left and right
records.

This isn't a correctness problem strictly, but it can result in the next
allocation rechecking chunks unnecessarily or allocating inodes further
away from the parent than it needs to. Fix it by swapping the record
pointer after we update the cached left and right records.

Fixes: bd169565993b ("xfs: speed up free inode search")
Signed-off-by: Omar Sandoval
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Omar Sandoval
2017-09-20 14:19:59 +0800
f90756d75 xfs: Fix per-inode DAX flag inheritance ... Browse Code »

commit 56bdf855e676f1f2ed7033f288f57dfd315725ba upstream.

According to the commit that implemented per-inode DAX flag:
commit 58f88ca2df72 ("xfs: introduce per-inode DAX enablement")
the flag is supposed to act as "inherit flag".

Currently this only works in the situations where parent directory
already has a flag in di_flags set, otherwise inheritance does not
work. This is because setting the XFS_DIFLAG2_DAX flag is done in a
wrong branch designated for di_flags, not di_flags2.

Fix this by moving the code to branch designated for setting di_flags2,
which does test for flags in di_flags2.

Fixes: 58f88ca2df72 ("xfs: introduce per-inode DAX enablement")
Signed-off-by: Lukas Czerner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Lukas Czerner
2017-09-20 14:19:59 +0800
229980158 xfs: fix multi-AG deadlock in xfs_bunmapi ... Browse Code »

commit 5b094d6dac0451ad89b1dc088395c7b399b7e9e8 upstream.

Just like in the allocator we must avoid touching multiple AGs out of
order when freeing blocks, as freeing still locks the AGF and can cause
the same AB-BA deadlocks as in the allocation path.

Signed-off-by: Christoph Hellwig
Reported-by: Nikolay Borisov
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2017-09-20 14:19:59 +0800
81e27c94f xfs: fix quotacheck dquot id overflow infinite loop ... Browse Code »

commit cfaf2d034360166e569a4929dd83ae9698bed856 upstream.

If a dquot has an id of U32_MAX, the next lookup index increment
overflows the uint32_t back to 0. This starts the lookup sequence
over from the beginning, repeats indefinitely and results in a
livelock.

Update xfs_qm_dquot_walk() to explicitly check for the lookup
overflow and exit the loop.

Signed-off-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-09-20 14:19:59 +0800
01bc13204 xfs: check _alloc_read_agf buffer pointer before using ... Browse Code »

commit 10479e2dea83d4c421ad05dfc55d918aa8dfc0cd upstream.

In some circumstances, _alloc_read_agf can return an error code of zero
but also a null AGF buffer pointer. Check for this and jump out.

Fixes-coverity-id: 1415250
Fixes-coverity-id: 1415320
Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-09-20 14:19:58 +0800
c32b1ec8a xfs: set firstfsb to NULLFSBLOCK before feeding it to _bmapi_write ... Browse Code »

commit 4c1a67bd3606540b9b42caff34a1d5cd94b1cf65 upstream.

We must initialize the firstfsb parameter to _bmapi_write so that it
doesn't incorrectly treat stack garbage as a restriction on which AGs
it can search for free space.

Fixes-coverity-id: 1402025
Fixes-coverity-id: 1415167
Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-09-20 14:19:58 +0800
a6247b018 xfs: check _btree_check_block value ... Browse Code »

commit 1e86eabe73b73c82e1110c746ed3ec6d5e1c0a0d upstream.

Check the _btree_check_block return value for the firstrec and lastrec
functions, since we have the ability to signal that the repositioning
did not succeed.

Fixes-coverity-id: 114067
Fixes-coverity-id: 114068
Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-09-20 14:19:58 +0800
e76496fa8 xfs: don't crash on unexpected holes in dir/attr btrees ... Browse Code »

commit cd87d867920155911d0d2e6485b769d853547750 upstream.

In quite a few places we call xfs_da_read_buf with a mappedbno that we
don't control, then assume that the function passes back either an error
code or a buffer pointer. Unfortunately, if mappedbno == -2 and bno
maps to a hole, we get a return code of zero and a NULL buffer, which
means that we crash if we actually try to use that buffer pointer. This
happens immediately when we set the buffer type for transaction context.

Therefore, check that we have no error code and a non-NULL bp before
trying to use bp. This patch is a follow-up to an incomplete fix in
96a3aefb8ffde231 ("xfs: don't crash if reading a directory results in an
unexpected hole").

Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-09-20 14:19:58 +0800