Eric Lee / smarc-fsl-linux-kernel

03 Aug, 2016

14 commits

3ab78df2a xfs: rework xfs_bmap_free callers to use xfs_defer_ops ... Browse Code »

Restructure everything that used xfs_bmap_free to use xfs_defer_ops
instead. For now we'll just remove the old symbols and play some
cpp magic to make it work; in the next patch we'll actually rename
everything.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-08-03 09:15:38 +0800
9749fee83 xfs: enable the xfs_defer mechanism to process extents to free ... Browse Code »

Connect the xfs_defer mechanism with the pieces that we'll need to
handle deferred extent freeing. We'll wire up the existing code to
our new deferred mechanism later.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-08-03 09:14:35 +0800
bba61cbf3 xfs: clean up typedef usage in the EFI/EFD handling code ... Browse Code »

Replace structure typedefs with struct xfs_foo_* in the EFI/EFD
handling code in preparation to move it over to deferred ops.

Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-08-03 09:13:47 +0800
3cd48abcc xfs: add tracepoints for the deferred ops mechanism ... Browse Code »

Add tracepoints for the internals of the deferred ops mechanism
and tracepoint classes for clients of the dops, to make debugging
easier.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-08-03 09:13:02 +0800
4e0cc29b9 xfs: move deferred operations into a separate file ... Browse Code »

All the code around struct xfs_bmap_free basically implements a
deferred operation framework through which we can roll transactions
(to unlock buffers and avoid violating lock order rules) while
managing all the necessary log redo items. Previously we only used
this code to free extents after some sort of mapping operation, but
with the advent of rmap and reflink, we suddenly need to do more than
that.

With that in mind, xfs_bmap_free really becomes a deferred ops control
structure. Rename the structure and move the deferred ops into their
own file to avoid further bloating of the bmap code.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-08-03 09:12:25 +0800
28a89567b xfs: refactor btree owner change into a separate visit-blocks function ... Browse Code »

Refactor the btree_change_owner function into a more generic apparatus
which visits all blocks in a btree. We'll use this in a subsequent
patch for counting btree blocks for AG reservations.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-08-03 09:10:55 +0800
105f7d83d xfs: introduce interval queries on btrees ... Browse Code »

Create a function to enable querying of btree records mapping to a
range of keys. This will be used in subsequent patches to allow
querying the reverse mapping btree to find the extents mapped to a
range of physical blocks, though the generic code can be used for
any range query.

The overlapped query range function needs to use the btree get_block
helper because the root block could be an inode, in which case
bc_bufs[nlevels-1] will be NULL.

Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-08-03 09:10:21 +0800
2c813ad66 xfs: support btrees with overlapping intervals for keys ... Browse Code »

On a filesystem with both reflink and reverse mapping enabled, it's
possible to have multiple rmap records referring to the same blocks on
disk. When overlapping intervals are possible, querying a classic
btree to find all records intersecting a given interval is inefficient
because we cannot use the left side of the search interval to filter
out non-matching records the same way that we can use the existing
btree key to filter out records coming after the right side of the
search interval. This will become important once we want to use the
rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.

(For the non-overlapping case, we can perform such queries trivially
by starting at the left side of the interval and walking the tree
until we pass the right side.)

Therefore, extend the btree code to come closer to supporting
intervals as a first-class record attribute. This involves widening
the btree node's key space to store both the lowest key reachable via
the node pointer (as the btree does now) and the highest key reachable
via the same pointer and teaching the btree modifying functions to
keep the highest-key records up to date.

This behavior can be turned on via a new btree ops flag so that btrees
that cannot store overlapping intervals don't pay the overhead costs
in terms of extra code and disk format changes.

When we're deleting a record in a btree that supports overlapped
interval records and the deletion results in two btree blocks being
joined, we defer updating the high/low keys until after all possible
joining (at higher levels in the tree) have finished. At this point,
the btree pointers at all levels have been updated to remove the empty
blocks and we can update the low and high keys.

When we're doing this, we must be careful to update the keys of all
node pointers up to the root instead of stopping at the first set of
keys that don't need updating. This is because it's possible for a
single deletion to cause joining of multiple levels of tree, and so
we need to update everything going back to the root.

The diff_two_keys functions return < 0, 0, or > 0 if key1 is less than,
equal to, or greater than key2, respectively. This is consistent
with the rest of the kernel and the C library.

In btree_updkeys(), we need to evaluate the force_all parameter before
running the key diff to avoid reading uninitialized memory when we're
forcing a key update. This happens when we've allocated an empty slot
at level N + 1 to point to a new block at level N and we're in the
process of filling out the new keys.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-08-03 09:08:36 +0800
70b226593 xfs: add function pointers for get/update keys to the btree ... Browse Code »

Add some function pointers to bc_ops to get the btree keys for
leaf and node blocks, and to update parent keys of a block.
Convert the _btree_updkey calls to use our new pointer, and
modify the tree shape changing code to call the appropriate
get_*_keys pointer instead of _btree_copy_keys because the
overlapping btree has to calculate high key values.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-08-03 09:03:38 +0800
e5821e57a xfs: during btree split, save new block key & ptr for future insertion ... Browse Code »

When a btree block has to be split, we pass the new block's ptr from
xfs_btree_split() back to xfs_btree_insert() via a pointer parameter;
however, we pass the block's key through the cursor's record. It is a
little weird to "initialize" a record from a key since the non-key
attributes will have garbage values.

When we go to add support for interval queries, we have to be able to
pass the lowest and highest keys accessible via a pointer. There's no
clean way to pass this back through the cursor's record field.
Therefore, pass the key directly back to xfs_btree_insert() the same
way that we pass the btree_ptr.

As a bonus, we no longer need init_rec_from_key and can drop it from the
codebase.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-08-03 09:02:39 +0800
0d309791b xfs: set *stat=1 after iroot realloc ... Browse Code »

If we make the inode root block of a btree unfull by expanding the
root, we must set *stat to 1 to signal success, rather than leaving
it uninitialized.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-08-03 09:01:25 +0800
f4a0660de xfs: fix locking of the rt bitmap/summary inodes ... Browse Code »

When we're deleting realtime extents, we need to lock the summary
inode in case we need to update the summary info to prevent an assert
on the rsumip inode lock on a debug kernel. While we're at it, fix
the locking annotations so that we avoid triggering lockdep warnings.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-08-03 09:00:42 +0800
3dadf901d xfs: fix attr shortform structure alignment on cris ... Browse Code »

Apparently cris doesn't require structure stride to align with the
largest type in the struct, so list[0] isn't at offset 4 like it is
everywhere else. Fix this... insofar as existing XFSes on cris are
screwed.

Signed-off-by: Darrick J. Wong
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-08-03 08:59:42 +0800
0facef7fb xfs: in _attrlist_by_handle, copy the cursor back to userspace ... Browse Code »

When we're iterating inode xattrs by handle, we have to copy the
cursor back to userspace so that a subsequent invocation actually
retrieves subsequent contents.

Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-08-03 08:58:53 +0800

22 Jul, 2016

7 commits

f2bdfda9a Merge branch 'xfs-4.8-misc-fixes-4' into for-next Browse Code »

Dave Chinner
2016-07-22 12:10:56 +0800
72ccbbe15 xfs: remove EXPERIMENTAL tag from sparse inode feature ... Browse Code »

Been around for long enough now, hasn't caused any regression test
failures in the past 3 months, so it's time to make it a fully
supported feature.

Signed-off-by: Dave Chinner
Reviewed-by: Eric Sandeen
Signed-off-by: Dave Chinner

Dave Chinner
2016-07-22 12:10:18 +0800
28b783e47 xfs: bufferhead chains are invalid after end_page_writeback ... Browse Code »

In xfs_finish_page_writeback(), we have a loop that looks like this:

do {
if (off < bvec->bv_offset)
goto next_bh;
if (off > end)
break;
bh->b_end_io(bh, !error);
next_bh:
off += bh->b_size;
} while ((bh = bh->b_this_page) != head);

The b_end_io function is end_buffer_async_write(), which will call
end_page_writeback() once all the buffers have marked as no longer
under IO. This issue here is that the only thing currently
protecting both the bufferhead chain and the page from being
reclaimed is the PageWriteback state held on the page.

While we attempt to limit the loop to just the buffers covered by
the IO, we still read from the buffer size and follow the next
pointer in the bufferhead chain. There is no guarantee that either
of these are valid after the PageWriteback flag has been cleared.
Hence, loops like this are completely unsafe, and result in
use-after-free issues. One such problem was caught by Calvin Owens
with KASAN:

.....
INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1
free_buffer_head+0x41/0x90
__slab_free+0x1ed/0x340
kmem_cache_free+0x270/0x300
free_buffer_head+0x41/0x90
try_to_free_buffers+0x171/0x240
xfs_vm_releasepage+0xcb/0x3b0
try_to_release_page+0x106/0x190
shrink_page_list+0x118e/0x1a10
shrink_inactive_list+0x42c/0xdf0
shrink_zone_memcg+0xa09/0xfa0
shrink_zone+0x2c3/0xbc0
.....
Call Trace:
[] dump_stack+0x68/0x94
[] print_trailer+0x115/0x1a0
[] object_err+0x34/0x40
[] kasan_report_error+0x217/0x530
[] __asan_report_load8_noabort+0x43/0x50
[] xfs_destroy_ioend+0x3bf/0x4c0
[] xfs_end_bio+0x154/0x220
[] bio_endio+0x158/0x1b0
[] blk_update_request+0x18b/0xb80
[] scsi_end_request+0x97/0x5a0
[] scsi_io_completion+0x438/0x1690
[] scsi_finish_command+0x375/0x4e0
[] scsi_softirq_done+0x280/0x340

Where the access is occuring during IO completion after the buffer
had been freed from direct memory reclaim.

Prevent use-after-free accidents in this end_io processing loop by
pre-calculating the loop conditionals before calling bh->b_end_io().
The loop is already limited to just the bufferheads covered by the
IO in progress, so the offset checks are sufficient to prevent
accessing buffers in the chain after end_page_writeback() has been
called by the the bh->b_end_io() callout.

Yet another example of why Bufferheads Must Die.

cc: # 4.7
Signed-off-by: Dave Chinner
Reported-and-Tested-by: Calvin Owens
Reviewed-by: Christoph Hellwig
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2016-07-22 07:56:38 +0800
b1c5ebb21 xfs: allocate log vector buffers outside CIL context lock ... Browse Code »

One of the problems we currently have with delayed logging is that
under serious memory pressure we can deadlock memory reclaim. THis
occurs when memory reclaim (such as run by kswapd) is reclaiming XFS
inodes and issues a log force to unpin inodes that are dirty in the
CIL.

The CIL is pushed, but this will only occur once it gets the CIL
context lock to ensure that all committing transactions are complete
and no new transactions start being committed to the CIL while the
push switches to a new context.

The deadlock occurs when the CIL context lock is held by a
committing process that is doing memory allocation for log vector
buffers, and that allocation is then blocked on memory reclaim
making progress. Memory reclaim, however, is blocked waiting for
a log force to make progress, and so we effectively deadlock at this
point.

To solve this problem, we have to move the CIL log vector buffer
allocation outside of the context lock so that memory reclaim can
always make progress when it needs to force the log. The problem
with doing this is that a CIL push can take place while we are
determining if we need to allocate a new log vector buffer for
an item and hence the current log vector may go away without
warning. That means we canot rely on the existing log vector being
present when we finally grab the context lock and so we must have a
replacement buffer ready to go at all times.

To ensure this, introduce a "shadow log vector" buffer that is
always guaranteed to be present when we gain the CIL context lock
and format the item. This shadow buffer may or may not be used
during the formatting, but if the log item does not have an existing
log vector buffer or that buffer is too small for the new
modifications, we swap it for the new shadow buffer and format
the modifications into that new log vector buffer.

The result of this is that for any object we modify more than once
in a given CIL checkpoint, we double the memory required
to track dirty regions in the log. For single modifications then
we consume the shadow log vectorwe allocate on commit, and that gets
consumed by the checkpoint. However, if we make multiple
modifications, then the second transaction commit will allocate a
shadow log vector and hence we will end up with double the memory
usage as only one of the log vectors is consumed by the CIL
checkpoint. The remaining shadow vector will be freed when th elog
item is freed.

This can probably be optimised in future - access to the shadow log
vector is serialised by the object lock (as opposited to the active
log vector, which is controlled by the CIL context lock) and so we
can probably free shadow log vector from some objects when the log
item is marked clean on removal from the AIL.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2016-07-22 07:52:35 +0800
160ae76fa libxfs: directory node splitting does not have an extra block ... Browse Code »

xfsprogs source commit 4280e59dcbc4cd8e01585efe788a68eb378048e8

xfs_da3_split() has to handle all three versions of the
directory/attribute btree structure. The attr tree is v1, the dir
tre is v2 or v3. The main difference between the v1 and v2/3 trees
is the way tree nodes are split - in the v1 tree we can require a
double split to occur because the object to be inserted may be
larger than the space made by splitting a leaf. In this case we need
to do a double split - one to split the full leaf, then another to
allocate an empty leaf block in the correct location for the new
entry. This does not happen with dir (v2/v3) formats as the objects
being inserted are always guaranteed to fit into the new space in
the split blocks.

Indeed, for directories they *may* be an extra block on this buffer
pointer. However, it's guaranteed not to be a leaf block (i.e. a
directory data block) - the directory code only ever places hash
index or free space blocks in this pointer (as a cursor of
sorts), and so to use it as a directory data block will immediately
corrupt the directory.

The problem is that the code assumes that there may be extra blocks
that we need to link into the tree once we've split the root, but
this is not true for either dir or attr trees, because the extra
attr block is always consumed by the last node split before we split
the root. Hence the linking in an extra block is always wrong at the
root split level, and this manifests itself in repair as a directory
corruption in a repaired directory, leaving the directory rebuild
incomplete.

This is a dir v2 zero-day bug - it was in the initial dir v2 commit
that was made back in February 1998.

Fix this by ensuring the linking of the blocks after the root split
never tries to make use of the extra blocks that may be held in the
cursor. They are held there for other purposes and should never be
touched by the root splitting code.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Dave Chinner
2016-07-22 07:51:05 +0800
f021bd071 xfs: remove dax code from object file when disabled ... Browse Code »

We check IS_DAX(inode) before calling either xfs_file_dax_read or
xfs_file_dax_write, and this will lead the call being optimized out at
compile time when CONFIG_FS_DAX is disabled.

However, the two functions are marked STATIC, so they become global
symbols when CONFIG_XFS_DEBUG is set, leaving us with two unused global
functions that call into an undefined function and a broken "allmodconfig"
build:

fs/built-in.o: In function `xfs_file_dax_read':
fs/xfs/xfs_file.c:348: undefined reference to `dax_do_io'
fs/built-in.o: In function `xfs_file_dax_write':
fs/xfs/xfs_file.c:758: undefined reference to `dax_do_io'

Marking the two functions 'static noinline' instead of 'STATIC' will let
the compiler drop the symbols when there are no callers but avoid the
implicit inlining.

Signed-off-by: Arnd Bergmann
Fixes: 16d4d43595b4 ("xfs: split direct I/O and DAX path")
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Arnd Bergmann
2016-07-22 07:50:55 +0800
99579ccec xfs: skip dirty pages in ->releasepage() ... Browse Code »

XFS has had scattered reports of delalloc blocks present at
->releasepage() time. This results in a warning with a stack trace
similar to the following:

...
Call Trace:
[] dump_stack+0x63/0x84
[] warn_slowpath_common+0x97/0xe0
[] warn_slowpath_null+0x1a/0x20
[] xfs_vm_releasepage+0x10f/0x140
[] ? page_mkclean_one+0xd0/0xd0
[] ? anon_vma_prepare+0x150/0x150
[] try_to_release_page+0x32/0x50
[] shrink_active_list+0x3ce/0x3e0
[] shrink_lruvec+0x687/0x7d0
[] shrink_zone+0xdc/0x2c0
[] kswapd+0x4f9/0x970
[] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
[] kthread+0xc9/0xe0
[] ? kthread_stop+0x100/0x100
[] ret_from_fork+0x3f/0x70
[] ? kthread_stop+0x100/0x100

This occurs because it is possible for shrink_active_list() to send
pages marked dirty to ->releasepage() when certain buffer_head threshold
conditions are met. shrink_active_list() doesn't check the page dirty
state apparently to handle an old ext3 corner case where in some cases
clean pages would not have the dirty bit cleared, thus it is up to the
filesystem to determine how to handle the page.

XFS currently handles the delalloc case properly, but this behavior
makes the warning spurious. Update the XFS ->releasepage() handler to
explicitly skip dirty pages. Retain the existing delalloc/unwritten
checks so we continue to warn if such buffers exist on clean pages when
they shouldn't.

Diagnosed-by: Dave Chinner
Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Brian Foster
2016-07-22 07:50:38 +0800

20 Jul, 2016

19 commits

dc4113d24 Merge branch 'xfs-4.8-dir2-sf-fixes' into for-next Browse Code »

Dave Chinner
2016-07-20 09:54:59 +0800
b47ec80bf Merge branch 'xfs-4.8-split-dax-dio' into for-next Browse Code »

Dave Chinner
2016-07-20 09:54:37 +0800
bbfeb6141 Merge branch 'xfs-4.8-buf-fixes' into for-next Browse Code »

Dave Chinner
2016-07-20 09:53:35 +0800
f63716175 Merge branch 'xfs-4.8-misc-fixes-3' into for-next Browse Code »

Dave Chinner
2016-07-20 09:51:08 +0800
aa2dd0ad4 xfs: remove __arch_pack ... Browse Code »

Instead we always declare struct xfs_dir2_sf_hdr as packed. That's
the expected layout, and while most major architectures do the packing
by default the new structure size and offset checker showed that not
only the ARM old ABI got this wrong, but various minor embedded
architectures did as well.

[Verified that no code change on x86-64 results from this change]

Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-07-20 09:48:46 +0800
266b6969c xfs: kill xfs_dir2_inou_t ... Browse Code »

And use an array of unsigned char values directly to avoid problems
with architectures that pad the size of structures. This also gets
rid of the xfs_dir2_ino4_t and xfs_dir2_ino8_t types, and introduces
new constants for the size of 4 and 8 bytes as well as the size
difference between the two.

Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-07-20 09:48:31 +0800
8353a649f xfs: kill xfs_dir2_sf_off_t ... Browse Code »

Just use an array of two unsigned chars directly to avoid problems
with architectures that pad the size of structures.

Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-07-20 09:47:21 +0800
16d4d4359 xfs: split direct I/O and DAX path ... Browse Code »

So far the DAX code overloaded the direct I/O code path. There is very little
in common between the two, and untangling them allows to clean up both variants.

As a side effect we also get separate trace points for both I/O types.

Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-07-20 09:38:55 +0800
fa8d972d0 xfs: direct calls in the direct I/O path ... Browse Code »

We control both the callers and callees of ->direct_IO, so remove the
indirect calls.

Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-07-20 09:38:01 +0800
f1285ff0a xfs: stop using generic_file_read_iter for direct I/O ... Browse Code »

XFS already implement it's own flushing of the pagecache because it
implements proper synchronization for direct I/O reads. This means
calling generic_file_read_iter for direct I/O is rather useless,
as it doesn't do much but updating the atime and iocb position for
us. This also gets rid of the buffered I/O fallback that isn't used
for XFS.

Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-07-20 09:36:57 +0800
bbc5a740c xfs: split xfs_file_read_iter into buffered and direct I/O helpers ... Browse Code »

Similar to what we did on the write side a while ago.

Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-07-20 09:35:42 +0800
cf810712c xfs: remove s_maxbytes enforcement in xfs_file_read_iter ... Browse Code »

All the three low-level read implementations that we might call already
take care of not overflowing the maximum supported bytes, no need to
duplicate it here.

Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-07-20 09:31:53 +0800
3176c3e0e xfs: kill ioflags ... Browse Code »

Now that we have the direct I/O kiocb flag there is no real need to sample
the value inside of XFS, and the invis flag was always just partially used
and isn't worth keeping this infrastructure around for. This also splits
the read tracepoint into buffered vs direct as we've done for writes a long
time ago.

Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-07-20 09:31:42 +0800
8f3e2058e xfs: don't pass ioflags around in the ioctl path ... Browse Code »

Instead check the file pointer for the invisble I/O flag directly, and
use the chance to drop redundant arguments from the xfs_ioc_space
prototype.

Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Christoph Hellwig
2016-07-20 09:29:35 +0800
9c7504aa7 xfs: track and serialize in-flight async buffers against unmount ... Browse Code »

Newly allocated XFS metadata buffers are added to the LRU once the hold
count is released, which typically occurs after I/O completion. There is
no other mechanism at current that tracks the existence or I/O state of
a new buffer. Further, readahead I/O tends to be submitted
asynchronously by nature, which means the I/O can remain in flight and
actually complete long after the calling context is gone. This means
that file descriptors or any other holds on the filesystem can be
released, allowing the filesystem to be unmounted while I/O is still in
flight. When I/O completion occurs, core data structures may have been
freed, causing completion to run into invalid memory accesses and likely
to panic.

This problem is reproduced on XFS via directory readahead. A filesystem
is mounted, a directory is opened/closed and the filesystem immediately
unmounted. The open/close cycle triggers a directory readahead that if
delayed long enough, runs buffer I/O completion after the unmount has
completed.

To address this problem, add a mechanism to track all in-flight,
asynchronous buffers using per-cpu counters in the buftarg. The buffer
is accounted on the first I/O submission after the current reference is
acquired and unaccounted once the buffer is returned to the LRU or
freed. Update xfs_wait_buftarg() to wait on all in-flight I/O before
walking the LRU list. Once in-flight I/O has completed and the workqueue
has drained, all new buffers should have been released onto the LRU.

Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Brian Foster
2016-07-20 09:15:28 +0800
c891c30a4 xfs: exclude never-released buffers from buftarg I/O accounting ... Browse Code »

The upcoming buftarg I/O accounting mechanism maintains a count of
all buffers that have undergone I/O in the current hold-release
cycle. Certain buffers associated with core infrastructure (e.g.,
the xfs_mount superblock buffer, log buffers) are never released,
however. This means that accounting I/O submission on such buffers
elevates the buftarg count indefinitely and could lead to lockup on
unmount.

Define a new buffer flag to explicitly exclude buffers from buftarg
I/O accounting. Set the flag on the superblock and associated log
buffers.

Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Brian Foster
2016-07-20 09:13:43 +0800
5539d3675 xfs: don't reset b_retries to 0 on every failure ... Browse Code »

With the code as it stands today, b_retries never increments because
it gets reset to 0 in the error callback.

Remove that, and fix a similar problem where the first retry time
was constantly being overwritten, which defeated the timeout tunable
as well. We now only set first retry time if a non-zero timeout is
set, to match the behavior of only incrementing retries if a retry
value is set.

This way max retries & timeouts consistently take effect after a
tunable is set, rather than acting retroactively on a buffer which
has failed at some point in the past and has accumulated state from
those prior failures.

Thanks to dchinner for talking through this with me.

Signed-off-by: Eric Sandeen
Reviewed-by: Carlos Maiolino
Signed-off-by: Dave Chinner

Eric Sandeen
2016-07-20 08:54:09 +0800
0b4db5dff xfs: remove extraneous buffer flag changes ... Browse Code »

Fix up a couple places where extra flag manipulation occurs.

In the first case we clear XBF_ASYNC and then immediately reset it -
so don't bother clearing in the first place.

In the 2nd case we are at a point in the function where the buffer
must already be async, so there is no need to reset it.

Add consistent spacing around the " | " while we're at it.

Signed-off-by: Eric Sandeen
Reviewed-by: Carlos Maiolino
Signed-off-by: Dave Chinner

Eric Sandeen
2016-07-20 08:53:22 +0800
e97f6c545 xfs: fix xfs_error_get_cfg for negative errnos ... Browse Code »

xfs_error_get_cfg() is called with bp->b_error as an arg, which is
negative, so the switch statement won't ever find any matches.

This results in only the default error handler having any effect, as
EIO/ENOSPC/ENODEV get ignored due to the wrong sign.

It seems simplest to always flip the error sign to positive, so that
we can handle either negative errors in bp->b_error, or possibly a
positive errno via something like xfs_error_get_cfg(EIO) - this
future-proofs the function.

Signed-off-by: Eric Sandeen
Reviewed-by: Carlos Maiolino
Signed-off-by: Dave Chinner

Eric Sandeen
2016-07-20 08:48:51 +0800