Eric Lee / smarc-fsl-linux-kernel

19 Jan, 2016

1 commit

4c931f770 Merge branch 'xfs-setxattr-promotion' into for-next Browse Code »

Dave Chinner
2016-01-19 05:16:08 +0800

12 Jan, 2016

3 commits

dde7f55bd Merge branch 'xfs-misc-fixes-for-4.5-2' into for-next Browse Code »

Dave Chinner
2016-01-12 04:04:30 +0800
7d6a13f02 xfs: handle dquot buffer readahead in log recovery correctly ... Browse Code »

When we do dquot readahead in log recovery, we do not use a verifier
as the underlying buffer may not have dquots in it. e.g. the
allocation operation hasn't yet been replayed. Hence we do not want
to fail recovery because we detect an operation to be replayed has
not been run yet. This problem was addressed for inodes in commit
d891400 ("xfs: inode buffers may not be valid during recovery
readahead") but the problem was not recognised to exist for dquots
and their buffers as the dquot readahead did not have a verifier.

The result of not using a verifier is that when the buffer is then
next read to replay a dquot modification, the dquot buffer verifier
will only be attached to the buffer if *readahead is not complete*.
Hence we can read the buffer, replay the dquot changes and then add
it to the delwri submission list without it having a verifier
attached to it. This then generates warnings in xfs_buf_ioapply(),
which catches and warns about this case.

Fix this and make it handle the same readahead verifier error cases
as for inode buffers by adding a new readahead verifier that has a
write operation as well as a read operation that marks the buffer as
not done if any corruption is detected. Also make sure we don't run
readahead if the dquot buffer has been marked as cancelled by
recovery.

This will result in readahead either succeeding and the buffer
having a valid write verifier, or readahead failing and the buffer
state requiring the subsequent read to resubmit the IO with the new
verifier. In either case, this will result in the buffer always
ending up with a valid write verifier on it.

Note: we also need to fix the inode buffer readahead error handling
to mark the buffer with EIO. Brian noticed the code I copied from
there wrong during review, so fix it at the same time. Add comments
linking the two functions that handle readahead verifier errors
together so we don't forget this behavioural link in future.

cc: # 3.12 - current
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2016-01-12 04:04:01 +0800
b79f4a1c6 xfs: inode recovery readahead can race with inode buffer creation ... Browse Code »

When we do inode readahead in log recovery, we do can do the
readahead before we've replayed the icreate transaction that stamps
the buffer with inode cores. The inode readahead verifier catches
this and marks the buffer as !done to indicate that it doesn't yet
contain valid inodes.

In adding buffer error notification (i.e. setting b_error = -EIO at
the same time as as we clear the done flag) to such a readahead
verifier failure, we can then get subsequent inode recovery failing
with this error:

XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32

This occurs when readahead completion races with icreate item replay
such as:

inode readahead
find buffer
lock buffer
submit RA io
....
icreate recovery
xfs_trans_get_buffer
find buffer
lock buffer

.....

fails verifier
clear XBF_DONE
set bp->b_error = -EIO
release and unlock buffer

icreate initialises buffer
marks buffer as done
adds buffer to delayed write queue
releases buffer

At this point, we have an initialised inode buffer that is up to
date but has an -EIO state registered against it. When we finally
get to recovering an inode in that buffer:

inode item recovery
xfs_trans_read_buffer
find buffer
lock buffer
sees XBF_DONE is set, returns buffer
sees bp->b_error is set
fail log recovery!

Essentially, we need xfs_trans_get_buf_map() to clear the error status of
the buffer when doing a lookup. This function returns uninitialised
buffers, so the buffer returned can not be in an error state and
none of the code that uses this function expects b_error to be set
on return. Indeed, there is an ASSERT(!bp->b_error); in the
transaction case in xfs_trans_get_buf_map() that would have caught
this if log recovery used transactions....

This patch firstly changes the inode readahead failure to set -EIO
on the buffer, and secondly changes xfs_buf_get_map() to never
return a buffer with an error state set so this first change doesn't
cause unexpected log recovery failures.

cc: # 3.12 - current
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2016-01-12 04:03:44 +0800

11 Jan, 2016

1 commit

f6106efae xfs: eliminate committed arg from xfs_bmap_finish ... Browse Code »

Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
associated comments were replicated several times across
the attribute code, all dealing with what to do if the
transaction was or wasn't committed.

And in that replicated code, an ASSERT() test of an
uninitialized variable occurs in several locations:

error = xfs_attr_thing(&args);
if (!error) {
error = xfs_bmap_finish(&args.trans, args.flist,
&committed);
}
if (error) {
ASSERT(committed);

If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
never set "committed", and then test it in the ASSERT.

Fix this up by moving the committed state internal to xfs_bmap_finish,
and add a new inode argument. If an inode is passed in, it is passed
through to __xfs_trans_roll() and joined to the transaction there if
the transaction was committed.

xfs_qm_dqalloc() was a little unique in that it called bjoin rather
than ijoin, but as Dave points out we can detect the committed state
but checking whether (*tpp != tp).

Addresses-Coverity-Id: 102360
Addresses-Coverity-Id: 102361
Addresses-Coverity-Id: 102363
Addresses-Coverity-Id: 102364
Signed-off-by: Eric Sandeen
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Eric Sandeen
2016-01-11 08:34:01 +0800

08 Jan, 2016

1 commit

e35438196 xfs: bmapbt checking on debug kernels too expensive ... Browse Code »

For large sparse or fragmented files, checking every single entry in
the bmapbt on every operation is prohibitively expensive. Especially
as such checks rarely discover problems during normal operations on
high extent coutn files. Our regression tests don't tend to exercise
files with hundreds of thousands to millions of extents, so mostly
this isn't noticed.

However, trying to run things like xfs_mdrestore of large filesystem
dumps on a debug kernel quickly becomes impossible as the CPU is
completely burnt up repeatedly walking the sparse file bmapbt that
is generated for every allocation that is made.

Hence, if the file has more than 10,000 extents, just don't bother
with walking the tree to check it exhaustively. The btree code has
checks that ensure that the newly inserted/removed/modified record
is correctly ordered, so the entrie tree walk in thses cases has
limited additional value.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2016-01-08 08:28:49 +0800

05 Jan, 2016

1 commit

7eeabbd4b Merge branch 'xfs-misc-fixes-for-4.5' into for-next Browse Code »

Dave Chinner
2016-01-05 05:08:35 +0800

04 Jan, 2016

11 commits

58f88ca2d xfs: introduce per-inode DAX enablement ... Browse Code »

Rather than just being able to turn DAX on and off via a mount
option, some applications may only want to enable DAX for certain
performance critical files in a filesystem.

This patch introduces a new inode flag to enable DAX in the v3 inode
di_flags2 field. It adds support for setting and clearing flags in
the di_flags2 field via the XFS_IOC_FSSETXATTR ioctl, and sets the
S_DAX inode flag appropriately when it is seen.

When this flag is set on a directory, it acts as an "inherit flag".
That is, inodes created in the directory will automatically inherit
the on-disk inode DAX flag, enabling administrators to set up
directory heirarchies that automatically use DAX. Setting this flag
on an empty root directory will make the entire filesystem use DAX
by default.

Signed-off-by: Dave Chinner

Dave Chinner
2016-01-04 13:44:15 +0800
e7b894810 xfs: use FS_XFLAG definitions directly ... Browse Code »

Now that the ioctls have been hoisted up to the VFS level, use
the VFs definitions directly and remove the XFS specific definitions
completely. Userspace is going to have to handle the change of this
interface separately, so removing the definitions from xfs_fs.h is
not an issue here at all.

Signed-off-by: Dave Chinner

Dave Chinner
2016-01-04 13:44:15 +0800
334e580a6 fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR promotion ... Browse Code »

Hoist the ioctl definitions for the XFS_IOC_FS[SG]SETXATTR API from
fs/xfs/libxfs/xfs_fs.h to include/uapi/linux/fs.h so that the ioctls
can be used by all filesystems, not just XFS. This enables
(initially) ext4 to use the ioctl to set project IDs on inodes.

Based-on-patch-from: Li Xi
Signed-off-by: Dave Chinner

Dave Chinner
2016-01-04 13:44:15 +0800
c5ab131ba libxfs: refactor short btree block verification ... Browse Code »

Create xfs_btree_sblock_verify() to verify short-format btree blocks
(i.e. the per-AG btrees with 32-bit block pointers) instead of
open-coding them.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-01-04 13:13:21 +0800
96f859d52 libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct ... Browse Code »

Because struct xfs_agfl is 36 bytes long and has a 64-bit integer
inside it, gcc will quietly round the structure size up to the nearest
64 bits -- in this case, 40 bytes. This results in the XFS_AGFL_SIZE
macro returning incorrect results for v5 filesystems on 64-bit
machines (118 items instead of 119). As a result, a 32-bit xfs_repair
will see garbage in AGFL item 119 and complain.

Therefore, tell gcc not to pad the structure so that the AGFL size
calculation is correct.

cc: # 3.10 - 4.4
Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-01-04 13:13:21 +0800
6d3eb1eca libxfs: use a convenience variable instead of open-coding the fork ... Browse Code »

Use a convenience variable instead of open-coding the inode fork.
This isn't really needed for now, but will become important when we
add the copy-on-write fork later.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-01-04 13:12:42 +0800
2e9101da6 libxfs: make xfs_alloc_fix_freelist non-static ... Browse Code »

Since xfs_repair wants to use xfs_alloc_fix_freelist, remove the
static designation. xfsprogs already has this; this simply brings
the kernel up to date.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-01-04 13:10:42 +0800
f1f96c494 xfs: get mp from bma->ip in xfs_bmap code ... Browse Code »

In my earlier commit

c29aad4 xfs: pass mp to XFS_WANT_CORRUPTED_GOTO

I added some local mp variables with code which indicates that
mp might be NULL. Coverity doesn't like this now, because the
updated per-fs XFS_STATS macros dereference mp.

I don't think this is actually a problem; from what I can tell,
we cannot get to these functions with a null bma->tp, so my NULL
check was probably pointless. Still, it's not super obvious.

So switch this code to get mp from the inode on the xfs_bmalloca
structure, with no conditional, because the functions are already
using bmap->ip directly.

Addresses-Coverity-Id: 1339552
Addresses-Coverity-Id: 1339553
Signed-off-by: Eric Sandeen
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Eric Sandeen
2016-01-04 13:10:42 +0800
233135b76 xfs: print name of verifier if it fails ... Browse Code »

This adds a name to each buf_ops structure, so that if
a verifier fails we can print the type of verifier that
failed it. Should be a slight debugging aid, I hope.

Signed-off-by: Eric Sandeen
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Eric Sandeen
2016-01-04 13:10:19 +0800
1d4292bfd libxfs: Optimize the loop for xfs_bitmap_empty ... Browse Code »

If there is any non zero bit in a long bitmap, it can jump out of the
loop and finish the function as soon as possible.

Signed-off-by: Jia He
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Jia He
2016-01-04 13:10:19 +0800
6528250b7 xfs: support a crc verification only log record pass ... Browse Code »

Log recovery torn write detection uses CRC verification over a range of
the active log to identify torn writes. Since the generic log recovery
pass code implements a superset of the functionality required for CRC
verification, it can be easily modified to support a CRC verification
only pass.

Create a new CRC pass type and update the log record processing helper
to skip everything beyond CRC verification when in this mode. This pass
will be invoked in subsequent patches to implement torn write detection.

Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Brian Foster
2016-01-04 12:55:10 +0800

03 Nov, 2015

4 commits

264e89ad3 Merge branch 'xfs-dax-updates' into for-next Browse Code »

Dave Chinner
2015-11-03 10:28:41 +0800
2da5c4b05 Merge branch 'xfs-misc-fixes-for-4.4-2' into for-next Browse Code »

Dave Chinner
2015-11-03 10:27:58 +0800
86a21c797 xfs: Validate the length of on-disk ACLs ... Browse Code »

In xfs_acl_from_disk, instead of trusting that xfs_acl.acl_cnt is correct,
make sure that the length of the attributes is correct as well. Also, turn
the aclp parameter into a const pointer.

Signed-off-by: Andreas Gruenbacher
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Andreas Gruenbacher
2015-11-03 09:41:59 +0800
3fbbbea34 xfs: introduce BMAPI_ZERO for allocating zeroed extents ... Browse Code »

To enable DAX to do atomic allocation of zeroed extents, we need to
drive the block zeroing deep into the allocator. Because
xfs_bmapi_write() can return merged extents on allocation that were
only partially allocated (i.e. requested range spans allocated and
hole regions, allocation into the hole was contiguous), we cannot
zero the extent returned from xfs_bmapi_write() as that can
overwrite existing data with zeros.

Hence we have to drive the extent zeroing into the allocation code,
prior to where we merge the extents into the BMBT and return the
resultant map. This means we need to propagate this need down to
the xfs_alloc_vextent() and issue the block zeroing at this point.

While this functionality is being introduced for DAX, there is no
reason why it is specific to DAX - we can per-zero blocks during the
allocation transaction on any type of device. It's just slow (and
usually slower than unwritten allocation and conversion) on
traditional block devices so doesn't tend to get used. We can,
however, hook hardware zeroing optimisations via sb_issue_zeroout()
to this operation, so it may be useful in future and hence the
"allocate zeroed blocks" API needs to be implementation neutral.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2015-11-03 09:27:22 +0800

12 Oct, 2015

8 commits

1e2103cbf Merge branch 'xfs-misc-fixes-for-4.4-1' into for-next Browse Code »

Dave Chinner
2015-10-12 15:38:25 +0800
316433bed Merge branch 'xfs-logging-fixes' into for-next Browse Code »

Dave Chinner
2015-10-12 15:37:58 +0800
ff6d6af23 xfs: per-filesystem stats counter implementation ... Browse Code »

This patch modifies the stats counting macros and the callers
to those macros to properly increment, decrement, and add-to
the xfs stats counts. The counts for global and per-fs stats
are correctly advanced, and cleared by writing a "1" to the
corresponding clear file.

global counts: /sys/fs/xfs/stats/stats
per-fs counts: /sys/fs/xfs/sda*/stats/stats

global clear: /sys/fs/xfs/stats/stats_clear
per-fs clear: /sys/fs/xfs/sda*/stats/stats_clear

[dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
stats structures and macros. ]

Signed-off-by: Bill O'Donnell
Reviewed-by: Eric Sandeen
Signed-off-by: Dave Chinner

Bill O'Donnell
2015-10-12 15:21:22 +0800
51fcbfe70 xfs: avoid dependency on Linux XATTR_SIZE_MAX ... Browse Code »

Currently, we depends on Linux XATTR value for on disk
definition. Which causes trouble on other platforms and
maybe also if this value was to change.

Fix it by creating a custom definition independent from
those in Linux (although with the same values), so it is OK
with the be16 fields used for holding these attributes.

This patch reflects a change in xfsprogs.

Signed-off-by: Jan Tulak
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Jan Tulak
2015-10-12 13:03:59 +0800
4e247614a xfs: prefix XATTR_LIST_MAX with XFS_ ... Browse Code »

Remove a hard dependency of Linux XATTR_LIST_MAX value by using
a prefixed version. This patch reflects the same change in xfsprogs.

Signed-off-by: Jan Tulak
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Jan Tulak
2015-10-12 13:02:56 +0800
fef4ded8c libxfs: fix two comment typos ... Browse Code »

Just fix two typos in code comments.

Signed-off-by: Geliang Tang
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Geliang Tang
2015-10-12 13:02:32 +0800
a45086e27 xfs: validate metadata LSNs against log on v5 superblocks ... Browse Code »

Since the onset of v5 superblocks, the LSN of the last modification has
been included in a variety of on-disk data structures. This LSN is used
to provide log recovery ordering guarantees (e.g., to ensure an older
log recovery item is not replayed over a newer target data structure).

While this works correctly from the point a filesystem is formatted and
mounted, userspace tools have some problematic behaviors that defeat
this mechanism. For example, xfs_repair historically zeroes out the log
unconditionally (regardless of whether corruption is detected). If this
occurs, the LSN of the filesystem is reset and the log is now in a
problematic state with respect to on-disk metadata structures that might
have a larger LSN. Until either the log catches up to the highest
previously used metadata LSN or each affected data structure is modified
and written out without incident (which resets the metadata LSN), log
recovery is susceptible to filesystem corruption.

This problem is ultimately addressed and repaired in the associated
userspace tools. The kernel is still responsible to detect the problem
and notify the user that something is wrong. Check the superblock LSN at
mount time and fail the mount if it is invalid. From that point on,
trigger verifier failure on any metadata I/O where an invalid LSN is
detected. This results in a filesystem shutdown and guarantees that we
do not log metadata changes with invalid LSNs on disk. Since this is a
known issue with a known recovery path, present a warning to instruct
the user how to recover.

Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Brian Foster
2015-10-12 12:59:25 +0800
b7cdc66be xfs: log local to remote symlink conversions correctly on v5 supers ... Browse Code »

A local format symlink inode is converted to extent format when an
extended attribute is set on an inode as part of the attribute fork
creation. This means a block is allocated, the local symlink target name
is copied to the block and the block is logged. Currently,
xfs_bmap_local_to_extents() handles logging the remote block data based
on the size of the data fork prior to the conversion. This is not
correct on v5 superblock filesystems, which add an additional header to
remote symlink blocks that is nonexistent in local format inodes.

As a result, the full length of the remote symlink block content is not
logged. This can lead to corruption should a crash occur and log
recovery replay this transaction.

Since a callout is already used to initialize the new remote symlink
block, update the local-to-extents conversion mechanism to make the
callout also responsible for logging the block. It is already required
to set the log buffer type and format the block appropriately based on
the superblock version. This ensures the remote symlink is always logged
correctly. Note that xfs_bmap_local_to_extents() is only called for
symlinks so there are no other callouts that require modification.

Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Brian Foster
2015-10-12 12:40:24 +0800

01 Sep, 2015

1 commit

5d54b8cde Merge branch 'xfs-misc-fixes-for-4.3-4' into for-next Browse Code »

Dave Chinner
2015-09-01 08:30:11 +0800

28 Aug, 2015

1 commit

dfdd4ac66 libxfs: bad magic number should set da block buffer error ... Browse Code »

If xfs_da3_node_read_verify() doesn't recognize the magic number of a
buffer it's just read, set the buffer error to -EFSCORRUPTED so that
the error can be sent up to userspace. Without this patch we'll
notice the bad magic eventually while trying to traverse or change
the block, but we really ought to fail early in the verifier.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Darrick J. Wong
2015-08-28 12:50:03 +0800

25 Aug, 2015

3 commits

70b33a746 Merge branch 'xfs-misc-fixes-for-4.3-3' into for-next Browse Code »

Dave Chinner
2015-08-25 08:13:35 +0800
037542345 xfs: Fix file type directory corruption for btree directories ... Browse Code »

Users have occasionally reported that file type for some directory
entries is wrong. This mostly happened after updating libraries some
libraries. After some debugging the problem was traced down to
xfs_dir2_node_replace(). The function uses args->filetype as a file type
to store in the replaced directory entry however it also calls
xfs_da3_node_lookup_int() which will store file type of the current
directory entry in args->filetype. Thus we fail to change file type of a
directory entry to a proper type.

Fix the problem by storing new file type in a local variable before
calling xfs_da3_node_lookup_int().

cc: # 3.16 - 4.x
Reported-by: Giacomo Comes
Signed-off-by: Jan Kara
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Jan Kara
2015-08-25 08:05:13 +0800
c184f855c xfs: Fix uninitialized return value in xfs_alloc_fix_freelist() ... Browse Code »

xfs_alloc_fix_freelist() can sometimes jump to out_agbp_relse
without ever setting value of 'error' variable which is then
returned. This can happen e.g. when pag->pagf_init is set but AG is
for metadata and we want to allocate user data.

Fix the problem by initializing 'error' to 0, which is the desired
return value when we decide to skip this group.

CC: xfs@oss.sgi.com
Coverity-id: 1309714
Signed-off-by: Jan Kara
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Jan Kara
2015-08-25 08:05:13 +0800

20 Aug, 2015

1 commit

aa493382c Merge branch 'xfs-misc-fixes-for-4.3-2' into for-next Browse Code »

Dave Chinner
2015-08-20 07:28:45 +0800

19 Aug, 2015

4 commits

ffeecc521 xfs: Fix xfs_attr_leafblock definition ... Browse Code »

struct xfs_attr_leafblock contains 'entries' array which is declared
with size 1 altough it can in fact contain much more entries. Since this
array is followed by further struct members, gcc (at least in version
4.8.3) thinks that the array has the fixed size of 1 element and thus
may optimize away all accesses beyond the end of array resulting in
non-working code. This problem was only observed with userspace code in
xfsprogs, however it's better to be safe in kernel as well and have
matching kernel and xfsprogs definitions.

cc:
Signed-off-by: Jan Kara
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Jan Kara
2015-08-19 08:34:32 +0800
2f123bce1 libxfs: readahead of dir3 data blocks should use the read verifier ... Browse Code »

In the dir3 data block readahead function, use the regular read
verifier to check the block's CRC and spot-check the block contents
instead of directly calling only the spot-checking routine. This
prevents corrupted directory data blocks from being read into the
kernel, which can lead to garbage ls output and directory loops (if
say one of the entries contains slashes and other junk).

cc: # 3.12 - 4.2
Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Darrick J. Wong
2015-08-19 08:33:58 +0800
dbad7c993 xfs: stop holding ILOCK over filldir callbacks ... Browse Code »

The recent change to the readdir locking made in 40194ec ("xfs:
reinstate the ilock in xfs_readdir") for CXFS directory sanity was
probably the wrong thing to do. Deep in the readdir code we
can take page faults in the filldir callback, and so taking a page
fault while holding an inode ilock creates a new set of locking
issues that lockdep warns all over the place about.

The locking order for regular inodes w.r.t. page faults is io_lock
-> pagefault -> mmap_sem -> ilock. The directory readdir code now
triggers ilock -> page fault -> mmap_sem. While we cannot deadlock
at this point, it inverts all the locking patterns that lockdep
normally sees on XFS inodes, and so triggers lockdep. We worked
around this with commit 93a8614 ("xfs: fix directory inode iolock
lockdep false positive"), but that then just moved the lockdep
warning to deeper in the page fault path and triggered on security
inode locks. Fixing the shmem issue there just moved the lockdep
reports somewhere else, and now we are getting false positives from
filesystem freezing annotations getting confused.

Further, if we enter memory reclaim in a readdir path, we now get
lockdep warning about potential deadlocks because the ilock is held
when we enter reclaim. This, again, is different to a regular file
in that we never allow memory reclaim to run while holding the ilock
for regular files. Hence lockdep now throws
ilock->kmalloc->reclaim->ilock warnings.

Basically, the problem is that the ilock is being used to protect
the directory data and the inode metadata, whereas for a regular
file the iolock protects the data and the ilock protects the
metadata. From the VFS perspective, the i_mutex serialises all
accesses to the directory data, and so not holding the ilock for
readdir doesn't matter. The issue is that CXFS doesn't access
directory data via the VFS, so it has no "data serialisaton"
mechanism. Hence we need to hold the IOLOCK in the correct places to
provide this low level directory data access serialisation.

The ilock can then be used just when the extent list needs to be
read, just like we do for regular files. The directory modification
code can take the iolock exclusive when the ilock is also taken,
and this then ensures that readdir is correct excluded while
modifications are in progress.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2015-08-19 08:33:00 +0800
7df1c170b xfs: swap leaf buffer into path struct atomically during path shift ... Browse Code »

The node directory lookup code uses a state structure that tracks the
path of buffers used to search for the hash of a filename through the
leaf blocks. When the lookup encounters a block that ends with the
requested hash, but the entry has not yet been found, it must shift over
to the next block and continue looking for the entry (i.e., duplicate
hashes could continue over into the next block). This shift mechanism
involves walking back up and down the state structure, replacing buffers
at the appropriate btree levels as necessary.

When a buffer is replaced, the old buffer is released and the new buffer
read into the active slot in the path structure. Because the buffer is
read directly into the path slot, a buffer read failure can result in
setting a NULL buffer pointer in an active slot. This throws off the
state cleanup code in xfs_dir2_node_lookup(), which expects to release a
buffer from each active slot. Instead, a BUG occurs due to a NULL
pointer dereference:

BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8
IP: [] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
...
RIP: 0010:[] [] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
...
Call Trace:
[] xfs_dir2_node_lookup+0xa6/0x2c0 [xfs]
[] xfs_dir_lookup+0x1ac/0x1c0 [xfs]
[] xfs_lookup+0x91/0x290 [xfs]
[] xfs_vn_lookup+0x73/0xb0 [xfs]
[] lookup_real+0x1d/0x50
[] path_openat+0x91e/0x1490
[] do_filp_open+0x89/0x100
...

This has been reproduced via a parallel fsstress and filesystem shutdown
workload in a loop. The shutdown triggers the read error in the
aforementioned codepath and causes the BUG in xfs_dir2_node_lookup().

Update xfs_da3_path_shift() to update the active path slot atomically
with respect to the caller when a buffer is replaced. This ensures that
the caller always sees the old or new buffer in the slot and prevents
the NULL pointer dereference.

Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Brian Foster
2015-08-19 08:32:33 +0800