Eric Lee / smarc-fsl-linux-kernel

24 Jan, 2016

1 commit

cc673757e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull final vfs updates from Al Viro:

- The ->i_mutex wrappers (with small prereq in lustre)

- a fix for too early freeing of symlink bodies on shmem (they need to
be RCU-delayed) (-stable fodder)

- followup to dedupe stuff merged this cycle

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: abort dedupe loop if fatal signals are pending
make sure that freeing shmem fast symlinks is RCU-delayed
wrappers for ->i_mutex access
lustre: remove unused declaration

Linus Torvalds
2016-01-24 04:24:56 +0800

23 Jan, 2016

3 commits

5eb88dca9 xfs: call dax_pfn_mkwrite() for DAX fsync/msync ... Browse Code »

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler
Cc: "H. Peter Anvin"
Cc: "J. Bruce Fields"
Cc: "Theodore Ts'o"
Cc: Alexander Viro
Cc: Andreas Dilger
Cc: Dave Chinner
Cc: Ingo Molnar
Cc: Jan Kara
Cc: Jeff Layton
Cc: Matthew Wilcox
Cc: Thomas Gleixner
Cc: Dan Williams
Cc: Matthew Wilcox
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ross Zwisler
2016-01-23 09:02:18 +0800
5955102c9 wrappers for ->i_mutex access ... Browse Code »

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro

Al Viro
2016-01-23 07:04:28 +0800
d5ffdf8b4 Merge tag 'xfs-for-linus-4.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs ... Browse Code »

Pull more xfs updates from Dave Chinner:
"This is the second update for XFS that I mentioned in the original
pull request last week.

It contains a revert for a suspend regression in 4.4 and a fix for a
long standing log recovery issue that has been further exposed by all
the log recovery changes made in the original 4.5 merge.

There is one more thing in this pull request - one that I forgot to
merge into the origin. That is, pulling the XFS_IOC_FS[GS]ETXATTR
ioctl up to the VFS level so that other filesystems can also use it
for modifying project quota IDs

Summary:

- promotion of XFS_IOC_FS[GS]ETXATTR ioctl to the vfs level so that
it can be shared with other filesystems. The ext4 project quota
functionality is the first target for this. The commits in this
series have not been updated with review or final SOB tags because
the branch they were originally published in was needed by ext4.
Those tags are:

Reviewed-by: Theodore Ts'o
Signed-off-by: Dave Chinner

- Revert a change that is causing suspend failures.

- Fix a use-after-free that can occur on log mount failures. Been
around forever, but now exposed by other changes to log recovery
made in the first 4.5 merge"

* tag 'xfs-for-linus-4.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: log mount failures don't wait for buffers to be released
Revert "xfs: clear PF_NOFREEZE for xfsaild kthread"
xfs: introduce per-inode DAX enablement
xfs: use FS_XFLAG definitions directly
fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR promotion

Linus Torvalds
2016-01-23 02:54:13 +0800

19 Jan, 2016

4 commits

ee3804d9f Merge branch 'xfs-misc-fixes-for-4.5-3' into for-next Browse Code »

Dave Chinner
2016-01-19 05:28:36 +0800
85bec5460 xfs: log mount failures don't wait for buffers to be released ... Browse Code »

Recently I've been seeing xfs/051 fail on 1k block size filesystems.
Trying to trace the events during the test lead to the problem going
away, indicating that it was a race condition that lead to this
ASSERT failure:

XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 156
.....
[] xfs_free_perag+0x87/0xb0
[] xfs_mountfs+0x4d9/0x900
[] xfs_fs_fill_super+0x3bf/0x4d0
[] mount_bdev+0x180/0x1b0
[] xfs_fs_mount+0x15/0x20
[] mount_fs+0x38/0x170
[] vfs_kern_mount+0x67/0x120
[] do_mount+0x218/0xd60
[] SyS_mount+0x8b/0xd0

When I finally caught it with tracing enabled, I saw that AG 2 had
an elevated reference count and a buffer was responsible for it. I
tracked down the specific buffer, and found that it was missing the
final reference count release that would put it back on the LRU and
hence be found by xfs_wait_buftarg() calls in the log mount failure
handling.

The last four traces for the buffer before the assert were (trimmed
for relevance)

kworker/0:1-5259 xfs_buf_iodone: hold 2 lock 0 flags ASYNC
kworker/0:1-5259 xfs_buf_ioerror: hold 2 lock 0 error -5
mount-7163 xfs_buf_lock_done: hold 2 lock 0 flags ASYNC
mount-7163 xfs_buf_unlock: hold 2 lock 1 flags ASYNC

This is an async write that is completing, so there's nobody waiting
for it directly. Hence we call xfs_buf_relse() once all the
processing is complete. That does:

static inline void xfs_buf_relse(xfs_buf_t *bp)
{
xfs_buf_unlock(bp);
xfs_buf_rele(bp);
}

Now, it's clear that mount is waiting on the buffer lock, and that
it has been released by xfs_buf_relse() and gained by mount. This is
expected, because at this point the mount process is in
xfs_buf_delwri_submit() waiting for all the IO it submitted to
complete.

The mount process, however, is waiting on the lock for the buffer
because it is in xfs_buf_delwri_submit(). This waits for IO
completion, but it doesn't wait for the buffer reference owned by
the IO to go away. The mount process collects all the completions,
fails the log recovery, and the higher level code then calls
xfs_wait_buftarg() to free all the remaining buffers in the
filesystem.

The issue is that on unlocking the buffer, the scheduler has decided
that the mount process has higher priority than the the kworker
thread that is running the IO completion, and so immediately
switched contexts to the mount process from the semaphore unlock
code, hence preventing the kworker thread from finishing the IO
completion and releasing the IO reference to the buffer.

Hence by the time that xfs_wait_buftarg() is run, the buffer still
has an active reference and so isn't on the LRU list that the
function walks to free the remaining buffers. Hence we miss that
buffer and continue onwards to tear down the mount structures,
at which time we get find a stray reference count on the perag
structure. On a non-debug kernel, this will be ignored and the
structure torn down and freed. Hence when the kworker thread is then
rescheduled and the buffer released and freed, it will access a
freed perag structure.

The problem here is that when the log mount fails, we still need to
quiesce the log to ensure that the IO workqueues have returned to
idle before we run xfs_wait_buftarg(). By synchronising the
workqueues, we ensure that all IO completions are fully processed,
not just to the point where buffers have been unlocked. This ensures
we don't end up in the situation above.

cc: # 3.18
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2016-01-19 05:28:10 +0800
3e85286e7 Revert "xfs: clear PF_NOFREEZE for xfsaild kthread" ... Browse Code »

This reverts commit 24ba16bb3d499c49974669cd8429c3e4138ab102 as it
prevents machines from suspending. This regression occurs when the
xfsaild is idle on entry to suspend, and so there s no activity to
wake it from it's idle sleep and hence see that it is supposed to
freeze. Hence the freezer times out waiting for it and suspend is
cancelled.

There is no obvious fix for this short of freezing the filesystem
properly, so revert this change for now.

cc: # 4.4
Signed-off-by: Dave Chinner
Acked-by: Jiri Kosina
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2016-01-19 05:21:46 +0800
4c931f770 Merge branch 'xfs-setxattr-promotion' into for-next Browse Code »

Dave Chinner
2016-01-19 05:16:08 +0800

15 Jan, 2016

1 commit

5d097056c kmemcg: account certain kmem allocations to memcg ... Browse Code »

Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:

- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Greg Thelen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-01-15 08:00:49 +0800

14 Jan, 2016

1 commit

7fdec82af Merge tag 'xfs-for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs ... Browse Code »

Pull xfs updates from Dave Chinner:
"There's not a lot in this - the main addition is the CRC validation of
the entire region of the log that the will be recovered, along with
several log recovery fixes. Most of the rest is small bug fixes and
cleanups.

I have three bug fixes still pending, all that address recently fixed
regressions that I will send to next week after they've had some time
in for-next.

Summary:
- extensive CRC validation during log recovery
- several log recovery bug fixes
- Various DAX support fixes
- AGFL size calculation fix
- various cleanups in preparation for new functionality
- project quota ENOSPC notification via netlink
- tracing and debug improvements"

* tag 'xfs-for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (26 commits)
xfs: handle dquot buffer readahead in log recovery correctly
xfs: inode recovery readahead can race with inode buffer creation
xfs: eliminate committed arg from xfs_bmap_finish
xfs: bmapbt checking on debug kernels too expensive
xfs: add tracepoints to readpage calls
xfs: debug mode log record crc error injection
xfs: detect and trim torn writes during log recovery
xfs: fix recursive splice read locking with DAX
xfs: Don't use reserved blocks for data blocks with DAX
XFS: Use a signed return type for suffix_kstrtoint()
libxfs: refactor short btree block verification
libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct
libxfs: use a convenience variable instead of open-coding the fork
xfs: fix log ticket type printing
libxfs: make xfs_alloc_fix_freelist non-static
xfs: make xfs_buf_ioend_async() static
xfs: send warning of project quota to userspace via netlink
xfs: get mp from bma->ip in xfs_bmap code
xfs: print name of verifier if it fails
libxfs: Optimize the loop for xfs_bitmap_empty
...

Linus Torvalds
2016-01-14 13:15:18 +0800

13 Jan, 2016

1 commit

33caf82ac Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc vfs updates from Al Viro:
"All kinds of stuff. That probably should've been 5 or 6 separate
branches, but by the time I'd realized how large and mixed that bag
had become it had been too close to -final to play with rebasing.

Some fs/namei.c cleanups there, memdup_user_nul() introduction and
switching open-coded instances, burying long-dead code, whack-a-mole
of various kinds, several new helpers for ->llseek(), assorted
cleanups and fixes from various people, etc.

One piece probably deserves special mention - Neil's
lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
called without ->i_mutex and tries to avoid ever taking it. That, of
course, means that it's not useful for any directory modifications,
but things like getting inode attributes in nfds readdirplus are fine
with that. I really should've asked for moratorium on lookup-related
changes this cycle, but since I hadn't done that early enough... I
*am* asking for that for the coming cycle, though - I'm going to try
and get conversion of i_mutex to rwsem with ->lookup() done under lock
taken shared.

There will be a patch closer to the end of the window, along the lines
of the one Linus had posted last May - mechanical conversion of
->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
inode_is_locked()/inode_lock_nested(). To quote Linus back then:

-----
| This is an automated patch using
|
| sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
| sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
| sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
| sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
| sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
|
| with a very few manual fixups
-----

I'm going to send that once the ->i_mutex-affecting stuff in -next
gets mostly merged (or when Linus says he's about to stop taking
merges)"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
nfsd: don't hold i_mutex over userspace upcalls
fs:affs:Replace time_t with time64_t
fs/9p: use fscache mutex rather than spinlock
proc: add a reschedule point in proc_readfd_common()
logfs: constify logfs_block_ops structures
fcntl: allow to set O_DIRECT flag on pipe
fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
fs: xattr: Use kvfree()
[s390] page_to_phys() always returns a multiple of PAGE_SIZE
nbd: use ->compat_ioctl()
fs: use block_device name vsprintf helper
lib/vsprintf: add %*pg format specifier
fs: use gendisk->disk_name where possible
poll: plug an unused argument to do_poll
amdkfd: don't open-code memdup_user()
cdrom: don't open-code memdup_user()
rsxx: don't open-code memdup_user()
mtip32xx: don't open-code memdup_user()
[um] mconsole: don't open-code memdup_user_nul()
[um] hostaudio: don't open-code memdup_user()
...

Linus Torvalds
2016-01-13 09:11:47 +0800

12 Jan, 2016

4 commits

ddf1d6238 Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs xattr updates from Al Viro:
"Andreas' xattr cleanup series.

It's a followup to his xattr work that went in last cycle; -0.5KLoC"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
xattr handlers: Simplify list operation
ocfs2: Replace list xattr handler operations
nfs: Move call to security_inode_listsecurity into nfs_listxattr
xfs: Change how listxattr generates synthetic attributes
tmpfs: listxattr should include POSIX ACL xattrs
tmpfs: Use xattr handler infrastructure
btrfs: Use xattr handler infrastructure
vfs: Distinguish between full xattr names and proper prefixes
posix acls: Remove duplicate xattr name definitions
gfs2: Remove gfs2_xattr_acl_chmod
vfs: Remove vfs_xattr_cmp

Linus Torvalds
2016-01-12 05:32:10 +0800
dde7f55bd Merge branch 'xfs-misc-fixes-for-4.5-2' into for-next Browse Code »

Dave Chinner
2016-01-12 04:04:30 +0800
7d6a13f02 xfs: handle dquot buffer readahead in log recovery correctly ... Browse Code »

When we do dquot readahead in log recovery, we do not use a verifier
as the underlying buffer may not have dquots in it. e.g. the
allocation operation hasn't yet been replayed. Hence we do not want
to fail recovery because we detect an operation to be replayed has
not been run yet. This problem was addressed for inodes in commit
d891400 ("xfs: inode buffers may not be valid during recovery
readahead") but the problem was not recognised to exist for dquots
and their buffers as the dquot readahead did not have a verifier.

The result of not using a verifier is that when the buffer is then
next read to replay a dquot modification, the dquot buffer verifier
will only be attached to the buffer if *readahead is not complete*.
Hence we can read the buffer, replay the dquot changes and then add
it to the delwri submission list without it having a verifier
attached to it. This then generates warnings in xfs_buf_ioapply(),
which catches and warns about this case.

Fix this and make it handle the same readahead verifier error cases
as for inode buffers by adding a new readahead verifier that has a
write operation as well as a read operation that marks the buffer as
not done if any corruption is detected. Also make sure we don't run
readahead if the dquot buffer has been marked as cancelled by
recovery.

This will result in readahead either succeeding and the buffer
having a valid write verifier, or readahead failing and the buffer
state requiring the subsequent read to resubmit the IO with the new
verifier. In either case, this will result in the buffer always
ending up with a valid write verifier on it.

Note: we also need to fix the inode buffer readahead error handling
to mark the buffer with EIO. Brian noticed the code I copied from
there wrong during review, so fix it at the same time. Add comments
linking the two functions that handle readahead verifier errors
together so we don't forget this behavioural link in future.

cc: # 3.12 - current
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2016-01-12 04:04:01 +0800
b79f4a1c6 xfs: inode recovery readahead can race with inode buffer creation ... Browse Code »

When we do inode readahead in log recovery, we do can do the
readahead before we've replayed the icreate transaction that stamps
the buffer with inode cores. The inode readahead verifier catches
this and marks the buffer as !done to indicate that it doesn't yet
contain valid inodes.

In adding buffer error notification (i.e. setting b_error = -EIO at
the same time as as we clear the done flag) to such a readahead
verifier failure, we can then get subsequent inode recovery failing
with this error:

XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32

This occurs when readahead completion races with icreate item replay
such as:

inode readahead
find buffer
lock buffer
submit RA io
....
icreate recovery
xfs_trans_get_buffer
find buffer
lock buffer

.....

fails verifier
clear XBF_DONE
set bp->b_error = -EIO
release and unlock buffer

icreate initialises buffer
marks buffer as done
adds buffer to delayed write queue
releases buffer

At this point, we have an initialised inode buffer that is up to
date but has an -EIO state registered against it. When we finally
get to recovering an inode in that buffer:

inode item recovery
xfs_trans_read_buffer
find buffer
lock buffer
sees XBF_DONE is set, returns buffer
sees bp->b_error is set
fail log recovery!

Essentially, we need xfs_trans_get_buf_map() to clear the error status of
the buffer when doing a lookup. This function returns uninitialised
buffers, so the buffer returned can not be in an error state and
none of the code that uses this function expects b_error to be set
on return. Indeed, there is an ASSERT(!bp->b_error); in the
transaction case in xfs_trans_get_buf_map() that would have caught
this if log recovery used transactions....

This patch firstly changes the inode readahead failure to set -EIO
on the buffer, and secondly changes xfs_buf_get_map() to never
return a buffer with an error state set so this first change doesn't
cause unexpected log recovery failures.

cc: # 3.12 - current
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2016-01-12 04:03:44 +0800

11 Jan, 2016

1 commit

f6106efae xfs: eliminate committed arg from xfs_bmap_finish ... Browse Code »

Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
associated comments were replicated several times across
the attribute code, all dealing with what to do if the
transaction was or wasn't committed.

And in that replicated code, an ASSERT() test of an
uninitialized variable occurs in several locations:

error = xfs_attr_thing(&args);
if (!error) {
error = xfs_bmap_finish(&args.trans, args.flist,
&committed);
}
if (error) {
ASSERT(committed);

If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
never set "committed", and then test it in the ASSERT.

Fix this up by moving the committed state internal to xfs_bmap_finish,
and add a new inode argument. If an inode is passed in, it is passed
through to __xfs_trans_roll() and joined to the transaction there if
the transaction was committed.

xfs_qm_dqalloc() was a little unique in that it called bjoin rather
than ijoin, but as Dave points out we can detect the committed state
but checking whether (*tpp != tp).

Addresses-Coverity-Id: 102360
Addresses-Coverity-Id: 102361
Addresses-Coverity-Id: 102363
Addresses-Coverity-Id: 102364
Signed-off-by: Eric Sandeen
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Eric Sandeen
2016-01-11 08:34:01 +0800

08 Jan, 2016

2 commits

e35438196 xfs: bmapbt checking on debug kernels too expensive ... Browse Code »

For large sparse or fragmented files, checking every single entry in
the bmapbt on every operation is prohibitively expensive. Especially
as such checks rarely discover problems during normal operations on
high extent coutn files. Our regression tests don't tend to exercise
files with hundreds of thousands to millions of extents, so mostly
this isn't noticed.

However, trying to run things like xfs_mdrestore of large filesystem
dumps on a debug kernel quickly becomes impossible as the CPU is
completely burnt up repeatedly walking the sparse file bmapbt that
is generated for every allocation that is made.

Hence, if the file has more than 10,000 extents, just don't bother
with walking the tree to check it exhaustively. The btree code has
checks that ensure that the newly inserted/removed/modified record
is correctly ordered, so the entrie tree walk in thses cases has
limited additional value.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2016-01-08 08:28:49 +0800
121e213ea xfs: add tracepoints to readpage calls ... Browse Code »

This allows us to see page cache driven readahead in action as it
passes through XFS. This helps to understand buffered read
throughput problems such as readahead IO IO sizes being too small
for the underlying device to reach max throughput.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2016-01-08 08:28:35 +0800

07 Jan, 2016

1 commit

a1c6f0573 fs: use block_device name vsprintf helper ... Browse Code »

Signed-off-by: Dmitry Monakhov
Signed-off-by: Al Viro

Dmitry Monakhov
2016-01-07 02:03:18 +0800

05 Jan, 2016

4 commits

4922be51e Merge branch 'xfs-dax-fixes-for-4.5' into for-next Browse Code »

Dave Chinner
2016-01-05 05:08:47 +0800
7eeabbd4b Merge branch 'xfs-misc-fixes-for-4.5' into for-next Browse Code »

Dave Chinner
2016-01-05 05:08:35 +0800
609adfc2e xfs: debug mode log record crc error injection ... Browse Code »

XFS now uses CRC verification over a limited section of the log to
detect torn writes prior to a crash. This is difficult to test directly
due to the timing and hardware requirements to cause a short write.

Add a mechanism to inject CRC errors into log records to facilitate
testing torn write detection during log recovery. This mechanism is
dangerous and can result in filesystem corruption. Thus, it is only
available in DEBUG mode for testing/development purposes. Set a non-zero
value to the following sysfs entry to enable error injection:

/sys/fs/xfs//log/log_badcrc_factor

Once enabled, XFS intentionally writes an invalid CRC to a log record at
some random point in the future based on the provided frequency. The
filesystem immediately shuts down once the record has been written to
the physical log to prevent metadata writeback (e.g., AIL insertion)
once the log write completes. This helps reasonably simulate a torn
write to the log as the affected record must be safe to discard. The
next mount after the intentional shutdown requires log recovery and
should detect and recover from the torn write.

Note again that this _will_ result in data loss or worse. For testing
and development purposes only!

Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Brian Foster
2016-01-05 04:41:16 +0800
7088c4136 xfs: detect and trim torn writes during log recovery ... Browse Code »

Certain types of storage, such as persistent memory, do not provide
sector atomicity for writes. This means that if a crash occurs while XFS
is writing log records, only part of those records might make it to the
storage. This is problematic because log recovery uses the cycle value
packed at the top of each log block to locate the head/tail of the log.
This can lead to CRC verification failures during log recovery and an
unmountable fs for a filesystem that is otherwise consistent.

Update log recovery to incorporate log record CRC verification as part
of the head/tail discovery process. Once the head is located via the
traditional algorithm, run a CRC-only pass over the records up to the
head of the log. If CRC verification fails, assume that the records are
torn as a matter of policy and trim the head block back to the start of
the first bad record.

Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Brian Foster
2016-01-05 04:40:16 +0800

04 Jan, 2016

17 commits

58f88ca2d xfs: introduce per-inode DAX enablement ... Browse Code »

Rather than just being able to turn DAX on and off via a mount
option, some applications may only want to enable DAX for certain
performance critical files in a filesystem.

This patch introduces a new inode flag to enable DAX in the v3 inode
di_flags2 field. It adds support for setting and clearing flags in
the di_flags2 field via the XFS_IOC_FSSETXATTR ioctl, and sets the
S_DAX inode flag appropriately when it is seen.

When this flag is set on a directory, it acts as an "inherit flag".
That is, inodes created in the directory will automatically inherit
the on-disk inode DAX flag, enabling administrators to set up
directory heirarchies that automatically use DAX. Setting this flag
on an empty root directory will make the entire filesystem use DAX
by default.

Signed-off-by: Dave Chinner

Dave Chinner
2016-01-04 13:44:15 +0800
e7b894810 xfs: use FS_XFLAG definitions directly ... Browse Code »

Now that the ioctls have been hoisted up to the VFS level, use
the VFs definitions directly and remove the XFS specific definitions
completely. Userspace is going to have to handle the change of this
interface separately, so removing the definitions from xfs_fs.h is
not an issue here at all.

Signed-off-by: Dave Chinner

Dave Chinner
2016-01-04 13:44:15 +0800
334e580a6 fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR promotion ... Browse Code »

Hoist the ioctl definitions for the XFS_IOC_FS[SG]SETXATTR API from
fs/xfs/libxfs/xfs_fs.h to include/uapi/linux/fs.h so that the ioctls
can be used by all filesystems, not just XFS. This enables
(initially) ext4 to use the ioctl to set project IDs on inodes.

Based-on-patch-from: Li Xi
Signed-off-by: Dave Chinner

Dave Chinner
2016-01-04 13:44:15 +0800
a6d7636e8 xfs: fix recursive splice read locking with DAX ... Browse Code »

Doing a splice read (generic/249) generates a lockdep splat because
we recursively lock the inode iolock in this path:

SyS_sendfile64
do_sendfile
do_splice_direct
splice_direct_to_actor
do_splice_to
xfs_file_splice_read <<<<<< lock here
default_file_splice_read
vfs_readv
do_readv_writev
do_iter_readv_writev
xfs_file_read_iter <<<<<< then here

The issue here is that for DAX inodes we need to avoid the page
cache path and hence simply push it into the normal read path.
Unfortunately, we can't tell down at xfs_file_read_iter() whether we
are being called from the splice path and hence we cannot avoid the
locking at this layer. Hence we simply have to drop the inode
locking at the higher splice layer for DAX.

Signed-off-by: Dave Chinner
Tested-by: Ross Zwisler
Signed-off-by: Dave Chinner

Dave Chinner
2016-01-04 13:28:25 +0800
3b0fe4780 xfs: Don't use reserved blocks for data blocks with DAX ... Browse Code »

Commit 1ca1915 ("xfs: Don't use unwritten extents for DAX") enabled
the DAX allocation call to dip into the reserve pool in case it was
converting unwritten extents rather than allocating blocks. This was
a direct copy of the unwritten extent conversion code, but had an
unintended side effect of allowing normal data block allocation to
use the reserve pool. Hence normal block allocation could deplete
the reserve pool and prevent unwritten extent conversion at ENOSPC,
hence violating fallocate guarantees on preallocated space.

Fix it by checking whether the incoming map from __xfs_get_blocks()
spans an unwritten extent and only use the reserve pool if the
allocation covers an unwritten extent.

Signed-off-by: Dave Chinner
Tested-by: Ross Zwisler
Signed-off-by: Dave Chinner

Dave Chinner
2016-01-04 13:22:45 +0800
a841b64df XFS: Use a signed return type for suffix_kstrtoint() ... Browse Code »

The return type "unsigned long" was used by the suffix_kstrtoint()
function even though it will eventually return a negative error code.
Improve this implementation detail by using the type "int" instead.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring
Reviewed-by: Eric Sandeen
Signed-off-by: Dave Chinner

Markus Elfring
2016-01-04 13:13:21 +0800
c5ab131ba libxfs: refactor short btree block verification ... Browse Code »

Create xfs_btree_sblock_verify() to verify short-format btree blocks
(i.e. the per-AG btrees with 32-bit block pointers) instead of
open-coding them.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-01-04 13:13:21 +0800
96f859d52 libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct ... Browse Code »

Because struct xfs_agfl is 36 bytes long and has a 64-bit integer
inside it, gcc will quietly round the structure size up to the nearest
64 bits -- in this case, 40 bytes. This results in the XFS_AGFL_SIZE
macro returning incorrect results for v5 filesystems on 64-bit
machines (118 items instead of 119). As a result, a 32-bit xfs_repair
will see garbage in AGFL item 119 and complain.

Therefore, tell gcc not to pad the structure so that the AGFL size
calculation is correct.

cc: # 3.10 - 4.4
Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-01-04 13:13:21 +0800
6d3eb1eca libxfs: use a convenience variable instead of open-coding the fork ... Browse Code »

Use a convenience variable instead of open-coding the inode fork.
This isn't really needed for now, but will become important when we
add the copy-on-write fork later.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-01-04 13:12:42 +0800
9b434a347 xfs: fix log ticket type printing ... Browse Code »

Update the log ticket reservation type printing code to reflect
all the types of log tickets, to avoid incorrect debug output and
avoid running off the end of the array.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-01-04 13:11:42 +0800
2e9101da6 libxfs: make xfs_alloc_fix_freelist non-static ... Browse Code »

Since xfs_repair wants to use xfs_alloc_fix_freelist, remove the
static designation. xfsprogs already has this; this simply brings
the kernel up to date.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Darrick J. Wong
2016-01-04 13:10:42 +0800
211fe1a4d xfs: make xfs_buf_ioend_async() static ... Browse Code »

There are no callers of the xfs_buf_ioend_async() function outside
of the fs/xfs/xfs_buf.c. So, let's make it static.

Signed-off-by: Alexander Kuleshov
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Alexander Kuleshov
2016-01-04 13:10:42 +0800
ffc671f1e xfs: send warning of project quota to userspace via netlink ... Browse Code »

Linux's quota subsystem has an ability to handle project quota. This
commit just utilizes the ability from xfs side. dbus-monitor and
quota_nld shipped as part of quota-tools can be used for testing.
See the patch posting on the XFS list for details on testing.

Signed-off-by: Masatake YAMATO
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Masatake YAMATO
2016-01-04 13:10:42 +0800
f1f96c494 xfs: get mp from bma->ip in xfs_bmap code ... Browse Code »

In my earlier commit

c29aad4 xfs: pass mp to XFS_WANT_CORRUPTED_GOTO

I added some local mp variables with code which indicates that
mp might be NULL. Coverity doesn't like this now, because the
updated per-fs XFS_STATS macros dereference mp.

I don't think this is actually a problem; from what I can tell,
we cannot get to these functions with a null bma->tp, so my NULL
check was probably pointless. Still, it's not super obvious.

So switch this code to get mp from the inode on the xfs_bmalloca
structure, with no conditional, because the functions are already
using bmap->ip directly.

Addresses-Coverity-Id: 1339552
Addresses-Coverity-Id: 1339553
Signed-off-by: Eric Sandeen
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Eric Sandeen
2016-01-04 13:10:42 +0800
233135b76 xfs: print name of verifier if it fails ... Browse Code »

This adds a name to each buf_ops structure, so that if
a verifier fails we can print the type of verifier that
failed it. Should be a slight debugging aid, I hope.

Signed-off-by: Eric Sandeen
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Eric Sandeen
2016-01-04 13:10:19 +0800
1d4292bfd libxfs: Optimize the loop for xfs_bitmap_empty ... Browse Code »

If there is any non zero bit in a long bitmap, it can jump out of the
loop and finish the function as soon as possible.

Signed-off-by: Jia He
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Jia He
2016-01-04 13:10:19 +0800
eed6b462f xfs: refactor log record start detection into a new helper ... Browse Code »

As part of the head/tail discovery process, log recovery locates the
head block and then reverse seeks to find the start of the last active
record in the log. This is non-trivial as the record itself could have
wrapped around the end of the physical log. Log recovery torn write
detection potentially needs to walk further behind the last record in
the log, as multiple log I/Os can be in-flight at one time during a
crash event.

Therefore, refactor the reverse log record header search mechanism into
a new helper that supports the ability to seek past an arbitrary number
of log records (or until the tail is hit). Update the head/tail search
mechanism to call the new helper, but otherwise there is no change in
log recovery behavior.

Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Brian Foster
2016-01-04 12:55:10 +0800