Eric Lee / smarc-fsl-linux-kernel

07 Mar, 2015

3 commits

84399bb07 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fixes from Chris Mason:
"Outside of misc fixes, Filipe has a few fsync corners and we're
pulling in one more of Josef's fixes from production use here"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.
Btrfs: fix data loss in the fast fsync path
Btrfs: remove extra run_delayed_refs in update_cowonly_root
Btrfs: incremental send, don't rename a directory too soon
btrfs: fix lost return value due to variable shadowing
Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr
Btrfs: fix off-by-one logic error in btrfs_realloc_node
Btrfs: add missing inode update when punching hole
Btrfs: abort the transaction if we fail to update the free space cache inode
Btrfs: fix fsync race leading to ordered extent memory leaks

Linus Torvalds
2015-03-07 05:52:54 +0800
7c5bde7ad Merge tag 'locks-v4.0-3' of git://git.samba.org/jlayton/linux ... Browse Code »

Pull file locking fix from Jeff Layton:
"Just a single patch to fix a memory leak that Daniel Wagner discovered
while doing some testing with leases"

* tag 'locks-v4.0-3' of git://git.samba.org/jlayton/linux:
locks: fix fasync_struct memory leak in lease upgrade/downgrade handling

Linus Torvalds
2015-03-07 02:31:38 +0800
1b1bd5619 Merge tag 'nfs-for-4.0-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs ... Browse Code »

Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:

- Fix a regression in the NFSv4 open state recovery code
- Fix a regression in the NFSv4 close code
- Fix regressions and side-effects of the loop-back mounted NFS fixes
in 3.18, that cause the NFS read() syscall to return EBUSY.
- Fix regressions around the readdirplus code and how it interacts
with the VFS lazy unmount changes that went into v3.18.
- Fix issues with out-of-order RPC call replies replacing updated
attributes with stale ones (particularly after a truncate()).
- Fix an underflow checking issue with RPC/RDMA credits
- Fix a number of issues with the NFSv4 delegation return/free code.
- Fix issues around stale NFSv4.1 leases when doing a mount"

* tag 'nfs-for-4.0-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (24 commits)
NFSv4.1: Clear the old state by our client id before establishing a new lease
NFSv4: Fix a race in NFSv4.1 server trunking discovery
NFS: Don't write enable new pages while an invalidation is proceeding
NFS: Fix a regression in the read() syscall
NFSv4: Ensure we skip delegations that are already being returned
NFSv4: Pin the superblock while we're returning the delegation
NFSv4: Ensure we honour NFS_DELEGATION_RETURNING in nfs_inode_set_delegation()
NFSv4: Ensure that we don't reap a delegation that is being returned
NFS: Fix stateid used for NFS v4 closes
NFSv4: Don't call put_rpccred() under the rcu_read_lock()
NFS: Don't require a filehandle to refresh the inode in nfs_prime_dcache()
NFSv3: Use the readdir fileid as the mounted-on-fileid
NFS: Don't invalidate a submounted dentry in nfs_prime_dcache()
NFSv4: Set a barrier in the update_changeattr() helper
NFS: Fix nfs_post_op_update_inode() to set an attribute barrier
NFS: Remove size hack in nfs_inode_attrs_need_update()
NFSv4: Add attribute update barriers to delegreturn and pNFS layoutcommit
NFS: Add attribute update barriers to NFS writebacks
NFS: Set an attribute barrier on all updates
NFS: Add attribute update barriers to nfs_setattr_update_inode()
...

Linus Torvalds
2015-03-07 02:09:57 +0800

06 Mar, 2015

3 commits

dd9ef135e Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref. ... Browse Code »

Improper arithmetics when calculting the address of the extended ref could
lead to an out of bounds memory read and kernel panic.

Signed-off-by: Quentin Casasnovas
Reviewed-by: David Sterba
cc: stable@vger.kernel.org # v3.7+
Signed-off-by: Chris Mason

Quentin Casasnovas
2015-03-06 09:28:33 +0800
3a8b36f37 Btrfs: fix data loss in the fast fsync path ... Browse Code »

When using the fast file fsync code path we can miss the fact that new
writes happened since the last file fsync and therefore return without
waiting for the IO to finish and write the new extents to the fsync log.

Here's an example scenario where the fsync will miss the fact that new
file data exists that wasn't yet durably persisted:

1. fs_info->last_trans_committed == N - 1 and current transaction is
transaction N (fs_info->generation == N);

2. do a buffered write;

3. fsync our inode, this clears our inode's full sync flag, starts
an ordered extent and waits for it to complete - when it completes
at btrfs_finish_ordered_io(), the inode's last_trans is set to the
value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
btrfs_set_inode_last_trans);

4. transaction N is committed, so fs_info->last_trans_committed is now
set to the value N and fs_info->generation remains with the value N;

5. do another buffered write, when this happens btrfs_file_write_iter
sets our inode's last_trans to the value N + 1 (that is
fs_info->generation + 1 == N + 1);

6. transaction N + 1 is started and fs_info->generation now has the
value N + 1;

7. transaction N + 1 is committed, so fs_info->last_trans_committed
is set to the value N + 1;

8. fsync our inode - because it doesn't have the full sync flag set,
we only start the ordered extent, we don't wait for it to complete
(only in a later phase) therefore its last_trans field has the
value N + 1 set previously by btrfs_file_write_iter(), and so we
have:

inode->last_trans last_trans_committed
(N + 1) (N + 1)

Which made us not log the last buffered write and exit the fsync
handler immediately, returning success (0) to user space and resulting
in data loss after a crash.

This can actually be triggered deterministically and the following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy
file across directories and then fsyncs the old parent directory - this
is just to trigger a transaction commit, so moving files around isn't
directly related to the issue but it was chosen because running 'sync' for
example does more than just committing the current transaction, as it
flushes/waits for all file data to be persisted. The issue can also happen
at random periods, since the transaction kthread periodicaly commits the
current transaction (about every 30 seconds by default).
The body of the test is:

_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey

# Create our main test file 'foo', the one we check for data loss.
# By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
# bit from its flags (btrfs inode specific flags).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
-c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io

# Now create one other file and 2 directories. We will move this second file
# from one directory to the other later because it forces btrfs to commit its
# currently open transaction if we fsync the old parent directory. This is
# necessary to trigger the data loss bug that affected btrfs.
mkdir $SCRATCH_MNT/testdir_1
touch $SCRATCH_MNT/testdir_1/bar
mkdir $SCRATCH_MNT/testdir_2

# Make sure everything is durably persisted.
sync

# Write more 8Kb of data to our file.
$XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io

# Move our 'bar' file into a new directory.
mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar

# Fsync our first directory. Because it had a file moved into some other
# directory, this made btrfs commit the currently open transaction. This is
# a condition necessary to trigger the data loss bug.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1

# Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
# data we wrote previously to be persisted and available if a crash happens.
# This did not happen with btrfs, because of the transaction commit that
# happened when we fsynced the parent directory.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey

_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey

# Now check that all data we wrote before are available.
echo "File content after log replay:"
od -t x1 $SCRATCH_MNT/foo

status=0
exit

The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:

wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 8192/8192 bytes at offset 8192
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
*
0040000

Without this fix applied, the output shows the test file does not have
the second 8Kb extent that we successfully fsynced:

wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 8192/8192 bytes at offset 8192
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000

So fix this by skipping the fsync only if we're doing a full sync and
if the inode's last_trans is last_trans_committed, or if
the inode is already in the log. Also remove setting the inode's
last_trans in btrfs_file_write_iter since it's useless/unreliable.

Also because btrfs_file_write_iter no longer sets inode->last_trans to
fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
bail out if last_trans is 0, otherwise something as simple as the following
example wouldn't log the second write on the last fsync:

1. write to file

2. fsync file

3. fsync file
|--> btrfs_inode_in_log() returns true and it set last_trans to 0

4. write to file
|--> btrfs_file_write_iter() no longers sets last_trans, so it
remained with a value of 0
5. fsync
|--> inode->last_trans == 0, so it bails out without logging the
second write

A test case for xfstests will be sent soon.

CC:
Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2015-03-06 09:28:32 +0800
f5c0a1228 Btrfs: remove extra run_delayed_refs in update_cowonly_root ... Browse Code »

This got added with my dirty_bgs patch, it's not needed. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2015-03-06 09:28:30 +0800

05 Mar, 2015

2 commits

0164bf023 locks: fix fasync_struct memory leak in lease upgrade/downgrade handling ... Browse Code »

Commit 8634b51f6ca2 (locks: convert lease handling to file_lock_context)
introduced a regression in the handling of lease upgrade/downgrades.

In the event that we already have a lease on a file and are going to
either upgrade or downgrade it, we skip doing any list insertion or
deletion and simply re-call lm_setup on the existing lease.

As of commit 8634b51f6ca2 however, we end up calling lm_setup on the
lease that was passed in, instead of on the existing lease. This causes
us to leak the fasync_struct that was allocated in the event that there
was not already an existing one (as it always appeared that there
wasn't one).

Fixes: 8634b51f6ca2 (locks: convert lease handling to file_lock_context)
Reported-and-Tested-by: Daniel Wagner
Signed-off-by: Jeff Layton

Jeff Layton
2015-03-05 06:34:32 +0800
8a001af4b Merge tag 'ecryptfs-4.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs ... Browse Code »

Pull eCryptfs fixes from Tyler Hicks:
"Fixes for proper ioctl handling and an untriggerable buffer overflow

- The eCryptfs ioctl handling functions should only pass known-good
ioctl commands to the lower filesystem

- A static checker found a potential buffer overflow. Upon
inspection, it is not triggerable due to input validation performed
on the mount parameters"

* tag 'ecryptfs-4.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
eCryptfs: don't pass fs-specific ioctl commands through
eCryptfs: ensure copy to crypt_stat->cipher does not overrun

Linus Torvalds
2015-03-05 06:19:48 +0800

04 Mar, 2015

5 commits

e11259f92 NFSv4.1: Clear the old state by our client id before establishing a new lease ... Browse Code »

If the call to exchange-id returns with the EXCHGID4_FLAG_CONFIRMED_R flag
set, then that means our lease was established by a previous mount instance.
Ensure that we detect this situation, and that we clear the state held by
that mount.

Reported-by: Jorge Mora
Signed-off-by: Trond Myklebust

Trond Myklebust
2015-03-04 10:52:30 +0800
48d66b974 NFSv4: Fix a race in NFSv4.1 server trunking discovery ... Browse Code »

We do not want to allow a race with another NFS mount to cause
nfs41_walk_client_list() to establish a lease on our nfs_client before
we're done checking for trunking.

Signed-off-by: Trond Myklebust

Trond Myklebust
2015-03-04 09:42:23 +0800
a6c5170d1 Merge branch 'for-4.0' of git://linux-nfs.org/~bfields/linux ... Browse Code »

Pull nfsd fixes from Bruce Fields:
"Three miscellaneous bugfixes, most importantly the clp->cl_revoked
bug, which we've seen several reports of people hitting"

* 'for-4.0' of git://linux-nfs.org/~bfields/linux:
sunrpc: integer underflow in rsc_parse()
nfsd: fix clp->cl_revoked list deletion causing softlock in nfsd
svcrpc: fix memory leak in gssp_accept_sec_context_upcall

Linus Torvalds
2015-03-04 07:52:50 +0800
ef070dcb3 NFS: Don't write enable new pages while an invalidation is proceeding ... Browse Code »

nfs_vm_page_mkwrite() should wait until the page cache invalidation
is finished. This is the second patch in a 2 patch series to deprecate
the NFS client's reliance on nfs_release_page() in the context of
nfs_invalidate_mapping().

Signed-off-by: Trond Myklebust

Trond Myklebust
2015-03-04 02:58:08 +0800
874f94637 NFS: Fix a regression in the read() syscall ... Browse Code »

When invalidating the page cache for a regular file, we want to first
sync all dirty data to disk and then call invalidate_inode_pages2().
The latter relies on nfs_launder_page() and nfs_release_page() to deal
respectively with dirty pages, and unstable written pages.

When commit 9590544694bec ("NFS: avoid deadlocks with loop-back mounted
NFS filesystems.") changed the behaviour of nfs_release_page(), then it
made it possible for invalidate_inode_pages2() to fail with an EBUSY.
Unfortunately, that error is then propagated back to read().

Let's therefore work around the problem for now by protecting the call
to sync the data and invalidate_inode_pages2() so that they are atomic
w.r.t. the addition of new writes.
Later on, we can revisit whether or not we still need nfs_launder_page()
and nfs_release_page().

Signed-off-by: Trond Myklebust

Trond Myklebust
2015-03-04 02:02:29 +0800

03 Mar, 2015

13 commits

6d65261a0 eCryptfs: don't pass fs-specific ioctl commands through ... Browse Code »

eCryptfs can't be aware of what to expect when after passing an
arbitrary ioctl command through to the lower filesystem. The ioctl
command may trigger an action in the lower filesystem that is
incompatible with eCryptfs.

One specific example is when one attempts to use the Btrfs clone
ioctl command when the source file is in the Btrfs filesystem that
eCryptfs is mounted on top of and the destination fd is from a new file
created in the eCryptfs mount. The ioctl syscall incorrectly returns
success because the command is passed down to Btrfs which thinks that it
was able to do the clone operation. However, the result is an empty
eCryptfs file.

This patch allows the trim, {g,s}etflags, and {g,s}etversion ioctl
commands through and then copies up the inode metadata from the lower
inode to the eCryptfs inode to catch any changes made to the lower
inode's metadata. Those five ioctl commands are mostly common across all
filesystems but the whitelist may need to be further pruned in the
future.

https://bugzilla.kernel.org/show_bug.cgi?id=93691
https://launchpad.net/bugs/1305335

Signed-off-by: Tyler Hicks
Cc: Rocko
Cc: Colin Ian King
Cc: stable@vger.kernel.org # v2.6.36+: c43f7b8 eCryptfs: Handle ioctl calls with unlocked and compat functions

Tyler Hicks
2015-03-03 16:03:56 +0800
ec3ca4e57 NFSv4: Ensure we skip delegations that are already being returned ... Browse Code »

In nfs_client_return_marked_delegations() and nfs_delegation_reap_unclaimed()
we want to optimise the loop traversal by skipping delegations that are
already in the process of being returned.

Signed-off-by: Trond Myklebust

Trond Myklebust
2015-03-03 07:09:15 +0800
9f0f8e12c NFSv4: Pin the superblock while we're returning the delegation ... Browse Code »

This patch ensures that the superblock doesn't go ahead and disappear
underneath us while the state manager thread is returning delegations.

Signed-off-by: Trond Myklebust

Trond Myklebust
2015-03-03 07:09:14 +0800
ade04647d NFSv4: Ensure we honour NFS_DELEGATION_RETURNING in nfs_inode_set_delegation() ... Browse Code »

Ensure that nfs_inode_set_delegation() doesn't inadvertently detach a
delegation that is already in the process of being returned.

Signed-off-by: Trond Myklebust

Trond Myklebust
2015-03-03 07:09:14 +0800
b04b22f4c NFSv4: Ensure that we don't reap a delegation that is being returned ... Browse Code »

Signed-off-by: Trond Myklebust

Trond Myklebust
2015-03-03 07:09:13 +0800
369d6b7f0 NFS: Fix stateid used for NFS v4 closes ... Browse Code »

After 566fcec60 the client uses the "current stateid" from the
nfs4_state structure to close a file. This could potentially contain a
delegation stateid, which is disallowed by the protocol and causes
servers to return NFS4ERR_BAD_STATEID. This patch restores the
(correct) behavior of sending the open stateid to close a file.

Reported-by: Olga Kornievskaia
Fixes: 566fcec60 (NFSv4: Fix an atomicity problem in CLOSE)
Signed-off-by: Anna Schumaker
Signed-off-by: Trond Myklebust

Anna Schumaker
2015-03-03 07:06:42 +0800
84471e242 Btrfs: incremental send, don't rename a directory too soon ... Browse Code »

There's one more case where we can't issue a rename operation for a
directory as soon as we process it. We used to delay directory renames
only if they have some ancestor directory with a higher inode number
that got renamed too, but there's another case where we need to delay
the rename too - when a directory A is renamed to the old name of a
directory B but that directory B has its rename delayed because it
has now (in the send root) an ancestor with a higher inode number that
was renamed. If we don't delay the directory rename in this case, the
receiving end of the send stream will attempt to rename A to the old
name of B before B got renamed to its new name, which results in a
"directory not empty" error. So fix this by delaying directory renames
for this case too.

Steps to reproduce:

$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt

$ mkdir /mnt/a
$ mkdir /mnt/b
$ mkdir /mnt/c
$ touch /mnt/a/file

$ btrfs subvolume snapshot -r /mnt /mnt/snap1

$ mv /mnt/c /mnt/x
$ mv /mnt/a /mnt/x/y
$ mv /mnt/b /mnt/a

$ btrfs subvolume snapshot -r /mnt /mnt/snap2

$ btrfs send /mnt/snap1 -f /tmp/1.send
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send

$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
$ btrfs receive /mnt2 -f /tmp/1.send
$ btrfs receive /mnt2 -f /tmp/2.send
ERROR: rename b -> a failed. Directory not empty

A test case for xfstests follows soon.

Reported-by: Ames Cornish
Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2015-03-03 06:04:45 +0800
1932b7be9 btrfs: fix lost return value due to variable shadowing ... Browse Code »

A block-local variable stores error code but btrfs_get_blocks_direct may
not return it in the end as there's a ret defined in the function scope.

CC: # 3.6+
Fixes: d187663ef24c ("Btrfs: lock extents as we map them in DIO")
Signed-off-by: David Sterba
Signed-off-by: Chris Mason

David Sterba
2015-03-03 06:04:45 +0800
5cdf83edb Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr ... Browse Code »

The return value from btrfs_lookup_xattr() can be a pointer encoding an
error, therefore deal with it. This fixes commit 5f5bc6b1e2d5
("Btrfs: make xattr replace operations atomic").

Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2015-03-03 06:04:45 +0800
5dfe2be7e Btrfs: fix off-by-one logic error in btrfs_realloc_node ... Browse Code »

The end_slot variable actually matches the number of pointers in the
node and not the last slot (which is 'nritems - 1'). Therefore in order
to check that the current slot in the for loop doesn't match the last
one, the correct logic is to check if 'i' is less than 'end_slot - 1'
and not 'end_slot - 2'.

Fix this and set end_slot to be 'nritems - 1', as it's less confusing
since the variable name implies it's inclusive rather then exclusive.

Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2015-03-03 06:04:45 +0800
e8c1c76e8 Btrfs: add missing inode update when punching hole ... Browse Code »

When punching a file hole if we endup only zeroing parts of a page,
because the start offset isn't a multiple of the sector size or the
start offset and length fall within the same page, we were not updating
the inode item. This prevented an fsync from doing anything, if no other
file changes happened in the current transaction, because the fields
in btrfs_inode used to check if the inode needs to be fsync'ed weren't
updated.

This issue is easy to reproduce and the following excerpt from the
xfstest case I made shows how to trigger it:

_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey

# Create our test file.
$XFS_IO_PROG -f -c "pwrite -S 0x22 -b 16K 0 16K" \
$SCRATCH_MNT/foo | _filter_xfs_io

# Fsync the file, this makes btrfs update some btrfs inode specific fields
# that are used to track if the inode needs to be written/updated to the fsync
# log or not. After this fsync, the new values for those fields indicate that
# a subsequent fsync does not need to touch the fsync log.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

# Force a commit of the current transaction. After this point, any operation
# that modifies the data or metadata of our file, should update those fields in
# the btrfs inode with values that make the next fsync operation write to the
# fsync log.
sync

# Punch a hole in our file. This small range affects only 1 page.
# This made the btrfs hole punching implementation write only some zeroes in
# one page, but it did not update the btrfs inode fields used to determine if
# the next fsync needs to write to the fsync log.
$XFS_IO_PROG -c "fpunch 8000 4K" $SCRATCH_MNT/foo

# Another variation of the previously mentioned case.
$XFS_IO_PROG -c "fpunch 15000 100" $SCRATCH_MNT/foo

# Now fsync the file. This was a no-operation because the previous hole punch
# operation didn't update the inode's fields mentioned before, so they remained
# with the values they had after the first fsync - that is, they indicate that
# it is not needed to write to fsync log.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

echo "File content before:"
od -t x1 $SCRATCH_MNT/foo

# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey

# Enable writes and mount the fs. This makes the fsync log replay code run.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey

# Because the last fsync didn't do anything, here the file content matched what
# it was after the first fsync, before the holes were punched, and not what it
# was after the holes were punched.
echo "File content after:"
od -t x1 $SCRATCH_MNT/foo

This issue has been around since 2012, when the punch hole implementation
was added, commit 2aaa66558172 ("Btrfs: add hole punching").

A test case for xfstests follows soon.

Signed-off-by: Filipe Manana
Reviewed-by: Liu Bo
Signed-off-by: Chris Mason

Filipe Manana
2015-03-03 06:04:44 +0800
0c0ef4bc8 Btrfs: abort the transaction if we fail to update the free space cache inode ... Browse Code »

Our gluster boxes were hitting a problem where they'd run out of space when
updating the block group cache and therefore wouldn't be able to update the free
space inode. This is a problem because this is how we invalidate the cache and
protect ourselves from errors further down the stack, so if this fails we have
to abort the transaction so we make sure we don't end up with stale free space
cache. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2015-03-03 06:04:44 +0800
4d884fcea Btrfs: fix fsync race leading to ordered extent memory leaks ... Browse Code »

We can have multiple fsync operations against the same file during the
same transaction and they can collect the same ordered extents while they
don't complete (still accessible from the inode's ordered tree). If this
happens, those ordered extents will never get their reference counts
decremented to 0, leading to memory leaks and inode leaks (an iput for an
ordered extent's inode is scheduled only when the ordered extent's refcount
drops to 0). The following sequence diagram explains this race:

CPU 1 CPU 2

btrfs_sync_file()

btrfs_sync_file()

mutex_lock(inode->i_mutex)
btrfs_log_inode()
btrfs_get_logged_extents()
--> collects ordered extent X
--> increments ordered
extent X's refcount
btrfs_submit_logged_extents()
mutex_unlock(inode->i_mutex)

mutex_lock(inode->i_mutex)
btrfs_sync_log()
btrfs_wait_logged_extents()
--> list_del_init(&ordered->log_list)
btrfs_log_inode()
btrfs_get_logged_extents()
--> Adds ordered extent X
to logged_list because
at this point:
list_empty(&ordered->log_list)
&& test_bit(BTRFS_ORDERED_LOGGED,
&ordered->flags) == 0
--> Increments ordered extent
X's refcount
--> check if ordered extent's io is
finished or not, start it if
necessary and wait for it to finish
--> sets bit BTRFS_ORDERED_LOGGED
on ordered extent X's flags
and adds it to trans->ordered
btrfs_sync_log() finishes

btrfs_submit_logged_extents()
btrfs_log_inode() finishes
mutex_unlock(inode->i_mutex)

btrfs_sync_file() finishes

btrfs_sync_log()
btrfs_wait_logged_extents()
--> Sees ordered extent X has the
bit BTRFS_ORDERED_LOGGED set in
its flags
--> X's refcount is untouched
btrfs_sync_log() finishes

btrfs_sync_file() finishes

btrfs_commit_transaction()
--> called by transaction kthread for e.g.
btrfs_wait_pending_ordered()
--> waits for ordered extent X to
complete
--> decrements ordered extent X's
refcount by 1 only, corresponding
to the increment done by the fsync
task ran by CPU 1

In the scenario of the above diagram, after the transaction commit,
the ordered extent will remain with a refcount of 1 forever, leaking
the ordered extent structure and preventing the i_count of its inode
from ever decreasing to 0, since the delayed iput is scheduled only
when the ordered extent's refcount drops to 0, preventing the inode
from ever being evicted by the VFS.

Fix this by using the flag BTRFS_ORDERED_LOGGED differently. Use it to
mean that an ordered extent is already being processed by an fsync call,
which will attach it to the current transaction, preventing it from being
collected by subsequent fsync operations against the same inode.

This race was introduced with the following change (added in 3.19 and
backported to stable 3.18 and 3.17):

Btrfs: make sure logged extents complete in the current transaction V3
commit 50d9aa99bd35c77200e0e3dd7a72274f8304701f

I ran into this issue while running xfstests/generic/113 in a loop, which
failed about 1 out of 10 runs with the following warning in dmesg:

[ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 free_fs_root+0x36/0x133 [btrfs]()
[ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop processor parport_pc parport psmouse therma
l_sys i2c_piix4 serio_raw pcspkr evdev microcode button i2c_core ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci ata_piix virtio_ring libata virtio flo
ppy e1000 scsi_mod [last unloaded: btrfs]
[ 2612.452711] CPU: 4 PID: 22057 Comm: umount Tainted: G W 3.19.0-rc5-btrfs-next-4+ #1
[ 2612.454921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2612.457709] 0000000000000009 ffff8801342c3c78 ffffffff8142425e ffff88023ec8f2d8
[ 2612.459829] 0000000000000000 ffff8801342c3cb8 ffffffff81045308 ffff880046460000
[ 2612.461564] ffffffffa036da56 ffff88003d07b000 ffff880046460000 ffff880046460068
[ 2612.463163] Call Trace:
[ 2612.463719] [] dump_stack+0x4c/0x65
[ 2612.464789] [] warn_slowpath_common+0xa1/0xbb
[ 2612.466026] [] ? free_fs_root+0x36/0x133 [btrfs]
[ 2612.467247] [] warn_slowpath_null+0x1a/0x1c
[ 2612.468416] [] free_fs_root+0x36/0x133 [btrfs]
[ 2612.469625] [] btrfs_drop_and_free_fs_root+0x93/0x9b [btrfs]
[ 2612.471251] [] btrfs_free_fs_roots+0xa4/0xd6 [btrfs]
[ 2612.472536] [] ? wait_for_completion+0x24/0x26
[ 2612.473742] [] close_ctree+0x1f3/0x33c [btrfs]
[ 2612.475477] [] ? destroy_workqueue+0x148/0x1ba
[ 2612.476695] [] btrfs_put_super+0x19/0x1b [btrfs]
[ 2612.477911] [] generic_shutdown_super+0x73/0xef
[ 2612.479106] [] kill_anon_super+0x13/0x1e
[ 2612.480226] [] btrfs_kill_super+0x17/0x23 [btrfs]
[ 2612.481471] [] deactivate_locked_super+0x3b/0x50
[ 2612.482686] [] deactivate_super+0x3f/0x43
[ 2612.483791] [] cleanup_mnt+0x59/0x78
[ 2612.484842] [] __cleanup_mnt+0x12/0x14
[ 2612.485900] [] task_work_run+0x8f/0xbc
[ 2612.486960] [] do_notify_resume+0x5a/0x6b
[ 2612.488083] [] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 2612.489333] [] int_signal+0x12/0x17
[ 2612.490353] ---[ end trace 54a960a6bdcb8d93 ]---
[ 2612.557253] VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds. Have a nice day...

Kmemleak confirmed the ordered extent leak (and btrfs inode specific
structures such as delayed nodes):

$ cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff880154290db0 (size 576):
comm "btrfsck", pid 21980, jiffies 4295542503 (age 1273.412s)
hex dump (first 32 bytes):
01 40 00 00 01 00 00 00 b0 1d f1 4e 01 88 ff ff .@.........N....
00 00 00 00 00 00 00 00 c8 0d 29 54 01 88 ff ff ..........)T....
backtrace:
[] kmemleak_update_trace+0x4c/0x6a
[] radix_tree_node_alloc+0x6d/0x83
[] __radix_tree_create+0x109/0x190
[] radix_tree_insert+0x30/0xac
[] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
[] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
[] __btrfs_unlink_inode+0xee/0x288 [btrfs]
[] btrfs_unlink_inode+0x1e/0x40 [btrfs]
[] btrfs_unlink+0x60/0x9b [btrfs]
[] vfs_unlink+0x9c/0xed
[] do_unlinkat+0x12c/0x1fa
[] SyS_unlinkat+0x29/0x2b
[] system_call_fastpath+0x12/0x17
[] 0xffffffffffffffff
unreferenced object 0xffff88014ef11db0 (size 576):
comm "rm", pid 22009, jiffies 4295542593 (age 1273.052s)
hex dump (first 32 bytes):
02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 c8 1d f1 4e 01 88 ff ff ...........N....
backtrace:
[] kmemleak_update_trace+0x4c/0x6a
[] radix_tree_node_alloc+0x6d/0x83
[] __radix_tree_create+0x109/0x190
[] radix_tree_insert+0x30/0xac
[] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
[] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
[] __btrfs_unlink_inode+0xee/0x288 [btrfs]
[] btrfs_unlink_inode+0x1e/0x40 [btrfs]
[] btrfs_unlink+0x60/0x9b [btrfs]
[] vfs_unlink+0x9c/0xed
[] do_unlinkat+0x12c/0x1fa
[] SyS_unlinkat+0x29/0x2b
[] system_call_fastpath+0x12/0x17
[] 0xffffffffffffffff
unreferenced object 0xffff8800336feda8 (size 584):
comm "aio-stress", pid 22031, jiffies 4295543006 (age 1271.400s)
hex dump (first 32 bytes):
00 40 3e 00 00 00 00 00 00 00 8f 42 00 00 00 00 .@>........B....
00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 ................
backtrace:
[] create_object+0x172/0x29a
[] kmemleak_alloc+0x25/0x41
[] kmemleak_alloc_recursive.constprop.52+0x16/0x18
[] kmem_cache_alloc+0xf7/0x198
[] __btrfs_add_ordered_extent+0x43/0x309 [btrfs]
[] btrfs_add_ordered_extent_dio+0x12/0x14 [btrfs]
[] btrfs_get_blocks_direct+0x3ef/0x571 [btrfs]
[] do_blockdev_direct_IO+0x62a/0xb47
[] __blockdev_direct_IO+0x34/0x36
[] btrfs_direct_IO+0x16a/0x1e8 [btrfs]
[] generic_file_direct_write+0xb8/0x12d
[] btrfs_file_write_iter+0x24b/0x42f [btrfs]
[] aio_run_iocb+0x2b7/0x32e
[] do_io_submit+0x26e/0x2ff
[] SyS_io_submit+0x10/0x12
[] system_call_fastpath+0x12/0x17

CC: # 3.19, 3.18 and 3.17
Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason

Filipe Manana
2015-03-03 06:04:44 +0800

02 Mar, 2015

13 commits

7c0af9ffb NFSv4: Don't call put_rpccred() under the rcu_read_lock() ... Browse Code »

put_rpccred() can sleep.

Fixes: 8f649c3762547 ("NFSv4: Fix the locking in nfs_inode_reclaim_delegation()")
Cc: stable@vger.kernel.org # 2.6.35+
Signed-off-by: Trond Myklebust

Trond Myklebust
2015-03-02 12:23:07 +0800
fa9233699 NFS: Don't require a filehandle to refresh the inode in nfs_prime_dcache() ... Browse Code »

If the server does not return a valid set of attributes that we can
use to either create a file or refresh the inode, then there is no
value in calling nfs_prime_dcache().

However if we're just refreshing the inode using the attributes that
the server returned, then it shouldn't matter whether or not we have
a filehandle, as long as we check the fsid+fileid combination.

Signed-off-by: Trond Myklebust

Trond Myklebust
2015-03-02 12:23:07 +0800
1ae04b252 NFSv3: Use the readdir fileid as the mounted-on-fileid ... Browse Code »

When we call readdirplus, set the fileid normally returned by readdir
as the mounted-on-fileid, since that is commonly the case if there is
a mountpoint. To ensure that we get it right, we only set the flag if
the readdir fileid differs from the one returned in the readdirplus
attributes.

This again means that we can avoid the issues described in commit
2ef47eb1aee17 ("NFS: Fix use of nfs_attr_use_mounted_on_fileid()"),
which only fixed NFSv4.

Signed-off-by: Trond Myklebust

Trond Myklebust
2015-03-02 12:23:07 +0800
6c441c254 NFS: Don't invalidate a submounted dentry in nfs_prime_dcache() ... Browse Code »

If we're traversing a directory which contains a submounted filesystem,
or one that has a referral, the NFS server that is processing the READDIR
request will often return information for the underlying (mounted-on)
directory. It may, or may not, also return filehandle information.

If this happens, and the lookup in nfs_prime_dcache() returns the
dentry for the submounted directory, the filehandle comparison will
fail, and we call d_invalidate(). Post-commit 8ed936b5671bf
("vfs: Lazily remove mounts on unlinked files and directories."), this
means the entire subtree is unmounted.

The following minimal patch addresses this problem by punting on
the invalidation if there is a submount.

Kudos to Neil Brown for having tracked down this
issue (see link).

Reported-by: Nix
Link: http://lkml.kernel.org/r/87iofju9ht.fsf@spindle.srvr.nix
Cc: stable@vger.kernel.org # 3.18+
Signed-off-by: Trond Myklebust

Trond Myklebust
2015-03-02 12:23:06 +0800
3235b4030 NFSv4: Set a barrier in the update_changeattr() helper ... Browse Code »

Ensure that we don't regress the changes that were made to the
directory.

Signed-off-by: Trond Myklebust
Tested-by: Chuck Lever

Trond Myklebust
2015-03-02 12:23:06 +0800
92d64e47b NFS: Fix nfs_post_op_update_inode() to set an attribute barrier ... Browse Code »

nfs_post_op_update_inode() is called after a self-induced attribute
update. Ensure that it also sets the barrier.

Signed-off-by: Trond Myklebust
Tested-by: Chuck Lever

Trond Myklebust
2015-03-02 12:23:06 +0800
00fb4c9f8 NFS: Remove size hack in nfs_inode_attrs_need_update() ... Browse Code »

Prior to this patch, we used to always OK attribute updates that extended
the file size on the assumption that we might be performing writeback.
Now that we have attribute barriers to protect the writeback related updates,
we should remove this hack, as it can cause truncate() operations to
apparently be reverted if/when a readahead or getattr RPC call races
with our on-the-wire SETATTR.

Signed-off-by: Trond Myklebust
Tested-by: Chuck Lever

Trond Myklebust
2015-03-02 12:23:06 +0800
8f8ba1d73 NFSv4: Add attribute update barriers to delegreturn and pNFS layoutcommit ... Browse Code »

Ensure that other operations that race with delegreturn and layoutcommit
cannot revert the attribute updates that were made on the server.

Signed-off-by: Trond Myklebust
Tested-by: Chuck Lever

Trond Myklebust
2015-03-02 12:23:06 +0800
a08a8cd37 NFS: Add attribute update barriers to NFS writebacks ... Browse Code »

Ensure that other operations that race with our write RPC calls
cannot revert the file size updates that were made on the server.

Signed-off-by: Trond Myklebust
Tested-by: Chuck Lever

Trond Myklebust
2015-03-02 12:23:06 +0800
f50620034 NFS: Set an attribute barrier on all updates ... Browse Code »

Ensure that we update the attribute barrier even if there were no
invalidations, provided that this value is newer than the old one.

Signed-off-by: Trond Myklebust
Tested-by: Chuck Lever

Trond Myklebust
2015-03-02 12:23:06 +0800
f044636d9 NFS: Add attribute update barriers to nfs_setattr_update_inode() ... Browse Code »

Ensure that other operations which raced with our setattr RPC call
cannot revert the file attribute changes that were made on the server.
To do so, we artificially bump the attribute generation counter on
the inode so that all calls to nfs_fattr_init() that precede ours
will be dropped.

The motivation for the patch came from Chuck Lever's reports of readaheads
racing with truncate operations and causing the file size to be reverted.

Reported-by: Chuck Lever
Signed-off-by: Trond Myklebust
Tested-by: Chuck Lever

Trond Myklebust
2015-03-02 12:23:05 +0800
140e049c6 NFS: Add a helper to set attribute barriers ... Browse Code »

Signed-off-by: Trond Myklebust
Tested-by: Chuck Lever

Trond Myklebust
2015-03-02 12:23:05 +0800
aa5accea4 NFS: Ensure that buffered writes wait for O_DIRECT writes to complete ... Browse Code »

The O_DIRECT code will grab the inode->i_mutex and flush out buffered
writes, before scheduling a read or a write. However there is no
equivalent in the buffered write code to wait for O_DIRECT to complete.

Fixes a reported issue in xfstests generic/133, when first performing an
O_DIRECT write followed by a buffered write.

Signed-off-by: Trond Myklebust
Tested-by: Chuck Lever

Trond Myklebust
2015-03-02 12:22:40 +0800

01 Mar, 2015

1 commit

2aaeb784b Merge tag 'xfs-for-linus-4.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs ... Browse Code »

Pull xfs fixes from Dave Chinner:
"These are fixes for regressions/bugs introduced in the 4.0 merge cycle
and problems discovered during the merge window that need to be pushed
back to stable kernels ASAP.

This contains:
- ensure quota type is reset in on-disk dquots
- fix missing partial EOF block data flush on truncate extension
- fix transaction leak in error handling for new pnfs block layout
support
- add missing target_ip check to RENAME_EXCHANGE"

* tag 'xfs-for-linus-4.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: cancel failed transaction in xfs_fs_commit_blocks()
xfs: Ensure we have target_ip for RENAME_EXCHANGE
xfs: ensure truncate forces zeroed blocks to disk
xfs: Fix quota type in quota structures when reusing quota file

Linus Torvalds
2015-03-01 02:06:33 +0800