Doug / smarc-fsl-linux-kernel | Embedian Git Server

15 Mar, 2011

2 commits

5f40d4209 Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 ... Browse Code »

* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: NFSROOT should default to "proto=udp"
nfs4: remove duplicated #include
NFSv4: nfs4_state_mark_reclaim_nograce() should be static
NFSv4: Fix the setlk error handler
NFSv4.1: Fix the handling of the SEQUENCE status bits
NFSv4/4.1: Fix nfs4_schedule_state_recovery abuses
NFSv4.1 reclaim complete must wait for completion
NFSv4: remove duplicate clientid in struct nfs_client
NFSv4.1: Retry CREATE_SESSION on NFS4ERR_DELAY
sunrpc: Propagate errors from xs_bind() through xs_create_sock()
(try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause problems if bit 31 or 63 are set in fileid
nfs: fix compilation warning
nfs: add kmalloc return value check in decode_and_add_ds
SUNRPC: Remove resource leak in svc_rdma_send_error()
nfs: close NFSv4 COMMIT vs. CLOSE race
SUNRPC: Close a race in __rpc_wait_for_completion_task()

Linus Torvalds
2011-03-15 02:19:50 +0800
1eafbfeb7 Fix corrupted OSF partition table parsing ... Browse Code »

The kernel automatically evaluates partition tables of storage devices.
The code for evaluating OSF partitions contains a bug that leaks data
from kernel heap memory to userspace for certain corrupted OSF
partitions.

In more detail:

for (i = 0 ; i < le16_to_cpu(label->d_npartitions); i++, partition++) {

iterates from 0 to d_npartitions - 1, where d_npartitions is read from
the partition table without validation and partition is a pointer to an
array of at most 8 d_partitions.

Add the proper and obvious validation.

Signed-off-by: Timo Warns
Cc: stable@kernel.org
[ Changed the patch trivially to not repeat the whole le16_to_cpu()
thing, and to use an explicit constant for the magic value '8' ]
Signed-off-by: Linus Torvalds

Timo Warns
2011-03-15 01:14:28 +0800

14 Mar, 2011

2 commits

c44ed965b compat breakage in preadv() and pwritev() ... Browse Code »

Fix for a dumb preadv()/pwritev() compat bug - unlike the native
variants, the compat_... ones forget to check FMODE_P{READ,WRITE}, so
e.g. on pipe the native preadv() will fail with -ESPIPE and compat one
will act as readv() and succeed.

Not critical, but it's a clear bug with trivial fix, so IMO it's OK for
-final.

Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds

Al Viro
2011-03-14 07:29:07 +0800
0e5b88cd9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: break out of shrink_delalloc earlier
btrfs: fix not enough reserved space
btrfs: fix dip leak
Btrfs: make sure not to return overlapping extents to fiemap
Btrfs: deal with short returns from copy_from_user
Btrfs: fix regressions in copy_from_user handling

Linus Torvalds
2011-03-14 07:00:49 +0800

12 Mar, 2011

7 commits

36e39c40b Btrfs: break out of shrink_delalloc earlier ... Browse Code »

Josef had changed shrink_delalloc to exit after three shrink
attempts, which wasn't quite enough because new writers could
race in and steal free space.

But it also fixed deadlocks and stalls as we tried to recover
delalloc reservations. The code was tweaked to loop 1024
times, and would reset the counter any time a small amount
of progress was made. This was too drastic, and with a
lot of writers we can end up stuck in shrink_delalloc forever.

The shrink_delalloc loop is fairly complex because the caller is looping
too, and the caller will go ahead and force a transaction commit to make
sure we reclaim space.

This reworks things to exit shrink_delalloc when we've forced some
writeback and the delalloc reservations have gone down. This means
the writeback has not just started but has also finished at
least some of the metadata changes required to reclaim delalloc
space.

If we've got this wrong, we're returning ENOSPC too early, which
is a big improvement over the current behavior of hanging the machine.

Test 224 in xfstests hammers on this nicely, and with 1000 writers
trying to fill a 1GB drive we get our first ENOSPC at 93% full. The
other writers are able to continue until we get 100%.

This is a worst case test for btrfs because the 1000 writers are doing
small IO, and the small FS size means we don't have a lot of room
for metadata chunks.

Signed-off-by: Chris Mason

Chris Mason
2011-03-12 20:08:42 +0800
53d473758 NFS: NFSROOT should default to "proto=udp" ... Browse Code »

There have been a number of recent reports that NFSROOT is no longer
working with default mount options, but fails only with certain NICs.

Brian Downing bisected to commit 56463e50 "NFS:
Use super.c for NFSROOT mount option parsing". Among other things,
this commit changes the default mount options for NFSROOT to use TCP
instead of UDP as the underlying transport.

TCP seems less able to deal with NICs that are slow to initialize.
The system logs that have accompanied reports of problems all show
that NFSROOT attempts to establish a TCP connection before the NIC is
fully initialized, and thus the TCP connection attempt fails.

When a TCP connection attempt fails during a mount operation, the
NFS stack needs to fail the operation. Usually user space knows how
and when to retry it. The network layer does not report a distinct
error code for this particular failure mode. Thus, there isn't a
clean way for the RPC client to see that it needs to retry in this
case, but not in others.

Because NFSROOT is used in some environments where it is not possible
to update the kernel command line to specify "udp", the proper thing
to do is change NFSROOT to use UDP by default, as it did before commit
56463e50.

To make it easier to see how to change default mount options for
NFSROOT and to distinguish default settings from mandatory settings,
I've adjusted a couple of areas to document the specifics.

root_nfs_cat() is also modified to deal with commas properly when
concatenating strings containing mount option lists. This keeps
root_nfs_cat() call sites simpler, now that we may be concatenating
multiple mount option strings.

Tested-by: Brian Downing
Tested-by: Mark Brown
Signed-off-by: Chuck Lever
Cc: # 2.6.37
Signed-off-by: Trond Myklebust

Chuck Lever
2011-03-12 04:38:07 +0800
57df216bd nfs4: remove duplicated #include ... Browse Code »

Remove duplicated #include('s) in
fs/nfs/nfs4proc.c

Signed-off-by: Huang Weiyi
Signed-off-by: Trond Myklebust

Huang Weiyi
2011-03-12 04:18:37 +0800
f9feab1e1 NFSv4: nfs4_state_mark_reclaim_nograce() should be static ... Browse Code »

There are no more external users of nfs4_state_mark_reclaim_nograce() or
nfs4_state_mark_reclaim_reboot(), so mark them as static.

Signed-off-by: Trond Myklebust

Trond Myklebust
2011-03-12 04:18:36 +0800
ecac799a5 NFSv4: Fix the setlk error handler ... Browse Code »

Signed-off-by: Trond Myklebust

Trond Myklebust
2011-03-12 04:18:36 +0800
b4410c2f7 NFSv4.1: Fix the handling of the SEQUENCE status bits ... Browse Code »

We want SEQUENCE status bits to be handled by the state manager in order
to avoid threading issues.

Signed-off-by: Trond Myklebust

Trond Myklebust
2011-03-12 04:18:35 +0800
0400a6b0c NFSv4/4.1: Fix nfs4_schedule_state_recovery abuses ... Browse Code »

nfs4_schedule_state_recovery() should only be used when we need to force
the state manager to check the lease. If we just want to start the
state manager in order to handle a state recovery situation, we should be
using nfs4_schedule_state_manager().

This patch fixes the abuses of nfs4_schedule_state_recovery() by replacing
its use with a set of helper functions that do the right thing.

Signed-off-by: Trond Myklebust

Trond Myklebust
2011-03-12 04:18:22 +0800

11 Mar, 2011

10 commits

c34c32ea9 NFSv4.1 reclaim complete must wait for completion ... Browse Code »

Signed-off-by: Andy Adamson
[Trond: fix whitespace errors]
Signed-off-by: Trond Myklebust

Andy Adamson
2011-03-11 04:05:01 +0800
114f64b5f NFSv4: remove duplicate clientid in struct nfs_client ... Browse Code »

Signed-off-by: Andy Adamson
Signed-off-by: Trond Myklebust

Andy Adamson
2011-03-11 04:05:00 +0800
7d6d63d64 NFSv4.1: Retry CREATE_SESSION on NFS4ERR_DELAY ... Browse Code »

Fix bug where we currently retry the EXCHANGEID call again, eventhough
we already have a valid clientid. Instead, delay and retry the CREATE_SESSION
call.

Signed-off-by: Ricardo Labiaga
Signed-off-by: Trond Myklebust

Ricardo Labiaga
2011-03-11 04:04:59 +0800
3fa0b4e20 (try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause problems if bit 31 o… ... Browse Code »

…r 63 are set in fileid

The problem was use of an int32, which when converted to a uint64
is sign extended resulting in a fileid that doesn't fit in 32 bits
even though the intent of the function is to fit the fileid into
32 bits.

Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
[Trond: Added an include for compat.h]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

Frank Filz
2011-03-11 04:04:58 +0800
43b7c3f05 nfs: fix compilation warning ... Browse Code »

this commit fix compilation warning as following:
linux-2.6/fs/nfs/nfs4proc.c:3265: warning: comparison of distinct pointer types lacks a cast

Signed-off-by: Jovi Zhang
Signed-off-by: Trond Myklebust

Jovi Zhang
2011-03-11 04:04:56 +0800
b9f810570 nfs: add kmalloc return value check in decode_and_add_ds ... Browse Code »

add kmalloc return value check in decode_and_add_ds

Signed-off-by: Stanislav Fomichev
Signed-off-by: Trond Myklebust

Stanislav Fomichev
2011-03-11 04:04:55 +0800
d2224e7af nfs: close NFSv4 COMMIT vs. CLOSE race ... Browse Code »

I've been adding in more artificial delays in the NFSv4 commit and close
codepaths to uncover races. The kernel I'm testing has the patch to
close the race in __rpc_wait_for_completion_task that's in Trond's
cthon2011 branch. The reproducer I've been using does this in a loop:

mkdir("DIR");
fd = open("DIR/FILE", O_WRONLY|O_CREAT|O_EXCL, 0644);
write(fd, "abcdefg", 7);
close(fd);
unlink("DIR/FILE");
rmdir("DIR");

The above reproducer shouldn't result in any silly-renaming. However,
when I add a "msleep(100)" just after the nfs_commit_clear_lock call in
nfs_commit_release, I can almost always force one to occur. If I can
force it to occur with that, then it can happen without that delay
given the right timing.

nfs_commit_inode waits for the NFS_INO_COMMIT bit to clear when called
with FLUSH_SYNC set. nfs_commit_rpcsetup on the other hand does not wait
for the task to complete before putting its reference to it, so the last
reference get put in rpc_release task and gets queued to a workqueue.

In this situation, the last open context reference may be put by the
COMMIT release instead of the close() syscall. The close() syscall
returns too quickly and the unlink runs while the d_count is still
high since the COMMIT release hasn't put its dentry reference yet.

Fix this by having rpc_commit_rpcsetup wait for the RPC call to complete
before putting the task reference when FLUSH_SYNC is set. With this, the
last reference is put by the process that's initiating the FLUSH_SYNC
commit and the race is closed.

Signed-off-by: Jeff Layton
Signed-off-by: Trond Myklebust

Jeff Layton
2011-03-11 04:04:53 +0800
bf294b41c SUNRPC: Close a race in __rpc_wait_for_completion_task() ... Browse Code »

Although they run as rpciod background tasks, under normal operation
(i.e. no SIGKILL), functions like nfs_sillyrename(), nfs4_proc_unlck()
and nfs4_do_close() want to be fully synchronous. This means that when we
exit, we want all references to the rpc_task to be gone, and we want
any dentry references etc. held by that task to be released.

For this reason these functions call __rpc_wait_for_completion_task(),
followed by rpc_put_task() in the expectation that the latter will be
releasing the last reference to the rpc_task, and thus ensuring that the
callback_ops->rpc_release() has been called synchronously.

This patch fixes a race which exists due to the fact that
rpciod calls rpc_complete_task() (in order to wake up the callers of
__rpc_wait_for_completion_task()) and then subsequently calls
rpc_put_task() without ensuring that these two steps are done atomically.

In order to avoid adding new spin locks, the patch uses the existing
waitqueue spin lock to order the rpc_task reference count releases between
the waiting process and rpciod.
The common case where nobody is waiting for completion is optimised for by
checking if the RPC_TASK_ASYNC flag is cleared and/or if the rpc_task
reference count is 1: in those cases we drop trying to grab the spin lock,
and immediately free up the rpc_task.

Those few processes that need to put the rpc_task from inside an
asynchronous context and that do not care about ordering are given a new
helper: rpc_put_task_async().

Signed-off-by: Trond Myklebust

Trond Myklebust
2011-03-11 04:04:52 +0800
7e6b6465e btrfs: fix not enough reserved space ... Browse Code »

btrfs_link() will insert 3 items(inode ref, dir name item and dir index item)
into the b+ tree and update 2 items(its inode, and parent's inode) in the b+
tree. So we should reserve space for these 5 items, not 3 items.

Reported-by: Tsutomu Itoh
Signed-off-by: Miao Xie
Signed-off-by: Chris Mason

Miao Xie
2011-03-11 00:21:49 +0800
b4966b777 btrfs: fix dip leak ... Browse Code »

The btrfs DIO code leaks dip structs when dip->csums allocation
fails; bio->bi_end_io isn't set at the point where the free_ordered
branch is consequently taken, thus bio_endio doesn't call the function
which would free it in the normal case. Fix.

Signed-off-by: Daniel J Blueman
Acked-by: Miao Xie
Signed-off-by: Chris Mason

Daniel J Blueman
2011-03-11 00:21:49 +0800

10 Mar, 2011

12 commits

d891eedbc fs/dcache: allow d_obtain_alias() to return unhashed dentries ... Browse Code »

Without this patch, inodes are not promptly freed on last close of an
unlinked file by an nfs client:

client$ mount -tnfs4 server:/export/ /mnt/
client$ tail -f /mnt/FOO
...
server$ df -i /export
server$ rm /export/FOO
(^C the tail -f)
server$ df -i /export
server$ echo 2 >/proc/sys/vm/drop_caches
server$ df -i /export

the df's will show that the inode is not freed on the filesystem until
the last step, when it could have been freed after killing the client's
tail -f. On-disk data won't be deallocated either, leading to possible
spurious ENOSPC.

This occurs because when the client does the close, it arrives in a
compound with a putfh and a close, processed like:

- putfh: look up the filehandle. The only alias found for the
inode will be DCACHE_UNHASHED alias referenced by the filp
this, so it creates a new DCACHE_DISCONECTED dentry and
returns that instead.
- close: closes the existing filp, which is destroyed
immediately by dput() since it's DCACHE_UNHASHED.
- end of the compound: release the reference
to the current filehandle, and dput() the new
DCACHE_DISCONECTED dentry, which gets put on the
unused list instead of being destroyed immediately.

Nick Piggin suggested fixing this by allowing d_obtain_alias to return
the unhashed dentry that is referenced by the filp, instead of making it
create a new dentry.

Leave __d_find_alias() alone to avoid changing behavior of other
callers.

Also nfsd doesn't need all the checks of __d_find_alias(); any dentry,
hashed or unhashed, disconnected or not, should work.

Signed-off-by: J. Bruce Fields
Signed-off-by: Al Viro

J. Bruce Fields
2011-03-10 18:18:54 +0800
1ca551c6c Check for immutable/append flag in fallocate path ... Browse Code »

In the fallocate path the kernel doesn't check for the immutable/append
flag. It's possible to have a race condition in this scenario: an
application open a file in read/write and it does something, meanwhile
root set the immutable flag on the file, the application at that point
can call fallocate with success. In addition, we don't allow to do any
unreserve operation on an append only file but only the reserve one.

Signed-off-by: Marco Stornelli
Signed-off-by: Al Viro

Marco Stornelli
2011-03-10 17:22:15 +0800
9177ada99 fat: fix d_revalidate oopsen on NFS exports ... Browse Code »

can't blindly check nd->flags in ->d_revalidate()

Signed-off-by: Al Viro

Al Viro
2011-03-10 16:45:49 +0800
8ce84eeb5 jfs: fix d_revalidate oopsen on NFS exports ... Browse Code »

can't blindly check nd->flags in ->d_revalidate()

Signed-off-by: Al Viro

Al Viro
2011-03-10 16:45:28 +0800
4714e6373 ocfs2: fix d_revalidate oopsen on NFS exports ... Browse Code »

can't blindly check nd->flags in ->d_revalidate()

Signed-off-by: Al Viro

Al Viro
2011-03-10 16:45:07 +0800
53fe92416 gfs2: fix d_revalidate oopsen on NFS exports ... Browse Code »

can't blindly check nd->flags in ->d_revalidate()

Signed-off-by: Al Viro

Al Viro
2011-03-10 16:44:48 +0800
529c5f958 fuse: fix d_revalidate oopsen on NFS exports ... Browse Code »

can't blindly check nd->flags in ->d_revalidate()

Signed-off-by: Al Viro

Al Viro
2011-03-10 16:44:31 +0800
0eb980e31 ceph: fix d_revalidate oopsen on NFS exports ... Browse Code »

can't blindly check nd->flags in ->d_revalidate()

Signed-off-by: Al Viro

Al Viro
2011-03-10 16:44:05 +0800
c78f4cc5e reiserfs xattr ->d_revalidate() shouldn't care about RCU ... Browse Code »

... it returns an error unconditionally

Signed-off-by: Al Viro

Al Viro
2011-03-10 16:42:01 +0800
ae50adcb0 /proc/self is never going to be invalidated... ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2011-03-10 16:41:53 +0800
397949170 Merge branch 'for-2.6.38' of git://linux-nfs.org/~bfields/linux ... Browse Code »

* 'for-2.6.38' of git://linux-nfs.org/~bfields/linux:
nfsd: wrong index used in inner loop
nfsd4: fix bad pointer on failure to find delegation
NFSD: fix decode_cb_sequence4resok

Linus Torvalds
2011-03-10 06:52:09 +0800
78833dd70 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
nd->inode is not set on the second attempt in path_walk()
unfuck proc_sysctl ->d_compare()
minimal fix for do_filp_open() race

Linus Torvalds
2011-03-10 05:55:51 +0800

09 Mar, 2011

3 commits

b306419ae nd->inode is not set on the second attempt in path_walk() ... Browse Code »

We leave it at whatever it had been pointing to after the
first link_path_walk() had failed with -ESTALE. Things
do not work well after that...

Signed-off-by: Al Viro

Al Viro
2011-03-09 10:16:28 +0800
3ec07aa95 nfsd: wrong index used in inner loop ... Browse Code »

Index i was already used in the outer loop

Cc: stable@kernel.org
Signed-off-by: Roel Kluin
Signed-off-by: J. Bruce Fields

roel
2011-03-09 08:46:10 +0800
ea8efc74b Btrfs: make sure not to return overlapping extents to fiemap ... Browse Code »

The btrfs fiemap code was incorrectly returning duplicate or overlapping
extents in some cases. cp was blindly trusting this result and we would
end up with a destination file that was bigger than the original because
some bytes were copied twice.

The fix here adjusts our offsets to make sure we're always moving
forward in the fiemap results.

Signed-off-by: Chris Mason

Chris Mason
2011-03-09 00:58:09 +0800

08 Mar, 2011

3 commits

dfef6dcd3 unfuck proc_sysctl ->d_compare() ... Browse Code »

a) struct inode is not going to be freed under ->d_compare();
however, the thing PROC_I(inode)->sysctl points to just might.
Fortunately, it's enough to make freeing that sucker delayed,
provided that we don't step on its ->unregistering, clear
the pointer to it in PROC_I(inode) before dropping the reference
and check if it's NULL in ->d_compare().

b) I'm not sure that we *can* walk into NULL inode here (we recheck
dentry->seq between verifying that it's still hashed / fetching
dentry->d_inode and passing it to ->d_compare() and there's no
negative hashed dentries in /proc/sys/*), but if we can walk into
that, we really should not have ->d_compare() return 0 on it!
Said that, I really suspect that this check can be simply killed.
Nick?

Signed-off-by: Al Viro

Al Viro
2011-03-08 15:22:27 +0800
32b007b4e nfsd4: fix bad pointer on failure to find delegation ... Browse Code »

In case of a nonempty list, the return on error here is obviously bogus;
it ends up being a pointer to the list head instead of to any valid
delegation on the list.

In particular, if nfsd4_delegreturn() hits this case, and you're quite unlucky,
then renew_client may oops, and it may take an embarassingly long time to
figure out why. Facepalm.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
IP: [] nfsd4_delegreturn+0x125/0x200
...

Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields

J. Bruce Fields
2011-03-08 00:44:53 +0800
31339acd0 Btrfs: deal with short returns from copy_from_user ... Browse Code »

When copy_from_user is only able to copy some of the bytes we requested,
we may end up creating a partially up to date page. To avoid garbage in
the page, we need to treat a partial copy as a zero length copy.

This makes the rest of the file_write code drop the page and
retry the whole copy instead of marking the partially up to
date page as dirty.

Signed-off-by: Chris Mason
cc: stable@kernel.org

Chris Mason
2011-03-08 00:10:24 +0800

07 Mar, 2011

1 commit

b1bf862e9 Btrfs: fix regressions in copy_from_user handling ... Browse Code »

Commit 914ee295af418e936ec20a08c1663eaabe4cd07a fixed deadlocks in
btrfs_file_write where we would catch page faults on pages we had
locked.

But, there were a few problems:

1) The x86-32 iov_iter_copy_from_user_atomic code always fails to copy
data when the amount to copy is more than 4K and the offset to start
copying from is not page aligned. The result was btrfs_file_write
looping forever retrying the iov_iter_copy_from_user_atomic

We deal with this by changing btrfs_file_write to drop down to single
page copies when iov_iter_copy_from_user_atomic starts returning failure.

2) The btrfs_file_write code was leaking delalloc reservations when
iov_iter_copy_from_user_atomic returned zero. The looping above would
result in the entire filesystem running out of delalloc reservations and
constantly trying to flush things to disk.

3) btrfs_file_write will lock down page cache pages, make sure
any writeback is finished, do the copy_from_user and then release them.
Before the loop runs we check the first and last pages in the write to
see if they are only being partially modified. If the start or end of
the write isn't aligned, we make sure the corresponding pages are
up to date so that we don't introduce garbage into the file.

With the copy_from_user changes, we're allowing the VM to reclaim the
pages after a partial update from copy_from_user, but we're not
making sure the page cache page is up to date when we loop around to
resume the write.

We deal with this by pushing the up to date checks down into the page
prep code. This fits better with how the rest of file_write works.

Signed-off-by: Chris Mason
Reported-by: Mitch Harder
cc: stable@kernel.org

Chris Mason
2011-03-07 23:42:27 +0800