Eric Lee / smarc-fsl-linux-kernel

29 Jan, 2011

3 commits

d1205f87b NFS: NFSv4 readdir loses entries ... Browse Code »

On recent 2.6.38-rc kernels, connectathon basic test 6 fails on
NFSv4 mounts of OpenSolaris with something like:

> ./test6: readdir
> ./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.12' dir entry, pass 0
> ./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.82' dir entry, pass 0
> ./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.164' dir entry, pass 0
> ./test6: (/mnt/klimt/matisse.test) Test failed with 3 errors
> basic tests failed
> Tests failed, leaving /mnt/klimt mounted
> [cel@matisse cthon04]$

I narrowed the problem down to nfs4_decode_dirent() reporting that the
decode buffer had overflowed while decoding the entries for those
missing files.

verify_attr_len() assumes both it's pointer arguments reside on the
same page. When these arguments point to locations on two different
pages, verify_attr_len() can report false errors. This can happen now
that a large NFSv4 readdir result can span pages.

We have reasonably good checking in nfs4_decode_dirent() anyway, so
it should be safe to simply remove the extra checking.

At a guess, this was introduced by commit 6650239a, "NFS: Don't use
vm_map_ram() in readdir".

Cc: stable@kernel.org [2.6.37]
Signed-off-by: Chuck Lever
Signed-off-by: Trond Myklebust

Chuck Lever
2011-01-29 02:41:35 +0800
c08e76d0c NFS: Micro-optimize nfs4_decode_dirent() ... Browse Code »

Make the decoding of NFSv4 directory entries slightly more efficient
by:

1. Avoiding unnecessary byte swapping when checking XDR booleans,
and

2. Not bumping "p" when its value will be immediately replaced by
xdr_inline_decode()

This commit makes nfs4_decode_dirent() consistent with similar logic
in the other two decode_dirent() functions.

Signed-off-by: Chuck Lever
Signed-off-by: Trond Myklebust

Chuck Lever
2011-01-29 02:37:35 +0800
e00b8a240 NFS: Fix an NFS client lockdep issue ... Browse Code »

There is no reason to be freeing the delegation cred in the rcu callback,
and doing so is resulting in a lockdep complaint that rpc_credcache_lock
is being called from both softirq and non-softirq contexts.

Reported-by: Chuck Lever
Signed-off-by: Trond Myklebust
Cc: stable@kernel.org

Trond Myklebust
2011-01-29 02:37:09 +0800

26 Jan, 2011

9 commits

c7a360b05 NFS construct consistent co_ownerid for v4.1 ... Browse Code »

As stated in section 2.4 of RFC 5661, subsequent instances of the client need
to present the same co_ownerid. Concatinate the client's IP dot address,
host name, and the rpc_auth pseudoflavor to form the co_ownerid.

Signed-off-by: Andy Adamson
Signed-off-by: Trond Myklebust

Andy Adamson
2011-01-26 11:49:14 +0800
27dc1cd3a NFS: nfs_wcc_update_inode() should set nfsi->attr_gencount ... Browse Code »

If the call to nfs_wcc_update_inode() results in an attribute update, we
need to ensure that the inode's attr_gencount gets bumped too, otherwise
we are not protected against races with other GETATTR calls.

Signed-off-by: Trond Myklebust

Trond Myklebust
2011-01-26 04:28:21 +0800
b2a2897dc NFS improve pnfs_put_deviceid_cache debug print ... Browse Code »

What we really want to know is the ref count.

Signed-off-by: Andy Adamson
Signed-off-by: Trond Myklebust

Andy Adamson
2011-01-26 04:26:51 +0800
2c4cdf8f6 NFS fix cb_sequence error processing ... Browse Code »

Always assign the cb_process_state nfs_client pointer so a processing error
in cb_sequence after the nfs_client is found and referenced returns
a non-NULL cb_process_state nfs_client and the matching nfs_put_client in
nfs4_callback_compound dereferences the client.

Signed-off-by: Andy Adamson
Signed-off-by: Trond Myklebust

Andy Adamson
2011-01-26 04:26:51 +0800
778be232a NFS do not find client in NFSv4 pg_authenticate ... Browse Code »

The information required to find the nfs_client cooresponding to the incoming
back channel request is contained in the NFS layer. Perform minimal checking
in the RPC layer pg_authenticate method, and push more detailed checking into
the NFS layer where the nfs_client can be found.

Signed-off-by: Andy Adamson
Signed-off-by: Trond Myklebust

Andy Adamson
2011-01-26 04:26:51 +0800
f61f6da0d NFS: Prevent memory allocation failure in nfsacl_encode() ... Browse Code »

nfsacl_encode() allocates memory in certain cases. This of course
is not guaranteed to work.

Since commit 9f06c719 "SUNRPC: New xdr_streams XDR encoder API", the
kernel's XDR encoders can't return a result indicating possibly a
failure, so a memory allocation failure in nfsacl_encode() has become
fatal (ie, the XDR code Oopses) in some cases.

However, the allocated memory is a tiny fixed amount, on the order
of 40-50 bytes. We can easily use a stack-allocated buffer for
this, with only a wee bit of nose-holding.

Signed-off-by: Chuck Lever
Signed-off-by: Trond Myklebust

Chuck Lever
2011-01-26 04:24:47 +0800
ee5dc7732 NFS: Fix "kernel BUG at fs/nfs/nfs3xdr.c:1338!" ... Browse Code »

Milan Broz reports:

> on today Linus' tree I get OOps if using nfs.
>
> server (2.6.36) exports dir:
> /dir 172.16.1.0/24(rw,async,all_squash,no_subtree_check,anonuid=500,anongid=500)
>
> on client it is mounted in fstab
> server:/dir /mnt/tst nfs rw,soft 0 0
>
> and these commands OOpses it (simplified from a configure script):
>
> cd /dir
> touch x
> install x y
>
> [ 105.327701] ------------[ cut here ]------------
> [ 105.327979] kernel BUG at fs/nfs/nfs3xdr.c:1338!
> [ 105.328075] invalid opcode: 0000 [#1] PREEMPT SMP
> [ 105.328223] last sysfs file: /sys/devices/virtual/bdi/0:16/uevent
> [ 105.328349] Modules linked in: usbcore dm_mod
> [ 105.328553]
> [ 105.328678] Pid: 3710, comm: install Not tainted 2.6.37+ #423 440BX Desktop Reference Platform/VMware Virtual Platform
> [ 105.328853] EIP: 0060:[] EFLAGS: 00010282 CPU: 0
> [ 105.329152] EIP is at nfs3_xdr_enc_setacl3args+0x61/0x98
> [ 105.329249] EAX: ffffffea EBX: ce941d98 ECX: 00000000 EDX: 00000004
> [ 105.329340] ESI: ce941cd0 EDI: 000000a4 EBP: ce941cc0 ESP: ce941cb4
> [ 105.329431] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> [ 105.329525] Process install (pid: 3710, ti=ce940000 task=ced36f20 task.ti=ce940000)
> [ 105.336600] Stack:
> [ 105.336693] ce941cd0 ce9dc000 00000000 ce941cf8 c12ecd02 c12f43e0 c116c00b cf754158
> [ 105.336982] ce9dc004 cf754284 ce9dc004 cf7ffee8 ceff9978 ce9dc000 cf7ffee8 ce9dc000
> [ 105.337182] ce9dc000 ce941d14 c12e698d cf75412c ce941d98 cf7ffee8 cf7fff20 00000000
> [ 105.337405] Call Trace:
> [ 105.337695] [] rpcauth_wrap_req+0x75/0x7f
> [ 105.337806] [] ? xdr_encode_opaque+0x12/0x15
> [ 105.337898] [] ? nfs3_xdr_enc_setacl3args+0x0/0x98
> [ 105.337988] [] call_transmit+0x17e/0x1e8
> [ 105.338072] [] __rpc_execute+0x6d/0x1a6
> [ 105.338155] [] rpc_execute+0x34/0x37
> [ 105.338235] [] rpc_run_task+0xb5/0xbd
> [ 105.338316] [] rpc_call_sync+0x3d/0x58
> [ 105.338402] [] nfs3_proc_setacls+0x18e/0x24f
> [ 105.338493] [] ? __kmalloc+0x148/0x1c4
> [ 105.338579] [] ? posix_acl_alloc+0x12/0x22
> [ 105.338665] [] nfs3_proc_setacl+0xa0/0xca
> [ 105.338748] [] nfs3_setxattr+0x62/0x88
> [ 105.338834] [] ? sub_preempt_count+0x7c/0x89
> [ 105.338926] [] ? nfs3_setxattr+0x0/0x88
> [ 105.339026] [] __vfs_setxattr_noperm+0x26/0x95
> [ 105.339114] [] vfs_setxattr+0x5b/0x76
> [ 105.339211] [] setxattr+0x9d/0xc3
> [ 105.339298] [] ? handle_pte_fault+0x258/0x5cb
> [ 105.339428] [] ? __free_pages+0x1a/0x23
> [ 105.339517] [] ? up_read+0x16/0x2c
> [ 105.339599] [] ? fget+0x0/0xa3
> [ 105.339677] [] ? fget+0x0/0xa3
> [ 105.339760] [] ? get_parent_ip+0xb/0x31
> [ 105.339843] [] ? sub_preempt_count+0x7c/0x89
> [ 105.339931] [] sys_fsetxattr+0x51/0x79
> [ 105.340014] [] sysenter_do_call+0x12/0x32
> [ 105.340133] Code: 2e 76 18 00 58 31 d2 8b 7f 28 f6 43 04 01 74 03 8b 53 08 6a 00 8b 46 04 6a 01 8b 0b 52 89 fa e8 85 10 f8 ff 83 c4 0c 85 c0 79 04 0b eb fe 31 c9 f6 43 04 04 74 03 8b 4b 0c 68 00 10 00 00 8d
> [ 105.350321] EIP: [] nfs3_xdr_enc_setacl3args+0x61/0x98 SS:ESP 0068:ce941cb4
> [ 105.364385] ---[ end trace 01fcfe7f0f7f6e4a ]---

nfs3_xdr_enc_setacl3args() is not properly setting up the target
buffer before nfsacl_encode() attempts to encode the ACL.

Introduced by commit d9c407b1 "NFS: Introduce new-style XDR encoding
functions for NFSv3."

Signed-off-by: Chuck Lever
Signed-off-by: Trond Myklebust

Chuck Lever
2011-01-26 04:24:47 +0800
839f7ad69 NFS: Fix "kernel BUG at fs/aio.c:554!" ... Browse Code »

Nick Piggin reports:

> I'm getting use after frees in aio code in NFS
>
> [ 2703.396766] Call Trace:
> [ 2703.396858] [] ? native_sched_clock+0x27/0x80
> [ 2703.396959] [] ? put_lock_stats+0xe/0x40
> [ 2703.397058] [] ? lock_release_holdtime+0xa8/0x140
> [ 2703.397159] [] lock_acquire+0x95/0x1b0
> [ 2703.397260] [] ? aio_put_req+0x2b/0x60
> [ 2703.397361] [] ? get_parent_ip+0x11/0x50
> [ 2703.397464] [] _raw_spin_lock_irq+0x41/0x80
> [ 2703.397564] [] ? aio_put_req+0x2b/0x60
> [ 2703.397662] [] aio_put_req+0x2b/0x60
> [ 2703.397761] [] do_io_submit+0x2be/0x7c0
> [ 2703.397895] [] sys_io_submit+0xb/0x10
> [ 2703.397995] [] system_call_fastpath+0x16/0x1b
>
> Adding some tracing, it is due to nfs completing the request then
> returning something other than -EIOCBQUEUED, so aio.c
> also completes the request.

To address this, prevent the NFS direct I/O engine from completing
async iocbs when the forward path returns an error without starting
any I/O.

This fix appears to survive ^C during both "xfstest no. 208" and "fsx
-Z."

It's likely this bug has existed for a very long while, as we are seeing
very similar symptoms in OEL 5. Copying stable.

Cc: Stable
Signed-off-by: Chuck Lever
Signed-off-by: Trond Myklebust

Chuck Lever
2011-01-26 04:24:47 +0800
ad3d2eedf NFS4: Avoid potential NULL pointer dereference in decode_and_add_ds(). ... Browse Code »

On Mon, 17 Jan 2011, Mi Jinlong wrote:

>
>
> Jesper Juhl:
> > strrchr() can return NULL if nothing is found. If this happens we'll
> > dereference a NULL pointer in
> > fs/nfs/nfs4filelayoutdev.c::decode_and_add_ds().
> >
> > I tried to find some other code that guarantees that this can never
> > happen but I was unsuccessful. So, unless someone else can point to some
> > code that ensures this can never be a problem, I believe this patch is
> > needed.
> >
> > While I was changing this code I also noticed that all the dprintk()
> > statements, except one, start with "%s:". The one missing the ":" I added
> > it to.
>
> Maybe another one also should be changed at decode_and_add_ds() at line 243:
>
> 243 printk("%s Decoded address and port %s\n", __func__, buf);
>
Missed that one. Thanks.

Signed-off-by: Jesper Juhl
Signed-off-by: Trond Myklebust

Jesper Juhl
2011-01-26 04:24:46 +0800

20 Jan, 2011

1 commit

0da2a4ac3 NFS: fix handling of malloc failure during nfs_flush_multi() ... Browse Code »

Cleanup of the allocated list entries should not call
put_nfs_open_context() on each entry, as the context will
always be NULL, causing an oops.

Signed-off-by: Fred Isaman
Signed-off-by: Trond Myklebust

Fred Isaman
2011-01-20 04:37:49 +0800

16 Jan, 2011

3 commits

ea5b778a8 Unexport do_add_mount() and add in follow_automount(), not ->d_automount() ... Browse Code »

Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
added rather than calling do_add_mount() itself. follow_automount() will then
do the addition.

This slightly complicates things as ->d_automount() normally wants to add the
new vfsmount to an expiration list and start an expiration timer. The problem
with that is that the vfsmount will be deleted if it has a refcount of 1 and
the timer will not repeat if the expiration list is empty.

To this end, we require the vfsmount to be returned from d_automount() with a
refcount of (at least) 2. One of these refs will be dropped unconditionally.
In addition, follow_automount() must get a 3rd ref around the call to
do_add_mount() lest it eat a ref and return an error, leaving the mount we
have open to being expired as we would otherwise have only 1 ref on it.

d_automount() should also add the the vfsmount to the expiration list (by
calling mnt_set_expiry()) and start the expiration timer before returning, if
this mechanism is to be used. The vfsmount will be unlinked from the
expiration list by follow_automount() if do_add_mount() fails.

This patch also fixes the call to do_add_mount() for AFS to propagate the mount
flags from the parent vfsmount.

Signed-off-by: David Howells
Signed-off-by: Al Viro

David Howells
2011-01-16 09:07:48 +0800
36d43a437 NFS: Use d_automount() rather than abusing follow_link() ... Browse Code »
44

Make NFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

Signed-off-by: David Howells
Acked-by: Trond Myklebust
Acked-by: Ian Kent
Signed-off-by: Al Viro

David Howells
2011-01-16 09:07:34 +0800
cc53ce53c Add a dentry op to allow processes to be held during pathwalk transit ... Browse Code »

Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
sleep when it tries to transit away from one of that filesystem's directories
during a pathwalk. The operation is keyed off a new dentry flag
(DCACHE_MANAGE_TRANSIT).

The filesystem is allowed to be selective about which processes it holds and
which it permits to continue on or prohibits from transiting from each flagged
directory. This will allow autofs to hold up client processes whilst letting
its userspace daemon through to maintain the directory or the stuff behind it
or mounted upon it.

The ->d_manage() dentry operation:

int (*d_manage)(struct path *path, bool mounting_here);

takes a pointer to the directory about to be transited away from and a flag
indicating whether the transit is undertaken by do_add_mount() or
do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.

It should return 0 if successful and to let the process continue on its way;
-EISDIR to prohibit the caller from skipping to overmounted filesystems or
automounting, and to use this directory; or some other error code to return to
the user.

->d_manage() is called with namespace_sem writelocked if mounting_here is true
and no other locks held, so it may sleep. However, if mounting_here is true,
it may not initiate or wait for a mount or unmount upon the parameter
directory, even if the act is actually performed by userspace.

Within fs/namei.c, follow_managed() is extended to check with d_manage() first
on each managed directory, before transiting away from it or attempting to
automount upon it.

follow_down() is renamed follow_down_one() and should only be used where the
filesystem deliberately intends to avoid management steps (e.g. autofs).

A new follow_down() is added that incorporates the loop done by all other
callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
and CIFS do use it, their use is removed by converting them to use
d_automount()). The new follow_down() calls d_manage() as appropriate. It
also takes an extra parameter to indicate if it is being called from mount code
(with namespace_sem writelocked) which it passes to d_manage(). follow_down()
ignores automount points so that it can be used to mount on them.

__follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have
that determine whether to abort or not itself. That would allow the autofs
daemon to continue on in rcu-walk mode.

Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
required as every tranist from that directory will cause d_manage() to be
invoked. It can always be set again when necessary.

==========================
WHAT THIS MEANS FOR AUTOFS
==========================

Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
trigger the automounting of indirect mounts, and both of these can be called
with i_mutex held.

autofs knows that the i_mutex will be held by the caller in lookup(), and so
can drop it before invoking the daemon - but this isn't so for d_revalidate(),
since the lock is only held on _some_ of the code paths that call it. This
means that autofs can't risk dropping i_mutex from its d_revalidate() function
before it calls the daemon.

The bug could manifest itself as, for example, a process that's trying to
validate an automount dentry that gets made to wait because that dentry is
expired and needs cleaning up:

mkdir S ffffffff8014e05a 0 32580 24956
Call Trace:
[] :autofs4:autofs4_wait+0x674/0x897
[] avc_has_perm+0x46/0x58
[] autoremove_wake_function+0x0/0x2e
[] :autofs4:autofs4_expire_wait+0x41/0x6b
[] :autofs4:autofs4_revalidate+0x91/0x149
[] __lookup_hash+0xa0/0x12f
[] lookup_create+0x46/0x80
[] sys_mkdirat+0x56/0xe4

versus the automount daemon which wants to remove that dentry, but can't
because the normal process is holding the i_mutex lock:

automount D ffffffff8014e05a 0 32581 1 32561
Call Trace:
[] __mutex_lock_slowpath+0x60/0x9b
[] do_path_lookup+0x2ca/0x2f1
[] .text.lock.mutex+0xf/0x14
[] do_rmdir+0x77/0xde
[] tracesys+0x71/0xe0
[] tracesys+0xd5/0xe0

which means that the system is deadlocked.

This patch allows autofs to hold up normal processes whilst the daemon goes
ahead and does things to the dentry tree behind the automouter point without
risking a deadlock as almost no locks are held in d_manage() and none in
d_automount().

Signed-off-by: David Howells
Was-Acked-by: Ian Kent
Signed-off-by: Al Viro

David Howells
2011-01-16 09:07:31 +0800

14 Jan, 2011

3 commits

db9effe99 Merge branch 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/gi… ... Browse Code »

…t/npiggin/linux-npiggin

* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
fs: fix do_last error case when need_reval_dot
nfs: add missing rcu-walk check
fs: hlist UP debug fixup
fs: fix dropping of rcu-walk from force_reval_path
fs: force_reval_path drop rcu-walk before d_invalidate
fs: small rcu-walk documentation fixes

Fixed up trivial conflicts in Documentation/filesystems/porting

Linus Torvalds
2011-01-14 12:14:13 +0800
657e94b67 nfs: add missing rcu-walk check ... Browse Code »

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-14 10:48:39 +0800
8a0eebf66 NFS: Fix NFSv3 exclusive open semantics ... Browse Code »

Commit c0204fd2b8fe047b18b67e07e1bf2a03691240cd (NFS: Clean up
nfs4_proc_create()) broke NFSv3 exclusive open by removing the code
that passes the O_EXCL flag down to nfs3_proc_create(). This patch
reverts that offending hunk from the original commit.

Reported-by: Nick Bowler
Signed-off-by: Trond Myklebust
Cc: stable@kernel.org [2.6.37]
Tested-by: Nick Bowler
Signed-off-by: Linus Torvalds

Trond Myklebust
2011-01-14 04:06:29 +0800

13 Jan, 2011

1 commit

8b244ff2f switch nfs to ->s_d_op ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2011-01-13 09:02:45 +0800

12 Jan, 2011

2 commits

b9d919a4a Merge branch 'nfs-for-2.6.38' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 ... Browse Code »

* 'nfs-for-2.6.38' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (89 commits)
NFS fix the setting of exchange id flag
NFS: Don't use vm_map_ram() in readdir
NFSv4: Ensure continued open and lockowner name uniqueness
NFS: Move cl_delegations to the nfs_server struct
NFS: Introduce nfs_detach_delegations()
NFS: Move cl_state_owners and related fields to the nfs_server struct
NFS: Allow walking nfs_client.cl_superblocks list outside client.c
pnfs: layout roc code
pnfs: update nfs4_callback_recallany to handle layouts
pnfs: add CB_LAYOUTRECALL handling
pnfs: CB_LAYOUTRECALL xdr code
pnfs: change lo refcounting to atomic_t
pnfs: check that partial LAYOUTGET return is ignored
pnfs: add layout to client list before sending rpc
pnfs: serialize LAYOUTGET(openstateid)
pnfs: layoutget rpc code cleanup
pnfs: change how lsegs are removed from layout list
pnfs: change layout state seqlock to a spinlock
pnfs: add prefix to struct pnfs_layout_hdr fields
pnfs: add prefix to struct pnfs_layout_segment fields
...

Linus Torvalds
2011-01-12 07:11:56 +0800
357f54d6b NFS fix the setting of exchange id flag ... Browse Code »

Indicate support for referrals. Do not set any PNFS roles. Check the flags
returned by the server for validity. Do not use exchange flags from an old
client ID instance when recovering a client ID.

Update the EXCHID4_FLAG_XXX set to RFC 5661.

Signed-off-by: Andy Adamson
Signed-off-by: Trond Myklebust

Andy Adamson
2011-01-12 03:17:09 +0800

11 Jan, 2011

2 commits

68c404b18 Merge branch 'bugfixes' into nfs-for-2.6.38 ... Browse Code »

Conflicts:
fs/nfs/nfs2xdr.c
fs/nfs/nfs3xdr.c
fs/nfs/nfs4xdr.c

Trond Myklebust
2011-01-11 03:48:02 +0800
6650239a4 NFS: Don't use vm_map_ram() in readdir ... Browse Code »

vm_map_ram() is not available on NOMMU platforms, and causes trouble
on incoherrent architectures such as ARM when we access the page data
through both the direct and the virtual mapping.

The alternative is to use the direct mapping to access page data
for the case when we are not crossing a page boundary, but to copy
the data into a linear scratch buffer when we are accessing data
that spans page boundaries.

Signed-off-by: Trond Myklebust
Tested-by: Marc Kleine-Budde
Cc: stable@kernel.org [2.6.37]

Trond Myklebust
2011-01-11 03:45:01 +0800

07 Jan, 2011

16 commits

873feea09 fs: dcache per-inode inode alias locking ... Browse Code »

dcache_inode_lock can be replaced with per-inode locking. Use existing
inode->i_lock for this. This is slightly non-trivial because we sometimes
need to find the inode from the dentry, which requires d_inode to be
stabilised (either with refcount or d_lock).

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:31 +0800
b74c79e99 fs: provide rcu-walk aware permission i_ops ... Browse Code »

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:29 +0800
34286d666 fs: rcu-walk aware d_revalidate method ... Browse Code »

Require filesystems be aware of .d_revalidate being called in rcu-walk
mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
-ECHILD from all implementations.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:29 +0800
fb045adb9 fs: dcache reduce branches in lookup path ... Browse Code »

Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.

Patched with:

git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/$[^\t ]*$->d_op = $.*$;/d_set_d_op(\1, \2);/' -e 's/$[^\t ]*$\.d_op = $.*$;/d_set_d_op(\&\1, \2);/' -i

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:28 +0800
fa0d7e3de fs: icache RCU free inodes ... Browse Code »

RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:26 +0800
b5c84bf6f fs: dcache remove dcache_lock ... Browse Code »

dcache_lock no longer protects anything. remove it.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:23 +0800
949854d02 fs: Use rename lock and RCU for multi-step operations ... Browse Code »

The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.

This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.

Solve this instead by using the rename seqlock for multi-step read-side
operations, retry in case of a rename so we don't walk up the wrong parent.
Concurrent dentry insertions are not serialised against. Concurrent deletes
are tricky when walking up the directory: our parent might have been deleted
when dropping locks so also need to check and retry for that.

We can also use the rename lock in cases where livelock is a worry (and it
is introduced in subsequent patch).

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:22 +0800
b23fb0a60 fs: scale inode alias list ... Browse Code »

Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list
from concurrent modification. d_alias is also protected by d_lock.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:22 +0800
b7ab39f63 fs: dcache scale dentry refcount ... Browse Code »

Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:21 +0800
fe15ce446 fs: change d_delete semantics ... Browse Code »

Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.

This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.

Signed-off-by: Nick Piggin

Nick Piggin
2011-01-07 14:50:18 +0800
d035c36c5 NFSv4: Ensure continued open and lockowner name uniqueness ... Browse Code »

In order to enable migration support, we will want to move some of the
structures that are subject to migration into the struct nfs_server.
In particular, if we are to move the state_owner and state_owner_id to
being a per-filesystem structure, then we should label the resulting
open/lock owners with a per-filesytem label to ensure global uniqueness.

This patch does so by adding the super block s_dev to the open/lock owner
name.

Signed-off-by: Trond Myklebust

Trond Myklebust
2011-01-07 05:03:13 +0800
d3978bb32 NFS: Move cl_delegations to the nfs_server struct ... Browse Code »

Delegations are per-inode, not per-nfs_client. When a server file
system is migrated, delegations on the client must be moved from the
source to the destination nfs_server. Make it easier to manage a
mount point's delegation list across a migration event by moving the
list to the nfs_server struct.

Clean up: I added documenting comments to public functions I changed
in this patch. For consistency I added comments to all the other
public functions in fs/nfs/delegation.c.

Signed-off-by: Chuck Lever
Signed-off-by: Trond Myklebust

Chuck Lever
2011-01-07 03:57:46 +0800
dda4b2256 NFS: Introduce nfs_detach_delegations() ... Browse Code »

Clean up: Refactor code that takes clp->cl_lock and calls
nfs_detach_delegations_locked() into its own function.

While we're changing the call sites, get rid of the second parameter
and the logic in nfs_detach_delegations_locked() that uses it, since
callers always set that parameter of nfs_detach_delegations_locked()
to NULL.

Signed-off-by: Chuck Lever
Signed-off-by: Trond Myklebust

Chuck Lever
2011-01-07 03:47:57 +0800
24d292b89 NFS: Move cl_state_owners and related fields to the nfs_server struct ... Browse Code »

NFSv4 migration needs to reassociate state owners from the source to
the destination nfs_server data structures. To make that easier, move
the cl_state_owners field to the nfs_server struct. cl_openowner_id
and cl_lockowner_id accompany this move, as they are used in
conjunction with cl_state_owners.

The cl_lock field in the parent nfs_client continues to protect all
three of these fields.

Signed-off-by: Chuck Lever
Signed-off-by: Trond Myklebust

Chuck Lever
2011-01-07 03:47:57 +0800
fca5238ef NFS: Allow walking nfs_client.cl_superblocks list outside client.c ... Browse Code »

We're about to move some fields from struct nfs_client to struct
nfs_server. There is a many-to-one relationship between nfs_servers
and nfs_clients. After these fields are moved to the nfs_server
struct, to visit all of the data in these fields that is owned by one
nfs_client, code will need to visit each nfs_server on the
cl_superblocks list for that nfs_client.

To serialize changes to the cl_superblocks list during these little
expeditions, protect the list with RCU.

Signed-off-by: Chuck Lever
Signed-off-by: Trond Myklebust

Chuck Lever
2011-01-07 03:47:56 +0800
f7e8917a6 pnfs: layout roc code ... Browse Code »

A layout can request return-on-close. How this interacts with the
forgetful model of never sending LAYOUTRETURNS is a bit ambiguous.
We forget any layouts marked roc, and wait for them to be completely
forgotten before continuing with the close. In addition, to compensate
for races with any inflight LAYOUTGETs, and the fact that we do not get
any layout stateid back from the server, we set the barrier to the worst
case scenario of current_seqid + number of outstanding LAYOUTGETS.

Signed-off-by: Fred Isaman
Signed-off-by: Trond Myklebust

Fred Isaman
2011-01-07 03:46:32 +0800