29 Jan, 2011

3 commits

  • On recent 2.6.38-rc kernels, connectathon basic test 6 fails on
    NFSv4 mounts of OpenSolaris with something like:

    > ./test6: readdir
    > ./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.12' dir entry, pass 0
    > ./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.82' dir entry, pass 0
    > ./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.164' dir entry, pass 0
    > ./test6: (/mnt/klimt/matisse.test) Test failed with 3 errors
    > basic tests failed
    > Tests failed, leaving /mnt/klimt mounted
    > [cel@matisse cthon04]$

    I narrowed the problem down to nfs4_decode_dirent() reporting that the
    decode buffer had overflowed while decoding the entries for those
    missing files.

    verify_attr_len() assumes both it's pointer arguments reside on the
    same page. When these arguments point to locations on two different
    pages, verify_attr_len() can report false errors. This can happen now
    that a large NFSv4 readdir result can span pages.

    We have reasonably good checking in nfs4_decode_dirent() anyway, so
    it should be safe to simply remove the extra checking.

    At a guess, this was introduced by commit 6650239a, "NFS: Don't use
    vm_map_ram() in readdir".

    Cc: stable@kernel.org [2.6.37]
    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Make the decoding of NFSv4 directory entries slightly more efficient
    by:

    1. Avoiding unnecessary byte swapping when checking XDR booleans,
    and

    2. Not bumping "p" when its value will be immediately replaced by
    xdr_inline_decode()

    This commit makes nfs4_decode_dirent() consistent with similar logic
    in the other two decode_dirent() functions.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • There is no reason to be freeing the delegation cred in the rcu callback,
    and doing so is resulting in a lockdep complaint that rpc_credcache_lock
    is being called from both softirq and non-softirq contexts.

    Reported-by: Chuck Lever
    Signed-off-by: Trond Myklebust
    Cc: stable@kernel.org

    Trond Myklebust
     

26 Jan, 2011

9 commits

  • As stated in section 2.4 of RFC 5661, subsequent instances of the client need
    to present the same co_ownerid. Concatinate the client's IP dot address,
    host name, and the rpc_auth pseudoflavor to form the co_ownerid.

    Signed-off-by: Andy Adamson
    Signed-off-by: Trond Myklebust

    Andy Adamson
     
  • If the call to nfs_wcc_update_inode() results in an attribute update, we
    need to ensure that the inode's attr_gencount gets bumped too, otherwise
    we are not protected against races with other GETATTR calls.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • What we really want to know is the ref count.

    Signed-off-by: Andy Adamson
    Signed-off-by: Trond Myklebust

    Andy Adamson
     
  • Always assign the cb_process_state nfs_client pointer so a processing error
    in cb_sequence after the nfs_client is found and referenced returns
    a non-NULL cb_process_state nfs_client and the matching nfs_put_client in
    nfs4_callback_compound dereferences the client.

    Signed-off-by: Andy Adamson
    Signed-off-by: Trond Myklebust

    Andy Adamson
     
  • The information required to find the nfs_client cooresponding to the incoming
    back channel request is contained in the NFS layer. Perform minimal checking
    in the RPC layer pg_authenticate method, and push more detailed checking into
    the NFS layer where the nfs_client can be found.

    Signed-off-by: Andy Adamson
    Signed-off-by: Trond Myklebust

    Andy Adamson
     
  • nfsacl_encode() allocates memory in certain cases. This of course
    is not guaranteed to work.

    Since commit 9f06c719 "SUNRPC: New xdr_streams XDR encoder API", the
    kernel's XDR encoders can't return a result indicating possibly a
    failure, so a memory allocation failure in nfsacl_encode() has become
    fatal (ie, the XDR code Oopses) in some cases.

    However, the allocated memory is a tiny fixed amount, on the order
    of 40-50 bytes. We can easily use a stack-allocated buffer for
    this, with only a wee bit of nose-holding.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Milan Broz reports:

    > on today Linus' tree I get OOps if using nfs.
    >
    > server (2.6.36) exports dir:
    > /dir 172.16.1.0/24(rw,async,all_squash,no_subtree_check,anonuid=500,anongid=500)
    >
    > on client it is mounted in fstab
    > server:/dir /mnt/tst nfs rw,soft 0 0
    >
    > and these commands OOpses it (simplified from a configure script):
    >
    > cd /dir
    > touch x
    > install x y
    >
    > [ 105.327701] ------------[ cut here ]------------
    > [ 105.327979] kernel BUG at fs/nfs/nfs3xdr.c:1338!
    > [ 105.328075] invalid opcode: 0000 [#1] PREEMPT SMP
    > [ 105.328223] last sysfs file: /sys/devices/virtual/bdi/0:16/uevent
    > [ 105.328349] Modules linked in: usbcore dm_mod
    > [ 105.328553]
    > [ 105.328678] Pid: 3710, comm: install Not tainted 2.6.37+ #423 440BX Desktop Reference Platform/VMware Virtual Platform
    > [ 105.328853] EIP: 0060:[] EFLAGS: 00010282 CPU: 0
    > [ 105.329152] EIP is at nfs3_xdr_enc_setacl3args+0x61/0x98
    > [ 105.329249] EAX: ffffffea EBX: ce941d98 ECX: 00000000 EDX: 00000004
    > [ 105.329340] ESI: ce941cd0 EDI: 000000a4 EBP: ce941cc0 ESP: ce941cb4
    > [ 105.329431] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
    > [ 105.329525] Process install (pid: 3710, ti=ce940000 task=ced36f20 task.ti=ce940000)
    > [ 105.336600] Stack:
    > [ 105.336693] ce941cd0 ce9dc000 00000000 ce941cf8 c12ecd02 c12f43e0 c116c00b cf754158
    > [ 105.336982] ce9dc004 cf754284 ce9dc004 cf7ffee8 ceff9978 ce9dc000 cf7ffee8 ce9dc000
    > [ 105.337182] ce9dc000 ce941d14 c12e698d cf75412c ce941d98 cf7ffee8 cf7fff20 00000000
    > [ 105.337405] Call Trace:
    > [ 105.337695] [] rpcauth_wrap_req+0x75/0x7f
    > [ 105.337806] [] ? xdr_encode_opaque+0x12/0x15
    > [ 105.337898] [] ? nfs3_xdr_enc_setacl3args+0x0/0x98
    > [ 105.337988] [] call_transmit+0x17e/0x1e8
    > [ 105.338072] [] __rpc_execute+0x6d/0x1a6
    > [ 105.338155] [] rpc_execute+0x34/0x37
    > [ 105.338235] [] rpc_run_task+0xb5/0xbd
    > [ 105.338316] [] rpc_call_sync+0x3d/0x58
    > [ 105.338402] [] nfs3_proc_setacls+0x18e/0x24f
    > [ 105.338493] [] ? __kmalloc+0x148/0x1c4
    > [ 105.338579] [] ? posix_acl_alloc+0x12/0x22
    > [ 105.338665] [] nfs3_proc_setacl+0xa0/0xca
    > [ 105.338748] [] nfs3_setxattr+0x62/0x88
    > [ 105.338834] [] ? sub_preempt_count+0x7c/0x89
    > [ 105.338926] [] ? nfs3_setxattr+0x0/0x88
    > [ 105.339026] [] __vfs_setxattr_noperm+0x26/0x95
    > [ 105.339114] [] vfs_setxattr+0x5b/0x76
    > [ 105.339211] [] setxattr+0x9d/0xc3
    > [ 105.339298] [] ? handle_pte_fault+0x258/0x5cb
    > [ 105.339428] [] ? __free_pages+0x1a/0x23
    > [ 105.339517] [] ? up_read+0x16/0x2c
    > [ 105.339599] [] ? fget+0x0/0xa3
    > [ 105.339677] [] ? fget+0x0/0xa3
    > [ 105.339760] [] ? get_parent_ip+0xb/0x31
    > [ 105.339843] [] ? sub_preempt_count+0x7c/0x89
    > [ 105.339931] [] sys_fsetxattr+0x51/0x79
    > [ 105.340014] [] sysenter_do_call+0x12/0x32
    > [ 105.340133] Code: 2e 76 18 00 58 31 d2 8b 7f 28 f6 43 04 01 74 03 8b 53 08 6a 00 8b 46 04 6a 01 8b 0b 52 89 fa e8 85 10 f8 ff 83 c4 0c 85 c0 79 04 0b eb fe 31 c9 f6 43 04 04 74 03 8b 4b 0c 68 00 10 00 00 8d
    > [ 105.350321] EIP: [] nfs3_xdr_enc_setacl3args+0x61/0x98 SS:ESP 0068:ce941cb4
    > [ 105.364385] ---[ end trace 01fcfe7f0f7f6e4a ]---

    nfs3_xdr_enc_setacl3args() is not properly setting up the target
    buffer before nfsacl_encode() attempts to encode the ACL.

    Introduced by commit d9c407b1 "NFS: Introduce new-style XDR encoding
    functions for NFSv3."

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Nick Piggin reports:

    > I'm getting use after frees in aio code in NFS
    >
    > [ 2703.396766] Call Trace:
    > [ 2703.396858] [] ? native_sched_clock+0x27/0x80
    > [ 2703.396959] [] ? put_lock_stats+0xe/0x40
    > [ 2703.397058] [] ? lock_release_holdtime+0xa8/0x140
    > [ 2703.397159] [] lock_acquire+0x95/0x1b0
    > [ 2703.397260] [] ? aio_put_req+0x2b/0x60
    > [ 2703.397361] [] ? get_parent_ip+0x11/0x50
    > [ 2703.397464] [] _raw_spin_lock_irq+0x41/0x80
    > [ 2703.397564] [] ? aio_put_req+0x2b/0x60
    > [ 2703.397662] [] aio_put_req+0x2b/0x60
    > [ 2703.397761] [] do_io_submit+0x2be/0x7c0
    > [ 2703.397895] [] sys_io_submit+0xb/0x10
    > [ 2703.397995] [] system_call_fastpath+0x16/0x1b
    >
    > Adding some tracing, it is due to nfs completing the request then
    > returning something other than -EIOCBQUEUED, so aio.c
    > also completes the request.

    To address this, prevent the NFS direct I/O engine from completing
    async iocbs when the forward path returns an error without starting
    any I/O.

    This fix appears to survive ^C during both "xfstest no. 208" and "fsx
    -Z."

    It's likely this bug has existed for a very long while, as we are seeing
    very similar symptoms in OEL 5. Copying stable.

    Cc: Stable
    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • On Mon, 17 Jan 2011, Mi Jinlong wrote:

    >
    >
    > Jesper Juhl:
    > > strrchr() can return NULL if nothing is found. If this happens we'll
    > > dereference a NULL pointer in
    > > fs/nfs/nfs4filelayoutdev.c::decode_and_add_ds().
    > >
    > > I tried to find some other code that guarantees that this can never
    > > happen but I was unsuccessful. So, unless someone else can point to some
    > > code that ensures this can never be a problem, I believe this patch is
    > > needed.
    > >
    > > While I was changing this code I also noticed that all the dprintk()
    > > statements, except one, start with "%s:". The one missing the ":" I added
    > > it to.
    >
    > Maybe another one also should be changed at decode_and_add_ds() at line 243:
    >
    > 243 printk("%s Decoded address and port %s\n", __func__, buf);
    >
    Missed that one. Thanks.

    Signed-off-by: Jesper Juhl
    Signed-off-by: Trond Myklebust

    Jesper Juhl
     

20 Jan, 2011

1 commit


16 Jan, 2011

3 commits

  • Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
    added rather than calling do_add_mount() itself. follow_automount() will then
    do the addition.

    This slightly complicates things as ->d_automount() normally wants to add the
    new vfsmount to an expiration list and start an expiration timer. The problem
    with that is that the vfsmount will be deleted if it has a refcount of 1 and
    the timer will not repeat if the expiration list is empty.

    To this end, we require the vfsmount to be returned from d_automount() with a
    refcount of (at least) 2. One of these refs will be dropped unconditionally.
    In addition, follow_automount() must get a 3rd ref around the call to
    do_add_mount() lest it eat a ref and return an error, leaving the mount we
    have open to being expired as we would otherwise have only 1 ref on it.

    d_automount() should also add the the vfsmount to the expiration list (by
    calling mnt_set_expiry()) and start the expiration timer before returning, if
    this mechanism is to be used. The vfsmount will be unlinked from the
    expiration list by follow_automount() if do_add_mount() fails.

    This patch also fixes the call to do_add_mount() for AFS to propagate the mount
    flags from the parent vfsmount.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • Make NFS use the new d_automount() dentry operation rather than abusing
    follow_link() on directories.

    Signed-off-by: David Howells
    Acked-by: Trond Myklebust
    Acked-by: Ian Kent
    Signed-off-by: Al Viro

    David Howells
     
  • Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
    sleep when it tries to transit away from one of that filesystem's directories
    during a pathwalk. The operation is keyed off a new dentry flag
    (DCACHE_MANAGE_TRANSIT).

    The filesystem is allowed to be selective about which processes it holds and
    which it permits to continue on or prohibits from transiting from each flagged
    directory. This will allow autofs to hold up client processes whilst letting
    its userspace daemon through to maintain the directory or the stuff behind it
    or mounted upon it.

    The ->d_manage() dentry operation:

    int (*d_manage)(struct path *path, bool mounting_here);

    takes a pointer to the directory about to be transited away from and a flag
    indicating whether the transit is undertaken by do_add_mount() or
    do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.

    It should return 0 if successful and to let the process continue on its way;
    -EISDIR to prohibit the caller from skipping to overmounted filesystems or
    automounting, and to use this directory; or some other error code to return to
    the user.

    ->d_manage() is called with namespace_sem writelocked if mounting_here is true
    and no other locks held, so it may sleep. However, if mounting_here is true,
    it may not initiate or wait for a mount or unmount upon the parameter
    directory, even if the act is actually performed by userspace.

    Within fs/namei.c, follow_managed() is extended to check with d_manage() first
    on each managed directory, before transiting away from it or attempting to
    automount upon it.

    follow_down() is renamed follow_down_one() and should only be used where the
    filesystem deliberately intends to avoid management steps (e.g. autofs).

    A new follow_down() is added that incorporates the loop done by all other
    callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
    and CIFS do use it, their use is removed by converting them to use
    d_automount()). The new follow_down() calls d_manage() as appropriate. It
    also takes an extra parameter to indicate if it is being called from mount code
    (with namespace_sem writelocked) which it passes to d_manage(). follow_down()
    ignores automount points so that it can be used to mount on them.

    __follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
    DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
    sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have
    that determine whether to abort or not itself. That would allow the autofs
    daemon to continue on in rcu-walk mode.

    Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
    required as every tranist from that directory will cause d_manage() to be
    invoked. It can always be set again when necessary.

    ==========================
    WHAT THIS MEANS FOR AUTOFS
    ==========================

    Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
    trigger the automounting of indirect mounts, and both of these can be called
    with i_mutex held.

    autofs knows that the i_mutex will be held by the caller in lookup(), and so
    can drop it before invoking the daemon - but this isn't so for d_revalidate(),
    since the lock is only held on _some_ of the code paths that call it. This
    means that autofs can't risk dropping i_mutex from its d_revalidate() function
    before it calls the daemon.

    The bug could manifest itself as, for example, a process that's trying to
    validate an automount dentry that gets made to wait because that dentry is
    expired and needs cleaning up:

    mkdir S ffffffff8014e05a 0 32580 24956
    Call Trace:
    [] :autofs4:autofs4_wait+0x674/0x897
    [] avc_has_perm+0x46/0x58
    [] autoremove_wake_function+0x0/0x2e
    [] :autofs4:autofs4_expire_wait+0x41/0x6b
    [] :autofs4:autofs4_revalidate+0x91/0x149
    [] __lookup_hash+0xa0/0x12f
    [] lookup_create+0x46/0x80
    [] sys_mkdirat+0x56/0xe4

    versus the automount daemon which wants to remove that dentry, but can't
    because the normal process is holding the i_mutex lock:

    automount D ffffffff8014e05a 0 32581 1 32561
    Call Trace:
    [] __mutex_lock_slowpath+0x60/0x9b
    [] do_path_lookup+0x2ca/0x2f1
    [] .text.lock.mutex+0xf/0x14
    [] do_rmdir+0x77/0xde
    [] tracesys+0x71/0xe0
    [] tracesys+0xd5/0xe0

    which means that the system is deadlocked.

    This patch allows autofs to hold up normal processes whilst the daemon goes
    ahead and does things to the dentry tree behind the automouter point without
    risking a deadlock as almost no locks are held in d_manage() and none in
    d_automount().

    Signed-off-by: David Howells
    Was-Acked-by: Ian Kent
    Signed-off-by: Al Viro

    David Howells
     

14 Jan, 2011

3 commits


13 Jan, 2011

1 commit


12 Jan, 2011

2 commits

  • * 'nfs-for-2.6.38' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (89 commits)
    NFS fix the setting of exchange id flag
    NFS: Don't use vm_map_ram() in readdir
    NFSv4: Ensure continued open and lockowner name uniqueness
    NFS: Move cl_delegations to the nfs_server struct
    NFS: Introduce nfs_detach_delegations()
    NFS: Move cl_state_owners and related fields to the nfs_server struct
    NFS: Allow walking nfs_client.cl_superblocks list outside client.c
    pnfs: layout roc code
    pnfs: update nfs4_callback_recallany to handle layouts
    pnfs: add CB_LAYOUTRECALL handling
    pnfs: CB_LAYOUTRECALL xdr code
    pnfs: change lo refcounting to atomic_t
    pnfs: check that partial LAYOUTGET return is ignored
    pnfs: add layout to client list before sending rpc
    pnfs: serialize LAYOUTGET(openstateid)
    pnfs: layoutget rpc code cleanup
    pnfs: change how lsegs are removed from layout list
    pnfs: change layout state seqlock to a spinlock
    pnfs: add prefix to struct pnfs_layout_hdr fields
    pnfs: add prefix to struct pnfs_layout_segment fields
    ...

    Linus Torvalds
     
  • Indicate support for referrals. Do not set any PNFS roles. Check the flags
    returned by the server for validity. Do not use exchange flags from an old
    client ID instance when recovering a client ID.

    Update the EXCHID4_FLAG_XXX set to RFC 5661.

    Signed-off-by: Andy Adamson
    Signed-off-by: Trond Myklebust

    Andy Adamson
     

11 Jan, 2011

2 commits

  • Conflicts:
    fs/nfs/nfs2xdr.c
    fs/nfs/nfs3xdr.c
    fs/nfs/nfs4xdr.c

    Trond Myklebust
     
  • vm_map_ram() is not available on NOMMU platforms, and causes trouble
    on incoherrent architectures such as ARM when we access the page data
    through both the direct and the virtual mapping.

    The alternative is to use the direct mapping to access page data
    for the case when we are not crossing a page boundary, but to copy
    the data into a linear scratch buffer when we are accessing data
    that spans page boundaries.

    Signed-off-by: Trond Myklebust
    Tested-by: Marc Kleine-Budde
    Cc: stable@kernel.org [2.6.37]

    Trond Myklebust
     

07 Jan, 2011

16 commits

  • dcache_inode_lock can be replaced with per-inode locking. Use existing
    inode->i_lock for this. This is slightly non-trivial because we sometimes
    need to find the inode from the dentry, which requires d_inode to be
    stabilised (either with refcount or d_lock).

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Require filesystems be aware of .d_revalidate being called in rcu-walk
    mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
    -ECHILD from all implementations.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Reduce some branches and memory accesses in dcache lookup by adding dentry
    flags to indicate common d_ops are set, rather than having to check them.
    This saves a pointer memory access (dentry->d_op) in common path lookup
    situations, and saves another pointer load and branch in cases where we
    have d_op but not the particular operation.

    Patched with:

    git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • RCU free the struct inode. This will allow:

    - Subsequent store-free path walking patch. The inode must be consulted for
    permissions when walking, so an RCU inode reference is a must.
    - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
    to take i_lock no longer need to take sb_inode_list_lock to walk the list in
    the first place. This will simplify and optimize locking.
    - Could remove some nested trylock loops in dcache code
    - Could potentially simplify things a bit in VM land. Do not need to take the
    page lock to follow page->mapping.

    The downsides of this is the performance cost of using RCU. In a simple
    creat/unlink microbenchmark, performance drops by about 10% due to inability to
    reuse cache-hot slab objects. As iterations increase and RCU freeing starts
    kicking over, this increases to about 20%.

    In cases where inode lifetimes are longer (ie. many inodes may be allocated
    during the average life span of a single inode), a lot of this cache reuse is
    not applicable, so the regression caused by this patch is smaller.

    The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
    however this adds some complexity to list walking and store-free path walking,
    so I prefer to implement this at a later date, if it is shown to be a win in
    real situations. I haven't found a regression in any non-micro benchmark so I
    doubt it will be a problem.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • dcache_lock no longer protects anything. remove it.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • The remaining usages for dcache_lock is to allow atomic, multi-step read-side
    operations over the directory tree by excluding modifications to the tree.
    Also, to walk in the leaf->root direction in the tree where we don't have
    a natural d_lock ordering.

    This could be accomplished by taking every d_lock, but this would mean a
    huge number of locks and actually gets very tricky.

    Solve this instead by using the rename seqlock for multi-step read-side
    operations, retry in case of a rename so we don't walk up the wrong parent.
    Concurrent dentry insertions are not serialised against. Concurrent deletes
    are tricky when walking up the directory: our parent might have been deleted
    when dropping locks so also need to check and retry for that.

    We can also use the rename lock in cases where livelock is a worry (and it
    is introduced in subsequent patch).

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list
    from concurrent modification. d_alias is also protected by d_lock.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
    0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
    we start protecting many other dentry members with d_lock.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Change d_delete from a dentry deletion notification to a dentry caching
    advise, more like ->drop_inode. Require it to be constant and idempotent,
    and not take d_lock. This is how all existing filesystems use the callback
    anyway.

    This makes fine grained dentry locking of dput and dentry lru scanning
    much simpler.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • In order to enable migration support, we will want to move some of the
    structures that are subject to migration into the struct nfs_server.
    In particular, if we are to move the state_owner and state_owner_id to
    being a per-filesystem structure, then we should label the resulting
    open/lock owners with a per-filesytem label to ensure global uniqueness.

    This patch does so by adding the super block s_dev to the open/lock owner
    name.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Delegations are per-inode, not per-nfs_client. When a server file
    system is migrated, delegations on the client must be moved from the
    source to the destination nfs_server. Make it easier to manage a
    mount point's delegation list across a migration event by moving the
    list to the nfs_server struct.

    Clean up: I added documenting comments to public functions I changed
    in this patch. For consistency I added comments to all the other
    public functions in fs/nfs/delegation.c.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Clean up: Refactor code that takes clp->cl_lock and calls
    nfs_detach_delegations_locked() into its own function.

    While we're changing the call sites, get rid of the second parameter
    and the logic in nfs_detach_delegations_locked() that uses it, since
    callers always set that parameter of nfs_detach_delegations_locked()
    to NULL.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • NFSv4 migration needs to reassociate state owners from the source to
    the destination nfs_server data structures. To make that easier, move
    the cl_state_owners field to the nfs_server struct. cl_openowner_id
    and cl_lockowner_id accompany this move, as they are used in
    conjunction with cl_state_owners.

    The cl_lock field in the parent nfs_client continues to protect all
    three of these fields.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • We're about to move some fields from struct nfs_client to struct
    nfs_server. There is a many-to-one relationship between nfs_servers
    and nfs_clients. After these fields are moved to the nfs_server
    struct, to visit all of the data in these fields that is owned by one
    nfs_client, code will need to visit each nfs_server on the
    cl_superblocks list for that nfs_client.

    To serialize changes to the cl_superblocks list during these little
    expeditions, protect the list with RCU.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • A layout can request return-on-close. How this interacts with the
    forgetful model of never sending LAYOUTRETURNS is a bit ambiguous.
    We forget any layouts marked roc, and wait for them to be completely
    forgotten before continuing with the close. In addition, to compensate
    for races with any inflight LAYOUTGETs, and the fact that we do not get
    any layout stateid back from the server, we set the barrier to the worst
    case scenario of current_seqid + number of outstanding LAYOUTGETS.

    Signed-off-by: Fred Isaman
    Signed-off-by: Trond Myklebust

    Fred Isaman