12 Jul, 2005

1 commit


08 Jul, 2005

38 commits

  • After discussion at the recent NFSv4 bake-a-thon, I realized that my
    assumption that NFS4_FH_PERSISTENT required filehandles to persist was a
    misreading of the spec. This also fixes an interoperability problem with the
    Solaris client.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We shouldn't be allowing, e.g., write locks on files not open for read. To
    enforce this, we add a pointer from the lock stateid back to the open stateid
    it came from, so that the check will continue to be correct even after the
    open is upgraded or downgraded.

    Signed-off-by: Andy Adamson
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • As long as we're here, do some miscellaneous cleanup.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The handling of close_lru in preprocess_stateid_op was a source of some
    confusion here recently. Try to make the logic a little clearer, by renaming
    find_openstateowner_id to make its purpose clearer and untangling some
    unnecessarily complicated goto's.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • nfs4_preprocess_seqid_op is called by NFSv4 operations that imply an implicit
    renewal of the client lease.

    Signed-off-by: Andy Adamson
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • from RFC 3530:
    "Share reservations are established by OPEN operations and by their
    nature are mandatory in that when the OPEN denies READ or WRITE
    operations, that denial results in such operations being rejected
    with error NFS4ERR_LOCKED."

    (Note that share_denied is really only a legal error for OPEN.)

    Signed-off-by: Andy Adamson
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • An OPEN from the same client/open stateowner requires a stateid update because
    of the share/deny access update.

    Signed-off-by: Andy Adamson
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We're insisting that the lock sequence id field passed in the
    open_to_lockowner struct always be zero. This is probably thanks to the
    sentence in rfc3530: "The first request issued for any given lock_owner is
    issued with a sequence number of zero."

    But there doesn't seem to be any problem with allowing initial sequence
    numbers other than zero. And currently this is causing lock reclaims from the
    Linux client to fail.

    In the spirit of "be liberal in what you accept, conservative in what you
    send", we'll relax the check (and patch the Linux client as well).

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Add some comments on the use of so_seqid, in an attempt to avoid some of the
    confusion outlined in the previous patch....

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The sequence number we store in the sequence id is the last one we received
    from the client. So on the next operation we'll check that the client gives
    us the next higher number.

    We increment sequence id's at the last moment, in encode, so that we're sure
    of knowing the right error return. (The decision to increment the sequence id
    depends on the exact error returned.)

    However on the *first* use of a sequence number, if we set the sequence number
    to the one received from the client and then let the increment happen on
    encode, we'll be left with a sequence number one to high.

    For that reason, ENCODE_SEQID_OP_TAIL only increments the sequence id on
    *confirmed* stateowners.

    This creates a problem for open reclaims, which are confirmed on first use.
    Therefore the open reclaim code, as a special exception, *decrements* the
    sequence id, cancelling out the undesired increment on encode. But this
    prevents the sequence id from ever being incremented in the case where
    multiple reclaims are sent with the same openowner. Yuch!

    We could add another exception to the open reclaim code, decrementing the
    sequence id only if this is the first use of the open owner.

    But it's simpler by far to modify the meaning of the op_seqid field: instead
    of representing the previous value sent by the client, we take op_seqid, after
    encoding, to represent the *next* sequence id that we expect from the client.
    This eliminates the need for special-case handling of the first use of a
    stateowner.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Yeah, it's trivial, but this drives me up the wall....

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • A misreading of the spec lead us to convert all errors on open and lock
    reclaims to RECLAIM_BAD. This causes problems--for example, a reboot within
    the grace period could lead to reclaims with stale stateid's, and we'd like to
    return STALE errors in those cases.

    What rfc3530 actually says about RECLAIM_BAD: "The reclaim provided by the
    client does not match any of the server's state consistency checks and is
    bad." I'm assuming that "state consistency checks" refers to checks for
    consistency with the state recorded to stable storage, and that the error
    should be reserved for that case.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • A GRACE or NOGRACE response to a lock request should also bump the sequence
    id. So we delay the handling of grace period errors till after we've found
    the relevant owner.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The GRACE and NOGRACE errors should bump the sequence id on open. So we delay
    the handling of these errors until nfsd4_process_open2, at which point we've
    set the open owner, so the encode routine will be able to bump the sequence
    id.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We oops in list_for_each_entry(), because release_stateowner frees something
    on the list we're traversing.

    Signed-off-by: Andy Adamson
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Make sure we don't try to delete client recovery directories multiple times;
    fixes some spurious error messages.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Oops, this lookup_one_len needs the i_sem.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We need to fsync the recovery directory after writing to it, but we weren't
    doing this correctly. (For example, we weren't taking the i_sem when calling
    ->fsync().)

    Just reuse the existing nfsd fsync code instead.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We need to remove the recovery directory here too. (This chunk just got lost
    somehow in the process of commuting the reboot recovery patches past the other
    patches.)

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This patch renames _mntput() to something a little more descriptive:
    mntput_no_expire().

    Signed-off-by: Miklos Szeredi
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • This patch renames vfsmount->mnt_fslink to something a little more
    descriptive: vfsmount->mnt_expire.

    Signed-off-by: Mike Waychison
    Signed-off-by: Miklos Szeredi
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Dcookies shouldn't play with the internals of dentry and vfsmnt
    refcounting. It defeats grepping, and is prone to break if implementation
    details change.

    In addition the function doesn't even seem to be performance critical: it
    calls kmem_cache_alloc().

    Signed-off-by: Miklos Szeredi
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • This patch sets ->mnt_namespace where it's actually added to the
    namespace.

    Previously mnt_namespace was set in do_kern_mount() even if the filesystem
    was never added to any process's namespace (most kernel-internal
    filesystems).

    This discrepancy doesn't actually cause any problems, but it's cleaner if
    mnt_namespace is NULL for these non exported filesystems.

    Signed-off-by: Miklos Szeredi
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • This patch clears mnt_namespace in an expired mount.

    If mnt_namespace is not cleared, it's possible to attach a new mount to the
    already detached mount, because check_mnt() can return true.

    The effect is a resource leak, since the resulting tree will never be
    freed.

    Signed-off-by: Miklos Szeredi
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • This patch fixes a bug noticed by Al Viro:

    However, we still have a problem here - just what would
    happen if vfsmount is detached while we were grabbing namespace
    semaphore? Refcount alone is not useful here - we might be held by
    whoever had detached the vfsmount. IOW, we should check that it's
    still attached (i.e. that mnt->mnt_parent != mnt). If it's not -
    just leave it alone, do mntput() and let whoever holds it deal with
    the sucker. No need to put it back on lists.

    Signed-off-by: Miklos Szeredi
    Cc:
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • This patch splits the mark_mounts_for_expiry() function. It's too complex and
    too deeply nested, even without the bugfix in the following patch.

    Otherwise code is completely the same.

    Signed-off-by: Miklos Szeredi
    Cc:
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • This patch simplifies mark_mounts_for_expiry() by using detach_mnt() instead
    of duplicating everything it does.

    It should be an equivalent transformation except for righting the dput/mntput
    order.

    Al Viro said: "Looks sane".

    Signed-off-by: Miklos Szeredi
    Cc:
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • This patch fixes a race found by Ram in mark_mounts_for_expiry() in
    fs/namespace.c.

    The bug can only be triggered with simultaneous exiting of a process having
    a private namespace, and expiry of a mount from within that namespace.
    It's practically impossible to trigger, and I haven't even tried. But
    still, a bug is a bug.

    The race happens when put_namespace() is called by another task, while
    mark_mounts_for_expiry() is between atomic_read() and get_namespace(). In
    that case get_namespace() will be called on an already dead namespace with
    unforeseeable results.

    The solution was suggested by Al Viro, with his own words:

    Instead of screwing with atomic_read() in there, why don't we
    simply do the following:
    a) atomic_dec_and_lock() in put_namespace()
    b) __put_namespace() called without dropping lock
    c) the first thing done by __put_namespace would be
    struct vfsmount *root = namespace->root;
    namespace->root = NULL;
    spin_unlock(...);
    ....
    umount_tree(root);
    ...
    d) check in mark_... would be simply namespace && namespace->root.

    And we are all set; no screwing around with atomic_read(), no magic
    at all. Dying namespace gets NULL ->root.
    All changes of ->root happen under spinlock.
    If under a spinlock we see non-NULL ->mnt_namespace, it won't be
    freed until we drop the lock (we will set ->mnt_namespace to NULL
    under that lock before we get to freeing namespace).
    If under a spinlock we see non-NULL ->mnt_namespace and
    ->mnt_namespace->root, we can grab a reference to namespace and be
    sure that it won't go away.

    Signed-off-by: Miklos Szeredi
    Acked-by: Al Viro
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • This patch clears mnt_namespace on unmount.

    Not clearing mnt_namespace has two effects:

    1) It is possible to attach a new mount to a detached mount,
    because check_mnt() returns true.

    This means, that when no other references to the detached mount
    remain, it still can't be freed. This causes a resource leak,
    and possibly un-removable modules.

    2) If mnt_namespace is dereferenced (only in mark_mounts_for_expiry())
    after the namspace has been freed, it can cause an Oops, memory
    corruption, etc.

    1) has been tested before and after the patch, 2) is only speculation.

    Signed-off-by: Miklos Szeredi
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • We're dereferencing `flp' and then we're testing it for NULLness.

    Either the compiler accidentally saved us or the existing null-pointer checdk
    is redundant.

    This defect was found automatically by Coverity Prevent, a static analysis tool.

    Signed-off-by: Zaur Kambarov
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMBAROV, ZAUR
     
  • Fix debugging printk.

    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • We are not using the in-inode space for xattrs in reserved inodes because
    mkfs.ext3 doesn't initialize it properly. For those inodes, we set
    i_extra_isize to 0. Make sure that we also don't overwrite the
    i_extra_isize field when writing out the inode in that case. This is for
    cleanliness only, and doesn't fix an actual bug.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Gruenbacher
     
  • Add a new section called ".data.read_mostly" for data items that are read
    frequently and rarely written to like cpumaps etc.

    If these maps are placed in the .data section then these frequenly read
    items may end up in cachelines with data is is frequently updated. In that
    case all processors in an SMP system must needlessly reload the cachelines
    again and again containing elements of those frequently used variables.

    The ability to share these cachelines will allow each cpu in an SMP system
    to keep local copies of those shared cachelines thereby optimizing
    performance.

    Signed-off-by: Alok N Kataria
    Signed-off-by: Shobhit Dayal
    Signed-off-by: Christoph Lameter
    Signed-off-by: Shai Fultheim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Original patch from Matt Mackall

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Gruenbacher
     
  • Use a bit spin lock in the first buffer of the page to synchronise asynch
    IO buffer completions, instead of the global page_uptodate_lock, which is
    showing some scalabilty problems.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Some time ago a trivial patch broke HPPFS (one var became a pointer, not
    all uses were updated). It wasn't fixed at that time because not very
    used, now it's been requested so I've fixed this, and it has been tested
    positively (at least partially).

    Signed-off-by: Paolo 'Blaisorblade' Giarrusso
    Cc: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paolo 'Blaisorblade' Giarrusso
     
  • - Make ioprio syscalls return long, like set/getpriority syscalls.
    - Move function prototypes into syscalls.h so we can pick them up in the
    32/64bit compat code.

    Signed-off-by: Anton Blanchard
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     
  • OCFS2 wants to mark an inode which has been orphaned by another node so
    that during final iput it takes the correct path through the VFS and can
    pass through the OCFS2 delete_inode callback. Since i_nlink can get out of
    date with other nodes, the best way I see to accomplish this is by clearing
    i_nlink on those inodes at drop_inode time. Other than this small amount
    of work, nothing different needs to happen, so I think it would be cleanest
    to be able to just call generic_drop_inode at the end of the OCFS2
    drop_inode callback.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Fasheh
     

07 Jul, 2005

1 commit