31 Mar, 2014

16 commits

  • Allow locks_mandatory_area() to handle file-private locks correctly.
    If there is a file-private lock set on an open file and we're doing I/O
    via the same, then that should not cause anything to block.

    Handle this by first doing a non-blocking FL_ACCESS check for a
    file-private lock, and then fall back to checking for a classic POSIX
    lock (and possibly blocking).

    Note that this approach is subject to the same races that have always
    plagued mandatory locking on Linux.

    Reported-by: Trond Myklebust
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • As Trond pointed out, you can currently deadlock yourself by setting a
    file-private lock on a file that requires mandatory locking and then
    trying to do I/O on it.

    Avoid this problem by plumbing some knowledge of file-private locks into
    the mandatory locking code. In order to do this, we must pass down
    information about the struct file that's being used to
    locks_verify_locked.

    Reported-by: Trond Myklebust
    Signed-off-by: Jeff Layton
    Acked-by: J. Bruce Fields

    Jeff Layton
     
  • Neil Brown suggested potentially overloading the l_pid value as a "lock
    context" field for file-private locks. While I don't think we will
    probably want to do that here, it's probably a good idea to ensure that
    in the future we could extend this API without breaking existing
    callers.

    Typically the l_pid value is ignored for incoming struct flock
    arguments, serving mainly as a place to return the pid of the owner if
    there is a conflicting lock. For file-private locks, require that it
    currently be set to 0 and return EINVAL if it isn't. If we eventually
    want to make a non-zero l_pid mean something, then this will help ensure
    that we don't break legacy programs that are using file-private locks.

    Cc: Neil Brown
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • Due to some unfortunate history, POSIX locks have very strange and
    unhelpful semantics. The thing that usually catches people by surprise
    is that they are dropped whenever the process closes any file descriptor
    associated with the inode.

    This is extremely problematic for people developing file servers that
    need to implement byte-range locks. Developers often need a "lock
    management" facility to ensure that file descriptors are not closed
    until all of the locks associated with the inode are finished.

    Additionally, "classic" POSIX locks are owned by the process. Locks
    taken between threads within the same process won't conflict with one
    another, which renders them useless for synchronization between threads.

    This patchset adds a new type of lock that attempts to address these
    issues. These locks conflict with classic POSIX read/write locks, but
    have semantics that are more like BSD locks with respect to inheritance
    and behavior on close.

    This is implemented primarily by changing how fl_owner field is set for
    these locks. Instead of having them owned by the files_struct of the
    process, they are instead owned by the filp on which they were acquired.
    Thus, they are inherited across fork() and are only released when the
    last reference to a filp is put.

    These new semantics prevent them from being merged with classic POSIX
    locks, even if they are acquired by the same process. These locks will
    also conflict with classic POSIX locks even if they are acquired by
    the same process or on the same file descriptor.

    The new locks are managed using a new set of cmd values to the fcntl()
    syscall. The initial implementation of this converts these values to
    "classic" cmd values at a fairly high level, and the details are not
    exposed to the underlying filesystem. We may eventually want to push
    this handing out to the lower filesystem code but for now I don't
    see any need for it.

    Also, note that with this implementation the new cmd values are only
    available via fcntl64() on 32-bit arches. There's little need to
    add support for legacy apps on a new interface like this.

    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • It's not really feasible to do deadlock detection with FL_FILE_PVT
    locks since they aren't owned by a single task, per-se. Deadlock
    detection also tends to be rather expensive so just skip it for
    these sorts of locks.

    Also, add a FIXME comment about adding more limited deadlock detection
    that just applies to ro -> rw upgrades, per Andy's request.

    Cc: Andy Lutomirski
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • Once we introduce file private locks, we'll need to know what cmd value
    was used, as that affects the ownership and whether a conflict would
    arise.

    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • FL_FILE_PVT locks are no longer tied to a particular pid, and are
    instead inheritable by child processes. Report a l_pid of '-1' for
    these sorts of locks since the pid is somewhat meaningless for them.

    This precedent comes from FreeBSD. There, POSIX and flock() locks can
    conflict with one another. If fcntl(F_GETLK, ...) returns a lock set
    with flock() then the l_pid member cannot be a process ID because the
    lock is not held by a process as such.

    Acked-by: J. Bruce Fields
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • In a later patch, we'll be adding a new type of lock that's owned by
    the struct file instead of the files_struct. Those sorts of locks
    will be flagged with a new FL_FILE_PVT flag.

    Report these types of locks as "FLPVT" in /proc/locks to distinguish
    them from "classic" POSIX locks.

    Acked-by: J. Bruce Fields
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • This function currently removes leases in addition to flock locks and in
    a later patch we'll have it deal with file-private locks too. Rename it
    to locks_remove_file to indicate that it removes locks that are
    associated with a particular struct file, and not just flock locks.

    Acked-by: J. Bruce Fields
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • Move this check into flock64_to_posix_lock instead of duplicating it in
    two places. This also fixes a minor wart in the code where we continue
    referring to the struct flock after converting it to struct file_lock.

    Acked-by: J. Bruce Fields
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • In the 32-bit case fcntl assigns the 64-bit f_pos and i_size to a 32-bit
    off_t.

    The existing range checks also seem to depend on signed arithmetic
    wrapping when it overflows. In practice maybe that works, but we can be
    more careful. That also allows us to make a more reliable distinction
    between -EINVAL and -EOVERFLOW.

    Note that in the 32-bit case SEEK_CUR or SEEK_END might allow the caller
    to set a lock with starting point no longer representable as a 32-bit
    value. We could return -EOVERFLOW in such cases, but the locks code is
    capable of handling such ranges, so we choose to be lenient here. The
    only problem is that subsequent GETLK calls on such a lock will fail
    with EOVERFLOW.

    While we're here, do some cleanup including consolidating code for the
    flock and flock64 cases.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Jeff Layton

    J. Bruce Fields
     
  • A leftover lock on the list is surely a sign of a problem of some sort,
    but it's not necessarily a reason to panic the box. Instead, just log a
    warning with some info about the lock, and then delete it like we would
    any other lock.

    In the event that the filesystem declares a ->lock f_op, we may end up
    leaking something, but that's generally preferable to an immediate
    panic.

    Acked-by: J. Bruce Fields
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • ...to make sparse happy.

    Acked-by: J. Bruce Fields
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • It's best to let the compiler decide that.

    Acked-by: J. Bruce Fields
    Reported-by: Stephen Rothwell
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • Acked-by: J. Bruce Fields
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • As Al Viro points out, there is an unlikely, but possible race between
    opening a file and setting a lease on it. generic_add_lease is done with
    the i_lock held, but the inode->i_flock check in break_lease is
    lockless. It's possible for another task doing an open to do the entire
    pathwalk and call break_lease between the point where generic_add_lease
    checks for a conflicting open and adds the lease to the list. If this
    occurs, we can end up with a lease set on the file with a conflicting
    open.

    To guard against that, check again for a conflicting open after adding
    the lease to the i_flock list. If the above race occurs, then we can
    simply unwind the lease setting and return -EAGAIN.

    Because we take dentry references and acquire write access on the file
    before calling break_lease, we know that if the i_flock list is empty
    when the open caller goes to check it then the necessary refcounts have
    already been incremented. Thus the additional check for a conflicting
    open will see that there is one and the setlease call will fail.

    Cc: Bruce Fields
    Cc: David Howells
    Cc: "Paul E. McKenney"
    Reported-by: Al Viro
    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     

13 Nov, 2013

1 commit


09 Nov, 2013

2 commits

  • Implement NFSv4 delegations at the vfs level using the new FL_DELEG lock
    type.

    Note nfsd is the only delegation user and is only using read
    delegations. Warn on any attempt to set a write delegation for now.
    We'll come back to that case later.

    Acked-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • For now FL_DELEG is just a synonym for FL_LEASE. So this patch doesn't
    change behavior.

    Next we'll modify break_lease to treat FL_DELEG leases differently, to
    account for the fact that NFSv4 delegations should be broken in more
    situations than Windows oplocks.

    Acked-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     

25 Oct, 2013

1 commit


08 Jul, 2013

1 commit

  • The file_lock_list is only used for /proc/locks. The vastly common case
    is for locks to be put onto the list and come off again, without ever
    being traversed.

    Help optimize for this use-case by moving to percpu hlist_head-s. At the
    same time, we can make the locking less contentious by moving to an
    lglock. When iterating over the lists for /proc/locks, we must take the
    global lock and then iterate over each CPU's list in turn.

    This change necessitates a new fl_link_cpu field to keep track of which
    CPU the entry is on. On x86_64 at least, this field is placed within an
    existing hole in the struct to avoid growing the size.

    Signed-off-by: Jeff Layton
    Acked-by: J. Bruce Fields
    Signed-off-by: Al Viro

    Jeff Layton
     

05 Jul, 2013

1 commit


29 Jun, 2013

12 commits

  • There's no reason we have to protect the blocked_hash and file_lock_list
    with the same spinlock. With the tests I have, breaking it in two gives
    a barely measurable performance benefit, but it seems reasonable to make
    this locking as granular as possible.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • Currently, the hashing that the locking code uses to add these values
    to the blocked_hash is simply calculated using fl_owner field. That's
    valid in most cases except for server-side lockd, which validates the
    owner of a lock based on fl_owner and fl_pid.

    In the case where you have a small number of NFS clients doing a lot
    of locking between different processes, you could end up with all
    the blocked requests sitting in a very small number of hash buckets.

    Add a new lm_owner_key operation to the lock_manager_operations that
    will generate an unsigned long to use as the key in the hashtable.
    That function is only implemented for server-side lockd, and simply
    XORs the fl_owner and fl_pid.

    Signed-off-by: Jeff Layton
    Acked-by: J. Bruce Fields
    Signed-off-by: Al Viro

    Jeff Layton
     
  • Break up the blocked_list into a hashtable, using the fl_owner as a key.
    This speeds up searching the hash chains, which is especially significant
    for deadlock detection.

    Note that the initial implementation assumes that hashing on fl_owner is
    sufficient. In most cases it should be, with the notable exception being
    server-side lockd, which compares ownership using a tuple of the
    nlm_host and the pid sent in the lock request. So, this may degrade to a
    single hash bucket when you only have a single NFS client. That will be
    addressed in a later patch.

    The careful observer may note that this patch leaves the file_lock_list
    alone. There's much less of a case for turning the file_lock_list into a
    hashtable. The only user of that list is the code that generates
    /proc/locks, and it always walks the entire list.

    Signed-off-by: Jeff Layton
    Acked-by: J. Bruce Fields
    Signed-off-by: Al Viro

    Jeff Layton
     
  • Testing has shown that iterating over the blocked_list for deadlock
    detection turns out to be a bottleneck. In order to alleviate that,
    begin the process of turning it into a hashtable. We start by turning
    the fl_link into a hlist_node and the global lists into hlists. A later
    patch will do the conversion of the blocked_list to a hashtable.

    Signed-off-by: Jeff Layton
    Acked-by: J. Bruce Fields
    Signed-off-by: Al Viro

    Jeff Layton
     
  • Since we always hold the i_lock when inserting a new waiter onto the
    fl_block list, we can avoid taking the global lock at all if we find
    that it's empty when we go to wake up blocked waiters.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • Having a global lock that protects all of this code is a clear
    scalability problem. Instead of doing that, move most of the code to be
    protected by the i_lock instead. The exceptions are the global lists
    that the ->fl_link sits on, and the ->fl_block list.

    ->fl_link is what connects these structures to the
    global lists, so we must ensure that we hold those locks when iterating
    over or updating these lists.

    Furthermore, sound deadlock detection requires that we hold the
    blocked_list state steady while checking for loops. We also must ensure
    that the search and update to the list are atomic.

    For the checking and insertion side of the blocked_list, push the
    acquisition of the global lock into __posix_lock_file and ensure that
    checking and update of the blocked_list is done without dropping the
    lock in between.

    On the removal side, when waking up blocked lock waiters, take the
    global lock before walking the blocked list and dequeue the waiters from
    the global list prior to removal from the fl_block list.

    With this, deadlock detection should be race free while we minimize
    excessive file_lock_lock thrashing.

    Finally, in order to avoid a lock inversion problem when handling
    /proc/locks output we must ensure that manipulations of the fl_block
    list are also protected by the file_lock_lock.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • Move the fl_link list handling routines into a separate set of helpers.
    Also ensure that locks and requests are always put on global lists
    last (after fully initializing them) and are taken off before unintializing
    them.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • Signed-off-by: Jeff Layton
    Acked-by: J. Bruce Fields
    Signed-off-by: Al Viro

    Jeff Layton
     
  • Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • Signed-off-by: Jeff Layton
    Acked-by: J. Bruce Fields
    Signed-off-by: Al Viro

    Jeff Layton
     
  • commit 66189be74 (CIFS: Fix VFS lock usage for oplocked files) exported
    the locks_delete_block symbol. There's already an exported helper
    function that provides this capability however, so make cifs use that
    instead and turn locks_delete_block back into a static function.

    Note that if fl->fl_next == NULL then this lock has already been through
    locks_delete_block(), so we should be OK to ignore an ENOENT error here
    and simply not retry the lock.

    Cc: Pavel Shilovsky
    Signed-off-by: Jeff Layton
    Acked-by: J. Bruce Fields
    Signed-off-by: Al Viro

    Jeff Layton
     
  • Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

23 Feb, 2013

1 commit


10 Oct, 2012

1 commit

  • This is to complete part of the Userspace API (UAPI) disintegration for which
    the preparatory patches were pulled recently. After these patches, userspace
    headers will be segregated into:

    include/uapi/linux/.../foo.h

    for the userspace interface stuff, and:

    include/linux/.../foo.h

    for the strictly kernel internal stuff.

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     

27 Sep, 2012

1 commit


21 Aug, 2012

1 commit

  • The rules for fl_type are rather convoluted. Typically it's treated as
    holding specific values, except in the case of LOCK_MAND, in which case
    it can be or'ed with LOCK_READ|LOCK_WRITE.

    On some arches F_WRLCK == 2 and F_UNLCK == 3, so and'ing with F_WRLCK will also
    catch the F_UNLCK case. It's unlikely in either case here that we'd ever see
    F_UNLCK since those shouldn't end up on any lists, but it's still best to be
    consistent.

    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     

02 Aug, 2012

1 commit

  • In commit 3b6e2723f32d ("locks: prevent side-effects of
    locks_release_private before file_lock is initialized") we removed the
    last user of lm_release_private without removing the field itself.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Linus Torvalds

    J. Bruce Fields
     

01 Aug, 2012

1 commit

  • Pull nfsd changes from J. Bruce Fields:
    "This has been an unusually quiet cycle--mostly bugfixes and cleanup.
    The one large piece is Stanislav's work to containerize the server's
    grace period--but that in itself is just one more step in a
    not-yet-complete project to allow fully containerized nfs service.

    There are a number of outstanding delegation, container, v4 state, and
    gss patches that aren't quite ready yet; 3.7 may be wilder."

    * 'nfsd-next' of git://linux-nfs.org/~bfields/linux: (35 commits)
    NFSd: make boot_time variable per network namespace
    NFSd: make grace end flag per network namespace
    Lockd: move grace period management from lockd() to per-net functions
    LockD: pass actual network namespace to grace period management functions
    LockD: manage grace list per network namespace
    SUNRPC: service request network namespace helper introduced
    NFSd: make nfsd4_manager allocated per network namespace context.
    LockD: make lockd manager allocated per network namespace
    LockD: manage grace period per network namespace
    Lockd: add more debug to host shutdown functions
    Lockd: host complaining function introduced
    LockD: manage used host count per networks namespace
    LockD: manage garbage collection timeout per networks namespace
    LockD: make garbage collector network namespace aware.
    LockD: mark host per network namespace on garbage collect
    nfsd4: fix missing fault_inject.h include
    locks: move lease-specific code out of locks_delete_lock
    locks: prevent side-effects of locks_release_private before file_lock is initialized
    NFSd: set nfsd_serv to NULL after service destruction
    NFSd: introduce nfsd_destroy() helper
    ...

    Linus Torvalds