09 Oct, 2014

40 commits

  • make it return dentry instead of inode

    Acked-by: Felipe Balbi
    Signed-off-by: Al Viro

    Al Viro
     
  • hlist_add_fake(inode->i_hash), same as for the rest of special ones...

    Signed-off-by: Al Viro

    Al Viro
     
  • The only way we can get to that function is from misc_open(), after
    the latter has set file->f_op to exactly the same value we are
    (re)assigning there.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • The gcc version 4.9.1 compiler complains Even though it isn't possible for
    these variables to not get initialized before they are used.

    fs/namespace.c: In function ‘SyS_mount’:
    fs/namespace.c:2720:8: warning: ‘kernel_dev’ may be used uninitialized in this function [-Wmaybe-uninitialized]
    ret = do_mount(kernel_dev, kernel_dir->name, kernel_type, flags,
    ^
    fs/namespace.c:2699:8: note: ‘kernel_dev’ was declared here
    char *kernel_dev;
    ^
    fs/namespace.c:2720:8: warning: ‘kernel_type’ may be used uninitialized in this function [-Wmaybe-uninitialized]
    ret = do_mount(kernel_dev, kernel_dir->name, kernel_type, flags,
    ^
    fs/namespace.c:2697:8: note: ‘kernel_type’ was declared here
    char *kernel_type;
    ^

    Fix the warnings by simplifying copy_mount_string() as suggested by Al Viro.

    Cc: Alexander Viro
    Signed-off-by: Tim Gardner
    Signed-off-by: Al Viro

    Tim Gardner
     
  • That loop in there is both anti-idiomatic *and* completely pointless.
    strtoll() is there for purpose; use it and compare what's left with
    acceptable suffices.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • ... not to mention that even atomic_long_read() is too low-level here -
    there's file_count().

    Signed-off-by: Al Viro

    Al Viro
     
  • check with the author of that horror...

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • This patch makes it possible to kill a process looping in
    cont_expand_zero. A process may spend a lot of time in this function, so
    it is desirable to be able to kill it.

    It happened to me that I wanted to copy a piece data from the disk to a
    file. By mistake, I used the "seek" parameter to dd instead of "skip". Due
    to the "seek" parameter, dd attempted to extend the file and became stuck
    doing so - the only possibility was to reset the machine or wait many
    hours until the filesystem runs out of space and cont_expand_zero fails.
    We need this patch to be able to terminate the process.

    Signed-off-by: Mikulas Patocka
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Mikulas Patocka
     
  • For DAX, we want to be able to copy between iovecs and kernel addresses
    that don't necessarily have a struct page. This is a fairly simple
    rearrangement for bvec iters to kmap the pages outside and pass them in,
    but for user iovecs it gets more complicated because we might try various
    different ways to kmap the memory. Duplicating the existing logic works
    out best in this case.

    We need to be able to write zeroes to an iovec for reads from unwritten
    ranges in a file. This is performed by the new iov_iter_zero() function,
    again patterned after the existing code that handles iovec iterators.

    [AV: and export the buggers...]

    Signed-off-by: Matthew Wilcox
    Signed-off-by: Al Viro

    Matthew Wilcox
     
  • total_objects could be 0 and is used as a denom.

    While total_objects is a "long", total_objects == 0 unlikely happens for
    3.12 and later kernels because 32-bit architectures would not be able to
    hold (1 << 32) objects. However, total_objects == 0 may happen for kernels
    between 3.1 and 3.11 because total_objects in prune_super() was an "int"
    and (e.g.) x86_64 architecture might be able to hold (1 << 32) objects.

    Signed-off-by: Tetsuo Handa
    Reviewed-by: Christoph Hellwig
    Cc: stable # 3.1+
    Signed-off-by: Al Viro

    Tetsuo Handa
     
  • Fixed coding style in dcache.c

    Signed-off-by: Daeseok Youn
    Signed-off-by: Al Viro

    Daeseok Youn
     
  • schedule_delayed_work() happening when the work is already pending is
    a cheap no-op. Don't bother with ->wbuf_queued logics - it's both
    broken (cancelling ->wbuf_dwork leaves it set, as spotted by Jeff Harris)
    and pointless. It's cheaper to let schedule_delayed_work() handle that
    case.

    Reported-by: Jeff Harris
    Tested-by: Jeff Harris
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     
  • The function which calls s_op->alloc_inode() is not inode_alloc(), but
    instead alloc_inode() which lives in fs/inode.c .

    The typo was there from the beginning from 5ea626aa (VFS: update
    documentation, 2005) - there was no standalone inode_alloc() for the
    whole kernel history.

    Cc: Pekka Enberg
    Signed-off-by: Kirill Smelkov
    Signed-off-by: Al Viro

    Kirill Smelkov
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • ... rather than doing that in the guts of ->load_binary().
    [updated to fix the bug spotted by Shentino - for SIGSEGV we really need
    something stronger than send_sig_info(); again, better do that in one place]

    Signed-off-by: Al Viro

    Al Viro
     
  • the only in-tree instance checks d_unhashed() anyway,
    out-of-tree code can preserve the current behaviour by
    adding such check if they want it and we get an ability
    to use it in cases where we *want* to be notified of
    killing being inevitable before ->d_lock is dropped,
    whether it's unhashed or not. In particular, autofs
    would benefit from that.

    Signed-off-by: Al Viro

    Al Viro
     
  • The only reason for games with ->d_prune() was __d_drop(), which
    was needed only to force dput() into killing the sucker off.

    Note that lock_parent() can be called under ->i_lock and won't
    drop it, so dentry is safe from somebody managing to kill it
    under us - it won't happen while we are holding ->i_lock.

    __dentry_kill() is called only with ->d_lockref.count being 0
    (here and when picked from shrink list) or 1 (dput() and dropping
    the ancestors in shrink_dentry_list()), so it will never be called
    twice - the first thing it's doing is making ->d_lockref.count
    negative and once that happens, nothing will increment it.

    Signed-off-by: Al Viro

    Al Viro
     
  • Now that d_invalidate always succeeds and flushes mount points use
    it in stead of a combination of shrink_dcache_parent and d_drop
    in proc_flush_task_mnt. This removes the danger of a mount point
    under /proc//... becoming unreachable after the d_drop.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • Now that d_invalidate always succeeds it is not longer necessary or
    desirable to hard code d_drop calls into filesystem specific
    d_revalidate implementations.

    Remove the unnecessary d_drop calls and rely on d_invalidate
    to drop the dentries. Using d_invalidate ensures that paths
    to mount points will not be dropped.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • Now that d_invalidate can no longer fail, stop returning a useless
    return code. For the few callers that checked the return code update
    remove the handling of d_invalidate failure.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • Now that d_invalidate is the only caller of check_submounts_and_drop,
    expand check_submounts_and_drop inline in d_invalidate.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • Now that check_submounts_and_drop can not fail and is called from
    d_invalidate there is no longer a need to call check_submounts_and_drom
    from filesystem d_revalidate methods so remove it.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • With the introduction of mount namespaces and bind mounts it became
    possible to access files and directories that on some paths are mount
    points but are not mount points on other paths. It is very confusing
    when rm -rf somedir returns -EBUSY simply because somedir is mounted
    somewhere else. With the addition of user namespaces allowing
    unprivileged mounts this condition has gone from annoying to allowing
    a DOS attack on other users in the system.

    The possibility for mischief is removed by updating the vfs to support
    rename, unlink and rmdir on a dentry that is a mountpoint and by
    lazily unmounting mountpoints on deleted dentries.

    In particular this change allows rename, unlink and rmdir system calls
    on a dentry without a mountpoint in the current mount namespace to
    succeed, and it allows rename, unlink, and rmdir performed on a
    distributed filesystem to update the vfs cache even if when there is a
    mount in some namespace on the original dentry.

    There are two common patterns of maintaining mounts: Mounts on trusted
    paths with the parent directory of the mount point and all ancestory
    directories up to / owned by root and modifiable only by root
    (i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
    cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories
    maintained by fusermount.

    In the case of mounts in trusted directories owned by root and
    modifiable only by root the current parent directory permissions are
    sufficient to ensure a mount point on a trusted path is not removed
    or renamed by anyone other than root, even if there is a context
    where the there are no mount points to prevent this.

    In the case of mounts in directories owned by less privileged users
    races with users modifying the path of a mount point are already a
    danger. fusermount already uses a combination of chdir,
    /proc//fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The
    removable of global rename, unlink, and rmdir protection really adds
    nothing new to consider only a widening of the attack window, and
    fusermount is already safe against unprivileged users modifying the
    directory simultaneously.

    In principle for perfect userspace programs returning -EBUSY for
    unlink, rmdir, and rename of dentires that have mounts in the local
    namespace is actually unnecessary. Unfortunately not all userspace
    programs are perfect so retaining -EBUSY for unlink, rmdir and rename
    of dentries that have mounts in the current mount namespace plays an
    important role of maintaining consistency with historical behavior and
    making imperfect userspace applications hard to exploit.

    v2: Remove spurious old_dentry.
    v3: Optimized shrink_submounts_and_drop
    Removed unsued afs label
    v4: Simplified the changes to check_submounts_and_drop
    Do not rename check_submounts_and_drop shrink_submounts_and_drop
    Document what why we need atomicity in check_submounts_and_drop
    Rely on the parent inode mutex to make d_revalidate and d_invalidate
    an atomic unit.
    v5: Refcount the mountpoint to detach in case of simultaneous
    renames.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • The new function detach_mounts comes in two pieces. The first piece
    is a static inline test of d_mounpoint that returns immediately
    without taking any locks if d_mounpoint is not set. In the common
    case when mountpoints are absent this allows the vfs to continue
    running with it's same cacheline foot print.

    The second piece of detach_mounts __detach_mounts actually does the
    work and it assumes that a mountpoint is present so it is slow and
    takes namespace_sem for write, and then locks the mount hash (aka
    mount_lock) after a struct mountpoint has been found.

    With those two locks held each entry on the list of mounts on a
    mountpoint is selected and lazily unmounted until all of the mount
    have been lazily unmounted.

    v7: Wrote a proper change description and removed the changelog
    documenting deleted wrong turns.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • I am shortly going to add a new user of struct mountpoint that
    needs to look up existing entries but does not want to create
    a struct mountpoint if one does not exist. Therefore to keep
    the code simple and easy to read split out lookup_mountpoint
    from new_mountpoint.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman