09 Dec, 2014

1 commit


20 Nov, 2014

4 commits


04 Nov, 2014

2 commits


24 Oct, 2014

1 commit

  • d_splice_alias() callers expect it to either stash the inode reference
    into a new alias, or drop the inode reference. That makes it possible
    to just return d_splice_alias() result from ->lookup() instance, without
    any extra housekeeping required.

    Unfortunately, that should include the failure exits. If d_splice_alias()
    returns an error, it leaves the dentry it has been given negative and
    thus it *must* drop the inode reference. Easily fixed, but it goes way
    back and will need backporting.

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     

13 Oct, 2014

1 commit


09 Oct, 2014

9 commits

  • Fixed coding style in dcache.c

    Signed-off-by: Daeseok Youn
    Signed-off-by: Al Viro

    Daeseok Youn
     
  • the only in-tree instance checks d_unhashed() anyway,
    out-of-tree code can preserve the current behaviour by
    adding such check if they want it and we get an ability
    to use it in cases where we *want* to be notified of
    killing being inevitable before ->d_lock is dropped,
    whether it's unhashed or not. In particular, autofs
    would benefit from that.

    Signed-off-by: Al Viro

    Al Viro
     
  • The only reason for games with ->d_prune() was __d_drop(), which
    was needed only to force dput() into killing the sucker off.

    Note that lock_parent() can be called under ->i_lock and won't
    drop it, so dentry is safe from somebody managing to kill it
    under us - it won't happen while we are holding ->i_lock.

    __dentry_kill() is called only with ->d_lockref.count being 0
    (here and when picked from shrink list) or 1 (dput() and dropping
    the ancestors in shrink_dentry_list()), so it will never be called
    twice - the first thing it's doing is making ->d_lockref.count
    negative and once that happens, nothing will increment it.

    Signed-off-by: Al Viro

    Al Viro
     
  • Now that d_invalidate can no longer fail, stop returning a useless
    return code. For the few callers that checked the return code update
    remove the handling of d_invalidate failure.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • Now that d_invalidate is the only caller of check_submounts_and_drop,
    expand check_submounts_and_drop inline in d_invalidate.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • With the introduction of mount namespaces and bind mounts it became
    possible to access files and directories that on some paths are mount
    points but are not mount points on other paths. It is very confusing
    when rm -rf somedir returns -EBUSY simply because somedir is mounted
    somewhere else. With the addition of user namespaces allowing
    unprivileged mounts this condition has gone from annoying to allowing
    a DOS attack on other users in the system.

    The possibility for mischief is removed by updating the vfs to support
    rename, unlink and rmdir on a dentry that is a mountpoint and by
    lazily unmounting mountpoints on deleted dentries.

    In particular this change allows rename, unlink and rmdir system calls
    on a dentry without a mountpoint in the current mount namespace to
    succeed, and it allows rename, unlink, and rmdir performed on a
    distributed filesystem to update the vfs cache even if when there is a
    mount in some namespace on the original dentry.

    There are two common patterns of maintaining mounts: Mounts on trusted
    paths with the parent directory of the mount point and all ancestory
    directories up to / owned by root and modifiable only by root
    (i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
    cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories
    maintained by fusermount.

    In the case of mounts in trusted directories owned by root and
    modifiable only by root the current parent directory permissions are
    sufficient to ensure a mount point on a trusted path is not removed
    or renamed by anyone other than root, even if there is a context
    where the there are no mount points to prevent this.

    In the case of mounts in directories owned by less privileged users
    races with users modifying the path of a mount point are already a
    danger. fusermount already uses a combination of chdir,
    /proc//fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The
    removable of global rename, unlink, and rmdir protection really adds
    nothing new to consider only a widening of the attack window, and
    fusermount is already safe against unprivileged users modifying the
    directory simultaneously.

    In principle for perfect userspace programs returning -EBUSY for
    unlink, rmdir, and rename of dentires that have mounts in the local
    namespace is actually unnecessary. Unfortunately not all userspace
    programs are perfect so retaining -EBUSY for unlink, rmdir and rename
    of dentries that have mounts in the current mount namespace plays an
    important role of maintaining consistency with historical behavior and
    making imperfect userspace applications hard to exploit.

    v2: Remove spurious old_dentry.
    v3: Optimized shrink_submounts_and_drop
    Removed unsued afs label
    v4: Simplified the changes to check_submounts_and_drop
    Do not rename check_submounts_and_drop shrink_submounts_and_drop
    Document what why we need atomicity in check_submounts_and_drop
    Rely on the parent inode mutex to make d_revalidate and d_invalidate
    an atomic unit.
    v5: Refcount the mountpoint to detach in case of simultaneous
    renames.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • The current comments in d_invalidate about what and why it is doing
    what it is doing are wildly off-base. Which is not surprising as
    the comments date back to last minute bug fix of the 2.2 kernel.

    The big fat lie of a comment said: If it's a directory, we can't drop
    it for fear of somebody re-populating it with children (even though
    dropping it would make it unreachable from that root, we still might
    repopulate it if it was a working directory or similar).

    [AV] What we really need to avoid is multiple dentry aliases of the
    same directory inode; on all filesystems that have ->d_revalidate()
    we either declare all positive dentries always valid (and thus never
    fed to d_invalidate()) or use d_materialise_unique() and/or d_splice_alias(),
    which take care of alias prevention.

    The current rules are:
    - To prevent mount point leaks dentries that are mount points or that
    have childrent that are mount points may not be be unhashed.
    - All dentries may be unhashed.
    - Directories may be rehashed with d_materialise_unique

    check_submounts_and_drop implements this already for well maintained
    remote filesystems so implement the current rules in d_invalidate
    by just calling check_submounts_and_drop.

    The one difference between d_invalidate and check_submounts_and_drop
    is that d_invalidate must respect it when a d_revalidate method has
    earlier called d_drop so preserve the d_unhashed check in
    d_invalidate.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • d_drop or check_submounts_and_drop called from d_revalidate can result
    in renamed directories with child dentries being unhashed. These
    renamed and drop directory dentries can be rehashed after
    d_materialise_unique uses d_find_alias to find them.

    Reviewed-by: Miklos Szeredi
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     
  • * external dentry names get a small structure prepended to them
    (struct external_name).
    * it contains an atomic refcount, matching the number of struct dentry
    instances that have ->d_name.name pointing to that external name. The
    first thing free_dentry() does is decrementing refcount of external name,
    so the instances that are between the call of free_dentry() and
    RCU-delayed actual freeing do not contribute.
    * __d_move(x, y, false) makes the name of x equal to the name of y,
    external or not. If y has an external name, extra reference is grabbed
    and put into x->d_name.name. If x used to have an external name, the
    reference to the old name is dropped and, should it reach zero, freeing
    is scheduled via kfree_rcu().
    * free_dentry() in dentry with external name decrements the refcount of
    that name and, should it reach zero, does RCU-delayed call that will
    free both the dentry and external name. Otherwise it does what it
    used to do, except that __d_free() doesn't even look at ->d_name.name;
    it simply frees the dentry.

    All non-RCU accesses to dentry external name are safe wrt freeing since they
    all should happen before free_dentry() is called. RCU accesses might run
    into a dentry seen by free_dentry() or into an old name that got already
    dropped by __d_move(); however, in both cases dentry must have been
    alive and refer to that name at some point after we'd done rcu_read_lock(),
    which means that any freeing must be still pending.

    Signed-off-by: Al Viro

    Al Viro
     

30 Sep, 2014

1 commit

  • AFAICS, prepend_name() is broken on SMP alpha. Disclaimer: I don't have
    SMP alpha boxen to reproduce it on. However, it really looks like the race
    is real.

    CPU1: d_path() on /mnt/ramfs//foo
    CPU2: mv /mnt/ramfs/ /mnt/ramfs/

    CPU2 does d_alloc(), which allocates an external name, stores the name there
    including terminating NUL, does smp_wmb() and stores its address in
    dentry->d_name.name. It proceeds to d_add(dentry, NULL) and d_move()
    old dentry over to that. ->d_name.name value ends up in that dentry.

    In the meanwhile, CPU1 gets to prepend_name() for that dentry. It fetches
    ->d_name.name and ->d_name.len; the former ends up pointing to new name
    (64-byte kmalloc'ed array), the latter - 255 (length of the old name).
    Nothing to force the ordering there, and normally that would be OK, since we'd
    run into the terminating NUL and stop. Except that it's alpha, and we'd need
    a data dependency barrier to guarantee that we see that store of NUL
    __d_alloc() has done. In a similar situation dentry_cmp() would survive; it
    does explicit smp_read_barrier_depends() after fetching ->d_name.name.
    prepend_name() doesn't and it risks walking past the end of kmalloc'ed object
    and possibly oops due to taking a page fault in kernel mode.

    Cc: stable@vger.kernel.org # 3.12+
    Signed-off-by: Al Viro

    Al Viro
     

28 Sep, 2014

2 commits

  • Only exchange source and destination filenames
    if flags contain RENAME_EXCHANGE.
    In case if executable file was running and replaced by
    other file /proc/PID/exe should still show correct file name,
    not the old name of the file by which it was replaced.

    The scenario when this bug manifests itself was like this:
    * ALT Linux uses rpm and start-stop-daemon;
    * during a package upgrade rpm creates a temporary file
    for an executable to rename it upon successful unpacking;
    * start-stop-daemon is run subsequently and it obtains
    the (nonexistant) temporary filename via /proc/PID/exe
    thus failing to identify the running process.

    Note that "long" filenames (> DNAiME_INLINE_LEN) are still
    exchanged without RENAME_EXCHANGE and this behaviour exists
    long enough (should be fixed too apparently).
    So this patch is just an interim workaround that restores
    behavior for "short" names as it was before changes
    introduced by commit da1ce0670c14 ("vfs: add cross-rename").

    See https://lkml.org/lkml/2014/9/7/6 for details.

    AV: the comments about being more careful with ->d_name.hash
    than with ->d_name.name are from back in 2.3.40s; they
    became obsolete by 2.3.60s, when we started to unhash the
    target instead of swapping hash chain positions followed
    by d_delete() as we used to do when dcache was first
    introduced.

    Acked-by: Miklos Szeredi
    Cc: Linus Torvalds
    Cc: Alexander Viro
    Cc: linux-fsdevel@vger.kernel.org
    Cc: stable@vger.kernel.org
    Fixes: da1ce0670c14 "vfs: add cross-rename"
    Signed-off-by: Mikhail Efremov
    Signed-off-by: Al Viro

    Mikhail Efremov
     
  • and do it along with ->d_name.len there

    Signed-off-by: Linus Torvalds
    Signed-off-by: Al Viro

    Linus Torvalds
     

27 Sep, 2014

6 commits


15 Sep, 2014

1 commit

  • Pull vfs fixes from Al Viro:
    "double iput() on failure exit in lustre, racy removal of spliced
    dentries from ->s_anon in __d_materialise_dentry() plus a bunch of
    assorted RCU pathwalk fixes"

    The RCU pathwalk fixes end up fixing a couple of cases where we
    incorrectly dropped out of RCU walking, due to incorrect initialization
    and testing of the sequence locks in some corner cases. Since dropping
    out of RCU walk mode forces the slow locked accesses, those corner cases
    slowed down quite dramatically.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    be careful with nd->inode in path_init() and follow_dotdot_rcu()
    don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu()
    fix bogus read_seqretry() checks introduced in b37199e
    move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon)
    [fix] lustre: d_make_root() does iput() on dentry allocation failure

    Linus Torvalds
     

14 Sep, 2014

2 commits

  • and lock the right list there

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     
  • Josef Bacik found a performance regression between 3.2 and 3.10 and
    narrowed it down to commit bfcfaa77bdf0 ("vfs: use 'unsigned long'
    accesses for dcache name comparison and hashing"). He reports:

    "The test case is essentially

    for (i = 0; i < 1000000; i++)
    mkdir("a$i");

    On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
    dir/sec with 3.10. This is because we spend waaaaay more time in
    __d_lookup on 3.10 than in 3.2.

    The new hashing function for strings is suboptimal for <
    sizeof(unsigned long) string names (and hell even > sizeof(unsigned
    long) string names that I've tested). I broke out the old hashing
    function and the new one into a userspace helper to get real numbers
    and this is what I'm getting:

    Old hash table had 1000000 entries, 0 dupes, 0 max dupes
    New hash table had 12628 entries, 987372 dupes, 900 max dupes
    We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash

    My test does the hash, and then does the d_hash into a integer pointer
    array the same size as the dentry hash table on my system, and then
    just increments the value at the address we got to see how many
    entries we overlap with.

    As you can see the old hash function ended up with all 1 million
    entries in their own bucket, whereas the new one they are only
    distributed among ~12.5k buckets, which is why we're using so much
    more CPU in __d_lookup".

    The reason for this hash regression is two-fold:

    - On 64-bit architectures the down-mixing of the original 64-bit
    word-at-a-time hash into the final 32-bit hash value is very
    simplistic and suboptimal, and just adds the two 32-bit parts
    together.

    In particular, because there is no bit shuffling and the mixing
    boundary is also a byte boundary, similar character patterns in the
    low and high word easily end up just canceling each other out.

    - the old byte-at-a-time hash mixed each byte into the final hash as it
    hashed the path component name, resulting in the low bits of the hash
    generally being a good source of hash data. That is not true for the
    word-at-a-time case, and the hash data is distributed among all the
    bits.

    The fix is the same in both cases: do a better job of mixing the bits up
    and using as much of the hash data as possible. We already have the
    "hash_32|64()" functions to do that.

    Reported-by: Josef Bacik
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Chris Mason
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Aug, 2014

9 commits

  • Signed-off-by: Fengguang Wu
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Fengguang Wu
     
  • I believe this can only happen in the case of a corrupted filesystem.
    So -EIO looks like the appropriate error.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • If we get to this point and discover the dentry is not a root dentry, or
    not DCACHE_DISCONNECTED--great, we always prefer that anyway.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • There are a few d_obtain_alias callers that are using it to get the
    root of a filesystem which may already have an alias somewhere else.

    This is not the same as the filehandle-lookup case, and none of them
    actually need DCACHE_DISCONNECTED set.

    It isn't really a serious problem, but it would really be clearer if we
    reserved DCACHE_DISCONNECTED for those cases where it's actually needed.

    In the btrfs case this was causing a spurious printk from
    nfsd/nfsfh.c:fh_verify when it found an unexpected DCACHE_DISCONNECTED
    dentry. Josef worked around this by unsetting DCACHE_DISCONNECTED
    manually in 3a0dfa6a12e "Btrfs: unset DCACHE_DISCONNECTED when mounting
    default subvol", and this replaces that workaround.

    Cc: Josef Bacik
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • Any IS_ROOT() alias should be safe to use; there's nothing special about
    DCACHE_DISCONNECTED dentries.

    Note that this is in fact useful for filesystems such as btrfs which can
    legimately encounter a directory with a preexisting IS_ROOT alias on a
    lookup that crosses into a subvolume. (Those aliases are currently
    marked DCACHE_DISCONNECTED--but not really for any good reason, and
    we'll change that soon.)

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • Currently if d_splice_alias finds a directory with an alias that is not
    IS_ROOT or not DCACHE_DISCONNECTED, it creates a duplicate directory.

    Duplicate directory dentries are unacceptable; it is better just to
    error out.

    (In the case of a local filesystem the most likely case is filesystem
    corruption: for example, perhaps two directories point to the same child
    directory, and the other parent has already been found and cached.)

    Note that distributed filesystems may encounter this case in normal
    operation if a remote host moves a directory to a location different
    from the one we last cached in the dcache. For that reason, such
    filesystems should instead use d_materialise_unique, which tries to move
    the old directory alias to the right place instead of erroring out.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • d_splice_alias will d_move an IS_ROOT() directory dentry into place if
    one exists. This should be safe as long as the dentry remains IS_ROOT,
    but I can't see what guarantees that: once we drop the i_lock all we
    hold here is the i_mutex on an unrelated parent directory.

    Instead copy the logic of d_materialise_unique.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • Just a trivial move to locate it near (similar) d_materialise_unique
    code and save some forward references in a following patch.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     

13 Jun, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "This the bunch that sat in -next + lock_parent() fix. This is the
    minimal set; there's more pending stuff.

    In particular, I really hope to get acct.c fixes merged this cycle -
    we need that to deal sanely with delayed-mntput stuff. In the next
    pile, hopefully - that series is fairly short and localized
    (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
    iov_iter work. Most of prereqs for ->splice_write with sane locking
    order are there and Kent's dio rewrite would also fit nicely on top of
    this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
    lock_parent: don't step on stale ->d_parent of all-but-freed one
    kill generic_file_splice_write()
    ceph: switch to iter_file_splice_write()
    shmem: switch to iter_file_splice_write()
    nfs: switch to iter_splice_write_file()
    fs/splice.c: remove unneeded exports
    ocfs2: switch to iter_file_splice_write()
    ->splice_write() via ->write_iter()
    bio_vec-backed iov_iter
    optimize copy_page_{to,from}_iter()
    bury generic_file_aio_{read,write}
    lustre: get rid of messing with iovecs
    ceph: switch to ->write_iter()
    ceph_sync_direct_write: stop poking into iov_iter guts
    ceph_sync_read: stop poking into iov_iter guts
    new helper: copy_page_from_iter()
    fuse: switch to ->write_iter()
    btrfs: switch to ->write_iter()
    ocfs2: switch to ->write_iter()
    xfs: switch to ->write_iter()
    ...

    Linus Torvalds