20 Jan, 2017

1 commit

  • commit 3895dbf8985f656675b5bde610723a29cbce3fa7 upstream.

    Protecting the mountpoint hashtable with namespace_sem was sufficient
    until a call to umount_mnt was added to mntput_no_expire. At which
    point it became possible for multiple calls of put_mountpoint on
    the same hash chain to happen on the same time.

    Kristen Johansen reported:
    > This can cause a panic when simultaneous callers of put_mountpoint
    > attempt to free the same mountpoint. This occurs because some callers
    > hold the mount_hash_lock, while others hold the namespace lock. Some
    > even hold both.
    >
    > In this submitter's case, the panic manifested itself as a GP fault in
    > put_mountpoint() when it called hlist_del() and attempted to dereference
    > a m_hash.pprev that had been poisioned by another thread.

    Al Viro observed that the simple fix is to switch from using the namespace_sem
    to the mount_lock to protect the mountpoint hash table.

    I have taken Al's suggested patch moved put_mountpoint in pivot_root
    (instead of taking mount_lock an additional time), and have replaced
    new_mountpoint with get_mountpoint a function that does the hash table
    lookup and addition under the mount_lock. The introduction of get_mounptoint
    ensures that only the mount_lock is needed to manipulate the mountpoint
    hashtable.

    d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
    already set. This allows get_mountpoint to use the setting of
    DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
    happens exactly once.

    Fixes: ce07d891a089 ("mnt: Honor MNT_LOCKED when detaching mounts")
    Reported-by: Krister Johansen
    Suggested-by: Al Viro
    Acked-by: Al Viro
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

07 Aug, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted cleanups and fixes.

    In the "trivial API change" department - ->d_compare() losing 'parent'
    argument"

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    cachefiles: Fix race between inactivating and culling a cache object
    9p: use clone_fid()
    9p: fix braino introduced in "9p: new helper - v9fs_parent_fid()"
    vfs: make dentry_needs_remove_privs() internal
    vfs: remove file_needs_remove_privs()
    vfs: fix deadlock in file_remove_privs() on overlayfs
    get rid of 'parent' argument of ->d_compare()
    cifs, msdos, vfat, hfs+: don't bother with parent in ->d_compare()
    affs ->d_compare(): don't bother with ->d_inode
    fold _d_rehash() and __d_rehash() together
    fold dentry_rcuwalk_invalidate() into its only remaining caller

    Linus Torvalds
     

06 Aug, 2016

1 commit

  • Pull qstr constification updates from Al Viro:
    "Fairly self-contained bunch - surprising lot of places passes struct
    qstr * as an argument when const struct qstr * would suffice; it
    complicates analysis for no good reason.

    I'd prefer to feed that separately from the assorted fixes (those are
    in #for-linus and with somewhat trickier topology)"

    * 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    qstr: constify instances in adfs
    qstr: constify instances in lustre
    qstr: constify instances in f2fs
    qstr: constify instances in ext2
    qstr: constify instances in vfat
    qstr: constify instances in procfs
    qstr: constify instances in fuse
    qstr constify instances in fs/dcache.c
    qstr: constify instances in nfs
    qstr: constify instances in ocfs2
    qstr: constify instances in autofs4
    qstr: constify instances in hfs
    qstr: constify instances in hfsplus
    qstr: constify instances in logfs
    qstr: constify dentry_init_security

    Linus Torvalds
     

01 Aug, 2016

1 commit


30 Jul, 2016

2 commits


29 Jul, 2016

2 commits

  • Pull vfs updates from Al Viro:
    "Assorted cleanups and fixes.

    Probably the most interesting part long-term is ->d_init() - that will
    have a bunch of followups in (at least) ceph and lustre, but we'll
    need to sort the barrier-related rules before it can get used for
    really non-trivial stuff.

    Another fun thing is the merge of ->d_iput() callers (dentry_iput()
    and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
    except the one in __d_lookup_lru())"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    fs/dcache.c: avoid soft-lockup in dput()
    vfs: new d_init method
    vfs: Update lookup_dcache() comment
    bdev: get rid of ->bd_inodes
    Remove last traces of ->sync_page
    new helper: d_same_name()
    dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
    vfs: clean up documentation
    vfs: document ->d_real()
    vfs: merge .d_select_inode() into .d_real()
    unify dentry_iput() and dentry_unlink_inode()
    binfmt_misc: ->s_root is not going anywhere
    drop redundant ->owner initializations
    ufs: get rid of redundant checks
    orangefs: constify inode_operations
    missed comment updates from ->direct_IO() prototype change
    file_inode(f)->i_mapping is f->f_mapping
    trim fsnotify hooks a bit
    9p: new helper - v9fs_parent_fid()
    debugfs: ->d_parent is never NULL or negative
    ...

    Linus Torvalds
     
  • This changes the vfs dentry hashing to mix in the parent pointer at the
    _beginning_ of the hash, rather than at the end.

    That actually improves both the hash and the code generation, because we
    can move more of the computation to the "static" part of the dcache
    setup, and do less at lookup runtime.

    It turns out that a lot of other hash users also really wanted to mix in
    a base pointer as a 'salt' for the hash, and so the slightly extended
    interface ends up working well for other cases too.

    Users that want a string hash that is purely about the string pass in a
    'salt' pointer of NULL.

    * merge branch 'salted-string-hash':
    fs/dcache.c: Save one 32-bit multiply in dcache lookup
    vfs: make the string hashes salt the hash

    Linus Torvalds
     

25 Jul, 2016

3 commits

  • We triggered soft-lockup under stress test which
    open/access/write/close one file concurrently on more than
    five different CPUs:

    WARN: soft lockup - CPU#0 stuck for 11s! [who:30631]
    ...
    [] dput+0x100/0x298
    [] terminate_walk+0x4c/0x60
    [] path_lookupat+0x5cc/0x7a8
    [] filename_lookup+0x38/0xf0
    [] user_path_at_empty+0x78/0xd0
    [] user_path_at+0x1c/0x28
    [] SyS_faccessat+0xb4/0x230

    ->d_lock trylock may failed many times because of concurrently
    operations, and dput() may execute a long time.

    Fix this by replacing cpu_relax() with cond_resched().
    dput() used to be sleepable, so make it sleepable again
    should be safe.

    Cc:
    Signed-off-by: Wei Fang
    Signed-off-by: Al Viro

    Wei Fang
     
  • Allow filesystem to initialize dentry at allocation time.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Al Viro
     

21 Jul, 2016

1 commit


01 Jul, 2016

4 commits


30 Jun, 2016

1 commit

  • The two methods essentially do the same: find the real dentry/inode
    belonging to an overlay dentry. The difference is in the usage:

    vfs_open() uses ->d_select_inode() and expects the function to perform
    copy-up if necessary based on the open flags argument.

    file_dentry() uses ->d_real() passing in the overlay dentry as well as the
    underlying inode.

    vfs_rename() uses ->d_select_inode() but passes zero flags. ->d_real()
    with a zero inode would have worked just as well here.

    This patch merges the functionality of ->d_select_inode() into ->d_real()
    by adding an 'open_flags' argument to the latter.

    [Al Viro] Make the signature of d_real() match that of ->d_real() again.
    And constify the inode argument, while we are at it.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

20 Jun, 2016

1 commit

  • Check for d_unhashed() while searching in in-lookup hash was absolutely
    wrong. Worse, it masked a deadlock on dget() done under bitlock that
    nests inside ->d_lock. Thanks to J. R. Okajima for spotting it.

    Spotted-by: "J. R. Okajima"
    Wearing-brown-paperbag: Al Viro
    Signed-off-by: Al Viro

    Al Viro
     

12 Jun, 2016

1 commit

  • Noe that we're mixing in the parent pointer earlier, we
    don't need to use hash_32() to mix its bits. Instead, we can
    just take the msbits of the hash value directly.

    For those applications which use the partial_name_hash(),
    move the multiply to end_name_hash.

    Signed-off-by: George Spelvin
    Signed-off-by: Linus Torvalds

    George Spelvin
     

11 Jun, 2016

1 commit

  • We always mixed in the parent pointer into the dentry name hash, but we
    did it late at lookup time. It turns out that we can simplify that
    lookup-time action by salting the hash with the parent pointer early
    instead of late.

    A few other users of our string hashes also wanted to mix in their own
    pointers into the hash, and those are updated to use the same mechanism.

    Hash users that don't have any particular initial salt can just use the
    NULL pointer as a no-salt.

    Cc: Vegard Nossum
    Cc: George Spelvin
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

10 Jun, 2016

1 commit

  • d_walk() relies upon the tree not getting rearranged under it without
    rename_lock being touched. And we do grab rename_lock around the
    places that change the tree topology. Unfortunately, branch reordering
    is just as bad from d_walk() POV and we have two places that do it
    without touching rename_lock - one in handling of cursors (for ramfs-style
    directories) and another in autofs. autofs one is a separate story; this
    commit deals with the cursors.
    * mark cursor dentries explicitly at allocation time
    * make __dentry_kill() leave ->d_child.next pointing to the next
    non-cursor sibling, making sure that it won't be moved around unnoticed
    before the parent is relocked on ascend-to-parent path in d_walk().
    * make d_walk() skip cursors explicitly; strictly speaking it's
    not necessary (all callbacks we pass to d_walk() are no-ops on cursors),
    but it makes analysis easier.

    Signed-off-by: Al Viro

    Al Viro
     

08 Jun, 2016

1 commit

  • Ascend-to-parent logics in d_walk() depends on all encountered child
    dentries not getting freed without an RCU delay. Unfortunately, in
    quite a few cases it is not true, with hard-to-hit oopsable race as
    the result.

    Fortunately, the fix is simiple; right now the rule is "if it ever
    been hashed, freeing must be delayed" and changing it to "if it
    ever had a parent, freeing must be delayed" closes that hole and
    covers all cases the old rule used to cover. Moreover, pipes and
    sockets remain _not_ covered, so we do not introduce RCU delay in
    the cases which are the reason for having that delay conditional
    in the first place.

    Cc: stable@vger.kernel.org # v3.2+ (and watch out for __d_materialise_dentry())
    Signed-off-by: Al Viro

    Al Viro
     

30 May, 2016

2 commits

  • There is a lot of duplication between dentry_unlink_inode() and dentry_iput().
    The only real difference is that dentry_unlink_inode() bumps ->d_seq and
    dentry_iput() doesn't. The argument of the latter is known to have been
    unhashed, so anybody who might've found it in RCU lookup would already be
    doomed to a ->d_seq mismatch. And we want to avoid pointless smp_rmb() there.

    This patch makes dentry_unlink_inode() bump ->d_seq only for hashed dentries.
    It's safe (d_delete() calls that sucker only if we are holding the only
    reference to dentry, so rehash is not going to happen) and it allows
    to use dentry_unlink_inode() in __dentry_kill() and get rid of dentry_iput().

    The interesting question here is profiling; it *is* a hot path, and extra
    conditional jumps in there might or might not be painful.

    Signed-off-by: Al Viro

    Al Viro
     
  • fsnotify_d_move()/__fsnotify_d_instantiate()/__fsnotify_update_dcache_flags()
    are identical to each other, regardless of the config.

    Signed-off-by: Al Viro

    Al Viro
     

29 May, 2016

2 commits

  • Pull string hash improvements from George Spelvin:
    "This series does several related things:

    - Makes the dcache hash (fs/namei.c) useful for general kernel use.

    (Thanks to Bruce for noticing the zero-length corner case)

    - Converts the string hashes in to use the
    above.

    - Avoids 64-bit multiplies in hash_64() on 32-bit platforms. Two
    32-bit multiplies will do well enough.

    - Rids the world of the bad hash multipliers in hash_32.

    This finishes the job started in commit 689de1d6ca95 ("Minimal
    fix-up of bad hashing behavior of hash_64()")

    The vast majority of Linux architectures have hardware support for
    32x32-bit multiply and so derive no benefit from "simplified"
    multipliers.

    The few processors that do not (68000, h8/300 and some models of
    Microblaze) have arch-specific implementations added. Those
    patches are last in the series.

    - Overhauls the dcache hash mixing.

    The patch in commit 0fed3ac866ea ("namei: Improve hash mixing if
    CONFIG_DCACHE_WORD_ACCESS") was an off-the-cuff suggestion.
    Replaced with a much more careful design that's simultaneously
    faster and better. (My own invention, as there was noting suitable
    in the literature I could find. Comments welcome!)

    - Modify the hash_name() loop to skip the initial HASH_MIX(). This
    would let us salt the hash if we ever wanted to.

    - Sort out partial_name_hash().

    The hash function is declared as using a long state, even though
    it's truncated to 32 bits at the end and the extra internal state
    contributes nothing to the result. And some callers do odd things:

    - fs/hfs/string.c only allocates 32 bits of state
    - fs/hfsplus/unicode.c uses it to hash 16-bit unicode symbols not bytes

    - Modify bytemask_from_count to handle inputs of 1..sizeof(long)
    rather than 0..sizeof(long)-1. This would simplify users other
    than full_name_hash"

    Special thanks to Bruce Fields for testing and finding bugs in v1. (I
    learned some humbling lessons about "obviously correct" code.)

    On the arch-specific front, the m68k assembly has been tested in a
    standalone test harness, I've been in contact with the Microblaze
    maintainers who mostly don't care, as the hardware multiplier is never
    omitted in real-world applications, and I haven't heard anything from
    the H8/300 world"

    * 'hash' of git://ftp.sciencehorizons.net/linux:
    h8300: Add
    microblaze: Add
    m68k: Add
    : Add support for architecture-specific functions
    fs/namei.c: Improve dcache hash function
    Eliminate bad hash multipliers from hash_32() and hash_64()
    Change hash_64() return value to 32 bits
    : Define hash_str() in terms of hashlen_string()
    fs/namei.c: Add hashlen_string() function
    Pull out string hash to

    Linus Torvalds
     
  • We'd like to make more use of the highly-optimized dcache hash functions
    throughout the kernel, rather than have every subsystem create its own,
    and a function that hashes basic null-terminated strings is required
    for that.

    (The name is to emphasize that it returns both hash and length.)

    It's actually useful in the dcache itself, specifically d_alloc_name().
    Other uses in the next patch.

    full_name_hash() is also tweaked to make it more generally useful:
    1) Take a "char *" rather than "unsigned char *" argument, to
    be consistent with hash_name().
    2) Handle zero-length inputs. If we want more callers, we don't want
    to make them worry about corner cases.

    Signed-off-by: George Spelvin

    George Spelvin
     

19 May, 2016

1 commit

  • Pull misc vfs cleanups from Al Viro:
    "Assorted cleanups and fixes all over the place"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coredump: only charge written data against RLIMIT_CORE
    coredump: get rid of coredump_params->written
    ecryptfs_lookup(): try either only encrypted or plaintext name
    ecryptfs: avoid multiple aliases for directories
    bpf: reject invalid names right in ->lookup()
    __d_alloc(): treat NULL name as QSTR("/", 1)
    mtd: switch ubi_open_volume_path() to vfs_stat()
    mtd: switch open_mtd_by_chdev() to use of vfs_stat()

    Linus Torvalds
     

03 May, 2016

7 commits

  • ta-da!

    The main issue is the lack of down_write_killable(), so the places
    like readdir.c switched to plain inode_lock(); once killable
    variants of rwsem primitives appear, that'll be dealt with.

    lockdep side also might need more work

    Signed-off-by: Al Viro

    Al Viro
     
  • If we *do* run into an in-lookup match, we need to wait for it to
    cease being in-lookup. Fortunately, we do have unused space in
    in-lookup dentries - d_lru is never looked at until it stops being
    in-lookup.

    So we can stash a pointer to wait_queue_head from stack frame of
    the caller of ->lookup(). Some precautions are needed while
    waiting, but it's not that hard - we do hold a reference to dentry
    we are waiting for, so it can't go away. If it's found to be
    in-lookup the wait_queue_head is still alive and will remain so
    at least while ->d_lock is held. Moreover, the condition we
    are waiting for becomes true at the same point where everything
    on that wq gets woken up, so we can just add ourselves to the
    queue once.

    d_alloc_parallel() gets a pointer to wait_queue_head_t from its
    caller; lookup_slow() adjusted, d_add_ci() taught to use
    d_alloc_parallel() if the dentry passed to it happens to be
    in-lookup one (i.e. if it's been called from the parallel lookup).

    That's pretty much it - all that remains is to switch ->i_mutex
    to rwsem and have lookup_slow() take it shared.

    Signed-off-by: Al Viro

    Al Viro
     
  • We will need to be able to check if there is an in-lookup
    dentry with matching parent/name. Right now it's impossible,
    but as soon as start locking directories shared such beasts
    will appear.

    Add a secondary hash for locating those. Hash chains go through
    the same space where d_alias will be once it's not in-lookup anymore.
    Search is done under the same bitlock we use for modifications -
    with the primary hash we can rely on d_rehash() into the wrong
    chain being the worst that could happen, but here the pointers are
    buggered once it's removed from the chain. On the other hand,
    the chains are not going to be long and normally we'll end up
    adding to the chain anyway. That allows us to avoid bothering with
    ->d_lock when doing the comparisons - everything is stable until
    removed from chain.

    New helper: d_alloc_parallel(). Right now it allocates, verifies
    that no hashed and in-lookup matches exist and adds to in-lookup
    hash.

    Returns ERR_PTR() for error, hashed match (in the unlikely case it's
    been found) or new dentry. In-lookup matches trigger BUG() for
    now; that will change in the next commit when we introduce waiting
    for ongoing lookup to finish. Note that in-lookup matches won't be
    possible until we actually go for shared locking.

    lookup_slow() switched to use of d_alloc_parallel().

    Again, these commits are separated only for making it easier to
    review. All this machinery will start doing something useful only
    when we go for shared locking; it's just that the combination is
    too large for my taste.

    Signed-off-by: Al Viro

    Al Viro
     
  • We'll need to verify that there's neither a hashed nor in-lookup
    dentry with desired parent/name before adding to in-lookup set.

    One possible solution would be to hold the parent's ->d_lock through
    both checks, but while the in-lookup set is relatively small at any
    time, dcache is not. And holding the parent's ->d_lock through
    something like __d_lookup_rcu() would suck too badly.

    So we leave the parent's ->d_lock alone, which means that we watch
    out for the following scenario:
    * we verify that there's no hashed match
    * existing in-lookup match gets hashed by another process
    * we verify that there's no in-lookup matches and decide
    that everything's fine.

    Solution: per-directory kinda-sorta seqlock, bumped around the times
    we hash something that used to be in-lookup or move (and hash)
    something in place of in-lookup. Then the above would turn into
    * read the counter
    * do dcache lookup
    * if no matches found, check for in-lookup matches
    * if there had been none of those either, check if the
    counter has changed; repeat if it has.

    The "kinda-sorta" part is due to the fact that we don't have much spare
    space in inode. There is a spare word (shared with i_bdev/i_cdev/i_pipe),
    so the counter part is not a problem, but spinlock is a different story.

    We could use the parent's ->d_lock, and it would be less painful in
    terms of contention, for __d_add() it would be rather inconvenient to
    grab; we could do that (using lock_parent()), but...

    Fortunately, we can get serialization on the counter itself, and it
    might be a good idea in general; we can use cmpxchg() in a loop to
    get from even to odd and smp_store_release() from odd to even.

    This commit adds the counter and updating logics; the readers will be
    added in the next commit.

    Signed-off-by: Al Viro

    Al Viro
     
  • marked as such when (would be) parallel lookup is about to pass them
    to actual ->lookup(); unmarked when
    * __d_add() is about to make it hashed, positive or not.
    * __d_move() (from d_splice_alias(), directly or via
    __d_unalias()) puts a preexisting dentry in its place
    * in caller of ->lookup() if it has escaped all of the
    above. Bug (WARN_ON, actually) if it reaches the final dput()
    or d_instantiate() while still marked such.

    As the result, we are guaranteed that for as long as the flag is
    set, dentry will
    * remain negative unhashed with positive refcount
    * never have its ->d_alias looked at
    * never have its ->d_lru looked at
    * never have its ->d_parent and ->d_name changed

    Right now we have at most one such for any given parent directory.
    With parallel lookups that restriction will weaken to
    * only exist when parent is locked shared
    * at most one with given (parent,name) pair (comparison of
    names is according to ->d_compare())
    * only exist when there's no hashed dentry with the same
    (parent,name)

    Transition will take the next several commits; unfortunately, we'll
    only be able to switch to rwsem at the end of this series. The
    reason for not making it a single patch is to simplify review.

    New primitives: d_in_lookup() (a predicate checking if dentry is in
    the in-lookup state) and d_lookup_done() (tells the system that
    we are done with lookup and if it's still marked as in-lookup, it
    should cease to be such).

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     

28 Mar, 2016

1 commit


27 Mar, 2016

1 commit

  • This series fixes bugs in nfs and ext4 due to 4bacc9c9234c ("overlayfs:
    Make f_path always point to the overlay and f_inode to the underlay").

    Regular files opened on overlayfs will result in the file being opened on
    the underlying filesystem, while f_path points to the overlayfs
    mount/dentry.

    This confuses filesystems which get the dentry from struct file and assume
    it's theirs.

    Add a new helper, file_dentry() [*], to get the filesystem's own dentry
    from the file. This checks file->f_path.dentry->d_flags against
    DCACHE_OP_REAL, and returns file->f_path.dentry if DCACHE_OP_REAL is not
    set (this is the common, non-overlayfs case).

    In the uncommon case it will call into overlayfs's ->d_real() to get the
    underlying dentry, matching file_inode(file).

    The reason we need to check against the inode is that if the file is copied
    up while being open, d_real() would return the upper dentry, while the open
    file comes from the lower dentry.

    [*] If possible, it's better simply to use file_inode() instead.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Theodore Ts'o
    Tested-by: Goldwyn Rodrigues
    Reviewed-by: Trond Myklebust
    Cc: # v4.2
    Cc: David Howells
    Cc: Al Viro
    Cc: Daniel Axtens

    Miklos Szeredi
     

14 Mar, 2016

4 commits