08 Aug, 2014

9 commits

  • Signed-off-by: Fengguang Wu
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Fengguang Wu
     
  • I believe this can only happen in the case of a corrupted filesystem.
    So -EIO looks like the appropriate error.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • If we get to this point and discover the dentry is not a root dentry, or
    not DCACHE_DISCONNECTED--great, we always prefer that anyway.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • There are a few d_obtain_alias callers that are using it to get the
    root of a filesystem which may already have an alias somewhere else.

    This is not the same as the filehandle-lookup case, and none of them
    actually need DCACHE_DISCONNECTED set.

    It isn't really a serious problem, but it would really be clearer if we
    reserved DCACHE_DISCONNECTED for those cases where it's actually needed.

    In the btrfs case this was causing a spurious printk from
    nfsd/nfsfh.c:fh_verify when it found an unexpected DCACHE_DISCONNECTED
    dentry. Josef worked around this by unsetting DCACHE_DISCONNECTED
    manually in 3a0dfa6a12e "Btrfs: unset DCACHE_DISCONNECTED when mounting
    default subvol", and this replaces that workaround.

    Cc: Josef Bacik
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • Any IS_ROOT() alias should be safe to use; there's nothing special about
    DCACHE_DISCONNECTED dentries.

    Note that this is in fact useful for filesystems such as btrfs which can
    legimately encounter a directory with a preexisting IS_ROOT alias on a
    lookup that crosses into a subvolume. (Those aliases are currently
    marked DCACHE_DISCONNECTED--but not really for any good reason, and
    we'll change that soon.)

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • Currently if d_splice_alias finds a directory with an alias that is not
    IS_ROOT or not DCACHE_DISCONNECTED, it creates a duplicate directory.

    Duplicate directory dentries are unacceptable; it is better just to
    error out.

    (In the case of a local filesystem the most likely case is filesystem
    corruption: for example, perhaps two directories point to the same child
    directory, and the other parent has already been found and cached.)

    Note that distributed filesystems may encounter this case in normal
    operation if a remote host moves a directory to a location different
    from the one we last cached in the dcache. For that reason, such
    filesystems should instead use d_materialise_unique, which tries to move
    the old directory alias to the right place instead of erroring out.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • d_splice_alias will d_move an IS_ROOT() directory dentry into place if
    one exists. This should be safe as long as the dentry remains IS_ROOT,
    but I can't see what guarantees that: once we drop the i_lock all we
    hold here is the i_mutex on an unrelated parent directory.

    Instead copy the logic of d_materialise_unique.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • Just a trivial move to locate it near (similar) d_materialise_unique
    code and save some forward references in a following patch.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     

13 Jun, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "This the bunch that sat in -next + lock_parent() fix. This is the
    minimal set; there's more pending stuff.

    In particular, I really hope to get acct.c fixes merged this cycle -
    we need that to deal sanely with delayed-mntput stuff. In the next
    pile, hopefully - that series is fairly short and localized
    (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
    iov_iter work. Most of prereqs for ->splice_write with sane locking
    order are there and Kent's dio rewrite would also fit nicely on top of
    this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
    lock_parent: don't step on stale ->d_parent of all-but-freed one
    kill generic_file_splice_write()
    ceph: switch to iter_file_splice_write()
    shmem: switch to iter_file_splice_write()
    nfs: switch to iter_splice_write_file()
    fs/splice.c: remove unneeded exports
    ocfs2: switch to iter_file_splice_write()
    ->splice_write() via ->write_iter()
    bio_vec-backed iov_iter
    optimize copy_page_{to,from}_iter()
    bury generic_file_aio_{read,write}
    lustre: get rid of messing with iovecs
    ceph: switch to ->write_iter()
    ceph_sync_direct_write: stop poking into iov_iter guts
    ceph_sync_read: stop poking into iov_iter guts
    new helper: copy_page_from_iter()
    fuse: switch to ->write_iter()
    btrfs: switch to ->write_iter()
    ocfs2: switch to ->write_iter()
    xfs: switch to ->write_iter()
    ...

    Linus Torvalds
     

12 Jun, 2014

1 commit

  • Dentry that had been through (or into) __dentry_kill() might be seen
    by shrink_dentry_list(); that's normal, it'll be taken off the shrink
    list and freed if __dentry_kill() has already finished. The problem
    is, its ->d_parent might be pointing to already freed dentry, so
    lock_parent() needs to be careful.

    We need to check that dentry hasn't already gone into __dentry_kill()
    *and* grab rcu_read_lock() before dropping ->d_lock - the latter makes
    sure that whatever we see in ->d_parent after dropping ->d_lock it
    won't be freed until we drop rcu_read_lock().

    Signed-off-by: Al Viro

    Al Viro
     

07 Jun, 2014

1 commit


01 Jun, 2014

1 commit

  • lock_parent() very much on purpose does nested locking of dentries, and
    is careful to maintain the right order (lock parent first). But because
    it didn't annotate the nested locking order, lockdep thought it might be
    a deadlock on d_lock, and complained.

    Add the proper annotation for the inner locking of the child dentry to
    make lockdep happy.

    Introduced by commit 046b961b45f9 ("shrink_dentry_list(): take parent's
    ->d_lock earlier").

    Reported-and-tested-by: Josh Boyer
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 May, 2014

3 commits

  • it's 1 in the only remaining caller.

    Signed-off-by: Al Viro

    Al Viro
     
  • We have the same problem with ->d_lock order in the inner loop, where
    we are dropping references to ancestors. Same solution, basically -
    instead of using dentry_kill() we use lock_parent() (introduced in the
    previous commit) to get that lock in a safe way, recheck ->d_count
    (in case if lock_parent() has ended up dropping and retaking ->d_lock
    and somebody managed to grab a reference during that window), trylock
    the inode->i_lock and use __dentry_kill() to do the rest.

    Signed-off-by: Al Viro

    Al Viro
     
  • The cause of livelocks there is that we are taking ->d_lock on
    dentry and its parent in the wrong order, forcing us to use
    trylock on the parent's one. d_walk() takes them in the right
    order, and unfortunately it's not hard to create a situation
    when shrink_dentry_list() can't make progress since trylock
    keeps failing, and shrink_dcache_parent() or check_submounts_and_drop()
    keeps calling d_walk() disrupting the very shrink_dentry_list() it's
    waiting for.

    Solution is straightforward - if that trylock fails, let's unlock
    the dentry itself and take locks in the right order. We need to
    stabilize ->d_parent without holding ->d_lock, but that's doable
    using RCU. And we'd better do that in the very beginning of the
    loop in shrink_dentry_list(), since the checks on refcount, etc.
    would need to be redone anyway.

    That deals with a half of the problem - killing dentries on the
    shrink list itself. Another one (dropping their parents) is
    in the next commit.

    locking parent is interesting - it would be easy to do rcu_read_lock(),
    lock whatever we think is a parent, lock dentry itself and check
    if the parent is still the right one. Except that we need to check
    that *before* locking the dentry, or we are risking taking ->d_lock
    out of order. Fortunately, once the D1 is locked, we can check if
    D2->d_parent is equal to D1 without the need to lock D2; D2->d_parent
    can start or stop pointing to D1 only under D1->d_lock, so taking
    D1->d_lock is enough. In other words, the right solution is
    rcu_read_lock/lock what looks like parent right now/check if it's
    still our parent/rcu_read_unlock/lock the child.

    Signed-off-by: Al Viro

    Al Viro
     

29 May, 2014

2 commits


28 May, 2014

1 commit

  • It can happen only when dentry_kill() is called with unlock_on_failure
    equal to 0 - other callers had dentry pinned until the moment they've
    got ->d_lock and DCACHE_DENTRY_KILLED is set only after lockref_mark_dead().

    IOW, only one of three call sites of dentry_kill() might end up reaching
    that code. Just move it there.

    Signed-off-by: Al Viro

    Al Viro
     

04 May, 2014

3 commits

  • Since now the shrink list is private and nobody can free the dentry while
    it is on the shrink list, we can remove RCU protection from this.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Start with shrink_dcache_parent(), then scan what remains.

    First of all, BUG() is very much an overkill here; we are holding
    ->s_umount, and hitting BUG() means that a lot of interesting stuff
    will be hanging after that point (sync(2), for example). Moreover,
    in cases when there had been more than one leak, we'll be better
    off reporting all of them. And more than just the last component
    of pathname - %pd is there for just such uses...

    That was the last user of dentry_lru_del(), so kill it off...

    Signed-off-by: Al Viro

    Al Viro
     
  • If we find something already on a shrink list, just increment
    data->found and do nothing else. Loops in shrink_dcache_parent() and
    check_submounts_and_drop() will do the right thing - everything we
    did put into our list will be evicted and if there had been nothing,
    but data->found got non-zero, well, we have somebody else shrinking
    those guys; just try again.

    Signed-off-by: Al Viro

    Al Viro
     

01 May, 2014

5 commits


20 Apr, 2014

1 commit

  • in non-lazy walk we need to be careful about dentry switching from
    negative to positive - both ->d_flags and ->d_inode are updated,
    and in some places we might see only one store. The cases where
    dentry has been obtained by dcache lookup with ->i_mutex held on
    parent are safe - ->d_lock and ->i_mutex provide all the barriers
    we need. However, there are several places where we run into
    trouble:
    * do_last() fetches ->d_inode, then checks ->d_flags and
    assumes that inode won't be NULL unless d_is_negative() is true.
    Race with e.g. creat() - we might have fetched the old value of
    ->d_inode (still NULL) and new value of ->d_flags (already not
    DCACHE_MISS_TYPE). Lin Ming has observed and reported the resulting
    oops.
    * a bunch of places checks ->d_inode for being non-NULL,
    then checks ->d_flags for "is it a symlink". Race with symlink(2)
    in case if our CPU sees ->d_inode update first - we see non-NULL
    there, but ->d_flags still contains DCACHE_MISS_TYPE instead of
    DCACHE_SYMLINK_TYPE. Result: false negative on "should we follow
    link here?", with subsequent unpleasantness.

    Cc: stable@vger.kernel.org # 3.13 and 3.14 need that one
    Reported-and-tested-by: Lin Ming
    Signed-off-by: Al Viro

    Al Viro
     

09 Apr, 2014

1 commit

  • Pull drm updates from Dave Airlie:
    "Highlights:

    - drm:

    Generic display port aux features, primary plane support, drm
    master management fixes, logging cleanups, enforced locking checks
    (instead of docs), documentation improvements, minor number
    handling cleanup, pseudofs for shared inodes.

    - ttm:

    add ability to allocate from both ends

    - i915:

    broadwell features, power domain and runtime pm, per-process
    address space infrastructure (not enabled)

    - msm:

    power management, hdmi audio support

    - nouveau:

    ongoing GPU fault recovery, initial maxwell support, random fixes

    - exynos:

    refactored driver to clean up a lot of abstraction, DP support
    moved into drm, LVDS bridge support added, parallel panel support

    - gma500:

    SGX MMU support, SGX irq handling, asle irq work fixes

    - radeon:

    video engine bringup, ring handling fixes, use dp aux helpers

    - vmwgfx:

    add rendernode support"

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (849 commits)
    DRM: armada: fix corruption while loading cursors
    drm/dp_helper: don't return EPROTO for defers (v2)
    drm/bridge: export ptn3460_init function
    drm/exynos: remove MODULE_DEVICE_TABLE definitions
    ARM: dts: exynos4412-trats2: enable exynos/fimd node
    ARM: dts: exynos4210-trats: enable exynos/fimd node
    ARM: dts: exynos4412-trats2: add panel node
    ARM: dts: exynos4210-trats: add panel node
    ARM: dts: exynos4: add MIPI DSI Master node
    drm/panel: add S6E8AA0 driver
    ARM: dts: exynos4210-universal_c210: add proper panel node
    drm/panel: add ld9040 driver
    panel/ld9040: add DT bindings
    panel/s6e8aa0: add DT bindings
    drm/exynos: add DSIM driver
    exynos/dsim: add DT bindings
    drm/exynos: disallow fbdev initialization if no device is connected
    drm/mipi_dsi: create dsi devices only for nodes with reg property
    drm/mipi_dsi: add flags to DSI messages
    Skip intel_crt_init for Dell XPS 8700
    ...

    Linus Torvalds
     

01 Apr, 2014

1 commit

  • If flags contain RENAME_EXCHANGE then exchange source and destination files.
    There's no restriction on the type of the files; e.g. a directory can be
    exchanged with a symlink.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Jan Kara
    Reviewed-by: J. Bruce Fields

    Miklos Szeredi
     

31 Mar, 2014

1 commit

  • Linux 3.14

    The vt-d w/a merged late in 3.14-rc needs a bit of fine-tuning, hence
    backmerge.

    Conflicts:
    drivers/gpu/drm/i915/i915_gem_gtt.c
    drivers/gpu/drm/i915/intel_ddi.c
    drivers/gpu/drm/i915/intel_dp.c

    All trivial adjacent lines changed type conflicts, so trivial git
    doesn't even show them in the merg commit.

    Signed-off-by: Daniel Vetter

    Daniel Vetter
     

23 Mar, 2014

1 commit

  • In all callchains leading to prepend_name(), the value left in *buflen
    is eventually discarded unused if prepend_name() has returned a negative.
    So we are free to do what prepend() does, and subtract from *buflen
    *before* checking for underflow (which turns into checking the sign
    of subtraction result, of course).

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     

16 Mar, 2014

1 commit

  • Our current DRM design uses a single address_space for all users of the
    same DRM device. However, there is no way to create an anonymous
    address_space without an underlying inode. Therefore, we wait for the
    first ->open() callback on a registered char-dev and take-over the inode
    of the char-dev. This worked well so far, but has several drawbacks:
    - We screw with FS internals and rely on some non-obvious invariants like
    inode->i_mapping being the same as inode->i_data for char-devs.
    - We don't have any address_space prior to the first ->open() from
    user-space. This leads to ugly fallback code and we cannot allocate
    global objects early.

    As pointed out by Al-Viro, fs/anon_inode.c is *not* supposed to be used by
    drivers for anonymous inode-allocation. Therefore, this patch follows the
    proposed alternative solution and adds a pseudo filesystem mount-point to
    DRM. We can then allocate private inodes including a private address_space
    for each DRM device at initialization time.

    Note that we could use:
    sysfs_get_inode(sysfs_mnt->mnt_sb, drm_device->dev->kobj.sd);
    to get access to the underlying sysfs-inode of a "struct device" object.
    However, most of this information is currently hidden and it's not clear
    whether this address_space is suitable for driver access. Thus, unless
    linux allows anonymous address_space objects or driver-core provides a
    public inode per device, we're left with our own private internal mount
    point.

    Cc: Al Viro
    Signed-off-by: David Herrmann

    David Herrmann
     

27 Jan, 2014

1 commit

  • * we need to save the starting point for restarts
    * reject pathologically short buffers outright

    Spotted-by: Denys Vlasenko
    Spotted-by: Oleg Nesterov
    Signed-off-by: Al Viro

    Al Viro
     

26 Jan, 2014

1 commit

  • In commit 232d2d60aa5469bb097f55728f65146bd49c1d25
    Author: Waiman Long
    Date: Mon Sep 9 12:18:13 2013 -0400

    dcache: Translating dentry into pathname without taking rename_lock

    The __dentry_path locking was changed and the variable error was
    intended to be moved outside of the loop. Unfortunately the inner
    declaration of error was not removed. Resulting in a version of
    __dentry_path that will never return an error.

    Remove the problematic inner declaration of error and allow
    __dentry_path to return errors once again.

    Cc: stable@vger.kernel.org
    Cc: Waiman Long
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Eric W. Biederman
     

18 Jan, 2014

1 commit

  • Pull namespace fixes from Eric Biederman:
    "This is a set of 3 regression fixes.

    This fixes /proc/mounts when using "ip netns add " to display
    the actual mount point.

    This fixes a regression in clone that broke lxc-attach.

    This fixes a regression in the permission checks for mounting /proc
    that made proc unmountable if binfmt_misc was in use. Oops.

    My apologies for sending this pull request so late. Al Viro gave
    interesting review comments about the d_path fix that I wanted to
    address in detail before I sent this pull request. Unfortunately a
    bad round of colds kept from addressing that in detail until today.
    The executive summary of the review was:

    Al: Is patching d_path really sufficient?
    The prepend_path, d_path, d_absolute_path, and __d_path family of
    functions is a really mess.

    Me: Yes, patching d_path is really sufficient. Yes, the code is mess.
    No it is not appropriate to rewrite all of d_path for a regression
    that has existed for entirely too long already, when a two line
    change will do"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    vfs: Fix a regression in mounting proc
    fork: Allow CLONE_PARENT after setns(CLONE_NEWPID)
    vfs: In d_path don't call d_dname on a mount point

    Linus Torvalds
     

13 Dec, 2013

1 commit

  • When explicitly hashing the end of a string with the word-at-a-time
    interface, we have to be careful which end of the word we pick up.

    On big-endian CPUs, the upper-bits will contain the data we're after, so
    ensure we generate our masks accordingly (and avoid hashing whatever
    random junk may have been sitting after the string).

    This patch adds a new dcache helper, bytemask_from_count, which creates
    a mask appropriate for the CPU endianness.

    Cc: Al Viro
    Signed-off-by: Will Deacon
    Signed-off-by: Linus Torvalds

    Will Deacon
     

27 Nov, 2013

1 commit

  • Aditya Kali (adityakali@google.com) wrote:
    > Commit bf056bfa80596a5d14b26b17276a56a0dcb080e5:
    > "proc: Fix the namespace inode permission checks." converted
    > the namespace files into symlinks. The same commit changed
    > the way namespace bind mounts appear in /proc/mounts:
    > $ mount --bind /proc/self/ns/ipc /mnt/ipc
    > Originally:
    > $ cat /proc/mounts | grep ipc
    > proc /mnt/ipc proc rw,nosuid,nodev,noexec 0 0
    >
    > After commit bf056bfa80596a5d14b26b17276a56a0dcb080e5:
    > $ cat /proc/mounts | grep ipc
    > proc ipc:[4026531839] proc rw,nosuid,nodev,noexec 0 0
    >
    > This breaks userspace which expects the 2nd field in
    > /proc/mounts to be a valid path.

    The symlink /proc//ns/{ipc,mnt,net,pid,user,uts} point to
    dentries allocated with d_alloc_pseudo that we can mount, and
    that have interesting names printed out with d_dname.

    When these files are bind mounted /proc/mounts is not currently
    displaying the mount point correctly because d_dname is called instead
    of just displaying the path where the file is mounted.

    Solve this by adding an explicit check to distinguish mounted pseudo
    inodes and unmounted pseudo inodes. Unmounted pseudo inodes always
    use mount of their filesstem as the mnt_root in their path making
    these two cases easy to distinguish.

    CC: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Reported-by: Aditya Kali
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

16 Nov, 2013

2 commits

  • There used to be a bunch of tree-walkers in dcache.c, all alike.
    try_to_ascend() had been introduced to abstract a piece of logics
    duplicated in all of them. These days all these tree-walkers are
    implemented via the same iterator (d_walk()), which is the only
    remaining caller of try_to_ascend(), so let's fold it back...

    Signed-off-by: Al Viro

    Al Viro
     
  • D_HASH{MASK,BITS} are used once each, both in the same function (d_hash()).
    At this point they are actively misguiding - they imply that values are
    compiler constants, which is no longer true.

    Signed-off-by: Al Viro

    Al Viro