02 Apr, 2014

4 commits

  • Make delayed_free() call free_vfsmnt() so that we don't have two functions
    doing the same job. This requires the calls to mnt_free_id() in free_vfsmnt()
    to be moved into the callers of that function.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • new flag in ->f_mode - FMODE_WRITER. Set by do_dentry_open() in case
    when it has grabbed write access, checked by __fput() to decide whether
    it wants to drop the sucker. Allows to stop bothering with mnt_clone_write()
    in alloc_file(), along with fewer special_file() checks.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • The current mainline has copies propagated to *all* nodes, then
    tears down the copies we made for nodes that do not contain
    counterparts of the desired mountpoint. That sets the right
    propagation graph for the copies (at teardown time we move
    the slaves of removed node to a surviving peer or directly
    to master), but we end up paying a fairly steep price in
    useless allocations. It's fairly easy to create a situation
    where N calls of mount(2) create exactly N bindings, with
    O(N^2) vfsmounts allocated and freed in process.

    Fortunately, it is possible to avoid those allocations/freeings.
    The trick is to create copies in the right order and find which
    one would've eventually become a master with the current algorithm.
    It turns out to be possible in O(nodes getting propagation) time
    and with no extra allocations at all.

    One part is that we need to make sure that eventual master will be
    created before its slaves, so we need to walk the propagation
    tree in a different order - by peer groups. And iterate through
    the peers before dealing with the next group.

    Another thing is finding the (earlier) copy that will be a master
    of one we are about to create; to do that we are (temporary) marking
    the masters of mountpoints we are attaching the copies to.

    Either we are in a peer of the last mountpoint we'd dealt with,
    or we have the following situation: we are attaching to mountpoint M,
    the last copy S_0 had been attached to M_0 and there are sequences
    S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
    S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
    such that M is getting propagation from M_{k}. It means that the master
    of M_{k} will be among the sequence of masters of M. On the
    other hand, the nearest marked node in that sequence will either
    be the master of M_{k} or the master of M_{k-1} (the latter -
    in the case if M_{k-1} is a slave of something M gets propagation
    from, but in a wrong peer group).

    So we go through the sequence of masters of M until we find
    a marked one (P). Let N be the one before it. Then we go through
    the sequence of masters of S_0 until we find one (say, S) mounted
    on a node D that has P as master and check if D is a peer of N.
    If it is, S will be the master of new copy, if not - the master of S
    will be.

    That's it for the hard part; the rest is fairly simple. Iterator
    is in next_group(), handling of one prospective mountpoint is
    propagate_one().

    It seems to survive all tests and gives a noticably better performance
    than the current mainline for setups that are seriously using shared
    subtrees.

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     

31 Mar, 2014

4 commits

  • fixes RCU bug - walking through hlist is safe in face of element moves,
    since it's self-terminating. Cyclic lists are not - if we end up jumping
    to another hash chain, we'll loop infinitely without ever hitting the
    original list head.

    [fix for dumb braino folded]

    Spotted by: Max Kellermann
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     
  • If the dest_mnt is not shared, propagate_mnt() does nothing -
    there's no mounts to propagate to and thus no copies to create.
    Might as well don't bother calling it in that case.

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     
  • preparation to switching mnt_hash to hlist

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     
  • * switch allocation to alloc_large_system_hash()
    * make sizes overridable by boot parameters (mhash_entries=, mphash_entries=)
    * switch mountpoint_hashtable from list_head to hlist_head

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     

21 Jan, 2014

1 commit

  • Pull driver core / sysfs patches from Greg KH:
    "Here's the big driver core and sysfs patch set for 3.14-rc1.

    There's a lot of work here moving sysfs logic out into a "kernfs" to
    allow other subsystems to also have a virtual filesystem with the same
    attributes of sysfs (handle device disconnect, dynamic creation /
    removal as needed / unneeded, etc)

    This is primarily being done for the cgroups filesystem, but the goal
    is to also move debugfs to it when it is ready, solving all of the
    known issues in that filesystem as well. The code isn't completed
    yet, but all should be stable now (there is a big section that was
    reverted due to problems found when testing)

    There's also some other smaller fixes, and a driver core addition that
    allows for a "collection" of objects, that the DRM people will be
    using soon (it's in this tree to make merges after -rc1 easier)

    All of this has been in linux-next with no reported issues"

    * tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (113 commits)
    kernfs: associate a new kernfs_node with its parent on creation
    kernfs: add struct dentry declaration in kernfs.h
    kernfs: fix get_active failure handling in kernfs_seq_*()
    Revert "kernfs: fix get_active failure handling in kernfs_seq_*()"
    Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq"
    Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()"
    Revert "kernfs: remove KERNFS_REMOVED"
    Revert "kernfs: restructure removal path to fix possible premature return"
    Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()"
    Revert "kernfs: remove kernfs_addrm_cxt"
    Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed"
    Revert "kernfs: implement kernfs_{de|re}activate[_self]()"
    Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers"
    Revert "pci: use device_remove_file_self() instead of device_schedule_callback()"
    Revert "scsi: use device_remove_file_self() instead of device_schedule_callback()"
    Revert "s390: use device_remove_file_self() instead of device_schedule_callback()"
    Revert "sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()"
    Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()"
    kernfs: remove unnecessary NULL check in __kernfs_remove()
    drivers/base: provide an infrastructure for componentised subsystems
    ...

    Linus Torvalds
     

30 Nov, 2013

1 commit

  • We're in the process of separating out core sysfs functionality into
    kernfs which will deal with sysfs_dirents directly. This patch
    rearranges mount path so that the kernfs and sysfs parts are separate.

    * As sysfs_super_info won't be visible outside kernfs proper,
    kernfs_super_ns() is added to allow kernfs users to access a
    super_block's namespace tag.

    * Generic mount operation is separated out into kernfs_mount_ns().
    sysfs_mount() now just performs sysfs-specific permission check,
    acquires namespace tag, and invokes kernfs_mount_ns().

    * Generic superblock release is separated out into kernfs_kill_sb()
    which can be used directly as file_system_type->kill_sb(). As sysfs
    needs to put the namespace tag, sysfs_kill_sb() wraps
    kernfs_kill_sb() with ns tag put.

    * sysfs_dir_cachep init and sysfs_inode_init() are separated out into
    kernfs_init(). kernfs_init() uses only small amount of memory and
    trying to handle and propagate kernfs_init() failure doesn't make
    much sense. Use SLAB_PANIC for sysfs_dir_cachep and make
    sysfs_inode_init() panic on failure.

    After this change, kernfs_init() should be called before
    sysfs_init(), fs/namespace.c::mnt_init() modified accordingly.

    Signed-off-by: Tejun Heo
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

27 Nov, 2013

1 commit

  • Gao feng reported that commit
    e51db73532955dc5eaba4235e62b74b460709d5b
    userns: Better restrictions on when proc and sysfs can be mounted
    caused a regression on mounting a new instance of proc in a mount
    namespace created with user namespace privileges, when binfmt_misc
    is mounted on /proc/sys/fs/binfmt_misc.

    This is an unintended regression caused by the absolutely bogus empty
    directory check in fs_fully_visible. The check fs_fully_visible replaced
    didn't even bother to attempt to verify proc was fully visible and
    hiding proc files with any kind of mount is rare. So for now fix
    the userspace regression by allowing directory with nlink == 1
    as /proc/sys/fs/binfmt_misc has.

    I will have a better patch but it is not stable material, or
    last minute kernel material. So it will have to wait.

    Cc: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Acked-by: Gao feng
    Tested-by: Gao feng
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

09 Nov, 2013

1 commit

  • * RCU-delayed freeing of vfsmounts
    * vfsmount_lock replaced with a seqlock (mount_lock)
    * sequence number from mount_lock is stored in nameidata->m_seq and
    used when we exit RCU mode
    * new vfsmount flag - MNT_SYNC_UMOUNT. Set by umount_tree() when its
    caller knows that vfsmount will have no surviving references.
    * synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
    and doing pending mntput().
    * new helper: legitimize_mnt(mnt, seq). Checks the mount_lock sequence
    number against seq, then grabs reference to mnt. Then it rechecks mount_lock
    again to close the race and either returns success or drops the reference it
    has acquired. The subtle point is that in case of MNT_SYNC_UMOUNT we can
    simply decrement the refcount and sod off - aforementioned synchronize_rcu()
    makes sure that final mntput() won't come until we leave RCU mode. We need
    that, since we don't want to end up with some lazy pathwalk racing with
    umount() and stealing the final mntput() from it - caller of umount() may
    expect it to return only once the fs is shut down and we don't want to break
    that. In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
    full-blown mntput() in case of mount_lock sequence number mismatch happening
    just as we'd grabbed the reference, but in those cases we won't be stealing
    the final mntput() from anything that would care.
    * mntput_no_expire() doesn't lock anything on the fast path now. Incidentally,
    SMP and UP cases are handled the same way - no ifdefs there.
    * normal pathname resolution does *not* do any writes to mount_lock. It does,
    of course, bump the refcounts of vfsmount and dentry in the very end, but that's
    it.

    Signed-off-by: Al Viro

    Al Viro
     

25 Oct, 2013

13 commits


12 Sep, 2013

1 commit

  • When the rootfs code was a wrapper around ramfs, having them in the same
    file made sense. Now that it can wrap another filesystem type, move it in
    with the init code instead.

    This also allows a subsequent patch to access rootfstype= command line
    arg.

    Signed-off-by: Rob Landley
    Cc: Jeff Layton
    Cc: Jens Axboe
    Cc: Stephen Warren
    Cc: Rusty Russell
    Cc: Jim Cromie
    Cc: Sam Ravnborg
    Cc: Greg Kroah-Hartman
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Landley
     

09 Sep, 2013

1 commit


08 Sep, 2013

2 commits

  • Pull vfs pile 2 (of many) from Al Viro:
    "Mostly Miklos' series this time"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    constify dcache.c inlined helpers where possible
    fuse: drop dentry on failed revalidate
    fuse: clean up return in fuse_dentry_revalidate()
    fuse: use d_materialise_unique()
    sysfs: use check_submounts_and_drop()
    nfs: use check_submounts_and_drop()
    gfs2: use check_submounts_and_drop()
    afs: use check_submounts_and_drop()
    vfs: check unlinked ancestors before mount
    vfs: check submounts and drop atomically
    vfs: add d_walk()
    vfs: restructure d_genocide()

    Linus Torvalds
     
  • Pull namespace changes from Eric Biederman:
    "This is an assorted mishmash of small cleanups, enhancements and bug
    fixes.

    The major theme is user namespace mount restrictions. nsown_capable
    is killed as it encourages not thinking about details that need to be
    considered. A very hard to hit pid namespace exiting bug was finally
    tracked and fixed. A couple of cleanups to the basic namespace
    infrastructure.

    Finally there is an enhancement that makes per user namespace
    capabilities usable as capabilities, and an enhancement that allows
    the per userns root to nice other processes in the user namespace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Kill nsown_capable it makes the wrong thing easy
    capabilities: allow nice if we are privileged
    pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
    userns: Allow PR_CAPBSET_DROP in a user namespace.
    namespaces: Simplify copy_namespaces so it is clear what is going on.
    pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
    sysfs: Restrict mounting sysfs
    userns: Better restrictions on when proc and sysfs can be mounted
    vfs: Don't copy mount bind mounts of /proc//ns/mnt between namespaces
    kernel/nsproxy.c: Improving a snippet of code.
    proc: Restrict mounting the proc filesystem
    vfs: Lock in place mounts from more privileged users

    Linus Torvalds
     

06 Sep, 2013

1 commit

  • We check submounts before doing d_drop() on a non-empty directory dentry in
    NFS (have_submounts()), but we do not exclude a racing mount. Nor do we
    prevent mounts to be added to the disconnected subtree using relative paths
    after the d_drop().

    This patch fixes these issues by checking for unlinked (unhashed, non-root)
    ancestors before proceeding with the mount. This is done with rename
    seqlock taken for write and with ->d_lock grabbed on each ancestor in turn,
    including our dentry itself. This ensures that the only one of
    check_submounts_and_drop() or has_unlinked_ancestor() can succeed.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     

04 Sep, 2013

1 commit

  • Christopher reported a regression where he was unable to unmount a NFS
    filesystem where the root had gone stale. The problem is that
    d_revalidate handles the root of the filesystem differently from other
    dentries, but d_weak_revalidate does not. We could simply fix this by
    making d_weak_revalidate return success on IS_ROOT dentries, but there
    are cases where we do want to revalidate the root of the fs.

    A umount is really a special case. We generally aren't interested in
    anything but the dentry and vfsmount that's attached at that point. If
    the inode turns out to be stale we just don't care since the intent is
    to stop using it anyway.

    Try to handle this situation better by treating umount as a special
    case in the lookup code. Have it resolve the parent using normal
    means, and then do a lookup of the final dentry without revalidating
    it. In most cases, the final lookup will come out of the dcache, but
    the case where there's a trailing symlink or !LAST_NORM entry on the
    end complicates things a bit.

    Cc: Neil Brown
    Reported-by: Christopher T Vogan
    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

31 Aug, 2013

1 commit


27 Aug, 2013

2 commits

  • Rely on the fact that another flavor of the filesystem is already
    mounted and do not rely on state in the user namespace.

    Verify that the mounted filesystem is not covered in any significant
    way. I would love to verify that the previously mounted filesystem
    has no mounts on top but there are at least the directories
    /proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
    for other filesystems to mount on top of.

    Refactor the test into a function named fs_fully_visible and call that
    function from the mount routines of proc and sysfs. This makes this
    test local to the filesystems involved and the results current of when
    the mounts take place, removing a weird threading of the user
    namespace, the mount namespace and the filesystems themselves.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Don't copy bind mounts of /proc//ns/mnt between namespaces.
    These files hold references to a mount namespace and copying them
    between namespaces could result in a reference counting loop.

    The current mnt_ns_loop test prevents loops on the assumption that
    mounts don't cross between namespaces. Unfortunately unsharing a
    mount namespace and shared substrees can both cause mounts to
    propogate between mount namespaces.

    Add two flags CL_COPY_UNBINDABLE and CL_COPY_MNT_NS_FILE are added to
    control this behavior, and CL_COPY_ALL is redefined as both of them.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

25 Aug, 2013

1 commit

  • This should actually be returning an ERR_PTR on error instead of NULL.
    That was how it was designed and all the callers expect it.

    [AV: actually, that's what "VFS: Make clone_mnt()/copy_tree()/collect_mounts()
    return errors" missed - originally collect_mounts() was expected to return
    NULL on failure]

    Cc: # 3.10+
    Signed-off-by: Dan Carpenter
    Signed-off-by: Al Viro

    Dan Carpenter
     

25 Jul, 2013

1 commit

  • When creating a less privileged mount namespace or propogating mounts
    from a more privileged to a less privileged mount namespace lock the
    submounts so they may not be unmounted individually in the child mount
    namespace revealing what is under them.

    This enforces the reasonable expectation that it is not possible to
    see under a mount point. Most of the time mounts are on empty
    directories and revealing that does not matter, however I have seen an
    occassionaly sloppy configuration where there were interesting things
    concealed under a mount point that probably should not be revealed.

    Expirable submounts are not locked because they will eventually
    unmount automatically so whatever is under them already needs
    to be safe for unprivileged users to access.

    From a practical standpoint these restrictions do not appear to be
    significant for unprivileged users of the mount namespace. Recursive
    bind mounts and pivot_root continues to work, and mounts that are
    created in a mount namespace may be unmounted there. All of which
    means that the common idiom of keeping a directory of interesting
    files and using pivot_root to throw everything else away continues to
    work just fine.

    Acked-by: Serge Hallyn
    Acked-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

05 May, 2013

2 commits


02 May, 2013

2 commits

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     
  • Split the proc namespace stuff out into linux/proc_ns.h.

    Signed-off-by: David Howells
    cc: netdev@vger.kernel.org
    cc: Serge E. Hallyn
    cc: Eric W. Biederman
    Signed-off-by: Al Viro

    David Howells