12 Aug, 2014

2 commits

  • Pull vfs updates from Al Viro:
    "Stuff in here:

    - acct.c fixes and general rework of mnt_pin mechanism. That allows
    to go for delayed-mntput stuff, which will permit mntput() on deep
    stack without worrying about stack overflows - fs shutdown will
    happen on shallow stack. IOW, we can do Eric's umount-on-rmdir
    series without introducing tons of stack overflows on new mntput()
    call chains it introduces.
    - Bruce's d_splice_alias() patches
    - more Miklos' rename() stuff.
    - a couple of regression fixes (stable fodder, in the end of branch)
    and a fix for API idiocy in iov_iter.c.

    There definitely will be another pile, maybe even two. I'd like to
    get Eric's series in this time, but even if we miss it, it'll go right
    in the beginning of for-next in the next cycle - the tricky part of
    prereqs is in this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    fix copy_tree() regression
    __generic_file_write_iter(): fix handling of sync error after DIO
    switch iov_iter_get_pages() to passing maximal number of pages
    fs: mark __d_obtain_alias static
    dcache: d_splice_alias should detect loops
    exportfs: update Exporting documentation
    dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
    dcache: remove unused d_find_alias parameter
    dcache: d_obtain_alias callers don't all want DISCONNECTED
    dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
    dcache: d_splice_alias mustn't create directory aliases
    dcache: close d_move race in d_splice_alias
    dcache: move d_splice_alias
    namei: trivial fix to vfs_rename_dir comment
    VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
    cifs: support RENAME_NOREPLACE
    hostfs: support rename flags
    shmem: support RENAME_EXCHANGE
    shmem: support RENAME_NOREPLACE
    btrfs: add RENAME_NOREPLACE
    ...

    Linus Torvalds
     
  • Since 3.14 we had copy_tree() get the shadowing wrong - if we had one
    vfsmount shadowing another (i.e. if A is a slave of B, C is mounted
    on A/foo, then D got mounted on B/foo creating D' on A/foo shadowed
    by C), copy_tree() of A would make a copy of D' shadow the the copy of
    C, not the other way around.

    It's easy to fix, fortunately - just make sure that mount follows
    the one that shadows it in mnt_child as well as in mnt_hash, and when
    copy_tree() decides to attach a new mount, check if the last child
    it has added to the same parent should be shadowing the new one.
    And if it should, just use the same logics commit_tree() has - put the
    new mount into the hash and children lists right after the one that
    should shadow it.

    Cc: stable@vger.kernel.org [3.14 and later]
    Signed-off-by: Al Viro

    Al Viro
     

10 Aug, 2014

1 commit

  • Pull namespace updates from Eric Biederman:
    "This is a bunch of small changes built against 3.16-rc6. The most
    significant change for users is the first patch which makes setns
    drmatically faster by removing unneded rcu handling.

    The next chunk of changes are so that "mount -o remount,.." will not
    allow the user namespace root to drop flags on a mount set by the
    system wide root. Aks this forces read-only mounts to stay read-only,
    no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec
    mounts to stay no exec and it prevents unprivileged users from messing
    with a mounts atime settings. I have included my test case as the
    last patch in this series so people performing backports can verify
    this change works correctly.

    The next change fixes a bug in NFS that was discovered while auditing
    nsproxy users for the first optimization. Today you can oops the
    kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever
    with pid namespaces. I rebased and fixed the build of the
    !CONFIG_NFS_FS case yesterday when a build bot caught my typo. Given
    that no one to my knowledge bases anything on my tree fixing the typo
    in place seems more responsible that requiring a typo-fix to be
    backported as well.

    The last change is a small semantic cleanup introducing
    /proc/thread-self and pointing /proc/mounts and /proc/net at it. This
    prevents several kinds of problemantic corner cases. It is a
    user-visible change so it has a minute chance of causing regressions
    so the change to /proc/mounts and /proc/net are individual one line
    commits that can be trivially reverted. Unfortunately I lost and
    could not find the email of the original reporter so he is not
    credited. From at least one perspective this change to /proc/net is a
    refgression fix to allow pthread /proc/net uses that were broken by
    the introduction of the network namespace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
    proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
    proc: Implement /proc/thread-self to point at the directory of the current thread
    proc: Have net show up under /proc//task/
    NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
    mnt: Add tests for unprivileged remount cases that have found to be faulty
    mnt: Change the default remount atime from relatime to the existing value
    mnt: Correct permission checks in do_remount
    mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
    mnt: Only change user settable mount flags in remount
    namespaces: Use task_lock and not rcu to protect nsproxy

    Linus Torvalds
     

08 Aug, 2014

3 commits

  • Rather than playing silly buggers with vfsmount refcounts, just have
    acct_on() ask fs/namespace.c for internal clone of file->f_path.mnt
    and replace it with said clone. Then attach the pin to original
    vfsmount. Voila - the clone will be alive until the file gets closed,
    making sure that underlying superblock remains active, etc., and
    we can drop the original vfsmount, so that it's not kept busy.
    If the file lives until the final mntput of the original vfsmount,
    we'll notice that there's an fs_pin (one in bsd_acct_struct that
    holds that file) and mnt_pin_kill() will take it out. Since
    ->kill() is synchronous, we won't proceed past that point until
    these files are closed (and private clones of our vfsmount are
    gone), so we get the same ordering warranties we used to get.

    mnt_pin()/mnt_unpin()/->mnt_pinned is gone now, and good riddance -
    it never became usable outside of kernel/acct.c (and racy wrt
    umount even there).

    Signed-off-by: Al Viro

    Al Viro
     
  • These externs belong in fs/internal.h. Rename (they are not acct-specific
    anymore) and move them over there.

    Signed-off-by: Al Viro

    Al Viro
     
  • Put these suckers on per-vfsmount and per-superblock lists instead.
    Note: right now it's still acct_lock for everything, but that's
    going to change.

    Signed-off-by: Al Viro

    Al Viro
     

07 Aug, 2014

1 commit

  • All other add functions for lists have the new item as first argument
    and the position where it is added as second argument. This was changed
    for no good reason in this function and makes using it unnecessary
    confusing.

    The name was changed to hlist_add_behind() to cause unconverted code to
    generate a compile error instead of using the wrong parameter order.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Ken Helias
    Cc: "Paul E. McKenney"
    Acked-by: Jeff Kirsher [intel driver bits]
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Helias
     

01 Aug, 2014

4 commits

  • Since March 2009 the kernel has treated the state that if no
    MS_..ATIME flags are passed then the kernel defaults to relatime.

    Defaulting to relatime instead of the existing atime state during a
    remount is silly, and causes problems in practice for people who don't
    specify any MS_...ATIME flags and to get the default filesystem atime
    setting. Those users may encounter a permission error because the
    default atime setting does not work.

    A default that does not work and causes permission problems is
    ridiculous, so preserve the existing value to have a default
    atime setting that is always guaranteed to work.

    Using the default atime setting in this way is particularly
    interesting for applications built to run in restricted userspace
    environments without /proc mounted, as the existing atime mount
    options of a filesystem can not be read from /proc/mounts.

    In practice this fixes user space that uses the default atime
    setting on remount that are broken by the permission checks
    keeping less privileged users from changing more privileged users
    atime settings.

    Cc: stable@vger.kernel.org
    Acked-by: Serge E. Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • While invesgiating the issue where in "mount --bind -oremount,ro ..."
    would result in later "mount --bind -oremount,rw" succeeding even if
    the mount started off locked I realized that there are several
    additional mount flags that should be locked and are not.

    In particular MNT_NOSUID, MNT_NODEV, MNT_NOEXEC, and the atime
    flags in addition to MNT_READONLY should all be locked. These
    flags are all per superblock, can all be changed with MS_BIND,
    and should not be changable if set by a more privileged user.

    The following additions to the current logic are added in this patch.
    - nosuid may not be clearable by a less privileged user.
    - nodev may not be clearable by a less privielged user.
    - noexec may not be clearable by a less privileged user.
    - atime flags may not be changeable by a less privileged user.

    The logic with atime is that always setting atime on access is a
    global policy and backup software and auditing software could break if
    atime bits are not updated (when they are configured to be updated),
    and serious performance degradation could result (DOS attack) if atime
    updates happen when they have been explicitly disabled. Therefore an
    unprivileged user should not be able to mess with the atime bits set
    by a more privileged user.

    The additional restrictions are implemented with the addition of
    MNT_LOCK_NOSUID, MNT_LOCK_NODEV, MNT_LOCK_NOEXEC, and MNT_LOCK_ATIME
    mnt flags.

    Taken together these changes and the fixes for MNT_LOCK_READONLY
    should make it safe for an unprivileged user to create a user
    namespace and to call "mount --bind -o remount,... ..." without
    the danger of mount flags being changed maliciously.

    Cc: stable@vger.kernel.org
    Acked-by: Serge E. Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • There are no races as locked mount flags are guaranteed to never change.

    Moving the test into do_remount makes it more visible, and ensures all
    filesystem remounts pass the MNT_LOCK_READONLY permission check. This
    second case is not an issue today as filesystem remounts are guarded
    by capable(CAP_DAC_ADMIN) and thus will always fail in less privileged
    mount namespaces, but it could become an issue in the future.

    Cc: stable@vger.kernel.org
    Acked-by: Serge E. Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Kenton Varda discovered that by remounting a
    read-only bind mount read-only in a user namespace the
    MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
    to the remount a read-only mount read-write.

    Correct this by replacing the mask of mount flags to preserve
    with a mask of mount flags that may be changed, and preserve
    all others. This ensures that any future bugs with this mask and
    remount will fail in an easy to detect way where new mount flags
    simply won't change.

    Cc: stable@vger.kernel.org
    Acked-by: Serge E. Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

30 Jul, 2014

1 commit

  • The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
    a sufficiently expensive system call that people have complained.

    Upon inspect nsproxy no longer needs rcu protection for remote reads.
    remote reads are rare. So optimize for same process reads and write
    by switching using rask_lock instead.

    This yields a simpler to understand lock, and a faster setns system call.

    In particular this fixes a performance regression observed
    by Rafael David Tinoco .

    This is effectively a revert of Pavel Emelyanov's commit
    cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter
    from 2007. The race this originialy fixed no longer exists as
    do_notify_parent uses task_active_pid_ns(parent) instead of
    parent->nsproxy.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

02 Apr, 2014

4 commits

  • Make delayed_free() call free_vfsmnt() so that we don't have two functions
    doing the same job. This requires the calls to mnt_free_id() in free_vfsmnt()
    to be moved into the callers of that function.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • new flag in ->f_mode - FMODE_WRITER. Set by do_dentry_open() in case
    when it has grabbed write access, checked by __fput() to decide whether
    it wants to drop the sucker. Allows to stop bothering with mnt_clone_write()
    in alloc_file(), along with fewer special_file() checks.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • The current mainline has copies propagated to *all* nodes, then
    tears down the copies we made for nodes that do not contain
    counterparts of the desired mountpoint. That sets the right
    propagation graph for the copies (at teardown time we move
    the slaves of removed node to a surviving peer or directly
    to master), but we end up paying a fairly steep price in
    useless allocations. It's fairly easy to create a situation
    where N calls of mount(2) create exactly N bindings, with
    O(N^2) vfsmounts allocated and freed in process.

    Fortunately, it is possible to avoid those allocations/freeings.
    The trick is to create copies in the right order and find which
    one would've eventually become a master with the current algorithm.
    It turns out to be possible in O(nodes getting propagation) time
    and with no extra allocations at all.

    One part is that we need to make sure that eventual master will be
    created before its slaves, so we need to walk the propagation
    tree in a different order - by peer groups. And iterate through
    the peers before dealing with the next group.

    Another thing is finding the (earlier) copy that will be a master
    of one we are about to create; to do that we are (temporary) marking
    the masters of mountpoints we are attaching the copies to.

    Either we are in a peer of the last mountpoint we'd dealt with,
    or we have the following situation: we are attaching to mountpoint M,
    the last copy S_0 had been attached to M_0 and there are sequences
    S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
    S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
    such that M is getting propagation from M_{k}. It means that the master
    of M_{k} will be among the sequence of masters of M. On the
    other hand, the nearest marked node in that sequence will either
    be the master of M_{k} or the master of M_{k-1} (the latter -
    in the case if M_{k-1} is a slave of something M gets propagation
    from, but in a wrong peer group).

    So we go through the sequence of masters of M until we find
    a marked one (P). Let N be the one before it. Then we go through
    the sequence of masters of S_0 until we find one (say, S) mounted
    on a node D that has P as master and check if D is a peer of N.
    If it is, S will be the master of new copy, if not - the master of S
    will be.

    That's it for the hard part; the rest is fairly simple. Iterator
    is in next_group(), handling of one prospective mountpoint is
    propagate_one().

    It seems to survive all tests and gives a noticably better performance
    than the current mainline for setups that are seriously using shared
    subtrees.

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     

31 Mar, 2014

4 commits

  • fixes RCU bug - walking through hlist is safe in face of element moves,
    since it's self-terminating. Cyclic lists are not - if we end up jumping
    to another hash chain, we'll loop infinitely without ever hitting the
    original list head.

    [fix for dumb braino folded]

    Spotted by: Max Kellermann
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     
  • If the dest_mnt is not shared, propagate_mnt() does nothing -
    there's no mounts to propagate to and thus no copies to create.
    Might as well don't bother calling it in that case.

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     
  • preparation to switching mnt_hash to hlist

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     
  • * switch allocation to alloc_large_system_hash()
    * make sizes overridable by boot parameters (mhash_entries=, mphash_entries=)
    * switch mountpoint_hashtable from list_head to hlist_head

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     

21 Jan, 2014

1 commit

  • Pull driver core / sysfs patches from Greg KH:
    "Here's the big driver core and sysfs patch set for 3.14-rc1.

    There's a lot of work here moving sysfs logic out into a "kernfs" to
    allow other subsystems to also have a virtual filesystem with the same
    attributes of sysfs (handle device disconnect, dynamic creation /
    removal as needed / unneeded, etc)

    This is primarily being done for the cgroups filesystem, but the goal
    is to also move debugfs to it when it is ready, solving all of the
    known issues in that filesystem as well. The code isn't completed
    yet, but all should be stable now (there is a big section that was
    reverted due to problems found when testing)

    There's also some other smaller fixes, and a driver core addition that
    allows for a "collection" of objects, that the DRM people will be
    using soon (it's in this tree to make merges after -rc1 easier)

    All of this has been in linux-next with no reported issues"

    * tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (113 commits)
    kernfs: associate a new kernfs_node with its parent on creation
    kernfs: add struct dentry declaration in kernfs.h
    kernfs: fix get_active failure handling in kernfs_seq_*()
    Revert "kernfs: fix get_active failure handling in kernfs_seq_*()"
    Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq"
    Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()"
    Revert "kernfs: remove KERNFS_REMOVED"
    Revert "kernfs: restructure removal path to fix possible premature return"
    Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()"
    Revert "kernfs: remove kernfs_addrm_cxt"
    Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed"
    Revert "kernfs: implement kernfs_{de|re}activate[_self]()"
    Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers"
    Revert "pci: use device_remove_file_self() instead of device_schedule_callback()"
    Revert "scsi: use device_remove_file_self() instead of device_schedule_callback()"
    Revert "s390: use device_remove_file_self() instead of device_schedule_callback()"
    Revert "sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()"
    Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()"
    kernfs: remove unnecessary NULL check in __kernfs_remove()
    drivers/base: provide an infrastructure for componentised subsystems
    ...

    Linus Torvalds
     

30 Nov, 2013

1 commit

  • We're in the process of separating out core sysfs functionality into
    kernfs which will deal with sysfs_dirents directly. This patch
    rearranges mount path so that the kernfs and sysfs parts are separate.

    * As sysfs_super_info won't be visible outside kernfs proper,
    kernfs_super_ns() is added to allow kernfs users to access a
    super_block's namespace tag.

    * Generic mount operation is separated out into kernfs_mount_ns().
    sysfs_mount() now just performs sysfs-specific permission check,
    acquires namespace tag, and invokes kernfs_mount_ns().

    * Generic superblock release is separated out into kernfs_kill_sb()
    which can be used directly as file_system_type->kill_sb(). As sysfs
    needs to put the namespace tag, sysfs_kill_sb() wraps
    kernfs_kill_sb() with ns tag put.

    * sysfs_dir_cachep init and sysfs_inode_init() are separated out into
    kernfs_init(). kernfs_init() uses only small amount of memory and
    trying to handle and propagate kernfs_init() failure doesn't make
    much sense. Use SLAB_PANIC for sysfs_dir_cachep and make
    sysfs_inode_init() panic on failure.

    After this change, kernfs_init() should be called before
    sysfs_init(), fs/namespace.c::mnt_init() modified accordingly.

    Signed-off-by: Tejun Heo
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

27 Nov, 2013

1 commit

  • Gao feng reported that commit
    e51db73532955dc5eaba4235e62b74b460709d5b
    userns: Better restrictions on when proc and sysfs can be mounted
    caused a regression on mounting a new instance of proc in a mount
    namespace created with user namespace privileges, when binfmt_misc
    is mounted on /proc/sys/fs/binfmt_misc.

    This is an unintended regression caused by the absolutely bogus empty
    directory check in fs_fully_visible. The check fs_fully_visible replaced
    didn't even bother to attempt to verify proc was fully visible and
    hiding proc files with any kind of mount is rare. So for now fix
    the userspace regression by allowing directory with nlink == 1
    as /proc/sys/fs/binfmt_misc has.

    I will have a better patch but it is not stable material, or
    last minute kernel material. So it will have to wait.

    Cc: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Acked-by: Gao feng
    Tested-by: Gao feng
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

09 Nov, 2013

1 commit

  • * RCU-delayed freeing of vfsmounts
    * vfsmount_lock replaced with a seqlock (mount_lock)
    * sequence number from mount_lock is stored in nameidata->m_seq and
    used when we exit RCU mode
    * new vfsmount flag - MNT_SYNC_UMOUNT. Set by umount_tree() when its
    caller knows that vfsmount will have no surviving references.
    * synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
    and doing pending mntput().
    * new helper: legitimize_mnt(mnt, seq). Checks the mount_lock sequence
    number against seq, then grabs reference to mnt. Then it rechecks mount_lock
    again to close the race and either returns success or drops the reference it
    has acquired. The subtle point is that in case of MNT_SYNC_UMOUNT we can
    simply decrement the refcount and sod off - aforementioned synchronize_rcu()
    makes sure that final mntput() won't come until we leave RCU mode. We need
    that, since we don't want to end up with some lazy pathwalk racing with
    umount() and stealing the final mntput() from it - caller of umount() may
    expect it to return only once the fs is shut down and we don't want to break
    that. In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
    full-blown mntput() in case of mount_lock sequence number mismatch happening
    just as we'd grabbed the reference, but in those cases we won't be stealing
    the final mntput() from anything that would care.
    * mntput_no_expire() doesn't lock anything on the fast path now. Incidentally,
    SMP and UP cases are handled the same way - no ifdefs there.
    * normal pathname resolution does *not* do any writes to mount_lock. It does,
    of course, bump the refcounts of vfsmount and dentry in the very end, but that's
    it.

    Signed-off-by: Al Viro

    Al Viro
     

25 Oct, 2013

13 commits


12 Sep, 2013

1 commit

  • When the rootfs code was a wrapper around ramfs, having them in the same
    file made sense. Now that it can wrap another filesystem type, move it in
    with the init code instead.

    This also allows a subsequent patch to access rootfstype= command line
    arg.

    Signed-off-by: Rob Landley
    Cc: Jeff Layton
    Cc: Jens Axboe
    Cc: Stephen Warren
    Cc: Rusty Russell
    Cc: Jim Cromie
    Cc: Sam Ravnborg
    Cc: Greg Kroah-Hartman
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Landley
     

09 Sep, 2013

1 commit


08 Sep, 2013

1 commit

  • Pull vfs pile 2 (of many) from Al Viro:
    "Mostly Miklos' series this time"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    constify dcache.c inlined helpers where possible
    fuse: drop dentry on failed revalidate
    fuse: clean up return in fuse_dentry_revalidate()
    fuse: use d_materialise_unique()
    sysfs: use check_submounts_and_drop()
    nfs: use check_submounts_and_drop()
    gfs2: use check_submounts_and_drop()
    afs: use check_submounts_and_drop()
    vfs: check unlinked ancestors before mount
    vfs: check submounts and drop atomically
    vfs: add d_walk()
    vfs: restructure d_genocide()

    Linus Torvalds