07 Dec, 2011

1 commit

  • __d_path() API is asking for trouble and in case of apparmor d_namespace_path()
    getting just that. The root cause is that when __d_path() misses the root
    it had been told to look for, it stores the location of the most remote ancestor
    in *root. Without grabbing references. Sure, at the moment of call it had
    been pinned down by what we have in *path. And if we raced with umount -l, we
    could have very well stopped at vfsmount/dentry that got freed as soon as
    prepend_path() dropped vfsmount_lock.

    It is safe to compare these pointers with pre-existing (and known to be still
    alive) vfsmount and dentry, as long as all we are asking is "is it the same
    address?". Dereferencing is not safe and apparmor ended up stepping into
    that. d_namespace_path() really wants to examine the place where we stopped,
    even if it's not connected to our namespace. As the result, it looked
    at ->d_sb->s_magic of a dentry that might've been already freed by that point.
    All other callers had been careful enough to avoid that, but it's really
    a bad interface - it invites that kind of trouble.

    The fix is fairly straightforward, even though it's bigger than I'd like:
    * prepend_path() root argument becomes const.
    * __d_path() is never called with NULL/NULL root. It was a kludge
    to start with. Instead, we have an explicit function - d_absolute_root().
    Same as __d_path(), except that it doesn't get root passed and stops where
    it stops. apparmor and tomoyo are using it.
    * __d_path() returns NULL on path outside of root. The main
    caller is show_mountinfo() and that's precisely what we pass root for - to
    skip those outside chroot jail. Those who don't want that can (and do)
    use d_path().
    * __d_path() root argument becomes const. Everyone agrees, I hope.
    * apparmor does *NOT* try to use __d_path() or any of its variants
    when it sees that path->mnt is an internal vfsmount. In that case it's
    definitely not mounted anywhere and dentry_path() is exactly what we want
    there. Handling of sysctl()-triggered weirdness is moved to that place.
    * if apparmor is asked to do pathname relative to chroot jail
    and __d_path() tells it we it's not in that jail, the sucker just calls
    d_absolute_path() instead. That's the other remaining caller of __d_path(),
    BTW.
    * seq_path_root() does _NOT_ return -ENAMETOOLONG (it's stupid anyway -
    the normal seq_file logics will take care of growing the buffer and redoing
    the call of ->show() just fine). However, if it gets path not reachable
    from root, it returns SEQ_SKIP. The only caller adjusted (i.e. stopped
    ignoring the return value as it used to do).

    Reviewed-by: John Johansen
    ACKed-by: John Johansen
    Signed-off-by: Al Viro
    Cc: stable@vger.kernel.org

    Al Viro
     

23 Nov, 2011

1 commit


17 Nov, 2011

2 commits

  • takes vfsmount and relative path, does lookup within that vfsmount
    (possibly triggering automounts) and returns the result as root
    of subtree suitable for return by ->mount() (i.e. a reference to
    dentry and an active reference to its superblock grabbed, superblock
    locked exclusive).

    btrfs and nfs switched to it instead of open-coding the sucker.

    Signed-off-by: Al Viro

    Al Viro
     
  • Life is much saner if create_mnt_ns(mnt) drops mnt in case of error...
    Switch it to such calling conventions, switch callers, fix double mntput() in
    fs/nfs/super.c one.

    Signed-off-by: Al Viro

    Al Viro
     

28 Oct, 2011

1 commit

  • nfsiostat was failing to find mounted filesystems on kernels after
    2.6.38 because of changes to show_vfsstat() by commit
    c7f404b40a3665d9f4e9a927cc5c1ee0479ed8f9. This patch adds back the
    "device" tag before the nfs server entry so scripts can parse the
    mountstats file correctly.

    Signed-off-by: Bryan Schumaker
    CC: stable@kernel.org [>=2.6.39]
    Signed-off-by: Christoph Hellwig

    Bryan Schumaker
     

27 Sep, 2011

1 commit

  • The concensus seems to be that system calls such as stat() etc should
    not trigger an automount. Neither should the l* versions.

    This patch therefore adds a LOOKUP_AUTOMOUNT flag to tag those lookups
    that _should_ trigger an automount on the last path element.

    Signed-off-by: Trond Myklebust
    [ Edited to leave out the cases that are already covered by LOOKUP_OPEN,
    LOOKUP_DIRECTORY and LOOKUP_CREATE - all of which also fundamentally
    force automounting for their own reasons - Linus ]
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     

24 Jul, 2011

1 commit

  • For a number of file systems that don't have a mount point (e.g. sockfs
    and pipefs), they are not marked as long term. Therefore in
    mntput_no_expire, all locks in vfs_mount lock are taken instead of just
    local cpu's lock to aggregate reference counts when we release
    reference to file objects. In fact, only local lock need to have been
    taken to update ref counts as these file systems are in no danger of
    going away until we are ready to unregister them.

    The attached patch marks file systems using kern_mount without
    mount point as long term. The contentions of vfs_mount lock
    is now eliminated. Before un-registering such file system,
    kern_unmount should be called to remove the long term flag and
    make the mount point ready to be freed.

    Signed-off-by: Tim Chen
    Signed-off-by: Al Viro

    Tim Chen
     

21 Jul, 2011

1 commit

  • Moving the event counter into the dynamically allocated 'struc seq_file'
    allows poll() support without the need to allocate its own tracking
    structure.

    All current users are switched over to use the new counter.

    Requested-by: Andrew Morton akpm@linux-foundation.org
    Acked-by: NeilBrown
    Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
    Signed-off-by: Kay Sievers
    Signed-off-by: Al Viro

    Kay Sievers
     

26 May, 2011

1 commit

  • This issue was discovered by users of busybox. And the bug is actual for
    busybox users, I don't know how it affects others. Apparently, mount is
    called with and without MS_SILENT, and this affects mount() behaviour.
    But MS_SILENT is only supposed to affect kernel logging verbosity.

    The following script was run in an empty test directory:

    mkdir -p mount.dir mount.shared1 mount.shared2
    touch mount.dir/a mount.dir/b
    mount -vv --bind mount.shared1 mount.shared1
    mount -vv --make-rshared mount.shared1
    mount -vv --bind mount.shared2 mount.shared2
    mount -vv --make-rshared mount.shared2
    mount -vv --bind mount.shared2 mount.shared1
    mount -vv --bind mount.dir mount.shared2
    ls -R mount.dir mount.shared1 mount.shared2
    umount mount.dir mount.shared1 mount.shared2 2>/dev/null
    umount mount.dir mount.shared1 mount.shared2 2>/dev/null
    umount mount.dir mount.shared1 mount.shared2 2>/dev/null
    rm -f mount.dir/a mount.dir/b mount.dir/c
    rmdir mount.dir mount.shared1 mount.shared2

    mount -vv was used to show the mount() call arguments and result.
    Output shows that flag argument has 0x00008000 = MS_SILENT bit:

    mount: mount('mount.shared1','mount.shared1','(null)',0x00009000,'(null)'):0
    mount: mount('','mount.shared1','',0x0010c000,''):0
    mount: mount('mount.shared2','mount.shared2','(null)',0x00009000,'(null)'):0
    mount: mount('','mount.shared2','',0x0010c000,''):0
    mount: mount('mount.shared2','mount.shared1','(null)',0x00009000,'(null)'):0
    mount: mount('mount.dir','mount.shared2','(null)',0x00009000,'(null)'):0
    mount.dir:
    a
    b

    mount.shared1:

    mount.shared2:
    a
    b

    After adding --loud option to remove MS_SILENT bit from just one mount cmd:

    mkdir -p mount.dir mount.shared1 mount.shared2
    touch mount.dir/a mount.dir/b
    mount -vv --bind mount.shared1 mount.shared1 2>&1
    mount -vv --make-rshared mount.shared1 2>&1
    mount -vv --bind mount.shared2 mount.shared2 2>&1
    mount -vv --loud --make-rshared mount.shared2 2>&1 # &1
    mount -vv --bind mount.dir mount.shared2 2>&1
    ls -R mount.dir mount.shared1 mount.shared2 2>&1
    umount mount.dir mount.shared1 mount.shared2 2>/dev/null
    umount mount.dir mount.shared1 mount.shared2 2>/dev/null
    umount mount.dir mount.shared1 mount.shared2 2>/dev/null
    rm -f mount.dir/a mount.dir/b mount.dir/c
    rmdir mount.dir mount.shared1 mount.shared2

    The result is different now - look closely at mount.shared1 directory listing.
    Now it does show files 'a' and 'b':

    mount: mount('mount.shared1','mount.shared1','(null)',0x00009000,'(null)'):0
    mount: mount('','mount.shared1','',0x0010c000,''):0
    mount: mount('mount.shared2','mount.shared2','(null)',0x00009000,'(null)'):0
    mount: mount('','mount.shared2','',0x00104000,''):0
    mount: mount('mount.shared2','mount.shared1','(null)',0x00009000,'(null)'):0
    mount: mount('mount.dir','mount.shared2','(null)',0x00009000,'(null)'):0

    mount.dir:
    a
    b

    mount.shared1:
    a
    b

    mount.shared2:
    a
    b

    The analysis shows that MS_SILENT flag which is ON by default in any
    busybox-> mount operations cames to flags_to_propagation_type function and
    causes the error return while is_power_of_2 checking because the function
    expects only one bit set. This doesn't allow to do busybox->mount with
    any --make-[r]shared, --make-[r]private etc options.

    Moreover, the recently added flags_to_propagation_type() function doesn't
    allow us to do such operations as --make-[r]private --make-[r]shared etc.
    when MS_SILENT is on. The idea or clearing the MS_SILENT flag came from
    to Denys Vlasenko.

    Signed-off-by: Roman Borisov
    Reported-by: Denys Vlasenko
    Cc: Chuck Ebbert
    Cc: Alexander Shishkin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Roman Borisov
     

13 Apr, 2011

1 commit

  • This reverts commit 93f1c20bc8cdb757be50566eff88d65c3b26881f.

    It turns out that libmount misparses it because it adds a '-' character
    in the uuid string, which libmount then incorrectly confuses with the
    separator string (" - ") at the end of all the optional arguments.

    Upstream libmount (in the util-linux tree) has been fixed, but until
    that fix actually percolates up to users, we'd better not expose this
    change in the kernel.

    Let's revisit this later (possibly by exposing the UUID without any '-'
    characters in it, avoiding the user-space bug).

    Reported-by: Dave Jones
    Cc: Aneesh Kumar K.V
    Cc: Al Viro
    Cc: Karel Zak
    Cc: Ram Pai
    Cc: Miklos Szeredi
    Cc: Eric Sandeen
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Mar, 2011

1 commit

  • printk()s without a priority level default to KERN_WARNING. To reduce
    noise at KERN_WARNING, this patch set the priority level appriopriately
    for unleveled printks()s. This should be useful to folks that look at
    dmesg warnings closely.

    Signed-off-by: Mandeep Singh Baines
    Cc: Jens Axboe
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mandeep Singh Baines
     

18 Mar, 2011

4 commits

  • Have it nested inside ->i_mutex. Instead of using follow_down()
    under namespace_sem, followed by grabbing i_mutex and checking that
    mountpoint to be is not dead, do the following:
    grab i_mutex
    check that it's not dead
    grab namespace_sem
    see if anything is mounted there
    if not, we've won
    otherwise
    drop locks
    put_path on what we had
    replace with what's mounted
    retry everything with new mountpoint to be

    New helper (lock_mount()) does that. do_add_mount(), do_move_mount(),
    do_loopback() and pivot_root() switched to it; in case of the last
    two that eliminates a race we used to have - original code didn't
    do follow_down().

    Signed-off-by: Al Viro

    Al Viro
     
  • Don't hold vfsmount_lock over the loop traversing ->mnt_parent;
    do check_mnt(new.mnt) under namespace_sem instead; combined with
    namespace_sem held over all that code it'll guarantee the stability
    of ->mnt_parent chain all the way to the root.

    Doing check_mnt() outside of namespace_sem in case of pivot_root()
    is wrong anyway.

    Signed-off-by: Al Viro

    Al Viro
     
  • new function: mount_fs(). Does all work done by vfs_kern_mount()
    except the allocation and filling of vfsmount; returns root dentry
    or ERR_PTR().

    vfs_kern_mount() switched to using it and taken to fs/namespace.c,
    along with its wrappers.

    alloc_vfsmnt()/free_vfsmnt() made static.

    functions in namespace.c slightly reordered.

    Signed-off-by: Al Viro

    Al Viro
     
  • not needed anymore, since all users (->get_sb() instances) are gone.

    Signed-off-by: Al Viro

    Al Viro
     

17 Mar, 2011

3 commits

  • * 'mnt_devname' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    vfs: bury ->get_sb()
    nfs: switch NFS from ->get_sb() to ->mount()
    nfs: stop mangling ->mnt_devname on NFS
    vfs: new superblock methods to override /proc/*/mount{s,info}
    nfs: nfs_do_{ref,sub}mount() superblock argument is redundant
    nfs: make nfs_path() work without vfsmount
    nfs: store devname at disconnected NFS roots
    nfs: propagate devname to nfs{,4}_get_root()

    Linus Torvalds
     
  • a) ->show_devname(m, mnt) - what to put into devname columns in mounts,
    mountinfo and mountstats
    b) ->show_path(m, mnt) - what to put into relative path column in mountinfo

    Leaving those NULL gives old behaviour. NFS switched to using those.

    Signed-off-by: Al Viro

    Al Viro
     
  • …s/security-testing-2.6

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits)
    AppArmor: kill unused macros in lsm.c
    AppArmor: cleanup generated files correctly
    KEYS: Add an iovec version of KEYCTL_INSTANTIATE
    KEYS: Add a new keyctl op to reject a key with a specified error code
    KEYS: Add a key type op to permit the key description to be vetted
    KEYS: Add an RCU payload dereference macro
    AppArmor: Cleanup make file to remove cruft and make it easier to read
    SELinux: implement the new sb_remount LSM hook
    LSM: Pass -o remount options to the LSM
    SELinux: Compute SID for the newly created socket
    SELinux: Socket retains creator role and MLS attribute
    SELinux: Auto-generate security_is_socket_class
    TOMOYO: Fix memory leak upon file open.
    Revert "selinux: simplify ioctl checking"
    selinux: drop unused packet flow permissions
    selinux: Fix packet forwarding checks on postrouting
    selinux: Fix wrong checks for selinux_policycap_netpeer
    selinux: Fix check for xfrm selinux context algorithm
    ima: remove unnecessary call to ima_must_measure
    IMA: remove IMA imbalance checking
    ...

    Linus Torvalds
     

15 Mar, 2011

1 commit


08 Mar, 2011

1 commit


04 Mar, 2011

1 commit

  • The VFS mount code passes the mount options to the LSM. The LSM will remove
    options it understands from the data and the VFS will then pass the remaining
    options onto the underlying filesystem. This is how options like the
    SELinux context= work. The problem comes in that -o remount never calls
    into LSM code. So if you include an LSM specific option it will get passed
    to the filesystem and will cause the remount to fail. An example of where
    this is a problem is the 'seclabel' option. The SELinux LSM hook will
    print this word in /proc/mounts if the filesystem is being labeled using
    xattrs. If you pass this word on mount it will be silently stripped and
    ignored. But if you pass this word on remount the LSM never gets called
    and it will be passed to the FS. The FS doesn't know what seclabel means
    and thus should fail the mount. For example an ext3 fs mounted over loop

    # mount -o loop /tmp/fs /mnt/tmp
    # cat /proc/mounts | grep /mnt/tmp
    /dev/loop0 /mnt/tmp ext3 rw,seclabel,relatime,errors=continue,barrier=0,data=ordered 0 0
    # mount -o remount /mnt/tmp
    mount: /mnt/tmp not mounted already, or bad option
    # dmesg
    EXT3-fs (loop0): error: unrecognized mount option "seclabel" or missing value

    This patch passes the remount mount options to an new LSM hook.

    Signed-off-by: Eric Paris
    Reviewed-by: James Morris

    Eric Paris
     

24 Feb, 2011

1 commit


17 Jan, 2011

6 commits

  • do_add_mount() and mnt_clear_expiry() are not needed outside of
    namespace.c anymore, now that namei has finish_automount() to
    use.

    Signed-off-by: Al Viro

    Al Viro
     
  • That gets rid of the kludge in finish_automount() - we need
    to keep refcount on the vfsmount as-is until we evict it from
    expiry list.

    Signed-off-by: Al Viro

    Al Viro
     
  • ... and shift it from namei.c to namespace.c

    Signed-off-by: Al Viro

    Al Viro
     
  • mnt_longterm is there only on SMP

    Reported-and-tested-by: Joachim Eastwood
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • Instead of splitting refcount between (per-cpu) mnt_count
    and (SMP-only) mnt_longrefs, make all references contribute
    to mnt_count again and keep track of how many are longterm
    ones.

    Accounting rules for longterm count:
    * 1 for each fs_struct.root.mnt
    * 1 for each fs_struct.pwd.mnt
    * 1 for having non-NULL ->mnt_ns
    * decrement to 0 happens only under vfsmount lock exclusive

    That allows nice common case for mntput() - since we can't drop the
    final reference until after mnt_longterm has reached 0 due to the rules
    above, mntput() can grab vfsmount lock shared and check mnt_longterm.
    If it turns out to be non-zero (which is the common case), we know
    that this is not the final mntput() and can just blindly decrement
    percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
    do usual decrement-and-check of percpu mnt_count.

    For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
    namespace.c uses the latter in places where we don't already hold
    vfsmount lock exclusive and opencodes a few remaining spots where
    we need to manipulate mnt_longterm.

    Note that we mostly revert the code outside of fs/namespace.c back
    to what we used to have; in particular, normal code doesn't need
    to care about two kinds of references, etc. And we get to keep
    the optimization Nick's variant had bought us...

    Signed-off-by: Al Viro

    Al Viro
     
  • Expiry-related code calls umount_tree() several times with
    the same list to collect vfsmounts to. Which is fine, except
    that umount_tree() implicitly assumed that the list would
    be empty on each call - it moves the victims over there and
    then iterates through the list kicking them out. It's *almost*
    idempotent, so everything nearly worked. However, mnt->ghosts
    handling (and thus expirability checks) had been broken - that
    part was not idempotent...

    The fix is trivial - use local temporary list, splice it to
    the the collector list when we are through.

    Signed-off-by: Al Viro

    Al Viro
     

16 Jan, 2011

2 commits

  • Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
    added rather than calling do_add_mount() itself. follow_automount() will then
    do the addition.

    This slightly complicates things as ->d_automount() normally wants to add the
    new vfsmount to an expiration list and start an expiration timer. The problem
    with that is that the vfsmount will be deleted if it has a refcount of 1 and
    the timer will not repeat if the expiration list is empty.

    To this end, we require the vfsmount to be returned from d_automount() with a
    refcount of (at least) 2. One of these refs will be dropped unconditionally.
    In addition, follow_automount() must get a 3rd ref around the call to
    do_add_mount() lest it eat a ref and return an error, leaving the mount we
    have open to being expired as we would otherwise have only 1 ref on it.

    d_automount() should also add the the vfsmount to the expiration list (by
    calling mnt_set_expiry()) and start the expiration timer before returning, if
    this mechanism is to be used. The vfsmount will be unlinked from the
    expiration list by follow_automount() if do_add_mount() fails.

    This patch also fixes the call to do_add_mount() for AFS to propagate the mount
    flags from the parent vfsmount.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
    sleep when it tries to transit away from one of that filesystem's directories
    during a pathwalk. The operation is keyed off a new dentry flag
    (DCACHE_MANAGE_TRANSIT).

    The filesystem is allowed to be selective about which processes it holds and
    which it permits to continue on or prohibits from transiting from each flagged
    directory. This will allow autofs to hold up client processes whilst letting
    its userspace daemon through to maintain the directory or the stuff behind it
    or mounted upon it.

    The ->d_manage() dentry operation:

    int (*d_manage)(struct path *path, bool mounting_here);

    takes a pointer to the directory about to be transited away from and a flag
    indicating whether the transit is undertaken by do_add_mount() or
    do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.

    It should return 0 if successful and to let the process continue on its way;
    -EISDIR to prohibit the caller from skipping to overmounted filesystems or
    automounting, and to use this directory; or some other error code to return to
    the user.

    ->d_manage() is called with namespace_sem writelocked if mounting_here is true
    and no other locks held, so it may sleep. However, if mounting_here is true,
    it may not initiate or wait for a mount or unmount upon the parameter
    directory, even if the act is actually performed by userspace.

    Within fs/namei.c, follow_managed() is extended to check with d_manage() first
    on each managed directory, before transiting away from it or attempting to
    automount upon it.

    follow_down() is renamed follow_down_one() and should only be used where the
    filesystem deliberately intends to avoid management steps (e.g. autofs).

    A new follow_down() is added that incorporates the loop done by all other
    callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
    and CIFS do use it, their use is removed by converting them to use
    d_automount()). The new follow_down() calls d_manage() as appropriate. It
    also takes an extra parameter to indicate if it is being called from mount code
    (with namespace_sem writelocked) which it passes to d_manage(). follow_down()
    ignores automount points so that it can be used to mount on them.

    __follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
    DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
    sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have
    that determine whether to abort or not itself. That would allow the autofs
    daemon to continue on in rcu-walk mode.

    Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
    required as every tranist from that directory will cause d_manage() to be
    invoked. It can always be set again when necessary.

    ==========================
    WHAT THIS MEANS FOR AUTOFS
    ==========================

    Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
    trigger the automounting of indirect mounts, and both of these can be called
    with i_mutex held.

    autofs knows that the i_mutex will be held by the caller in lookup(), and so
    can drop it before invoking the daemon - but this isn't so for d_revalidate(),
    since the lock is only held on _some_ of the code paths that call it. This
    means that autofs can't risk dropping i_mutex from its d_revalidate() function
    before it calls the daemon.

    The bug could manifest itself as, for example, a process that's trying to
    validate an automount dentry that gets made to wait because that dentry is
    expired and needs cleaning up:

    mkdir S ffffffff8014e05a 0 32580 24956
    Call Trace:
    [] :autofs4:autofs4_wait+0x674/0x897
    [] avc_has_perm+0x46/0x58
    [] autoremove_wake_function+0x0/0x2e
    [] :autofs4:autofs4_expire_wait+0x41/0x6b
    [] :autofs4:autofs4_revalidate+0x91/0x149
    [] __lookup_hash+0xa0/0x12f
    [] lookup_create+0x46/0x80
    [] sys_mkdirat+0x56/0xe4

    versus the automount daemon which wants to remove that dentry, but can't
    because the normal process is holding the i_mutex lock:

    automount D ffffffff8014e05a 0 32581 1 32561
    Call Trace:
    [] __mutex_lock_slowpath+0x60/0x9b
    [] do_path_lookup+0x2ca/0x2f1
    [] .text.lock.mutex+0xf/0x14
    [] do_rmdir+0x77/0xde
    [] tracesys+0x71/0xe0
    [] tracesys+0xd5/0xe0

    which means that the system is deadlocked.

    This patch allows autofs to hold up normal processes whilst the daemon goes
    ahead and does things to the dentry tree behind the automouter point without
    risking a deadlock as almost no locks are held in d_manage() and none in
    d_automount().

    Signed-off-by: David Howells
    Was-Acked-by: Ian Kent
    Signed-off-by: Al Viro

    David Howells
     

07 Jan, 2011

3 commits

  • The problem that this patch aims to fix is vfsmount refcounting scalability.
    We need to take a reference on the vfsmount for every successful path lookup,
    which often go to the same mount point.

    The fundamental difficulty is that a "simple" reference count can never be made
    scalable, because any time a reference is dropped, we must check whether that
    was the last reference. To do that requires communication with all other CPUs
    that may have taken a reference count.

    We can make refcounts more scalable in a couple of ways, involving keeping
    distributed counters, and checking for the global-zero condition less
    frequently.

    - check the global sum once every interval (this will delay zero detection
    for some interval, so it's probably a showstopper for vfsmounts).

    - keep a local count and only taking the global sum when local reaches 0 (this
    is difficult for vfsmounts, because we can't hold preempt off for the life of
    a reference, so a counter would need to be per-thread or tied strongly to a
    particular CPU which requires more locking).

    - keep a local difference of increments and decrements, which allows us to sum
    the total difference and hence find the refcount when summing all CPUs. Then,
    keep a single integer "long" refcount for slow and long lasting references,
    and only take the global sum of local counters when the long refcount is 0.

    This last scheme is what I implemented here. Attached mounts and process root
    and working directory references are "long" references, and everything else is
    a short reference.

    This allows scalable vfsmount references during path walking over mounted
    subtrees and unattached (lazy umounted) mounts with processes still running
    in them.

    This results in one fewer atomic op in the fastpath: mntget is now just a
    per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
    and non-atomic decrement in the common case. However code is otherwise bigger
    and heavier, so single threaded performance is basically a wash.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Suggested by Andreas, mnt_ prefix is clearer namespace, follows kernel
    conventions better, and is easier for tab complete. I introduced these
    names so I'll admit they were not good choices.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Rather than keep a d_mounted count in the dentry, set a dentry flag instead.
    The flag can be cleared by checking the hash table to see if there are any
    mounts left, which is not time critical because it is performed at detach time.

    The mounted state of a dentry is only used to speculatively take a look in the
    mount hash table if it is set -- before following the mount, vfsmount lock is
    taken and mount re-checked without races.

    This saves 4 bytes on 32-bit, nothing on 64-bit but it does provide a hole I
    might use later (and some configs have larger than 32-bit spinlocks which might
    make use of the hole).

    Autofs4 conversion and changelog by Ian Kent :
    In autofs4, when expring direct (or offset) mounts we need to ensure that we
    block user path walks into the autofs mount, which is covered by another mount.
    To do this we clear the mounted status so that follows stop before walking into
    the mount and are essentially blocked until the expire is completed. The
    automount daemon still finds the correct dentry for the umount due to the
    follow mount logic in fs/autofs4/root.c:autofs4_follow_link(), which is set as
    an inode operation for direct and offset mounts only and is called following
    the lookup that stopped at the covered mount.

    At the end of the expire the covering mount probably has gone away so the
    mounted status need not be restored. But we need to check this and only restore
    the mounted status if the expire failed.

    XXX: autofs may not work right if we have other mounts go over the top of it?

    Signed-off-by: Nick Piggin

    Nick Piggin
     

18 Nov, 2010

1 commit


26 Oct, 2010

1 commit

  • If clone_mnt() happens while mnt_make_readonly() is running, the
    cloned mount might have MNT_WRITE_HOLD flag set, which results in
    mnt_want_write() spinning forever on this mount.

    Needs CAP_SYS_ADMIN to trigger deliberately and unlikely to happen
    accidentally. But if it does happen it can hang the machine.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     

05 Oct, 2010

1 commit

  • After pushing down the BKL to the get_sb/fill_super operations of the
    filesystems that still make usage of the BKL it is safe to remove it from
    do_new_mount().

    I've read through all the code formerly covered by the BKL inside
    do_kern_mount() and have satisfied myself that it doesn't need the BKL
    any more.

    Signed-off-by: Jan Blunck
    Cc: Matthew Wilcox
    Signed-off-by: Arnd Bergmann

    Jan Blunck
     

08 Sep, 2010

1 commit

  • Sanity check the flags passed to change_mnt_propagation(). Exactly
    one flag should be set. Return EINVAL otherwise.

    Userspace can pass in arbitrary combinations of MS_* flags to mount().
    do_change_type() is called if any of MS_SHARED, MS_PRIVATE, MS_SLAVE,
    or MS_UNBINDABLE is set. do_change_type() clears MS_REC and then
    calls change_mnt_propagation() with the rest of the user-supplied
    flags. change_mnt_propagation() clearly assumes only one flag is set
    but do_change_type() does not check that this is true. For example,
    mount() with flags MS_SHARED | MS_RDONLY does not actually make the
    mount shared or read-only but does clear MNT_UNBINDABLE.

    Signed-off-by: Valerie Aurora
    Signed-off-by: Linus Torvalds

    Valerie Aurora
     

18 Aug, 2010

1 commit

  • fs: brlock vfsmount_lock

    Use a brlock for the vfsmount lock. It must be taken for write whenever
    modifying the mount hash or associated fields, and may be taken for read when
    performing mount hash lookups.

    A new lock is added for the mnt-id allocator, so it doesn't need to take
    the heavy vfsmount write-lock.

    The number of atomics should remain the same for fastpath rlock cases, though
    code would be slightly slower due to per-cpu access. Scalability is not not be
    much improved in common cases yet, due to other locks (ie. dcache_lock) getting
    in the way. However path lookups crossing mountpoints should be one case where
    scalability is improved (currently requiring the global lock).

    The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
    Altix system (high latency to remote nodes), a simple umount microbenchmark
    (mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
    took 6.8s, afterwards took 7.1s, about 5% slower.

    Cc: Al Viro
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     

11 Aug, 2010

2 commits