17 Jan, 2011

1 commit

  • Instead of splitting refcount between (per-cpu) mnt_count
    and (SMP-only) mnt_longrefs, make all references contribute
    to mnt_count again and keep track of how many are longterm
    ones.

    Accounting rules for longterm count:
    * 1 for each fs_struct.root.mnt
    * 1 for each fs_struct.pwd.mnt
    * 1 for having non-NULL ->mnt_ns
    * decrement to 0 happens only under vfsmount lock exclusive

    That allows nice common case for mntput() - since we can't drop the
    final reference until after mnt_longterm has reached 0 due to the rules
    above, mntput() can grab vfsmount lock shared and check mnt_longterm.
    If it turns out to be non-zero (which is the common case), we know
    that this is not the final mntput() and can just blindly decrement
    percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
    do usual decrement-and-check of percpu mnt_count.

    For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
    namespace.c uses the latter in places where we don't already hold
    vfsmount lock exclusive and opencodes a few remaining spots where
    we need to manipulate mnt_longterm.

    Note that we mostly revert the code outside of fs/namespace.c back
    to what we used to have; in particular, normal code doesn't need
    to care about two kinds of references, etc. And we get to keep
    the optimization Nick's variant had bought us...

    Signed-off-by: Al Viro

    Al Viro
     

16 Jan, 2011

1 commit

  • Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
    added rather than calling do_add_mount() itself. follow_automount() will then
    do the addition.

    This slightly complicates things as ->d_automount() normally wants to add the
    new vfsmount to an expiration list and start an expiration timer. The problem
    with that is that the vfsmount will be deleted if it has a refcount of 1 and
    the timer will not repeat if the expiration list is empty.

    To this end, we require the vfsmount to be returned from d_automount() with a
    refcount of (at least) 2. One of these refs will be dropped unconditionally.
    In addition, follow_automount() must get a 3rd ref around the call to
    do_add_mount() lest it eat a ref and return an error, leaving the mount we
    have open to being expired as we would otherwise have only 1 ref on it.

    d_automount() should also add the the vfsmount to the expiration list (by
    calling mnt_set_expiry()) and start the expiration timer before returning, if
    this mechanism is to be used. The vfsmount will be unlinked from the
    expiration list by follow_automount() if do_add_mount() fails.

    This patch also fixes the call to do_add_mount() for AFS to propagate the mount
    flags from the parent vfsmount.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

07 Jan, 2011

1 commit

  • The problem that this patch aims to fix is vfsmount refcounting scalability.
    We need to take a reference on the vfsmount for every successful path lookup,
    which often go to the same mount point.

    The fundamental difficulty is that a "simple" reference count can never be made
    scalable, because any time a reference is dropped, we must check whether that
    was the last reference. To do that requires communication with all other CPUs
    that may have taken a reference count.

    We can make refcounts more scalable in a couple of ways, involving keeping
    distributed counters, and checking for the global-zero condition less
    frequently.

    - check the global sum once every interval (this will delay zero detection
    for some interval, so it's probably a showstopper for vfsmounts).

    - keep a local count and only taking the global sum when local reaches 0 (this
    is difficult for vfsmounts, because we can't hold preempt off for the life of
    a reference, so a counter would need to be per-thread or tied strongly to a
    particular CPU which requires more locking).

    - keep a local difference of increments and decrements, which allows us to sum
    the total difference and hence find the refcount when summing all CPUs. Then,
    keep a single integer "long" refcount for slow and long lasting references,
    and only take the global sum of local counters when the long refcount is 0.

    This last scheme is what I implemented here. Attached mounts and process root
    and working directory references are "long" references, and everything else is
    a short reference.

    This allows scalable vfsmount references during path walking over mounted
    subtrees and unattached (lazy umounted) mounts with processes still running
    in them.

    This results in one fewer atomic op in the fastpath: mntget is now just a
    per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
    and non-atomic decrement in the common case. However code is otherwise bigger
    and heavier, so single threaded performance is basically a wash.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

11 Aug, 2010

1 commit

  • Commit d0adde574b8487ef30f69e2d08bba769e4be513f added MNT_STRICTATIME
    but it isn't actually used (MS_STRICTATIME clears MNT_RELATIME and
    MNT_NOATIME rather than setting any mount flag).

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     

28 Jul, 2010

1 commit


05 Mar, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
    init: Open /dev/console from rootfs
    mqueue: fix typo "failues" -> "failures"
    mqueue: only set error codes if they are really necessary
    mqueue: simplify do_open() error handling
    mqueue: apply mathematics distributivity on mq_bytes calculation
    mqueue: remove unneeded info->messages initialization
    mqueue: fix mq_open() file descriptor leak on user-space processes
    fix race in d_splice_alias()
    set S_DEAD on unlink() and non-directory rename() victims
    vfs: add NOFOLLOW flag to umount(2)
    get rid of ->mnt_parent in tomoyo/realpath
    hppfs can use existing proc_mnt, no need for do_kern_mount() in there
    Mirror MS_KERNMOUNT in ->mnt_flags
    get rid of useless vfsmount_lock use in put_mnt_ns()
    Take vfsmount_lock to fs/internal.h
    get rid of insanity with namespace roots in tomoyo
    take check for new events in namespace (guts of mounts_poll()) to namespace.c
    Don't mess with generic_permission() under ->d_lock in hpfs
    sanitize const/signedness for udf
    nilfs: sanitize const/signedness in dealing with ->d_name.name
    ...

    Fix up fairly trivial (famous last words...) conflicts in
    drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c

    Linus Torvalds
     

04 Mar, 2010

3 commits


17 Feb, 2010

1 commit

  • Add __percpu sparse annotations to fs.

    These annotations are to make sparse consider percpu variables to be
    in a different address space and warn if accessed without going
    through percpu accessors. This patch doesn't affect normal builds.

    Signed-off-by: Tejun Heo
    Cc: "Theodore Ts'o"
    Cc: Trond Myklebust
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Cc: Alexander Viro

    Tejun Heo
     

12 Jun, 2009

2 commits

  • This patch speeds up lmbench lat_mmap test by about another 2% after the
    first patch.

    Before:
    avg = 462.286
    std = 5.46106

    After:
    avg = 453.12
    std = 9.58257

    (50 runs of each, stddev gives a reasonable confidence)

    It does this by introducing mnt_clone_write, which avoids some heavyweight
    operations of mnt_want_write if called on a vfsmount which we know already
    has a write count; and mnt_want_write_file, which can call mnt_clone_write
    if the file is open for write.

    After these two patches, mnt_want_write and mnt_drop_write go from 7% on
    the profile down to 1.3% (including mnt_clone_write).

    [AV: mnt_want_write_file() should take file alone and derive mnt from it;
    not only all callers have that form, but that's the only mnt about which
    we know that it's already held for write if file is opened for write]

    Cc: Dave Hansen
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     
  • This patch speeds up lmbench lat_mmap test by about 8%. lat_mmap is set up
    basically to mmap a 64MB file on tmpfs, fault in its pages, then unmap it.
    A microbenchmark yes, but it exercises some important paths in the mm.

    Before:
    avg = 501.9
    std = 14.7773

    After:
    avg = 462.286
    std = 5.46106

    (50 runs of each, stddev gives a reasonable confidence, but there is quite
    a bit of variation there still)

    It does this by removing the complex per-cpu locking and counter-cache and
    replaces it with a percpu counter in struct vfsmount. This makes the code
    much simpler, and avoids spinlocks (although the msync is still pretty
    costly, unfortunately). It results in about 900 bytes smaller code too. It
    does increase the size of a vfsmount, however.

    It should also give a speedup on large systems if CPUs are frequently operating
    on different mounts (because the existing scheme has to operate on an atomic in
    the struct vfsmount when switching between mounts). But I'm most interested in
    the single threaded path performance for the moment.

    [AV: minor cleanup]

    Cc: Dave Hansen
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

27 Mar, 2009

1 commit

  • Add support for explicitly requesting full atime updates. This makes it
    possible for kernels to default to relatime but still allow userspace to
    override it.

    Signed-off-by: Matthew Garrett
    Signed-off-by: Linus Torvalds

    Matthew Garrett
     

17 Oct, 2008

1 commit


01 Aug, 2008

1 commit


27 Jul, 2008

1 commit


30 Apr, 2008

1 commit


23 Apr, 2008

2 commits

  • Add a unique ID to each peer group using the IDR infrastructure. The
    identifiers are reused after the peer group dissolves.

    The IDR structures are protected by holding namepspace_sem for write
    while allocating or deallocating IDs.

    IDs are allocated when a previously unshared vfsmount becomes the
    first member of a peer group. When a new member is added to an
    existing group, the ID is copied from one of the old members.

    IDs are freed when the last member of a peer group is unshared.

    Setting the MNT_SHARED flag on members of a subtree is done as a
    separate step, after all the IDs have been allocated. This way an
    allocation failure can be cleaned up easilty, without affecting the
    propagation state.

    Based on design sketch by Al Viro.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Add a unique ID to each vfsmount using the IDR infrastructure. The
    identifiers are reused after the vfsmount is freed.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     

22 Apr, 2008

1 commit


19 Apr, 2008

3 commits

  • Originally from: Herbert Poetzl

    This is the core of the read-only bind mount patch set.

    Note that this does _not_ add a "ro" option directly to the bind mount
    operation. If you require such a mount, you must first do the bind, then
    follow it up with a 'mount -o remount,ro' operation:

    If you wish to have a r/o bind mount of /foo on bar:

    mount --bind /foo /bar
    mount -o remount,ro /bar

    Acked-by: Al Viro
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Hansen
     
  • This is the real meat of the entire series. It actually
    implements the tracking of the number of writers to a mount.
    However, it causes scalability problems because there can be
    hundreds of cpus doing open()/close() on files on the same mnt at
    the same time. Even an atomic_t in the mnt has massive scalaing
    problems because the cacheline gets so terribly contended.

    This uses a statically-allocated percpu variable. All want/drop
    operations are local to a cpu as long that cpu operates on the same
    mount, and there are no writer count imbalances. Writer count
    imbalances happen when a write is taken on one cpu, and released
    on another, like when an open/close pair is performed on two

    Upon a remount,ro request, all of the data from the percpu
    variables is collected (expensive, but very rare) and we determine
    if there are any outstanding writers to the mount.

    I've written a little benchmark to sit in a loop for a couple of
    seconds in several cpus in parallel doing open/write/close loops.

    http://sr71.net/~dave/linux/openbench.c

    The code in here is a a worst-possible case for this patch. It
    does opens on a _pair_ of files in two different mounts in parallel.
    This should cause my code to lose its "operate on the same mount"
    optimization completely. This worst-case scenario causes a 3%
    degredation in the benchmark.

    I could probably get rid of even this 3%, but it would be more
    complex than what I have here, and I think this is getting into
    acceptable territory. In practice, I expect writing more than 3
    bytes to a file, as well as disk I/O to mask any effects that this
    has.

    (To get rid of that 3%, we could have an #defined number of mounts
    in the percpu variable. So, instead of a CPU getting operate only
    on percpu data when it accesses only one mount, it could stay on
    percpu data when it only accesses N or fewer mounts.)

    [AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount

    Acked-by: Al Viro
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Hansen
     
  • This patch adds two function mnt_want_write() and mnt_drop_write(). These are
    used like a lock pair around and fs operations that might cause a write to the
    filesystem.

    Before these can become useful, we must first cover each place in the VFS
    where writes are performed with a want/drop pair. When that is complete, we
    can actually introduce code that will safely check the counts before allowing
    r/wr/o transitions to occur.

    Acked-by: Serge Hallyn
    Acked-by: Al Viro
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Dave Hansen
    Signed-off-by: Al Viro

    Dave Hansen
     

28 Mar, 2008

2 commits


09 May, 2007

1 commit


12 Feb, 2007

1 commit

  • I noticed cache misses in touch_atime() that can be avoided if we keep
    mnt_count & mnt_expiry_mark in a different cache line than mnt_flags
    (mostly read)

    mnt_count & mnt_expiry_mark are modified each time a file is opened/closed
    in a file system.

    touch_atime() is called each time a file is read, and generally needs to
    read mnt_flags.

    Other fields of struct vfsmount are mostly read so I chose to move
    mnt_count & mnt_expiry_mark at the end of struct vfsmount. And adding a
    comment so that nobody tries to re-arrange fields to fill the holes :)

    On 64bits platforms, the new offsetof(mnt_count) is 0xC0
    On 32bits platforms, it is 0x60, so I didnot add a
    ____cacheline_aligned_in_smp because it would have a too big impact on the
    size of this object (in particular if CONFIG_X86_L1_CACHE_SHIFT=7)

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

14 Dec, 2006

1 commit

  • Add "relatime" (relative atime) support. Relative atime only updates the
    atime if the previous atime is older than the mtime or ctime. Like
    noatime, but useful for applications like mutt that need to know when a
    file has been read since it was last modified.

    A corresponding patch against mount(8) is available at
    http://userweb.kernel.org/~akpm/mount-relative-atime.txt

    Signed-off-by: Valerie Henson
    Cc: Mark Fasheh
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Karel Zak
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Valerie Henson
     

09 Dec, 2006

1 commit

  • Rename 'struct namespace' to 'struct mnt_namespace' to avoid confusion with
    other namespaces being developped for the containers : pid, uts, ipc, etc.
    'namespace' variables and attributes are also renamed to 'mnt_ns'

    Signed-off-by: Kirill Korotaev
    Signed-off-by: Cedric Le Goater
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     

25 Jun, 2006

1 commit


23 Jun, 2006

1 commit

  • Give the statfs superblock operation a dentry pointer rather than a superblock
    pointer.

    This complements the get_sb() patch. That reduced the significance of
    sb->s_root, allowing NFS to place a fake root there. However, NFS does
    require a dentry to use as a target for the statfs operation. This permits
    the root in the vfsmount to be used instead.

    linux/mount.h has been added where necessary to make allyesconfig build
    successfully.

    Interest has also been expressed for use with the FUSE and XFS filesystems.

    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

09 Jun, 2006

2 commits

  • Allow a submount to be marked as being 'shrinkable' by means of the
    vfsmount->mnt_flags, and then add a function 'shrink_submounts()' which
    attempts to recursively unmount these submounts.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • do_kern_mount() does not allow the kernel to use private mount interfaces
    without exposing the same interfaces to userland. The problem is that the
    filesystem is referenced by name, thus meaning that it and its mount
    interface must be registered in the global filesystem list.

    vfs_kern_mount() passes the struct file_system_type as an explicit
    parameter in order to overcome this limitation.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

11 Jan, 2006

1 commit

  • Turn noatime and nodiratime into per-mount instead of per-sb flags.

    After all the preparations this is a rather trivial patch. The mount code
    needs to treat the two options as per-mount instead of per-superblock, and
    touch_atime needs to be changed to check the new MNT_ flags in addition to
    the MS_ flags that are kept for filesystems that are always
    noatime/nodiratime but not user settable anymore. Besides that core code
    only nfs needed an update because it's leaving atime updates to the server
    and thus sets the S_NOATIME flag on every inode, but needs to know whether
    it's a real noatime mount for an getattr optimization.

    While we're at it I've killed the IS_NOATIME/IS_NODIRATIME macros that were
    only used by touch_atime.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

09 Jan, 2006

1 commit


08 Nov, 2005

5 commits

  • An unbindable mount does not forward or receive propagation. Also
    unbindable mount disallows bind mounts. The semantics is as follows.

    Bind semantics:
    It is invalid to bind mount an unbindable mount.

    Move semantics:
    It is invalid to move an unbindable mount under shared mount.

    Clone-namespace semantics:
    If a mount is unbindable in the parent namespace, the corresponding
    cloned mount in the child namespace becomes unbindable too. Note:
    there is subtle difference, unbindable mounts cannot be bind mounted
    but can be cloned during clone-namespace.

    Signed-off-by: Ram Pai
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Ram Pai
     
  • A slave mount always has a master mount from which it receives
    mount/umount events. Unlike shared mount the event propagation does not
    flow from the slave mount to the master.

    Signed-off-by: Ram Pai
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Ram Pai
     
  • This creates shared mounts. A shared mount when bind-mounted to some
    mountpoint, propagates mount/umount events to each other. All the
    shared mounts that propagate events to each other belong to the same
    peer-group.

    Signed-off-by: Ram Pai
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Ram Pai
     
  • A private mount does not forward or receive propagation. This patch
    provides user the ability to convert any mount to private.

    Signed-off-by: Ram Pai
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Ram Pai
     
  • The way we currently deal with quota and process accounting that might
    keep vfsmount busy at umount time is inherently broken; we try to turn
    them off just in case (not quite correctly, at that) and

    a) pray umount doesn't fail (otherwise they'll stay turned off)
    b) pray nobody doesn anything funny just as we turn quota off

    Moreover, LSM provides hooks for doing the same sort of broken logics.

    The proper way to deal with that is to introduce the second kind of
    reference to vfsmount. Semantics:

    - when the last normal reference is dropped, all special ones are
    converted to normal ones and if there had been any, cleanup is done.
    - normal reference can be cloned into a special one
    - special reference can be converted to normal one; that's a no-op if
    we'd already passed the point of no return (i.e. mntput() had
    converted special references to normal and started cleanup).

    The way it works: e.g. starting process accounting converts the vfsmount
    reference pinned by the opened file into special one and turns it back
    to normal when it gets shut down; acct_auto_close() is done when no
    normal references are left. That way it does *not* obstruct umount(2)
    and it silently gets turned off when the last normal reference to
    vfsmount is gone. Which is exactly what we want...

    The same should be done by LSM module that holds some internal
    references to vfsmount and wants to shut them down on umount - it should
    make them special and security_sb_umount_close() will be called exactly
    when the last normal reference to vfsmount is gone.

    quota handling is even simpler - we don't use normal file IO anymore, so
    there's no need to hold vfsmounts at all. DQUOT_OFF() is done from
    deactivate_super(), where it really belongs.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro