04 Jan, 2012

29 commits


07 Jan, 2011

1 commit

  • The problem that this patch aims to fix is vfsmount refcounting scalability.
    We need to take a reference on the vfsmount for every successful path lookup,
    which often go to the same mount point.

    The fundamental difficulty is that a "simple" reference count can never be made
    scalable, because any time a reference is dropped, we must check whether that
    was the last reference. To do that requires communication with all other CPUs
    that may have taken a reference count.

    We can make refcounts more scalable in a couple of ways, involving keeping
    distributed counters, and checking for the global-zero condition less
    frequently.

    - check the global sum once every interval (this will delay zero detection
    for some interval, so it's probably a showstopper for vfsmounts).

    - keep a local count and only taking the global sum when local reaches 0 (this
    is difficult for vfsmounts, because we can't hold preempt off for the life of
    a reference, so a counter would need to be per-thread or tied strongly to a
    particular CPU which requires more locking).

    - keep a local difference of increments and decrements, which allows us to sum
    the total difference and hence find the refcount when summing all CPUs. Then,
    keep a single integer "long" refcount for slow and long lasting references,
    and only take the global sum of local counters when the long refcount is 0.

    This last scheme is what I implemented here. Attached mounts and process root
    and working directory references are "long" references, and everything else is
    a short reference.

    This allows scalable vfsmount references during path walking over mounted
    subtrees and unattached (lazy umounted) mounts with processes still running
    in them.

    This results in one fewer atomic op in the fastpath: mntget is now just a
    per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
    and non-atomic decrement in the common case. However code is otherwise bigger
    and heavier, so single threaded performance is basically a wash.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

18 Aug, 2010

1 commit

  • fs: brlock vfsmount_lock

    Use a brlock for the vfsmount lock. It must be taken for write whenever
    modifying the mount hash or associated fields, and may be taken for read when
    performing mount hash lookups.

    A new lock is added for the mnt-id allocator, so it doesn't need to take
    the heavy vfsmount write-lock.

    The number of atomics should remain the same for fastpath rlock cases, though
    code would be slightly slower due to per-cpu access. Scalability is not not be
    much improved in common cases yet, due to other locks (ie. dcache_lock) getting
    in the way. However path lookups crossing mountpoints should be one case where
    scalability is improved (currently requiring the global lock).

    The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
    Altix system (high latency to remote nodes), a simple umount microbenchmark
    (mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
    took 6.8s, afterwards took 7.1s, about 5% slower.

    Cc: Al Viro
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     

04 Mar, 2010

1 commit

  • First of all, get_source() never results in CL_PROPAGATION
    alone. We either get CL_MAKE_SHARED (for the continuation
    of peer group) or CL_SLAVE (slave that is not shared) or both
    (beginning of peer group among slaves). Massage the code to
    make that explicit, kill CL_PROPAGATION test in clone_mnt()
    (nothing sets CL_MAKE_SHARED without CL_PROPAGATION and in
    clone_mnt() we are checking CL_PROPAGATION after we'd found
    that there's no CL_SLAVE, so the check for CL_MAKE_SHARED
    would do just as well).

    Fix comments, while we are at it...

    Signed-off-by: Al Viro

    Al Viro
     

23 Apr, 2008

2 commits

  • Show peer group ID of nearest dominating group that has intersection
    with the mount's namespace.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Add a unique ID to each peer group using the IDR infrastructure. The
    identifiers are reused after the peer group dissolves.

    The IDR structures are protected by holding namepspace_sem for write
    while allocating or deallocating IDs.

    IDs are allocated when a previously unshared vfsmount becomes the
    first member of a peer group. When a new member is added to an
    existing group, the ID is copied from one of the old members.

    IDs are freed when the last member of a peer group is unshared.

    Setting the MNT_SHARED flag on members of a subtree is done as a
    separate step, after all the IDs have been allocated. This way an
    allocation failure can be cleaned up easilty, without affecting the
    propagation state.

    Based on design sketch by Al Viro.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     

22 Apr, 2008

2 commits


28 Mar, 2008

1 commit


07 Feb, 2008

1 commit

  • Some time ago ( http://lkml.org/lkml/2007/6/19/128 ) I wrote about
    MNT_UNBINDABLE that it felt like a bug that it is not reset by "mount
    --make-private".

    Today I happened to see mount(8) and Documentation/sharedsubtree.txt and
    both document the version obtained by applying the little patch given in
    the above (and again below).

    So, the present kernel code is not according to specs and must be regarded
    as buggy.

    Specification in Documentation/sharedsubtree.txt:
    See state diagram: unbindable should become private upon make-private.

    Specification in mount(8):
    ... It's
    also possible to set up uni-directional propagation (with --make-
    slave), to make a mount point unavailable for --bind/--rbind (with
    --make-unbindable), and to undo any of these (with --make-private).

    Repeat of old fix-shared-subtrees-make-private.patch
    (due to Dirk Gerrits, René Gabriëls, Peter Kooijmans):

    Acked-by: Ram Pai
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andries E. Brouwer
     

09 May, 2007

1 commit

  • There are many places in the kernel where the construction like

    foo = list_entry(head->next, struct foo_struct, list);

    are used.
    The code might look more descriptive and neat if using the macro

    list_first_entry(head, type, member) \
    list_entry((head)->next, type, member)

    Here is the macro itself and the examples of its usage in the generic code.
    If it will turn out to be useful, I can prepare the set of patches to
    inject in into arch-specific code, drivers, networking, etc.

    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Cc: Randy Dunlap
    Cc: Andi Kleen
    Cc: Zach Brown
    Cc: Davide Libenzi
    Cc: John McCutchan
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: john stultz
    Cc: Ram Pai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelianov
     

09 Dec, 2006

1 commit

  • Rename 'struct namespace' to 'struct mnt_namespace' to avoid confusion with
    other namespaces being developped for the containers : pid, uts, ipc, etc.
    'namespace' variables and attributes are also renamed to 'mnt_ns'

    Signed-off-by: Kirill Korotaev
    Signed-off-by: Cedric Le Goater
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev