25 Apr, 2008

1 commit


23 Apr, 2008

5 commits

  • Show peer group ID of nearest dominating group that has intersection
    with the mount's namespace.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • [mszeredi@suse.cz] rewrite and split big patch into managable chunks

    /proc/mounts in its current form lacks important information:

    - propagation state
    - root of mount for bind mounts
    - the st_dev value used within the filesystem
    - identifier for each mount and it's parent

    It also suffers from the following problems:

    - not easily extendable
    - ambiguity of mountpoints within a chrooted environment
    - doesn't distinguish between filesystem dependent and independent options
    - doesn't distinguish between per mount and per super block options

    This patch introduces /proc//mountinfo which attempts to address
    all these deficiencies.

    Code shared between /proc//mounts and /proc//mountinfo is
    extracted into separate functions.

    Thanks to Al Viro for the help in getting the design right.

    Signed-off-by: Ram Pai
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Ram Pai
     
  • Allow /proc//mountinfo to use the root of to calculate
    mountpoints.

    - move definition of 'struct proc_mounts' to
    - add the process's namespace and root to this structure
    - pass a pointer to 'struct proc_mounts' into seq_operations

    In addition the following cleanups are made:

    - use a common open function for /proc//{mounts,mountstat}
    - surround namespace.c part of these proc files with #ifdef CONFIG_PROC_FS
    - make the seq_operations structures const

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Add a unique ID to each peer group using the IDR infrastructure. The
    identifiers are reused after the peer group dissolves.

    The IDR structures are protected by holding namepspace_sem for write
    while allocating or deallocating IDs.

    IDs are allocated when a previously unshared vfsmount becomes the
    first member of a peer group. When a new member is added to an
    existing group, the ID is copied from one of the old members.

    IDs are freed when the last member of a peer group is unshared.

    Setting the MNT_SHARED flag on members of a subtree is done as a
    separate step, after all the IDs have been allocated. This way an
    allocation failure can be cleaned up easilty, without affecting the
    propagation state.

    Based on design sketch by Al Viro.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Add a unique ID to each vfsmount using the IDR infrastructure. The
    identifiers are reused after the vfsmount is freed.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     

22 Apr, 2008

3 commits


19 Apr, 2008

3 commits

  • Originally from: Herbert Poetzl

    This is the core of the read-only bind mount patch set.

    Note that this does _not_ add a "ro" option directly to the bind mount
    operation. If you require such a mount, you must first do the bind, then
    follow it up with a 'mount -o remount,ro' operation:

    If you wish to have a r/o bind mount of /foo on bar:

    mount --bind /foo /bar
    mount -o remount,ro /bar

    Acked-by: Al Viro
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Hansen
     
  • This is the real meat of the entire series. It actually
    implements the tracking of the number of writers to a mount.
    However, it causes scalability problems because there can be
    hundreds of cpus doing open()/close() on files on the same mnt at
    the same time. Even an atomic_t in the mnt has massive scalaing
    problems because the cacheline gets so terribly contended.

    This uses a statically-allocated percpu variable. All want/drop
    operations are local to a cpu as long that cpu operates on the same
    mount, and there are no writer count imbalances. Writer count
    imbalances happen when a write is taken on one cpu, and released
    on another, like when an open/close pair is performed on two

    Upon a remount,ro request, all of the data from the percpu
    variables is collected (expensive, but very rare) and we determine
    if there are any outstanding writers to the mount.

    I've written a little benchmark to sit in a loop for a couple of
    seconds in several cpus in parallel doing open/write/close loops.

    http://sr71.net/~dave/linux/openbench.c

    The code in here is a a worst-possible case for this patch. It
    does opens on a _pair_ of files in two different mounts in parallel.
    This should cause my code to lose its "operate on the same mount"
    optimization completely. This worst-case scenario causes a 3%
    degredation in the benchmark.

    I could probably get rid of even this 3%, but it would be more
    complex than what I have here, and I think this is getting into
    acceptable territory. In practice, I expect writing more than 3
    bytes to a file, as well as disk I/O to mask any effects that this
    has.

    (To get rid of that 3%, we could have an #defined number of mounts
    in the percpu variable. So, instead of a CPU getting operate only
    on percpu data when it accesses only one mount, it could stay on
    percpu data when it only accesses N or fewer mounts.)

    [AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount

    Acked-by: Al Viro
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Hansen
     
  • This patch adds two function mnt_want_write() and mnt_drop_write(). These are
    used like a lock pair around and fs operations that might cause a write to the
    filesystem.

    Before these can become useful, we must first cover each place in the VFS
    where writes are performed with a want/drop pair. When that is complete, we
    can actually introduce code that will safely check the counts before allowing
    r/wr/o transitions to occur.

    Acked-by: Serge Hallyn
    Acked-by: Al Viro
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Dave Hansen
    Signed-off-by: Al Viro

    Dave Hansen
     

28 Mar, 2008

5 commits


15 Feb, 2008

6 commits

  • seq_path() is always called with a dentry and a vfsmount from a struct path.
    Make seq_path() take it directly as an argument.

    Signed-off-by: Jan Blunck
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: "J. Bruce Fields"
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     
  • In nearly all cases the set_fs_{root,pwd}() calls work on a struct
    path. Change the function to reflect this and use path_get() here.

    Signed-off-by: Jan Blunck
    Signed-off-by: Andreas Gruenbacher
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     
  • * Use struct path in fs_struct.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Jan Blunck
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     
  • * Add path_put() functions for releasing a reference to the dentry and
    vfsmount of a struct path in the right order

    * Switch from path_release(nd) to path_put(&nd->path)

    * Rename dput_path() to path_put_conditional()

    [akpm@linux-foundation.org: fix cifs]
    Signed-off-by: Jan Blunck
    Signed-off-by: Andreas Gruenbacher
    Acked-by: Christoph Hellwig
    Cc:
    Cc: Al Viro
    Cc: Steven French
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     
  • This is the central patch of a cleanup series. In most cases there is no good
    reason why someone would want to use a dentry for itself. This series reflects
    that fact and embeds a struct path into nameidata.

    Together with the other patches of this series
    - it enforced the correct order of getting/releasing the reference count on
    pairs
    - it prepares the VFS for stacking support since it is essential to have a
    struct path in every place where the stack can be traversed
    - it reduces the overall code size:

    without patch series:
    text data bss dec hex filename
    5321639 858418 715768 6895825 6938d1 vmlinux

    with patch series:
    text data bss dec hex filename
    5320026 858418 715768 6894212 693284 vmlinux

    This patch:

    Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix cifs]
    [akpm@linux-foundation.org: fix smack]
    Signed-off-by: Jan Blunck
    Signed-off-by: Andreas Gruenbacher
    Acked-by: Christoph Hellwig
    Cc: Al Viro
    Cc: Casey Schaufler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     
  • path_release_on_umount() should only be called from sys_umount(). I merged the
    function into sys_umount() instead of having in in namei.c.

    Signed-off-by: Jan Blunck
    Acked-by: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     

09 Feb, 2008

2 commits

  • do_mount() uses a whopping 616 bytes of stack on x86_64 in 2.6.24-mm1,
    largely thanks to gcc inlining the various helper functions.

    noinlining these can slim it down a lot; on my box this patch gets it down
    to 168, which is mostly the struct nameidata nd; left on the stack.

    These functions are called only as do_mount() helpers; none of them should
    be in any path that would see a performance benefit from inlining...

    Signed-off-by: Eric Sandeen
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • Add a new s_options field to struct super_block. Filesystems can save
    mount options passed to them in mount or remount. It is automatically
    freed when the superblock is destroyed.

    A new helper function, generic_show_options() is introduced, which uses
    this field to display the mount options in /proc/mounts.

    Another helper function, save_mount_options() may be used by
    filesystems to save the options in the super block.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

07 Feb, 2008

1 commit

  • We can use ilog2() in fs/namespace.c to compute hash_bits and hash_mask at
    compile time, not runtime.

    [akpm@linux-foundation.org: clean it all up]
    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

25 Jan, 2008

2 commits


21 Oct, 2007

1 commit


20 Oct, 2007

1 commit

  • This flag tells the .get_sb callback that this is a kern_mount() call so that
    it can trust *data pointer to be valid in-kernel one. If this flag is passed
    from the user process, it is cleared since the *data pointer is not a valid
    kernel object.

    Running a few steps forward - this will be needed for proc to create the
    superblock and store a valid pid namespace on it during the namespace
    creation. The reason, why the namespace cannot live without proc mount is
    described in the appropriate patch.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

17 Oct, 2007

1 commit

  • Since the mempages parameter is actually not used, they should be removed.

    Now there is only files_init use the mempages parameter,

    files_init(mempages);

    but I don't think the adaptation to mempages in files_init is really
    useful; and if files_init also changed to the prototype void (*func)(void),
    the wrapper vfs_caches_init would also not need the mempages parameter.

    Signed-off-by: Denis Cheng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis Cheng
     

20 Jul, 2007

1 commit

  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt
     

17 Jul, 2007

4 commits

  • Every file should include the headers containing the prototypes for
    its global functions.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • While working on unshare support for the network namespace I noticed we
    were putting clone flags in an int. Which is weird because the syscall
    uses unsigned long and we at least need an unsigned to properly hold all of
    the unshare flags.

    So to make the code consistent, this patch updates the code to use
    unsigned long instead of int for the clone flags in those places
    where we get it wrong today.

    Signed-off-by: Eric W. Biederman
    Acked-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • dup_mnt_ns() and clone_uts_ns() return NULL on failure. This is wrong,
    create_new_namespaces() uses ERR_PTR() to catch an error. This means that the
    subsequent create_new_namespaces() will hit BUG_ON() in copy_mnt_ns() or
    copy_utsname().

    Modify create_new_namespaces() to also use the errors returned by the
    copy_*_ns routines and not to systematically return ENOMEM.

    [oleg@tv-sign.ru: better changelog]
    Signed-off-by: Cedric Le Goater
    Cc: Serge E. Hallyn
    Cc: Badari Pulavarty
    Cc: Pavel Emelianov
    Cc: Herbert Poetzl
    Cc: Eric W. Biederman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • One more simple and stupid switching to the new API.

    Signed-off-by: Pavel Emelianov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelianov
     

09 May, 2007

4 commits

  • There's a missing check for CAP_SYS_ADMIN in do_change_type().

    Signed-off-by: Miklos Szeredi
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • There are many places in the kernel where the construction like

    foo = list_entry(head->next, struct foo_struct, list);

    are used.
    The code might look more descriptive and neat if using the macro

    list_first_entry(head, type, member) \
    list_entry((head)->next, type, member)

    Here is the macro itself and the examples of its usage in the generic code.
    If it will turn out to be useful, I can prepare the set of patches to
    inject in into arch-specific code, drivers, networking, etc.

    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Cc: Randy Dunlap
    Cc: Andi Kleen
    Cc: Zach Brown
    Cc: Davide Libenzi
    Cc: John McCutchan
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: john stultz
    Cc: Ram Pai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelianov
     
  • There's a slight problem with filesystem type representation in fuse
    based filesystems.

    From the kernel's view, there are just two filesystem types: fuse and
    fuseblk. From the user's view there are lots of different filesystem
    types. The user is not even much concerned if the filesystem is fuse based
    or not. So there's a conflict of interest in how this should be
    represented in fstab, mtab and /proc/mounts.

    The current scheme is to encode the real filesystem type in the mount
    source. So an sshfs mount looks like this:

    sshfs#user@server:/ /mnt/server fuse rw,nosuid,nodev,...

    This url-ish syntax works OK for sshfs and similar filesystems. However
    for block device based filesystems (ntfs-3g, zfs) it doesn't work, since
    the kernel expects the mount source to be a real device name.

    A possibly better scheme would be to encode the real type in the type
    field as "type.subtype". So fuse mounts would look like this:

    /dev/hda1 /mnt/windows fuseblk.ntfs-3g rw,...
    user@server:/ /mnt/server fuse.sshfs rw,nosuid,nodev,...

    This patch adds the necessary code to the kernel so that this can be
    correctly displayed in /proc/mounts.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • sys_clone() and sys_unshare() both makes copies of nsproxy and its associated
    namespaces. But they have different code paths.

    This patch merges all the nsproxy and its associated namespace copy/clone
    handling (as much as possible). Posted on container list earlier for
    feedback.

    - Create a new nsproxy and its associated namespaces and pass it back to
    caller to attach it to right process.

    - Changed all copy_*_ns() routines to return a new copy of namespace
    instead of attaching it to task->nsproxy.

    - Moved the CAP_SYS_ADMIN checks out of copy_*_ns() routines.

    - Removed unnessary !ns checks from copy_*_ns() and added BUG_ON()
    just incase.

    - Get rid of all individual unshare_*_ns() routines and make use of
    copy_*_ns() instead.

    [akpm@osdl.org: cleanups, warning fix]
    [clg@fr.ibm.com: remove dup_namespaces() declaration]
    [serue@us.ibm.com: fix CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) retval]
    [akpm@linux-foundation.org: fix build with CONFIG_SYSVIPC=n]
    Signed-off-by: Badari Pulavarty
    Signed-off-by: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Cedric Le Goater
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty