07 Jan, 2012

2 commits

  • Currently remouting superblock read-only is racy in a major way.

    With the per mount read-only infrastructure it is now possible to
    prevent most races, which this patch attempts.

    Before starting the remount read-only, iterate through all mounts
    belonging to the superblock and if none of them have any pending
    writes, set sb->s_readonly_remount. This indicates that remount is in
    progress and no further write requests are allowed. If the remount
    succeeds set MS_RDONLY and reset s_readonly_remount.

    If the remounting is unsuccessful just reset s_readonly_remount.
    This can result in transient EROFS errors, despite the fact the
    remount failed. Unfortunately hodling off writes is difficult as
    remount itself may touch the filesystem (e.g. through load_nls())
    which would deadlock.

    A later patch deals with delayed writes due to nlink going to zero.

    Signed-off-by: Miklos Szeredi
    Tested-by: Toshiyuki Okajima
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Al Viro
     

04 Jan, 2012

4 commits


20 Jul, 2011

2 commits

  • The per-sb shrinker has the same requirement as the writeback
    threads of ensuring that the superblock is usable and pinned for the
    time it takes to run the work. Both need to take a passive reference
    to the sb, take a read lock on the s_umount lock and then only
    continue if an unmount is not in progress.

    pin_sb_for_writeback() does this exactly, so move it to fs/super.c
    and rename it to grab_super_passive() and exporting it via
    fs/internal.h for all the VFS code to be able to use.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name).
    Allocates dentry, sets its ->d_sb to given superblock and sets
    ->d_op accordingly. Old d_alloc(NULL, name) callers are converted
    to that (all of them know what superblock they want). d_alloc()
    itself is left only for parent != NULl case; uses __d_alloc(),
    inserts result into the list of parent's children.

    Note that now ->d_sb is assign-once and never NULL *and*
    ->d_parent is never NULL either.

    Signed-off-by: Al Viro

    Al Viro
     

25 Mar, 2011

2 commits

  • Protect the inode writeback list with a new global lock
    inode_wb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect the per-sb inode list with a new global lock
    inode_sb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

22 Mar, 2011

1 commit


18 Mar, 2011

1 commit

  • new function: mount_fs(). Does all work done by vfs_kern_mount()
    except the allocation and filling of vfsmount; returns root dentry
    or ERR_PTR().

    vfs_kern_mount() switched to using it and taken to fs/namespace.c,
    along with its wrappers.

    alloc_vfsmnt()/free_vfsmnt() made static.

    functions in namespace.c slightly reordered.

    Signed-off-by: Al Viro

    Al Viro
     

15 Mar, 2011

1 commit


14 Mar, 2011

2 commits

  • new function: file_open_root(dentry, mnt, name, flags) opens the file
    vfs_path_lookup would arrive to.

    Note that name can be empty; in that case the usual requirement that
    dentry should be a directory is lifted.

    open-coded equivalents switched to it, may_open() got down exactly
    one caller and became static.

    Signed-off-by: Al Viro

    Al Viro
     
  • take calculation of open_flags by open(2) arguments into new helper
    in fs/open.c, move filp_open() over there, have it and do_sys_open()
    use that helper, switch exec.c callers of do_filp_open() to explicit
    (and constant) struct open_flags.

    Signed-off-by: Al Viro

    Al Viro
     

24 Feb, 2011

1 commit

  • There are two cases when we call flush_disk.
    In one, the device has disappeared (check_disk_change) so any
    data will hold becomes irrelevant.
    In the oter, the device has changed size (check_disk_size_change)
    so data we hold may be irrelevant.

    In both cases it makes sense to discard any 'clean' buffers,
    so they will be read back from the device if needed.

    In the former case it makes sense to discard 'dirty' buffers
    as there will never be anywhere safe to write the data. In the
    second case it *does*not* make sense to discard dirty buffers
    as that will lead to file system corruption when you simply enlarge
    the containing devices.

    flush_disk calls __invalidate_devices.
    __invalidate_device calls both invalidate_inodes and invalidate_bdev.

    invalidate_inodes *does* discard I_DIRTY inodes and this does lead
    to fs corruption.

    invalidate_bev *does*not* discard dirty pages, but I don't really care
    about that at present.

    So this patch adds a flag to __invalidate_device (calling it
    __invalidate_device2) to indicate whether dirty buffers should be
    killed, and this is passed to invalidate_inodes which can choose to
    skip dirty inodes.

    flusk_disk then passes true from check_disk_change and false from
    check_disk_size_change.

    dm avoids tripping over this problem by calling i_size_write directly
    rathher than using check_disk_size_change.

    md does use check_disk_size_change and so is affected.

    This regression was introduced by commit 608aeef17a which causes
    check_disk_size_change to call flush_disk, so it is suitable for any
    kernel since 2.6.27.

    Cc: stable@kernel.org
    Acked-by: Jeff Moyer
    Cc: Andrew Patterson
    Cc: Jens Axboe
    Signed-off-by: NeilBrown

    NeilBrown
     

17 Jan, 2011

3 commits

  • do_add_mount() and mnt_clear_expiry() are not needed outside of
    namespace.c anymore, now that namei has finish_automount() to
    use.

    Signed-off-by: Al Viro

    Al Viro
     
  • ... and shift it from namei.c to namespace.c

    Signed-off-by: Al Viro

    Al Viro
     
  • Instead of splitting refcount between (per-cpu) mnt_count
    and (SMP-only) mnt_longrefs, make all references contribute
    to mnt_count again and keep track of how many are longterm
    ones.

    Accounting rules for longterm count:
    * 1 for each fs_struct.root.mnt
    * 1 for each fs_struct.pwd.mnt
    * 1 for having non-NULL ->mnt_ns
    * decrement to 0 happens only under vfsmount lock exclusive

    That allows nice common case for mntput() - since we can't drop the
    final reference until after mnt_longterm has reached 0 due to the rules
    above, mntput() can grab vfsmount lock shared and check mnt_longterm.
    If it turns out to be non-zero (which is the common case), we know
    that this is not the final mntput() and can just blindly decrement
    percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
    do usual decrement-and-check of percpu mnt_count.

    For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
    namespace.c uses the latter in places where we don't already hold
    vfsmount lock exclusive and opencodes a few remaining spots where
    we need to manipulate mnt_longterm.

    Note that we mostly revert the code outside of fs/namespace.c back
    to what we used to have; in particular, normal code doesn't need
    to care about two kinds of references, etc. And we get to keep
    the optimization Nick's variant had bought us...

    Signed-off-by: Al Viro

    Al Viro
     

16 Jan, 2011

1 commit

  • Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
    added rather than calling do_add_mount() itself. follow_automount() will then
    do the addition.

    This slightly complicates things as ->d_automount() normally wants to add the
    new vfsmount to an expiration list and start an expiration timer. The problem
    with that is that the vfsmount will be deleted if it has a refcount of 1 and
    the timer will not repeat if the expiration list is empty.

    To this end, we require the vfsmount to be returned from d_automount() with a
    refcount of (at least) 2. One of these refs will be dropped unconditionally.
    In addition, follow_automount() must get a 3rd ref around the call to
    do_add_mount() lest it eat a ref and return an error, leaving the mount we
    have open to being expired as we would otherwise have only 1 ref on it.

    d_automount() should also add the the vfsmount to the expiration list (by
    calling mnt_set_expiry()) and start the expiration timer before returning, if
    this mechanism is to be used. The vfsmount will be unlinked from the
    expiration list by follow_automount() if do_add_mount() fails.

    This patch also fixes the call to do_add_mount() for AFS to propagate the mount
    flags from the parent vfsmount.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

07 Jan, 2011

1 commit

  • The problem that this patch aims to fix is vfsmount refcounting scalability.
    We need to take a reference on the vfsmount for every successful path lookup,
    which often go to the same mount point.

    The fundamental difficulty is that a "simple" reference count can never be made
    scalable, because any time a reference is dropped, we must check whether that
    was the last reference. To do that requires communication with all other CPUs
    that may have taken a reference count.

    We can make refcounts more scalable in a couple of ways, involving keeping
    distributed counters, and checking for the global-zero condition less
    frequently.

    - check the global sum once every interval (this will delay zero detection
    for some interval, so it's probably a showstopper for vfsmounts).

    - keep a local count and only taking the global sum when local reaches 0 (this
    is difficult for vfsmounts, because we can't hold preempt off for the life of
    a reference, so a counter would need to be per-thread or tied strongly to a
    particular CPU which requires more locking).

    - keep a local difference of increments and decrements, which allows us to sum
    the total difference and hence find the refcount when summing all CPUs. Then,
    keep a single integer "long" refcount for slow and long lasting references,
    and only take the global sum of local counters when the long refcount is 0.

    This last scheme is what I implemented here. Attached mounts and process root
    and working directory references are "long" references, and everything else is
    a short reference.

    This allows scalable vfsmount references during path walking over mounted
    subtrees and unattached (lazy umounted) mounts with processes still running
    in them.

    This results in one fewer atomic op in the fastpath: mntget is now just a
    per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
    and non-atomic decrement in the common case. However code is otherwise bigger
    and heavier, so single threaded performance is basically a wash.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

29 Oct, 2010

1 commit


26 Oct, 2010

3 commits

  • Pull removal of fsnotify marks into generic_shutdown_super().
    Split umount-time work into a new function - evict_inodes().
    Make sure that invalidate_inodes() will be able to cope with
    I_FREEING once we change locking in iput().

    Signed-off-by: Al Viro

    Al Viro
     
  • The number of inodes allocated does not need to be tied to the
    addition or removal of an inode to/from a list. If we are not tied
    to a list lock, we could update the counters when inodes are
    initialised or destroyed, but to do that we need to convert the
    counters to be per-cpu (i.e. independent of a lock). This means that
    we have the freedom to change the list/locking implementation
    without needing to care about the counters.

    Based on a patch originally from Eric Dumazet.

    [AV: cleaned up a bit, fixed build breakage on weird configs

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Signed-off-by: Al Viro

    Al Viro
     

18 Aug, 2010

2 commits

  • fs: brlock vfsmount_lock

    Use a brlock for the vfsmount lock. It must be taken for write whenever
    modifying the mount hash or associated fields, and may be taken for read when
    performing mount hash lookups.

    A new lock is added for the mnt-id allocator, so it doesn't need to take
    the heavy vfsmount write-lock.

    The number of atomics should remain the same for fastpath rlock cases, though
    code would be slightly slower due to per-cpu access. Scalability is not not be
    much improved in common cases yet, due to other locks (ie. dcache_lock) getting
    in the way. However path lookups crossing mountpoints should be one case where
    scalability is improved (currently requiring the global lock).

    The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
    Altix system (high latency to remote nodes), a simple umount microbenchmark
    (mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
    took 6.8s, afterwards took 7.1s, about 5% slower.

    Cc: Al Viro
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • tty: fix fu_list abuse

    tty code abuses fu_list, which causes a bug in remount,ro handling.

    If a tty device node is opened on a filesystem, then the last link to the inode
    removed, the filesystem will be allowed to be remounted readonly. This is
    because fs_may_remount_ro does not find the 0 link tty inode on the file sb
    list (because the tty code incorrectly removed it to use for its own purpose).
    This can result in a filesystem with errors after it is marked "clean".

    Taking idea from Christoph's initial patch, allocate a tty private struct
    at file->private_data and put our required list fields in there, linking
    file and tty. This makes tty nodes behave the same way as other device nodes
    and avoid meddling with the vfs, and avoids this bug.

    The error handling is not trivial in the tty code, so for this bugfix, I take
    the simple approach of using __GFP_NOFAIL and don't worry about memory errors.
    This is not a problem because our allocator doesn't fail small allocs as a rule
    anyway. So proper error handling is left as an exercise for tty hackers.

    [ Arguably filesystem's device inode would ideally be divorced from the
    driver's pseudo inode when it is opened, but in practice it's not clear whether
    that will ever be worth implementing. ]

    Cc: linux-kernel@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Alan Cox
    Cc: Greg Kroah-Hartman
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     

22 May, 2010

1 commit


04 Mar, 2010

1 commit


23 Dec, 2009

1 commit


17 Dec, 2009

1 commit

  • All users outside of fs/ of get_empty_filp() have been removed. This patch
    moves the definition from the include/ directory to internal.h so no new
    users crop up and removes the EXPORT_SYMBOL. I'd love to see open intents
    stop using it too, but that's a problem for another day and a smarter
    developer!

    Signed-off-by: Eric Paris
    Acked-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Eric Paris
     

24 Sep, 2009

1 commit

  • sys_mount() reads/copies a whole page for its "type" parameter. When
    do_mount_root() passes a kernel address that points to an object which is
    smaller than a whole page, copy_mount_options() will happily go past this
    memory object, possibly dereferencing "wild" pointers that could be in any
    state (hence the kmemcheck warning, which shows that parts of the next
    page are not even allocated).

    (The likelihood of something going wrong here is pretty low -- first of
    all this only applies to kernel calls to sys_mount(), which are mostly
    found in the boot code. Secondly, I guess if the page was not mapped,
    exact_copy_from_user() _would_ in fact handle it correctly because of its
    access_ok(), etc. checks.)

    But it is much nicer to avoid the dubious reads altogether, by stopping as
    soon as we find a NUL byte. Is there a good reason why we can't do
    something like this, using the already existing strndup_from_user()?

    [akpm@linux-foundation.org: make copy_mount_string() static]
    [AV: fix compat mount breakage, which involves undoing akpm's change above]

    Reported-by: Ingo Molnar
    Signed-off-by: Vegard Nossum
    Cc: Al Viro
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: al

    Vegard Nossum
     

12 Jun, 2009

4 commits

  • do_remount_sb() is fs/internal.h fodder, fsync_no_super() is long gone.

    Signed-off-by: Al Viro

    Al Viro
     
  • It is unnecessarily fragile to have two places (fsync_super() and do_sync())
    doing data integrity sync of the filesystem. Alter __fsync_super() to
    accommodate needs of both callers and use it. So after this patch
    __fsync_super() is the only place where we gather all the calls needed to
    properly send all data on a filesystem to disk.

    Nice bonus is that we get a complete livelock avoidance and write_supers()
    is now only used for periodic writeback of superblocks.

    sync_blockdevs() introduced a couple of patches ago is gone now.

    [build fixes folded]

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • So far, do_sync() called:
    sync_inodes(0);
    sync_supers();
    sync_filesystems(0);
    sync_filesystems(1);
    sync_inodes(1);

    This ordering makes it kind of hard for filesystems as sync_inodes(0) need not
    submit all the IO (for example it skips inodes with I_SYNC set) so e.g. forcing
    transaction to disk in ->sync_fs() is not really enough. Therefore sys_sync has
    not been completely reliable on some filesystems (ext3, ext4, reiserfs, ocfs2
    and others are hit by this) when racing e.g. with background writeback. A
    similar problem hits also other filesystems (e.g. ext2) because of
    write_supers() being called before the sync_inodes(1).

    Change the ordering of calls in do_sync() - this requires a new function
    sync_blockdevs() to preserve the property that block devices are always synced
    after write_super() / sync_fs() call.

    The same issue is fixed in __fsync_super() function used on umount /
    remount read-only.

    [AV: build fixes]

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • This function walks the s_files lock, and operates primarily on the
    files in a superblock, so it better belongs here (eg. see also
    fs_may_remount_ro).

    [AV: ... and it shouldn't be static after that move]

    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

01 Apr, 2009

2 commits

  • * all changes of current->fs are done under task_lock and write_lock of
    old fs->lock
    * refcount is not atomic anymore (same protection)
    * its decrements are done when removing reference from current; at the
    same time we decide whether to free it.
    * put_fs_struct() is gone
    * new field - ->in_exec. Set by check_unsafe_exec() if we are trying to do
    execve() and only subthreads share fs_struct. Cleared when finishing exec
    (success and failure alike). Makes CLONE_FS fail with -EAGAIN if set.
    * check_unsafe_exec() may fail with -EAGAIN if another execve() from subthread
    is in progress.

    Signed-off-by: Al Viro

    Al Viro
     
  • Pure code move; two new helper functions for nfsd and daemonize
    (unshare_fs_struct() and daemonize_fs_struct() resp.; for now -
    the same code as used to be in callers). unshare_fs_struct()
    exported (for nfsd, as copy_fs_struct()/exit_fs() used to be),
    copy_fs_struct() and exit_fs() don't need exports anymore.

    Signed-off-by: Al Viro

    Al Viro
     

29 Mar, 2009

1 commit

  • Joe Malicki reports that setuid sometimes doesn't: very rarely,
    a setuid root program does not get root euid; and, by the way,
    they have a health check running lsof every few minutes.

    Right, check_unsafe_exec() notes whether the files_struct is being
    shared by more threads than will get killed by the exec, and if so
    sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
    But /proc//fd and /proc//fdinfo lookups make transient
    use of get_files_struct(), which also raises that sharing count.

    There's a rather simple fix for this: exec's check on files->count
    has been redundant ever since 2.6.1 made it unshare_files() (except
    while compat_do_execve() omitted to do so) - just remove that check.

    [Note to -stable: this patch will not apply before 2.6.29: earlier
    releases should just remove the files->count line from unsafe_exec().]

    Reported-by: Joe Malicki
    Narrowed-down-by: Michael Itz
    Tested-by: Joe Malicki
    Signed-off-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

07 Feb, 2009

1 commit

  • The patch:

    commit a6f76f23d297f70e2a6b3ec607f7aeeea9e37e8d
    CRED: Make execve() take advantage of copy-on-write credentials

    moved the place in which the 'safeness' of a SUID/SGID exec was performed to
    before de_thread() was called. This means that LSM_UNSAFE_SHARE is now
    calculated incorrectly. This flag is set if any of the usage counts for
    fs_struct, files_struct and sighand_struct are greater than 1 at the time the
    determination is made. All of which are true for threads created by the
    pthread library.

    However, since we wish to make the security calculation before irrevocably
    damaging the process so that we can return it an error code in the case where
    we decide we want to reject the exec request on this basis, we have to make the
    determination before calling de_thread().

    So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
    our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
    (CLONE_SIGHAND/CLONE_THREAD) with us. These will be killed by de_thread() and
    so can be discounted by check_unsafe_exec().

    We do have to be careful because CLONE_THREAD does not imply FS or FILES.

    We _assume_ that there will be no extra references to these structs held by the
    threads we're going to kill.

    This can be tested with the attached pair of programs. Build the two programs
    using the Makefile supplied, and run ./test1 as a non-root user. If
    successful, you should see something like:

    [dhowells@andromeda tmp]$ ./test1
    --TEST1--
    uid=4043, euid=4043 suid=4043
    exec ./test2
    --TEST2--
    uid=4043, euid=0 suid=0
    SUCCESS - Correct effective user ID

    and if unsuccessful, something like:

    [dhowells@andromeda tmp]$ ./test1
    --TEST1--
    uid=4043, euid=4043 suid=4043
    exec ./test2
    --TEST2--
    uid=4043, euid=4043 suid=4043
    ERROR - Incorrect effective user ID!

    The non-root user ID you see will depend on the user you run as.

    [test1.c]
    #include
    #include
    #include
    #include

    static void *thread_func(void *arg)
    {
    while (1) {}
    }

    int main(int argc, char **argv)
    {
    pthread_t tid;
    uid_t uid, euid, suid;

    printf("--TEST1--\n");
    getresuid(&uid, &euid, &suid);
    printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);

    if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
    perror("pthread_create");
    exit(1);
    }

    printf("exec ./test2\n");
    execlp("./test2", "test2", NULL);
    perror("./test2");
    _exit(1);
    }

    [test2.c]
    #include
    #include
    #include

    int main(int argc, char **argv)
    {
    uid_t uid, euid, suid;

    getresuid(&uid, &euid, &suid);
    printf("--TEST2--\n");
    printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);

    if (euid != 0) {
    fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
    exit(1);
    }
    printf("SUCCESS - Correct effective user ID\n");
    exit(0);
    }

    [Makefile]
    CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
    all: test1 test2

    test1: test1.c
    gcc $(CFLAGS) -o test1 test1.c -lpthread

    test2: test2.c
    gcc $(CFLAGS) -o test2 test2.c
    sudo chown root.root test2
    sudo chmod +s test2

    Reported-by: David Smith
    Signed-off-by: David Howells
    Acked-by: David Smith
    Signed-off-by: James Morris

    David Howells