13 Oct, 2012

1 commit

  • ...and fix up the callers. For do_file_open_root, just declare a
    struct filename on the stack and fill out the .name field. For
    do_filp_open, make it also take a struct filename pointer, and fix up its
    callers to call it appropriately.

    For filp_open, add a variant that takes a struct filename pointer and turn
    filp_open into a wrapper around it.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

31 Jul, 2012

1 commit

  • Most of places where we want freeze protection coincides with the places where
    we also have remount-ro protection. So make mnt_want_write() and
    mnt_drop_write() (and their _file alternative) prevent freezing as well.
    For the few cases that are really interested only in remount-ro protection
    provide new function variants.

    BugLink: https://bugs.launchpad.net/bugs/897421
    Tested-by: Kamal Mostafa
    Tested-by: Peter M. Petrakis
    Tested-by: Dann Frazier
    Tested-by: Massimo Morana
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

14 Jul, 2012

6 commits

  • Split inode_permission() into inode- and superblock-dependent parts.

    This is aimed at unionmounts where the superblock from the upper layer has to
    be checked rather than the superblock from the lower layer as the upper layer
    may be writable, thus allowing an unwritable file from the lower layer to be
    copied up and modified.

    Original-author: Valerie Aurora
    Signed-off-by: David Howells (Further development)
    Signed-off-by: Al Viro

    David Howells
     
  • Just pass struct file *. Methods are happier that way...
    There's no need to return struct file * from finish_open() now,
    so let it return int. Next: saner prototypes for parts in
    namei.c

    Signed-off-by: Al Viro

    Al Viro
     
  • ->filp->f_path is there for purpose...

    Signed-off-by: Al Viro

    Al Viro
     
  • All users of open intents have been converted to use ->atomic_{open,create}.

    This patch gets rid of nd->intent.open and related infrastructure.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Add a new inode operation which is called on the last component of an open.
    Using this the filesystem can look up, possibly create and open the file in one
    atomic operation. If it cannot perform this (e.g. the file type turned out to
    be wrong) it may signal this by returning NULL instead of an open struct file
    pointer.

    i_op->atomic_open() is only called if the last component is negative or needs
    lookup. Handling cached positive dentries here doesn't add much value: these
    can be opened using f_op->open(). If the cached file turns out to be invalid,
    the open can be retried, this time using ->atomic_open() with a fresh dentry.

    For now leave the old way of using open intents in lookup and revalidate in
    place. This will be removed once all the users are converted.

    David Howells noticed that if ->atomic_open() opens the file but does not create
    it, handle_truncate() will be called on it even if it is not a regular file.
    Fix this by checking the file type in this case too.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • it's enough to set ->mnt_ns of internal vfsmounts to something
    distinct from all struct mnt_namespace out there; then we can
    just use the check for ->mnt_ns != NULL in the fast path of
    mntput_no_expire()

    Signed-off-by: Al Viro

    Al Viro
     

02 Jun, 2012

1 commit

  • Split __dentry_open() into two functions:

    do_dentry_open() - does most of the actual work, doesn't put file on failure
    open_check_o_direct() - after a successful open, checks direct_IO method

    This will allow i_op->atomic_open to do just the file initialization and leave
    the direct_IO checking to the VFS.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     

30 May, 2012

1 commit

  • lglocks and brlocks are currently generated with some complicated macros
    in lglock.h. But there's no reason to not just use common utility
    functions and put all the data into a common data structure.

    Since there are at least two users it makes sense to share this code in a
    library. This is also easier maintainable than a macro forest.

    This will also make it later possible to dynamically allocate lglocks and
    also use them in modules (this would both still need some additional, but
    now straightforward, code)

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Andi Kleen
    Cc: Al Viro
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Rusty Russell
    Signed-off-by: Al Viro

    Andi Kleen
     

07 Jan, 2012

2 commits

  • Currently remouting superblock read-only is racy in a major way.

    With the per mount read-only infrastructure it is now possible to
    prevent most races, which this patch attempts.

    Before starting the remount read-only, iterate through all mounts
    belonging to the superblock and if none of them have any pending
    writes, set sb->s_readonly_remount. This indicates that remount is in
    progress and no further write requests are allowed. If the remount
    succeeds set MS_RDONLY and reset s_readonly_remount.

    If the remounting is unsuccessful just reset s_readonly_remount.
    This can result in transient EROFS errors, despite the fact the
    remount failed. Unfortunately hodling off writes is difficult as
    remount itself may touch the filesystem (e.g. through load_nls())
    which would deadlock.

    A later patch deals with delayed writes due to nlink going to zero.

    Signed-off-by: Miklos Szeredi
    Tested-by: Toshiyuki Okajima
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Al Viro
     

04 Jan, 2012

4 commits


20 Jul, 2011

2 commits

  • The per-sb shrinker has the same requirement as the writeback
    threads of ensuring that the superblock is usable and pinned for the
    time it takes to run the work. Both need to take a passive reference
    to the sb, take a read lock on the s_umount lock and then only
    continue if an unmount is not in progress.

    pin_sb_for_writeback() does this exactly, so move it to fs/super.c
    and rename it to grab_super_passive() and exporting it via
    fs/internal.h for all the VFS code to be able to use.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name).
    Allocates dentry, sets its ->d_sb to given superblock and sets
    ->d_op accordingly. Old d_alloc(NULL, name) callers are converted
    to that (all of them know what superblock they want). d_alloc()
    itself is left only for parent != NULl case; uses __d_alloc(),
    inserts result into the list of parent's children.

    Note that now ->d_sb is assign-once and never NULL *and*
    ->d_parent is never NULL either.

    Signed-off-by: Al Viro

    Al Viro
     

25 Mar, 2011

2 commits

  • Protect the inode writeback list with a new global lock
    inode_wb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect the per-sb inode list with a new global lock
    inode_sb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

22 Mar, 2011

1 commit


18 Mar, 2011

1 commit

  • new function: mount_fs(). Does all work done by vfs_kern_mount()
    except the allocation and filling of vfsmount; returns root dentry
    or ERR_PTR().

    vfs_kern_mount() switched to using it and taken to fs/namespace.c,
    along with its wrappers.

    alloc_vfsmnt()/free_vfsmnt() made static.

    functions in namespace.c slightly reordered.

    Signed-off-by: Al Viro

    Al Viro
     

15 Mar, 2011

1 commit


14 Mar, 2011

2 commits

  • new function: file_open_root(dentry, mnt, name, flags) opens the file
    vfs_path_lookup would arrive to.

    Note that name can be empty; in that case the usual requirement that
    dentry should be a directory is lifted.

    open-coded equivalents switched to it, may_open() got down exactly
    one caller and became static.

    Signed-off-by: Al Viro

    Al Viro
     
  • take calculation of open_flags by open(2) arguments into new helper
    in fs/open.c, move filp_open() over there, have it and do_sys_open()
    use that helper, switch exec.c callers of do_filp_open() to explicit
    (and constant) struct open_flags.

    Signed-off-by: Al Viro

    Al Viro
     

24 Feb, 2011

1 commit

  • There are two cases when we call flush_disk.
    In one, the device has disappeared (check_disk_change) so any
    data will hold becomes irrelevant.
    In the oter, the device has changed size (check_disk_size_change)
    so data we hold may be irrelevant.

    In both cases it makes sense to discard any 'clean' buffers,
    so they will be read back from the device if needed.

    In the former case it makes sense to discard 'dirty' buffers
    as there will never be anywhere safe to write the data. In the
    second case it *does*not* make sense to discard dirty buffers
    as that will lead to file system corruption when you simply enlarge
    the containing devices.

    flush_disk calls __invalidate_devices.
    __invalidate_device calls both invalidate_inodes and invalidate_bdev.

    invalidate_inodes *does* discard I_DIRTY inodes and this does lead
    to fs corruption.

    invalidate_bev *does*not* discard dirty pages, but I don't really care
    about that at present.

    So this patch adds a flag to __invalidate_device (calling it
    __invalidate_device2) to indicate whether dirty buffers should be
    killed, and this is passed to invalidate_inodes which can choose to
    skip dirty inodes.

    flusk_disk then passes true from check_disk_change and false from
    check_disk_size_change.

    dm avoids tripping over this problem by calling i_size_write directly
    rathher than using check_disk_size_change.

    md does use check_disk_size_change and so is affected.

    This regression was introduced by commit 608aeef17a which causes
    check_disk_size_change to call flush_disk, so it is suitable for any
    kernel since 2.6.27.

    Cc: stable@kernel.org
    Acked-by: Jeff Moyer
    Cc: Andrew Patterson
    Cc: Jens Axboe
    Signed-off-by: NeilBrown

    NeilBrown
     

17 Jan, 2011

3 commits

  • do_add_mount() and mnt_clear_expiry() are not needed outside of
    namespace.c anymore, now that namei has finish_automount() to
    use.

    Signed-off-by: Al Viro

    Al Viro
     
  • ... and shift it from namei.c to namespace.c

    Signed-off-by: Al Viro

    Al Viro
     
  • Instead of splitting refcount between (per-cpu) mnt_count
    and (SMP-only) mnt_longrefs, make all references contribute
    to mnt_count again and keep track of how many are longterm
    ones.

    Accounting rules for longterm count:
    * 1 for each fs_struct.root.mnt
    * 1 for each fs_struct.pwd.mnt
    * 1 for having non-NULL ->mnt_ns
    * decrement to 0 happens only under vfsmount lock exclusive

    That allows nice common case for mntput() - since we can't drop the
    final reference until after mnt_longterm has reached 0 due to the rules
    above, mntput() can grab vfsmount lock shared and check mnt_longterm.
    If it turns out to be non-zero (which is the common case), we know
    that this is not the final mntput() and can just blindly decrement
    percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
    do usual decrement-and-check of percpu mnt_count.

    For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
    namespace.c uses the latter in places where we don't already hold
    vfsmount lock exclusive and opencodes a few remaining spots where
    we need to manipulate mnt_longterm.

    Note that we mostly revert the code outside of fs/namespace.c back
    to what we used to have; in particular, normal code doesn't need
    to care about two kinds of references, etc. And we get to keep
    the optimization Nick's variant had bought us...

    Signed-off-by: Al Viro

    Al Viro
     

16 Jan, 2011

1 commit

  • Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
    added rather than calling do_add_mount() itself. follow_automount() will then
    do the addition.

    This slightly complicates things as ->d_automount() normally wants to add the
    new vfsmount to an expiration list and start an expiration timer. The problem
    with that is that the vfsmount will be deleted if it has a refcount of 1 and
    the timer will not repeat if the expiration list is empty.

    To this end, we require the vfsmount to be returned from d_automount() with a
    refcount of (at least) 2. One of these refs will be dropped unconditionally.
    In addition, follow_automount() must get a 3rd ref around the call to
    do_add_mount() lest it eat a ref and return an error, leaving the mount we
    have open to being expired as we would otherwise have only 1 ref on it.

    d_automount() should also add the the vfsmount to the expiration list (by
    calling mnt_set_expiry()) and start the expiration timer before returning, if
    this mechanism is to be used. The vfsmount will be unlinked from the
    expiration list by follow_automount() if do_add_mount() fails.

    This patch also fixes the call to do_add_mount() for AFS to propagate the mount
    flags from the parent vfsmount.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

07 Jan, 2011

1 commit

  • The problem that this patch aims to fix is vfsmount refcounting scalability.
    We need to take a reference on the vfsmount for every successful path lookup,
    which often go to the same mount point.

    The fundamental difficulty is that a "simple" reference count can never be made
    scalable, because any time a reference is dropped, we must check whether that
    was the last reference. To do that requires communication with all other CPUs
    that may have taken a reference count.

    We can make refcounts more scalable in a couple of ways, involving keeping
    distributed counters, and checking for the global-zero condition less
    frequently.

    - check the global sum once every interval (this will delay zero detection
    for some interval, so it's probably a showstopper for vfsmounts).

    - keep a local count and only taking the global sum when local reaches 0 (this
    is difficult for vfsmounts, because we can't hold preempt off for the life of
    a reference, so a counter would need to be per-thread or tied strongly to a
    particular CPU which requires more locking).

    - keep a local difference of increments and decrements, which allows us to sum
    the total difference and hence find the refcount when summing all CPUs. Then,
    keep a single integer "long" refcount for slow and long lasting references,
    and only take the global sum of local counters when the long refcount is 0.

    This last scheme is what I implemented here. Attached mounts and process root
    and working directory references are "long" references, and everything else is
    a short reference.

    This allows scalable vfsmount references during path walking over mounted
    subtrees and unattached (lazy umounted) mounts with processes still running
    in them.

    This results in one fewer atomic op in the fastpath: mntget is now just a
    per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
    and non-atomic decrement in the common case. However code is otherwise bigger
    and heavier, so single threaded performance is basically a wash.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

29 Oct, 2010

1 commit


26 Oct, 2010

3 commits

  • Pull removal of fsnotify marks into generic_shutdown_super().
    Split umount-time work into a new function - evict_inodes().
    Make sure that invalidate_inodes() will be able to cope with
    I_FREEING once we change locking in iput().

    Signed-off-by: Al Viro

    Al Viro
     
  • The number of inodes allocated does not need to be tied to the
    addition or removal of an inode to/from a list. If we are not tied
    to a list lock, we could update the counters when inodes are
    initialised or destroyed, but to do that we need to convert the
    counters to be per-cpu (i.e. independent of a lock). This means that
    we have the freedom to change the list/locking implementation
    without needing to care about the counters.

    Based on a patch originally from Eric Dumazet.

    [AV: cleaned up a bit, fixed build breakage on weird configs

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Signed-off-by: Al Viro

    Al Viro
     

18 Aug, 2010

2 commits

  • fs: brlock vfsmount_lock

    Use a brlock for the vfsmount lock. It must be taken for write whenever
    modifying the mount hash or associated fields, and may be taken for read when
    performing mount hash lookups.

    A new lock is added for the mnt-id allocator, so it doesn't need to take
    the heavy vfsmount write-lock.

    The number of atomics should remain the same for fastpath rlock cases, though
    code would be slightly slower due to per-cpu access. Scalability is not not be
    much improved in common cases yet, due to other locks (ie. dcache_lock) getting
    in the way. However path lookups crossing mountpoints should be one case where
    scalability is improved (currently requiring the global lock).

    The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
    Altix system (high latency to remote nodes), a simple umount microbenchmark
    (mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
    took 6.8s, afterwards took 7.1s, about 5% slower.

    Cc: Al Viro
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • tty: fix fu_list abuse

    tty code abuses fu_list, which causes a bug in remount,ro handling.

    If a tty device node is opened on a filesystem, then the last link to the inode
    removed, the filesystem will be allowed to be remounted readonly. This is
    because fs_may_remount_ro does not find the 0 link tty inode on the file sb
    list (because the tty code incorrectly removed it to use for its own purpose).
    This can result in a filesystem with errors after it is marked "clean".

    Taking idea from Christoph's initial patch, allocate a tty private struct
    at file->private_data and put our required list fields in there, linking
    file and tty. This makes tty nodes behave the same way as other device nodes
    and avoid meddling with the vfs, and avoids this bug.

    The error handling is not trivial in the tty code, so for this bugfix, I take
    the simple approach of using __GFP_NOFAIL and don't worry about memory errors.
    This is not a problem because our allocator doesn't fail small allocs as a rule
    anyway. So proper error handling is left as an exercise for tty hackers.

    [ Arguably filesystem's device inode would ideally be divorced from the
    driver's pseudo inode when it is opened, but in practice it's not clear whether
    that will ever be worth implementing. ]

    Cc: linux-kernel@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Alan Cox
    Cc: Greg Kroah-Hartman
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     

22 May, 2010

1 commit


04 Mar, 2010

1 commit


23 Dec, 2009

1 commit