26 Oct, 2010

40 commits

  • Despite the comment above it we can not safely drop the lock here.
    invalidate_list is called from many other places that just umount.
    Also switch to proper list macros now that we never drop the lock.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • The use of the same inode list structure (inode->i_list) for two
    different list constructs with different lifecycles and purposes
    makes it impossible to separate the locking of the different
    operations. Therefore, to enable the separation of the locking of
    the writeback and reclaim lists, split the inode->i_list into two
    separate lists dedicated to their specific tracking functions.

    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Nick Piggin
     
  • bdev inodes can remain dirty even after their last close. Hence the
    BDI associated with the bdev->inode gets modified duringthe last
    close to point to the default BDI. However, the bdev inode still
    needs to be moved to the dirty lists of the new BDI, otherwise it
    will corrupt the writeback list is was left on.

    Add a new function bdev_inode_switch_bdi() to move all the bdi state
    from the old bdi to the new one safely. This is only a temporary
    measure until the bdev inodebdi lifecycle problems are sorted
    out.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Dave Chinner
     
  • We must not call invalidate_inode_buffers in invalidate_list unless the
    inode can be reclaimed. If we remove the buffer association of a busy
    inode fsync won't find the buffers anymore. As invalidate_inode_buffers
    is called from various others sources than umount this actually does
    matter in practice.

    While at it change the loop to a more natural form and remove the
    WARN_ON for I_NEW, wich we already tested a few lines above.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Use dget_parent instead of opencoding it. This simplifies the code, but
    more importanly prepares for the more complicated locking for a parent
    dget in the dcache scale patch series.

    It means we do grab a reference to the parent now if need to be watched,
    but not with the specified mask. If this turns out to be a problem
    we'll have to revisit it, but for now let's keep as much as possible
    dcache internals inside dcache.[ch].

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Use dget_parent instead of opencoding it. This simplifies the code, but
    more importanly prepares for the more complicated locking for a parent
    dget in the dcache scale patch series.

    Note that the d_time assignment in smb_renew_times moves out of d_lock,
    but it's a single atomic 32-bit value, and that's what other sites
    setting it do already.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Use dget_parent instead of opencoding it. This simplifies the code, but
    more importanly prepares for the more complicated locking for a parent
    dget in the dcache scale patch series.

    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • d_validate does a purely read lookup in the dentry hash, so use RCU read side
    locking instead of dcache_lock. Split out from a larget patch by
    Nick Piggin .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Always do a list_del_init on the LRU to make sure the list_empty invariant for
    not beeing on the LRU always holds true, and fold dentry_lru_del_init into
    dentry_lru_del. Replace the dentry_lru_add_tail primitive with a
    dentry_lru_move_tail operations that simpler when the dentry already is one
    the list, which is always is. Move the list_empty into dentry_lru_add to
    fit the scheme of the other lru helpers, and simplify locking once we
    move to a separate LRU lock.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Currently __shrink_dcache_sb has an extremly awkward calling convention
    because it tries to please very different callers. Split out the
    main loop into a shrink_dentry_list helper, which gets called directly
    from shrink_dcache_sb for the cases where all dentries need to be pruned,
    or from __shrink_dcache_sb for pruning only a certain number of dentries.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • dentry referenced bit is only set when installing the dentry back
    onto the LRU. However with lazy LRU, the dentry can already be on
    the LRU list at dput time, thus missing out on setting the referenced
    bit. Fix this.

    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Nick Piggin
     
  • The nr_dentry stat is a globally touched cacheline and atomic operation
    twice over the lifetime of a dentry. It is used for the benfit of userspace
    only. Turn it into a per-cpu counter and always decrement it in d_free instead
    of doing various batching operations to reduce lock hold times in the callers.

    Based on an earlier patch from Nick Piggin .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Remove d_callback and always call __d_free with a RCU head.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • All callers take dcache_lock just around the call to __d_path, so
    take the lock into it in preparation of getting rid of dcache_lock.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Instead of always assigning an increasing inode number in new_inode
    move the call to assign it into those callers that actually need it.
    For now callers that need it is estimated conservatively, that is
    the call is added to all filesystems that do not assign an i_ino
    by themselves. For a few more filesystems we can avoid assigning
    any inode number given that they aren't user visible, and for others
    it could be done lazily when an inode number is actually needed,
    but that's left for later patches.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • new_inode() dirties a contended cache line to get increasing
    inode numbers. This limits performance on workloads that cause
    significant parallel inode allocation.

    Solve this problem by using a per_cpu variable fed by the shared
    last_ino in batches of 1024 allocations. This reduces contention on
    the shared last_ino, and give same spreading ino numbers than before
    (i.e. same wraparound after 2^32 allocations).

    Signed-off-by: Eric Dumazet
    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Eric Dumazet
     
  • Clones an existing reference to inode; caller must already hold one.

    Signed-off-by: Al Viro

    Al Viro
     
  • Split up inode_add_to_list/__inode_add_to_list. Locking for the two
    lists will be split soon so these helpers really don't buy us much
    anymore.

    The __ prefixes for the sb list helpers will go away soon, but until
    inode_lock is gone we'll need them to distinguish between the locked
    and unlocked variants.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Now that iunique is not abusing find_inode anymore we can move the i_ref
    increment back to where it belongs.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Stop abusing find_inode_fast for iunique and opencode the inode hash walk.
    Introduce a new iunique_lock to protect the iunique counters once inode_lock
    is removed.

    Based on a patch originally from Nick Piggin.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Before replacing the inode hash locking with a more scalable
    mechanism, factor the removal of the inode from the hashes rather
    than open coding it in several places.

    Based on a patch originally from Nick Piggin.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Convert the inode LRU to use lazy updates to reduce lock and
    cacheline traffic. We avoid moving inodes around in the LRU list
    during iget/iput operations so these frequent operations don't need
    to access the LRUs. Instead, we defer the refcount checks to
    reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
    reclaim that iget has touched the inode in the past. This means that
    only reclaim should be touching the LRU with any frequency, hence
    significantly reducing lock acquisitions and the amount contention
    on LRU updates.

    This also removes the inode_in_use list, which means we now only
    have one list for tracking the inode LRU status. This makes it much
    simpler to split out the LRU list operations under it's own lock.

    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Nick Piggin
     
  • The number of inodes allocated does not need to be tied to the
    addition or removal of an inode to/from a list. If we are not tied
    to a list lock, we could update the counters when inodes are
    initialised or destroyed, but to do that we need to convert the
    counters to be per-cpu (i.e. independent of a lock). This means that
    we have the freedom to change the list/locking implementation
    without needing to care about the counters.

    Based on a patch originally from Eric Dumazet.

    [AV: cleaned up a bit, fixed build breakage on weird configs

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Dave Chinner
     
  • If clone_mnt() happens while mnt_make_readonly() is running, the
    cloned mount might have MNT_WRITE_HOLD flag set, which results in
    mnt_want_write() spinning forever on this mount.

    Needs CAP_SYS_ADMIN to trigger deliberately and unlikely to happen
    accidentally. But if it does happen it can hang the machine.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Make node look as if it was on hlist, with hlist_del()
    working correctly. Usable without any locking...

    Convert a couple of places where we want to do that to
    inode->i_hash.

    Signed-off-by: Al Viro

    Al Viro
     
  • note: for race-free uses you inode_lock held

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • We are in fill_super(); again, no inodes with zero i_count could
    be around until we set MS_ACTIVE.

    Signed-off-by: Al Viro

    Al Viro
     
  • In fill_super() we hadn't MS_ACTIVE set yet, so there won't
    be any inodes with zero i_count sitting around.

    In put_super() we already have MS_ACTIVE removed *and* we
    had called invalidate_inodes() since then. So again there
    won't be any inodes with zero i_count...

    Signed-off-by: Al Viro

    Al Viro
     
  • It's pointless - we *do* have busy inodes (root directory,
    for one), so that call will fail and attempt to change
    XIP flag will be ignored.

    Signed-off-by: Al Viro

    Al Viro
     
  • If we have the appropriate page already, call __block_write_begin()
    directly instead of releasing and regrabbing it inside of
    block_write_begin().

    Signed-off-by: Namhyung Kim
    Signed-off-by: Al Viro

    Namhyung Kim
     
  • Since inode->i_mode shares its bits for S_IFMT, S_ISDIR should be
    used to distinguish whether it is a dir or not.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Al Viro

    Namhyung Kim
     
  • The aio batching code is using igrab to get an extra reference on the
    inode so it can safely batch. igrab will go ahead and take the global
    inode spinlock, which can be a bottleneck on large machines doing lots
    of AIO.

    In this case, igrab isn't required because we already have a reference
    on the file handle. It is safe to just bump the i_count directly
    on the inode.

    Benchmarking shows this patch brings IOP/s on tons of flash up by about
    2.5X.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Updated Documentation/filesystems/Locking to match the code.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     
  • bh->b_private is initialized within init_buffer(), thus the
    assignment should be redundant. Remove it.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Al Viro

    Namhyung Kim
     
  • Move the EXPORTFS kconfig symbol out of the NETWORK_FILESYSTEMS block
    since it provides a library function that can be (and is) used by other
    (non-network) filesystems.

    This also eliminates a kconfig dependency warning:

    warning: (XFS_FS && BLOCK || NFSD && NETWORK_FILESYSTEMS && INET && FILE_LOCKING && BKL) selects EXPORTFS which has unmet direct dependencies (NETWORK_FILESYSTEMS)

    Signed-off-by: Randy Dunlap
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Alex Elder
    Cc: xfs-masters@oss.sgi.com
    Signed-off-by: Al Viro

    Randy Dunlap
     
  • Use sync_dirty_buffer instead of the incorrect opencoding it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Now, rw_verify_area() checsk f_pos is negative or not. And if negative,
    returns -EINVAL.

    But, some special files as /dev/(k)mem and /proc//mem etc.. has
    negative offsets. And we can't do any access via read/write to the
    file(device).

    So introduce FMODE_UNSIGNED_OFFSET to allow negative file offsets.

    Signed-off-by: Wu Fengguang
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Al Viro
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    KAMEZAWA Hiroyuki