09 Jan, 2012

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
    GFS2: local functions should be static
    GFS2: We only need one ACL getting function
    GFS2: Fix multi-block allocation
    GFS2: decouple quota allocations from block allocations
    GFS2: split function rgblk_search
    GFS2: Fix up "off by one" in the previous patch
    GFS2: move toward a generic multi-block allocator
    GFS2: O_(D)SYNC support for fallocate
    GFS2: remove vestigial al_alloced
    GFS2: combine gfs2_alloc_block and gfs2_alloc_di
    GFS2: Add non-try locks back to get_local_rgrp
    GFS2: f_ra is always valid in dir readahead function
    GFS2: Fix very unlikley memory leak in ACL xattr code
    GFS2: More automated code analysis fixes
    GFS2: Add readahead to sequential directory traversal
    GFS2: Fix up REQ flags

    Linus Torvalds
     

07 Jan, 2012

1 commit


04 Jan, 2012

1 commit

  • Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
    it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
    the cost of taking it into inode_init_always() will be negligible for pipes
    and sockets and negative for everything else. Not to mention the removal of
    boilerplate code from ->destroy_inode() instances...

    Signed-off-by: Al Viro

    Al Viro
     

22 Nov, 2011

1 commit

  • This patch separates the code pertaining to allocations into two
    parts: quota-related information and block reservations.
    This patch also moves all the block reservation structure allocations to
    function gfs2_inplace_reserve to simplify the code, and moves
    the frees to function gfs2_inplace_release.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

21 Oct, 2011

6 commits

  • Unfortunately, it is not enough to just ignore locked buffers during
    the AIL flush from fsync. We need to be able to ignore all buffers
    which are locked, dirty or pinned at this stage as they might have
    been added subsequent to the log flush earlier in the fsync function.

    In addition, this means that we no longer need to rely on i_mutex to
    keep out writes during fsync, so we can, as a side-effect, remove
    that protection too.

    Signed-off-by: Steven Whitehouse
    Tested-By: Abhijith Das

    Steven Whitehouse
     
  • This means that after the initial allocation for any inode, the
    last used resource group is cached in the inode for future use.
    This drastically reduces the number of lookups of resource
    groups in the common case, and this the contention on that
    data structure.

    The allocation algorithm is the same as previously, except that we
    always check to see if the goal block is within the cached rgrp
    first before going to the rbtree to look one up.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Since we have ruled out supporting online filesystem shrink,
    it is possible to make the resource group list append only
    during the life of a super block. This gives several benefits:

    Firstly, we only need to read new rindex elements as they are added
    rather than needing to reread the whole rindex file each time one
    element is added.

    Secondly, the rindex glock can be held for much shorter periods of
    time, and is completely removed from the fast path for allocations.
    The lock is taken in shared mode only when updating the resource
    groups when the first allocation occurs, and after a grow has
    taken place.

    Thirdly, this results in a reduction in code size, and everything
    gets a lot simpler to understand in this area.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The aim of this patch is to use the newly enhanced ->dirty_inode()
    super block operation to deal with atime updates, rather than
    piggy backing that code into ->write_inode() as is currently
    done.

    The net result is a simplification of the code in various places
    and a reduction of the number of gfs2_dinode_out() calls since
    this is now implied by ->dirty_inode().

    Some of the mark_inode_dirty() calls have been moved under glocks
    in order to take advantage of then being able to avoid locking in
    ->dirty_inode() when we already have suitable locks.

    One consequence is that generic_write_end() now correctly deals
    with file size updates, so that we do not need a separate check
    for that afterwards. This also, indirectly, means that fdatasync
    should work correctly on GFS2 - the current code always syncs the
    metadata whether it needs to or not.

    Has survived testing with postmark (with and without atime) and
    also fsx.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • If we have got far enough through the inode allocation code
    path that an inode has already been allocated, then we must
    call iput to dispose of it, if an error occurs during a
    later part of the process. This will always be the final iput
    since there will be no other references to the inode.

    Unlike when the inode has been unlinked, its block state will
    be GFS2_BLKST_INODE rather than GFS2_BLKST_UNLINKED so we need
    to skip the test in ->evict_inode() for this one case in order
    to ensure that it will be deallocated correctly. This patch adds
    a new flag in order to ensure that this will happen correctly.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • We do not need to start a transaction unless the atime
    check has proved positive. Also if we are going to flush
    the complete ail list anyway, we might as well skip the
    writeback for this specific inode's metadata, since that
    will be done as part of the ail writeback process in an
    order offering potentially more efficient I/O.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

15 Jul, 2011

1 commit

  • This patch adds a cache for the hash table to the directory code
    in order to help simplify the way in which the hash table is
    accessed. This is intended to be a first step towards introducing
    some performance improvements in the directory code.

    There are two follow ups that I'm hoping to see fairly shortly. One
    is to simplify the hash table reading code now that we always read the
    complete hash table, whether we want one entry or all of them. The
    other is to introduce readahead on the heads of the hash chains
    which are referred to from the table.

    The hash table is a maximum of 128k in size, so it is not worth trying
    to read it in small chunks.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

14 Jul, 2011

1 commit

  • This patch contains a few misc fixes which resolve a recently
    reported issue. This patch has been a real team effort and has
    received a lot of testing.

    The first issue is that the ail lock needs to be held over a few
    more operations. The lock thats added into gfs2_releasepage() may
    possibly be a candidate for replacing with RCU at some future
    point, but at this stage we've gone for the obvious fix.

    The second issue is that gfs2_write_inode() can end up calling
    a glock recursively when called from gfs2_evict_inode() via the
    syncing code, so it needs a guard added.

    The third issue is that we either need to not truncate the metadata
    pages of inodes which have zero link count, but which we cannot
    deallocate due to them still being in use by other nodes, or we need
    to ensure that those pages have all made it through the journal and
    ail lists first. This patch takes the former approach, but the
    latter has also been tested and there is nothing to choose between
    them performance-wise. So again, we could revise that decision
    in the future.

    Also, the inode eviction process is now better documented.

    Signed-off-by: Steven Whitehouse
    Tested-by: Bob Peterson
    Tested-by: Abhijith Das
    Reported-by: Barry J. Marson
    Reported-by: David Teigland

    Steven Whitehouse
     

09 May, 2011

2 commits


20 Apr, 2011

4 commits

  • This patch adds writeback_control to writing back the AIL
    list. This means that we can then take advantage of the
    information we get in ->write_inode() in order to set off
    some pre-emptive writeback.

    In addition, the AIL code is cleaned up a bit to make it
    a bit simpler to understand.

    There is still more which can usefully be done in this area,
    but this is a good start at least.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The GLF_LRU flag introduced in the previous patch can be
    used to check if a glock is on the lru list when a new
    holder is queued and if so remove it, without having first
    to get the lru_lock.

    The main purpose of this patch however is to optimise the
    glocks left over when an inode at end of life is being
    evicted. Previously such glocks were left with the GLF_LFLUSH
    flag set, so that when reclaimed, each one required a log flush.
    This patch resets the GLF_LFLUSH flag when there is nothing
    left to flush thus preventing later log flushes as glocks are
    reused or demoted.

    In order to do this, we need to keep track of the number of
    revokes which are outstanding, and also to clear the GLF_LFLUSH
    bit after a log commit when only revokes have been processed.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Rather than allowing the glocks to be scheduled for possible
    reclaim as soon as they have exited the journal, this patch
    delays their entry to the list until the glocks in question
    are no longer in use.

    This means that we will rely on the vm for writeback of all
    dirty data and metadata from now on. When glocks are added
    to the lru list they should be freeable much faster since all
    the I/O required to free them should have already been completed.

    This should lead to much better I/O patterns under low memory
    conditions.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The GFS2 ->write_inode function should be more aggressive at writing
    back to the filesystem. This adopts the XFS system of returning
    -EAGAIN when the writeback has not been completely done. Also, we
    now kick off in-place writeback when called with WB_SYNC_NONE,
    but we only wait for it and flush the log when WB_SYNC_ALL is
    requested.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

18 Apr, 2011

2 commits

  • This patch fixes a deadlock in GFS2 where two processes are trying
    to reclaim an unlinked dinode:
    One holds the inode glock and calls gfs2_lookup_by_inum trying to look
    up the inode, which it can't, due to I_FREEING. The other has set
    I_FREEING from vfs and is at the beginning of gfs2_delete_inode
    waiting for the glock, which is held by the first. The solution is to
    add a new non_block parameter to the gfs2_iget function that causes it
    to return -ENOENT if the inode is being freed.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This adds a couple of missing tests to avoid read-only nodes
    from attempting to deallocate unlinked inodes.

    Signed-off-by: Steven Whitehouse
    Reported-by: Michel Andre de la Porte

    Steven Whitehouse
     

31 Mar, 2011

1 commit


18 Jan, 2011

1 commit

  • When a file gets deleted on GFS2, if a node can't get an exclusive lock on the
    file's iopen glock, it punts on actually freeing up the space, because another
    node is using the file. When it does this, it needs to drop the iopen glock
    from its cache so that the other node can get an exclusive lock on it. Now,
    gfs2_delete_inode() sets GL_NOCACHE before dropping the shared lock on the
    iopen glock in preparation for grabbing it in the exclusive state. Since the
    node needs the glock in the exclusive state, dropping the shared lock from the
    cache doesn't slow down the case where no other nodes are using the file.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

07 Jan, 2011

1 commit

  • RCU free the struct inode. This will allow:

    - Subsequent store-free path walking patch. The inode must be consulted for
    permissions when walking, so an RCU inode reference is a must.
    - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
    to take i_lock no longer need to take sb_inode_list_lock to walk the list in
    the first place. This will simplify and optimize locking.
    - Could remove some nested trylock loops in dcache code
    - Could potentially simplify things a bit in VM land. Do not need to take the
    page lock to follow page->mapping.

    The downsides of this is the performance cost of using RCU. In a simple
    creat/unlink microbenchmark, performance drops by about 10% due to inability to
    reuse cache-hot slab objects. As iterations increase and RCU freeing starts
    kicking over, this increases to about 20%.

    In cases where inode lifetimes are longer (ie. many inodes may be allocated
    during the average life span of a single inode), a lot of this cache reuse is
    not applicable, so the regression caused by this patch is smaller.

    The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
    however this adds some complexity to list walking and store-free path walking,
    so I prefer to implement this at a later date, if it is shown to be a win in
    real situations. I haven't found a regression in any non-micro benchmark so I
    doubt it will be a problem.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

26 Oct, 2010

1 commit

  • In fill_super() we hadn't MS_ACTIVE set yet, so there won't
    be any inodes with zero i_count sitting around.

    In put_super() we already have MS_ACTIVE removed *and* we
    had called invalidate_inodes() since then. So again there
    won't be any inodes with zero i_count...

    Signed-off-by: Al Viro

    Al Viro
     

29 Sep, 2010

1 commit


24 Sep, 2010

1 commit

  • This option has never done anything useful. Also at the same time
    this cleans up the sb checks which are done at mount time. The
    debug option will be accepted, but ignored in future. Since it
    didn't do anything, there didn't seem much point in retaining it.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

23 Sep, 2010

2 commits

  • This option defaulted to on for lock_nolock mounts and off
    otherwise. The only function was to avoid the revalidation of
    dentries. In the cluster case, that is entirely pointless and
    liable to cause coherency problems.

    The patch changes the revalidation to depend upon whether the
    fs is a local or cluster fs (i.e. it follows the existing default
    behaviour). I very much doubt anybody ever used this option as
    there is no reason to. Even so we will continue to accept it
    on the mount command line, but ignore it.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This is been a no-op for a very long time now. I'm pretty sure
    nobody uses it, but just in case we'll still accept it on the
    command line, but ignore it.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

20 Sep, 2010

1 commit

  • With the update of the truncate code, ip->i_disksize and
    inode->i_size are merely copies of each other. This means
    we can remove ip->i_disksize and use inode->i_size exclusively
    reducing the size of a GFS2 inode by 8 bytes.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

11 Aug, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
    no need for list_for_each_entry_safe()/resetting with superblock list
    Fix sget() race with failing mount
    vfs: don't hold s_umount over close_bdev_exclusive() call
    sysv: do not mark superblock dirty on remount
    sysv: do not mark superblock dirty on mount
    btrfs: remove junk sb_dirt change
    BFS: clean up the superblock usage
    AFFS: wait for sb synchronization when needed
    AFFS: clean up dirty flag usage
    cifs: truncate fallout
    mbcache: fix shrinker function return value
    mbcache: Remove unused features
    add f_flags to struct statfs(64)
    pass a struct path to vfs_statfs
    update VFS documentation for method changes.
    All filesystems that need invalidate_inode_buffers() are doing that explicitly
    convert remaining ->clear_inode() to ->evict_inode()
    Make ->drop_inode() just return whether inode needs to be dropped
    fs/inode.c:clear_inode() is gone
    fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
    ...

    Fix up trivial conflicts in fs/nilfs2/super.c

    Linus Torvalds
     

10 Aug, 2010

2 commits


29 Jul, 2010

1 commit

  • Function gfs2_write_alloc_required always returned zero as its
    return code. Therefore, it doesn't need to return a return code
    at all. Given that, we can use the return value to return whether
    or not the dinode needs block allocations rather than passing
    that value in, which in turn simplifies a bunch of error checking.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

06 May, 2010

1 commit

  • The following patch adds a message to indicate when barriers have been
    disabled due to a block device which doesn't support them. You could
    already tell this via the mount options in /proc/mounts, but all the
    other filesystems also log a message at the same time.

    Also, the same mechanisms are used to indicate when the lock
    demote interface has been used (only ever used for debugging)
    which is a request from our support team.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

05 May, 2010

1 commit

  • This patch contains various tweaks to how log flushes and active item writeback
    work. gfs2_logd is now managed by a waitqueue, and gfs2_log_reseve now waits
    for gfs2_logd to do the log flushing. Multiple functions were rewritten to
    remove the need to call gfs2_log_lock(). Instead of using one test to see if
    gfs2_logd had work to do, there are now seperate tests to check if there
    are two many buffers in the incore log or if there are two many items on the
    active items list.

    This patch is a port of a patch Steve Whitehouse wrote about a year ago, with
    some minor changes. Since gfs2_ail1_start always submits all the active items,
    it no longer needs to keep track of the first ai submitted, so this has been
    removed. In gfs2_log_reserve(), the order of the calls to
    prepare_to_wait_exclusive() and wake_up() when firing off the logd thread has
    been switched. If it called wake_up first there was a small window for a race,
    where logd could run and return before gfs2_log_reserve was ready to get woken
    up. If gfs2_logd ran, but did not free up enough blocks, gfs2_log_reserve()
    would be left waiting for gfs2_logd to eventualy run because it timed out.
    Finally, gt_logd_secs, which controls how long to wait before gfs2_logd times
    out, and flushes the log, can now be set on mount with ar_commit.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

06 Mar, 2010

2 commits

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
    quota: stop using QUOTA_OK / NO_QUOTA
    dquot: cleanup dquot initialize routine
    dquot: move dquot initialization responsibility into the filesystem
    dquot: cleanup dquot drop routine
    dquot: move dquot drop responsibility into the filesystem
    dquot: cleanup dquot transfer routine
    dquot: move dquot transfer responsibility into the filesystem
    dquot: cleanup inode allocation / freeing routines
    dquot: cleanup space allocation / freeing routines
    ext3: add writepage sanity checks
    ext3: Truncate allocated blocks if direct IO write fails to update i_size
    quota: Properly invalidate caches even for filesystems with blocksize < pagesize
    quota: generalize quota transfer interface
    quota: sb_quota state flags cleanup
    jbd: Delay discarding buffers in journal_unmap_buffer
    ext3: quota_write cross block boundary behaviour
    quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
    quota: split out compat_sys_quotactl support from quota.c
    quota: split out netlink notification support from quota.c
    quota: remove invalid optimization from quota_sync_all
    ...

    Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c

    Linus Torvalds
     
  • This gives the filesystem more information about the writeback that
    is happening. Trond requested this for the NFS unstable write handling,
    and other filesystems might benefit from this too by beeing able to
    distinguish between the different callers in more detail.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

05 Mar, 2010

1 commit

  • Currenly sync_quota_sb does a lot of sync and truncate action that only
    applies to "VFS" style quotas and is actively harmful for the sync
    performance in XFS. Move it into vfs_quota_sync and add a wait parameter
    to ->quota_sync to tell if we need it or not.

    My audit of the GFS2 code says it's also not needed given the way GFS2
    implements quotas, but I'd be happy if this can get a detailed review.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     

01 Mar, 2010

2 commits

  • As a consequence of the previous patch, we can now remove the
    loop which used to be required due to the circular dependency
    between the inodes and glocks. Instead we can just invalidate
    the inodes, and then clear up any glocks which are left.

    Also we no longer need the rwsem since there is no longer any
    danger of the inode invalidation calling back into the glock
    code (and from there back into the inode code).

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Since the start of GFS2, an "extra" inode has been used to store
    the metadata belonging to each inode. The only reason for using
    this inode was to have an extra address space, the other fields
    were unused. This means that the memory usage was rather inefficient.

    The reason for keeping each inode's metadata in a separate address
    space is that when glocks are requested on remote nodes, we need to
    be able to efficiently locate the data and metadata which relating
    to that glock (inode) in order to sync or sync and invalidate it
    (depending on the remotely requested lock mode).

    This patch adds a new type of glock, which has in addition to
    its normal fields, has an address space. This applies to all
    inode and rgrp glocks (but to no other glock types which remain
    as before). As a result, we no longer need to have the second
    inode.

    This results in three major improvements:
    1. A saving of approx 25% of memory used in caching inodes
    2. A removal of the circular dependency between inodes and glocks
    3. No confusion between "normal" and "metadata" inodes in super.c

    Although the first of these is the more immediately apparent, the
    second is just as important as it now enables a number of clean
    ups at umount time. Those will be the subject of future patches.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse