05 Jan, 2009

7 commits

  • This patch removes the two daemons, gfs2_scand and gfs2_glockd
    and replaces them with a shrinker which is called from the VM.

    The net result is that GFS2 responds better when there is memory
    pressure, since it shrinks the glock cache at the same rate
    as the VFS shrinks the dcache and icache. There are no longer
    any time based criteria for shrinking glocks, they are kept
    until such time as the VM asks for more memory and then we
    demote just as many glocks as required.

    There are potential future changes to this code, including the
    possibility of sorting the glocks which are to be written back
    into inode number order, to get a better I/O ordering. It would
    be very useful to have an elevator based workqueue implementation
    for this, as that would automatically deal with the read I/O cases
    at the same time.

    This patch is my answer to Andrew Morton's remark, made during
    the initial review of GFS2, asking why GFS2 needs so many kernel
    threads, the answer being that it doesn't :-) This patch is a
    net loss of about 200 lines of code.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The final field in gfs2_dinode_host was the i_flags field. Thats
    renamed to i_diskflags in order to avoid confusion with the existing
    inode flags, and moved into the inode proper at a suitable location
    to avoid creating a "hole".

    At that point struct gfs2_dinode_host is no longer needed and as
    promised (quite some time ago!) it can now be removed completely.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch moved the i_size field from the gfs2_dinode_host and
    following the ext3 convention renames it i_disksize.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This moves the di_eattr field out of gfs2_inode_host and
    into the inode proper.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This moves the directory entry count into the proper inode.
    Potentially we could get this to share the space used by
    something else in the future, but this is one more step
    on the way to removing the gfs2_dinode_host structure.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This moves the generation number from the gfs2_dinode_host
    into the gfs2_inode structure. Eventually the plan is to get
    rid of the gfs2_dinode_host structure completely.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Move the contents of some headers which contained very
    little into more sensible places, and remove the original
    header files. This should make it easier to find things.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

14 Nov, 2008

1 commit

  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: Steven Whitehouse
    Cc: cluster-devel@redhat.com
    Signed-off-by: James Morris

    David Howells
     

18 Sep, 2008

1 commit

  • Until now, we've used the same scheme as GFS1 for atime. This has failed
    since atime is a per vfsmnt flag, not a per fs flag and as such the
    "noatime" flag was not getting passed down to the filesystems. This
    patch removes all the "special casing" around atime updates and we
    simply use the VFS's atime code.

    The net result is that GFS2 will now support all the same atime related
    mount options of any other filesystem on a per-vfsmnt basis. We do lose
    the "lazy atime" updates, but we gain "relatime". We could add lazy
    atime to the VFS at a later date, if there is a requirement for that
    variant still - I suspect relatime will be enough.

    Also we lose about 100 lines of code after this patch has been applied,
    and I have a suspicion that it will speed things up a bit, even when
    atime is "on". So it seems like a nice clean up as well.

    From a user perspective, everything stays the same except the loss of
    the per-fs atime quantum tweekable (ought to be per-vfsmnt at the very
    least, and to be honest I don't think anybody ever used it) and that a
    number of options which were ignored before now work correctly.

    Please let me know if you've got any comments. I'm pushing this out
    early so that you can all see what my plans are.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

05 Sep, 2008

1 commit

  • In case of error, the function gfs2_inode_lookup returns an
    ERR pointer, but never returns a NULL pointer. So a NULL test that
    necessarily comes after an IS_ERR test should be deleted, and a NULL
    test that may come after a call to this function should be
    strengthened by an IS_ERR test.

    The semantic match that finds this problem is as follows:
    (http://www.emn.fr/x-info/coccinelle/)

    //
    @match_bad_null_test@
    expression x, E;
    statement S1,S2;
    @@
    x = gfs2_inode_lookup(...)
    ... when != x = E
    * if (x != NULL)
    S1 else S2
    //

    Signed-off-by: Julien Brunel
    Signed-off-by: Julia Lawall
    Signed-off-by: Steven Whitehouse

    Julien Brunel
     

27 Aug, 2008

1 commit

  • This patch fixes a locking issue in the rename code by ensuring that we hold
    the per sb rename lock over both directory and "other" renames which involve
    different parent directories.

    At the same time, this moved the (only called from one place) function
    gfs2_ok_to_move into the file that its called from, so we can mark it
    static. This should make a code a bit easier to follow.

    Signed-off-by: Steven Whitehouse
    Cc: Peter Staubach

    Steven Whitehouse
     

27 Jul, 2008

1 commit


10 Jul, 2008

1 commit


03 Jul, 2008

1 commit

  • GFS2 calls permission() to verify permissions after locks on the files
    have been taken.

    For this it's sufficient to call gfs2_permission() instead. This
    results in the following changes:

    - IS_RDONLY() check is not performed
    - IS_IMMUTABLE() check is not performed
    - devcgroup_inode_permission() is not called
    - security_inode_permission() is not called

    IS_RDONLY() should be unnecessary anyway, as the per-mount read-only
    flag should provide protection against read-only remounts during
    operations. do_gfs2_set_flags() has been fixed to perform
    mnt_want_write()/mnt_drop_write() to protect against remounting
    read-only.

    IS_IMMUTABLE has been added to gfs2_permission()

    Repeating the security checks seems to be pointless, as they don't
    normally change, and if they do, it's independent of the filesystem
    state.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Steven Whitehouse

    Miklos Szeredi
     

12 May, 2008

1 commit

  • This patch fixes a GFS2 filesystem consistency error reported from
    function do_strip. The problem was caused by a timing window
    that allowed two vfs inodes to be created in memory that point
    to the same file. The problem is fixed by making the vfs's
    iget_test, iget_set mechanism check and set a new bit in the
    in-core gfs2_inode structure while the vfs inode spin_lock is held.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

10 Apr, 2008

1 commit

  • There are several places where GFP_KERNEL allocations happen under a glock,
    which will result in hangs if we're under memory pressure and go to re-enter the
    fs in order to flush stuff out. This patch changes the culprits to GFS_NOFS to
    keep this problem from happening. Thank you,

    Signed-off-by: Josef Bacik
    Signed-off-by: Steven Whitehouse

    Josef Bacik
     

31 Mar, 2008

9 commits

  • gfs2_alloc_get may fail so we have to check it to prevent
    NULL pointer dereference.

    Signed-off-by: Cyrill Gorcunov
    Signed-off-by: Steven Whitehouse

    Cyrill Gorcunov
     
  • a previous commit removed call to
    init_special_inode from inode lookuping, this cause problems as:

    # mknod /mnt/gfs2/dev/null c 1 3
    # cat /mnt/gfs2/dev/null
    cat: /mnt/gfs2/dev/null: Invalid argument

    without special inode, GFS2 cannot support char device file,
    block device file, fifo pipe, and socket file, lose many important
    features as a common file system.

    this one line patch re add special inode support.

    Signed-off-by: Denis Cheng
    Signed-off-by: Steven Whitehouse

    Denis Cheng
     
  • struct inode_operations gfs2_dev_iops is always the same as gfs2_file_iops,
    since Jan 2006, when GFS2 merged into mainstream kernel.

    So one of them could be removed.

    Signed-off-by: Denis Cheng
    Signed-off-by: Steven Whitehouse

    Denis Cheng
     
  • We've previously been using a "try lock" in readpage on the basis that
    it would prevent deadlocks due to the inverted lock ordering (our normal
    lock ordering is glock first and then page lock). Unfortunately tests
    have shown that this isn't enough. If the glock has a demote request
    queued such that run_queue() in the glock code tries to do a demote when
    its called under readpage then it will try and write out all the dirty
    pages which requires locking them. This then deadlocks with the page
    locked by readpage.

    The solution is to always require two calls into readpage. The first
    unlocks the page, gets the glock and returns AOP_TRUNCATED_PAGE, the
    second does the actual readpage and unlocks the glock & page as
    required.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The blocks counter is almost a duplicate of the i_blocks
    field in the VFS inode. The only difference is that i_blocks
    can be only 32bits long for 32bit arch without large single file
    support. Since GFS2 doesn't handle the non-large single file
    case (for 32 bit anyway) this adds a new config dependency on
    64BIT || LSF. This has always been the case, however we've never
    explicitly said so before.

    Even if we do add support for the non-LSF case, we will still
    not require this field to be duplicated since we will not be
    able to access oversized files anyway.

    So the net result of all this is that we shave 8 bytes from a gfs2_inode
    and get our config deps correct.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • There were three fields being used to keep track of the location
    of the most recently allocated block for each inode. These have
    been merged into a single field in order to better keep the
    data and metadata for an inode close on disk, and also to reduce
    the space required for storage.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch forms a pair with the previous patch which shrunk
    di_height. Like that patch di_depth is renamed i_depth and moved
    into struct gfs2_inode directly. Also the field goes from 16 bits
    to 8 bits since it is also limited to a max value which is rather
    small (17 in this case). In addition we also now validate the field
    against this maximum value when its read in.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • I noticed that the latest change to i_height got rid of the
    value from the inode dump. This patch adds it back.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This patch improves the calculation of the tree height in order to reduce
    the number of operations which are carried out on each call to gfs2_block_map.
    In the common case, we now make a single comparison, rather than calculating
    the required tree height from scratch each time. Also in the case that the
    tree does need some extra height, we start from the current height rather from
    zero when we work out what the new height ought to be.

    In addition the di_height field is moved into the inode proper and reduced
    in size to a u8 since the value must be between 0 and GFS2_MAX_META_HEIGHT (10).

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

08 Feb, 2008

1 commit


25 Jan, 2008

7 commits

  • I spotted this bug while I was digging around. Looks like it could cause
    a lockup in some rare error condition.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • It is possible to reduce the size of GFS2 inodes by taking the i_alloc
    structure out of the gfs2_inode. This patch allocates the i_alloc
    structure whenever its needed, and frees it afterward. This decreases
    the amount of low memory we use at the expense of requiring a memory
    allocation for each page or partial page that we write. A quick test
    with postmark shows that the overhead is not measurable and I also note
    that OCFS2 use the same approach.

    In the future I'd like to solve the problem by shrinking down the size
    of the members of the i_alloc structure, but for now, this reduces the
    immediate problem of using too much low-memory on x86 and doesn't add
    too much overhead.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • GFS2 supports two modes of locking - lock_nolock for single node filesystem
    and lock_dlm for cluster mode locking. The gfs2 lock methods are removed from
    file operation table for lock_nolock protocol. This would allow VFS to handle
    posix lock and flock logics just like other in-tree filesystems without
    duplication.

    Signed-off-by: S. Wendy Cheng
    Signed-off-by: Steven Whitehouse

    Wendy Cheng
     
  • The only reason for adding glocks to the journal was to keep track
    of which locks required a log flush prior to release. We add a
    flag to the glock to allow this check to be made in a simpler way.

    This reduces the size of a glock (by 12 bytes on i386, 24 on x86_64)
    and means that we can avoid extra work during the journal flush.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Just like ext3 we now have three sets of address space operations
    to cover the cases of writeback, ordered and journalled data
    writes. This means that the individual operations can now become
    less complicated as we are able to remove some of the tests for
    file data mode from the code.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The i_cache was designed to keep references to the indirect blocks
    used during block mapping so that they didn't have to be looked
    up continually. The idea failed because there are too many places
    where the i_cache needs to be freed, and this has in the past been
    the cause of many bugs.

    In addition there was no performance benefit being gained since the
    disk blocks in question were cached anyway. So this patch removes
    it in order to simplify the code to prepare for other changes which
    would otherwise have had to add further support for this feature.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • As requested by Christoph, this patch cleans up GFS2's internal
    read function so that it no longer uses the do_generic_mapping_read
    function. This function is obsolete and GFS2 is the last user of it.

    As a side effect the internal read code gets smaller and easier
    to read and gfs2_readpage is split into two. One function has the locking
    and the other function has the rest of the logic.

    Signed-off-by: Steven Whitehouse
    Cc: Christoph Hellwig

    Steven Whitehouse
     

10 Oct, 2007

2 commits

  • There is a possible deadlock between two processes on the same node, where one
    process is deleting an inode, and another process is looking for allocated but
    unused inodes to delete in order to create more space.

    process A does an iput() on inode X, and it's i_count drops to 0. This causes
    iput_final() to be called, which puts an inode into state I_FREEING at
    generic_delete_inode(). There no point between when iput_final() is called, and
    when I_FREEING is set where GFS2 could acquire any glocks. Once I_FREEING is
    set, no other process on that node can successfully look up that inode until
    the delete finishes.

    process B locks the the resource group for the same inode in get_local_rgrp(),
    which is called by gfs2_inplace_reserve_i()

    process A tries to lock the resource group for the inode in
    gfs2_dinode_dealloc(), but it's already locked by process B

    process B waits in find_inode for the inode to have the I_FREEING state cleared.

    Deadlock.

    This patch solves the problem by adding an alternative to gfs2_iget(),
    gfs2_iget_skip(), that simply skips any inodes that are in the I_FREEING
    state.o The alternate test function is just like the original one, except that
    it fails if the inode is being freed, and sets a skipped flag. The alternate
    set function is just like the original, except that it fails if the skipped
    flag is set. Only try_rgrp_unlink() calls gfs2_iget_skip() instead of
    gfs2_iget().

    Signed-off-by: Benjamin E. Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     
  • Fix a nasty inode meta data corruption issue by keeping the buffer head in
    icache array. This buffer needs to stay in memory until journal flush occurs
    Otherwise, gfs2_meta_inode_buffer could do a disk read before the inode hits
    disk. It ends up with meta data corruptions. The buffer will be released as
    part of the existing journal flush logic.

    Signed-off-by: S. Wendy Cheng
    Signed-off-by: Steven Whitehouse

    Wendy Cheng
     

09 Jul, 2007

5 commits

  • GFS2 has been passing i_mode within NFS File Handle. Other than the
    wrong assumption that there is always room for this extra 16 bit value,
    the current gfs2_get_dentry doesn't really need the i_mode to work
    correctly. Note that GFS2 NFS code does go thru the same lookup code
    path as direct file access route (where the mode is obtained from name
    lookup) but gfs2_get_dentry() is coded for different purpose. It is not
    used during lookup time. It is part of the file access procedure call.
    When the call is invoked, if on-disk inode is not in-memory, it has to
    be read-in. This makes i_mode passing a useless overhead.

    Signed-off-by: S. Wendy Cheng
    Signed-off-by: Steven Whitehouse

    Wendy Cheng
     
  • GFS2 lookup code doesn't ask for inode shared glock. This implies during
    in-memory inode creation for existing file, GFS2 will not disk-read in
    the inode contents. This leaves no_formal_ino un-initialized during
    lookup time. The un-initialized no_formal_ino is subsequently encoded
    into file handle. Clients will get ESTALE error whenever it tries to
    access these files.

    Signed-off-by: S. Wendy Cheng
    Signed-off-by: Steven Whitehouse

    Wendy Cheng
     
  • There were two issues during deallocation of unlinked inodes. The
    first was relating to the use of a "try" lock which in the case of
    the inode lock wasn't trying hard enough to deallocate in all
    circumstances (now changed to a normal glock) and in the case of
    the iopen lock didn't wait for the demotion of the shared lock before
    attempting to get the exclusive lock, and thereby sometimes (timing dependent)
    not completing the deallocation when it should have done.

    The second issue related to the lack of a way to invalidate dcache entries
    on remote nodes (now fixed by this patch) which meant that unlinks were
    taking a long time to return disk space to the fs. By adding some code to
    invalidate the dcache entries across the cluster for unlinked inodes, that
    is now fixed.

    This patch was written jointly by Abhijith Das and Steven Whitehouse.

    Signed-off-by: Abhijith Das
    Signed-off-by: Steven Whitehouse

    Abhijith Das
     
  • fs/gfs2/inode.c: In function 'gfs2_lookupi':
    fs/gfs2/inode.c:392: warning: 'error' may be used uninitialized in this function

    Looks like a real bug to me.

    Cc: Steven Whitehouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Steven Whitehouse

    akpm@linux-foundation.org
     
  • Under certain circumstances its possible (though rather unlikely) that
    inodes which were unlinked by one node while still open on another might
    get "lost" in the sense that they don't get deallocated if the node
    which held the inode open crashed before it was unlinked.

    This patch adds the recovery code which allows automatic deallocation of
    the inode if its found during block allocation (the sensible time to
    look for such inodes since we are scanning the rgrp's bitmaps anyway at
    this time, so it adds no overhead to do this).

    Since the inode will have had its i_nlink set to zero, all we need to
    trigger recovery is a lookup and an iput(), and the normal deallocation
    code takes care of the rest.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse