11 Jan, 2012

1 commit

  • In function gfs2_inplace_release it was trying to unlock a gfs2_holder
    structure associated with a reservation, after said reservation was
    freed. The problem is that the statements have the wrong order.
    This patch corrects the order so that the reservation is freed after
    the gfs2_holder is unlocked.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

22 Nov, 2011

3 commits

  • Clean up gfs2_alloc_blocks so that it takes the full extent length
    rather than just the number of non-inode blocks as an argument. That
    will only make a difference in the inode allocation case for now.

    Also, this fixes the extent length handling around gfs2_alloc_extent() so
    that multi block allocations will work again.

    The rd_last_alloc block is set to the final block in the allocated
    extent (as per the update to i_goal, but referenced to a different
    start point).

    This also removes the dinode argument to rgblk_search() which is no
    longer used.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch separates the code pertaining to allocations into two
    parts: quota-related information and block reservations.
    This patch also moves all the block reservation structure allocations to
    function gfs2_inplace_reserve to simplify the code, and moves
    the frees to function gfs2_inplace_release.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This patch splits function rgblk_search into a function that finds
    blocks to allocate (rgblk_search) and a function that assigns those
    blocks (gfs2_alloc_extent).

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

21 Nov, 2011

2 commits

  • The trace point should take extlen and not *ndata as the
    extent length.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch is a revision of the one I previously posted.
    I tried to integrate all the suggestions Steve gave.
    The purpose of the patch is to change function gfs2_alloc_block
    (allocate either a dinode block or an extent of data blocks)
    to a more generic gfs2_alloc_blocks function that can
    allocate both a dinode _and_ an extent of data blocks in the
    same call. This will ultimately help us create a multi-block
    reservation scheme to reduce file fragmentation.

    This patch moves more toward a generic multi-block allocator that
    takes a pointer to the number of data blocks to allocate, plus whether
    or not to allocate a dinode. In theory, it could be called to allocate
    (1) a single dinode block, (2) a group of one or more data blocks, or
    (3) a dinode plus several data blocks.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

18 Nov, 2011

1 commit

  • This patch removes the vestigial variable al_alloced from
    the gfs2_alloc structure. This is another baby step toward
    multi-block reservations.

    My next planned step is to decouple the quota variables
    from the gfs2_alloc structure so we can use a different
    method for allocations.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

15 Nov, 2011

2 commits

  • GFS2 functions gfs2_alloc_block and gfs2_alloc_di do basically
    the same things, with a few exceptions. This patch combines
    the two functions into a slightly more generic gfs2_alloc_block.
    Having one centralized block allocation function will reduce
    code redundancy and make it easier to implement multi-block
    reservations to reduce file fragmentation in the future.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This upstream patch had what I believe is an unintended consequence:

    http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-3.0-nmw.git;a=commitdiff;h=beca42486749c1538a5ed58fe9dcc9f26d428c93

    The patch changed function get_local_rgrp such that it ONLY
    used TRY locks for RGRP searches. Prior to that patch, the code
    used TRY locks during the first loop, and if that was unsuccessful,
    it used normal blocking locks on subsequent searches. This patch
    changes it back to the old way.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

21 Oct, 2011

8 commits

  • The two variables being initialised in gfs2_inplace_reserve
    to track the file & line number of the caller are never
    used, so we might as well remove them.

    If something does go wrong, then a stack trace is probably
    more useful anyway.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Bob reported:

    I found an off-by-one problem with how I coded this section:
    It should be:

    + else if (blk >= cur->rd_data0 + cur->rd_data)

    In fact, cur->rd_data0 + cur->rd_data is the start of the next
    rgrp (the next ri_addr), so without the "=" check it can land on
    the wrong rgrp.

    In all normal cases, this won't be a problem: you're searching
    for a block _within_ the rgrp, which will pass the test properly.
    Where it gets into trouble is if you search the rgrps for the
    block exactly equal to ri_addr. I don't think anything in the
    kernel does this, but I found a place in gfs2-utils gfs2_edit
    where it does. So I definitely need to fix it in libgfs2. I'd
    like to suggest we fix it in the kernel as well for the sake of
    keeping the functions similar.

    So this patch fixes the above mentioned off by one error as well
    as removing the unused parent pointer.

    Reported-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The new goal block should be set to the end of the newly
    allocated extent, not the start of it.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Each block which is deallocated, requires a call to gfs2_rlist_add()
    and each of those calls was calling gfs2_blk2rgrpd() in order to
    figure out which rgrp the block belonged in. This can be speeded up
    by making use of the rgrp cached in the inode. We also reset this
    cached rgrp in case the block has changed rgrp. This should provide
    a big reduction in gfs2_blk2rgrpd() calls during deallocation.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Given that a resource group has been locked, there is no reason why
    we should not be able to allocate as many blocks as are free. The
    al_requested parameter should really be considered as a minimum
    number of blocks to be available. Should this limit be overshot,
    there are other mechanisms which will prevent over allocation.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This means that after the initial allocation for any inode, the
    last used resource group is cached in the inode for future use.
    This drastically reduces the number of lookups of resource
    groups in the common case, and this the contention on that
    data structure.

    The allocation algorithm is the same as previously, except that we
    always check to see if the goal block is within the cached rgrp
    first before going to the rbtree to look one up.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Since we have ruled out supporting online filesystem shrink,
    it is possible to make the resource group list append only
    during the life of a super block. This gives several benefits:

    Firstly, we only need to read new rindex elements as they are added
    rather than needing to reread the whole rindex file each time one
    element is added.

    Secondly, the rindex glock can be held for much shorter periods of
    time, and is completely removed from the fast path for allocations.
    The lock is taken in shared mode only when updating the resource
    groups when the first allocation occurs, and after a grow has
    taken place.

    Thirdly, this results in a reduction in code size, and everything
    gets a lot simpler to understand in this area.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Here is an update of Bob's original rbtree patch which, in addition, also
    resolves the rather strange ref counting that was being done relating to
    the bitmap blocks.

    Originally we had a dual system for journaling resource groups. The metadata
    blocks were journaled and also the rgrp itself was added to a list. The reason
    for adding the rgrp to the list in the journal was so that the "repolish
    clones" code could be run to update the free space, and potentially send any
    discard requests when the log was flushed. This was done by comparing the
    "cloned" bitmap with what had been written back on disk during the transaction
    commit.

    Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
    until the journal had been flushed. For that reason, there was a rather
    complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
    both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
    count on the buffers.

    However, the journal maintains a reference count on the buffers anyway, since
    they are being journaled as metadata buffers. So by moving the code which deals
    with the post-journal accounting for bitmap blocks to the metadata journaling
    code, we can entirely dispense with the rather strange buffer ref counting
    scheme and also the requirement to journal the rgrps.

    The net result of all this is that the ->sd_rindex_spin is left to do exactly
    one job, and that is to look after the rbtree or rgrps.

    This patch is designed to be a stepping stone towards using RCU for the rbtree
    of resource groups, however the reduction in the number of uses of the
    ->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
    anyway.

    The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
    be removed in future in favour of calling the functions directly where required
    in the code. That will allow locking of resource groups without needing to
    actually read them in - something that could be useful in speeding up statfs.

    In the mean time though it is valid to dereference ->bi_bh only when the rgrp
    is locked. This is basically the same rule as before, modulo the references not
    being valid until the following journal flush.

    Signed-off-by: Steven Whitehouse
    Signed-off-by: Bob Peterson
    Cc: Benjamin Marzinski

    Bob Peterson
     

15 Jul, 2011

1 commit

  • __gfs2_free_data and __gfs2_free_meta are almost identical, and
    can be trivially combined.

    [This is as per Eric's original patch minus gfs2_free_data() which had
    no callers left and plus the conversion of the bmap.c calls to these
    functions. All in all, a nice clean up]

    Signed-off-by: Eric Sandeen
    Signed-off-by: Steven Whitehouse

    Eric Sandeen
     

21 May, 2011

1 commit

  • The deallocation code for directories in GFS2 is largely divided into
    two parts. The first part deallocates any directory leaf blocks and
    marks the directory as being a regular file when that is complete. The
    second stage was identical to deallocating regular files.

    Regular files have their data blocks in a different
    address space to directories, and thus what would have been normal data
    blocks in a regular file (the hash table in a GFS2 directory) were
    deallocated correctly. However, a reference to these blocks was left in the
    journal (assuming of course that some previous activity had resulted in
    those blocks being in the journal or ail list).

    This patch uses the i_depth as a test of whether the inode is an
    exhash directory (we cannot test the inode type as that has already
    been changed to a regular file at this stage in deallocation)

    The original issue was reported by Chris Hertel as an issue he encountered
    running bonnie++

    Reported-by: Christopher R. Hertel
    Cc: Abhijith Das
    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

20 Apr, 2011

2 commits

  • Rather than allowing the glocks to be scheduled for possible
    reclaim as soon as they have exited the journal, this patch
    delays their entry to the list until the glocks in question
    are no longer in use.

    This means that we will rely on the vm for writeback of all
    dirty data and metadata from now on. When glocks are added
    to the lru list they should be freeable much faster since all
    the I/O required to free them should have already been completed.

    This should lead to much better I/O patterns under low memory
    conditions.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • On rare occasions we encounter gfs2 problems where an
    invalid bitmap state transition is attempted. For example,
    trying to "unlink" a free block. In these cases, there
    is really no useful information logged to debug the problem.
    This patch adds more debug details that should allow us to
    more closely examine the problem and possibly solve it.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

18 Apr, 2011

1 commit

  • This patch fixes a deadlock in GFS2 where two processes are trying
    to reclaim an unlinked dinode:
    One holds the inode glock and calls gfs2_lookup_by_inum trying to look
    up the inode, which it can't, due to I_FREEING. The other has set
    I_FREEING from vfs and is at the beginning of gfs2_delete_inode
    waiting for the glock, which is held by the first. The solution is to
    add a new non_block parameter to the gfs2_iget function that causes it
    to return -ENOENT if the inode is being freed.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

24 Feb, 2011

1 commit

  • This patch is a performance improvement to GFS2's dealloc code.
    Rather than update the quota file and statfs file for every
    single block that's stripped off in unlink function do_strip,
    this patch keeps track and updates them once for every layer
    that's stripped. This is done entirely inside the existing
    transaction, so there should be no risk of corruption.
    The other functions that deallocate blocks will be unaffected
    because they are using wrapper functions that do the same
    thing that they do today.

    I tested this code on my roth cluster by creating 200
    files in a directory, each of which is 100MB, then on
    four nodes, I simultaneously deleted the files, thus competing
    for GFS2 resources (but different files). The commands
    I used were:

    [root@roth-01]# time for i in `seq 1 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
    [root@roth-02]# time for i in `seq 2 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
    [root@roth-03]# time for i in `seq 3 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
    [root@roth-05]# time for i in `seq 4 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done

    The performance increase was significant:

    roth-01 roth-02 roth-03 roth-05
    --------- --------- --------- ---------
    old: real 0m34.027 0m25.021s 0m23.906s 0m35.646s
    new: real 0m22.379s 0m24.362s 0m24.133s 0m18.562s

    Total time spent deleting:
    old: 118.6s
    new: 89.4

    For this particular case, this showed a 25% performance increase for
    GFS2 unlinks.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

08 Dec, 2010

1 commit


30 Nov, 2010

2 commits

  • When you truncate the rindex file, you need to avoid calling gfs2_rindex_hold,
    since you already hold it. However, if you haven't already read in the
    resource groups, you need to do that.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     
  • When GFS2 grew the filesystem, it was never rereading the rindex file during
    the grow. This is necessary for large grows when the filesystem is almost full,
    and GFS2 needs to use some of the space allocated earlier in the grow to
    complete it. Now, if GFS2 fails to reserve the necessary space and the rindex
    file is not uptodate, it rereads it. Also, the only difference between
    gfs2_ri_update() and gfs2_ri_update_special() was that gfs2_ri_update_special()
    didn't clear out the existing resource groups, since you knew that it was only
    called when there were no resource groups. Attempting to clear out the
    resource groups when there are none takes almost no time, and rarely happens,
    so I simply removed gfs2_ri_update_special().

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

15 Nov, 2010

1 commit

  • This area of the code has always been a bit delicate due to the
    subtleties of lock ordering. The problem is that for "normal"
    alloc/dealloc, we always grab the inode locks first and the rgrp lock
    later.

    In order to ensure no races in looking up the unlinked, but still
    allocated inodes, we need to hold the rgrp lock when we do the lookup,
    which means that we can't take the inode glock.

    The solution is to borrow the technique already used by NFS to solve
    what is essentially the same problem (given an inode number, look up
    the inode carefully, checking that it really is in the expected
    state).

    We cannot do that directly from the allocation code (lock ordering
    again) so we give the job to the pre-existing delete workqueue and
    carry on with the allocation as normal.

    If we find there is no space, we do a journal flush (required anyway
    if space from a deallocation is to be released) which should block
    against the pending deallocations, so we should always get the space
    back.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

23 Oct, 2010

1 commit

  • * 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
    xen-blkfront: disable barrier/flush write support
    Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
    block: remove BLKDEV_IFL_WAIT
    aic7xxx_old: removed unused 'req' variable
    block: remove the BH_Eopnotsupp flag
    block: remove the BLKDEV_IFL_BARRIER flag
    block: remove the WRITE_BARRIER flag
    swap: do not send discards as barriers
    fat: do not send discards as barriers
    ext4: do not send discards as barriers
    jbd2: replace barriers with explicit flush / FUA usage
    jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
    jbd: replace barriers with explicit flush / FUA usage
    nilfs2: replace barriers with explicit flush / FUA usage
    reiserfs: replace barriers with explicit flush / FUA usage
    gfs2: replace barriers with explicit flush / FUA usage
    btrfs: replace barriers with explicit flush / FUA usage
    xfs: replace barriers with explicit flush / FUA usage
    block: pass gfp_mask and flags to sb_issue_discard
    dm: convey that all flushes are processed as empty
    ...

    Linus Torvalds
     

01 Oct, 2010

1 commit

  • This patch fixes a GFS2 problem whereby the first rename after a
    mount can result in a file system consistency error being flagged
    improperly and cause the file system to withdraw. The problem is
    that the rename code tries to run the rgrp list with function
    gfs2_blk2rgrpd before the rgrp list is guaranteed to be read in
    from disk. The patch makes the rename function hold the rindex
    glock (as the gfs2_unlink code does today) which reads in the rgrp
    list if need be. There were a total of three places in the rename
    code that improperly referenced the rgrp list without the rindex
    glock and this patch fixes all three.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

20 Sep, 2010

3 commits

  • This patch adds support for fallocate to gfs2. Since the gfs2 does not support
    uninitialized data blocks, it must write out zeros to all the blocks. However,
    since it does not need to lock any pages to read from, gfs2 can write out the
    zero blocks much more efficiently. On a moderately full filesystem, fallocate
    works around 5 times faster on average. The fallocate call also allows gfs2 to
    add blocks to the file without changing the filesize, which will make it
    possible for gfs2 to preallocate space for the rindex file, so that gfs2 can
    grow a completely full filesystem.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     
  • This adds a check to ensure that if we reach the block allocator
    that we don't try and proceed if there is no alloc structure
    hanging off the inode. This should only happen if there is a bug
    in GFS2. The error return code is distinctive in order that it
    will be easily spotted.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • With the update of the truncate code, ip->i_disksize and
    inode->i_size are merely copies of each other. This means
    we can remove ip->i_disksize and use inode->i_size exclusively
    reducing the size of a GFS2 inode by 8 bytes.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

17 Sep, 2010

1 commit

  • All the blkdev_issue_* helpers can only sanely be used for synchronous
    caller. To issue cache flushes or barriers asynchronously the caller needs
    to set up a bio by itself with a completion callback to move the asynchronous
    state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always
    specified when calling blkdev_issue_* and also remove the now unused flags
    argument to blkdev_issue_flush and blkdev_issue_zeroout. For
    blkdev_issue_discard we need to keep it for the secure discard flag, which
    gains a more descriptive name and loses the bitops vs flag confusion.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

10 Sep, 2010

1 commit


25 May, 2010

1 commit


22 May, 2010

1 commit


21 May, 2010

1 commit

  • The previous patch I wrote for reclaiming unlinked dinodes
    had some shortcomings and did not prevent all hangs.
    This version is much cleaner and more logical, and has
    passed very difficult testing. Sorry for the churn.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

12 May, 2010

1 commit


29 Apr, 2010

1 commit


14 Apr, 2010

1 commit

  • This patch fixes a couple gfs2 problems with the reclaiming of
    unlinked dinodes. First, there were a couple of livelocks where
    everything would come to a halt waiting for a glock that was
    seemingly held by a process that no longer existed. In fact, the
    process did exist, it just had the wrong pid number in the holder
    information. Second, there was a lock ordering problem between
    inode locking and glock locking. Third, glock/inode contention
    could sometimes cause inodes to be improperly marked invalid by
    iget_failed.

    Signed-off-by: Bob Peterson

    Bob Peterson