22 Nov, 2011

1 commit

  • This patch separates the code pertaining to allocations into two
    parts: quota-related information and block reservations.
    This patch also moves all the block reservation structure allocations to
    function gfs2_inplace_reserve to simplify the code, and moves
    the frees to function gfs2_inplace_release.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

21 Nov, 2011

1 commit

  • This patch is a revision of the one I previously posted.
    I tried to integrate all the suggestions Steve gave.
    The purpose of the patch is to change function gfs2_alloc_block
    (allocate either a dinode block or an extent of data blocks)
    to a more generic gfs2_alloc_blocks function that can
    allocate both a dinode _and_ an extent of data blocks in the
    same call. This will ultimately help us create a multi-block
    reservation scheme to reduce file fragmentation.

    This patch moves more toward a generic multi-block allocator that
    takes a pointer to the number of data blocks to allocate, plus whether
    or not to allocate a dinode. In theory, it could be called to allocate
    (1) a single dinode block, (2) a group of one or more data blocks, or
    (3) a dinode plus several data blocks.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

15 Nov, 2011

1 commit

  • GFS2 functions gfs2_alloc_block and gfs2_alloc_di do basically
    the same things, with a few exceptions. This patch combines
    the two functions into a slightly more generic gfs2_alloc_block.
    Having one centralized block allocation function will reduce
    code redundancy and make it easier to implement multi-block
    reservations to reduce file fragmentation in the future.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

21 Oct, 2011

4 commits

  • The two variables being initialised in gfs2_inplace_reserve
    to track the file & line number of the caller are never
    used, so we might as well remove them.

    If something does go wrong, then a stack trace is probably
    more useful anyway.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Each block which is deallocated, requires a call to gfs2_rlist_add()
    and each of those calls was calling gfs2_blk2rgrpd() in order to
    figure out which rgrp the block belonged in. This can be speeded up
    by making use of the rgrp cached in the inode. We also reset this
    cached rgrp in case the block has changed rgrp. This should provide
    a big reduction in gfs2_blk2rgrpd() calls during deallocation.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Since we have ruled out supporting online filesystem shrink,
    it is possible to make the resource group list append only
    during the life of a super block. This gives several benefits:

    Firstly, we only need to read new rindex elements as they are added
    rather than needing to reread the whole rindex file each time one
    element is added.

    Secondly, the rindex glock can be held for much shorter periods of
    time, and is completely removed from the fast path for allocations.
    The lock is taken in shared mode only when updating the resource
    groups when the first allocation occurs, and after a grow has
    taken place.

    Thirdly, this results in a reduction in code size, and everything
    gets a lot simpler to understand in this area.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Here is an update of Bob's original rbtree patch which, in addition, also
    resolves the rather strange ref counting that was being done relating to
    the bitmap blocks.

    Originally we had a dual system for journaling resource groups. The metadata
    blocks were journaled and also the rgrp itself was added to a list. The reason
    for adding the rgrp to the list in the journal was so that the "repolish
    clones" code could be run to update the free space, and potentially send any
    discard requests when the log was flushed. This was done by comparing the
    "cloned" bitmap with what had been written back on disk during the transaction
    commit.

    Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
    until the journal had been flushed. For that reason, there was a rather
    complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
    both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
    count on the buffers.

    However, the journal maintains a reference count on the buffers anyway, since
    they are being journaled as metadata buffers. So by moving the code which deals
    with the post-journal accounting for bitmap blocks to the metadata journaling
    code, we can entirely dispense with the rather strange buffer ref counting
    scheme and also the requirement to journal the rgrps.

    The net result of all this is that the ->sd_rindex_spin is left to do exactly
    one job, and that is to look after the rbtree or rgrps.

    This patch is designed to be a stepping stone towards using RCU for the rbtree
    of resource groups, however the reduction in the number of uses of the
    ->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
    anyway.

    The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
    be removed in future in favour of calling the functions directly where required
    in the code. That will allow locking of resource groups without needing to
    actually read them in - something that could be useful in speeding up statfs.

    In the mean time though it is valid to dereference ->bi_bh only when the rgrp
    is locked. This is basically the same rule as before, modulo the references not
    being valid until the following journal flush.

    Signed-off-by: Steven Whitehouse
    Signed-off-by: Bob Peterson
    Cc: Benjamin Marzinski

    Bob Peterson
     

15 Jul, 2011

1 commit

  • __gfs2_free_data and __gfs2_free_meta are almost identical, and
    can be trivially combined.

    [This is as per Eric's original patch minus gfs2_free_data() which had
    no callers left and plus the conversion of the bmap.c calls to these
    functions. All in all, a nice clean up]

    Signed-off-by: Eric Sandeen
    Signed-off-by: Steven Whitehouse

    Eric Sandeen
     

24 Feb, 2011

1 commit

  • This patch is a performance improvement to GFS2's dealloc code.
    Rather than update the quota file and statfs file for every
    single block that's stripped off in unlink function do_strip,
    this patch keeps track and updates them once for every layer
    that's stripped. This is done entirely inside the existing
    transaction, so there should be no risk of corruption.
    The other functions that deallocate blocks will be unaffected
    because they are using wrapper functions that do the same
    thing that they do today.

    I tested this code on my roth cluster by creating 200
    files in a directory, each of which is 100MB, then on
    four nodes, I simultaneously deleted the files, thus competing
    for GFS2 resources (but different files). The commands
    I used were:

    [root@roth-01]# time for i in `seq 1 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
    [root@roth-02]# time for i in `seq 2 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
    [root@roth-03]# time for i in `seq 3 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
    [root@roth-05]# time for i in `seq 4 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done

    The performance increase was significant:

    roth-01 roth-02 roth-03 roth-05
    --------- --------- --------- ---------
    old: real 0m34.027 0m25.021s 0m23.906s 0m35.646s
    new: real 0m22.379s 0m24.362s 0m24.133s 0m18.562s

    Total time spent deleting:
    old: 118.6s
    new: 89.4

    For this particular case, this showed a 25% performance increase for
    GFS2 unlinks.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

30 Nov, 2010

1 commit


01 Oct, 2010

1 commit

  • This patch fixes a GFS2 problem whereby the first rename after a
    mount can result in a file system consistency error being flagged
    improperly and cause the file system to withdraw. The problem is
    that the rename code tries to run the rgrp list with function
    gfs2_blk2rgrpd before the rgrp list is guaranteed to be read in
    from disk. The patch makes the rename function hold the rindex
    glock (as the gfs2_unlink code does today) which reads in the rgrp
    list if need be. There were a total of three places in the rename
    code that improperly referenced the rgrp list without the rindex
    glock and this patch fixes all three.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

09 Sep, 2009

1 commit

  • There is a potential race in the inode deallocation code if two
    nodes try to deallocate the same inode at the same time. Most of
    the issue is solved by the iopen locking. There is still a small
    window which is not covered by the iopen lock. This patches fixes
    that and also makes the deallocation code more robust in the face of
    any errors in the rgrp bitmaps, or erroneous iopen callbacks from
    other nodes.

    This does introduce one extra disk read, but that is generally not
    an issue since its the same block that must be written to later
    in the deallocation process. The total disk accesses therefore stay
    the same,

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

17 Aug, 2009

1 commit

  • A little while back, block allocation was given some improved
    error handling which meant that -EIO was returned in the case
    of there being a problem in the resource group data. In addition
    a message is printed explaning what went wrong and how to fix it.
    This extends that error handling so that it also covers inode
    allocation too.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

20 May, 2009

1 commit

  • This patch improves the error handling in the case where we
    discover that the summary information in the resource group
    doesn't match the bitmap information while in the process of
    allocating blocks. Originally this resulted in a kernel bug,
    but this patch changes that so that we return -EIO and print
    some messages explaining what went wrong, and how to fix it.

    We also remember locally not to try and allocate from the
    same rgrp again, so that a subsequent allocation in a
    different rgrp should succeed.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

31 Mar, 2008

3 commits

  • Rather than having to allocate a single block at a time, this patch
    allows the block allocator to allocate an extent. Since there is
    no difference (so far as the block allocator is concerned) between
    data blocks and indirect blocks, it is posible to allocate a single
    extent and for the caller to unrevoke just the blocks required
    for indirect blocks.

    Currently the only bit of GFS2 to make use of this feature is the
    build height function. The intention is that gfs2_block_map will
    be changed to make use of this feature in future patches.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Thanks to the preceeding patches, the only difference between
    these two functions is their name. We can thus merge them
    and call the new function gfs2_alloc_block to reflect the
    fact that it can allocate either kind of block.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch removed the unnecessary parameter from function
    gfs2_rlist_alloc. The parameter was always passed in as 0.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

25 Jan, 2008

1 commit

  • It is possible to reduce the size of GFS2 inodes by taking the i_alloc
    structure out of the gfs2_inode. This patch allocates the i_alloc
    structure whenever its needed, and frees it afterward. This decreases
    the amount of low memory we use at the expense of requiring a memory
    allocation for each page or partial page that we write. A quick test
    with postmark shows that the overhead is not measurable and I also note
    that OCFS2 use the same approach.

    In the future I'd like to solve the problem by shrinking down the size
    of the members of the i_alloc structure, but for now, this reduces the
    immediate problem of using too much low-memory on x86 and doesn't add
    too much overhead.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

09 Jul, 2007

1 commit

  • This addendum patch 2 corrects three things:

    1. It fixes a stupid mistake in the previous addendum that broke gfs2.
    Ref: https://www.redhat.com/archives/cluster-devel/2007-May/msg00162.html
    2. It fixes a problem that Dave Teigland pointed out regarding the
    external declarations in ops_address.h being in the wrong place.
    3. It recasts a couple more %llu printks to (unsigned long long)
    as requested by Steve Whitehouse.

    I would have loved to put this all in one revised patch, but there was
    a rush to get some patches for RHEL5. Therefore, the previous patches
    were applied to the git tree "as is" and therefore, I'm posting another
    addendum. Sorry.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Robert Peterson
     

13 Oct, 2006

1 commit


06 Sep, 2006

1 commit


05 Sep, 2006

2 commits


01 Sep, 2006

1 commit

  • As per comments from Jan Engelhardt this
    updates the copyright message to say "version" in full rather than
    "v.2". Also incore.h has been updated to remove forward structure
    declarations which are not required.

    The gfs2_quota_lvb structure has now had endianess annotations added
    to it. Also quota.c has been updated so that we now store the
    lvb data locally in endian independant format to avoid needing
    a structure in host endianess too. As a result the endianess
    conversions are done as required at various points and thus the
    conversion routines in lvb.[ch] are no longer required. I've
    moved the one remaining constant in lvb.h thats used into lm.h
    and removed the unused lvb.[ch].

    I have not changed the HIF_ constants. That is left to a later patch
    which I hope will unify the gh_flags and gh_iflags fields of the
    struct gfs2_holder.

    Cc: Jan Engelhardt
    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

11 Jul, 2006

1 commit

  • This adds a generation number for the eventual use of NFS to the
    ondisk inode. Its backward compatible with the current code since
    it doesn't really matter what the generation number is to start with,
    and indeed since its set to zero, due to it being taken from padding
    in both the inode and rgrp header, it should be fine.

    The eventual plan is to use this rather than no_formal_ino in the
    NFS filehandles. At that point no_formal_ino will be unused.

    At the same time we also add a releasepages call back to the
    "normal" address space for gfs2 inodes. Also I've removed a
    one-linrer function thats not required any more.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

22 Jun, 2006

1 commit


15 Jun, 2006

1 commit

  • This patch fixes the way we have been dealing with unlinked,
    but still open files. It removes all limits (other than memory
    for inodes, as per every other filesystem) on numbers of these
    which we can support on GFS2. It also means that (like other
    fs) its the responsibility of the last process to close the file
    to deallocate the storage, rather than the person who did the
    unlinking. Note that with GFS2, those two events might take place
    on different nodes.

    Also there are a number of other changes:

    o We use the Linux inode subsystem as it was intended to be
    used, wrt allocating GFS2 inodes
    o The Linux inode cache is now the point which we use for
    local enforcement of only holding one copy of the inode in
    core at once (previous to this we used the glock layer).
    o We no longer use the unlinked "special" file. We just ignore it
    completely. This makes unlinking more efficient.
    o We now use the 4th block allocation state. The previously unused
    state is used to track unlinked but still open inodes.
    o gfs2_inoded is no longer needed
    o Several fields are now no longer needed (and removed) from the in
    core struct gfs2_inode
    o Several fields are no longer needed (and removed) from the in core
    superblock

    There are a number of future possible optimisations and clean ups
    which have been made possible by this patch.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

19 May, 2006

1 commit


17 Jan, 2006

1 commit