17 Nov, 2014

1 commit

  • The current gfs2 freezing code is considerably more complicated than it
    should be because it doesn't use the vfs freezing code on any node except
    the one that begins the freeze. This is because it needs to acquire a
    cluster glock before calling the vfs code to prevent a deadlock, and
    without the new freeze_super and thaw_super hooks, that was impossible. To
    deal with the issue, gfs2 had to do some hacky locking tricks to make sure
    that a frozen node couldn't be holding on a lock it needed to do the
    unfreeze ioctl.

    This patch makes use of the new hooks to simply the gfs2 locking code. Now,
    all the nodes in the cluster freeze and thaw in exactly the same way. Every
    node in the cluster caches the freeze glock in the shared state. The new
    freeze_super hook allows the freezing node to grab this freeze glock in
    the exclusive state without first calling the vfs freeze_super function.
    All the nodes in the cluster see this lock change, and call the vfs
    freeze_super function. The vfs locking code guarantees that the nodes can't
    get stuck holding the glocks necessary to unfreeze the system. To
    unfreeze, the freezing node uses the new thaw_super hook to drop the freeze
    glock. Again, all the nodes notice this, reacquire the glock in shared mode
    and call the vfs thaw_super function.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

08 Oct, 2014

1 commit


14 May, 2014

1 commit

  • GFS2 has a transaction glock, which must be grabbed for every
    transaction, whose purpose is to deal with freezing the filesystem.
    Aside from this involving a large amount of locking, it is very easy to
    make the current fsfreeze code hang on unfreezing.

    This patch rewrites how gfs2 handles freezing the filesystem. The
    transaction glock is removed. In it's place is a freeze glock, which is
    cached (but not held) in a shared state by every node in the cluster
    when the filesystem is mounted. This lock only needs to be grabbed on
    freezing, and actions which need to be safe from freezing, like
    recovery.

    When a node wants to freeze the filesystem, it grabs this glock
    exclusively. When the freeze glock state changes on the nodes (either
    from shared to unlocked, or shared to exclusive), the filesystem does a
    special log flush. gfs2_log_flush() does all the work for flushing out
    the and shutting down the incore log, and then it tries to grab the
    freeze glock in a shared state again. Since the filesystem is stuck in
    gfs2_log_flush, no new transaction can start, and nothing can be written
    to disk. Unfreezing the filesytem simply involes dropping the freeze
    glock, allowing gfs2_log_flush() to grab and then release the shared
    lock, so it is cached for next time.

    However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
    shared lock on the filesystem root directory inode to check permissions.
    If that glock has already been grabbed exclusively, fsfreeze will be
    unable to get the shared lock and unfreeze the filesystem.

    In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
    on the filesystem root directory during the freeze, and hold it until it
    unfreezes the filesystem. The functions which need to grab a shared
    lock in order to allow the unfreeze ioctl to be issued now use the lock
    grabbed by the freeze code instead.

    The freeze and unfreeze code take care to make sure that this shared
    lock will not be dropped while another process is using it.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

07 Mar, 2014

2 commits

  • Add pr_fmt, remove embedded "GFS2: " prefixes.
    This now consistently emits lower case "gfs2: " for each message.

    Other miscellanea around these changes:

    o Add missing newlines
    o Coalesce formats
    o Realign arguments

    Signed-off-by: Joe Perches
    Signed-off-by: Steven Whitehouse

    Joe Perches
     
  • -All printk(KERN_foo converted to pr_foo().
    -Messages updated to fit in 80 columns.
    -fs_macros converted as well.
    -fs_printk removed.

    Signed-off-by: Fabian Frederick
    Signed-off-by: Steven Whitehouse

    Fabian Frederick
     

25 Feb, 2014

2 commits

  • Now we have a master transaction into which other transactions
    are merged, the accounting can be done using this master
    transaction. We no longer require the superblock fields which
    were being used for this function.

    In addition, this allows for a clean up in calc_reserved()
    making it rather easier understand. Also, by reducing the
    number of variables used to track the buffers being added
    and removed from the journal, a number of error checks are
    now no longer required.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Over time, we hope to be able to improve the concurrency available
    in the log code. This is one small step towards that, by moving
    the buffer lists from the super block, and into the transaction
    structure, so that each transaction builds its own buffer lists.

    At transaction commit time, the buffer lists are merged into
    the currently accumulating transaction. That transaction then
    is passed into the before and after commit functions at journal
    flush time. Thus there should be no change in overall behaviour
    yet.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

21 Feb, 2014

1 commit


20 Jun, 2013

1 commit


19 Jun, 2013

1 commit

  • This patch looks at all the outstanding blocks in all the transactions
    on the log, and moves the completed ones to the ail2 list. Then it
    issues revokes for these blocks. This will hopefully speed things up
    in situations where there is a lot of contention for glocks, especially
    if they are acquired serially.

    revoke_lo_before_commit will issue at most one log block's full of these
    preemptive revokes. The amount of reserved log space that
    gfs2_log_reserve() ignores has been incremented to allow for this extra
    block.

    This patch also consolidates the common revoke instructions into one
    function, gfs2_add_revoke().

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

01 May, 2013

1 commit

  • Pull GFS2 updates from Steven Whitehouse:
    "There is not a whole lot of change this time - there are some further
    changes which are in the works, but those will be held over until next
    time.

    Here there are some clean ups to inode creation, the addition of an
    origin (local or remote) indicator to glock demote requests, removal
    of one of the remaining GFP_NOFAIL allocations during log flushes, one
    minor clean up, and a one liner bug fix."

    * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
    GFS2: Flush work queue before clearing glock hash tables
    GFS2: Add origin indicator to glock demote tracing
    GFS2: Add origin indicator to glock callbacks
    GFS2: replace gfs2_ail structure with gfs2_trans
    GFS2: Remove vestigial parameter ip from function rs_deltree
    GFS2: Use gfs2_dinode_out() in the inode create path
    GFS2: Remove gfs2_refresh_inode from inode creation path
    GFS2: Clean up inode creation path

    Linus Torvalds
     

29 Apr, 2013

1 commit


08 Apr, 2013

1 commit

  • In order to allow transactions and log flushes to happen at the same
    time, gfs2 needs to move the transaction accounting and active items
    list code into the gfs2_trans structure. As a first step toward this,
    this patch removes the gfs2_ail structure, and handles the active items
    list in the gfs_trans structure. This keeps gfs2 from allocating an ail
    structure on log flushes, and gives us a struture that can later be used
    to store the transaction accounting outside of the gfs2 superblock
    structure.

    With this patch, at the end of a transaction, gfs2 will add the
    gfs2_trans structure to the superblock if there is not one already.
    This structure now has the active items fields that were previously in
    gfs2_ail. This is not necessary in the case where the transaction was
    simply used to add revokes, since these are never written outside of the
    journal, and thus, don't need an active items list.

    Also, in order to make sure that the transaction structure is not
    removed while it's still in use by gfs2_trans_end, unlocking the
    sd_log_flush_lock has to happen slightly later in ending the
    transaction.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

29 Jan, 2013

5 commits

  • Instead of using a list of buffers to write ahead of the journal
    flush, this now uses a list of inodes and calls ->writepages
    via filemap_fdatawrite() in order to achieve the same thing. For
    most use cases this results in a shorter ordered write list,
    as well as much larger i/os being issued.

    The ordered write list is sorted by inode number before writing
    in order to retain the disk block ordering between inodes as
    per the previous code.

    The previous ordered write code used to conflict in its assumptions
    about how to write out the disk blocks with mpage_writepages()
    so that with this updated version we can also use mpage_writepages()
    for GFS2's ordered write, writepages implementation. So we will
    also send larger i/os from writeback too.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The locking in gfs2_attach_bufdata() was type specific (data/meta)
    which made the function rather confusing. This patch moves the core
    of gfs2_attach_bufdata() into trans.c renaming it gfs2_alloc_bufdata()
    and moving the locking into gfs2_trans_add_data()/gfs2_trans_add_meta()

    As a result all of the locking related to adding data and metadata to
    the journal is now in these two functions. This should help to clarify
    what is going on, and give us some opportunities to simplify in
    some cases.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch copies the body of gfs2_trans_add_bh into the two newly
    added gfs2_trans_add_data and gfs2_trans_add_meta functions. We can
    then move the .lo_add functions from lops.c into trans.c and call
    them directly.

    As a result of this, we no longer need to use the .lo_add functions
    at all, so that is removed from the log operations structure.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • There is little common content in gfs2_trans_add_bh() between the data
    and meta classes by the time that the functions which it calls are
    taken into account. The intent here is to split this into two
    separate functions. Stage one is to introduce gfs2_trans_add_data()
    and gfs2_trans_add_meta() and update the callers accordingly.

    Later patches will then pull in the content of gfs2_trans_add_bh()
    and its dependent functions in order to clean up the code in this
    area.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This moves the lo_add function for revokes into trans.c, removing
    a function call and making the code easier to read.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

07 Nov, 2012

1 commit

  • In gfs2_trans_add_bh(), gfs2 was testing if a there was a bd attached to the
    buffer without having the gfs2_log_lock held. It was then assuming it would
    stay attached for the rest of the function. However, without either the log
    lock being held of the buffer locked, __gfs2_ail_flush() could detach bd at any
    time. This patch moves the locking before the test. If there isn't a bd
    already attached, gfs2 can safely allocate one and attach it before locking.
    There is no way that the newly allocated bd could be on the ail list,
    and thus no way for __gfs2_ail_flush() to detach it.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

31 Jul, 2012

1 commit

  • We update gfs2_page_mkwrite() to use new freeze protection and the transaction
    code to use freeze protection while the transaction is running. That is needed
    to stop iput() of unlinked file from modifying the filesystem. The rest is
    handled by the generic code.

    CC: cluster-devel@redhat.com
    CC: Steven Whitehouse
    Acked-by: Steven Whitehouse
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

02 May, 2012

1 commit

  • This patch eliminates the gfs2_log_element data structure and
    rolls its two components into the gfs2_bufdata. This makes the code
    easier to understand and makes it easier to migrate to a rbtree
    to keep the list sorted.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

24 Apr, 2012

1 commit

  • This is another clean up in the logging code. This per-transaction
    list was largely unused. Its main function was to ensure that the
    number of buffers in a transaction was correct, however that counter
    was only used to check the number of buffers in the bd_list_tr, plus
    an assert at the end of each transaction. With the assert now changed
    to use the calculated buffer counts, we can remove both bd_list_tr and
    its associated counter.

    This should make the code easier to understand as well as shrinking
    a couple of structures.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

21 Oct, 2011

1 commit

  • Here is an update of Bob's original rbtree patch which, in addition, also
    resolves the rather strange ref counting that was being done relating to
    the bitmap blocks.

    Originally we had a dual system for journaling resource groups. The metadata
    blocks were journaled and also the rgrp itself was added to a list. The reason
    for adding the rgrp to the list in the journal was so that the "repolish
    clones" code could be run to update the free space, and potentially send any
    discard requests when the log was flushed. This was done by comparing the
    "cloned" bitmap with what had been written back on disk during the transaction
    commit.

    Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
    until the journal had been flushed. For that reason, there was a rather
    complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
    both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
    count on the buffers.

    However, the journal maintains a reference count on the buffers anyway, since
    they are being journaled as metadata buffers. So by moving the code which deals
    with the post-journal accounting for bitmap blocks to the metadata journaling
    code, we can entirely dispense with the rather strange buffer ref counting
    scheme and also the requirement to journal the rgrps.

    The net result of all this is that the ->sd_rindex_spin is left to do exactly
    one job, and that is to look after the rbtree or rgrps.

    This patch is designed to be a stepping stone towards using RCU for the rbtree
    of resource groups, however the reduction in the number of uses of the
    ->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
    anyway.

    The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
    be removed in future in favour of calling the functions directly where required
    in the code. That will allow locking of resource groups without needing to
    actually read them in - something that could be useful in speeding up statfs.

    In the mean time though it is valid to dereference ->bi_bh only when the rgrp
    is locked. This is basically the same rule as before, modulo the references not
    being valid until the following journal flush.

    Signed-off-by: Steven Whitehouse
    Signed-off-by: Bob Peterson
    Cc: Benjamin Marzinski

    Bob Peterson
     

05 May, 2010

1 commit

  • This patch contains various tweaks to how log flushes and active item writeback
    work. gfs2_logd is now managed by a waitqueue, and gfs2_log_reseve now waits
    for gfs2_logd to do the log flushing. Multiple functions were rewritten to
    remove the need to call gfs2_log_lock(). Instead of using one test to see if
    gfs2_logd had work to do, there are now seperate tests to check if there
    are two many buffers in the incore log or if there are two many items on the
    active items list.

    This patch is a port of a patch Steve Whitehouse wrote about a year ago, with
    some minor changes. Since gfs2_ail1_start always submits all the active items,
    it no longer needs to keep track of the first ai submitted, so this has been
    removed. In gfs2_log_reserve(), the order of the calls to
    prepare_to_wait_exclusive() and wake_up() when firing off the logd thread has
    been switched. If it called wake_up first there was a small window for a race,
    where logd could run and return before gfs2_log_reserve was ready to get woken
    up. If gfs2_logd ran, but did not free up enough blocks, gfs2_log_reserve()
    would be left waiting for gfs2_logd to eventualy run because it timed out.
    Finally, gt_logd_secs, which controls how long to wait before gfs2_logd times
    out, and flushes the log, can now be set on mount with ar_commit.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

13 May, 2009

1 commit


24 Mar, 2009

2 commits

  • This patch fixes a deadlock when the journal is flushed and there
    are dirty inodes other than the one which caused the journal flush.
    Originally the journal flushing code was trying to obtain the
    transaction glock while running the flush code for an inode glock.
    We no longer require the transaction glock at this point in time
    since we know that any attempt to get the transaction glock from
    another node will result in a journal flush. So if we are flushing
    the journal, we can be sure that the transaction lock is still
    cached from when the transaction was started.

    By inlining a version of gfs2_trans_begin() (minus the bit which
    gets the transaction glock) we can avoid the deadlock problems
    caused if there is a demote request queued up on the transaction
    glock.

    In addition I've also moved the umount rwsem so that it covers
    the glock workqueue, since it all demotions are done by this
    workqueue now. That fixes a bug on umount which I came across
    while fixing the original problem.

    Reported-by: David Teigland
    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This is the big patch that I've been working on for some time
    now. There are many reasons for wanting to make this change
    such as:
    o Reducing overhead by eliminating duplicated fields between structures
    o Simplifcation of the code (reduces the code size by a fair bit)
    o The locking interface is now the DLM interface itself as proposed
    some time ago.
    o Fewer lookups of glocks when processing replies from the DLM
    o Fewer memory allocations/deallocations for each glock
    o Scope to do further optimisations in the future (but this patch is
    more than big enough for now!)

    Please note that (a) this patch relates to the lock_dlm module and
    not the DLM itself, that is still a separate module; and (b) that
    we retain the ability to build GFS2 as a standalone single node
    filesystem with out requiring the DLM.

    This patch needs a lot of testing, hence my keeping it I restarted
    my -git tree after the last merge window. That way, this has the maximum
    exposure before its merged. This is (modulo a few minor bug fixes) the
    same patch that I've been posting on and off the the last three months
    and its passed a number of different tests so far.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

31 Mar, 2008

1 commit

  • By adding an extra argument to gfs2_trans_add_unrevoke we can now
    specify an extent length of blocks to unrevoke. This means that
    we only need to make one pass through the list for each extent
    rather than each block. Currently the only extent length which
    is used is 1, but that will change in the future.

    Also gfs2_trans_add_unrevoke is removed from gfs2_alloc_meta
    since its the only difference between this and gfs2_alloc_data
    which is left. This will allow a future patch to merge these
    two functions into one (i.e. one call to allocate both data
    and metadata in a single extent in the future).

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

25 Jan, 2008

1 commit

  • The only reason for adding glocks to the journal was to keep track
    of which locks required a log flush prior to release. We add a
    flag to the glock to allow this check to be made in a simpler way.

    This reduces the size of a glock (by 12 bytes on i386, 24 on x86_64)
    and means that we can avoid extra work during the journal flush.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

10 Oct, 2007

3 commits

  • The following alters gfs2_trans_add_revoke() to take a struct
    gfs2_bufdata as an argument. This eliminates the memory allocation which
    was previously required by making use of the already existing struct
    gfs2_bufdata. It makes some sanity checks to ensure that the
    gfs2_bufdata has been removed from all the lists before its recycled as
    a revoke structure. This saves one memory allocation and one free per
    revoke structure.

    Also as a result, and to simplify the locking, since there is no longer
    any blocking code in gfs2_trans_add_revoke() we must hold the log lock
    whenever this function is called. This reduces the amount of times we
    take and unlock the log lock.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The old revoke structure was allocated using kalloc/kfree but
    there is a slab cache for gfs2_bufdata, so we should use that
    now that the structures have been converted.

    This is part two of the patch series to merge the revoke
    and gfs2_bufdata structures.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Both the revoke structure and the bufdata structure are quite similar.
    They are basically small tags which are put on lists. In addition to
    which the revoke structure is always allocated when there is a bufdata
    structure which is (or can be) freed. As such it should be possible to
    reduce the number of frees and allocations by using the same structure
    for both purposes.

    This patch is the first step along that path. It replaces existing uses
    of the revoke structure with the bufdata structure.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

19 Sep, 2006

1 commit

  • lm_interface.h has a few out of the tree clients such as GFS1
    and userland tools.

    Right now, these clients keeps a copy of the file in their build tree
    that can go out of sync.

    Move lm_interface.h to include/linux, export it to userland and
    clean up fs/gfs2 to use the new location.

    Signed-off-by: Fabio M. Di Nitto
    Signed-off-by: Steven Whitehouse

    Fabio Massimo Di Nitto
     

05 Sep, 2006

1 commit


01 Sep, 2006

1 commit

  • As per comments from Jan Engelhardt this
    updates the copyright message to say "version" in full rather than
    "v.2". Also incore.h has been updated to remove forward structure
    declarations which are not required.

    The gfs2_quota_lvb structure has now had endianess annotations added
    to it. Also quota.c has been updated so that we now store the
    lvb data locally in endian independant format to avoid needing
    a structure in host endianess too. As a result the endianess
    conversions are done as required at various points and thus the
    conversion routines in lvb.[ch] are no longer required. I've
    moved the one remaining constant in lvb.h thats used into lm.h
    and removed the unused lvb.[ch].

    I have not changed the HIF_ constants. That is left to a later patch
    which I hope will unify the gh_flags and gh_iflags fields of the
    struct gfs2_holder.

    Cc: Jan Engelhardt
    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

19 May, 2006

2 commits


27 Apr, 2006

1 commit


12 Apr, 2006

1 commit

  • A small update to the journaling code to change the way that
    the "extra" blocks are accounted for in the journal. These are
    used at a rate of one per 503 metadata blocks or one per 251
    journaled data blocks (or just one if the total number of journaled
    blocks in the transaction is smaller). Since we are using them at
    two different rates the old method of accounting for them no longer
    works and we count them up as required.

    Since the "per transaction" accounting can't handle this (there is no
    fixed number of header blocks per transaction) we have to account for
    it in the general journal code. We now require that each transaction
    reserves more blocks than it actually needs to take account of the
    possible extra blocks.

    Also a final fix to dir.c to ensure that all ref counts are handled
    correctly.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

07 Apr, 2006

1 commit

  • This fixes a ref count bug that sometimes showed up a umount time
    (causing it to hang) but it otherwise mostly harmless. At the same
    time there are some clean ups including making the log operations
    structures const, moving a memory allocation so that its not done
    in the fast path of checking to see if there is an outstanding
    transaction related to a particular glock.

    Removes the sd_log_wrap varaible which was updated, but never actually
    used anywhere. Updates the gfs2 ioctl() to run without the kernel lock
    (which it never needed anyway). Removes the "invalidate inodes" loop
    from GFS2's put_super routine. This is done in kill super anyway so
    we don't need to do it here. The loop was also bogus in that if there
    are any inodes "stuck" at this point its a bug and we need to know
    about it rather than hide it by hanging forever.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse