11 Jan, 2012

3 commits

  • If the first mounter fails to recover one of the journals
    during mount, the mount should fail.

    Signed-off-by: David Teigland
    Signed-off-by: Steven Whitehouse

    David Teigland
     
  • Previously, a spectator mount would not even attempt to do
    journal recovery for a failed node. This meant that if all
    mounted nodes were spectators, everyone would be stuck after
    a node failed, all waiting for recovery to be performed.
    This is unnecessary since the failed node had a clean journal.

    Instead, allow a spectator mount to do a partial "read only"
    recovery, which means it will check if the failed journal is
    clean, and if so, report a successful recovery. If the failed
    journal is not clean, it reports that journal recovery failed.
    This makes it work the same as a read only mount on a read only
    block device.

    Signed-off-by: David Teigland
    Signed-off-by: Steven Whitehouse

    David Teigland
     
  • This new method of managing recovery is an alternative to
    the previous approach of using the userland gfs_controld.

    - use dlm slot numbers to assign journal id's
    - use dlm recovery callbacks to initiate journal recovery
    - use a dlm lock to determine the first node to mount fs
    - use a dlm lock to track journals that need recovery

    Signed-off-by: David Teigland
    Signed-off-by: Steven Whitehouse

    David Teigland
     

22 Nov, 2011

1 commit

  • This patch separates the code pertaining to allocations into two
    parts: quota-related information and block reservations.
    This patch also moves all the block reservation structure allocations to
    function gfs2_inplace_reserve to simplify the code, and moves
    the frees to function gfs2_inplace_release.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

18 Nov, 2011

1 commit

  • This patch removes the vestigial variable al_alloced from
    the gfs2_alloc structure. This is another baby step toward
    multi-block reservations.

    My next planned step is to decouple the quota variables
    from the gfs2_alloc structure so we can use a different
    method for allocations.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

21 Oct, 2011

6 commits

  • The two variables being initialised in gfs2_inplace_reserve
    to track the file & line number of the caller are never
    used, so we might as well remove them.

    If something does go wrong, then a stack trace is probably
    more useful anyway.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • GFS2's fallocate code currently goes through the page cache. Since it's only
    writing to the end of the file or to holes in it, it doesn't need to, and it
    was causing issues on low memory environments. This patch pulls in some of
    Steve's block allocation work, and uses it to simply allocate the blocks for
    the file, and zero them out at allocation time. It provides a slight
    performance increase, and it dramatically simplifies the code.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     
  • This means that after the initial allocation for any inode, the
    last used resource group is cached in the inode for future use.
    This drastically reduces the number of lookups of resource
    groups in the common case, and this the contention on that
    data structure.

    The allocation algorithm is the same as previously, except that we
    always check to see if the goal block is within the cached rgrp
    first before going to the rbtree to look one up.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Since we have ruled out supporting online filesystem shrink,
    it is possible to make the resource group list append only
    during the life of a super block. This gives several benefits:

    Firstly, we only need to read new rindex elements as they are added
    rather than needing to reread the whole rindex file each time one
    element is added.

    Secondly, the rindex glock can be held for much shorter periods of
    time, and is completely removed from the fast path for allocations.
    The lock is taken in shared mode only when updating the resource
    groups when the first allocation occurs, and after a grow has
    taken place.

    Thirdly, this results in a reduction in code size, and everything
    gets a lot simpler to understand in this area.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Here is an update of Bob's original rbtree patch which, in addition, also
    resolves the rather strange ref counting that was being done relating to
    the bitmap blocks.

    Originally we had a dual system for journaling resource groups. The metadata
    blocks were journaled and also the rgrp itself was added to a list. The reason
    for adding the rgrp to the list in the journal was so that the "repolish
    clones" code could be run to update the free space, and potentially send any
    discard requests when the log was flushed. This was done by comparing the
    "cloned" bitmap with what had been written back on disk during the transaction
    commit.

    Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
    until the journal had been flushed. For that reason, there was a rather
    complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
    both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
    count on the buffers.

    However, the journal maintains a reference count on the buffers anyway, since
    they are being journaled as metadata buffers. So by moving the code which deals
    with the post-journal accounting for bitmap blocks to the metadata journaling
    code, we can entirely dispense with the rather strange buffer ref counting
    scheme and also the requirement to journal the rgrps.

    The net result of all this is that the ->sd_rindex_spin is left to do exactly
    one job, and that is to look after the rbtree or rgrps.

    This patch is designed to be a stepping stone towards using RCU for the rbtree
    of resource groups, however the reduction in the number of uses of the
    ->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
    anyway.

    The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
    be removed in future in favour of calling the functions directly where required
    in the code. That will allow locking of resource groups without needing to
    actually read them in - something that could be useful in speeding up statfs.

    In the mean time though it is valid to dereference ->bi_bh only when the rgrp
    is locked. This is basically the same rule as before, modulo the references not
    being valid until the following journal flush.

    Signed-off-by: Steven Whitehouse
    Signed-off-by: Bob Peterson
    Cc: Benjamin Marzinski

    Bob Peterson
     
  • If we have got far enough through the inode allocation code
    path that an inode has already been allocated, then we must
    call iput to dispose of it, if an error occurs during a
    later part of the process. This will always be the final iput
    since there will be no other references to the inode.

    Unlike when the inode has been unlinked, its block state will
    be GFS2_BLKST_INODE rather than GFS2_BLKST_UNLINKED so we need
    to skip the test in ->evict_inode() for this one case in order
    to ensure that it will be deallocated correctly. This patch adds
    a new flag in order to ensure that this will happen correctly.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

15 Jul, 2011

2 commits

  • This patch is a performance improvement for GFS2 in a clustered
    environment. It makes the glock hold time self-adjusting.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This patch adds a cache for the hash table to the directory code
    in order to help simplify the way in which the hash table is
    accessed. This is intended to be a first step towards introducing
    some performance improvements in the directory code.

    There are two follow ups that I'm hoping to see fairly shortly. One
    is to simplify the hash table reading code now that we always read the
    complete hash table, whether we want one entry or all of them. The
    other is to introduce readahead on the heads of the hash chains
    which are referred to from the table.

    The hash table is a maximum of 128k in size, so it is not worth trying
    to read it in small chunks.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

12 Jul, 2011

1 commit

  • There is a potential race during filesystem mounting which has recently
    been reported. It occurs when the userland gfs_controld is able to
    process requests fast enough that it tries to use the sysfs interface
    before the lock module is properly initialised. This is a pretty
    unusual case as normally the lock module initialisation is very quick
    compared with gfs_controld.

    This patch adds an interruptible completion which is used to ensure that
    userland will wait for the initialisation of the lock module to
    complete.

    There are other potential solutions to this problem, but this is the
    quickest at this stage and has been tested both with and without
    mount.gfs2 present in the system.

    Signed-off-by: Steven Whitehouse
    Reported-by: David Booher

    Steven Whitehouse
     

10 May, 2011

1 commit


20 Apr, 2011

3 commits

  • This patch adds writeback_control to writing back the AIL
    list. This means that we can then take advantage of the
    information we get in ->write_inode() in order to set off
    some pre-emptive writeback.

    In addition, the AIL code is cleaned up a bit to make it
    a bit simpler to understand.

    There is still more which can usefully be done in this area,
    but this is a good start at least.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The GLF_LRU flag introduced in the previous patch can be
    used to check if a glock is on the lru list when a new
    holder is queued and if so remove it, without having first
    to get the lru_lock.

    The main purpose of this patch however is to optimise the
    glocks left over when an inode at end of life is being
    evicted. Previously such glocks were left with the GLF_LFLUSH
    flag set, so that when reclaimed, each one required a log flush.
    This patch resets the GLF_LFLUSH flag when there is nothing
    left to flush thus preventing later log flushes as glocks are
    reused or demoted.

    In order to do this, we need to keep track of the number of
    revokes which are outstanding, and also to clear the GLF_LFLUSH
    bit after a log commit when only revokes have been processed.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This adds support for two new flags. One keeps track of whether
    the glock is on the LRU list or not. The other isn't really a
    flag as such, but an indication of whether the glock has an
    attached object or not. This indication is reported without
    any locking, which is ok since we do not dereference the object
    pointer but merely report whether it is NULL or not.

    Also, this fixes one place where a tracepoint was missing, which
    was at the point we remove deallocated blocks from the journal.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

11 Mar, 2011

1 commit

  • The log lock is currently used to protect the AIL lists and
    the movements of buffers into and out of them. The lists
    are self contained and no log specific items outside the
    lists are accessed when starting or emptying the AIL lists.

    Hence the operation of the AIL does not require the protection
    of the log lock so split them out into a new AIL specific lock
    to reduce the amount of traffic on the log lock. This will
    also reduce the amount of serialisation that occurs when
    the gfs2_logd pushes on the AIL to move it forward.

    This reduces the impact of log pushing on sequential write
    throughput.

    Signed-off-by: Dave Chinner
    Signed-off-by: Steven Whitehouse

    Dave Chinner
     

09 Mar, 2011

1 commit

  • Immediately after being synced to disk, cached quotas are zeroed out and a
    subsequent access of the cached quotas results in incorrect zero values. This
    meant that gfs2 assumed the actual usage to be the zero (or near-zero) usage
    values it found in the cached quotas and comparison against warn/limits never
    triggered a quota violation.

    This patch adds a new flag QDF_REFRESH that is set after a sync so that the
    cached quotas are forcefully refreshed from disk on a subsequent access on
    seeing this flag set.

    Resolves: rhbz#675944
    Signed-off-by: Abhi Das
    Signed-off-by: Steven Whitehouse

    Abhijith Das
     

21 Jan, 2011

1 commit

  • This has a number of advantages:

    - Reduces contention on the hash table lock
    - Makes the code smaller and simpler
    - Should speed up glock dumps when under load
    - Removes ref count changing in examine_bucket
    - No longer need hash chain lock in glock_put() in common case

    There are some further changes which this enables and which
    we may do in the future. One is to look at using SLAB_RCU,
    and another is to look at using a per-cpu counter for the
    per-sb glock counter, since that is touched twice in the
    lifetime of each glock (but only used at umount time).

    Signed-off-by: Steven Whitehouse
    Cc: Paul E. McKenney

    Steven Whitehouse
     

11 Jan, 2011

1 commit


30 Nov, 2010

1 commit

  • We can only merge the fields into a bitfield if the locking
    rules for them are the same. In this case gl_spin covers all
    of the fields (write side) but a couple of them are used
    with GLF_LOCK as the read side lock, which should be ok
    since we know that the field in question won't be changing
    at the time.

    The gl_req setting has to be done earlier (in glock.c) in order
    to place it under gl_spin. The gl_reply setting also has to be
    brought under gl_spin in order to comply with the new rules.

    This saves 4*sizeof(unsigned int) per glock.

    Signed-off-by: Steven Whitehouse
    Cc: Bob Peterson

    Steven Whitehouse
     

29 Sep, 2010

1 commit

  • Recently a feature was added to GFS2 to allow journal id allocation
    via sysfs. This patch builds upon that so that a negative journal id
    will be treated as an error code to be passed back as the return code
    from mount. This allows termination of the mount process if there is
    a failure.

    Also, the process has been updated so that the kernel will wait
    for a journal id, even in the "spectator" case. This is required
    in order to avoid mounting a filesystem in case there is an error
    while joining the cluster. In the spectator case, 0 is written into
    the file to indicate that all is well, and that mount should continue.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

24 Sep, 2010

1 commit

  • This option has never done anything useful. Also at the same time
    this cleans up the sb checks which are done at mount time. The
    debug option will be accepted, but ignored in future. Since it
    didn't do anything, there didn't seem much point in retaining it.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

23 Sep, 2010

2 commits

  • This option defaulted to on for lock_nolock mounts and off
    otherwise. The only function was to avoid the revalidation of
    dentries. In the cluster case, that is entirely pointless and
    liable to cause coherency problems.

    The patch changes the revalidation to depend upon whether the
    fs is a local or cluster fs (i.e. it follows the existing default
    behaviour). I very much doubt anybody ever used this option as
    there is no reason to. Even so we will continue to accept it
    on the mount command line, but ignore it.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This is been a no-op for a very long time now. I'm pretty sure
    nobody uses it, but just in case we'll still accept it on the
    command line, but ignore it.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

20 Sep, 2010

3 commits

  • Due to the design of the VFS, it is quite usual for operations on GFS2
    to consist of a lookup (requiring a shared lock) followed by an
    operation requiring an exclusive lock. If a remote node has cached an
    exclusive lock, then it will receive two demote events in rapid succession
    firstly for a shared lock and then to unlocked. The existing min hold time
    code was triggering in this case, even if the node was otherwise idle
    since the state change time was being updated by the initial demote.

    This patch introduces logic to skip the min hold timer in the case that
    a "double demote" of this kind has occurred. The min hold timer will
    still be used in all other cases.

    A new glock flag is introduced which is used to keep track of whether
    there have been any newly queued holders since the last glock state
    change. The min hold time is only applied if the flag is set.

    Signed-off-by: Steven Whitehouse
    Tested-by: Abhijith Das

    Steven Whitehouse
     
  • This patch adds support for fallocate to gfs2. Since the gfs2 does not support
    uninitialized data blocks, it must write out zeros to all the blocks. However,
    since it does not need to lock any pages to read from, gfs2 can write out the
    zero blocks much more efficiently. On a moderately full filesystem, fallocate
    works around 5 times faster on average. The fallocate call also allows gfs2 to
    add blocks to the file without changing the filesize, which will make it
    possible for gfs2 to preallocate space for the rindex file, so that gfs2 can
    grow a completely full filesystem.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     
  • With the update of the truncate code, ip->i_disksize and
    inode->i_size are merely copies of each other. This means
    we can remove ip->i_disksize and use inode->i_size exclusively
    reducing the size of a GFS2 inode by 8 bytes.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

08 Aug, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (55 commits)
    workqueue: mark init_workqueues() as early_initcall()
    workqueue: explain for_each_*cwq_cpu() iterators
    fscache: fix build on !CONFIG_SYSCTL
    slow-work: kill it
    gfs2: use workqueue instead of slow-work
    drm: use workqueue instead of slow-work
    cifs: use workqueue instead of slow-work
    fscache: drop references to slow-work
    fscache: convert operation to use workqueue instead of slow-work
    fscache: convert object to use workqueue instead of slow-work
    workqueue: fix how cpu number is stored in work->data
    workqueue: fix mayday_mask handling on UP
    workqueue: fix build problem on !CONFIG_SMP
    workqueue: fix locking in retry path of maybe_create_worker()
    async: use workqueue for worker pool
    workqueue: remove WQ_SINGLE_CPU and use WQ_UNBOUND instead
    workqueue: implement unbound workqueue
    workqueue: prepare for WQ_UNBOUND implementation
    libata: take advantage of cmwq and remove concurrency limitations
    workqueue: fix worker management invocation without pending works
    ...

    Fixed up conflicts in fs/cifs/* as per Tejun. Other trivial conflicts in
    include/linux/workqueue.h, kernel/trace/Kconfig and kernel/workqueue.c

    Linus Torvalds
     

29 Jul, 2010

1 commit

  • This patch implements a wait for the journal id in the case that it has
    not been specified on the command line. This is to allow the future
    removal of the mount.gfs2 helper. The journal id would instead be
    directly communicated by gfs_controld to the file system. Here is a
    comparison of the two systems:

    Current:
    1. mount calls mount.gfs2
    2. mount.gfs2 connects to gfs_controld to retrieve the journal id
    3. mount.gfs2 adds the journal id to the mount command line and calls
    the mount system call
    4. gfs_controld receives the status of the mount request via a uevent

    Proposed:
    1. mount calls the mount system call (no mount.gfs2 helper)
    2. gfs_controld receives a uevent for a gfs2 fs which it doesn't know
    about already
    3. gfs_controld assigns a journal id to it via sysfs
    4. the mount system call then completes as normal (sending a uevent
    according to status)

    The advantage of the proposed system is that it is completely backward
    compatible with the current system both at the kernel and at the
    userland levels. The "first" parameter can also be set the same way,
    with the restriction that it must be set before the journal id is
    assigned.

    In addition, if mount becomes stuck waiting for a reply from
    gfs_controld which never arrives, then it is killable and will abort the
    mount gracefully.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

23 Jul, 2010

1 commit

  • Workqueue can now handle high concurrency. Convert gfs to use
    workqueue instead of slow-work.

    * Steven pointed out that recovery path might be run from allocation
    path and thus requires forward progress guarantee without memory
    allocation. Create and use gfs_recovery_wq with rescuer. Please
    note that forward progress wasn't guaranteed with slow-work.

    * Updated to use non-reentrant workqueue.

    Signed-off-by: Tejun Heo
    Acked-by: Steven Whitehouse

    Tejun Heo
     

06 May, 2010

1 commit

  • The following patch adds a message to indicate when barriers have been
    disabled due to a block device which doesn't support them. You could
    already tell this via the mount options in /proc/mounts, but all the
    other filesystems also log a message at the same time.

    Also, the same mechanisms are used to indicate when the lock
    demote interface has been used (only ever used for debugging)
    which is a request from our support team.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

05 May, 2010

1 commit

  • This patch contains various tweaks to how log flushes and active item writeback
    work. gfs2_logd is now managed by a waitqueue, and gfs2_log_reseve now waits
    for gfs2_logd to do the log flushing. Multiple functions were rewritten to
    remove the need to call gfs2_log_lock(). Instead of using one test to see if
    gfs2_logd had work to do, there are now seperate tests to check if there
    are two many buffers in the incore log or if there are two many items on the
    active items list.

    This patch is a port of a patch Steve Whitehouse wrote about a year ago, with
    some minor changes. Since gfs2_ail1_start always submits all the active items,
    it no longer needs to keep track of the first ai submitted, so this has been
    removed. In gfs2_log_reserve(), the order of the calls to
    prepare_to_wait_exclusive() and wake_up() when firing off the logd thread has
    been switched. If it called wake_up first there was a small window for a race,
    where logd could run and return before gfs2_log_reserve was ready to get woken
    up. If gfs2_logd ran, but did not free up enough blocks, gfs2_log_reserve()
    would be left waiting for gfs2_logd to eventualy run because it timed out.
    Finally, gt_logd_secs, which controls how long to wait before gfs2_logd times
    out, and flushes the log, can now be set on mount with ar_commit.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

11 Mar, 2010

1 commit

  • GFS2 tracks the number of revokes and unrevokes that are part of committed
    transactions via sd_log_commited_revoke. It is possible for one process to add
    revokes during its transaction, while another process unrevokes them during its
    transaction. If the second process finishes its transaction first,
    sd_log_commited_revoke will be decremented by the number of unrevokes that the
    second process did, without first being incremented by the number of revokes
    the first process did. This is fine, since all started transactions must be
    completed before the journal can be flushed. However, sd_log_commited_revoke
    is an unsigned integer, and log_refund() causes an assertion failure if it
    would go negative at the end of a transaction. This patch makes
    sd_log_commited_revoke a signed integer and allows it to go negative.
    __gfs2_log_flush() still checks that it mataches the actual number of revokes.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

01 Mar, 2010

2 commits

  • As a consequence of the previous patch, we can now remove the
    loop which used to be required due to the circular dependency
    between the inodes and glocks. Instead we can just invalidate
    the inodes, and then clear up any glocks which are left.

    Also we no longer need the rwsem since there is no longer any
    danger of the inode invalidation calling back into the glock
    code (and from there back into the inode code).

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Since the start of GFS2, an "extra" inode has been used to store
    the metadata belonging to each inode. The only reason for using
    this inode was to have an extra address space, the other fields
    were unused. This means that the memory usage was rather inefficient.

    The reason for keeping each inode's metadata in a separate address
    space is that when glocks are requested on remote nodes, we need to
    be able to efficiently locate the data and metadata which relating
    to that glock (inode) in order to sync or sync and invalidate it
    (depending on the remotely requested lock mode).

    This patch adds a new type of glock, which has in addition to
    its normal fields, has an address space. This applies to all
    inode and rgrp glocks (but to no other glock types which remain
    as before). As a result, we no longer need to have the second
    inode.

    This results in three major improvements:
    1. A saving of approx 25% of memory used in caching inodes
    2. A removal of the circular dependency between inodes and glocks
    3. No confusion between "normal" and "metadata" inodes in super.c

    Although the first of these is the more immediately apparent, the
    second is just as important as it now enables a number of clean
    ups at umount time. Those will be the subject of future patches.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

03 Feb, 2010

1 commit

  • This patch adds a wait on umount between the point at which we
    dispose of all glocks and the point at which we unmount the
    lock protocol. This ensures that we've received all the replies
    to our unlock requests before we stop the locking.

    Signed-off-by: Steven Whitehouse
    Reported-by: Fabio M. Di Nitto

    Steven Whitehouse
     

03 Dec, 2009

1 commit

  • Currently gfs2 issues barrier unconditionally. There are various reasons
    to disable them, be that just for testing or for stupid devices flushing
    large battert backed caches. Add a nobarrier option that matches xfs and
    btrfs for this. Also add a symmetric barrier option to turn it back on
    at remount time.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Steven Whitehouse

    Christoph Hellwig