25 Dec, 2015

1 commit

  • When gfs2 releases the glock of an inode, it must invalidate all
    information cached for that inode, including the page cache and acls.
    Use the new security_inode_invalidate_secctx hook to also invalidate
    security labels in that case. These items will be reread from disk
    when needed after reacquiring the glock.

    Signed-off-by: Andreas Gruenbacher
    Acked-by: Bob Peterson
    Acked-by: Steven Whitehouse
    Cc: cluster-devel@redhat.com
    [PM: fixed spelling errors and description line lengths]
    Signed-off-by: Paul Moore

    Andreas Gruenbacher
     

30 Oct, 2015

1 commit

  • Commit e66cf161 replaced the gl_spin spinlock in struct gfs2_glock with a
    gl_lockref lockref and defined gl_spin as gl_lockref.lock (the spinlock in
    gl_lockref). Remove that define to make the references to gl_lockref.lock more
    obvious.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

04 Sep, 2015

1 commit

  • What uniquely identifies a glock in the glock hash table is not
    gl_name, but gl_name and its superblock pointer. This patch makes
    the gl_name field correspond to a unique glock identifier. That will
    allow us to simplify hashing with a future patch, since the hash
    algorithm can then take the gl_name and hash its components in one
    operation.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher
    Acked-by: Steven Whitehouse

    Bob Peterson
     

19 Jun, 2015

2 commits

  • This patch allows the block allocation code to retain the buffers
    for the resource groups so they don't need to be re-read from buffer
    cache with every request. This is a performance improvement that's
    especially noticeable when resource groups are very large. For
    example, with 2GB resource groups and 4K blocks, there can be 33
    blocks for every resource group. This patch allows those 33 buffers
    to be kept around and not read in and thrown away with every
    operation. The buffers are released when the resource group is
    either synced or invalidated.

    Signed-off-by: Bob Peterson
    Reviewed-by: Steven Whitehouse
    Reviewed-by: Benjamin Marzinski

    Bob Peterson
     
  • The glocks used for resource groups often come and go hundreds of
    thousands of times per second. Adding them to the lru list just
    adds unnecessary contention for the lru_lock spin_lock, especially
    considering we're almost certainly going to re-use the glock and
    take it back off the lru microseconds later. We never want the
    glock shrinker to cull them anyway. This patch adds a new bit in
    the glops that determines which glock types get put onto the lru
    list and which ones don't.

    Signed-off-by: Bob Peterson
    Acked-by: Steven Whitehouse

    Bob Peterson
     

17 Nov, 2014

1 commit

  • The current gfs2 freezing code is considerably more complicated than it
    should be because it doesn't use the vfs freezing code on any node except
    the one that begins the freeze. This is because it needs to acquire a
    cluster glock before calling the vfs code to prevent a deadlock, and
    without the new freeze_super and thaw_super hooks, that was impossible. To
    deal with the issue, gfs2 had to do some hacky locking tricks to make sure
    that a frozen node couldn't be holding on a lock it needed to do the
    unfreeze ioctl.

    This patch makes use of the new hooks to simply the gfs2 locking code. Now,
    all the nodes in the cluster freeze and thaw in exactly the same way. Every
    node in the cluster caches the freeze glock in the shared state. The new
    freeze_super hook allows the freezing node to grab this freeze glock in
    the exclusive state without first calling the vfs freeze_super function.
    All the nodes in the cluster see this lock change, and call the vfs
    freeze_super function. The vfs locking code guarantees that the nodes can't
    get stuck holding the glocks necessary to unfreeze the system. To
    unfreeze, the freezing node uses the new thaw_super hook to drop the freeze
    glock. Again, all the nodes notice this, reacquire the glock in shared mode
    and call the vfs thaw_super function.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

08 Oct, 2014

1 commit


18 Jul, 2014

1 commit


04 Jun, 2014

1 commit

  • …teve/gfs2-3.0-nmw into next

    Pull gfs2 updates from Steven Whitehouse:
    "This must be about the smallest merge window patch set ever for GFS2.
    It is probably also the first one without a single patch from me.
    That is down to a combination of factors, and I have some things in
    the works that are not quite ready yet, that I hope to put in next
    time around.

    Returning to what is here this time... we have 3 patches which fix
    various warnings. Two are bug fixes (for quotas and also a rare
    recovery race condition). The final patch, from Ben Marzinski, is an
    important change in the freeze code which has been in progress for
    some time. This removes the need to take and drop the transaction
    lock for every single transaction, when the only time it was used, was
    at file system freeze time. Ben's patch integrates the freeze
    operation into the journal flush code as an alternative with lower
    overheads and also lands up resolving some difficult to fix races at
    the same time"

    * tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
    GFS2: Prevent recovery before the local journal is set
    GFS2: fs/gfs2/file.c: kernel-doc warning fixes
    GFS2: fs/gfs2/bmap.c: kernel-doc warning fixes
    GFS2: remove transaction glock
    GFS2: lops.c: replace 0 by NULL for pointers
    GFS2: quotas not being refreshed in gfs2_adjust_quota

    Linus Torvalds
     

14 May, 2014

1 commit

  • GFS2 has a transaction glock, which must be grabbed for every
    transaction, whose purpose is to deal with freezing the filesystem.
    Aside from this involving a large amount of locking, it is very easy to
    make the current fsfreeze code hang on unfreezing.

    This patch rewrites how gfs2 handles freezing the filesystem. The
    transaction glock is removed. In it's place is a freeze glock, which is
    cached (but not held) in a shared state by every node in the cluster
    when the filesystem is mounted. This lock only needs to be grabbed on
    freezing, and actions which need to be safe from freezing, like
    recovery.

    When a node wants to freeze the filesystem, it grabs this glock
    exclusively. When the freeze glock state changes on the nodes (either
    from shared to unlocked, or shared to exclusive), the filesystem does a
    special log flush. gfs2_log_flush() does all the work for flushing out
    the and shutting down the incore log, and then it tries to grab the
    freeze glock in a shared state again. Since the filesystem is stuck in
    gfs2_log_flush, no new transaction can start, and nothing can be written
    to disk. Unfreezing the filesytem simply involes dropping the freeze
    glock, allowing gfs2_log_flush() to grab and then release the shared
    lock, so it is cached for next time.

    However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
    shared lock on the filesystem root directory inode to check permissions.
    If that glock has already been grabbed exclusively, fsfreeze will be
    unable to get the shared lock and unfreeze the filesystem.

    In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
    on the filesystem root directory during the freeze, and hold it until it
    unfreezes the filesystem. The functions which need to grab a shared
    lock in order to allow the unfreeze ioctl to be issued now use the lock
    grabbed by the freeze code instead.

    The freeze and unfreeze code take care to make sure that this shared
    lock will not be dropped while another process is using it.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

18 Apr, 2014

1 commit

  • Mostly scripted conversion of the smp_mb__* barriers.

    Signed-off-by: Peter Zijlstra
    Acked-by: Paul E. McKenney
    Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
    Cc: Linus Torvalds
    Cc: linux-arch@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

25 Feb, 2014

1 commit

  • Over time, we hope to be able to improve the concurrency available
    in the log code. This is one small step towards that, by moving
    the buffer lists from the super block, and into the transaction
    structure, so that each transaction builds its own buffer lists.

    At transaction commit time, the buffer lists are merged into
    the currently accumulating transaction. That transaction then
    is passed into the before and after commit functions at journal
    flush time. Thus there should be no change in overall behaviour
    yet.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

16 Jan, 2014

1 commit

  • Al Viro has tactfully pointed out that we are using the incorrect
    error code in some cases. This patch fixes that, and also removes
    the (unused) return value for glock dumping.

    > * gfs2_iget() - ENOBUFS instead of ENOMEM. ENOBUFS is
    > "No buffer space available (POSIX.1 (XSI STREAMS option))" and since
    > we don't support STREAMS it's probably fair game, but... what the hell?

    Signed-off-by: Steven Whitehouse
    Cc: Al Viro

    Steven Whitehouse
     

03 Jan, 2014

2 commits

  • Prior to this patch, GFS2 had one address space for each rgrp,
    stored in the glock. This patch changes them to use a single
    address space in the super block. This therefore saves
    (sizeof(struct address_space) * nr_of_rgrps) bytes of memory
    and for large filesystems, that can be significant.

    It would be nice to be able to do something similar and merge
    the inode metadata address space into the same global
    address space. However, that is rather more complicated as the
    on-disk location doesn't have a 1:1 mapping with the inodes in
    general. So while it could be done, it will be a more complicated
    operation as it requires changing a lot more code paths.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Each rgrp header is represented as a single extent on disk, so we
    can calculate the position within the address space, since we are
    using address spaces mapped 1:1 to the disk. This means that it
    is possible to use the range based versions of filemap_fdatawrite/wait
    and for invalidating the page cache.

    Our eventual intent is to then be able to merge the address spaces
    used for rgrps into a single address space, rather than to have
    one for each glock, saving memory and reducing complexity.

    Since during umount, the rgrp structures are disposed of before
    the glocks, we need to store the extent information in the glock
    so that is is available for a final invalidation. This patch uses
    a field which is otherwise unused in rgrp glocks to do that, so
    that we do not have to expand the size of a glock.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

20 Dec, 2013

1 commit

  • We need to wait for any outstanding DIO to complete in a couple
    of situations. Firstly, in case we are changing out of deferred
    mode (in inode_go_sync) where GLF_DIRTY will not be set. That
    call could be prefixed with a test for gl_state == LM_ST_DEFERRED
    but it doesn't seem worth it bearing in mind that the test for
    outstanding DIO is very quick anyway, in the usual case that there
    is none.

    The second case is in inode_go_lock which will catch the cases
    where we have a cached EX lock, but where we grant deferred locks
    against it so that there is no glock state transistion. We only
    need to wait if the state is not deferred, since DIO is valid
    anyway in that state.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

15 Oct, 2013

1 commit

  • Currently glocks have an atomic reference count and also a spinlock
    which covers various internal fields, such as the state. This intent of
    this patch is to replace the spinlock and the atomic reference count
    with a lockref structure. This contains a spinlock which we can continue
    to use as before, and a reference counter which is used in conjuction
    with the spinlock to replace the previous atomic counter.

    As a result of this there are some new rules for reference counting on
    glocks. We need to distinguish between reference count changes under
    gl_spin (which are now just increment or decrement of the new counter,
    provided the count cannot hit zero) and those which are outside of
    gl_spin, but which now take gl_spin internally.

    The conversion is relatively straight forward. There is probably some
    further clean up which can be done, but the priority at this stage is to
    make the change in as simple a manner as possible.

    A consequence of this change is that the reference count is being
    decoupled from the lru list processing. This should allow future
    adoption of the lru_list code with glocks in due course.

    The reason for using the "dead" state and not just relying on 0 being
    the "invalid state" is so that in due course 0 ref counts can be
    allowable. The intent is to eventually be able to remove the ref count
    changes which are currently hidden away in state_change().

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

19 Aug, 2013

1 commit

  • When run during fsync, a gfs2_log_flush could happen between the
    time when gfs2_ail_flush checked the number of blocks to revoke,
    and when it actually started the transaction to do those revokes.
    This occassionally caused it to need more revokes than it reserved,
    causing gfs2 to crash.

    Instead of just reserving enough revokes to handle the blocks that
    currently need them, this patch makes gfs2_ail_flush reserve the
    maximum number of revokes it can, without increasing the total number
    of reserved log blocks. This patch also passes the number of reserved
    revokes to __gfs2_ail_flush() so that it doesn't go over its limit
    and cause a crash like we're seeing. Non-fsync calls to __gfs2_ail_flush
    will still cause a BUG() necessary revokes are skipped.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

19 Jun, 2013

1 commit

  • This patch looks at all the outstanding blocks in all the transactions
    on the log, and moves the completed ones to the ail2 list. Then it
    issues revokes for these blocks. This will hopefully speed things up
    in situations where there is a lot of contention for glocks, especially
    if they are acquired serially.

    revoke_lo_before_commit will issue at most one log block's full of these
    preemptive revokes. The amount of reserved log space that
    gfs2_log_reserve() ignores has been incremented to allow for this extra
    block.

    This patch also consolidates the common revoke instructions into one
    function, gfs2_add_revoke().

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

10 Apr, 2013

1 commit

  • This patch adds a bool indicating whether the demote
    request was originated locally or remotely. This is then
    used by the iopen ->go_callback() to make 100% sure that
    it will only respond to remote callbacks.

    Since ->evict_inode() uses GL_NOCACHE when it attempts to
    get an exclusive lock on the iopen lock, this may result
    in extra scheduling of the workqueue in case that the
    exclusive promotion request failed. This patch prevents
    that from happening.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

13 Feb, 2013

1 commit

  • When reading dinodes from the disk convert uids and gids
    into kuids and kgids to store in vfs data structures.

    When writing to dinodes to the disk convert kuids and kgids
    in the in memory structures into plain uids and gids.

    For now all on disk data structures are assumed to be
    stored in the initial user namespace.

    Cc: Steven Whitehouse
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

15 Nov, 2012

1 commit


07 Nov, 2012

2 commits

  • [Editorial: This is a nit, but has been a minor irritation for a long time:]

    This patch renames glops structure item for go_xmote_th to go_sync.
    The functionality is unchanged; it's just for readability.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • Two of the bug traps here could really be warnings. The others are
    converted from BUG() to GLOCK_BUG_ON() since we'll most likely
    need to know the glock state in order to debug any issues which
    arise. As a result of this, __dump_glock has to be renamed and
    is no longer static.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

24 Sep, 2012

1 commit

  • gfs2_ail_empty_gl() contains an "inline version" of gfs2_trans_begin(),
    so it needs an explicit sb_start_intwrite() as well, to balance the
    sb_end_intwrite() which will be called by gfs2_trans_end().

    With this, xfstest 068 passes on lock_nolock local gfs2.
    Without it, we reach a writer count of -1 and get stuck.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Steven Whitehouse

    Eric Sandeen
     

08 May, 2012

1 commit


24 Apr, 2012

1 commit

  • This is another clean up in the logging code. This per-transaction
    list was largely unused. Its main function was to ensure that the
    number of buffers in a transaction was correct, however that counter
    was only used to check the number of buffers in the bd_list_tr, plus
    an assert at the end of each transaction. With the assert now changed
    to use the calculated buffer counts, we can remove both bd_list_tr and
    its associated counter.

    This should make the code easier to understand as well as shrinking
    a couple of structures.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

02 Nov, 2011

1 commit


21 Oct, 2011

5 commits

  • Unfortunately, it is not enough to just ignore locked buffers during
    the AIL flush from fsync. We need to be able to ignore all buffers
    which are locked, dirty or pinned at this stage as they might have
    been added subsequent to the log flush earlier in the fsync function.

    In addition, this means that we no longer need to rely on i_mutex to
    keep out writes during fsync, so we can, as a side-effect, remove
    that protection too.

    Signed-off-by: Steven Whitehouse
    Tested-By: Abhijith Das

    Steven Whitehouse
     
  • Since we have ruled out supporting online filesystem shrink,
    it is possible to make the resource group list append only
    during the life of a super block. This gives several benefits:

    Firstly, we only need to read new rindex elements as they are added
    rather than needing to reread the whole rindex file each time one
    element is added.

    Secondly, the rindex glock can be held for much shorter periods of
    time, and is completely removed from the fast path for allocations.
    The lock is taken in shared mode only when updating the resource
    groups when the first allocation occurs, and after a grow has
    taken place.

    Thirdly, this results in a reduction in code size, and everything
    gets a lot simpler to understand in this area.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Here is an update of Bob's original rbtree patch which, in addition, also
    resolves the rather strange ref counting that was being done relating to
    the bitmap blocks.

    Originally we had a dual system for journaling resource groups. The metadata
    blocks were journaled and also the rgrp itself was added to a list. The reason
    for adding the rgrp to the list in the journal was so that the "repolish
    clones" code could be run to update the free space, and potentially send any
    discard requests when the log was flushed. This was done by comparing the
    "cloned" bitmap with what had been written back on disk during the transaction
    commit.

    Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
    until the journal had been flushed. For that reason, there was a rather
    complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
    both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
    count on the buffers.

    However, the journal maintains a reference count on the buffers anyway, since
    they are being journaled as metadata buffers. So by moving the code which deals
    with the post-journal accounting for bitmap blocks to the metadata journaling
    code, we can entirely dispense with the rather strange buffer ref counting
    scheme and also the requirement to journal the rgrps.

    The net result of all this is that the ->sd_rindex_spin is left to do exactly
    one job, and that is to look after the rbtree or rgrps.

    This patch is designed to be a stepping stone towards using RCU for the rbtree
    of resource groups, however the reduction in the number of uses of the
    ->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
    anyway.

    The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
    be removed in future in favour of calling the functions directly where required
    in the code. That will allow locking of resource groups without needing to
    actually read them in - something that could be useful in speeding up statfs.

    In the mean time though it is valid to dereference ->bi_bh only when the rgrp
    is locked. This is basically the same rule as before, modulo the references not
    being valid until the following journal flush.

    Signed-off-by: Steven Whitehouse
    Signed-off-by: Bob Peterson
    Cc: Benjamin Marzinski

    Bob Peterson
     
  • Journaled data requires that a complete flush of all dirty data for
    the file is done, in order that the ail flush which comes after
    will succeed.

    Also the recently enhanced bug trap can trigger falsely in case
    an ail flush from fsync races with a page read. This updates the
    bug trap such that it will ignore buffers which are locked and
    only trigger on dirty and/or pinned buffers when the ail flush
    is run from fsync. The original bug trap is retained when ail
    flush is run from ->go_sync()

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The assert was being tested under the wrong lock, a
    legacy of the original code. Also, if it does trigger,
    the resulting information was not always a lot of help.

    This moves the patch under the correct lock and also
    prints out more useful information in tacking down the
    source of the problem.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

15 Jul, 2011

3 commits

  • This adds S_NOSEC support to GFS2. We set/reset the flag either when
    a user calls setattr or when we have just regained the glock
    from another node. The flag is only set if there are no xattrs
    on the inode and there is no suid bit set.

    Signed-off-by: Steven Whitehouse
    Reviewed-by: Andi Kleen
    Cc: Al Viro

    Steven Whitehouse
     
  • This patch is a performance improvement for GFS2 in a clustered
    environment. It makes the glock hold time self-adjusting.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This patch adds a cache for the hash table to the directory code
    in order to help simplify the way in which the hash table is
    accessed. This is intended to be a first step towards introducing
    some performance improvements in the directory code.

    There are two follow ups that I'm hoping to see fairly shortly. One
    is to simplify the hash table reading code now that we always read the
    complete hash table, whether we want one entry or all of them. The
    other is to introduce readahead on the heads of the hash chains
    which are referred to from the table.

    The hash table is a maximum of 128k in size, so it is not worth trying
    to read it in small chunks.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

14 Jul, 2011

1 commit

  • This patch contains a few misc fixes which resolve a recently
    reported issue. This patch has been a real team effort and has
    received a lot of testing.

    The first issue is that the ail lock needs to be held over a few
    more operations. The lock thats added into gfs2_releasepage() may
    possibly be a candidate for replacing with RCU at some future
    point, but at this stage we've gone for the obvious fix.

    The second issue is that gfs2_write_inode() can end up calling
    a glock recursively when called from gfs2_evict_inode() via the
    syncing code, so it needs a guard added.

    The third issue is that we either need to not truncate the metadata
    pages of inodes which have zero link count, but which we cannot
    deallocate due to them still being in use by other nodes, or we need
    to ensure that those pages have all made it through the journal and
    ail lists first. This patch takes the former approach, but the
    latter has also been tested and there is nothing to choose between
    them performance-wise. So again, we could revise that decision
    in the future.

    Also, the inode eviction process is now better documented.

    Signed-off-by: Steven Whitehouse
    Tested-by: Bob Peterson
    Tested-by: Abhijith Das
    Reported-by: Barry J. Marson
    Reported-by: David Teigland

    Steven Whitehouse
     

12 Jul, 2011

1 commit

  • Right now, there is nothing that forces the log to get flushed when a node
    drops its rindex glock so that another node can grow the filesystem. If the
    log doesn't get flushed, GFS2 can corrupt the sd_log_le_rg list in the
    following way.

    A node puts an rgd on the list in rg_lo_add(), and then the rindex glock is
    dropped so the other node can grow the filesystem. When the node reacquires the
    rindex glock, that rgd gets deleted in clear_rgrpdi() before ever being
    removed from the list by gfs2_log_flush().

    This code simply forces a log flush when the rindex glock is invalidated,
    solving the problem.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

09 May, 2011

1 commit


20 Apr, 2011

1 commit

  • This patch is designed to clean up GFS2's fsync
    implementation and ensure that it really does get everything on
    disk. Since ->write_inode() has been updated, we can call that
    via the vfs library function sync_inode_metadata() and the only
    remaining thing that has to be done is to ensure that we get
    any revoke records in the log after the inode has been written back.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse