06 Jun, 2020

3 commits


05 Jun, 2020

1 commit

  • Before this patch, asserts based on glocks did not print the glock with
    the error. This patch introduces a new macro, gfs2_glock_assert_withdraw
    which first prints the glock, then takes the assert.

    This also changes a few glock asserts to the new macro.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     

05 Sep, 2019

1 commit

  • Because s_vfs_rename_mutex is not cluster-wide, multiple nodes can
    reverse the roles of which directories are "old" and which are "new" for
    the purposes of rename. This can cause deadlocks where two nodes end up
    waiting for each other.

    There can be several layers of directory dependencies across many nodes.

    This patch fixes the problem by acquiring all gfs2_rename's inode glocks
    asychronously and waiting for all glocks to be acquired. That way all
    inodes are locked regardless of the order.

    The timeout value for multiple asynchronous glocks is calculated to be
    the total of the individual wait times for each glock times two.

    Since gfs2_exchange is very similar to gfs2_rename, both functions are
    patched in the same way.

    A new async glock wait queue, sd_async_glock_wait, keeps a list of
    waiters for these events. If gfs2's holder_wake function detects an
    async holder, it wakes up any waiters for the event. The waiter only
    tests whether any of its requests are still pending.

    Since the glocks are sent to dlm asychronously, the wait function needs
    to check to see which glocks, if any, were granted.

    If a glock is granted by dlm (and therefore held), its minimum hold time
    is checked and adjusted as necessary, as other glock grants do.

    If the event times out, all glocks held thus far must be dequeued to
    resolve any existing deadlocks. Then, if there are any outstanding
    locking requests, we need to loop around and wait for dlm to respond to
    those requests too. After we release all requests, we return -ESTALE to
    the caller (vfs rename) which loops around and retries the request.

    Node1 Node2
    --------- ---------
    1. Enqueue A Enqueue B
    2. Enqueue B Enqueue A
    3. A granted
    6. B granted
    7. Wait for B
    8. Wait for A
    9. A times out (since Node 1 holds A)
    10. Dequeue B (since it was granted)
    11. Wait for all requests from DLM
    12. B Granted (since Node2 released it in step 10)
    13. Rename
    14. Dequeue A
    15. DLM Grants A
    16. Dequeue A (due to the timeout and since we
    no longer have B held for our task).
    17. Dequeue B
    18. Return -ESTALE to vfs
    19. VFS retries the operation, goto step 1.

    This release-all-locks / acquire-all-locks may slow rename / exchange
    down as both nodes struggle in the same way and do the same thing.
    However, this will only happen when there is contention for the same
    inodes, which ought to be rare.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     

28 Jun, 2019

1 commit

  • Before this patch, if a glock error was encountered, the glock with
    the problem was dumped. But sometimes you may have lots of file systems
    mounted, and that doesn't tell you which file system it was for.

    This patch adds a new boolean parameter fsid to the dump_glock family
    of functions. For non-error cases, such as dumping the glocks debugfs
    file, the fsid is not dumped in order to keep lock dumps and glocktop
    as clean as possible. For all error cases, such as GLOCK_BUG_ON, the
    file system id is now printed. This will make it easier to debug.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher

    Bob Peterson
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this copyrighted material is made available to anyone wishing to use
    modify copy or redistribute it subject to the terms and conditions
    of the gnu general public license version 2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 44 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Kate Stewart
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531081038.653000175@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

23 Jan, 2019

1 commit

  • When calling debugfs functions, there is no need to ever check the
    return value. The function can work or not, but the code logic should
    never do something different based on this.

    There is no need to save the dentries for the debugfs files, so drop
    those variables to save a bit of space and make the code simpler.

    Cc: Bob Peterson
    Cc: Andreas Gruenbacher
    Cc: cluster-devel@redhat.com
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Andreas Gruenbacher

    Greg Kroah-Hartman
     

12 Dec, 2018

1 commit


10 Aug, 2017

1 commit

  • gfs2_evict_inode is called to free inodes under memory pressure. The
    function calls into DLM when an inode's last cluster-wide reference goes
    away (remote unlink) and to release the glock and associated DLM lock
    before finally destroying the inode. However, if DLM is blocked on
    memory to become available, calling into DLM again will deadlock.

    Avoid that by decoupling releasing glocks from destroying inodes in that
    case: with gfs2_glock_queue_put, glocks will be dequeued asynchronously
    in work queue context, when the associated inodes have likely already
    been destroyed.

    With this change, inodes can end up being unlinked, remote-unlink can be
    triggered, and then the inode can be reallocated before all
    remote-unlink callbacks are processed. To detect that, revalidate the
    link count in gfs2_evict_inode to make sure we're not deleting an
    allocated, referenced inode.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

21 Jul, 2017

1 commit

  • This patch introduces a new helper function in glock.h that
    clears gl_object, with an added integrity check. An additional
    integrity check has been added to glock_set_object, plus comments.
    This is step 1 in a series to ensure gl_object integrity.

    Signed-off-by: Bob Peterson
    Reviewed-by: Andreas Gruenbacher

    Bob Peterson
     

05 Jul, 2017

1 commit

  • So far, gfs2_evict_inode clears gl->gl_object and then flushes the glock
    work queue to make sure that inode glops which dereference gl->gl_object
    have finished running before the inode is destroyed. However, flushing
    the work queue may do more work than needed, and in particular, it may
    call into DLM, which we want to avoid here. Use a bit lock
    (GIF_GLOP_PENDING) to synchronize between the inode glops and
    gfs2_evict_inode instead to get rid of the flushing.

    In addition, flush the work queues of existing glocks before reusing
    them for new inodes to get those glocks into a known state: the glock
    state engine currently doesn't handle glock re-appropriation correctly.
    (We may be able to fix the glock state engine instead later.)

    Based on a patch by Steven Whitehouse .

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

27 Jun, 2016

1 commit

  • Make the code more readable by cleaning up the different ways of
    initializing lock holders and checking for initialized lock holders:
    mark lock holders as uninitialized by setting the holder's glock to NULL
    (gfs2_holder_mark_uninitialized) instead of zeroing out the entire
    object or using a separate flag. Recognize initialized holders by their
    non-NULL glock (gfs2_holder_initialized). Don't zero out holder objects
    which are immeditiately initialized via gfs2_holder_init or
    gfs2_glock_nq_init.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

15 Dec, 2015

1 commit


30 Oct, 2015

1 commit

  • Commit e66cf161 replaced the gl_spin spinlock in struct gfs2_glock with a
    gl_lockref lockref and defined gl_spin as gl_lockref.lock (the spinlock in
    gl_lockref). Remove that define to make the references to gl_lockref.lock more
    obvious.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

16 Jan, 2014

1 commit

  • Al Viro has tactfully pointed out that we are using the incorrect
    error code in some cases. This patch fixes that, and also removes
    the (unused) return value for glock dumping.

    > * gfs2_iget() - ENOBUFS instead of ENOMEM. ENOBUFS is
    > "No buffer space available (POSIX.1 (XSI STREAMS option))" and since
    > we don't support STREAMS it's probably fair game, but... what the hell?

    Signed-off-by: Steven Whitehouse
    Cc: Al Viro

    Steven Whitehouse
     

15 Oct, 2013

1 commit

  • Currently glocks have an atomic reference count and also a spinlock
    which covers various internal fields, such as the state. This intent of
    this patch is to replace the spinlock and the atomic reference count
    with a lockref structure. This contains a spinlock which we can continue
    to use as before, and a reference counter which is used in conjuction
    with the spinlock to replace the previous atomic counter.

    As a result of this there are some new rules for reference counting on
    glocks. We need to distinguish between reference count changes under
    gl_spin (which are now just increment or decrement of the new counter,
    provided the count cannot hit zero) and those which are outside of
    gl_spin, but which now take gl_spin internally.

    The conversion is relatively straight forward. There is probably some
    further clean up which can be done, but the priority at this stage is to
    make the change in as simple a manner as possible.

    A consequence of this change is that the reference count is being
    decoupled from the lru list processing. This should allow future
    adoption of the lru_list code with glocks in due course.

    The reason for using the "dead" state and not just relying on 0 being
    the "invalid state" is so that in due course 0 ref counts can be
    allowable. The intent is to eventually be able to remove the ref count
    changes which are currently hidden away in state_change().

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

08 Apr, 2013

1 commit

  • The original method for creating inodes used in GFS2 was to fill
    out a buffer, with all the information, and then to read that
    buffer into the in-core inode, using gfs2_refresh_inode()

    The problem with this approach is that all the inode's fields
    need to be calculated ahead of time, and were stored in various
    variables making the code rather complicated.

    The new approach is simply to allocate the in-core inode earlier
    and fill in as many fields as possible ahead of time. These can
    then be used to initilise the on disk representation. The
    code has been working towards the point where it is possible
    to remove gfs2_refresh_inode() because all the fields are
    correctly initialised ahead of time. We've now reached that
    milestone, and have reversed the order of setting up the in
    core and on disk inodes.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

07 Nov, 2012

1 commit

  • Two of the bug traps here could really be warnings. The others are
    converted from BUG() to GLOCK_BUG_ON() since we'll most likely
    need to know the glock state in order to debug any issues which
    arise. As a result of this, __dump_glock has to be renamed and
    is no longer static.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

11 Jan, 2012

1 commit

  • This new method of managing recovery is an alternative to
    the previous approach of using the userland gfs_controld.

    - use dlm slot numbers to assign journal id's
    - use dlm recovery callbacks to initiate journal recovery
    - use a dlm lock to determine the first node to mount fs
    - use a dlm lock to track journals that need recovery

    Signed-off-by: David Teigland
    Signed-off-by: Steven Whitehouse

    David Teigland
     

01 Nov, 2011

1 commit

  • Standardize the style for compiler based printf format verification.
    Standardized the location of __printf too.

    Done via script and a little typing.

    $ grep -rPl --include=*.[ch] -w "__attribute__" * | \
    grep -vP "^(tools|scripts|include/linux/compiler-gcc.h)" | \
    xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s*\(\s*\(\s*format\s*\(\s*printf\s*,\s*(.+)\s*,\s*(.+)\s*\)\s*\)\s*\)/__printf($1, $2)/g ; print; }'

    [akpm@linux-foundation.org: revert arch bits]
    Signed-off-by: Joe Perches
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

15 Jul, 2011

1 commit


20 Apr, 2011

1 commit

  • Rather than allowing the glocks to be scheduled for possible
    reclaim as soon as they have exited the journal, this patch
    delays their entry to the list until the glocks in question
    are no longer in use.

    This means that we will rely on the vm for writeback of all
    dirty data and metadata from now on. When glocks are added
    to the lru list they should be freeable much faster since all
    the I/O required to free them should have already been completed.

    This should lead to much better I/O patterns under low memory
    conditions.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

09 Mar, 2011

1 commit

  • This patch fixes a race in deallocating glocks which was introduced
    in the RCU glock patch. We need to ensure that the glock count is
    kept correct even in the case that there is a race to add a new
    glock into the hash table. Also, to avoid having to wait for an
    RCU grace period, the glock counter can be decremented before
    call_rcu() is called.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

21 Jan, 2011

1 commit

  • This has a number of advantages:

    - Reduces contention on the hash table lock
    - Makes the code smaller and simpler
    - Should speed up glock dumps when under load
    - Removes ref count changing in examine_bucket
    - No longer need hash chain lock in glock_put() in common case

    There are some further changes which this enables and which
    we may do in the future. One is to look at using SLAB_RCU,
    and another is to look at using a per-cpu counter for the
    per-sb glock counter, since that is touched twice in the
    lifetime of each glock (but only used at umount time).

    Signed-off-by: Steven Whitehouse
    Cc: Paul E. McKenney

    Steven Whitehouse
     

30 Nov, 2010

3 commits


18 Oct, 2010

1 commit


01 Mar, 2010

1 commit

  • Since the start of GFS2, an "extra" inode has been used to store
    the metadata belonging to each inode. The only reason for using
    this inode was to have an extra address space, the other fields
    were unused. This means that the memory usage was rather inefficient.

    The reason for keeping each inode's metadata in a separate address
    space is that when glocks are requested on remote nodes, we need to
    be able to efficiently locate the data and metadata which relating
    to that glock (inode) in order to sync or sync and invalidate it
    (depending on the remotely requested lock mode).

    This patch adds a new type of glock, which has in addition to
    its normal fields, has an address space. This applies to all
    inode and rgrp glocks (but to no other glock types which remain
    as before). As a result, we no longer need to have the second
    inode.

    This results in three major improvements:
    1. A saving of approx 25% of memory used in caching inodes
    2. A removal of the circular dependency between inodes and glocks
    3. No confusion between "normal" and "metadata" inodes in super.c

    Although the first of these is the more immediately apparent, the
    second is just as important as it now enables a number of clean
    ups at umount time. Those will be the subject of future patches.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

03 Feb, 2010

1 commit

  • Although all glocks are, by the time of the umount glock wait,
    scheduled for demotion, some of them haven't made it far
    enough through the process for the original set of waiting
    code to wait for them.

    This extends the ref count to the whole glock lifetime in order
    to ensure that the waiting does catch all glocks. It does make
    it a bit more invasive, but it seems the only sensible solution
    at the moment.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

03 Dec, 2009

1 commit


30 Jul, 2009

1 commit

  • When a file is deleted from a gfs2 filesystem on one node, a dcache
    entry for it may still exist on other nodes in the cluster. If this
    happens, gfs2 will be unable to free this file on disk. Because of this,
    it's possible to have a gfs2 filesystem with no files on it and no free
    space. With this patch, when a node receives a callback notifying it
    that the file is being deleted on another node, it schedules a new
    workqueue thread to remove the file's dcache entry.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

24 Mar, 2009

1 commit

  • This is the big patch that I've been working on for some time
    now. There are many reasons for wanting to make this change
    such as:
    o Reducing overhead by eliminating duplicated fields between structures
    o Simplifcation of the code (reduces the code size by a fair bit)
    o The locking interface is now the DLM interface itself as proposed
    some time ago.
    o Fewer lookups of glocks when processing replies from the DLM
    o Fewer memory allocations/deallocations for each glock
    o Scope to do further optimisations in the future (but this patch is
    more than big enough for now!)

    Please note that (a) this patch relates to the lock_dlm module and
    not the DLM itself, that is still a separate module; and (b) that
    we retain the ability to build GFS2 as a standalone single node
    filesystem with out requiring the DLM.

    This patch needs a lot of testing, hence my keeping it I restarted
    my -git tree after the last merge window. That way, this has the maximum
    exposure before its merged. This is (modulo a few minor bug fixes) the
    same patch that I've been posting on and off the the last three months
    and its passed a number of different tests so far.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

05 Jan, 2009

4 commits

  • This reverts commit 78802499912f1ba31ce83a94c55b5a980f250a43.

    The original patch is causing problems in relation to order of
    operations at umount in relation to jdata files. I need to fix
    this a different way.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • There was a use-after-free with the GFS2 super block during
    umount. This patch moves almost all of the umount code from
    ->put_super into ->kill_sb, the only bit that cannot be moved
    being the glock hash clearing which has to remain as ->put_super
    due to umount ordering requirements. As a result its now obvious
    that the kfree is the final operation, whereas before it was
    hidden in ->put_super.

    Also gfs2_jindex_free is then only referenced from a single file
    so thats moved and marked static too.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch removes the two daemons, gfs2_scand and gfs2_glockd
    and replaces them with a shrinker which is called from the VM.

    The net result is that GFS2 responds better when there is memory
    pressure, since it shrinks the glock cache at the same rate
    as the VFS shrinks the dcache and icache. There are no longer
    any time based criteria for shrinking glocks, they are kept
    until such time as the VM asks for more memory and then we
    demote just as many glocks as required.

    There are potential future changes to this code, including the
    possibility of sorting the glocks which are to be written back
    into inode number order, to get a better I/O ordering. It would
    be very useful to have an elevator based workqueue implementation
    for this, as that would automatically deal with the read I/O cases
    at the same time.

    This patch is my answer to Andrew Morton's remark, made during
    the initial review of GFS2, asking why GFS2 needs so many kernel
    threads, the answer being that it doesn't :-) This patch is a
    net loss of about 200 lines of code.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Following on from the recent clean up of gfs2_quotad, this patch moves
    the processing of "truncate in progress" inodes from the glock workqueue
    into gfs2_quotad. This fixes a hang due to the "truncate in progress"
    processing requiring glocks in order to complete.

    It might seem odd to use gfs2_quotad for this particular item, but
    we have to use a pre-existing thread since creating a thread implies
    a GFP_KERNEL memory allocation which is not allowed from the glock
    workqueue context. Of the existing threads, gfs2_logd and gfs2_recoverd
    may deadlock if used for this operation. gfs2_scand and gfs2_glockd are
    both scheduled for removal at some (hopefully not too distant) future
    point. That leaves only gfs2_quotad whose workload is generally fairly
    light and is easily adapted for this extra task.

    Also, as a result of this change, it opens the way for a future patch to
    make the reading of the inode's information asynchronous with respect to
    the glock workqueue, which is another improvement that has been on the list
    for some time now.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

18 Sep, 2008

1 commit

  • Until now, we've used the same scheme as GFS1 for atime. This has failed
    since atime is a per vfsmnt flag, not a per fs flag and as such the
    "noatime" flag was not getting passed down to the filesystems. This
    patch removes all the "special casing" around atime updates and we
    simply use the VFS's atime code.

    The net result is that GFS2 will now support all the same atime related
    mount options of any other filesystem on a per-vfsmnt basis. We do lose
    the "lazy atime" updates, but we gain "relatime". We could add lazy
    atime to the VFS at a later date, if there is a requirement for that
    variant still - I suspect relatime will be enough.

    Also we lose about 100 lines of code after this patch has been applied,
    and I have a suspicion that it will speed things up a bit, even when
    atime is "on". So it seems like a nice clean up as well.

    From a user perspective, everything stays the same except the loss of
    the per-fs atime quantum tweekable (ought to be per-vfsmnt at the very
    least, and to be honest I don't think anybody ever used it) and that a
    number of options which were ignored before now work correctly.

    Please let me know if you've got any comments. I'm pushing this out
    early so that you can all see what my plans are.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

27 Jun, 2008

1 commit

  • There are several reasons why this is undesirable:

    1. It never happens during normal operation anyway
    2. If it does happen it causes performance to be very, very poor
    3. It isn't likely to solve the original problem (memory shortage
    on remote DLM node) it was supposed to solve
    4. It uses a bunch of arbitrary constants which are unlikely to be
    correct for any particular situation and for which the tuning seems
    to be a black art.
    5. In an N node cluster, only 1/N of the dropped locked will actually
    contribute to solving the problem on average.

    So all in all we are better off without it. This also makes merging
    the lock_dlm module into GFS2 a bit easier.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse