30 Jul, 2009

4 commits

  • When a file is deleted from a gfs2 filesystem on one node, a dcache
    entry for it may still exist on other nodes in the cluster. If this
    happens, gfs2 will be unable to free this file on disk. Because of this,
    it's possible to have a gfs2 filesystem with no files on it and no free
    space. With this patch, when a node receives a callback notifying it
    that the file is being deleted on another node, it schedules a new
    workqueue thread to remove the file's dcache entry.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     
  • GFS2 was placing far too many glocks on the reclaim list that were not good
    candidates for freeing up from cache. These locks would sit there and
    repeatedly get scanned to see if they could be reclaimed, wasting a lot
    of time when there was memory pressure. This fix does more checks on the
    locks to see if they are actually likely to be removable from cache.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     
  • It is possible for gfs2_shrink_glock_memory() to check a glock for
    demotion
    that's in the process of being freed by gfs2_glock_put(). In this case,
    gfs2_shrink_glock_memory() will acquire a new reference to this glock,
    and
    then try to free the glock itself when it drops the refernce. To solve
    this, gfs2_shrink_glock_memory() just needs to check if the glock is in
    the process of being freed, and if so skip it without ever unlocking the
    lru_lock.

    Signed-off-by: Benjamin Marzinski
    Acked-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     
  • This patch removes some of the special cases that the shrinker
    was trying to deal with. As a result we leave fewer items on
    the list and none at all which cannot be demoted. This makes
    the list scanning more efficient and solves some issues seen
    with large numbers of inodes.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

12 Jun, 2009

1 commit

  • This patch adds the ability to trace various aspects of the GFS2
    filesystem. The trace points are divided into three groups,
    glocks, logging and bmap. These points have been chosen because
    they allow inspection of the major internal functions of GFS2
    and they are also generic enough that they are unlikely to need
    any major changes as the filesystem evolves.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

19 May, 2009

1 commit

  • This patch fixes a race condition where we can receive recovery
    requests part way through processing a umount. This was causing
    problems since the recovery thread had already gone away.

    Looking in more detail at the recovery code, it was really trying
    to implement a slight variation on a work queue, and that happens to
    align nicely with the recently introduced slow-work subsystem. As a
    result I've updated the code to use slow-work, rather than its own home
    grown variety of work queue.

    When using the wait_on_bit() function, I noticed that the wait function
    that was supplied as an argument was appearing in the WCHAN field, so
    I've updated the function names in order to produce more meaningful
    output.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

09 May, 2009

1 commit

  • Depending on the ordering of events as we go around the
    glock shrinker loop, it is possible to drop the ref count
    of a glock incorrectly. It doesn't happen very often. This
    patch corrects the got_ref variable, fixing the problem.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

15 Apr, 2009

1 commit


24 Mar, 2009

4 commits

  • This adds a sysfs file called demote_rq to GFS2's
    per filesystem directory. Its possible to use this
    file to demote arbitrary glocks in exactly the same
    way as if a request had come in from a remote node.

    This is intended for testing issues relating to caching
    of data under glocks. Despite that, the interface is
    generic enough to send requests to any type of glock,
    but be careful as its not always safe to send an
    arbitrary message to an arbitrary glock. For that reason
    and to prevent DoS, this interface is restricted to root
    only.

    The messages look like this:

    :

    Example:

    echo -n "2:13324 EX" >/sys/fs/gfs2/unity:myfs/demote_rq

    Which means "please demote inode glock (type 2) number 13324 so that
    I can get an EX (exclusive) lock". The lock modes are those which
    would normally be sent by a remote node in its callback so if you
    want to unlock a glock, you use EX, to demote to shared, use SH or PR
    (depending on whether you like GFS2 or DLM lock modes better!).

    If the glock doesn't exist, you'll get -ENOENT returned. If the
    arguments don't make sense, you'll get -EINVAL returned.

    The plan is that this interface will be used in combination with
    the blktrace patch which I recently posted for comments although
    it is, of course, still useful in its own right.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch fixes a deadlock when the journal is flushed and there
    are dirty inodes other than the one which caused the journal flush.
    Originally the journal flushing code was trying to obtain the
    transaction glock while running the flush code for an inode glock.
    We no longer require the transaction glock at this point in time
    since we know that any attempt to get the transaction glock from
    another node will result in a journal flush. So if we are flushing
    the journal, we can be sure that the transaction lock is still
    cached from when the transaction was started.

    By inlining a version of gfs2_trans_begin() (minus the bit which
    gets the transaction glock) we can avoid the deadlock problems
    caused if there is a demote request queued up on the transaction
    glock.

    In addition I've also moved the umount rwsem so that it covers
    the glock workqueue, since it all demotions are done by this
    workqueue now. That fixes a bug on umount which I came across
    while fixing the original problem.

    Reported-by: David Teigland
    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The time stamp field is unused in the glock now that we are
    using a shrinker, so that we can remove it and save sizeof(unsigned long)
    bytes in each glock.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This is the big patch that I've been working on for some time
    now. There are many reasons for wanting to make this change
    such as:
    o Reducing overhead by eliminating duplicated fields between structures
    o Simplifcation of the code (reduces the code size by a fair bit)
    o The locking interface is now the DLM interface itself as proposed
    some time ago.
    o Fewer lookups of glocks when processing replies from the DLM
    o Fewer memory allocations/deallocations for each glock
    o Scope to do further optimisations in the future (but this patch is
    more than big enough for now!)

    Please note that (a) this patch relates to the lock_dlm module and
    not the DLM itself, that is still a separate module; and (b) that
    we retain the ability to build GFS2 as a standalone single node
    filesystem with out requiring the DLM.

    This patch needs a lot of testing, hence my keeping it I restarted
    my -git tree after the last merge window. That way, this has the maximum
    exposure before its merged. This is (modulo a few minor bug fixes) the
    same patch that I've been posting on and off the the last three months
    and its passed a number of different tests so far.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

05 Jan, 2009

7 commits

  • SPIN_LOCK_UNLOCKED is deprecated. The following makes the change suggested
    in Documentation/spinlocks.txt

    The semantic patch that makes this change is as follows:
    (http://www.emn.fr/x-info/coccinelle/)

    //
    @@
    declarer name DEFINE_SPINLOCK;
    identifier xxx_lock;
    @@

    - spinlock_t xxx_lock = SPIN_LOCK_UNLOCKED;
    + DEFINE_SPINLOCK(xxx_lock);
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Steven Whitehouse

    Julia Lawall
     
  • This reverts commit 78802499912f1ba31ce83a94c55b5a980f250a43.

    The original patch is causing problems in relation to order of
    operations at umount in relation to jdata files. I need to fix
    this a different way.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • There was a use-after-free with the GFS2 super block during
    umount. This patch moves almost all of the umount code from
    ->put_super into ->kill_sb, the only bit that cannot be moved
    being the glock hash clearing which has to remain as ->put_super
    due to umount ordering requirements. As a result its now obvious
    that the kfree is the final operation, whereas before it was
    hidden in ->put_super.

    Also gfs2_jindex_free is then only referenced from a single file
    so thats moved and marked static too.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The functions which are being moved can all be marked
    static in their new locations, since they only have
    a single caller each. Their new locations are more
    logical than before and some of the functions are
    small enough that the compiler might well inline them.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch removes the two daemons, gfs2_scand and gfs2_glockd
    and replaces them with a shrinker which is called from the VM.

    The net result is that GFS2 responds better when there is memory
    pressure, since it shrinks the glock cache at the same rate
    as the VFS shrinks the dcache and icache. There are no longer
    any time based criteria for shrinking glocks, they are kept
    until such time as the VM asks for more memory and then we
    demote just as many glocks as required.

    There are potential future changes to this code, including the
    possibility of sorting the glocks which are to be written back
    into inode number order, to get a better I/O ordering. It would
    be very useful to have an elevator based workqueue implementation
    for this, as that would automatically deal with the read I/O cases
    at the same time.

    This patch is my answer to Andrew Morton's remark, made during
    the initial review of GFS2, asking why GFS2 needs so many kernel
    threads, the answer being that it doesn't :-) This patch is a
    net loss of about 200 lines of code.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Following on from the recent clean up of gfs2_quotad, this patch moves
    the processing of "truncate in progress" inodes from the glock workqueue
    into gfs2_quotad. This fixes a hang due to the "truncate in progress"
    processing requiring glocks in order to complete.

    It might seem odd to use gfs2_quotad for this particular item, but
    we have to use a pre-existing thread since creating a thread implies
    a GFP_KERNEL memory allocation which is not allowed from the glock
    workqueue context. Of the existing threads, gfs2_logd and gfs2_recoverd
    may deadlock if used for this operation. gfs2_scand and gfs2_glockd are
    both scheduled for removal at some (hopefully not too distant) future
    point. That leaves only gfs2_quotad whose workload is generally fairly
    light and is easily adapted for this extra task.

    Also, as a result of this change, it opens the way for a future patch to
    make the reading of the inode's information asynchronous with respect to
    the glock workqueue, which is another improvement that has been on the list
    for some time now.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • fs/gfs2/glock.c:308:5: warning: context problem in 'do_promote': '_spin_unlock' expected different context
    fs/gfs2/glock.c:308:5: context '*gl+28': wanted >= 1, got 0
    fs/gfs2/glock.c:529:2: warning: context problem in 'do_xmote': '_spin_unlock' expected different context
    fs/gfs2/glock.c:529:2: context '*gl+28': wanted >= 1, got 0
    fs/gfs2/glock.c:925:3: warning: context problem in 'add_to_queue': '_spin_unlock' expected different context
    fs/gfs2/glock.c:925:3: context '*gl+28': wanted >= 1, got 0

    Signed-off-by: Harvey Harrison
    Signed-off-by: Steven Whitehouse

    Harvey Harrison
     

18 Sep, 2008

1 commit

  • Until now, we've used the same scheme as GFS1 for atime. This has failed
    since atime is a per vfsmnt flag, not a per fs flag and as such the
    "noatime" flag was not getting passed down to the filesystems. This
    patch removes all the "special casing" around atime updates and we
    simply use the VFS's atime code.

    The net result is that GFS2 will now support all the same atime related
    mount options of any other filesystem on a per-vfsmnt basis. We do lose
    the "lazy atime" updates, but we gain "relatime". We could add lazy
    atime to the VFS at a later date, if there is a requirement for that
    variant still - I suspect relatime will be enough.

    Also we lose about 100 lines of code after this patch has been applied,
    and I have a suspicion that it will speed things up a bit, even when
    atime is "on". So it seems like a nice clean up as well.

    From a user perspective, everything stays the same except the loss of
    the per-fs atime quantum tweekable (ought to be per-vfsmnt at the very
    least, and to be honest I don't think anybody ever used it) and that a
    number of options which were ignored before now work correctly.

    Please let me know if you've got any comments. I'm pushing this out
    early so that you can all see what my plans are.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

05 Sep, 2008

1 commit

  • In the case that a request for a glock arrives right after the
    grant reply has arrived, it sometimes means that the gl_tstamp
    field hasn't been updated recently enough. The net result is that
    the min-hold time for the glock is ignored. If this happens
    often enough, it leads to poor performance.

    This patch adds an additional test, so that if the reply pending
    bit is set on a glock, then it will select the maximum length of
    time for the min-hold time, rather than looking at gl_tstamp.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

13 Aug, 2008

1 commit


07 Jul, 2008

2 commits

  • We already allow local SH locks while we hold a cached EX glock, so here
    we allow DF locks as well. This works only because we rely on the VFS's
    invalidation for locally cached data, and because if we hold an EX lock,
    then we know that no other node can be caching data relating to this
    file.

    It dramatically speeds up initial writes to O_DIRECT files since we fall
    back to buffered I/O for this and would otherwise bounce between DF and
    EX modes on each and every write call. The lessons to be learned from
    that are to ensure that (for the time being anyway) O_DIRECT files are
    preallocated and that they are written to using reasonably large I/O
    sizes. Even so this change fixes that corner case nicely

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • There is a race in the delayed demote code where it does the wrong thing
    if a demotion to UN has occurred for other reasons before the delay has
    expired. This patch adds an assert to catch that condition as well as
    fixing the root cause by adding an additional check for the UN state.

    Signed-off-by: Steven Whitehouse
    Cc: Bob Peterson

    Steven Whitehouse
     

27 Jun, 2008

3 commits

  • There are several reasons why this is undesirable:

    1. It never happens during normal operation anyway
    2. If it does happen it causes performance to be very, very poor
    3. It isn't likely to solve the original problem (memory shortage
    on remote DLM node) it was supposed to solve
    4. It uses a bunch of arbitrary constants which are unlikely to be
    correct for any particular situation and for which the tuning seems
    to be a black art.
    5. In an N node cluster, only 1/N of the dropped locked will actually
    contribute to solving the problem on average.

    So all in all we are better off without it. This also makes merging
    the lock_dlm module into GFS2 a bit easier.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch merges the lock_nolock module into GFS2 itself. As well as removing
    some of the overhead of the module, it also means that its now impossible to
    build GFS2 without a lock module (which would be a pointless thing to do
    anyway).

    We also plan to merge lock_dlm into GFS2 in the future, but that is a more
    tricky task, and will therefore be a separate patch.

    Signed-off-by: Steven Whitehouse
    Cc: David Teigland

    Steven Whitehouse
     
  • This patch implements a number of cleanups to the core of the
    GFS2 glock code. As a result a lot of code is removed. It looks
    like a really big change, but actually a large part of this patch
    is either removing or moving existing code.

    There are some new bits too though, such as the new run_queue()
    function which is considerably streamlined. Highlights of this
    patch include:

    o Fixes a cluster coherency bug during SH -> EX lock conversions
    o Removes the "glmutex" code in favour of a single bit lock
    o Removes the ->go_xmote_bh() for inodes since it was duplicating
    ->go_lock()
    o We now only use the ->lm_lock() function for both locks and
    unlocks (i.e. unlock is a lock with target mode LM_ST_UNLOCKED)
    o The fast path is considerably shortly, giving performance gains
    especially with lock_nolock
    o The glock_workqueue is now used for all the callbacks from the DLM
    which allows us to simplify the lock_dlm module (see following patch)
    o The way is now open to make further changes such as eliminating the two
    threads (gfs2_glockd and gfs2_scand) in favour of a more efficient
    scheme.

    This patch has undergone extensive testing with various test suites
    so it should be pretty stable by now.

    Signed-off-by: Steven Whitehouse
    Cc: Bob Peterson

    Steven Whitehouse
     

31 Mar, 2008

8 commits

  • GFS2 wasn't invalidating its cache before it called into the lock manager
    with a request that could potentially drop a lock. This was leaving a
    window where the lock could be actually be held by another node, but the
    file's page cache would still appear valid, causing coherency problems.
    This patch moves the cache invalidation to before the lock manager call
    when dropping a lock. It also adds the option to the lock_dlm lock
    manager to not use conversion mode deadlock avoidance, which, on a
    conversion from shared to exclusive, could internally drop the lock, and
    then reacquire in. GFS2 now asks lock_dlm to not do this. Instead, GFS2
    manually drops the lock and reacquires it.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     
  • As a result of an earlier patch, drop_bh was being called in cases
    when it shouldn't have been. Since we never have a gh in the drop
    case and we always have a gh in the promote case, we can use that
    extra information to tell which case has been seen.

    Signed-off-by: Steven Whitehouse
    Cc: Bob Peterson

    Steven Whitehouse
     
  • This patch further reduces GFS2's memory requirements by
    eliminating the 64-bit version number fields in lieu of
    a couple bits.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • The functions in lm.c were just wrappers which were mostly
    only used in one other file. By moving the functions to
    the files where they are being used, they can be marked
    static and also this will usually result in them being inlined
    since they are often only used from one point in the code.

    A couple of really trivial functions have been inlined by hand
    into the function which called them as it makes the code clearer
    to do that.

    We also gain from one fewer function call in the glock lock and
    unlock paths.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch further reduces the memory needs of GFS2 by
    eliminating the gl_req_bh variable from struct gfs2_glock.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This patch reduces memory by replacing the int variable
    gl_waiters2 by a single bit in the gl_flags.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • gfs2_glock_hold() can now become static.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Steven Whitehouse

    Adrian Bunk
     
  • This patch only wakes up the glock reclaim daemon if there is
    actually something to be reclaimed.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

08 Feb, 2008

2 commits

  • The gl_owner_pid field is used to get the lock owning task by its pid, so make
    it in a proper manner, i.e. by using the struct pid pointer and pid_task()
    function.

    The pid_task() becomes exported for the gfs2 module.

    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Acked-by: Steven Whitehouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The gl_owner_pid field is used to get the holder task by its pid and check
    whether the current is a holder, so make it in a proper manner, i.e. via the
    struct pid * manipulations.

    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Acked-by: Steven Whitehouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

25 Jan, 2008

3 commits

  • This patch optimizes the function gfs2_glmutex_lock.
    The basic theory is: Why bother initializing a holder, setting up
    wait bits and then waiting on them, if you know the glock can be
    yours. So the holder stuff is placed inside the if checking if the
    glock is locked. This one needs careful scrutiny because changing
    anything to do with locking should strike terror into one's heart.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • The issue is indeed UP vs SMP and it is totally random.

    spin_is_locked() is a bad assertion because there is no correct answer on UP.
    on UP spin_is_locked() has to return either one value or another, always.

    This means that in my setup I am lucky enough to trigger the issue and your you
    are lucky enough not to.

    the patch in attachment removes the bogus calls to BUG_ON and according to David
    (in CC and thanks for the long explanation on the problem) we can rely upon
    things like lockdep to find problem that might be trying to catch.

    Signed-off-by: Fabio M. Di Nitto
    Cc: David S. Miller
    Signed-off-by: Steven Whitehouse

    Fabio Massimo Di Nitto
     
  • The only reason for adding glocks to the journal was to keep track
    of which locks required a log flush prior to release. We add a
    flag to the glock to allow this check to be made in a simpler way.

    This reduces the size of a glock (by 12 bytes on i386, 24 on x86_64)
    and means that we can avoid extra work during the journal flush.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse