12 Mar, 2014

1 commit

  • This patch closes a small timing window whereby a request to hold the
    transaction glock can get stuck. The problem is that after the DLM has
    granted the lock, it can get into a state whereby it doesn't transition
    the glock to a held state, due to not having requeued the glock state
    machine to finish the transition.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

07 Mar, 2014

2 commits

  • Add pr_fmt, remove embedded "GFS2: " prefixes.
    This now consistently emits lower case "gfs2: " for each message.

    Other miscellanea around these changes:

    o Add missing newlines
    o Coalesce formats
    o Realign arguments

    Signed-off-by: Joe Perches
    Signed-off-by: Steven Whitehouse

    Joe Perches
     
  • -All printk(KERN_foo converted to pr_foo().
    -Messages updated to fit in 80 columns.
    -fs_macros converted as well.
    -fs_printk removed.

    Signed-off-by: Fabian Frederick
    Signed-off-by: Steven Whitehouse

    Fabian Frederick
     

16 Jan, 2014

1 commit

  • Al Viro has tactfully pointed out that we are using the incorrect
    error code in some cases. This patch fixes that, and also removes
    the (unused) return value for glock dumping.

    > * gfs2_iget() - ENOBUFS instead of ENOMEM. ENOBUFS is
    > "No buffer space available (POSIX.1 (XSI STREAMS option))" and since
    > we don't support STREAMS it's probably fair game, but... what the hell?

    Signed-off-by: Steven Whitehouse
    Cc: Al Viro

    Steven Whitehouse
     

02 Jan, 2014

1 commit


21 Nov, 2013

1 commit

  • Commit [e66cf1610: GFS2: Use lockref for glocks] replaced call:
    atomic_read(&gi->gl->gl_ref) == 0
    with:
    __lockref_is_dead(&gl->gl_lockref)
    therefore changing how gl is accessed, from gi->gl to plan gl.
    However, gl can be a NULL pointer, and so gi->gl needs to be
    used instead (which is guaranteed not to be NULL because fo
    the while loop checking that condition).

    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Steven Whitehouse

    Michal Nazarewicz
     

15 Oct, 2013

1 commit

  • Currently glocks have an atomic reference count and also a spinlock
    which covers various internal fields, such as the state. This intent of
    this patch is to replace the spinlock and the atomic reference count
    with a lockref structure. This contains a spinlock which we can continue
    to use as before, and a reference counter which is used in conjuction
    with the spinlock to replace the previous atomic counter.

    As a result of this there are some new rules for reference counting on
    glocks. We need to distinguish between reference count changes under
    gl_spin (which are now just increment or decrement of the new counter,
    provided the count cannot hit zero) and those which are outside of
    gl_spin, but which now take gl_spin internally.

    The conversion is relatively straight forward. There is probably some
    further clean up which can be done, but the priority at this stage is to
    make the change in as simple a manner as possible.

    A consequence of this change is that the reference count is being
    decoupled from the lru list processing. This should allow future
    adoption of the lru_list code with glocks in due course.

    The reason for using the "dead" state and not just relying on 0 being
    the "invalid state" is so that in due course 0 ref counts can be
    allowable. The intent is to eventually be able to remove the ref count
    changes which are currently hidden away in state_change().

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

11 Sep, 2013

2 commits

  • Convert the filesystem shrinkers to use the new API, and standardise some
    of the behaviours of the shrinkers at the same time. For example,
    nr_to_scan means the number of objects to scan, not the number of objects
    to free.

    I refactored the CIFS idmap shrinker a little - it really needs to be
    broken up into a shrinker per tree and keep an item count with the tree
    root so that we don't need to walk the tree every time the shrinker needs
    to count the number of objects in the tree (i.e. all the time under
    memory pressure).

    [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree]
    [assorted fixes folded in]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Acked-by: Mel Gorman
    Acked-by: Artem Bityutskiy
    Acked-by: Jan Kara
    Acked-by: Steven Whitehouse
    Cc: Adrian Hunter
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton

    Signed-off-by: Al Viro

    Dave Chinner
     
  • The sysctl knob sysctl_vfs_cache_pressure is used to determine which
    percentage of the shrinkable objects in our cache we should actively try
    to shrink.

    It works great in situations in which we have many objects (at least more
    than 100), because the aproximation errors will be negligible. But if
    this is not the case, specially when total_objects < 100, we may end up
    concluding that we have no objects at all (total / 100 = 0, if total <
    100).

    This is certainly not the biggest killer in the world, but may matter in
    very low kernel memory situations.

    Signed-off-by: Glauber Costa
    Reviewed-by: Carlos Maiolino
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Cc: Dave Chinner
    Cc: Al Viro
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     

04 Sep, 2013

1 commit


20 Aug, 2013

1 commit

  • We need to check the glock ref counter in a race free way
    in order to ensure that the gfs2_glock_hold() call will
    succeed. The easiest way to do that is to simply take the
    reference count early in the common code of examine_bucket,
    skipping any glocks with zero ref count.

    That means that the examiner functions all need to put their
    reference on the glock once they've performed their function.

    Signed-off-by: Steven Whitehouse
    Reported-by: David Teigland
    Tested-by: David Teigland

    Steven Whitehouse
     

19 Aug, 2013

1 commit


01 May, 2013

1 commit

  • Pull GFS2 updates from Steven Whitehouse:
    "There is not a whole lot of change this time - there are some further
    changes which are in the works, but those will be held over until next
    time.

    Here there are some clean ups to inode creation, the addition of an
    origin (local or remote) indicator to glock demote requests, removal
    of one of the remaining GFP_NOFAIL allocations during log flushes, one
    minor clean up, and a one liner bug fix."

    * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
    GFS2: Flush work queue before clearing glock hash tables
    GFS2: Add origin indicator to glock demote tracing
    GFS2: Add origin indicator to glock callbacks
    GFS2: replace gfs2_ail structure with gfs2_trans
    GFS2: Remove vestigial parameter ip from function rs_deltree
    GFS2: Use gfs2_dinode_out() in the inode create path
    GFS2: Remove gfs2_refresh_inode from inode creation path
    GFS2: Clean up inode creation path

    Linus Torvalds
     

29 Apr, 2013

1 commit


26 Apr, 2013

1 commit

  • There was a timing window when a GFS2 file system was unmounted
    that caused GFS2 to call BUG() and panic the kernel. The call
    to BUG() is meant to ensure that the glock reference count,
    gl_ref, never gets down to zero and bounce back up again. What was
    happening during umount is that function gfs2_put_super was dequeing
    its glocks for well-known files. In particular, we saw it on the
    journal glock, sd_jinode_gh. The dequeue caused delayed work to be
    queued for the glock state machine, to transition the lock to an
    "unlocked" state. While the work was still queued, gfs2_put_super
    called gfs2_gl_hash_clear to clear out the glock hash tables.
    If the timing was just so, the glock work function would drop the
    reference count at the time when it was being checked for zero,
    and that caused BUG() to be called. This patch calls
    flush_workqueue before clearing the glock hash tables, thereby
    ensuring that the delayed work is executed before the hash tables
    are cleared, and therefore the reference count never goes to zero
    until the glock is cleared.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

10 Apr, 2013

2 commits

  • This adds the origin indicator to the trace point for glock
    demotion, so that it is possible to see where demote requests
    have come from.

    Note that requests generated from the demote_rq sysfs interface
    will show as remote, since they are intended to replicate
    exactly the effect of a demote reuqest from a remote node. It
    is still possible to tell these apart by looking at the process
    which initiated the demote request.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch adds a bool indicating whether the demote
    request was originated locally or remotely. This is then
    used by the iopen ->go_callback() to make 100% sure that
    it will only respond to remote callbacks.

    Since ->evict_inode() uses GL_NOCACHE when it attempts to
    get an exclusive lock on the iopen lock, this may result
    in extra scheduling of the workqueue in case that the
    exclusive promotion request failed. This patch prevents
    that from happening.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

08 Apr, 2013

1 commit

  • The original method for creating inodes used in GFS2 was to fill
    out a buffer, with all the information, and then to read that
    buffer into the in-core inode, using gfs2_refresh_inode()

    The problem with this approach is that all the inode's fields
    need to be calculated ahead of time, and were stored in various
    variables making the code rather complicated.

    The new approach is simply to allocate the in-core inode earlier
    and fill in as many fields as possible ahead of time. These can
    then be used to initilise the on disk representation. The
    code has been working towards the point where it is possible
    to remove gfs2_refresh_inode() because all the fields are
    correctly initialised ahead of time. We've now reached that
    milestone, and have reversed the order of setting up the in
    core and on disk inodes.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

02 Feb, 2013

1 commit

  • The intent here is to split the processing of the glock lru
    list into two parts, so that the selection of glocks and the
    disposal are separate functions. The plan is then, that further
    updates can then be made to these functions in the future
    to improve the selection of glocks and also the efficiency of
    glock disposal.

    The new feature which this patch brings is sorting the
    glocks to be disposed of into glock number (and thus also
    disk block number) order. Not all glocks will need i/o in
    order to dispose of them, but some will, and at least we'll
    generate mostly disk block order i/o now.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

29 Jan, 2013

1 commit


16 Dec, 2012

1 commit

  • Pull GFS2 updates from Steven Whitehouse:
    "The main feature this time is the new Orlov allocator and the patches
    leading up to it which allow us to allocate new inodes from their own
    allocation context, rather than borrowing that of their parent
    directory. It is this change which then allows us to choose a
    different location for subdirectories when required. This works
    exactly as per the ext3 implementation from the users point of view.

    In addition to that, we've got a speed up in gfs2_rbm_from_block()
    from Bob Peterson, three locking related improvements from Dave
    Teigland plus a selection of smaller bug fixes and clean ups."

    * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
    GFS2: Set gl_object during inode create
    GFS2: add error check while allocating new inodes
    GFS2: don't reference inode's glock during block allocation trace
    GFS2: remove redundant lvb pointer
    GFS2: only use lvb on glocks that need it
    GFS2: skip dlm_unlock calls in unmount
    GFS2: Fix one RG corner case
    GFS2: Eliminate redundant buffer_head manipulation in gfs2_unlink_inode
    GFS2: Use dirty_inode in gfs2_dir_add
    GFS2: Fix truncation of journaled data files
    GFS2: Add Orlov allocator
    GFS2: Use proper allocation context for new inodes
    GFS2: Add test for resource group congestion status
    GFS2: Rename glops go_xmote_th to go_sync
    GFS2: Speed up gfs2_rbm_from_block
    GFS2: Review bug traps in glops.c

    Linus Torvalds
     

12 Dec, 2012

1 commit

  • Overhaul struct address_space.assoc_mapping renaming it to
    address_space.private_data and its type is redefined to void*. By this
    approach we consistently name the .private_* elements from struct
    address_space as well as allow extended usage for address_space
    association with other data structures through ->private_data.

    Also, all users of old ->assoc_mapping element are converted to reflect
    its new name and type change (->private_data).

    Signed-off-by: Rafael Aquini
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

15 Nov, 2012

2 commits


14 Nov, 2012

1 commit

  • When unmounting, gfs2 does a full dlm_unlock operation on every
    cached lock. This can create a very large amount of work and can
    take a long time to complete. However, the vast majority of these
    dlm unlock operations are unnecessary because after all the unlocks
    are done, gfs2 leaves the dlm lockspace, which automatically clears
    the locks of the leaving node, without unlocking each one individually.
    So, gfs2 can skip explicit dlm unlocks, and use dlm_release_lockspace to
    remove the locks implicitly. The one exception is when the lock's lvb is
    being used. In this case, dlm_unlock is called because it may update the
    lvb of the resource.

    Signed-off-by: David Teigland
    Signed-off-by: Steven Whitehouse

    David Teigland
     

07 Nov, 2012

2 commits

  • [Editorial: This is a nit, but has been a minor irritation for a long time:]

    This patch renames glops structure item for go_xmote_th to go_sync.
    The functionality is unchanged; it's just for readability.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • Two of the bug traps here could really be warnings. The others are
    converted from BUG() to GLOCK_BUG_ON() since we'll most likely
    need to know the glock state in order to debug any issues which
    arise. As a result of this, __dump_glock has to be renamed and
    is no longer static.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

24 Sep, 2012

4 commits


11 Jun, 2012

2 commits


08 Jun, 2012

2 commits

  • Instead of reading in the resource groups when gfs2 is checking
    for free space to allocate from, gfs2 can store the necessary infromation
    in the resource group's lvb. Also, instead of searching for unlinked
    inodes in every resource group that's checked for free space, gfs2 can
    store the number of unlinked but inodes in the lvb, and only check for
    unlinked inodes if it will find some.

    The first time a resource group is locked, the lvb must initialized.
    Since this involves counting the unlinked inodes in the resource group,
    this takes a little extra time. But after that, if the resource group
    is locked with GL_SKIP, the buffer head won't be read in unless it's
    actually needed.

    Enabling the resource groups lvbs is done via the rgrplvb mount option. If
    this option isn't set, the lvbs will still be set and updated, but they won't
    be verfied or used by the filesystem. To safely turn on this option, all of
    the nodes mounting the filesystem must be running code with this patch, and
    the filesystem must have been completely unmounted since they were updated.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     
  • For the glocks and glstats seq_files, which are exposed via debugfs
    we should cache the most recent hash bucket, along with the offset
    into that bucket. This allows us to restart from that point, rather
    than having to begin at the beginning each time.

    This is an idea from Eric Dumazet, however I've slightly extended it
    so that if the position from which we are due to start is at any
    point beyond the last cached point, we start from the last cached
    point, plus whatever is the appropriate offset. I don't really expect
    people to be lseeking around these files, but if they did so with only
    positive offsets, then we'd still get some of the benefit of using a
    cached offset.

    With my simple test of around 200k entries in the file, I'm seeing
    an approx 10x speed up.

    Cc: Eric Dumazet
    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

07 Jun, 2012

1 commit


29 Feb, 2012

1 commit

  • The stats are divided into two sets: those relating to the
    super block and those relating to an individual glock. The
    super block stats are done on a per cpu basis in order to
    try and reduce the overhead of gathering them. They are also
    further divided by glock type.

    In the case of both the super block and glock statistics,
    the same information is gathered in each case. The super
    block statistics are used to provide default values for
    most of the glock statistics, so that newly created glocks
    should have, as far as possible, a sensible starting point.

    The statistics are divided into three pairs of mean and
    variance, plus two counters. The mean/variance pairs are
    smoothed exponential estimates and the algorithm used is
    one which will be very familiar to those used to calculation
    of round trip times in network code.

    The three pairs of mean/variance measure the following
    things:

    1. DLM lock time (non-blocking requests)
    2. DLM lock time (blocking requests)
    3. Inter-request time (again to the DLM)

    A non-blocking request is one which will complete right
    away, whatever the state of the DLM lock in question. That
    currently means any requests when (a) the current state of
    the lock is exclusive (b) the requested state is either null
    or unlocked or (c) the "try lock" flag is set. A blocking
    request covers all the other lock requests.

    There are two counters. The first is there primarily to show
    how many lock requests have been made, and thus how much data
    has gone into the mean/variance calculations. The other counter
    is counting queueing of holders at the top layer of the glock
    code. Hopefully that number will be a lot larger than the number
    of dlm lock requests issued.

    So why gather these statistics? There are several reasons
    we'd like to get a better idea of these timings:

    1. To be able to better set the glock "min hold time"
    2. To spot performance issues more easily
    3. To improve the algorithm for selecting resource groups for
    allocation (to base it on lock wait time, rather than blindly
    using a "try lock")
    Due to the smoothing action of the updates, a step change in
    some input quantity being sampled will only fully be taken
    into account after 8 samples (or 4 for the variance) and this
    needs to be carefully considered when interpreting the
    results.

    Knowing both the time it takes a lock request to complete and
    the average time between lock requests for a glock means we
    can compute the total percentage of the time for which the
    node is able to use a glock vs. time that the rest of the
    cluster has its share. That will be very useful when setting
    the lock min hold time.

    The other point to remember is that all times are in
    nanoseconds. Great care has been taken to ensure that we
    measure exactly the quantities that we want, as accurately
    as possible. There are always inaccuracies in any
    measuring system, but I hope this is as accurate as we
    can reasonably make it.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

28 Feb, 2012

1 commit


11 Jan, 2012

1 commit

  • This new method of managing recovery is an alternative to
    the previous approach of using the userland gfs_controld.

    - use dlm slot numbers to assign journal id's
    - use dlm recovery callbacks to initiate journal recovery
    - use a dlm lock to determine the first node to mount fs
    - use a dlm lock to track journals that need recovery

    Signed-off-by: David Teigland
    Signed-off-by: Steven Whitehouse

    David Teigland
     

15 Jul, 2011

1 commit