15 Mar, 2016

1 commit


15 Dec, 2015

3 commits

  • gfs2 currently returns 31 bits of filename hash as a cookie that readdir
    uses for an offset into the directory. When there are a large number of
    directory entries, the likelihood of a collision goes up way too
    quickly. GFS2 will now return cookies that are guaranteed unique for a
    while, and then fail back to using 30 bits of filename hash.
    Specifically, the directory leaf blocks are divided up into chunks based
    on the minimum size of a gfs2 directory entry (48 bytes). Each entry's
    cookie is based off the chunk where it starts, in the linked list of
    leaf blocks that it hashes to (there are 131072 hash buckets). Directory
    entries will have unique names until they take reach chunk 8192.
    Assuming the largest filenames possible, and the least efficient spacing
    possible, this new method will still be able to return unique names when
    the previous method has statistically more than a 99% chance of a
    collision. The non-unique names it fails back to are guaranteed to not
    collide with the unique names.

    unique cookies will be in this format:
    - 1 bit "0" to make sure the the returned cookie is positive
    - 17 bits for the hash table index
    - 1 bit for the mode "0"
    - 13 bits for the offset

    non-unique cookies will be in this format:
    - 1 bit "0" to make sure the the returned cookie is positive
    - 17 bits for the hash table index
    - 1 bit for the mode "1"
    - 13 more bits of the name hash

    Another benefit of location based cookies, is that once a directory's
    exhash table is fully extended (so that multiple hash table indexs do
    not use the same leaf blocks), gfs2 can skip sorting the directory
    entries until it reaches the non-unique ones, and then it only needs to
    sort these. This provides a significant speed up for directory reads of
    very large directories.

    The only issue is that for these cookies to continue to point to the
    correct entry as files are added and removed from the directory, gfs2
    must keep the entries at the same offset in the leaf block when they are
    split (see my previous patch). This means that until all the nodes in a
    cluster are running with code that will split the directory leaf blocks
    this way, none of the nodes can use the new cookie code. To deal with
    this, gfs2 now has the mount option loccookie, which, if set, will make
    it return these new location based cookies. This option must not be set
    until all nodes in the cluster are at least running this version of the
    kernel code, and you have guaranteed that there are no outstanding
    cookies required by other software, such as NFS.

    gfs2 uses some of the extra space at the end of the gfs2_dirent
    structure to store the calculated readdir cookies. This keeps us from
    needing to allocate a seperate array to hold these values. gfs2
    recomputes the cookie stored in de_cookie for every readdir call. The
    time it takes to do so is small, and if gfs2 expected this value to be
    saved on disk, the new code wouldn't work correctly on filesystems
    created with an earlier version of gfs2.

    One issue with adding de_cookie to the union in the gfs2_dirent
    structure is that it caused the union to align itself to a 4 byte
    boundary, instead of its previous 2 byte boundary. This changed the
    offset of de_rahead. To solve that, I pulled de_rahead out of the union,
    since it does not need to be there.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Bob Peterson

    Benjamin Marzinski
     
  • This patch makes no functional changes. Its goal is to reduce the
    size of the gfs2 inode in memory by rearranging structures and
    changing the size of some variables within the structure.

    Signed-off-by: Bob Peterson

    Bob Peterson
     
  • Before this patch, multi-block reservation structures were allocated
    from a special slab. This patch folds the structure into the gfs2_inode
    structure. The disadvantage is that the gfs2_inode needs more memory,
    even when a file is opened read-only. The advantages are: (a) we don't
    need the special slab and the extra time it takes to allocate and
    deallocate from it. (b) we no longer need to worry that the structure
    exists for things like quota management. (c) This also allows us to
    remove the calls to get_write_access and put_write_access since we
    know the structure will exist.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

24 Nov, 2015

1 commit

  • This patch basically reverts the majority of patch 5407e24.
    That patch eliminated the gfs2_qadata structure in favor of just
    using the reservations structure. The problem with doing that is that
    it increases the size of the reservations structure. That is not an
    issue until it comes time to fold the reservations structure into the
    inode in memory so we know it's always there. By separating out the
    quota structure again, we aren't punishing the non-quota users by
    making all the inodes bigger, requiring more slab space. This patch
    creates a new slab area to allocate the quota stuff so it's managed
    a little more sanely.

    Signed-off-by: Bob Peterson

    Bob Peterson
     

17 Nov, 2015

1 commit

  • When gfs2 allocates an inode and its extended attribute block next to
    each other at inode create time, the inode's directory entry indicates
    that in de_rahead. In that case, we can readahead the extended
    attribute block when we read in the inode.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

30 Oct, 2015

1 commit

  • Commit e66cf161 replaced the gl_spin spinlock in struct gfs2_glock with a
    gl_lockref lockref and defined gl_spin as gl_lockref.lock (the spinlock in
    gl_lockref). Remove that define to make the references to gl_lockref.lock more
    obvious.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Bob Peterson

    Andreas Gruenbacher
     

04 Sep, 2015

3 commits

  • None of these statistics can meaningfully be negative, and the
    numerator for do_div() must have the type u64. The generic
    implementation of do_div() used on some 32-bit architectures asserts
    that, resulting in a compiler error in gfs2_rgrp_congested().

    Fixes: 0166b197c2ed ("GFS2: Average in only non-zero round-trip times ...")

    Signed-off-by: Ben Hutchings
    Signed-off-by: Bob Peterson
    Acked-by: Andreas Gruenbacher

    Ben Hutchings
     
  • This patch changes the glock hash table from a normal hash table to
    a resizable hash table, which scales better. This also simplifies
    a lot of code.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher
    Acked-by: Steven Whitehouse

    Bob Peterson
     
  • What uniquely identifies a glock in the glock hash table is not
    gl_name, but gl_name and its superblock pointer. This patch makes
    the gl_name field correspond to a unique glock identifier. That will
    allow us to simplify hashing with a future patch, since the hash
    algorithm can then take the gl_name and hash its components in one
    operation.

    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher
    Acked-by: Steven Whitehouse

    Bob Peterson
     

19 Jun, 2015

1 commit

  • The glocks used for resource groups often come and go hundreds of
    thousands of times per second. Adding them to the lru list just
    adds unnecessary contention for the lru_lock spin_lock, especially
    considering we're almost certainly going to re-use the glock and
    take it back off the lru microseconds later. We never want the
    glock shrinker to cull them anyway. This patch adds a new bit in
    the glops that determines which glock types get put onto the lru
    list and which ones don't.

    Signed-off-by: Bob Peterson
    Acked-by: Steven Whitehouse

    Bob Peterson
     

03 Jun, 2015

1 commit

  • This patch makes the quota subsystem only report once that a
    particular user/group has exceeded their allotted quota.

    Previously, it was possible for a program to continuously try
    exceeding quota (despite receiving EDQUOT) and in turn trigger
    gfs2 to issue a kernel log message about quota exceed. In theory,
    this could get out of hand and flood the log and the filesystem
    hosting the log files.

    Signed-off-by: Abhi Das
    Signed-off-by: Bob Peterson

    Abhi Das
     

19 Mar, 2015

2 commits

  • struct gfs2_alloc_parms is passed to gfs2_quota_check() and
    gfs2_inplace_reserve() with ap->target containing the number of
    blocks being requested for allocation in the current operation.

    We add a new field to struct gfs2_alloc_parms called 'allowed'.
    gfs2_quota_check() and gfs2_inplace_reserve() return the max
    blocks allowed by quota and the max blocks allowed by the chosen
    rgrp respectively in 'allowed'.

    A new field 'min_target', when non-zero, tells gfs2_quota_check()
    and gfs2_inplace_reserve() to not return -EDQUOT/-ENOSPC when
    there are atleast 'min_target' blocks allowable/available. The
    assumption is that the caller is ok with just 'min_target' blocks
    and will likely proceed with allocating them.

    Signed-off-by: Abhi Das
    Signed-off-by: Bob Peterson
    Acked-by: Steven Whitehouse

    Abhi Das
     
  • Use struct gfs2_alloc_parms as an argument to gfs2_quota_check()
    and gfs2_quota_lock_check() to check for quota violations while
    accounting for the new blocks requested by the current operation
    in ap->target.

    Previously, the number of new blocks requested during an operation
    were not accounted for during quota_check and would allow these
    operations to exceed quota. This was not very apparent since most
    operations allocated only 1 block at a time and quotas would get
    violated in the next operation. i.e. quota excess would only be by
    1 block or so. With fallocate, (where we allocate a bunch of blocks
    at once) the quota excess is non-trivial and is addressed by this
    patch.

    Signed-off-by: Abhi Das
    Signed-off-by: Bob Peterson
    Acked-by: Steven Whitehouse

    Abhi Das
     

17 Nov, 2014

1 commit

  • The current gfs2 freezing code is considerably more complicated than it
    should be because it doesn't use the vfs freezing code on any node except
    the one that begins the freeze. This is because it needs to acquire a
    cluster glock before calling the vfs code to prevent a deadlock, and
    without the new freeze_super and thaw_super hooks, that was impossible. To
    deal with the issue, gfs2 had to do some hacky locking tricks to make sure
    that a frozen node couldn't be holding on a lock it needed to do the
    unfreeze ioctl.

    This patch makes use of the new hooks to simply the gfs2 locking code. Now,
    all the nodes in the cluster freeze and thaw in exactly the same way. Every
    node in the cluster caches the freeze glock in the shared state. The new
    freeze_super hook allows the freezing node to grab this freeze glock in
    the exclusive state without first calling the vfs freeze_super function.
    All the nodes in the cluster see this lock change, and call the vfs
    freeze_super function. The vfs locking code guarantees that the nodes can't
    get stuck holding the glocks necessary to unfreeze the system. To
    unfreeze, the freezing node uses the new thaw_super hook to drop the freeze
    glock. Again, all the nodes notice this, reacquire the glock in shared mode
    and call the vfs thaw_super function.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

04 Nov, 2014

1 commit


11 Sep, 2014

1 commit

  • MAXQUOTAS value defines maximum number of quota types VFS supports.
    This isn't necessarily the number of types gfs2 supports and with
    addition of project quotas these two numbers stop matching. So make gfs2
    use its private definition.

    CC: cluster-devel@redhat.com
    Signed-off-by: Jan Kara
    Signed-off-by: Steven Whitehouse

    Jan Kara
     

03 Jun, 2014

1 commit


14 May, 2014

1 commit

  • GFS2 has a transaction glock, which must be grabbed for every
    transaction, whose purpose is to deal with freezing the filesystem.
    Aside from this involving a large amount of locking, it is very easy to
    make the current fsfreeze code hang on unfreezing.

    This patch rewrites how gfs2 handles freezing the filesystem. The
    transaction glock is removed. In it's place is a freeze glock, which is
    cached (but not held) in a shared state by every node in the cluster
    when the filesystem is mounted. This lock only needs to be grabbed on
    freezing, and actions which need to be safe from freezing, like
    recovery.

    When a node wants to freeze the filesystem, it grabs this glock
    exclusively. When the freeze glock state changes on the nodes (either
    from shared to unlocked, or shared to exclusive), the filesystem does a
    special log flush. gfs2_log_flush() does all the work for flushing out
    the and shutting down the incore log, and then it tries to grab the
    freeze glock in a shared state again. Since the filesystem is stuck in
    gfs2_log_flush, no new transaction can start, and nothing can be written
    to disk. Unfreezing the filesytem simply involes dropping the freeze
    glock, allowing gfs2_log_flush() to grab and then release the shared
    lock, so it is cached for next time.

    However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
    shared lock on the filesystem root directory inode to check permissions.
    If that glock has already been grabbed exclusively, fsfreeze will be
    unable to get the shared lock and unfreeze the filesystem.

    In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
    on the filesystem root directory during the freeze, and hold it until it
    unfreezes the filesystem. The functions which need to grab a shared
    lock in order to allow the unfreeze ioctl to be issued now use the lock
    grabbed by the freeze code instead.

    The freeze and unfreeze code take care to make sure that this shared
    lock will not be dropped while another process is using it.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

31 Mar, 2014

1 commit

  • When gfs2_create_inode() fails due to quota violation, the VFS
    inode is not completely uninitialized. This can cause a list
    corruption error.

    This patch correctly uninitializes the VFS inode when a quota
    violation occurs in the gfs2_create_inode codepath.

    Resolves: rhbz#1059808
    Signed-off-by: Abhi Das
    Signed-off-by: Steven Whitehouse

    Abhi Das
     

07 Mar, 2014

1 commit

  • If multiple nodes fail and their recovery work runs simultaneously, they
    would use the same unprotected variables in the superblock. For example,
    they would stomp on each other's revoked blocks lists, which resulted
    in file system metadata corruption. This patch moves the necessary
    variables so that each journal has its own separate area for tracking
    its journal replay.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

03 Mar, 2014

1 commit

  • This patch fixes a long standing issue in mapping the journal
    extents. Most journals will consist of only a single extent,
    and although the cache took account of that by merging extents,
    it did not actually map large extents, but instead was doing a
    block by block mapping. Since the journal was only being mapped
    on mount, this was not normally noticeable.

    With the updated code, it is now possible to use the same extent
    mapping system during journal recovery (which will be added in a
    later patch). This will allow checking of the integrity of the
    journal before any reply of the journal content is attempted. For
    this reason the code is moving to bmap.c, since it will be used
    more widely in due course.

    An exercise left for the reader is to compare the new function
    gfs2_map_journal_extents() with gfs2_write_alloc_required()

    Additionally, should there be a failure, the error reporting is
    also updated to show more detail about what went wrong.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

25 Feb, 2014

2 commits

  • Now we have a master transaction into which other transactions
    are merged, the accounting can be done using this master
    transaction. We no longer require the superblock fields which
    were being used for this function.

    In addition, this allows for a clean up in calc_reserved()
    making it rather easier understand. Also, by reducing the
    number of variables used to track the buffers being added
    and removed from the journal, a number of error checks are
    now no longer required.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Over time, we hope to be able to improve the concurrency available
    in the log code. This is one small step towards that, by moving
    the buffer lists from the super block, and into the transaction
    structure, so that each transaction builds its own buffer lists.

    At transaction commit time, the buffer lists are merged into
    the currently accumulating transaction. That transaction then
    is passed into the before and after commit functions at journal
    flush time. Thus there should be no change in overall behaviour
    yet.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

21 Feb, 2014

1 commit


16 Jan, 2014

1 commit

  • Al Viro has tactfully pointed out that we are using the incorrect
    error code in some cases. This patch fixes that, and also removes
    the (unused) return value for glock dumping.

    > * gfs2_iget() - ENOBUFS instead of ENOMEM. ENOBUFS is
    > "No buffer space available (POSIX.1 (XSI STREAMS option))" and since
    > we don't support STREAMS it's probably fair game, but... what the hell?

    Signed-off-by: Steven Whitehouse
    Cc: Al Viro

    Steven Whitehouse
     

15 Jan, 2014

3 commits

  • Gradually, the global qd_lock is being used for less and less.
    After this patch it will only be used for the per super block
    list whose purpose is to allow syncing of changes back to the
    master quota file from the local quota changes file. Fixing
    up that process to make it more efficient will be the subject
    of a later patch, however this patch removes another barrier
    to doing that.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das

    Steven Whitehouse
     
  • Quota slot allocation has historically used a vector of pages
    and a set of homegrown find/test/set/clear bit functions. Since
    the size of the bitmap is likely to be based on the default
    qc file size, thats a couple of pages at most. So we ought
    to be able to allocate that as a single chunk, with a vmalloc
    fallback, just in case of memory fragmentation.

    We are then able to use the kernel's own find/test/set/clear
    bit functions, rather than rolling our own.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das

    Steven Whitehouse
     
  • Prior to this patch, GFS2 kept all the quotas for each
    super block in a single linked list. This is rather slow
    when there are large numbers of quotas.

    This patch introduces a hlist_bl based hash table, similar
    to the one used for glocks. The initial look up of the quota
    is now lockless in the case where it is already cached,
    although we still have to take the per quota spinlock in
    order to bump the ref count. Either way though, this is a
    big improvement on what was there before.

    The qd_lock and the per super block list is preserved, for
    the time being. However it is intended that since this is no
    longer used for its original role, it should be possible to
    shrink the number of items on that list in due course and
    remove the requirement to take qd_lock in qd_get.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das
    Cc: Paul E. McKenney

    Steven Whitehouse
     

03 Jan, 2014

3 commits

  • Prior to this patch, GFS2 had one address space for each rgrp,
    stored in the glock. This patch changes them to use a single
    address space in the super block. This therefore saves
    (sizeof(struct address_space) * nr_of_rgrps) bytes of memory
    and for large filesystems, that can be significant.

    It would be nice to be able to do something similar and merge
    the inode metadata address space into the same global
    address space. However, that is rather more complicated as the
    on-disk location doesn't have a 1:1 mapping with the inodes in
    general. So while it could be done, it will be a more complicated
    operation as it requires changing a lot more code paths.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Each rgrp header is represented as a single extent on disk, so we
    can calculate the position within the address space, since we are
    using address spaces mapped 1:1 to the disk. This means that it
    is possible to use the range based versions of filemap_fdatawrite/wait
    and for invalidating the page cache.

    Our eventual intent is to then be able to merge the address spaces
    used for rgrps into a single address space, rather than to have
    one for each glock, saving memory and reducing complexity.

    Since during umount, the rgrp structures are disposed of before
    the glocks, we need to store the extent information in the glock
    so that is is available for a final invalidation. This patch uses
    a field which is otherwise unused in rgrp glocks to do that, so
    that we do not have to expand the size of a glock.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • With the preceding patch, we started accepting block reservations
    smaller than the ideal size, which requires a lot more parsing of the
    bitmaps. To reduce the amount of bitmap searching, this patch
    implements a scheme whereby each rgrp keeps track of the point
    at this multi-block reservations will fail.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

04 Nov, 2013

2 commits

  • By using the generic list_lru code, we can now separate the
    per sb quota list locking from the lru locking. The lru
    lock is made into the inner-most lock.

    As a result of this new lock order, we may occasionally see
    items on the per-sb quota list which are "dead" so that the
    two places where we traverse that list are updated to take
    account of that.

    As a result of this patch, the gfs2 quota shrinker is now
    NUMA zone aware, and we are also laying the foundations for
    further improvments in due course.

    Signed-off-by: Steven Whitehouse
    Signed-off-by: Abhijith Das
    Tested-by: Abhijith Das
    Cc: Dave Chinner

    Steven Whitehouse
     
  • This patch adds reflink support to the quota data cache. It
    looks a bit strange because we still don't have a sensible
    split in the lookup by id and the lru list. That is coming in
    later patches though.

    The intent here is just to swap the current ref count for
    reflinks in all cases with as little as possible other change.

    Signed-off-by: Steven Whitehouse
    Signed-off-by: Abhijith Das
    Tested-by: Abhijith Das

    Steven Whitehouse
     

15 Oct, 2013

1 commit

  • Currently glocks have an atomic reference count and also a spinlock
    which covers various internal fields, such as the state. This intent of
    this patch is to replace the spinlock and the atomic reference count
    with a lockref structure. This contains a spinlock which we can continue
    to use as before, and a reference counter which is used in conjuction
    with the spinlock to replace the previous atomic counter.

    As a result of this there are some new rules for reference counting on
    glocks. We need to distinguish between reference count changes under
    gl_spin (which are now just increment or decrement of the new counter,
    provided the count cannot hit zero) and those which are outside of
    gl_spin, but which now take gl_spin internally.

    The conversion is relatively straight forward. There is probably some
    further clean up which can be done, but the priority at this stage is to
    make the change in as simple a manner as possible.

    A consequence of this change is that the reference count is being
    decoupled from the lru list processing. This should allow future
    adoption of the lru_list code with glocks in due course.

    The reason for using the "dead" state and not just relying on 0 being
    the "invalid state" is so that in due course 0 ref counts can be
    allowable. The intent is to eventually be able to remove the ref count
    changes which are currently hidden away in state_change().

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

04 Oct, 2013

2 commits

  • Now that gfs2_quota_sync can be potentially called from multiple
    threads, we should protect this bit of code, and the sync generation
    number in particular in order to ensure that there are no races
    when syncing quotas.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das

    Steven Whitehouse
     
  • There is no need for a paramater which relates to the internals
    of quota to be exposed to users. The only possible use would be
    to turn it up so large that the memory allocation fails. So lets
    remove it and set it to a sensible value which ensures that we
    don't ask for multipage allocations.

    Currently the size of struct gfs2_holder means that the caluclated
    value is identical to the previous default value, so there should
    be no functional change.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das

    Steven Whitehouse
     

02 Oct, 2013

1 commit

  • This patch adds a structure to contain allocation parameters with
    the intention of future expansion of this structure. The idea is
    that we should be able to add more information about the allocation
    in the future in order to allow the allocator to make a better job
    of placing the requests on-disk.

    There is no functional difference from applying this patch.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

18 Sep, 2013

1 commit

  • This is a respin of the original patch. As Steve pointed out, the
    introduction of field bii makes it easy to eliminate bi itself.
    This revised patch does just that, replacing bi with bii.

    This patch adds a new field to the rbm structure, called bii,
    which is an index into the array of bitmaps for an rgrp.
    This replaces *bi which was a pointer to the bitmap.
    This is being done for further optimizations.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

17 Sep, 2013

1 commit

  • This patch introduces a new field in the bitmap structure called
    bi_blocks. Its purpose is to save us from constantly multiplying
    bi_len by the constant GFS2_NBBY. It also paves the way for more
    optimization in a future patch.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson