17 Nov, 2014

1 commit

  • The current gfs2 freezing code is considerably more complicated than it
    should be because it doesn't use the vfs freezing code on any node except
    the one that begins the freeze. This is because it needs to acquire a
    cluster glock before calling the vfs code to prevent a deadlock, and
    without the new freeze_super and thaw_super hooks, that was impossible. To
    deal with the issue, gfs2 had to do some hacky locking tricks to make sure
    that a frozen node couldn't be holding on a lock it needed to do the
    unfreeze ioctl.

    This patch makes use of the new hooks to simply the gfs2 locking code. Now,
    all the nodes in the cluster freeze and thaw in exactly the same way. Every
    node in the cluster caches the freeze glock in the shared state. The new
    freeze_super hook allows the freezing node to grab this freeze glock in
    the exclusive state without first calling the vfs freeze_super function.
    All the nodes in the cluster see this lock change, and call the vfs
    freeze_super function. The vfs locking code guarantees that the nodes can't
    get stuck holding the glocks necessary to unfreeze the system. To
    unfreeze, the freezing node uses the new thaw_super hook to drop the freeze
    glock. Again, all the nodes notice this, reacquire the glock in shared mode
    and call the vfs thaw_super function.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

04 Nov, 2014

1 commit


11 Sep, 2014

1 commit

  • MAXQUOTAS value defines maximum number of quota types VFS supports.
    This isn't necessarily the number of types gfs2 supports and with
    addition of project quotas these two numbers stop matching. So make gfs2
    use its private definition.

    CC: cluster-devel@redhat.com
    Signed-off-by: Jan Kara
    Signed-off-by: Steven Whitehouse

    Jan Kara
     

03 Jun, 2014

1 commit


14 May, 2014

1 commit

  • GFS2 has a transaction glock, which must be grabbed for every
    transaction, whose purpose is to deal with freezing the filesystem.
    Aside from this involving a large amount of locking, it is very easy to
    make the current fsfreeze code hang on unfreezing.

    This patch rewrites how gfs2 handles freezing the filesystem. The
    transaction glock is removed. In it's place is a freeze glock, which is
    cached (but not held) in a shared state by every node in the cluster
    when the filesystem is mounted. This lock only needs to be grabbed on
    freezing, and actions which need to be safe from freezing, like
    recovery.

    When a node wants to freeze the filesystem, it grabs this glock
    exclusively. When the freeze glock state changes on the nodes (either
    from shared to unlocked, or shared to exclusive), the filesystem does a
    special log flush. gfs2_log_flush() does all the work for flushing out
    the and shutting down the incore log, and then it tries to grab the
    freeze glock in a shared state again. Since the filesystem is stuck in
    gfs2_log_flush, no new transaction can start, and nothing can be written
    to disk. Unfreezing the filesytem simply involes dropping the freeze
    glock, allowing gfs2_log_flush() to grab and then release the shared
    lock, so it is cached for next time.

    However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
    shared lock on the filesystem root directory inode to check permissions.
    If that glock has already been grabbed exclusively, fsfreeze will be
    unable to get the shared lock and unfreeze the filesystem.

    In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
    on the filesystem root directory during the freeze, and hold it until it
    unfreezes the filesystem. The functions which need to grab a shared
    lock in order to allow the unfreeze ioctl to be issued now use the lock
    grabbed by the freeze code instead.

    The freeze and unfreeze code take care to make sure that this shared
    lock will not be dropped while another process is using it.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

31 Mar, 2014

1 commit

  • When gfs2_create_inode() fails due to quota violation, the VFS
    inode is not completely uninitialized. This can cause a list
    corruption error.

    This patch correctly uninitializes the VFS inode when a quota
    violation occurs in the gfs2_create_inode codepath.

    Resolves: rhbz#1059808
    Signed-off-by: Abhi Das
    Signed-off-by: Steven Whitehouse

    Abhi Das
     

07 Mar, 2014

1 commit

  • If multiple nodes fail and their recovery work runs simultaneously, they
    would use the same unprotected variables in the superblock. For example,
    they would stomp on each other's revoked blocks lists, which resulted
    in file system metadata corruption. This patch moves the necessary
    variables so that each journal has its own separate area for tracking
    its journal replay.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

03 Mar, 2014

1 commit

  • This patch fixes a long standing issue in mapping the journal
    extents. Most journals will consist of only a single extent,
    and although the cache took account of that by merging extents,
    it did not actually map large extents, but instead was doing a
    block by block mapping. Since the journal was only being mapped
    on mount, this was not normally noticeable.

    With the updated code, it is now possible to use the same extent
    mapping system during journal recovery (which will be added in a
    later patch). This will allow checking of the integrity of the
    journal before any reply of the journal content is attempted. For
    this reason the code is moving to bmap.c, since it will be used
    more widely in due course.

    An exercise left for the reader is to compare the new function
    gfs2_map_journal_extents() with gfs2_write_alloc_required()

    Additionally, should there be a failure, the error reporting is
    also updated to show more detail about what went wrong.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

25 Feb, 2014

2 commits

  • Now we have a master transaction into which other transactions
    are merged, the accounting can be done using this master
    transaction. We no longer require the superblock fields which
    were being used for this function.

    In addition, this allows for a clean up in calc_reserved()
    making it rather easier understand. Also, by reducing the
    number of variables used to track the buffers being added
    and removed from the journal, a number of error checks are
    now no longer required.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Over time, we hope to be able to improve the concurrency available
    in the log code. This is one small step towards that, by moving
    the buffer lists from the super block, and into the transaction
    structure, so that each transaction builds its own buffer lists.

    At transaction commit time, the buffer lists are merged into
    the currently accumulating transaction. That transaction then
    is passed into the before and after commit functions at journal
    flush time. Thus there should be no change in overall behaviour
    yet.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

21 Feb, 2014

1 commit


16 Jan, 2014

1 commit

  • Al Viro has tactfully pointed out that we are using the incorrect
    error code in some cases. This patch fixes that, and also removes
    the (unused) return value for glock dumping.

    > * gfs2_iget() - ENOBUFS instead of ENOMEM. ENOBUFS is
    > "No buffer space available (POSIX.1 (XSI STREAMS option))" and since
    > we don't support STREAMS it's probably fair game, but... what the hell?

    Signed-off-by: Steven Whitehouse
    Cc: Al Viro

    Steven Whitehouse
     

15 Jan, 2014

3 commits

  • Gradually, the global qd_lock is being used for less and less.
    After this patch it will only be used for the per super block
    list whose purpose is to allow syncing of changes back to the
    master quota file from the local quota changes file. Fixing
    up that process to make it more efficient will be the subject
    of a later patch, however this patch removes another barrier
    to doing that.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das

    Steven Whitehouse
     
  • Quota slot allocation has historically used a vector of pages
    and a set of homegrown find/test/set/clear bit functions. Since
    the size of the bitmap is likely to be based on the default
    qc file size, thats a couple of pages at most. So we ought
    to be able to allocate that as a single chunk, with a vmalloc
    fallback, just in case of memory fragmentation.

    We are then able to use the kernel's own find/test/set/clear
    bit functions, rather than rolling our own.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das

    Steven Whitehouse
     
  • Prior to this patch, GFS2 kept all the quotas for each
    super block in a single linked list. This is rather slow
    when there are large numbers of quotas.

    This patch introduces a hlist_bl based hash table, similar
    to the one used for glocks. The initial look up of the quota
    is now lockless in the case where it is already cached,
    although we still have to take the per quota spinlock in
    order to bump the ref count. Either way though, this is a
    big improvement on what was there before.

    The qd_lock and the per super block list is preserved, for
    the time being. However it is intended that since this is no
    longer used for its original role, it should be possible to
    shrink the number of items on that list in due course and
    remove the requirement to take qd_lock in qd_get.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das
    Cc: Paul E. McKenney

    Steven Whitehouse
     

03 Jan, 2014

3 commits

  • Prior to this patch, GFS2 had one address space for each rgrp,
    stored in the glock. This patch changes them to use a single
    address space in the super block. This therefore saves
    (sizeof(struct address_space) * nr_of_rgrps) bytes of memory
    and for large filesystems, that can be significant.

    It would be nice to be able to do something similar and merge
    the inode metadata address space into the same global
    address space. However, that is rather more complicated as the
    on-disk location doesn't have a 1:1 mapping with the inodes in
    general. So while it could be done, it will be a more complicated
    operation as it requires changing a lot more code paths.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Each rgrp header is represented as a single extent on disk, so we
    can calculate the position within the address space, since we are
    using address spaces mapped 1:1 to the disk. This means that it
    is possible to use the range based versions of filemap_fdatawrite/wait
    and for invalidating the page cache.

    Our eventual intent is to then be able to merge the address spaces
    used for rgrps into a single address space, rather than to have
    one for each glock, saving memory and reducing complexity.

    Since during umount, the rgrp structures are disposed of before
    the glocks, we need to store the extent information in the glock
    so that is is available for a final invalidation. This patch uses
    a field which is otherwise unused in rgrp glocks to do that, so
    that we do not have to expand the size of a glock.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • With the preceding patch, we started accepting block reservations
    smaller than the ideal size, which requires a lot more parsing of the
    bitmaps. To reduce the amount of bitmap searching, this patch
    implements a scheme whereby each rgrp keeps track of the point
    at this multi-block reservations will fail.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

04 Nov, 2013

2 commits

  • By using the generic list_lru code, we can now separate the
    per sb quota list locking from the lru locking. The lru
    lock is made into the inner-most lock.

    As a result of this new lock order, we may occasionally see
    items on the per-sb quota list which are "dead" so that the
    two places where we traverse that list are updated to take
    account of that.

    As a result of this patch, the gfs2 quota shrinker is now
    NUMA zone aware, and we are also laying the foundations for
    further improvments in due course.

    Signed-off-by: Steven Whitehouse
    Signed-off-by: Abhijith Das
    Tested-by: Abhijith Das
    Cc: Dave Chinner

    Steven Whitehouse
     
  • This patch adds reflink support to the quota data cache. It
    looks a bit strange because we still don't have a sensible
    split in the lookup by id and the lru list. That is coming in
    later patches though.

    The intent here is just to swap the current ref count for
    reflinks in all cases with as little as possible other change.

    Signed-off-by: Steven Whitehouse
    Signed-off-by: Abhijith Das
    Tested-by: Abhijith Das

    Steven Whitehouse
     

15 Oct, 2013

1 commit

  • Currently glocks have an atomic reference count and also a spinlock
    which covers various internal fields, such as the state. This intent of
    this patch is to replace the spinlock and the atomic reference count
    with a lockref structure. This contains a spinlock which we can continue
    to use as before, and a reference counter which is used in conjuction
    with the spinlock to replace the previous atomic counter.

    As a result of this there are some new rules for reference counting on
    glocks. We need to distinguish between reference count changes under
    gl_spin (which are now just increment or decrement of the new counter,
    provided the count cannot hit zero) and those which are outside of
    gl_spin, but which now take gl_spin internally.

    The conversion is relatively straight forward. There is probably some
    further clean up which can be done, but the priority at this stage is to
    make the change in as simple a manner as possible.

    A consequence of this change is that the reference count is being
    decoupled from the lru list processing. This should allow future
    adoption of the lru_list code with glocks in due course.

    The reason for using the "dead" state and not just relying on 0 being
    the "invalid state" is so that in due course 0 ref counts can be
    allowable. The intent is to eventually be able to remove the ref count
    changes which are currently hidden away in state_change().

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

04 Oct, 2013

2 commits

  • Now that gfs2_quota_sync can be potentially called from multiple
    threads, we should protect this bit of code, and the sync generation
    number in particular in order to ensure that there are no races
    when syncing quotas.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das

    Steven Whitehouse
     
  • There is no need for a paramater which relates to the internals
    of quota to be exposed to users. The only possible use would be
    to turn it up so large that the memory allocation fails. So lets
    remove it and set it to a sensible value which ensures that we
    don't ask for multipage allocations.

    Currently the size of struct gfs2_holder means that the caluclated
    value is identical to the previous default value, so there should
    be no functional change.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das

    Steven Whitehouse
     

02 Oct, 2013

1 commit

  • This patch adds a structure to contain allocation parameters with
    the intention of future expansion of this structure. The idea is
    that we should be able to add more information about the allocation
    in the future in order to allow the allocator to make a better job
    of placing the requests on-disk.

    There is no functional difference from applying this patch.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

18 Sep, 2013

1 commit

  • This is a respin of the original patch. As Steve pointed out, the
    introduction of field bii makes it easy to eliminate bi itself.
    This revised patch does just that, replacing bi with bii.

    This patch adds a new field to the rbm structure, called bii,
    which is an index into the array of bitmaps for an rgrp.
    This replaces *bi which was a pointer to the bitmap.
    This is being done for further optimizations.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

17 Sep, 2013

1 commit

  • This patch introduces a new field in the bitmap structure called
    bi_blocks. Its purpose is to save us from constantly multiplying
    bi_len by the constant GFS2_NBBY. It also paves the way for more
    optimization in a future patch.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     

10 Apr, 2013

1 commit

  • This patch adds a bool indicating whether the demote
    request was originated locally or remotely. This is then
    used by the iopen ->go_callback() to make 100% sure that
    it will only respond to remote callbacks.

    Since ->evict_inode() uses GL_NOCACHE when it attempts to
    get an exclusive lock on the iopen lock, this may result
    in extra scheduling of the workqueue in case that the
    exclusive promotion request failed. This patch prevents
    that from happening.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

08 Apr, 2013

1 commit

  • In order to allow transactions and log flushes to happen at the same
    time, gfs2 needs to move the transaction accounting and active items
    list code into the gfs2_trans structure. As a first step toward this,
    this patch removes the gfs2_ail structure, and handles the active items
    list in the gfs_trans structure. This keeps gfs2 from allocating an ail
    structure on log flushes, and gives us a struture that can later be used
    to store the transaction accounting outside of the gfs2 superblock
    structure.

    With this patch, at the end of a transaction, gfs2 will add the
    gfs2_trans structure to the superblock if there is not one already.
    This structure now has the active items fields that were previously in
    gfs2_ail. This is not necessary in the case where the transaction was
    simply used to add revokes, since these are never written outside of the
    journal, and thus, don't need an active items list.

    Also, in order to make sure that the transaction structure is not
    removed while it's still in use by gfs2_trans_end, unlocking the
    sd_log_flush_lock has to happen slightly later in ending the
    transaction.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

04 Apr, 2013

1 commit


26 Feb, 2013

1 commit

  • Pull user namespace and namespace infrastructure changes from Eric W Biederman:
    "This set of changes starts with a few small enhnacements to the user
    namespace. reboot support, allowing more arbitrary mappings, and
    support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
    user namespace root.

    I do my best to document that if you care about limiting your
    unprivileged users that when you have the user namespace support
    enabled you will need to enable memory control groups.

    There is a minor bug fix to prevent overflowing the stack if someone
    creates way too many user namespaces.

    The bulk of the changes are a continuation of the kuid/kgid push down
    work through the filesystems. These changes make using uids and gids
    typesafe which ensures that these filesystems are safe to use when
    multiple user namespaces are in use. The filesystems converted for
    3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
    changes for these filesystems were a little more involved so I split
    the changes into smaller hopefully obviously correct changes.

    XFS is the only filesystem that remains. I was hoping I could get
    that in this release so that user namespace support would be enabled
    with an allyesconfig or an allmodconfig but it looks like the xfs
    changes need another couple of days before it they are ready."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
    cifs: Enable building with user namespaces enabled.
    cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
    cifs: Convert struct cifs_sb_info to use kuids and kgids
    cifs: Modify struct smb_vol to use kuids and kgids
    cifs: Convert struct cifsFileInfo to use a kuid
    cifs: Convert struct cifs_fattr to use kuid and kgids
    cifs: Convert struct tcon_link to use a kuid.
    cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
    cifs: Convert from a kuid before printing current_fsuid
    cifs: Use kuids and kgids SID to uid/gid mapping
    cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
    cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
    cifs: Override unmappable incoming uids and gids
    nfsd: Enable building with user namespaces enabled.
    nfsd: Properly compare and initialize kuids and kgids
    nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
    nfsd: Modify nfsd4_cb_sec to use kuids and kgids
    nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
    nfsd: Convert nfsxdr to use kuids and kgids
    nfsd: Convert nfs3xdr to use kuids and kgids
    ...

    Linus Torvalds
     

13 Feb, 2013

2 commits


29 Jan, 2013

3 commits

  • Instead of using a list of buffers to write ahead of the journal
    flush, this now uses a list of inodes and calls ->writepages
    via filemap_fdatawrite() in order to achieve the same thing. For
    most use cases this results in a shorter ordered write list,
    as well as much larger i/os being issued.

    The ordered write list is sorted by inode number before writing
    in order to retain the disk block ordering between inodes as
    per the previous code.

    The previous ordered write code used to conflict in its assumptions
    about how to write out the disk blocks with mpage_writepages()
    so that with this updated version we can also use mpage_writepages()
    for GFS2's ordered write, writepages implementation. So we will
    also send larger i/os from writeback too.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The freeze code has not been looked at a lot recently. Upstream has
    moved on, and this is an attempt to catch us back up again. There
    is a vfs level interface for the freeze code which can be called
    from our (obsolete, but kept for backward compatibility purposes)
    sysfs freeze interface. This means freezing this way vs. doing it
    from the ioctl should now work in identical fashion.

    As a result of this, the freeze function is only called once
    and we can drop our own special purpose code for counting the
    number of freezes.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • This patch copies the body of gfs2_trans_add_bh into the two newly
    added gfs2_trans_add_data and gfs2_trans_add_meta functions. We can
    then move the .lo_add functions from lops.c into trans.c and call
    them directly.

    As a result of this, we no longer need to use the .lo_add functions
    at all, so that is removed from the log operations structure.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

15 Nov, 2012

2 commits


14 Nov, 2012

1 commit

  • When unmounting, gfs2 does a full dlm_unlock operation on every
    cached lock. This can create a very large amount of work and can
    take a long time to complete. However, the vast majority of these
    dlm unlock operations are unnecessary because after all the unlocks
    are done, gfs2 leaves the dlm lockspace, which automatically clears
    the locks of the leaving node, without unlocking each one individually.
    So, gfs2 can skip explicit dlm unlocks, and use dlm_release_lockspace to
    remove the locks implicitly. The one exception is when the lock's lvb is
    being used. In this case, dlm_unlock is called because it may update the
    lvb of the resource.

    Signed-off-by: David Teigland
    Signed-off-by: Steven Whitehouse

    David Teigland
     

07 Nov, 2012

2 commits

  • [Editorial: This is a nit, but has been a minor irritation for a long time:]

    This patch renames glops structure item for go_xmote_th to go_sync.
    The functionality is unchanged; it's just for readability.

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson
     
  • This patch is a rewrite of function gfs2_rbm_from_block. Rather than
    looping to find the right bitmap, the code now does a few simple
    math calculations.

    I compared the performance of both algorithms side by side and the new
    algorithm is noticeably faster. Sample instrumentation output from a
    "fast" machine:

    5 million calls: millisec spent: Orig: 166 New: 113
    5 million calls: millisec spent: Orig: 189 New: 114

    In addition, I ran postmark (on a somewhat slowr CPU) before the after
    the new algorithm was put in place and postmark showed a decent
    improvement:

    Before the new algorithm:
    -------------------------
    Time:
    645 seconds total
    584 seconds of transactions (171 per second)

    Files:
    150087 created (232 per second)
    Creation alone: 100000 files (2083 per second)
    Mixed with transactions: 50087 files (85 per second)
    49995 read (85 per second)
    49991 appended (85 per second)
    150087 deleted (232 per second)
    Deletion alone: 100174 files (7705 per second)
    Mixed with transactions: 49913 files (85 per second)

    Data:
    273.42 megabytes read (434.08 kilobytes per second)
    852.13 megabytes written (1.32 megabytes per second)

    With the new algorithm:
    -----------------------
    Time:
    599 seconds total
    530 seconds of transactions (188 per second)

    Files:
    150087 created (250 per second)
    Creation alone: 100000 files (1886 per second)
    Mixed with transactions: 50087 files (94 per second)
    49995 read (94 per second)
    49991 appended (94 per second)
    150087 deleted (250 per second)
    Deletion alone: 100174 files (6260 per second)
    Mixed with transactions: 49913 files (94 per second)

    Data:
    273.42 megabytes read (467.42 kilobytes per second)
    852.13 megabytes written (1.42 megabytes per second)

    Signed-off-by: Bob Peterson
    Signed-off-by: Steven Whitehouse

    Bob Peterson