15 Dec, 2015

1 commit

  • gfs2 currently returns 31 bits of filename hash as a cookie that readdir
    uses for an offset into the directory. When there are a large number of
    directory entries, the likelihood of a collision goes up way too
    quickly. GFS2 will now return cookies that are guaranteed unique for a
    while, and then fail back to using 30 bits of filename hash.
    Specifically, the directory leaf blocks are divided up into chunks based
    on the minimum size of a gfs2 directory entry (48 bytes). Each entry's
    cookie is based off the chunk where it starts, in the linked list of
    leaf blocks that it hashes to (there are 131072 hash buckets). Directory
    entries will have unique names until they take reach chunk 8192.
    Assuming the largest filenames possible, and the least efficient spacing
    possible, this new method will still be able to return unique names when
    the previous method has statistically more than a 99% chance of a
    collision. The non-unique names it fails back to are guaranteed to not
    collide with the unique names.

    unique cookies will be in this format:
    - 1 bit "0" to make sure the the returned cookie is positive
    - 17 bits for the hash table index
    - 1 bit for the mode "0"
    - 13 bits for the offset

    non-unique cookies will be in this format:
    - 1 bit "0" to make sure the the returned cookie is positive
    - 17 bits for the hash table index
    - 1 bit for the mode "1"
    - 13 more bits of the name hash

    Another benefit of location based cookies, is that once a directory's
    exhash table is fully extended (so that multiple hash table indexs do
    not use the same leaf blocks), gfs2 can skip sorting the directory
    entries until it reaches the non-unique ones, and then it only needs to
    sort these. This provides a significant speed up for directory reads of
    very large directories.

    The only issue is that for these cookies to continue to point to the
    correct entry as files are added and removed from the directory, gfs2
    must keep the entries at the same offset in the leaf block when they are
    split (see my previous patch). This means that until all the nodes in a
    cluster are running with code that will split the directory leaf blocks
    this way, none of the nodes can use the new cookie code. To deal with
    this, gfs2 now has the mount option loccookie, which, if set, will make
    it return these new location based cookies. This option must not be set
    until all nodes in the cluster are at least running this version of the
    kernel code, and you have guaranteed that there are no outstanding
    cookies required by other software, such as NFS.

    gfs2 uses some of the extra space at the end of the gfs2_dirent
    structure to store the calculated readdir cookies. This keeps us from
    needing to allocate a seperate array to hold these values. gfs2
    recomputes the cookie stored in de_cookie for every readdir call. The
    time it takes to do so is small, and if gfs2 expected this value to be
    saved on disk, the new code wouldn't work correctly on filesystems
    created with an earlier version of gfs2.

    One issue with adding de_cookie to the union in the gfs2_dirent
    structure is that it caused the union to align itself to a 4 byte
    boundary, instead of its previous 2 byte boundary. This changed the
    offset of de_rahead. To solve that, I pulled de_rahead out of the union,
    since it does not need to be there.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Bob Peterson

    Benjamin Marzinski
     

14 May, 2014

1 commit

  • GFS2 has a transaction glock, which must be grabbed for every
    transaction, whose purpose is to deal with freezing the filesystem.
    Aside from this involving a large amount of locking, it is very easy to
    make the current fsfreeze code hang on unfreezing.

    This patch rewrites how gfs2 handles freezing the filesystem. The
    transaction glock is removed. In it's place is a freeze glock, which is
    cached (but not held) in a shared state by every node in the cluster
    when the filesystem is mounted. This lock only needs to be grabbed on
    freezing, and actions which need to be safe from freezing, like
    recovery.

    When a node wants to freeze the filesystem, it grabs this glock
    exclusively. When the freeze glock state changes on the nodes (either
    from shared to unlocked, or shared to exclusive), the filesystem does a
    special log flush. gfs2_log_flush() does all the work for flushing out
    the and shutting down the incore log, and then it tries to grab the
    freeze glock in a shared state again. Since the filesystem is stuck in
    gfs2_log_flush, no new transaction can start, and nothing can be written
    to disk. Unfreezing the filesytem simply involes dropping the freeze
    glock, allowing gfs2_log_flush() to grab and then release the shared
    lock, so it is cached for next time.

    However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
    shared lock on the filesystem root directory inode to check permissions.
    If that glock has already been grabbed exclusively, fsfreeze will be
    unable to get the shared lock and unfreeze the filesystem.

    In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
    on the filesystem root directory during the freeze, and hold it until it
    unfreezes the filesystem. The functions which need to grab a shared
    lock in order to allow the unfreeze ioctl to be issued now use the lock
    grabbed by the freeze code instead.

    The freeze and unfreeze code take care to make sure that this shared
    lock will not be dropped while another process is using it.

    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

07 Feb, 2014

1 commit

  • The intent of this new field in the directory entry is to
    allow a subsequent lookup to know how many blocks, which
    are contiguous with the inode, contain metadata which relates
    to the inode. This will then allow the issuing of a single
    read to read these blocks, rather than reading the inode
    first, and then issuing a second read for the metadata.

    This only works under some fairly strict conditions, since
    we do not have back pointers from inodes to directory entries
    we must ensure that the blocks referenced in this way will
    always belong to the inode.

    This rules out being able to use this system for indirect
    blocks, as these can change as a result of truncate/rewrite.

    So the idea here is to restrict this to xattr blocks only
    for the time being. For most inodes, that means only a
    single block. Also, when using ACLs and/or SELinux or
    other LSMs, these will be added at inode creation time
    so that they will be contiguous with the inode on disk and
    also will almost always be needed when we read the inode in
    for permissions checks.

    Once an xattr block for an inode is allocated, it will never
    change until the inode is deallocated.

    This patch adds the new field, a further patch will add the
    readahead in due course.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

04 Feb, 2014

1 commit

  • This is another step towards improving the allocation of xattr
    blocks at inode allocation time. Here we take advantage of
    Christoph's recent work on ACLs to allocate a block for the
    xattrs early if we know that we will be adding ACLs to the
    inode later on. The advantage of that is that it is much
    more likely that we'll get a contiguous run of two blocks
    where the first is the inode and the second is the xattr block.

    We still have to fall back to the original system in case we
    don't get the requested two contiguous blocks, or in case the
    ACLs are too large to fit into the block.

    Future patches will move more of the ACL setting code further
    up the gfs2_inode_create() function. Also, I'd like to be
    able to do the same thing with the xattrs from LSMs in
    due course, too. That way we should be able to slowly reduce
    the number of independent transactions, at least in the
    most common cases.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

08 Jan, 2014

1 commit

  • This patch adds four new fields to directory leaf blocks.
    The intent is not to use them in the kernel itself, although
    perhaps we may be able to use them as hints at some later date,
    but instead to provide more information for debug/fsck use.

    One new field adds a pointer to the inode to which the leaf
    belongs. This can be useful if the pointer to the leaf block
    has become corrupt, as it will allow us to know which inode
    this block should be associated with. This field is set when
    the leaf is created and never changed over its lifetime.

    The second field is a "distance from the hash table" field.
    The meaning is as follows:
    0 = An old leaf in which this value has not been set
    1 = This leaf is pointed to directly from the hash table
    2+ = This leaf is part of a chain, pointed to by another leaf
    block, the value gives the position in the chain.

    The third and fourth fields combine to give a time stamp of
    the most recent directory insertion or deletion from this
    leaf block. The time stamp is not updated when a new leaf
    block is chained from the current one. The code is currently
    written such that the timestamp on the dir inode will match
    that of the leaf block for the most recent insertion/deletion.

    For backwards compatibility, any of these new fields which is
    zero should be considered to be "unknown".

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

13 Oct, 2012

1 commit