24 Oct, 2013

1 commit

  • All of the buffer operations structures are needed to be exported
    for xfs_db, so move them all to a common location rather than
    spreading them all over the place. They are verifying the on-disk
    format, so while xfs_format.h might be a good place, it is not part
    of the on disk format.

    Hence we need to create a new header file that we centralise these
    related definitions. Start by moving the bffer operations
    structures, and then also move all the other definitions that have
    crept into xfs_log_format.h and xfs_format.h as there was no other
    shared header file to put them in.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     

13 Aug, 2013

1 commit

  • The struct xfs_perag has many kernel-only definitions in it,
    requiring a __KERNEL__ guard so userspace can use it to. Move it to
    xfs_mount.h so that it it kernel-only, and let userspace redefine
    it's own version of the structure containing only what it needs.
    This gets rid of another __KERNEL__ check in the XFS header files.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     

22 Apr, 2013

3 commits

  • Same set of changes made to the AGF need to be made to the AGI.
    This patch has a similar history to the AGF, hence a similar
    sign-off chain.

    Signed-off-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Add CRC checks, location information and a magic number to the AGFL.
    Previously the AGFL was just a block containing nothing but the
    free block pointers. The new AGFL has a real header with the usual
    boilerplate instead, so that we can verify it's not corrupted and
    written into the right place.

    [dchinner@redhat.com] Added LSN field, reworked significantly to fit
    into new verifier structure and growfs structure, enabled full
    verifier functionality now there is a header to verify and we can
    guarantee an initialised AGFL.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • The AGF already has some self identifying fields (e.g. the sequence
    number) so we only need to add the uuid to it to identify the
    filesystem it belongs to. The location is fixed based on the
    sequence number, so there's no need to add a block number, either.

    Hence the only additional fields are the CRC and LSN fields. These
    are unlogged, so place some space between the end of the logged
    fields and them so that future expansion of the AGF for logged
    fields can be placed adjacent to the existing logged fields and
    hence not complicate the field-derived range based logging we
    currently have.

    Based originally on a patch from myself, modified further by
    Christoph Hellwig and then modified again to fit into the
    verifier structure with additional fields by myself. The multiple
    signed-off-by tags indicate the age and history of this patch.

    Signed-off-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     

16 Nov, 2012

1 commit

  • To separate the verifiers from iodone functions and associate read
    and write verifiers at the same time, introduce a buffer verifier
    operations structure to the xfs_buf.

    This avoids the need for assigning the write verifier, clearing the
    iodone function and re-running ioend processing in the read
    verifier, and gets rid of the nasty "b_pre_io" name for the write
    verifier function pointer. If we ever need to, it will also be
    easier to add further content specific callbacks to a buffer with an
    ops structure in place.

    We also avoid needing to export verifier functions, instead we
    can simply export the ops structures for those that are needed
    outside the function they are defined in.

    This patch also fixes a directory block readahead verifier issue
    it exposed.

    This patch also adds ops callbacks to the inode/alloc btree blocks
    initialised by growfs. These will need more work before they will
    work with CRCs.

    Signed-off-by: Dave Chinner
    Reviewed-by: Phil White
    Signed-off-by: Ben Myers

    Dave Chinner
     

09 Nov, 2012

1 commit

  • Add the XFS_ICI_EOFBLOCKS_TAG inode tag to identify inodes with
    speculatively preallocated blocks beyond EOF. An inode is tagged
    when speculative preallocation occurs and untagged either via
    truncate down or when post-EOF blocks are freed via release or
    reclaim.

    The tag management is intentionally not aggressive to prefer
    simplicity over the complexity of handling all the corner cases
    under which post-EOF blocks could be freed (i.e., forward
    truncation, fallocate, write error conditions, etc.). This means
    that a tagged inode may or may not have post-EOF blocks after a
    period of time. The tag is eventually cleared when the inode is
    released or reclaimed.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Brian Foster
     

15 May, 2012

1 commit

  • To make it easier to handle userspace code merges, move all the busy
    extent handling out of the allocation code and into it's own file.
    The userspace code does not need the busy extent code, so this
    simplifies the merging of the kernel code into the userspace
    xfsprogs library.

    Because the busy extent code has been almost completely rewritten
    over the past couple of years, also update the copyright on this new
    file to include the authors that made all those changes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     

26 Jul, 2011

1 commit


25 May, 2011

2 commits

  • Blocks for the allocation btree are allocated from and released to
    the AGFL, and thus frequently reused. Even worse we do not have an
    easy way to avoid using an AGFL block when it is discarded due to
    the simple FILO list of free blocks, and thus can frequently stall
    on blocks that are currently undergoing a discard.

    Add a flag to the busy extent tracking structure to skip the discard
    for allocation btree blocks. In normal operation these blocks are
    reused frequently enough that there is no need to discard them
    anyway, but if they spill over to the allocation btree as part of a
    balance we "leak" blocks that we would otherwise discard. We could
    fix this by adding another flag and keeping these block in the
    rbtree even after they aren't busy any more so that we could discard
    them when they migrate out of the AGFL. Given that this would cause
    significant overhead I don't think it's worthwile for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Now that we have reliably tracking of deleted extents in a
    transaction we can easily implement "online" discard support
    which calls blkdev_issue_discard once a transaction commits.

    The actual discard is a two stage operation as we first have
    to mark the busy extent as not available for reuse before we
    can start the actual discard. Note that we don't bother
    supporting discard for the non-delaylog mode.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

29 Apr, 2011

1 commit

  • Update the extent tree in case we have to reuse a busy extent, so that it
    always is kept uptodate. This is done by replacing the busy list searches
    with a new xfs_alloc_busy_reuse helper, which updates the busy extent tree
    in case of a reuse. This allows us to allow reusing metadata extents
    unconditionally, and thus avoid log forces especially for allocation btree
    blocks.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

16 Dec, 2010

1 commit

  • now that we are using RCU protection for the inode cache lookups,
    the lock is only needed on the modification side. Hence it is not
    necessary for the lock to be a rwlock as there are no read side
    holders anymore. Convert it to a spin lock to reflect it's exclusive
    nature.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

19 Oct, 2010

3 commits

  • The buffer cache hash is showing typical hash scalability problems.
    In large scale testing the number of cached items growing far larger
    than the hash can efficiently handle. Hence we need to move to a
    self-scaling cache indexing mechanism.

    I have selected rbtrees for indexing becuse they can have O(log n)
    search scalability, and insert and remove cost is not excessive,
    even on large trees. Hence we should be able to cache large numbers
    of buffers without incurring the excessive cache miss search
    penalties that the hash is imposing on us.

    To ensure we still have parallel access to the cache, we need
    multiple trees. Rather than hashing the buffers by disk address to
    select a tree, it seems more sensible to separate trees by typical
    access patterns. Most operations use buffers from within a single AG
    at a time, so rather than searching lots of different lists,
    separate the buffer indexes out into per-AG rbtrees. This means that
    searches during metadata operation have a much higher chance of
    hitting cache resident nodes, and that updates of the tree are less
    likely to disturb trees being accessed on other CPUs doing
    independent operations.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • Memory reclaim via shrinkers has a terrible habit of having N+M
    concurrent shrinker executions (N = num CPUs, M = num kswapds) all
    trying to shrink the same cache. When the cache they are all working
    on is protected by a single spinlock, massive contention an
    slowdowns occur.

    Wrap the per-ag inode caches with a reclaim mutex to serialise
    reclaim access to the AG. This will block concurrent reclaim in each
    AG but still allow reclaim to scan multiple AGs concurrently. Allow
    shrinkers to move on to the next AG if it can't get the lock, and if
    we can't get any AG, then start blocking on locks.

    To prevent reclaimers from continually scanning the same inodes in
    each AG, add a cursor that tracks where the last reclaim got up to
    and start from that point on the next reclaim. This should avoid
    only ever scanning a small number of inodes at the satart of each AG
    and not making progress. If we have a non-shrinker based reclaim
    pass, ignore the cursor and reset it to zero once we are done.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • When we start taking a reference to the per-ag for every cached
    buffer in the system, kernel lockstat profiling on an 8-way create
    workload shows the mp->m_perag_lock has higher acquisition rates
    than the inode lock and has significantly more contention. That is,
    it becomes the highest contended lock in the system.

    The perag lookup is trivial to convert to lock-less RCU lookups
    because perag structures never go away. Hence the only thing we need
    to protect against is tree structure changes during a grow. This can
    be done simply by replacing the locking in xfs_perag_get() with RCU
    read locking. This removes the mp->m_perag_lock completely from this
    path.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     

29 May, 2010

1 commit

  • If a filesystem is mounted without the inode64 mount option we
    should still be able to access inodes not fitting into 32 bits, just
    not created new ones. For this to work we need to make sure the
    inode cache radix tree is initialized for all allocation groups, not
    just those we plan to allocate inodes from. This patch makes sure
    we initialize the inode cache radix tree for all allocation groups,
    and also cleans xfs_initialize_perag up a bit to separate the
    inode32 logical from the general perag structure setup.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

24 May, 2010

1 commit

  • When we free a metadata extent, we record it in the per-AG busy
    extent array so that it is not re-used before the freeing
    transaction hits the disk. This array is fixed size, so when it
    overflows we make further allocation transactions synchronous
    because we cannot track more freed extents until those transactions
    hit the disk and are completed. Under heavy mixed allocation and
    freeing workloads with large log buffers, we can overflow this array
    quite easily.

    Further, the array is sparsely populated, which means that inserts
    need to search for a free slot, and array searches often have to
    search many more slots that are actually used to check all the
    busy extents. Quite inefficient, really.

    To enable this aspect of extent freeing to scale better, we need
    a structure that can grow dynamically. While in other areas of
    XFS we have used radix trees, the extents being freed are at random
    locations on disk so are better suited to being indexed by an rbtree.

    So, use a per-AG rbtree indexed by block number to track busy
    extents. This incures a memory allocation when marking an extent
    busy, but should not occur too often in low memory situations. This
    should scale to an arbitrary number of extents so should not be a
    limitation for features such as in-memory aggregation of
    transactions.

    However, there are still situations where we can't avoid allocating
    busy extents (such as allocation from the AGFL). To minimise the
    overhead of such occurences, we need to avoid doing a synchronous
    log force while holding the AGF locked to ensure that the previous
    transactions are safely on disk before we use the extent. We can do
    this by marking the transaction doing the allocation as synchronous
    rather issuing a log force.

    Because of the locking involved and the ordering of transactions,
    the synchronous transaction provides the same guarantees as a
    synchronous log force because it ensures that all the prior
    transactions are already on disk when the synchronous transaction
    hits the disk. i.e. it preserves the free->allocate order of the
    extent correctly in recovery.

    By doing this, we avoid holding the AGF locked while log writes are
    in progress, hence reducing the length of time the lock is held and
    therefore we increase the rate at which we can allocate and free
    from the allocation group, thereby increasing overall throughput.

    The only problem with this approach is that when a metadata buffer is
    marked stale (e.g. a directory block is removed), then buffer remains
    pinned and locked until the log goes to disk. The issue here is that
    if that stale buffer is reallocated in a subsequent transaction, the
    attempt to lock that buffer in the transaction will hang waiting
    the log to go to disk to unlock and unpin the buffer. Hence if
    someone tries to lock a pinned, stale, locked buffer we need to
    push on the log to get it unlocked ASAP. Effectively we are trading
    off a guaranteed log force for a much less common trigger for log
    force to occur.

    Ideally we should not reallocate busy extents. That is a much more
    complex fix to the problem as it involves direct intervention in the
    allocation btree searches in many places. This is left to a future
    set of modifications.

    Finally, now that we track busy extents in allocated memory, we
    don't need the descriptors in the transaction structure to point to
    them. We can replace the complex busy chunk infrastructure with a
    simple linked list of busy extents. This allows us to remove a large
    chunk of code, making the overall change a net reduction in code
    size.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

30 Apr, 2010

1 commit

  • On low memory boxes or those with highmem, kernel can OOM before the
    background reclaims inodes via xfssyncd. Add a shrinker to run inode
    reclaim so that it inode reclaim is expedited when memory is low.

    This is more complex than it needs to be because the VM folk don't
    want a context added to the shrinker infrastructure. Hence we need
    to add a global list of XFS mount structures so the shrinker can
    traverse them.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

16 Jan, 2010

3 commits

  • Now that the perag structure is allocated memory rather than held in
    an array, we don't need to have the busy extent array external to
    the structure. Embed it into the perag structure to avoid needing an
    extra allocation when setting up.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Uninline xfs_perag_{get,put} so that tracepoints can be inserted
    into them to speed debugging of reference count problems.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Reference count the per-ag structures to ensure that we keep get/put
    pairs balanced. Assert that the reference counts are zero at unmount
    time to catch leaks. In future, reference counts will enable us to
    safely remove perag structures by allowing us to detect when they
    are no longer in use.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

15 Dec, 2009

1 commit

  • Convert the old xfs tracing support that could only be used with the
    out of tree kdb and xfsidbg patches to use the generic event tracer.

    To use it make sure CONFIG_EVENT_TRACING is enabled and then enable
    all xfs trace channels by:

    echo 1 > /sys/kernel/debug/tracing/events/xfs/enable

    or alternatively enable single events by just doing the same in one
    event subdirectory, e.g.

    echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable

    or set more complex filters, etc. In Documentation/trace/events.txt
    all this is desctribed in more detail. To reads the events do a

    cat /sys/kernel/debug/tracing/trace

    Compared to the last posting this patch converts the tracing mostly to
    the one tracepoint per callsite model that other users of the new
    tracing facility also employ. This allows a very fine-grained control
    of the tracing, a cleaner output of the traces and also enables the
    perf tool to use each tracepoint as a virtual performance counter,
    allowing us to e.g. count how often certain workloads git various
    spots in XFS. Take a look at

    http://lwn.net/Articles/346470/

    for some examples.

    Also the btree tracing isn't included at all yet, as it will require
    additional core tracing features not in mainline yet, I plan to
    deliver it later.

    And the really nice thing about this patch is that it actually removes
    many lines of code while adding this nice functionality:

    fs/xfs/Makefile | 8
    fs/xfs/linux-2.6/xfs_acl.c | 1
    fs/xfs/linux-2.6/xfs_aops.c | 52 -
    fs/xfs/linux-2.6/xfs_aops.h | 2
    fs/xfs/linux-2.6/xfs_buf.c | 117 +--
    fs/xfs/linux-2.6/xfs_buf.h | 33
    fs/xfs/linux-2.6/xfs_fs_subr.c | 3
    fs/xfs/linux-2.6/xfs_ioctl.c | 1
    fs/xfs/linux-2.6/xfs_ioctl32.c | 1
    fs/xfs/linux-2.6/xfs_iops.c | 1
    fs/xfs/linux-2.6/xfs_linux.h | 1
    fs/xfs/linux-2.6/xfs_lrw.c | 87 --
    fs/xfs/linux-2.6/xfs_lrw.h | 45 -
    fs/xfs/linux-2.6/xfs_super.c | 104 ---
    fs/xfs/linux-2.6/xfs_super.h | 7
    fs/xfs/linux-2.6/xfs_sync.c | 1
    fs/xfs/linux-2.6/xfs_trace.c | 75 ++
    fs/xfs/linux-2.6/xfs_trace.h | 1369 +++++++++++++++++++++++++++++++++++++++++
    fs/xfs/linux-2.6/xfs_vnode.h | 4
    fs/xfs/quota/xfs_dquot.c | 110 ---
    fs/xfs/quota/xfs_dquot.h | 21
    fs/xfs/quota/xfs_qm.c | 40 -
    fs/xfs/quota/xfs_qm_syscalls.c | 4
    fs/xfs/support/ktrace.c | 323 ---------
    fs/xfs/support/ktrace.h | 85 --
    fs/xfs/xfs.h | 16
    fs/xfs/xfs_ag.h | 14
    fs/xfs/xfs_alloc.c | 230 +-----
    fs/xfs/xfs_alloc.h | 27
    fs/xfs/xfs_alloc_btree.c | 1
    fs/xfs/xfs_attr.c | 107 ---
    fs/xfs/xfs_attr.h | 10
    fs/xfs/xfs_attr_leaf.c | 14
    fs/xfs/xfs_attr_sf.h | 40 -
    fs/xfs/xfs_bmap.c | 507 +++------------
    fs/xfs/xfs_bmap.h | 49 -
    fs/xfs/xfs_bmap_btree.c | 6
    fs/xfs/xfs_btree.c | 5
    fs/xfs/xfs_btree_trace.h | 17
    fs/xfs/xfs_buf_item.c | 87 --
    fs/xfs/xfs_buf_item.h | 20
    fs/xfs/xfs_da_btree.c | 3
    fs/xfs/xfs_da_btree.h | 7
    fs/xfs/xfs_dfrag.c | 2
    fs/xfs/xfs_dir2.c | 8
    fs/xfs/xfs_dir2_block.c | 20
    fs/xfs/xfs_dir2_leaf.c | 21
    fs/xfs/xfs_dir2_node.c | 27
    fs/xfs/xfs_dir2_sf.c | 26
    fs/xfs/xfs_dir2_trace.c | 216 ------
    fs/xfs/xfs_dir2_trace.h | 72 --
    fs/xfs/xfs_filestream.c | 8
    fs/xfs/xfs_fsops.c | 2
    fs/xfs/xfs_iget.c | 111 ---
    fs/xfs/xfs_inode.c | 67 --
    fs/xfs/xfs_inode.h | 76 --
    fs/xfs/xfs_inode_item.c | 5
    fs/xfs/xfs_iomap.c | 85 --
    fs/xfs/xfs_iomap.h | 8
    fs/xfs/xfs_log.c | 181 +----
    fs/xfs/xfs_log_priv.h | 20
    fs/xfs/xfs_log_recover.c | 1
    fs/xfs/xfs_mount.c | 2
    fs/xfs/xfs_quota.h | 8
    fs/xfs/xfs_rename.c | 1
    fs/xfs/xfs_rtalloc.c | 1
    fs/xfs/xfs_rw.c | 3
    fs/xfs/xfs_trans.h | 47 +
    fs/xfs/xfs_trans_buf.c | 62 -
    fs/xfs/xfs_vnodeops.c | 8
    70 files changed, 2151 insertions(+), 2592 deletions(-)

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

02 Sep, 2009

1 commit

  • Don't search too far - abort if it is outside a certain radius and simply do
    a linear search for the first free inode. In AGs with a million inodes this
    can speed up allocation speed by 3-4x.

    [hch: ported to the new xfs_ialloc.c world order]

    Signed-off-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Signed-off-by: Felix Blyakher

    Dave Chinner
     

01 Sep, 2009

2 commits


08 Jun, 2009

1 commit

  • Given that we walk across the per-ag inode lists so often, it makes sense to
    introduce an iterator for this.

    Convert the sync and reclaim code to use this new iterator, quota code will
    follow in the next patch.

    Also change xfs_reclaim_inode to return -EGAIN instead of 1 for an inode
    already under reclaim. This simplifies the AG iterator and doesn't
    matter for the only other caller.

    [hch: merged the lookup and execute callbacks back into one to get the
    pag_ici_lock locking correct and simplify the code flow]

    Signed-off-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Eric Sandeen

    Dave Chinner
     

09 Feb, 2009

1 commit


16 Jan, 2009

1 commit


09 Jan, 2009

1 commit


01 Dec, 2008

2 commits


30 Oct, 2008

2 commits


07 Feb, 2008

1 commit

  • Un-obfuscate pagb_lock, remove mutex_lock->spin_lock macros, call
    spin_lock directly, remove extraneous cookie holdover from old xfs code,
    and change lock type to spinlock_t.

    SGI-PV: 970382
    SGI-Modid: xfs-linux-melb:xfs-kern:29743a

    Signed-off-by: Eric Sandeen
    Signed-off-by: Donald Douwsma
    Signed-off-by: Tim Shimmin

    Eric Sandeen
     

15 Oct, 2007

1 commit

  • One of the perpetual scaling problems XFS has is indexing it's incore
    inodes. We currently uses hashes and the default hash sizes chosen can
    only ever be a tradeoff between memory consumption and the maximum
    realistic size of the cache.

    As a result, anyone who has millions of inodes cached on a filesystem
    needs to tunes the size of the cache via the ihashsize mount option to
    allow decent scalability with inode cache operations.

    A further problem is the separate inode cluster hash, whose size is based
    on the ihashsize but is smaller, and so under certain conditions (sparse
    cluster cache population) this can become a limitation long before the
    inode hash is causing issues.

    The following patchset removes the inode hash and cluster hash and
    replaces them with radix trees to avoid the scalability limitations of the
    hashes. It also reduces the size of the inodes by 3 pointers....

    SGI-PV: 969561
    SGI-Modid: xfs-linux-melb:xfs-kern:29481a

    Signed-off-by: David Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Tim Shimmin

    David Chinner
     

14 Jul, 2007

2 commits

  • In media spaces, video is often stored in a frame-per-file format. When
    dealing with uncompressed realtime HD video streams in this format, it is
    crucial that files do not get fragmented and that multiple files a placed
    contiguously on disk.

    When multiple streams are being ingested and played out at the same time,
    it is critical that the filesystem does not cross the streams and
    interleave them together as this creates seek and readahead cache miss
    latency and prevents both ingest and playout from meeting frame rate
    targets.

    This patch set creates a "stream of files" concept into the allocator to
    place all the data from a single stream contiguously on disk so that RAID
    array readahead can be used effectively. Each additional stream gets
    placed in different allocation groups within the filesystem, thereby
    ensuring that we don't cross any streams. When an AG fills up, we select a
    new AG for the stream that is not in use.

    The core of the functionality is the stream tracking - each inode that we
    create in a directory needs to be associated with the directories' stream.
    Hence every time we create a file, we look up the directories' stream
    object and associate the new file with that object.

    Once we have a stream object for a file, we use the AG that the stream
    object point to for allocations. If we can't allocate in that AG (e.g. it
    is full) we move the entire stream to another AG. Other inodes in the same
    stream are moved to the new AG on their next allocation (i.e. lazy
    update).

    Stream objects are kept in a cache and hold a reference on the inode.
    Hence the inode cannot be reclaimed while there is an outstanding stream
    reference. This means that on unlink we need to remove the stream
    association and we also need to flush all the associations on certain
    events that want to reclaim all unreferenced inodes (e.g. filesystem
    freeze).

    SGI-PV: 964469
    SGI-Modid: xfs-linux-melb:xfs-kern:29096a

    Signed-off-by: David Chinner
    Signed-off-by: Barry Naujok
    Signed-off-by: Donald Douwsma
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Tim Shimmin
    Signed-off-by: Vlad Apostolov

    David Chinner
     
  • When we have a couple of hundred transactions on the fly at once, they all
    typically modify the on disk superblock in some way.
    create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify
    free block counts.

    When these counts are modified in a transaction, they must eventually lock
    the superblock buffer and apply the mods. The buffer then remains locked
    until the transaction is committed into the incore log buffer. The result
    of this is that with enough transactions on the fly the incore superblock
    buffer becomes a bottleneck.

    The result of contention on the incore superblock buffer is that
    transaction rates fall - the more pressure that is put on the superblock
    buffer, the slower things go.

    The key to removing the contention is to not require the superblock fields
    in question to be locked. We do that by not marking the superblock dirty
    in the transaction. IOWs, we modify the incore superblock but do not
    modify the cached superblock buffer. In short, we do not log superblock
    modifications to critical fields in the superblock on every transaction.
    In fact we only do it just before we write the superblock to disk every
    sync period or just before unmount.

    This creates an interesting problem - if we don't log or write out the
    fields in every transaction, then how do the values get recovered after a
    crash? the answer is simple - we keep enough duplicate, logged information
    in other structures that we can reconstruct the correct count after log
    recovery has been performed.

    It is the AGF and AGI structures that contain the duplicate information;
    after recovery, we walk every AGI and AGF and sum their individual
    counters to get the correct value, and we do a transaction into the log to
    correct them. An optimisation of this is that if we have a clean unmount
    record, we know the value in the superblock is correct, so we can avoid
    the summation walk under normal conditions and so mount/recovery times do
    not change under normal operation.

    One wrinkle that was discovered during development was that the blocks
    used in the freespace btrees are never accounted for in the AGF counters.
    This was once a valid optimisation to make; when the filesystem is full,
    the free space btrees are empty and consume no space. Hence when it
    matters, the "accounting" is correct. But that means the when we do the
    AGF summations, we would not have a correct count and xfs_check would
    complain. Hence a new counter was added to track the number of blocks used
    by the free space btrees. This is an *on-disk format change*.

    As a result of this, lazy superblock counters are a mkfs option and at the
    moment on linux there is no way to convert an old filesystem. This is
    possible - xfs_db can be used to twiddle the right bits and then
    xfs_repair will do the format conversion for you. Similarly, you can
    convert backwards as well. At some point we'll add functionality to
    xfs_admin to do the bit twiddling easily....

    SGI-PV: 964999
    SGI-Modid: xfs-linux-melb:xfs-kern:28652a

    Signed-off-by: David Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Tim Shimmin

    David Chinner
     

28 Sep, 2006

1 commit


29 Mar, 2006

1 commit