30 May, 2012

1 commit


15 May, 2012

4 commits

  • With the removal of xfs_rw.h and other changes over time, xfs_bit.h
    is being included in many files that don't actually need it. Clean
    up the includes as necessary.

    Also move the only-used-once xfs_ialloc_find_free() static inline
    function out of a header file that is widely included to reduce
    the number of needless dependencies on xfs_bit.h.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Now that the busy extent tracking has been moved out of the
    allocation files, clean up the namespace it uses to
    "xfs_extent_busy" rather than a mix of "xfs_busy" and
    "xfs_alloc_busy".

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • To make it easier to handle userspace code merges, move all the busy
    extent handling out of the allocation code and into it's own file.
    The userspace code does not need the busy extent code, so this
    simplifies the merging of the kernel code into the userspace
    xfsprogs library.

    Because the busy extent code has been almost completely rewritten
    over the past couple of years, also update the copyright on this new
    file to include the authors that made all those changes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Untangle the header file includes a bit by moving the definition of
    xfs_agino_t to xfs_types.h. This removes the dependency that xfs_ag.h has on
    xfs_inum.h, meaning we don't need to include xfs_inum.h everywhere we include
    xfs_ag.h.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     

23 Feb, 2012

1 commit

  • Split the log regrant case out of xfs_log_reserve into a separate function,
    and merge xlog_grant_log_space and xlog_regrant_write_log_space into their
    respective callers. Also replace the XFS_LOG_PERM_RESERV flag, which easily
    got misused before the previous cleanups with a simple boolean parameter.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

14 Feb, 2012

1 commit


09 Dec, 2011

3 commits

  • Outside the now removed nodelaylog code this field is only used for
    asserts and can be safely removed now.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Now that the nodelaylog mode is gone we can simplify the transaction commit
    path a bit by removing the xfs_trans_commit_cil routine. Restoring the
    process flags is merged into xfs_trans_commit which already does it for
    the error path, and allocating the log vectors is merged into
    xlog_cil_format_items, which already fills them with data, thus avoiding
    one loop over all log items.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • The delaylog mode has been the default for a long time, and the nodelaylog
    option has been scheduled for removal in Linux 3.3. Remove it and code
    only used by it now that we have opened the 3.3 window.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

12 Oct, 2011

2 commits

  • There is no reason to keep a reference to the inode even if we unlock
    it during transaction commit because we never drop a reference between
    the ijoin and commit. Also use this fact to merge xfs_trans_ijoin_ref
    back into xfs_trans_ijoin - the third argument decides if an unlock
    is needed now.

    I'm actually starting to wonder if allowing inodes to be unlocked
    at transaction commit really is worth the effort. The only real
    benefit is that they can be unlocked earlier when commiting a
    synchronous transactions, but that could be solved by doing the
    log force manually after the unlock, too.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Only read the LSN we need to push to with the ilock held, and then release
    it before we do the log force to improve concurrency.

    This also removes the only direct caller of _xfs_trans_commit, thus
    allowing it to be merged into the plain xfs_trans_commit again.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

21 Jul, 2011

1 commit

  • Delayed logging can insert tens of thousands of log items into the
    AIL at the same LSN. When the committing of log commit records
    occur, we can get insertions occurring at an LSN that is not at the
    end of the AIL. If there are thousands of items in the AIL on the
    tail LSN, each insertion has to walk the AIL to find the correct
    place to insert the new item into the AIL. This can consume large
    amounts of CPU time and block other operations from occurring while
    the traversals are in progress.

    To avoid this repeated walk, use a AIL cursor to record
    where we should be inserting the new items into the AIL without
    having to repeat the walk. The cursor infrastructure already
    provides this functionality for push walks, so is a simple extension
    of existing code. While this will not avoid the initial walk, it
    will avoid repeating it tens of thousands of times during a single
    checkpoint commit.

    This version includes logic improvements from Christoph Hellwig.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

11 Jul, 2011

1 commit

  • This reverts commit 7a249cf83da1813cfa71cfe1e265b40045eceb47.

    That commit created a situation that could lead to a filesystem
    hang. As Dave Chinner pointed out, xfs_trans_alloc() could hold a
    reference to m_active_trans (i.e., keep it non-zero) and then wait
    for SB_FREEZE_TRANS to complete. Meanwhile a filesystem freeze
    request could set SB_FREEZE_TRANS and then wait for m_active_trans
    to drop to zero. Nobody benefits from this sequence of events...

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Alex Elder
     

08 Jul, 2011

1 commit

  • As pointed out by Jan xfs_trans_alloc can race with a concurrent filesystem
    freeze when it sleeps during the memory allocation. Fix this by moving the
    wait_for_freeze call after the memory allocation. This means moving the
    freeze into the low-level _xfs_trans_alloc helper, which thus grows a new
    argument. Also fix up some comments in that area while at it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

07 Jul, 2011

1 commit

  • When inodes are marked stale in a transaction, they are treated
    specially when the inode log item is being inserted into the AIL.
    It tries to avoid moving the log item forward in the AIL due to a
    race condition with the writing the underlying buffer back to disk.
    The was "fixed" in commit de25c18 ("xfs: avoid moving stale inodes
    in the AIL").

    To avoid moving the item forward, we return a LSN smaller than the
    commit_lsn of the completing transaction, thereby trying to trick
    the commit code into not moving the inode forward at all. I'm not
    sure this ever worked as intended - it assumes the inode is already
    in the AIL, but I don't think the returned LSN would have been small
    enough to prevent moving the inode. It appears that the reason it
    worked is that the lower LSN of the inodes meant they were inserted
    into the AIL and flushed before the inode buffer (which was moved to
    the commit_lsn of the transaction).

    The big problem is that with delayed logging, the returning of the
    different LSN means insertion takes the slow, non-bulk path. Worse
    yet is that insertion is to a position -before- the commit_lsn so it
    is doing a AIL traversal on every insertion, and has to walk over
    all the items that have already been inserted into the AIL. It's
    expensive.

    To compound the matter further, with delayed logging inodes are
    likely to go from clean to stale in a single checkpoint, which means
    they aren't even in the AIL at all when we come across them at AIL
    insertion time. Hence these were all getting inserted into the AIL
    when they simply do not need to be as inodes marked XFS_ISTALE are
    never written back.

    Transactional/recovery integrity is maintained in this case by the
    other items in the unlink transaction that were modified (e.g. the
    AGI btree blocks) and committed in the same checkpoint.

    So to fix this, simply unpin the stale inodes directly in
    xfs_inode_item_committed() and return -1 to indicate that the AIL
    insertion code does not need to do any further processing of these
    inodes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

25 May, 2011

1 commit

  • Now that we have reliably tracking of deleted extents in a
    transaction we can easily implement "online" discard support
    which calls blkdev_issue_discard once a transaction commits.

    The actual discard is a two stage operation as we first have
    to mark the busy extent as not available for reuse before we
    can start the actual discard. Note that we don't bother
    supporting discard for the non-delaylog mode.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

29 Apr, 2011

1 commit


28 Jan, 2011

2 commits

  • Failure to commit a transaction into the CIL is not handled
    correctly. This currently can only happen when racing with a
    shutdown and requires an explicit shutdown check, so it rare and can
    be avoided. Remove the shutdown check and make the CIL commit a void
    function to indicate it will always succeed, thereby removing the
    incorrectly handled failure case.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • After test 139, kmemleak shows:

    unreferenced object 0xffff880078b405d8 (size 400):
    comm "xfs_io", pid 4904, jiffies 4294909383 (age 1186.728s)
    hex dump (first 32 bytes):
    60 c1 17 79 00 88 ff ff 60 c1 17 79 00 88 ff ff `..y....`..y....
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x2d/0x60
    [] kmem_cache_alloc+0x13f/0x2b0
    [] kmem_zone_alloc+0x77/0xf0
    [] kmem_zone_zalloc+0x1e/0x50
    [] xfs_efi_init+0x4b/0xb0
    [] xfs_trans_get_efi+0x58/0x90
    [] xfs_bmap_finish+0x8b/0x1d0
    [] xfs_itruncate_finish+0x2c4/0x5d0
    [] xfs_setattr+0x8df/0xa70
    [] xfs_vn_setattr+0x1b/0x20
    [] notify_change+0x170/0x2e0
    [] do_truncate+0x66/0xa0
    [] sys_ftruncate+0xdb/0xe0
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    The cause of the leak is that the "remove" parameter of IOP_UNPIN()
    is never set when a CIL push is aborted. This means that the EFI
    item is never freed if it was in the push being cancelled. The
    problem is specific to delayed logging, but has uncovered a couple
    of problems with the handling of IOP_UNPIN(remove).

    Firstly, we cannot safely call xfs_trans_del_item() from IOP_UNPIN()
    in the CIL commit failure path or the iclog write failure path
    because for delayed loging we have no transaction context. Hence we
    must only call xfs_trans_del_item() if the log item being unpinned
    has an active log item descriptor.

    Secondly, xfs_trans_uncommit() does not handle log item descriptor
    freeing during the traversal of log items on a transaction. It can
    reference a freed log item descriptor when unpinning an EFI item.
    Hence it needs to use a safe list traversal method to allow items to
    be removed from the transaction during IOP_UNPIN().

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder

    Dave Chinner
     

12 Jan, 2011

1 commit

  • In fs/xfs/xfs_trans.c::xfs_trans_unreserve_and_mod_sb() at the out:
    label we have this:
    ASSERT(error = 0);
    I believe a comparison was intended, not an assignment. If I'm
    right, the patch below fixes that up.

    Signed-off-by: Jesper Juhl
    Signed-off-by: Alex Elder

    Jesper Juhl
     

20 Dec, 2010

1 commit

  • When inserting items into the AIL from the transaction committed
    callbacks, we take the AIL lock for every single item that is to be
    inserted. For a CIL checkpoint commit, this can be tens of thousands
    of individual inserts, yet almost all of the items will be inserted
    at the same point in the AIL because they have the same index.

    To reduce the overhead and contention on the AIL lock for such
    operations, introduce a "bulk insert" operation which allows a list
    of log items with the same LSN to be inserted in a single operation
    via a list splice. To do this, we need to pre-sort the log items
    being committed into a temporary list for insertion.

    The complexity is that not every log item will end up with the same
    LSN, and not every item is actually inserted into the AIL. Items
    that don't match the commit LSN will be inserted and unpinned as per
    the current one-at-a-time method (relatively rare), while items that
    are not to be inserted will be unpinned and freed immediately. Items
    that are to be inserted at the given commit lsn are placed in a
    temporary array and inserted into the AIL in bulk each time the
    array fills up.

    As a result of this, we trade off AIL hold time for a significant
    reduction in traffic. lock_stat output shows that the worst case
    hold time is unchanged, but contention from AIL inserts drops by an
    order of magnitude and the number of lock traversal decreases
    significantly.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

19 Oct, 2010

4 commits


24 Aug, 2010

1 commit

  • When we commit a transaction using delayed logging, we need to
    unlock the items in the transaciton before we unlock the CIL context
    and allow it to be checkpointed. If we unlock them after we release
    the CIl context lock, the CIL can checkpoint and complete before
    we free the log items. This breaks stale buffer item unlock and
    unpin processing as there is an implicit assumption that the unlock
    will occur before the unpin.

    Also, some log items need to store the LSN of the transaction commit
    in the item (inodes and EFIs) and so can race with other transaction
    completions if we don't prevent the CIL from checkpointing before
    the unlock occurs.

    Cc:
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

27 Jul, 2010

6 commits

  • xfs_trans_add_item() is called with ip->i_ilock held, which means it
    is unsafe for memory reclaim to recurse back into the filesystem
    (ilock is required in writeback). Hence the allocation needs to be
    KM_NOFS to avoid recursion.

    Lockdep report indicating memory allocation being called with the
    ip->i_ilock held is as follows:

    [ 1749.866796] =================================
    [ 1749.867788] [ INFO: inconsistent lock state ]
    [ 1749.868327] 2.6.35-rc3-dgc+ #25
    [ 1749.868741] ---------------------------------
    [ 1749.868741] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
    [ 1749.868741] dd/2835 [HC0[0]:SC0[0]:HE1:SE1] takes:
    [ 1749.868741] (&(&ip->i_lock)->mr_lock){++++?.}, at: [] xfs_ilock+0x10b/0x190
    [ 1749.868741] {IN-RECLAIM_FS-W} state was registered at:
    [ 1749.868741] [] __lock_acquire+0x437/0x1450
    [ 1749.868741] [] lock_acquire+0xa6/0x160
    [ 1749.868741] [] down_write_nested+0x65/0xb0
    [ 1749.868741] [] xfs_ilock+0x10b/0x190
    [ 1749.868741] [] xfs_reclaim_inode+0x99/0x310
    [ 1749.868741] [] xfs_inode_ag_walk+0x8b/0x150
    [ 1749.868741] [] xfs_inode_ag_iterator+0x8b/0xf0
    [ 1749.868741] [] xfs_reclaim_inode_shrink+0x88/0x90
    [ 1749.868741] [] shrink_slab+0x137/0x1a0
    [ 1749.868741] [] balance_pgdat+0x421/0x6a0
    [ 1749.868741] [] kswapd+0x11d/0x320
    [ 1749.868741] [] kthread+0x96/0xa0
    [ 1749.868741] [] kernel_thread_helper+0x4/0x10
    [ 1749.868741] irq event stamp: 4234335
    [ 1749.868741] hardirqs last enabled at (4234335): [] kmem_cache_free+0x115/0x220
    [ 1749.868741] hardirqs last disabled at (4234334): [] kmem_cache_free+0x3d/0x220
    [ 1749.868741] softirqs last enabled at (4233112): [] __do_softirq+0x142/0x260
    [ 1749.868741] softirqs last disabled at (4233095): [] call_softirq+0x1c/0x50
    [ 1749.868741]
    [ 1749.868741] other info that might help us debug this:
    [ 1749.868741] 2 locks held by dd/2835:
    [ 1749.868741] #0: (&(&ip->i_iolock)->mr_lock#2){+.+.+.}, at: [] xfs_ilock_nowait+0xed/0x200
    [ 1749.868741] #1: (&(&ip->i_lock)->mr_lock){++++?.}, at: [] xfs_ilock+0x10b/0x190
    [ 1749.868741]
    [ 1749.868741] stack backtrace:
    [ 1749.868741] Pid: 2835, comm: dd Not tainted 2.6.35-rc3-dgc+ #25
    [ 1749.868741] Call Trace:
    [ 1749.868741] [] print_usage_bug+0x18a/0x190
    [ 1749.868741] [] ? save_stack_trace+0x2f/0x50
    [ 1749.868741] [] ? check_usage_backwards+0x0/0xf0
    [ 1749.868741] [] mark_lock+0x331/0x400
    [ 1749.868741] [] mark_held_locks+0x67/0x90
    [ 1749.868741] [] lockdep_trace_alloc+0xa1/0xe0
    [ 1749.868741] [] kmem_cache_alloc+0x39/0x1e0
    [ 1749.868741] [] kmem_zone_alloc+0x94/0xe0
    [ 1749.868741] [] kmem_zone_zalloc+0x1e/0x50
    [ 1749.868741] [] xfs_trans_add_item+0x72/0xb0
    [ 1749.868741] [] xfs_trans_ijoin+0xa1/0xd0
    [ 1749.868741] [] xfs_itruncate_finish+0x312/0x5d0
    [ 1749.868741] [] xfs_free_eofblocks+0x227/0x280
    [ 1749.868741] [] xfs_release+0x138/0x190
    [ 1749.868741] [] xfs_file_release+0x15/0x20
    [ 1749.868741] [] fput+0x13f/0x260
    [ 1749.868741] [] filp_close+0x52/0x80
    [ 1749.868741] [] sys_close+0xb9/0x120
    [ 1749.868741] [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Currently we need to either call IHOLD or xfs_trans_ihold on an inode when
    joining it to a transaction via xfs_trans_ijoin.

    This patches instead makes xfs_trans_ijoin usable on it's own by doing
    an implicity xfs_trans_ihold, which also allows us to drop the third
    argument. For the case where we want to hold a reference on the inode
    a xfs_trans_ijoin_ref wrapper is added which does the IHOLD and marks
    the inode for needing an xfs_iput. In addition to the cleaner interface
    to the caller this also simplifies the implementation.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • The unpin_remove item operation instances always share most of the
    implementation with the respective unpin implementation. So instead
    of keeping two different entry points add a remove flag to the unpin
    operation and share the code more easily.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Currently we track log item descriptor belonging to a transaction using a
    complex opencoded chunk allocator. This code has been there since day one
    and seems to work around the lack of an efficient slab allocator.

    This patch replaces it with dynamically allocated log item descriptors
    from a dedicated slab pool, linked to the transaction by a linked list.

    This allows to greatly simplify the log item descriptor tracking to the
    point where it's just a couple hundred lines in xfs_trans.c instead of
    a separate file. The external API has also been simplified while we're
    at it - the xfs_trans_add_item and xfs_trans_del_item functions to add/
    delete items from a transaction have been simplified to the bare minium,
    and the xfs_trans_find_item function is replaced with a direct dereference
    of the li_desc field. All debug code walking the list of log items in
    a transaction is down to a simple list_for_each_entry.

    Note that we could easily use a singly linked list here instead of the
    double linked list from list.h as the fastpath only does deletion from
    sequential traversal. But given that we don't have one available as
    a library function yet I use the list.h functions for simplicity.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Dmapi support was never merged upstream, but we still have a lot of hooks
    bloating XFS for it, all over the fast pathes of the filesystem.

    This patch drops over 700 lines of dmapi overhead. If we'll ever get HSM
    support in mainline at least the namespace events can be done much saner
    in the VFS instead of the individual filesystem, so it's not like this
    is much help for future work.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

29 May, 2010

1 commit

  • Instead of having small helper functions calling big macros do the
    calculations for the log reservations directly in the functions.
    These are mostly 1:1 from the macros execept that the macros kept
    the quota calculations in their callers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

24 May, 2010

2 commits

  • The delayed logging code only changes in-memory structures and as
    such can be enabled and disabled with a mount option. Add the mount
    option and emit a warning that this is an experimental feature that
    should not be used in production yet.

    We also need infrastructure to track committed items that have not
    yet been written to the log. This is what the Committed Item List
    (CIL) is for.

    The log item also needs to be extended to track the current log
    vector, the associated memory buffer and it's location in the Commit
    Item List. Extend the log item and log vector structures to enable
    this tracking.

    To maintain the current log format for transactions with delayed
    logging, we need to introduce a checkpoint transaction and a context
    for tracking each checkpoint from initiation to transaction
    completion. This includes adding a log ticket for tracking space
    log required/used by the context checkpoint.

    To track all the changes we need an io vector array per log item,
    rather than a single array for the entire transaction. Using the new
    log vector structure for this requires two passes - the first to
    allocate the log vector structures and chain them together, and the
    second to fill them out. This log vector chain can then be passed
    to the CIL for formatting, pinning and insertion into the CIL.

    Formatting of the log vector chain is relatively simple - it's just
    a loop over the iovecs on each log vector, but it is made slightly
    more complex because we re-write the iovec after the copy to point
    back at the memory buffer we just copied into.

    This code also needs to pin log items. If the log item is not
    already tracked in this checkpoint context, then it needs to be
    pinned. Otherwise it is already pinned and we don't need to pin it
    again.

    The only other complexity is calculating the amount of new log space
    the formatting has consumed. This needs to be accounted to the
    transaction in progress, and the accounting is made more complex
    becase we need also to steal space from it for log metadata in the
    checkpoint transaction. Calculate all this at insert time and update
    all the tickets, counters, etc correctly.

    Once we've formatted all the log items in the transaction, attach
    the busy extents to the checkpoint context so the busy extents live
    until checkpoint completion and can be processed at that point in
    time. Transactions can then be freed at this point in time.

    Now we need to issue checkpoints - we are tracking the amount of log space
    used by the items in the CIL, so we can trigger background checkpoints when the
    space usage gets to a certain threshold. Otherwise, checkpoints need ot be
    triggered when a log synchronisation point is reached - a log force event.

    Because the log write code already handles chained log vectors, writing the
    transaction is trivial, too. Construct a transaction header, add it
    to the head of the chain and write it into the log, then issue a
    commit record write. Then we can release the checkpoint log ticket
    and attach the context to the log buffer so it can be called during
    Io completion to complete the checkpoint.

    We also need to allow for synchronising multiple in-flight
    checkpoints. This is needed for two things - the first is to ensure
    that checkpoint commit records appear in the log in the correct
    sequence order (so they are replayed in the correct order). The
    second is so that xfs_log_force_lsn() operates correctly and only
    flushes and/or waits for the specific sequence it was provided with.

    To do this we need a wait variable and a list tracking the
    checkpoint commits in progress. We can walk this list and wait for
    the checkpoints to change state or complete easily, an this provides
    the necessary synchronisation for correct operation in both cases.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • When we free a metadata extent, we record it in the per-AG busy
    extent array so that it is not re-used before the freeing
    transaction hits the disk. This array is fixed size, so when it
    overflows we make further allocation transactions synchronous
    because we cannot track more freed extents until those transactions
    hit the disk and are completed. Under heavy mixed allocation and
    freeing workloads with large log buffers, we can overflow this array
    quite easily.

    Further, the array is sparsely populated, which means that inserts
    need to search for a free slot, and array searches often have to
    search many more slots that are actually used to check all the
    busy extents. Quite inefficient, really.

    To enable this aspect of extent freeing to scale better, we need
    a structure that can grow dynamically. While in other areas of
    XFS we have used radix trees, the extents being freed are at random
    locations on disk so are better suited to being indexed by an rbtree.

    So, use a per-AG rbtree indexed by block number to track busy
    extents. This incures a memory allocation when marking an extent
    busy, but should not occur too often in low memory situations. This
    should scale to an arbitrary number of extents so should not be a
    limitation for features such as in-memory aggregation of
    transactions.

    However, there are still situations where we can't avoid allocating
    busy extents (such as allocation from the AGFL). To minimise the
    overhead of such occurences, we need to avoid doing a synchronous
    log force while holding the AGF locked to ensure that the previous
    transactions are safely on disk before we use the extent. We can do
    this by marking the transaction doing the allocation as synchronous
    rather issuing a log force.

    Because of the locking involved and the ordering of transactions,
    the synchronous transaction provides the same guarantees as a
    synchronous log force because it ensures that all the prior
    transactions are already on disk when the synchronous transaction
    hits the disk. i.e. it preserves the free->allocate order of the
    extent correctly in recovery.

    By doing this, we avoid holding the AGF locked while log writes are
    in progress, hence reducing the length of time the lock is held and
    therefore we increase the rate at which we can allocate and free
    from the allocation group, thereby increasing overall throughput.

    The only problem with this approach is that when a metadata buffer is
    marked stale (e.g. a directory block is removed), then buffer remains
    pinned and locked until the log goes to disk. The issue here is that
    if that stale buffer is reallocated in a subsequent transaction, the
    attempt to lock that buffer in the transaction will hang waiting
    the log to go to disk to unlock and unpin the buffer. Hence if
    someone tries to lock a pinned, stale, locked buffer we need to
    push on the log to get it unlocked ASAP. Effectively we are trading
    off a guaranteed log force for a much less common trigger for log
    force to occur.

    Ideally we should not reallocate busy extents. That is a much more
    complex fix to the problem as it involves direct intervention in the
    allocation btree searches in many places. This is left to a future
    set of modifications.

    Finally, now that we track busy extents in allocated memory, we
    don't need the descriptors in the transaction structure to point to
    them. We can replace the complex busy chunk infrastructure with a
    simple linked list of busy extents. This allows us to remove a large
    chunk of code, making the overall change a net reduction in code
    size.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

19 May, 2010

4 commits

  • Now that the code has been factored, clean up all the remaining
    style cruft, simplify the code and re-order functions so that it
    doesn't need forward declarations.

    Also move the remaining functions that require forward declarations
    (xfs_trans_uncommit, xfs_trans_free) so that all the forward
    declarations can be removed from the file.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • The function header to xfs-trans_committed has long had this
    comment:

    * THIS SHOULD BE REWRITTEN TO USE xfs_trans_next_item()

    To prepare for different methods of committing items, convert the
    code to use xfs_trans_next_item() and factor the code into smaller,
    more digestible chunks.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • > +shut_us_down:
    > + shutdown = XFS_FORCED_SHUTDOWN(mp) ? EIO : 0;
    > + if (!(tp->t_flags & XFS_TRANS_DIRTY) || shutdown) {
    > + xfs_trans_unreserve_and_mod_sb(tp);
    > + /*

    This whole area in _xfs_trans_commit is still a complete mess.

    So while touching this code, unravel this mess as well to make the
    whole flow of the function simpler and clearer.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Split the the part of xfs_trans_commit() that deals with writing the
    transaction into the iclog into a separate function. This isolates the
    physical commit process from the logical commit operation and makes
    it easier to insert different transaction commit paths without affecting
    the existing algorithm adversely.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner