12 Oct, 2011

1 commit

  • Currently we have a few issues with the way the workqueue code is used to
    implement AIL pushing:

    - it accidentally uses the same workqueue as the syncer action, and thus
    can be prevented from running if there are enough sync actions active
    in the system.
    - it doesn't use the HIGHPRI flag to queue at the head of the queue of
    work items

    At this point I'm not confident enough in getting all the workqueue flags and
    tweaks right to provide a perfectly reliable execution context for AIL
    pushing, which is the most important piece in XFS to make forward progress
    when the log fills.

    Revert back to use a kthread per filesystem which fixes all the above issues
    at the cost of having a task struct and stack around for each mounted
    filesystem. In addition this also gives us much better ways to diagnose
    any issues involving hung AIL pushing and removes a small amount of code.

    Signed-off-by: Christoph Hellwig
    Reported-by: Stefan Priebe
    Tested-by: Stefan Priebe
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

21 Jul, 2011

2 commits

  • The list of active AIL cursors uses a roll-your-own linked list with
    special casing for the AIL push cursor. Simplify this code by
    replacing the list with standard struct list_head lists, and use a
    separate list_head to track the active cursors. This allows us to
    treat the AIL push cursor as a generic cursor rather than as a
    special case, further simplifying the code.

    Further, fix the duplicate push cursor initialisation that the
    special case handling was hiding, and clean up all the comments
    around the active cursor list handling.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Delayed logging can insert tens of thousands of log items into the
    AIL at the same LSN. When the committing of log commit records
    occur, we can get insertions occurring at an LSN that is not at the
    end of the AIL. If there are thousands of items in the AIL on the
    tail LSN, each insertion has to walk the AIL to find the correct
    place to insert the new item into the AIL. This can consume large
    amounts of CPU time and block other operations from occurring while
    the traversals are in progress.

    To avoid this repeated walk, use a AIL cursor to record
    where we should be inserting the new items into the AIL without
    having to repeat the walk. The cursor infrastructure already
    provides this functionality for push walks, so is a simple extension
    of existing code. While this will not avoid the initial walk, it
    will avoid repeating it tens of thousands of times during a single
    checkpoint commit.

    This version includes logic improvements from Christoph Hellwig.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

08 Apr, 2011

2 commits

  • When we are short on memory, we want to expedite the cleaning of
    dirty objects. Hence when we run short on memory, we need to kick
    the AIL flushing into action to clean as many dirty objects as
    quickly as possible. To implement this, sample the lsn of the log
    item at the head of the AIL and use that as the push target for the
    AIL flush.

    Further, we keep items in the AIL that are dirty that are not
    tracked any other way, so we can get objects sitting in the AIL that
    don't get written back until the AIL is pushed. Hence to get the
    filesystem to the idle state, we might need to push the AIL to flush
    out any remaining dirty objects sitting in the AIL. This requires
    the same push mechanism as the reclaim push.

    This patch also renames xfs_trans_ail_tail() to xfs_ail_min_lsn() to
    match the new xfs_ail_max_lsn() function introduced in this patch.
    Similarly for xfs_trans_ail_push -> xfs_ail_push.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • Similar to the xfssyncd, the per-filesystem xfsaild threads can be
    converted to a global workqueue and run periodically by delayed
    works. This makes sense for the AIL pushing because it uses
    variable timeouts depending on the work that needs to be done.

    By removing the xfsaild, we simplify the AIL pushing code and
    remove the need to spread the code to implement the threading
    and pushing across multiple files.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     

20 Dec, 2010

4 commits

  • We now have two copies of AIL delete operations that are mostly
    duplicate functionality. The single log item deletes can be
    implemented via the bulk updates by turning xfs_trans_ail_delete()
    into a simple wrapper. This removes all the duplicate delete
    functionality and associated helpers.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • We now have two copies of AIL insert operations that are mostly
    duplicate functionality. The single log item updates can be
    implemented via the bulk updates by turning xfs_trans_ail_update()
    into a simple wrapper. This removes all the duplicate insert
    functionality and associated helpers.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • When inode buffer IO completes, usually all of the inodes are removed from the
    AIL. This involves processing them one at a time and taking the AIL lock once
    for every inode. When all CPUs are processing inode IO completions, this causes
    excessive amount sof contention on the AIL lock.

    Instead, change the way we process inode IO completion in the buffer
    IO done callback. Allow the inode IO done callback to walk the list
    of IO done callbacks and pull all the inodes off the buffer in one
    go and then process them as a batch.

    Once all the inodes for removal are collected, take the AIL lock
    once and do a bulk removal operation to minimise traffic on the AIL
    lock.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • When inserting items into the AIL from the transaction committed
    callbacks, we take the AIL lock for every single item that is to be
    inserted. For a CIL checkpoint commit, this can be tens of thousands
    of individual inserts, yet almost all of the items will be inserted
    at the same point in the AIL because they have the same index.

    To reduce the overhead and contention on the AIL lock for such
    operations, introduce a "bulk insert" operation which allows a list
    of log items with the same LSN to be inserted in a single operation
    via a list splice. To do this, we need to pre-sort the log items
    being committed into a temporary list for insertion.

    The complexity is that not every log item will end up with the same
    LSN, and not every item is actually inserted into the AIL. Items
    that don't match the commit LSN will be inserted and unpinned as per
    the current one-at-a-time method (relatively rare), while items that
    are not to be inserted will be unpinned and freed immediately. Items
    that are to be inserted at the given commit lsn are placed in a
    temporary array and inserted into the AIL in bulk each time the
    array fills up.

    As a result of this, we trade off AIL hold time for a significant
    reduction in traffic. lock_stat output shows that the worst case
    hold time is unchanged, but contention from AIL inserts drops by an
    order of magnitude and the number of lock traversal decreases
    significantly.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

24 Aug, 2010

1 commit

  • When we commit a transaction using delayed logging, we need to
    unlock the items in the transaciton before we unlock the CIL context
    and allow it to be checkpointed. If we unlock them after we release
    the CIl context lock, the CIL can checkpoint and complete before
    we free the log items. This breaks stale buffer item unlock and
    unpin processing as there is an implicit assumption that the unlock
    will occur before the unpin.

    Also, some log items need to store the LSN of the transaction commit
    in the item (inodes and EFIs) and so can race with other transaction
    completions if we don't prevent the CIL from checkpointing before
    the unlock occurs.

    Cc:
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

27 Jul, 2010

1 commit

  • Currently we track log item descriptor belonging to a transaction using a
    complex opencoded chunk allocator. This code has been there since day one
    and seems to work around the lack of an efficient slab allocator.

    This patch replaces it with dynamically allocated log item descriptors
    from a dedicated slab pool, linked to the transaction by a linked list.

    This allows to greatly simplify the log item descriptor tracking to the
    point where it's just a couple hundred lines in xfs_trans.c instead of
    a separate file. The external API has also been simplified while we're
    at it - the xfs_trans_add_item and xfs_trans_del_item functions to add/
    delete items from a transaction have been simplified to the bare minium,
    and the xfs_trans_find_item function is replaced with a direct dereference
    of the li_desc field. All debug code walking the list of log items in
    a transaction is down to a simple list_for_each_entry.

    Note that we could easily use a singly linked list here instead of the
    double linked list from list.h as the fastpath only does deletion from
    sequential traversal. But given that we don't have one available as
    a library function yet I use the list.h functions for simplicity.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

24 May, 2010

2 commits

  • The delayed logging code only changes in-memory structures and as
    such can be enabled and disabled with a mount option. Add the mount
    option and emit a warning that this is an experimental feature that
    should not be used in production yet.

    We also need infrastructure to track committed items that have not
    yet been written to the log. This is what the Committed Item List
    (CIL) is for.

    The log item also needs to be extended to track the current log
    vector, the associated memory buffer and it's location in the Commit
    Item List. Extend the log item and log vector structures to enable
    this tracking.

    To maintain the current log format for transactions with delayed
    logging, we need to introduce a checkpoint transaction and a context
    for tracking each checkpoint from initiation to transaction
    completion. This includes adding a log ticket for tracking space
    log required/used by the context checkpoint.

    To track all the changes we need an io vector array per log item,
    rather than a single array for the entire transaction. Using the new
    log vector structure for this requires two passes - the first to
    allocate the log vector structures and chain them together, and the
    second to fill them out. This log vector chain can then be passed
    to the CIL for formatting, pinning and insertion into the CIL.

    Formatting of the log vector chain is relatively simple - it's just
    a loop over the iovecs on each log vector, but it is made slightly
    more complex because we re-write the iovec after the copy to point
    back at the memory buffer we just copied into.

    This code also needs to pin log items. If the log item is not
    already tracked in this checkpoint context, then it needs to be
    pinned. Otherwise it is already pinned and we don't need to pin it
    again.

    The only other complexity is calculating the amount of new log space
    the formatting has consumed. This needs to be accounted to the
    transaction in progress, and the accounting is made more complex
    becase we need also to steal space from it for log metadata in the
    checkpoint transaction. Calculate all this at insert time and update
    all the tickets, counters, etc correctly.

    Once we've formatted all the log items in the transaction, attach
    the busy extents to the checkpoint context so the busy extents live
    until checkpoint completion and can be processed at that point in
    time. Transactions can then be freed at this point in time.

    Now we need to issue checkpoints - we are tracking the amount of log space
    used by the items in the CIL, so we can trigger background checkpoints when the
    space usage gets to a certain threshold. Otherwise, checkpoints need ot be
    triggered when a log synchronisation point is reached - a log force event.

    Because the log write code already handles chained log vectors, writing the
    transaction is trivial, too. Construct a transaction header, add it
    to the head of the chain and write it into the log, then issue a
    commit record write. Then we can release the checkpoint log ticket
    and attach the context to the log buffer so it can be called during
    Io completion to complete the checkpoint.

    We also need to allow for synchronising multiple in-flight
    checkpoints. This is needed for two things - the first is to ensure
    that checkpoint commit records appear in the log in the correct
    sequence order (so they are replayed in the correct order). The
    second is so that xfs_log_force_lsn() operates correctly and only
    flushes and/or waits for the specific sequence it was provided with.

    To do this we need a wait variable and a list tracking the
    checkpoint commits in progress. We can walk this list and wait for
    the checkpoints to change state or complete easily, an this provides
    the necessary synchronisation for correct operation in both cases.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • When we free a metadata extent, we record it in the per-AG busy
    extent array so that it is not re-used before the freeing
    transaction hits the disk. This array is fixed size, so when it
    overflows we make further allocation transactions synchronous
    because we cannot track more freed extents until those transactions
    hit the disk and are completed. Under heavy mixed allocation and
    freeing workloads with large log buffers, we can overflow this array
    quite easily.

    Further, the array is sparsely populated, which means that inserts
    need to search for a free slot, and array searches often have to
    search many more slots that are actually used to check all the
    busy extents. Quite inefficient, really.

    To enable this aspect of extent freeing to scale better, we need
    a structure that can grow dynamically. While in other areas of
    XFS we have used radix trees, the extents being freed are at random
    locations on disk so are better suited to being indexed by an rbtree.

    So, use a per-AG rbtree indexed by block number to track busy
    extents. This incures a memory allocation when marking an extent
    busy, but should not occur too often in low memory situations. This
    should scale to an arbitrary number of extents so should not be a
    limitation for features such as in-memory aggregation of
    transactions.

    However, there are still situations where we can't avoid allocating
    busy extents (such as allocation from the AGFL). To minimise the
    overhead of such occurences, we need to avoid doing a synchronous
    log force while holding the AGF locked to ensure that the previous
    transactions are safely on disk before we use the extent. We can do
    this by marking the transaction doing the allocation as synchronous
    rather issuing a log force.

    Because of the locking involved and the ordering of transactions,
    the synchronous transaction provides the same guarantees as a
    synchronous log force because it ensures that all the prior
    transactions are already on disk when the synchronous transaction
    hits the disk. i.e. it preserves the free->allocate order of the
    extent correctly in recovery.

    By doing this, we avoid holding the AGF locked while log writes are
    in progress, hence reducing the length of time the lock is held and
    therefore we increase the rate at which we can allocate and free
    from the allocation group, thereby increasing overall throughput.

    The only problem with this approach is that when a metadata buffer is
    marked stale (e.g. a directory block is removed), then buffer remains
    pinned and locked until the log goes to disk. The issue here is that
    if that stale buffer is reallocated in a subsequent transaction, the
    attempt to lock that buffer in the transaction will hang waiting
    the log to go to disk to unlock and unpin the buffer. Hence if
    someone tries to lock a pinned, stale, locked buffer we need to
    push on the log to get it unlocked ASAP. Effectively we are trading
    off a guaranteed log force for a much less common trigger for log
    force to occur.

    Ideally we should not reallocate busy extents. That is a much more
    complex fix to the problem as it involves direct intervention in the
    allocation btree searches in many places. This is left to a future
    set of modifications.

    Finally, now that we track busy extents in allocated memory, we
    don't need the descriptors in the transaction structure to point to
    them. We can replace the complex busy chunk infrastructure with a
    simple linked list of busy extents. This allows us to remove a large
    chunk of code, making the overall change a net reduction in code
    size.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

30 Oct, 2008

6 commits

  • Change all the remaining AIL API functions that are passed struct
    xfs_mount pointers to pass pointers directly to the struct xfs_ail being
    used. With this conversion, all external access to the AIL is via the
    struct xfs_ail. Hence the operation and referencing of the AIL is almost
    entirely independent of the xfs_mount that is using it - it is now much
    more tightly tied to the log and the items it is tracking in the log than
    it is tied to the xfs_mount.

    SGI-PV: 988143

    SGI-Modid: xfs-linux-melb:xfs-kern:32353a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Christoph Hellwig

    David Chinner
     
  • Bring the ail lock inside the struct xfs_ail. This means the AIL can be
    entirely manipulated via the struct xfs_ail rather than needing both the
    struct xfs_mount and the struct xfs_ail.

    SGI-PV: 988143

    SGI-Modid: xfs-linux-melb:xfs-kern:32350a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Christoph Hellwig

    David Chinner
     
  • When copying lsn's from the log item to the inode or dquot flush lsn, we
    currently grab the AIL lock. We do this because the LSN is a 64 bit
    quantity and it needs to be read atomically. The lock is used to guarantee
    atomicity for 32 bit platforms.

    Make the LSN copying a small function, and make the function used
    conditional on BITS_PER_LONG so that 64 bit machines don't need to take
    the AIL lock in these places.

    SGI-PV: 988143

    SGI-Modid: xfs-linux-melb:xfs-kern:32349a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Christoph Hellwig

    David Chinner
     
  • With the new cursor interface, it makes sense to make all the traversing
    code use the cursor interface and make the old one go away. This means
    more of the AIL interfacing is done by passing struct xfs_ail pointers
    around the place instead of struct xfs_mount pointers.

    We can replace the use of xfs_trans_first_ail() in xfs_log_need_covered()
    as it is only checking if the AIL is empty. We can do that with a call to
    xfs_trans_ail_tail() instead, where a zero LSN returned indicates and
    empty AIL...

    SGI-PV: 988143

    SGI-Modid: xfs-linux-melb:xfs-kern:32348a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Christoph Hellwig

    David Chinner
     
  • To replace the current generation number ensuring sanity of the AIL
    traversal, replace it with an external cursor that is linked to the AIL.

    Basically, we store the next item in the cursor whenever we want to drop
    the AIL lock to do something to the current item. When we regain the lock.
    the current item may already be free, so we can't reference it, but the
    next item in the traversal is already held in the cursor.

    When we move or delete an object, we search all the active cursors and if
    there is an item match we clear the cursor(s) that point to the object.
    This forces the traversal to restart transparently.

    We don't invalidate the cursor on insert because the cursor still points
    to a valid item. If the intem is inserted between the current item and the
    cursor it does not matter; the traversal is considered to be past the
    insertion point so it will be picked up in the next traversal.

    Hence traversal restarts pretty much disappear altogether with this method
    of traversal, which should substantially reduce the overhead of pushing on
    a busy AIL.

    Version 2 o add restart logic o comment cursor interface o minor cleanups

    SGI-PV: 988143

    SGI-Modid: xfs-linux-melb:xfs-kern:32347a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Christoph Hellwig

    David Chinner
     
  • Rather than embedding the struct xfs_ail in the struct xfs_mount, allocate
    it during AIL initialisation. Add a back pointer to the struct xfs_ail so
    that we can pass around the xfs_ail and still be able to access the
    xfs_mount if need be. This is th first step involved in isolating the AIL
    implementation from the surrounding filesystem code.

    SGI-PV: 988143

    SGI-Modid: xfs-linux-melb:xfs-kern:32346a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Christoph Hellwig

    David Chinner
     

07 Feb, 2008

2 commits

  • When many hundreds to thousands of threads all try to do simultaneous
    transactions and the log is in a tail-pushing situation (i.e. full), we
    can get multiple threads walking the AIL list and contending on the AIL
    lock.

    The AIL push is, in effect, a simple I/O dispatch algorithm complicated by
    the ordering constraints placed on it by the transaction subsystem. It
    really does not need multiple threads to push on it - even when only a
    single CPU is pushing the AIL, it can push the I/O out far faster that
    pretty much any disk subsystem can handle.

    So, to avoid contention problems stemming from multiple list walkers, move
    the list walk off into another thread and simply provide a "target" to
    push to. When a thread requires a push, it sets the target and wakes the
    push thread, then goes to sleep waiting for the required amount of space
    to become available in the log.

    This mechanism should also be a lot fairer under heavy load as the waiters
    will queue in arrival order, rather than queuing in "who completed a push
    first" order.

    Also, by moving the pushing to a separate thread we can do more
    effectively overload detection and prevention as we can keep context from
    loop iteration to loop iteration. That is, we can push only part of the
    list each loop and not have to loop back to the start of the list every
    time we run. This should also help by reducing the number of items we try
    to lock and/or push items that we cannot move.

    Note that this patch is not intended to solve the inefficiencies in the
    AIL structure and the associated issues with extremely large list
    contents. That needs to be addresses separately; parallel access would
    cause problems to any new structure as well, so I'm only aiming to isolate
    the structure from unbounded parallelism here.

    SGI-PV: 972759
    SGI-Modid: xfs-linux-melb:xfs-kern:30371a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy

    David Chinner
     
  • SGI-PV: 970382
    SGI-Modid: xfs-linux-melb:xfs-kern:29739a

    Signed-off-by: Donald Douwsma
    Signed-off-by: Eric Sandeen
    Signed-off-by: Tim Shimmin

    Donald Douwsma
     

28 Sep, 2006

1 commit

  • xfs_trans_delete_ail

    xfs_trans_update_ail and xfs_trans_delete_ail get called with the AIL lock
    held, and release it. Add lock annotations to these two functions so that
    sparse can check callers for lock pairing, and so that sparse will not
    complain about these functions since they intentionally use locks in this
    manner.

    SGI-PV: 954580
    SGI-Modid: xfs-linux-melb:xfs-kern:26807a

    Signed-off-by: Josh Triplett
    Signed-off-by: Nathan Scott
    Signed-off-by: Tim Shimmin

    Josh Triplett
     

02 Nov, 2005

1 commit


17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds