09 Dec, 2011

1 commit


09 Nov, 2011

1 commit

  • The log item ops aren't nessecarily the biggest exploit vector, but marking
    them const is easy enough. Also remove the unused xfs_item_ops_t typedef
    while we're at it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Alex Elder

    Christoph Hellwig
     

18 Oct, 2011

1 commit


12 Oct, 2011

3 commits

  • There is no reason to keep a reference to the inode even if we unlock
    it during transaction commit because we never drop a reference between
    the ijoin and commit. Also use this fact to merge xfs_trans_ijoin_ref
    back into xfs_trans_ijoin - the third argument decides if an unlock
    is needed now.

    I'm actually starting to wonder if allowing inodes to be unlocked
    at transaction commit really is worth the effort. The only real
    benefit is that they can be unlocked earlier when commiting a
    synchronous transactions, but that could be solved by doing the
    log force manually after the unlock, too.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Only read the LSN we need to push to with the ilock held, and then release
    it before we do the log force to improve concurrency.

    This also removes the only direct caller of _xfs_trans_commit, thus
    allowing it to be merged into the plain xfs_trans_commit again.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • We need to check for pinned buffers even in .iop_pushbuf given that inode
    items flush into the same buffers that may be pinned directly due operations
    on the unlinked inode list operating directly on buffers. To do this add a
    return value to .iop_pushbuf that tells the AIL push about this and use
    the existing log force mechanisms to unpin it.

    Signed-off-by: Christoph Hellwig
    Reported-by: Stefan Priebe
    Tested-by: Stefan Priebe
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

23 Feb, 2011

1 commit

  • Currently we return iodes from xfs_ialloc with just a single reference held.
    But we need two references, as one is dropped during transaction commit and
    the second needs to be transfered to the VFS. Change xfs_ialloc to use
    xfs_iget plus xfs_trans_ijoin_ref to grab two references to the inode,
    and remove the now superflous IHOLD calls from all callers. This also
    greatly simplifies the error handling in xfs_create and also allow to remove
    xfs_trans_iget as no other callers are left.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

02 Dec, 2010

1 commit

  • Now that the buffer reclaim infrastructure can handle different reclaim
    priorities for different types of buffers, reconnect the hooks in the
    XFS code that has been sitting dormant since it was ported to Linux. This
    should finally give use reclaim prioritisation that is on a par with the
    functionality that Irix provided XFS 15 years ago.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

19 Oct, 2010

2 commits

  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Under heavy multi-way parallel create workloads, the VFS struggles
    to write back all the inodes that have been changed in age order.
    The bdi flusher thread becomes CPU bound, spending 85% of it's time
    in the VFS code, mostly traversing the superblock dirty inode list
    to separate dirty inodes old enough to flush.

    We already keep an index of all metadata changes in age order - in
    the AIL - and continued log pressure will do age ordered writeback
    without any extra overhead at all. If there is no pressure on the
    log, the xfssyncd will periodically write back metadata in ascending
    disk address offset order so will be very efficient.

    Hence we can stop marking VFS inodes dirty during transaction commit
    or when changing timestamps during transactions. This will keep the
    inodes in the superblock dirty list to those containing data or
    unlogged metadata changes.

    However, the timstamp changes are slightly more complex than this -
    there are a couple of places that do unlogged updates of the
    timestamps, and the VFS need to be informed of these. Hence add a
    new function xfs_trans_ichgtime() for transactional changes,
    and leave xfs_ichgtime() for the non-transactional changes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

27 Jul, 2010

4 commits

  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Currently we need to either call IHOLD or xfs_trans_ihold on an inode when
    joining it to a transaction via xfs_trans_ijoin.

    This patches instead makes xfs_trans_ijoin usable on it's own by doing
    an implicity xfs_trans_ihold, which also allows us to drop the third
    argument. For the case where we want to hold a reference on the inode
    a xfs_trans_ijoin_ref wrapper is added which does the IHOLD and marks
    the inode for needing an xfs_iput. In addition to the cleaner interface
    to the caller this also simplifies the implementation.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • The unpin_remove item operation instances always share most of the
    implementation with the respective unpin implementation. So instead
    of keeping two different entry points add a remove flag to the unpin
    operation and share the code more easily.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Currently we track log item descriptor belonging to a transaction using a
    complex opencoded chunk allocator. This code has been there since day one
    and seems to work around the lack of an efficient slab allocator.

    This patch replaces it with dynamically allocated log item descriptors
    from a dedicated slab pool, linked to the transaction by a linked list.

    This allows to greatly simplify the log item descriptor tracking to the
    point where it's just a couple hundred lines in xfs_trans.c instead of
    a separate file. The external API has also been simplified while we're
    at it - the xfs_trans_add_item and xfs_trans_del_item functions to add/
    delete items from a transaction have been simplified to the bare minium,
    and the xfs_trans_find_item function is replaced with a direct dereference
    of the li_desc field. All debug code walking the list of log items in
    a transaction is down to a simple list_for_each_entry.

    Note that we could easily use a singly linked list here instead of the
    double linked list from list.h as the fastpath only does deletion from
    sequential traversal. But given that we don't have one available as
    a library function yet I use the list.h functions for simplicity.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

29 May, 2010

1 commit

  • Instead of having small helper functions calling big macros do the
    calculations for the log reservations directly in the functions.
    These are mostly 1:1 from the macros execept that the macros kept
    the quota calculations in their callers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

24 May, 2010

3 commits

  • With delayed logging, we can get inode allocation buffers in the
    same transaction inode unlink buffers. We don't currently mark inode
    allocation buffers in the log, so inode unlink buffers take
    precedence over allocation buffers.

    The result is that when they are combined into the same checkpoint,
    only the unlinked inode chain fields are replayed, resulting in
    uninitialised inode buffers being detected when the next inode
    modification is replayed.

    To fix this, we need to ensure that we do not set the inode buffer
    flag in the buffer log item format flags if the inode allocation has
    not already hit the log. To avoid requiring a change to log
    recovery, we really need to make this a modification that relies
    only on in-memory sate.

    We can do this by checking during buffer log formatting (while the
    CIL cannot be flushed) if we are still in the same sequence when we
    commit the unlink transaction as the inode allocation transaction.
    If we are, then we do not add the inode buffer flag to the buffer
    log format item flags. This means the entire buffer will be
    replayed, not just the unlinked fields. We do this while
    CIL flusheѕ are locked out to ensure that we don't race with the
    sequence numbers changing and hence fail to put the inode buffer
    flag in the buffer format flags when we really need to.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • The delayed logging code only changes in-memory structures and as
    such can be enabled and disabled with a mount option. Add the mount
    option and emit a warning that this is an experimental feature that
    should not be used in production yet.

    We also need infrastructure to track committed items that have not
    yet been written to the log. This is what the Committed Item List
    (CIL) is for.

    The log item also needs to be extended to track the current log
    vector, the associated memory buffer and it's location in the Commit
    Item List. Extend the log item and log vector structures to enable
    this tracking.

    To maintain the current log format for transactions with delayed
    logging, we need to introduce a checkpoint transaction and a context
    for tracking each checkpoint from initiation to transaction
    completion. This includes adding a log ticket for tracking space
    log required/used by the context checkpoint.

    To track all the changes we need an io vector array per log item,
    rather than a single array for the entire transaction. Using the new
    log vector structure for this requires two passes - the first to
    allocate the log vector structures and chain them together, and the
    second to fill them out. This log vector chain can then be passed
    to the CIL for formatting, pinning and insertion into the CIL.

    Formatting of the log vector chain is relatively simple - it's just
    a loop over the iovecs on each log vector, but it is made slightly
    more complex because we re-write the iovec after the copy to point
    back at the memory buffer we just copied into.

    This code also needs to pin log items. If the log item is not
    already tracked in this checkpoint context, then it needs to be
    pinned. Otherwise it is already pinned and we don't need to pin it
    again.

    The only other complexity is calculating the amount of new log space
    the formatting has consumed. This needs to be accounted to the
    transaction in progress, and the accounting is made more complex
    becase we need also to steal space from it for log metadata in the
    checkpoint transaction. Calculate all this at insert time and update
    all the tickets, counters, etc correctly.

    Once we've formatted all the log items in the transaction, attach
    the busy extents to the checkpoint context so the busy extents live
    until checkpoint completion and can be processed at that point in
    time. Transactions can then be freed at this point in time.

    Now we need to issue checkpoints - we are tracking the amount of log space
    used by the items in the CIL, so we can trigger background checkpoints when the
    space usage gets to a certain threshold. Otherwise, checkpoints need ot be
    triggered when a log synchronisation point is reached - a log force event.

    Because the log write code already handles chained log vectors, writing the
    transaction is trivial, too. Construct a transaction header, add it
    to the head of the chain and write it into the log, then issue a
    commit record write. Then we can release the checkpoint log ticket
    and attach the context to the log buffer so it can be called during
    Io completion to complete the checkpoint.

    We also need to allow for synchronising multiple in-flight
    checkpoints. This is needed for two things - the first is to ensure
    that checkpoint commit records appear in the log in the correct
    sequence order (so they are replayed in the correct order). The
    second is so that xfs_log_force_lsn() operates correctly and only
    flushes and/or waits for the specific sequence it was provided with.

    To do this we need a wait variable and a list tracking the
    checkpoint commits in progress. We can walk this list and wait for
    the checkpoints to change state or complete easily, an this provides
    the necessary synchronisation for correct operation in both cases.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • When we free a metadata extent, we record it in the per-AG busy
    extent array so that it is not re-used before the freeing
    transaction hits the disk. This array is fixed size, so when it
    overflows we make further allocation transactions synchronous
    because we cannot track more freed extents until those transactions
    hit the disk and are completed. Under heavy mixed allocation and
    freeing workloads with large log buffers, we can overflow this array
    quite easily.

    Further, the array is sparsely populated, which means that inserts
    need to search for a free slot, and array searches often have to
    search many more slots that are actually used to check all the
    busy extents. Quite inefficient, really.

    To enable this aspect of extent freeing to scale better, we need
    a structure that can grow dynamically. While in other areas of
    XFS we have used radix trees, the extents being freed are at random
    locations on disk so are better suited to being indexed by an rbtree.

    So, use a per-AG rbtree indexed by block number to track busy
    extents. This incures a memory allocation when marking an extent
    busy, but should not occur too often in low memory situations. This
    should scale to an arbitrary number of extents so should not be a
    limitation for features such as in-memory aggregation of
    transactions.

    However, there are still situations where we can't avoid allocating
    busy extents (such as allocation from the AGFL). To minimise the
    overhead of such occurences, we need to avoid doing a synchronous
    log force while holding the AGF locked to ensure that the previous
    transactions are safely on disk before we use the extent. We can do
    this by marking the transaction doing the allocation as synchronous
    rather issuing a log force.

    Because of the locking involved and the ordering of transactions,
    the synchronous transaction provides the same guarantees as a
    synchronous log force because it ensures that all the prior
    transactions are already on disk when the synchronous transaction
    hits the disk. i.e. it preserves the free->allocate order of the
    extent correctly in recovery.

    By doing this, we avoid holding the AGF locked while log writes are
    in progress, hence reducing the length of time the lock is held and
    therefore we increase the rate at which we can allocate and free
    from the allocation group, thereby increasing overall throughput.

    The only problem with this approach is that when a metadata buffer is
    marked stale (e.g. a directory block is removed), then buffer remains
    pinned and locked until the log goes to disk. The issue here is that
    if that stale buffer is reallocated in a subsequent transaction, the
    attempt to lock that buffer in the transaction will hang waiting
    the log to go to disk to unlock and unpin the buffer. Hence if
    someone tries to lock a pinned, stale, locked buffer we need to
    push on the log to get it unlocked ASAP. Effectively we are trading
    off a guaranteed log force for a much less common trigger for log
    force to occur.

    Ideally we should not reallocate busy extents. That is a much more
    complex fix to the problem as it involves direct intervention in the
    allocation btree searches in many places. This is left to a future
    set of modifications.

    Finally, now that we track busy extents in allocated memory, we
    don't need the descriptors in the transaction structure to point to
    them. We can replace the complex busy chunk infrastructure with a
    simple linked list of busy extents. This allows us to remove a large
    chunk of code, making the overall change a net reduction in code
    size.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

19 May, 2010

2 commits

  • Currently there is no tracing in log recovery, so it is difficult to
    determine what is going on when something goes wrong.

    Add tracing for log item recovery to provide visibility into the log
    recovery process. The tracing added shows regions being extracted
    from the log transactions and added to the transaction hash forming
    recovery items, followed by the reordering, cancelling and finally
    recovery of the items.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • The staleness of a object being unpinned can be directly derived
    from the object itself - there is no need to extract it from the
    object then pass it as a parameter into IOP_UNPIN().

    This means we can kill the XFS_LID_BUF_STALE flag - it is set,
    checked and cleared in the same places XFS_BLI_STALE flag in the
    xfs_buf_log_item so it is now redundant and hence safe to remove.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

02 Mar, 2010

1 commit

  • Currenly we pass opaque xfs_log_ticket_t handles instead of
    struct xlog_ticket pointers, and void pointers instead of
    struct xlog_in_core pointers to various log manager functions.
    Instead pass properly typed pointers after adding forward
    declarations for them to xfs_log.h, and adjust the touched
    function prototypes to the standard XFS style while at it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

02 Feb, 2010

1 commit

  • All buffers logged into the AIL are marked as delayed write.
    When the AIL needs to push the buffer out, it issues an async write of the
    buffer. This means that IO patterns are dependent on the order of
    buffers in the AIL.

    Instead of flushing the buffer, promote the buffer in the delayed
    write list so that the next time the xfsbufd is run the buffer will
    be flushed by the xfsbufd. Return the state to the xfsaild that the
    buffer was promoted so that the xfsaild knows that it needs to cause
    the xfsbufd to run to flush the buffers that were promoted.

    Using the xfsbufd for issuing the IO allows us to dispatch all
    buffer IO from the one queue. This means that we can make much more
    enlightened decisions on what order to flush buffers to disk as
    we don't have multiple places issuing IO. Optimisations to xfsbufd
    will be in a future patch.

    Version 2
    - kill XFS_ITEM_FLUSHING as it is now unused.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

15 Dec, 2009

1 commit

  • Convert the old xfs tracing support that could only be used with the
    out of tree kdb and xfsidbg patches to use the generic event tracer.

    To use it make sure CONFIG_EVENT_TRACING is enabled and then enable
    all xfs trace channels by:

    echo 1 > /sys/kernel/debug/tracing/events/xfs/enable

    or alternatively enable single events by just doing the same in one
    event subdirectory, e.g.

    echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable

    or set more complex filters, etc. In Documentation/trace/events.txt
    all this is desctribed in more detail. To reads the events do a

    cat /sys/kernel/debug/tracing/trace

    Compared to the last posting this patch converts the tracing mostly to
    the one tracepoint per callsite model that other users of the new
    tracing facility also employ. This allows a very fine-grained control
    of the tracing, a cleaner output of the traces and also enables the
    perf tool to use each tracepoint as a virtual performance counter,
    allowing us to e.g. count how often certain workloads git various
    spots in XFS. Take a look at

    http://lwn.net/Articles/346470/

    for some examples.

    Also the btree tracing isn't included at all yet, as it will require
    additional core tracing features not in mainline yet, I plan to
    deliver it later.

    And the really nice thing about this patch is that it actually removes
    many lines of code while adding this nice functionality:

    fs/xfs/Makefile | 8
    fs/xfs/linux-2.6/xfs_acl.c | 1
    fs/xfs/linux-2.6/xfs_aops.c | 52 -
    fs/xfs/linux-2.6/xfs_aops.h | 2
    fs/xfs/linux-2.6/xfs_buf.c | 117 +--
    fs/xfs/linux-2.6/xfs_buf.h | 33
    fs/xfs/linux-2.6/xfs_fs_subr.c | 3
    fs/xfs/linux-2.6/xfs_ioctl.c | 1
    fs/xfs/linux-2.6/xfs_ioctl32.c | 1
    fs/xfs/linux-2.6/xfs_iops.c | 1
    fs/xfs/linux-2.6/xfs_linux.h | 1
    fs/xfs/linux-2.6/xfs_lrw.c | 87 --
    fs/xfs/linux-2.6/xfs_lrw.h | 45 -
    fs/xfs/linux-2.6/xfs_super.c | 104 ---
    fs/xfs/linux-2.6/xfs_super.h | 7
    fs/xfs/linux-2.6/xfs_sync.c | 1
    fs/xfs/linux-2.6/xfs_trace.c | 75 ++
    fs/xfs/linux-2.6/xfs_trace.h | 1369 +++++++++++++++++++++++++++++++++++++++++
    fs/xfs/linux-2.6/xfs_vnode.h | 4
    fs/xfs/quota/xfs_dquot.c | 110 ---
    fs/xfs/quota/xfs_dquot.h | 21
    fs/xfs/quota/xfs_qm.c | 40 -
    fs/xfs/quota/xfs_qm_syscalls.c | 4
    fs/xfs/support/ktrace.c | 323 ---------
    fs/xfs/support/ktrace.h | 85 --
    fs/xfs/xfs.h | 16
    fs/xfs/xfs_ag.h | 14
    fs/xfs/xfs_alloc.c | 230 +-----
    fs/xfs/xfs_alloc.h | 27
    fs/xfs/xfs_alloc_btree.c | 1
    fs/xfs/xfs_attr.c | 107 ---
    fs/xfs/xfs_attr.h | 10
    fs/xfs/xfs_attr_leaf.c | 14
    fs/xfs/xfs_attr_sf.h | 40 -
    fs/xfs/xfs_bmap.c | 507 +++------------
    fs/xfs/xfs_bmap.h | 49 -
    fs/xfs/xfs_bmap_btree.c | 6
    fs/xfs/xfs_btree.c | 5
    fs/xfs/xfs_btree_trace.h | 17
    fs/xfs/xfs_buf_item.c | 87 --
    fs/xfs/xfs_buf_item.h | 20
    fs/xfs/xfs_da_btree.c | 3
    fs/xfs/xfs_da_btree.h | 7
    fs/xfs/xfs_dfrag.c | 2
    fs/xfs/xfs_dir2.c | 8
    fs/xfs/xfs_dir2_block.c | 20
    fs/xfs/xfs_dir2_leaf.c | 21
    fs/xfs/xfs_dir2_node.c | 27
    fs/xfs/xfs_dir2_sf.c | 26
    fs/xfs/xfs_dir2_trace.c | 216 ------
    fs/xfs/xfs_dir2_trace.h | 72 --
    fs/xfs/xfs_filestream.c | 8
    fs/xfs/xfs_fsops.c | 2
    fs/xfs/xfs_iget.c | 111 ---
    fs/xfs/xfs_inode.c | 67 --
    fs/xfs/xfs_inode.h | 76 --
    fs/xfs/xfs_inode_item.c | 5
    fs/xfs/xfs_iomap.c | 85 --
    fs/xfs/xfs_iomap.h | 8
    fs/xfs/xfs_log.c | 181 +----
    fs/xfs/xfs_log_priv.h | 20
    fs/xfs/xfs_log_recover.c | 1
    fs/xfs/xfs_mount.c | 2
    fs/xfs/xfs_quota.h | 8
    fs/xfs/xfs_rename.c | 1
    fs/xfs/xfs_rtalloc.c | 1
    fs/xfs/xfs_rw.c | 3
    fs/xfs/xfs_trans.h | 47 +
    fs/xfs/xfs_trans_buf.c | 62 -
    fs/xfs/xfs_vnodeops.c | 8
    70 files changed, 2151 insertions(+), 2592 deletions(-)

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

12 Dec, 2009

1 commit

  • When completing I/O requests we must not allow the memory allocator to
    recurse into the filesystem, as we might deadlock on waiting for the
    I/O completion otherwise. The only thing currently allocating normal
    GFP_KERNEL memory is the allocation of the transaction structure for
    the unwritten extent conversion. Add a memflags argument to
    _xfs_trans_alloc to allow controlling the allocator behaviour.

    Signed-off-by: Christoph Hellwig
    Reported-by: Thomas Neumann
    Tested-by: Thomas Neumann
    Reviewed-by: Alex Elder
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

02 Sep, 2009

1 commit

  • The guarantees for O_SYNC are exactly the same as the ones we need to
    make for an fsync call (and given that Linux O_SYNC is O_DSYNC the
    equivalent is fdadatasync, but we treat both the same in XFS), except
    with a range data writeout. Jan Kara has started unifying these two
    path for filesystems using the generic helpers, and I've started to
    look at XFS.

    The actual transaction commited by xfs_fsync and xfs_write_sync_logforce
    has a different transaction number, but actually is exactly the same.
    We'll only use the fsync transaction going forward. One major difference
    is that xfs_write_sync_logforce never issues a cache flush unless we
    commit a transaction causing that as a side-effect, which is an obvious
    bug in the O_SYNC handling. Second all the locking and i_update_size
    vs i_update_core changes from 978b7237123d007b9fa983af6e0e2fa8f97f9934
    never made it to xfs_write_sync_logforce, so we add them back.

    To make xfs_fsync easily usable from the O_SYNC path, the filemap_fdatawait
    call is moved up to xfs_file_fsync, so that we don't wait on the whole
    file after we already waited for our portion in xfs_write.

    We'll also use a plain call to filemap_write_and_wait_range instead
    of the previous sync_page_rang which did it in two steps including
    an half-hearted inode write out that doesn't help us.

    Once we're done with this also remove the now useless i_update_size
    tracking.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Felix Blyakher
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     

29 Mar, 2009

1 commit


09 Feb, 2009

1 commit

  • xfs_ialloc_btree.h has a a cuple of macros that only obsfucate the code
    but don't provide any abstraction benefits. This patches removes those
    and cleans up the reamaining defintions up a little.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

30 Oct, 2008

4 commits

  • Change all the remaining AIL API functions that are passed struct
    xfs_mount pointers to pass pointers directly to the struct xfs_ail being
    used. With this conversion, all external access to the AIL is via the
    struct xfs_ail. Hence the operation and referencing of the AIL is almost
    entirely independent of the xfs_mount that is using it - it is now much
    more tightly tied to the log and the items it is tracking in the log than
    it is tied to the xfs_mount.

    SGI-PV: 988143

    SGI-Modid: xfs-linux-melb:xfs-kern:32353a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Christoph Hellwig

    David Chinner
     
  • Add an xfs_ail pointer to log items so that the log items can reference
    the AIL directly during callbacks without needed a struct xfs_mount.

    SGI-PV: 988143

    SGI-Modid: xfs-linux-melb:xfs-kern:32352a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Christoph Hellwig

    David Chinner
     
  • With the new cursor interface, it makes sense to make all the traversing
    code use the cursor interface and make the old one go away. This means
    more of the AIL interfacing is done by passing struct xfs_ail pointers
    around the place instead of struct xfs_mount pointers.

    We can replace the use of xfs_trans_first_ail() in xfs_log_need_covered()
    as it is only checking if the AIL is empty. We can do that with a call to
    xfs_trans_ail_tail() instead, where a zero LSN returned indicates and
    empty AIL...

    SGI-PV: 988143

    SGI-Modid: xfs-linux-melb:xfs-kern:32348a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Christoph Hellwig

    David Chinner
     
  • SGI-PV: 986558

    SGI-Modid: xfs-linux-melb:xfs-kern:32231a

    Signed-off-by: Barry Naujok
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy

    Barry Naujok
     

13 Aug, 2008

2 commits

  • SGI-PV: 981498

    SGI-Modid: xfs-linux-melb:xfs-kern:31758a

    Signed-off-by: Eric Sandeen
    Signed-off-by: Niv Sardi
    Signed-off-by: Lachlan McIlroy

    Eric Sandeen
     
  • Move it from the attr code to the transaction code and make
    the attr code call the new function.

    We rolltrans is really usefull whenever we want to use rolling
    transaction, should be generic, it isn't dependent on any part
    of the attr code anyway.

    We use this excuse to change all the:

    if ((error = xfs_attr_rolltrans()))

    calls into:

    error = xfs_trans_roll();

    if (error)

    SGI-PV: 981498

    SGI-Modid: xfs-linux-melb:xfs-kern:31729a

    Signed-off-by: Niv Sardi
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy

    Niv Sardi
     

18 Apr, 2008

2 commits

  • Replace the xfs_ail_entry_t with a struct list_head and clean the
    surrounding code up. Also fixes a livelock in xfs_trans_first_push_ail()
    by terminating the loop at the head of the list correctly.

    SGI-PV: 978682
    SGI-Modid: xfs-linux-melb:xfs-kern:30636a

    Signed-off-by: Josef 'Jeff' Sipek
    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy

    Josef 'Jeff' Sipek
     
  • It's completely unused so we might aswell kill it. Note that there is
    another t_sema in struct xlog_ticket, which is used and actually an sv_t
    despite the name. That one is left untouched by this patch.

    SGI-PV: 971186
    SGI-Modid: xfs-linux-melb:xfs-kern:30591a

    Signed-off-by: Niv Sardi
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy

    Niv Sardi
     

07 Feb, 2008

2 commits

  • When many hundreds to thousands of threads all try to do simultaneous
    transactions and the log is in a tail-pushing situation (i.e. full), we
    can get multiple threads walking the AIL list and contending on the AIL
    lock.

    The AIL push is, in effect, a simple I/O dispatch algorithm complicated by
    the ordering constraints placed on it by the transaction subsystem. It
    really does not need multiple threads to push on it - even when only a
    single CPU is pushing the AIL, it can push the I/O out far faster that
    pretty much any disk subsystem can handle.

    So, to avoid contention problems stemming from multiple list walkers, move
    the list walk off into another thread and simply provide a "target" to
    push to. When a thread requires a push, it sets the target and wakes the
    push thread, then goes to sleep waiting for the required amount of space
    to become available in the log.

    This mechanism should also be a lot fairer under heavy load as the waiters
    will queue in arrival order, rather than queuing in "who completed a push
    first" order.

    Also, by moving the pushing to a separate thread we can do more
    effectively overload detection and prevention as we can keep context from
    loop iteration to loop iteration. That is, we can push only part of the
    list each loop and not have to loop back to the start of the list every
    time we run. This should also help by reducing the number of items we try
    to lock and/or push items that we cannot move.

    Note that this patch is not intended to solve the inefficiencies in the
    AIL structure and the associated issues with extremely large list
    contents. That needs to be addresses separately; parallel access would
    cause problems to any new structure as well, so I'm only aiming to isolate
    the structure from unbounded parallelism here.

    SGI-PV: 972759
    SGI-Modid: xfs-linux-melb:xfs-kern:30371a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy

    David Chinner
     
  • These are mostly locking annotations, marking things static, casts where
    needed and declaring stuff in header files.

    SGI-PV: 971186
    SGI-Modid: xfs-linux-melb:xfs-kern:30002a

    Signed-off-by: David Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy

    David Chinner
     

14 Jul, 2007

1 commit

  • When we have a couple of hundred transactions on the fly at once, they all
    typically modify the on disk superblock in some way.
    create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify
    free block counts.

    When these counts are modified in a transaction, they must eventually lock
    the superblock buffer and apply the mods. The buffer then remains locked
    until the transaction is committed into the incore log buffer. The result
    of this is that with enough transactions on the fly the incore superblock
    buffer becomes a bottleneck.

    The result of contention on the incore superblock buffer is that
    transaction rates fall - the more pressure that is put on the superblock
    buffer, the slower things go.

    The key to removing the contention is to not require the superblock fields
    in question to be locked. We do that by not marking the superblock dirty
    in the transaction. IOWs, we modify the incore superblock but do not
    modify the cached superblock buffer. In short, we do not log superblock
    modifications to critical fields in the superblock on every transaction.
    In fact we only do it just before we write the superblock to disk every
    sync period or just before unmount.

    This creates an interesting problem - if we don't log or write out the
    fields in every transaction, then how do the values get recovered after a
    crash? the answer is simple - we keep enough duplicate, logged information
    in other structures that we can reconstruct the correct count after log
    recovery has been performed.

    It is the AGF and AGI structures that contain the duplicate information;
    after recovery, we walk every AGI and AGF and sum their individual
    counters to get the correct value, and we do a transaction into the log to
    correct them. An optimisation of this is that if we have a clean unmount
    record, we know the value in the superblock is correct, so we can avoid
    the summation walk under normal conditions and so mount/recovery times do
    not change under normal operation.

    One wrinkle that was discovered during development was that the blocks
    used in the freespace btrees are never accounted for in the AGF counters.
    This was once a valid optimisation to make; when the filesystem is full,
    the free space btrees are empty and consume no space. Hence when it
    matters, the "accounting" is correct. But that means the when we do the
    AGF summations, we would not have a correct count and xfs_check would
    complain. Hence a new counter was added to track the number of blocks used
    by the free space btrees. This is an *on-disk format change*.

    As a result of this, lazy superblock counters are a mkfs option and at the
    moment on linux there is no way to convert an old filesystem. This is
    possible - xfs_db can be used to twiddle the right bits and then
    xfs_repair will do the format conversion for you. Similarly, you can
    convert backwards as well. At some point we'll add functionality to
    xfs_admin to do the bit twiddling easily....

    SGI-PV: 964999
    SGI-Modid: xfs-linux-melb:xfs-kern:28652a

    Signed-off-by: David Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Tim Shimmin

    David Chinner
     

08 May, 2007

1 commit


10 Feb, 2007

1 commit

  • The free block modification code has a 32bit interface, limiting the size
    the filesystem can be grown even on 64 bit machines. On 32 bit machines,
    there are other 32bit variables in transaction structures and interfaces
    that need to be expanded to allow this to work.

    SGI-PV: 959978
    SGI-Modid: xfs-linux-melb:xfs-kern:27894a

    Signed-off-by: David Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Tim Shimmin

    David Chinner