14 Mar, 2012

4 commits

  • Add an in-memory only flag to say we logged timestamps only, and use it to
    check if fdatasync can optimize away the log force.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Add a new ili_fields member to the inode log item to isolate the in-memory
    flags from the ones that actually go to the log. This will allow tracking
    timestamp-only updates for fdatasync and O_DSYNC in the next patch and
    prepares for divorcing the on-disk log format from the in-memory log item
    a little further down the road.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Move all code messing with the inode log item flags into xfs_inode_item_format
    to make sure xfs_inode_item_size really only calculates the the number of
    vectors, but doesn't modify any state of the inode item.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Timestamps on regular files are the last metadata that XFS does not update
    transactionally. Now that we use the delaylog mode exclusively and made
    the log scode scale extremly well there is no need to bypass that code for
    timestamp updates. Logging all updates allows to drop a lot of code, and
    will allow for further performance improvements later on.

    Note that this patch drops optimized handling of fdatasync - it will be
    added back in a separate commit.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

23 Feb, 2012

1 commit

  • There is no reason to wake up log space waiters when unlocking inodes or
    dquots, and the commit log has no explanation for this function either.

    Given that we now have exact log space wakeups everywhere we can assume
    the reason for this function was to paper over log space races in earlier
    XFS versions.

    Reviewed-by: Mark Tinguely
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

18 Jan, 2012

3 commits

  • Replace i_pin_wait, which is only used during synchronous inode flushing
    with a bit waitqueue. This trades off a much smaller inode against
    slightly slower wakeup performance, and saves 12 (32-bit) or 20 (64-bit)
    bytes in the XFS inode.

    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • We almost never block on i_flock, the exception is synchronous inode
    flushing. Instead of bloating the inode with a 16/24-byte completion
    that we abuse as a semaphore just implement it as a bitlock that uses
    a bit waitqueue for the rare sleeping path. This primarily is a
    tradeoff between a much smaller inode and a faster non-blocking
    path vs faster wakeups, and we are much better off with the former.

    A small downside is that we will lose lockdep checking for i_flock, but
    given that it's always taken inside the ilock that should be acceptable.

    Note that for example the inode writeback locking is implemented in a
    very similar way.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • We spent a lot of effort to maintain this field, but it always equals to the
    fork size divided by the constant size of an extent. The prime use of it is
    to assert that the two stay in sync. Just divide the fork size by the extent
    size in the few places that we actually use it and remove the overhead
    of maintaining it. Also introduce a few helpers to consolidate the places
    where we actually care about the value.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

09 Dec, 2011

1 commit


09 Nov, 2011

1 commit

  • The log item ops aren't nessecarily the biggest exploit vector, but marking
    them const is easy enough. Also remove the unused xfs_item_ops_t typedef
    while we're at it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Alex Elder

    Christoph Hellwig
     

18 Oct, 2011

1 commit


12 Oct, 2011

2 commits

  • There is no reason to keep a reference to the inode even if we unlock
    it during transaction commit because we never drop a reference between
    the ijoin and commit. Also use this fact to merge xfs_trans_ijoin_ref
    back into xfs_trans_ijoin - the third argument decides if an unlock
    is needed now.

    I'm actually starting to wonder if allowing inodes to be unlocked
    at transaction commit really is worth the effort. The only real
    benefit is that they can be unlocked earlier when commiting a
    synchronous transactions, but that could be solved by doing the
    log force manually after the unlock, too.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • We need to check for pinned buffers even in .iop_pushbuf given that inode
    items flush into the same buffers that may be pinned directly due operations
    on the unlinked inode list operating directly on buffers. To do this add a
    return value to .iop_pushbuf that tells the AIL push about this and use
    the existing log force mechanisms to unpin it.

    Signed-off-by: Christoph Hellwig
    Reported-by: Stefan Priebe
    Tested-by: Stefan Priebe
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

13 Jul, 2011

1 commit


08 Jul, 2011

1 commit

  • Remove the transaction pointer in the inode. It's only used to avoid
    passing down an argument in the bmap code, and for a few asserts in
    the transaction code right now.

    Also use the local variable ip in a few more places in xfs_inode_item_unlock,
    so that it isn't only used for debug builds after the above change.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

07 Jul, 2011

1 commit

  • When inodes are marked stale in a transaction, they are treated
    specially when the inode log item is being inserted into the AIL.
    It tries to avoid moving the log item forward in the AIL due to a
    race condition with the writing the underlying buffer back to disk.
    The was "fixed" in commit de25c18 ("xfs: avoid moving stale inodes
    in the AIL").

    To avoid moving the item forward, we return a LSN smaller than the
    commit_lsn of the completing transaction, thereby trying to trick
    the commit code into not moving the inode forward at all. I'm not
    sure this ever worked as intended - it assumes the inode is already
    in the AIL, but I don't think the returned LSN would have been small
    enough to prevent moving the inode. It appears that the reason it
    worked is that the lower LSN of the inodes meant they were inserted
    into the AIL and flushed before the inode buffer (which was moved to
    the commit_lsn of the transaction).

    The big problem is that with delayed logging, the returning of the
    different LSN means insertion takes the slow, non-bulk path. Worse
    yet is that insertion is to a position -before- the commit_lsn so it
    is doing a AIL traversal on every insertion, and has to walk over
    all the items that have already been inserted into the AIL. It's
    expensive.

    To compound the matter further, with delayed logging inodes are
    likely to go from clean to stale in a single checkpoint, which means
    they aren't even in the AIL at all when we come across them at AIL
    insertion time. Hence these were all getting inserted into the AIL
    when they simply do not need to be as inodes marked XFS_ISTALE are
    never written back.

    Transactional/recovery integrity is maintained in this case by the
    other items in the unlink transaction that were modified (e.g. the
    AGI btree blocks) and committed in the same checkpoint.

    So to fix this, simply unpin the stale inodes directly in
    xfs_inode_item_committed() and return -1 to indicate that the AIL
    insertion code does not need to do any further processing of these
    inodes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

29 Apr, 2011

1 commit

  • follow these guidelines:
    - leave initialization in the declaration block if it fits the line
    - move to the code where it's more suitable ('for' init block)

    The last chunk was modified from David's original to be a correct
    fix for what appeared to be a duplicate initialization.

    Signed-off-by: David Sterba
    Signed-off-by: Alex Elder
    Reviewed-by: Dave Chinner

    David Sterba
     

08 Apr, 2011

1 commit

  • When formatting an inode item, we have to allocate a separate buffer
    to hold extents when there are delayed allocation extents on the
    inode and it is in extent format. The allocation size is derived
    from the in-core data fork representation, which accounts for
    delayed allocation extents, while the on-disk representation does
    not contain any delalloc extents.

    As a result of this mismatch, the allocated buffer can be far larger
    than needed to hold the real extent list which, due to the fact the
    inode is in extent format, is limited to the size of the literal
    area of the inode. However, we can have thousands of delalloc
    extents, resulting in an allocation size orders of magnitude larger
    than is needed to hold all the real extents.

    Fix this by limiting the size of the buffer being allocated to the
    size of the literal area of the inodes in the filesystem (i.e. the
    maximum size an inode fork can grow to).

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder

    Dave Chinner
     

26 Mar, 2011

1 commit

  • There is an ABBA deadlock between synchronous inode flushing in
    xfs_reclaim_inode and xfs_icluster_free. xfs_icluster_free locks the
    buffer, then takes inode ilocks, whilst synchronous reclaim takes
    the ilock followed by the buffer lock in xfs_iflush().

    To avoid this deadlock, separate the inode cluster buffer locking
    semantics from the synchronous inode flush semantics, allowing
    callers to attempt to lock the buffer but still issue synchronous IO
    if it can get the buffer. This requires xfs_iflush() calls that
    currently use non-blocking semantics to pass SYNC_TRYLOCK rather
    than 0 as the flags parameter.

    This allows xfs_reclaim_inode to avoid the deadlock on the buffer
    lock and detect the failure so that it can drop the inode ilock and
    restart the reclaim attempt on the inode. This allows
    xfs_ifree_cluster to obtain the inode lock, mark the inode stale and
    release it and hence defuse the deadlock situation. It also has the
    pleasant side effect of avoiding IO in xfs_reclaim_inode when it
    tries to next reclaim the inode as it is now marked stale.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     

20 Dec, 2010

1 commit

  • When inode buffer IO completes, usually all of the inodes are removed from the
    AIL. This involves processing them one at a time and taking the AIL lock once
    for every inode. When all CPUs are processing inode IO completions, this causes
    excessive amount sof contention on the AIL lock.

    Instead, change the way we process inode IO completion in the buffer
    IO done callback. Allow the inode IO done callback to walk the list
    of IO done callbacks and pull all the inodes off the buffer in one
    go and then process them as a batch.

    Once all the inodes for removal are collected, take the AIL lock
    once and do a bulk removal operation to minimise traffic on the AIL
    lock.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

01 Dec, 2010

1 commit

  • When an inode has been marked stale because the cluster is being
    freed, we don't want to (re-)insert this inode into the AIL. There
    is a race condition where the cluster buffer may be unpinned before
    the inode is inserted into the AIL during transaction committed
    processing. If the buffer is unpinned before the inode item has been
    committed and inserted, then it is possible for the buffer to be
    released and hence processthe stale inode callbacks before the inode
    is inserted into the AIL.

    In this case, we then insert a clean, stale inode into the AIL which
    will never get removed by an IO completion. It will, however, get
    reclaimed and that triggers an assert in xfs_inode_free()
    complaining about freeing an inode still in the AIL.

    This race can be avoided by not moving stale inodes forward in the AIL
    during transaction commit completion processing. This closes the
    race condition by ensuring we never insert clean stale inodes into
    the AIL. It is safe to do this because a dirty stale inode, by
    definition, must already be in the AIL.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

19 Oct, 2010

1 commit

  • Under heavy multi-way parallel create workloads, the VFS struggles
    to write back all the inodes that have been changed in age order.
    The bdi flusher thread becomes CPU bound, spending 85% of it's time
    in the VFS code, mostly traversing the superblock dirty inode list
    to separate dirty inodes old enough to flush.

    We already keep an index of all metadata changes in age order - in
    the AIL - and continued log pressure will do age ordered writeback
    without any extra overhead at all. If there is no pressure on the
    log, the xfssyncd will periodically write back metadata in ascending
    disk address offset order so will be very efficient.

    Hence we can stop marking VFS inodes dirty during transaction commit
    or when changing timestamps during transactions. This will keep the
    inodes in the superblock dirty list to those containing data or
    unlogged metadata changes.

    However, the timstamp changes are slightly more complex than this -
    there are a couple of places that do unlogged updates of the
    timestamps, and the VFS need to be informed of these. Hence add a
    new function xfs_trans_ichgtime() for transactional changes,
    and leave xfs_ichgtime() for the non-transactional changes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

27 Jul, 2010

10 commits


19 May, 2010

3 commits

  • The staleness of a object being unpinned can be directly derived
    from the object itself - there is no need to extract it from the
    object then pass it as a parameter into IOP_UNPIN().

    This means we can kill the XFS_LID_BUF_STALE flag - it is set,
    checked and cleared in the same places XFS_BLI_STALE flag in the
    xfs_buf_log_item so it is now redundant and hence safe to remove.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • We don't record pin counts in inode events right now, and this makes
    it difficult to track down problems related to pinning inodes. Add
    the pin count to the inode trace class and add trace events for
    pinning and unpinning inodes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • Each log item type does manual initialisation of the log item.
    Delayed logging introduces new fields that need initialisation, so
    factor all the open coded initialisation into a common function
    first.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

02 Mar, 2010

1 commit

  • Inodes are only pinned/unpinned via the inode item methods, and lots of
    code relies on that fact. So remove the separate xfs_ipin/xfs_iunpin
    helpers and merge them into their only callers. This also fixes up
    various duplicate and/or incorrect comments.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

06 Feb, 2010

1 commit

  • We currently do background inode flush asynchronously, resulting in
    inodes being written in whatever order the background writeback
    issues them. Not only that, there are also blocking and non-blocking
    asynchronous inode flushes, depending on where the flush comes from.

    This patch completely removes asynchronous inode writeback. It
    removes all the strange writeback modes and replaces them with
    either a synchronous flush or a non-blocking delayed write flush.
    That is, inode flushes will only issue IO directly if they are
    synchronous, and background flushing may do nothing if the operation
    would block (e.g. on a pinned inode or buffer lock).

    Delayed write flushes will now result in the inode buffer sitting in
    the delwri queue of the buffer cache to be flushed by either an AIL
    push or by the xfsbufd timing out the buffer. This will allow
    accumulation of dirty inode buffers in memory and allow optimisation
    of inode cluster writeback at the xfsbufd level where we have much
    greater queue depths than the block layer elevators. We will also
    get adjacent inode cluster buffer IO merging for free when a later
    patch in the series allows sorting of the delayed write buffers
    before dispatch.

    This effectively means that any inode that is written back by
    background writeback will be seen as flush locked during AIL
    pushing, and will result in the buffers being pushed from there.
    This writeback path is currently non-optimal, but the next patch
    in the series will fix that problem.

    A side effect of this delayed write mechanism is that background
    inode reclaim will no longer directly flush inodes, nor can it wait
    on the flush lock. The result is that inode reclaim must leave the
    inode in the reclaimable state until it is clean. Hence attempts to
    reclaim a dirty inode in the background will simply skip the inode
    until it is clean and this allows other mechanisms (i.e. xfsbufd) to
    do more optimal writeback of the dirty buffers. As a result, the
    inode reclaim code has been rewritten so that it no longer relies on
    the ambiguous return values of xfs_iflush() to determine whether it
    is safe to reclaim an inode.

    Portions of this patch are derived from patches by Christoph
    Hellwig.

    Version 2:
    - cleanup reclaim code as suggested by Christoph
    - log background reclaim inode flush errors
    - just pass sync flags to xfs_iflush

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

02 Feb, 2010

1 commit

  • All buffers logged into the AIL are marked as delayed write.
    When the AIL needs to push the buffer out, it issues an async write of the
    buffer. This means that IO patterns are dependent on the order of
    buffers in the AIL.

    Instead of flushing the buffer, promote the buffer in the delayed
    write list so that the next time the xfsbufd is run the buffer will
    be flushed by the xfsbufd. Return the state to the xfsaild that the
    buffer was promoted so that the xfsaild knows that it needs to cause
    the xfsbufd to run to flush the buffers that were promoted.

    Using the xfsbufd for issuing the IO allows us to dispatch all
    buffer IO from the one queue. This means that we can make much more
    enlightened decisions on what order to flush buffers to disk as
    we don't have multiple places issuing IO. Optimisations to xfsbufd
    will be in a future patch.

    Version 2
    - kill XFS_ITEM_FLUSHING as it is now unused.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

22 Jan, 2010

2 commits

  • Remove the XFS_LOG_FORCE argument which was always set, and the
    XFS_LOG_URGE define, which was never used.

    Split xfs_log_force into a two helpers - xfs_log_force which forces
    the whole log, and xfs_log_force_lsn which forces up to the
    specified LSN. The underlying implementations already were entirely
    separate, as were the users.

    Also re-indent the new _xfs_log_force/_xfs_log_force which
    previously had a weird coding style.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • This macro only obsfucates the log item type assignments, so kill it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig