14 Mar, 2012

2 commits

  • Add an in-memory only flag to say we logged timestamps only, and use it to
    check if fdatasync can optimize away the log force.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Timestamps on regular files are the last metadata that XFS does not update
    transactionally. Now that we use the delaylog mode exclusively and made
    the log scode scale extremly well there is no need to bypass that code for
    timestamp updates. Logging all updates allows to drop a lot of code, and
    will allow for further performance improvements later on.

    Note that this patch drops optimized handling of fdatasync - it will be
    added back in a separate commit.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

18 Jan, 2012

4 commits

  • With all the size field updates out of the way xfs_file_aio_write can
    be further simplified by pushing all iolock handling into
    xfs_file_dio_aio_write and xfs_file_buffered_aio_write and using
    the generic generic_write_sync helper for synchronous writes.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • While xfs_iunlock is fine with 0 lockflags the calling conventions are much
    cleaner if xfs_file_aio_write_checks never returns without the iolock held.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Now that we use the VFS i_size field throughout XFS there is no need for the
    i_new_size field any more given that the VFS i_size field gets updated
    in ->write_end before unlocking the page, and thus is always uptodate when
    writeback could see a page. Removing i_new_size also has the advantage that
    we will never have to trim back di_size during a failed buffered write,
    given that it never gets updated past i_size.

    Note that currently the generic direct I/O code only updates i_size after
    calling our end_io handler, which requires a small workaround to make
    sure di_size actually makes it to disk. I hope to fix this properly in
    the generic code.

    A downside is that we lose the support for parallel non-overlapping O_DIRECT
    appending writes that recently was added. I don't think keeping the complex
    and fragile i_new_size infrastructure for this is a good tradeoff - if we
    really care about parallel appending writers we should investigate turning
    the iolock into a range lock, which would also allow for parallel
    non-overlapping buffered writers.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • There is no fundamental need to keep an in-memory inode size copy in the XFS
    inode. We already have the on-disk value in the dinode, and the separate
    in-memory copy that we need for regular files only in the XFS inode.

    Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the
    VFS inode i_size field for regular files. Switch code that was directly
    accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where
    we are limited to regular files direct access of the VFS inode i_size field.

    This also allows dropping some fairly complicated code in the write path
    which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size
    that is getting updated inside ->write_end.

    Note that we do not bother resetting the VFS i_size when truncating a file
    that gets freed to zero as there is no point in doing so because the VFS inode
    is no longer in use at this point. Just relax the assert in xfs_ifree to
    only check the on-disk size instead.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

02 Dec, 2011

1 commit


12 Oct, 2011

8 commits

  • Directories are only updated transactionally, which means fsync only
    needs to flush the log the inode is currently dirty, but not bother
    with checking for dirty data, non-transactional updates, and most
    importanly doesn't have to flush disk caches except as part of a
    transaction commit.

    While the first two optimizations can't easily be measured, the
    latter actually makes a difference when doing lots of fsync that do
    not actually have to commit the inode, e.g. because an earlier fsync
    already pushed the log far enough.

    The new xfs_dir_fsync is identical to xfs_nfs_commit_metadata except
    for the prototype, but I'm not sure creating a common helper for the
    two is worth it given how simple the functions are.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • There is no reason to keep a reference to the inode even if we unlock
    it during transaction commit because we never drop a reference between
    the ijoin and commit. Also use this fact to merge xfs_trans_ijoin_ref
    back into xfs_trans_ijoin - the third argument decides if an unlock
    is needed now.

    I'm actually starting to wonder if allowing inodes to be unlocked
    at transaction commit really is worth the effort. The only real
    benefit is that they can be unlocked earlier when commiting a
    synchronous transactions, but that could be solved by doing the
    log force manually after the unlock, too.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Only read the LSN we need to push to with the ilock held, and then release
    it before we do the log force to improve concurrency.

    This also removes the only direct caller of _xfs_trans_commit, thus
    allowing it to be merged into the plain xfs_trans_commit again.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • xfs_bmapi() currently handles both extent map reading and
    allocation. As a result, the code is littered with "if (wr)"
    branches to conditionally do allocation operations if required.
    This makes the code much harder to follow and causes significant
    indent issues with the code.

    Given that read mapping is much simpler than allocation, we can
    split out read mapping from xfs_bmapi() and reuse the logic that
    we have already factored out do do all the hard work of handling the
    extent map manipulations. The results in a much simpler function for
    the common extent read operations, and will allow the allocation
    code to be simplified in another commit.

    Once xfs_bmapi_read() is implemented, convert all the callers of
    xfs_bmapi() that are only reading extents to use the new function.

    Signed-off-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Currently a buffered reader or writer can add pages to the pagecache
    while we are waiting for the iolock in xfs_file_dio_aio_write. Prevent
    this by re-checking mapping->nrpages after we got the iolock, and if
    nessecary upgrade the lock to exclusive mode. To simplify this a bit
    only take the ilock inside of xfs_file_aio_write_checks.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • We now have an i_dio_count filed and surrounding infrastructure to wait
    for direct I/O completion instead of i_icount, and we have never needed
    to iocount waits for buffered I/O given that we only set the page uptodate
    after finishing all required work. Thus remove i_iocount, and replace
    the actually needed waits with calls to inode_dio_wait.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • For append write workloads, extending the file requires a certain
    amount of exclusive locking to be done up front to ensure sanity in
    things like ensuring that we've zeroed any allocated regions
    between the old EOF and the start of the new IO.

    For single threads, this typically isn't a problem, and for large
    IOs we don't serialise enough for it to be a problem for two
    threads on really fast block devices. However for smaller IO and
    larger thread counts we have a problem.

    Take 4 concurrent sequential, single block sized and aligned IOs.
    After the first IO is submitted but before it completes, we end up
    with this state:

    IO 1 IO 2 IO 3 IO 4
    +-------+-------+-------+-------+
    ^ ^
    | |
    | |
    | |
    | \- ip->i_new_size
    \- ip->i_size

    And the IO is done without exclusive locking because offset i_size. When we submit IO 2, we see offset > ip->i_size, and
    grab the IO lock exclusive, because there is a chance we need to do
    EOF zeroing. However, there is already an IO in progress that avoids
    the need for IO zeroing because offset i_new_size. hence we
    could avoid holding the IO lock exlcusive for this. Hence after
    submission of the second IO, we'd end up this state:

    IO 1 IO 2 IO 3 IO 4
    +-------+-------+-------+-------+
    ^ ^
    | |
    | |
    | |
    | \- ip->i_new_size
    \- ip->i_size

    There is no need to grab the i_mutex of the IO lock in exclusive
    mode if we don't need to invalidate the page cache. Taking these
    locks on every direct IO effective serialises them as taking the IO
    lock in exclusive mode has to wait for all shared holders to drop
    the lock. That only happens when IO is complete, so effective it
    prevents dispatch of concurrent direct IO writes to the same inode.

    And so you can see that for the third concurrent IO, we'd avoid
    exclusive locking for the same reason we avoided the exclusive lock
    for the second IO.

    Fixing this is a bit more complex than that, because we need to hold
    a write-submission local value of ip->i_new_size to that clearing
    the value is only done if no other thread has updated it before our
    IO completes.....

    Signed-off-by: Dave Chinner
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • There is no need to grab the i_mutex of the IO lock in exclusive
    mode if we don't need to invalidate the page cache. Taking these
    locks on every direct IO effective serialises them as taking the IO
    lock in exclusive mode has to wait for all shared holders to drop
    the lock. That only happens when IO is complete, so effective it
    prevents dispatch of concurrent direct IO reads to the same inode.

    Fix this by taking the IO lock shared to check the page cache state,
    and only then drop it and take the IO lock exclusively if there is
    work to be done. Hence for the normal direct IO case, no exclusive
    locking will occur.

    Signed-off-by: Dave Chinner
    Tested-by: Joern Engel
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

13 Aug, 2011

1 commit

  • Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the
    annoying subdirectories in the XFS source code. Besides the large
    amount of file rename the only changes are to the Makefile, a few
    files including headers with the subdirectory prefix, and the binary
    sysctl compat code that includes a header under fs/xfs/ from
    kernel/.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig