30 Jul, 2012

2 commits


22 Jul, 2012

2 commits

  • There is no need to keep this helper around, opencoding it in the only
    caller is just as clear.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • All callers of xfs_imap_to_bp want the dinode pointer, so let's calculate it
    inside xfs_imap_to_bp. Once that is done xfs_itobp becomes a fairly pointless
    wrapper which can be replaced with direct calls to xfs_imap_to_bp.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

15 May, 2012

3 commits

  • The only thing left in xfs_rw.h is a function prototype for an inode
    function. Move that to xfs_inode.h, and kill xfs_rw.h.

    Also move the function implementing the prototype from xfs_rw.c to
    xfs_inode.c so we only have one function left in xfs_rw.c

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
    and write back the buffers per-process instead of by waking up xfsbufd.

    This is now easily doable given that we have very few places left that write
    delwri buffers:

    - log recovery:
    Only done at mount time, and already forcing out the buffers
    synchronously using xfs_flush_buftarg

    - quotacheck:
    Same story.

    - dquot reclaim:
    Writes out dirty dquots on the LRU under memory pressure. We might
    want to look into doing more of this via xfsaild, but it's already
    more optimal than the synchronous inode reclaim that writes each
    buffer synchronously.

    - xfsaild:
    This is the main beneficiary of the change. By keeping a local list
    of buffers to write we reduce latency of writing out buffers, and
    more importably we can remove all the delwri list promotions which
    were hitting the buffer cache hard under sustained metadata loads.

    The implementation is very straight forward - xfs_buf_delwri_queue now gets
    a new list_head pointer that it adds the delwri buffers to, and all callers
    need to eventually submit the list using xfs_buf_delwi_submit or
    xfs_buf_delwi_submit_nowait. Buffers that already are on a delwri list are
    skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
    list. The biggest change to pass down the buffer list was done to the AIL
    pushing. Now that we operate on buffers the trylock, push and pushbuf log
    item methods are merged into a single push routine, which tries to lock the
    item, and if possible add the buffer that needs writeback to the buffer list.
    This leads to much simpler code than the previous split but requires the
    individual IOP_PUSH instances to unlock and reacquire the AIL around calls
    to blocking routines.

    Given that xfsailds now also handle writing out buffers, the conditions for
    log forcing and the sleep times needed some small changes. The most
    important one is that we consider an AIL busy as long we still have buffers
    to push, and the other one is that we do increment the pushed LSN for
    buffers that are under flushing at this moment, but still count them towards
    the stuck items for restart purposes. Without this we could hammer on stuck
    items without ever forcing the log and not make progress under heavy random
    delete workloads on fast flash storage devices.

    [ Dave Chinner:
    - rebase on previous patches.
    - improved comments for XBF_DELWRI_Q handling
    - fix XBF_ASYNC handling in queue submission (test 106 failure)
    - rename delwri submit function buffer list parameters for clarity
    - xfs_efd_item_push() should return XFS_ITEM_PINNED ]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Instead of writing the buffer directly from inside xfs_iflush return it to
    the caller and let the caller decide what to do with the buffer. Also
    remove the pincount check in xfs_iflush that all non-blocking callers already
    implement and the now unused flags parameter.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

27 Mar, 2012

1 commit

  • When we read inodes via bulkstat, we generally only read them once
    and then throw them away - they never get used again. If we retain
    them in cache, then it simply causes the working set of inodes and
    other cached items to be reclaimed just so the inode cache can grow.

    Avoid this problem by marking inodes read by bulkstat not to be
    cached and check this flag in .drop_inode to determine whether the
    inode should be added to the VFS LRU or not. If the inode lookup
    hits an already cached inode, then don't set the flag. If the inode
    lookup hits an inode marked with no cache flag, remove the flag and
    allow it to be cached once the current reference goes away.

    Inodes marked as not cached will get cleaned up by the background
    inode reclaim or via memory pressure, so they will still generate
    some short term cache pressure. They will, however, be reclaimed
    much sooner and in preference to cache hot inodes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     

14 Mar, 2012

1 commit

  • Timestamps on regular files are the last metadata that XFS does not update
    transactionally. Now that we use the delaylog mode exclusively and made
    the log scode scale extremly well there is no need to bypass that code for
    timestamp updates. Logging all updates allows to drop a lot of code, and
    will allow for further performance improvements later on.

    Note that this patch drops optimized handling of fdatasync - it will be
    added back in a separate commit.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

06 Mar, 2012

1 commit

  • Replace xfs_ioend_new_eof with a new inline xfs_new_eof helper that
    doesn't require and ioend, and is available also outside of xfs_aops.c.

    Also make the code a bit more clear by using a normal if statement
    instead of a slightly misleading MIN().

    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

23 Feb, 2012

1 commit

  • There is no reason to wake up log space waiters when unlocking inodes or
    dquots, and the commit log has no explanation for this function either.

    Given that we now have exact log space wakeups everywhere we can assume
    the reason for this function was to paper over log space races in earlier
    XFS versions.

    Reviewed-by: Mark Tinguely
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

18 Jan, 2012

6 commits

  • Now that we use the VFS i_size field throughout XFS there is no need for the
    i_new_size field any more given that the VFS i_size field gets updated
    in ->write_end before unlocking the page, and thus is always uptodate when
    writeback could see a page. Removing i_new_size also has the advantage that
    we will never have to trim back di_size during a failed buffered write,
    given that it never gets updated past i_size.

    Note that currently the generic direct I/O code only updates i_size after
    calling our end_io handler, which requires a small workaround to make
    sure di_size actually makes it to disk. I hope to fix this properly in
    the generic code.

    A downside is that we lose the support for parallel non-overlapping O_DIRECT
    appending writes that recently was added. I don't think keeping the complex
    and fragile i_new_size infrastructure for this is a good tradeoff - if we
    really care about parallel appending writers we should investigate turning
    the iolock into a range lock, which would also allow for parallel
    non-overlapping buffered writers.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • There is no fundamental need to keep an in-memory inode size copy in the XFS
    inode. We already have the on-disk value in the dinode, and the separate
    in-memory copy that we need for regular files only in the XFS inode.

    Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the
    VFS inode i_size field for regular files. Switch code that was directly
    accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where
    we are limited to regular files direct access of the VFS inode i_size field.

    This also allows dropping some fairly complicated code in the write path
    which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size
    that is getting updated inside ->write_end.

    Note that we do not bother resetting the VFS i_size when truncating a file
    that gets freed to zero as there is no point in doing so because the VFS inode
    is no longer in use at this point. Just relax the assert in xfs_ifree to
    only check the on-disk size instead.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Replace i_pin_wait, which is only used during synchronous inode flushing
    with a bit waitqueue. This trades off a much smaller inode against
    slightly slower wakeup performance, and saves 12 (32-bit) or 20 (64-bit)
    bytes in the XFS inode.

    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • We almost never block on i_flock, the exception is synchronous inode
    flushing. Instead of bloating the inode with a 16/24-byte completion
    that we abuse as a semaphore just implement it as a bitlock that uses
    a bit waitqueue for the rare sleeping path. This primarily is a
    tradeoff between a much smaller inode and a faster non-blocking
    path vs faster wakeups, and we are much better off with the former.

    A small downside is that we will lose lockdep checking for i_flock, but
    given that it's always taken inside the ilock that should be acceptable.

    Note that for example the inode writeback locking is implemented in a
    very similar way.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • To be used for bit wakeup i_flags needs to be an unsigned long or we'll
    run into trouble on big endian systems. Because of the 1-byte i_update
    field right after it this actually causes a fairly large size increase
    on its own (4 or 8 bytes), but that increase will be more than offset
    by the next two patches.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • We spent a lot of effort to maintain this field, but it always equals to the
    fork size divided by the constant size of an extent. The prime use of it is
    to assert that the two stay in sync. Just divide the fork size by the extent
    size in the few places that we actually use it and remove the overhead
    of maintaining it. Also introduce a few helpers to consolidate the places
    where we actually care about the value.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

14 Jan, 2012

2 commits

  • .. and the just as dead bhv_desc forward declaration while we're at it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • This wrapper isn't overly useful, not to say rather confusing.

    Around the call to xfs_itruncate_extents it does:

    - add tracing
    - add a few asserts in debug builds
    - conditionally update the inode size in two places
    - log the inode

    Both the tracing and the inode logging can be moved to xfs_itruncate_extents
    as they are useful for the attribute fork as well - in fact the attr code
    already does an equivalent xfs_trans_log_inode call just after calling
    xfs_itruncate_extents. The conditional size updates are a mess, and there
    was no reason to do them in two places anyway, as the first one was
    conditional on the inode having extents - but without extents we
    xfs_itruncate_extents would be a no-op and the placement wouldn't matter
    anyway. Instead move the size assignments and the asserts that make sense
    to the callers that want it.

    As a side effect of this clean up xfs_setattr_size by introducing variables
    for the old and new inode size, and moving the size updates into a common
    place.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

04 Jan, 2012

1 commit


30 Nov, 2011

1 commit

  • If we are doing synchronous inode reclaim we block the VM from making
    progress in memory reclaim. So if we encouter a flush locked inode
    promote it in the delwri list and wake up xfsbufd to write it out now.
    Without this we can get hangs of up to 30 seconds during workloads hitting
    synchronous inode reclaim.

    The scheme is copied from what we do for dquot reclaims.

    Reported-by: Simon Kirby
    Signed-off-by: Christoph Hellwig
    Tested-by: Simon Kirby
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

12 Oct, 2011

1 commit

  • We now have an i_dio_count filed and surrounding infrastructure to wait
    for direct I/O completion instead of i_icount, and we have never needed
    to iocount waits for buffered I/O given that we only set the page uptodate
    after finishing all required work. Thus remove i_iocount, and replace
    the actually needed waits with calls to inode_dio_wait.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

27 Jul, 2011

1 commit


08 Jul, 2011

3 commits

  • Remove the transaction pointer in the inode. It's only used to avoid
    passing down an argument in the bmap code, and for a few asserts in
    the transaction code right now.

    Also use the local variable ip in a few more places in xfs_inode_item_unlock,
    so that it isn't only used for debug builds after the above change.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Split the guts of xfs_itruncate_finish that loop over the existing extents
    and calls xfs_bunmapi on them into a new helper, xfs_itruncate_externs.
    Make xfs_attr_inactive call it directly instead of xfs_itruncate_finish,
    which allows to simplify the latter a lot, by only letting it deal with
    the data fork. As a result xfs_itruncate_finish is renamed to
    xfs_itruncate_data to make its use case more obvious.

    Also remove the sync parameter from xfs_itruncate_data, which has been
    unessecary since the introduction of the busy extent list in 2002, and
    completely dead code since 2003 when the XFS_BMAPI_ASYNC parameter was
    made a no-op.

    I can't actually see why the xfs_attr_inactive needs to set the transaction
    sync, but let's keep this patch simple and without changes in behaviour.

    Also avoid passing a useless argument to xfs_isize_check, and make it
    private to xfs_inode.c.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • xfs_itruncate_start is a rather length wrapper that evaluates to a call
    to xfs_ioend_wait and xfs_tosspages, and only has two callers.

    Instead of using the complicated checks left over from IRIX where we
    can to truncate the pagecache just call xfs_tosspages
    (aka truncate_inode_pages) directly as we want to get rid of all data
    after i_size, and truncate_inode_pages handles incorrect alignments
    and too large offsets just fine.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

24 Jun, 2011

1 commit

  • XFS inodes has several per-lifetime state fields that determine the
    behaviour of the inode. These state fields are not all reset when an
    inode is reused from the reclaimable state.

    This can lead to unexpected behaviour of the new inode such as
    speculative preallocation not being truncated away in the expected
    manner for local files until the inode is subsequently truncated,
    freed or cycles out of the cache. It can also lead to an inode being
    considered to be a filestream inode or having been truncated when
    that is not the case.

    Rework the reinitialisation of the inode when it is recycled to
    ensure that it is pristine before it is reused. While there, also
    fix the resetting of state flags in the recycling error paths so the
    inode does not become unreclaimable.

    Signed-off-by: Dave Chinner
    Signed-off-by: Alex Elder

    Dave Chinner
     

25 May, 2011

1 commit

  • The if_lastex field in struct xfs_ifork is only used as a temporary
    index during xfs_bmapi and xfs_bunmapi. Instead of using the inode
    fork to store it keep it local in the callchain. Fortunately this
    is very easy as we already pass a stack copy of it down the whole
    chain which can simplify be changed to be passed by reference.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

31 Mar, 2011

1 commit


08 Feb, 2011

1 commit

  • The rt bitmap and summary inodes do not participate in the normal inode
    locking protocol. Instead the rt bitmap inode can be locked in any
    transaction involving rt allocations, and the both of the rt inodes can
    be locked at the same time. Add specific lockdep subclasses for the rt
    inodes to prevent lockdep from blowing up.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

23 Dec, 2010

2 commits

  • A long standing problem for streaming writeѕ through the NFS server
    has been that the NFS server opens and closes file descriptors on an
    inode for every write. The result of this behaviour is that the
    ->release() function is called on every close and that results in
    XFS truncating speculative preallocation beyond the EOF. This has
    an adverse effect on file layout when multiple files are being
    written at the same time - they interleave their extents and can
    result in severe fragmentation.

    To avoid this problem, keep track of ->release calls made on a dirty
    inode. For most cases, an inode is only going to be opened once for
    writing and then closed again during it's lifetime in cache. Hence
    if there are multiple ->release calls when the inode is dirty, there
    is a good chance that the inode is being accessed by the NFS server.
    Hence set a flag the first time ->release is called while there are
    delalloc blocks still outstanding on the inode.

    If this flag is set when ->release is next called, then do no
    truncate away the speculative preallocation - leave it there so that
    subsequent writes do not need to reallocate the delalloc space. This
    will prevent interleaving of extents of different inodes written
    concurrently to the same AG.

    If we get this wrong, it is not a big deal as we truncate
    speculative allocation beyond EOF anyway in xfs_inactive() when the
    inode is thrown out of the cache.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • The XFS iolock needs to be re-initialised to a new lock class before
    it enters reclaim to prevent lockdep false positives. Unfortunately,
    this is not sufficient protection as inodes in the XFS_IRECLAIMABLE
    state can be recycled and not re-initialised before being reused.

    We need to re-initialise the lock state when transfering out of
    XFS_IRECLAIMABLE state to XFS_INEW, but we need to keep the same
    class as if the inode was just allocated. Hence we need a specific
    lockdep class variable for the iolock so that both initialisations
    use the same class.

    While there, add a specific class for inodes in the reclaim state so
    that it is easy to tell from lockdep reports what state the inode
    was in that generated the report.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

26 Oct, 2010

1 commit


19 Oct, 2010

3 commits

  • This patch adds support for 32bit project quota identifiers.

    On disk format is backward compatible with 16bit projid numbers. projid
    on disk is now kept in two 16bit values - di_projid_lo (which holds the
    same position as old 16bit projid value) and new di_projid_hi (takes
    existing padding) and converts from/to 32bit value on the fly.

    xfs_admin (for existing fs), mkfs.xfs (for new fs) needs to be used
    to enable PROJID32BIT support.

    Signed-off-by: Arkadiusz Miśkiewicz
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Arkadiusz Mi?kiewicz
     
  • We're not actually passing around credentials inside XFS for a while
    now, so remove all xfs_cred.h with it's cred_t typedef and all
    instances of it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Under heavy multi-way parallel create workloads, the VFS struggles
    to write back all the inodes that have been changed in age order.
    The bdi flusher thread becomes CPU bound, spending 85% of it's time
    in the VFS code, mostly traversing the superblock dirty inode list
    to separate dirty inodes old enough to flush.

    We already keep an index of all metadata changes in age order - in
    the AIL - and continued log pressure will do age ordered writeback
    without any extra overhead at all. If there is no pressure on the
    log, the xfssyncd will periodically write back metadata in ascending
    disk address offset order so will be very efficient.

    Hence we can stop marking VFS inodes dirty during transaction commit
    or when changing timestamps during transactions. This will keep the
    inodes in the superblock dirty list to those containing data or
    unlogged metadata changes.

    However, the timstamp changes are slightly more complex than this -
    there are a couple of places that do unlogged updates of the
    timestamps, and the VFS need to be informed of these. Hence add a
    new function xfs_trans_ichgtime() for transactional changes,
    and leave xfs_ichgtime() for the non-transactional changes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

27 Jul, 2010

3 commits

  • xfs_ireclaim has to get and put te pag structure because it is only
    called with the inode to reclaim. The one caller of this function
    already has a reference on the pag and a pointer to is, so move the
    radix tree delete to the caller and remove xfs_ireclaim completely.
    This avoids a xfs_perag_get/put on every inode being reclaimed.

    The overhead was noticed in a bug report at:

    https://bugzilla.kernel.org/show_bug.cgi?id=16348

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • xfs_iput is just a small wrapper for xfs_iunlock + IRELE. Having this
    out of line wrapper means the trace events in those two can't track
    their caller properly. So just remove the wrapper and opencode the
    unlock + rele in the few callers.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • We never get an i_mode of 0 or a locked VFS inode until we pass in the
    XFS_IGET_CREATE flag to xfs_iget, which makes xfs_iput_new equivalent to
    xfs_iput for the only caller. In addition to that xfs_nfs_get_inode
    does not even need to lock the inode given that the generation never changes
    for a life inode, so just pass a 0 lock_flags to xfs_iget and release
    the inode using IRELE in the error path.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

24 Jun, 2010

1 commit

  • The block number comes from bulkstat based inode lookups to shortcut
    the mapping calculations. We ar enot able to trust anything from
    bulkstat, so drop the block number as well so that the correct
    lookups and mappings are always done.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner