31 Oct, 2013

1 commit

  • Lots of the dir code now goes through switches to determine what is
    the correct on-disk format to parse. It generally involves a
    "xfs_sbversion_hasfoo" check, deferencing the superblock version and
    feature fields and hence touching several cache lines per operation
    in the process. Some operations do multiple checks because they nest
    conditional operations and they don't pass the information in a
    direct fashion between each other.

    Hence, add an ops vector to the xfs_inode structure that is
    configured when the inode is initialised to point to all the correct
    decode and encoding operations. This will significantly reduce the
    branchiness and cacheline footprint of the directory object decoding
    and encoding.

    This is the first patch in a series of conversion patches. It will
    introduce the ops structure, the setup of it and add the first
    operation to the vector. Subsequent patches will convert directory
    ops one at a time to keep the changes simple and obvious.

    Just this patch shows the benefit of such an approach on code size.
    Just converting the two shortform dir operations as this patch does
    decreases the built binary size by ~1500 bytes:

    $ size fs/xfs/xfs.o.orig fs/xfs/xfs.o.p1
    text data bss dec hex filename
    794490 96802 1096 892388 d9de4 fs/xfs/xfs.o.orig
    792986 96802 1096 890884 d9804 fs/xfs/xfs.o.p1
    $

    That's a significant decrease in the instruction cache footprint of
    the directory code for such a simple change, and indicates that this
    approach is definitely worth pursuing further.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     

24 Oct, 2013

1 commit

  • Currently the xfs_inode.h header has a dependency on the definition
    of the BMAP btree records as the inode fork includes an array of
    xfs_bmbt_rec_host_t objects in it's definition.

    Move all the btree format definitions from xfs_btree.h,
    xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
    xfs_format.h to continue the process of centralising the on-disk
    format definitions. With this done, the xfs inode definitions are no
    longer dependent on btree header files.

    The enables a massive culling of unnecessary includes, with close to
    200 #include directives removed from the XFS kernel code base.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     

09 Oct, 2013

1 commit


13 Aug, 2013

6 commits

  • There are a few small helper functions in xfs_util, all related to
    xfs_inode modifications. Move them all to xfs_inode.c so all
    xfs_inode operations are consiolidated in the one place.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Now we have xfs_inode.c for holding kernel-only XFS inode
    operations, move all the inode operations from xfs_vnodeops.c to
    this new file as it holds another set of kernel-only inode
    operations. The name of this file traces back to the days of Irix
    and it's vnodes which we don't have anymore.

    Essentially this move consolidates the inode locking functions
    and a bunch of XFS inode operations into the one file. Eventually
    the high level functions will be merged into the VFS interface
    functions in xfs_iops.c.

    This leaves only internal preallocation, EOF block manipulation and
    hole punching functions in vnodeops.c. Move these to xfs_bmap_util.c
    where we are already consolidating various in-kernel physical extent
    manipulation and querying functions.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • The only thing remaining in xfs_inode.[ch] are the operations that
    read, write or verify physical inodes in their underlying buffers.
    Move all this code to xfs_inode_buf.[ch] and so we can stop sharing
    xfs_inode.[ch] with userspace.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • The inode fork definitions are a combination of on-disk format
    definition and in-memory tracking and manipulation. They are both
    shared with userspace, so move them all into their own file so
    sharing is easy to do and track. This removes all inode fork
    related information from xfs_inode.h.

    Do the same for the all the C code that currently resides in
    xfs_inode.c for the same reason.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • The log item format definitions are shared with userspace. Split
    them out of header files that contain kernel only defintions to make
    it simple to shared them.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     

11 Jul, 2013

1 commit

  • Add project quota changes to all the places where group quota field
    is used:
    * add separate project quota members into various structures
    * split project quota and group quotas so that instead of overriding
    the group quota members incore, the new project quota members are
    used instead
    * get rid of usage of the OQUOTA flag incore, in favor of separate
    group and project quota flags.
    * add a project dquot argument to various functions.

    Not using the pquotino field from superblock yet.

    Signed-off-by: Chandra Seetharaman
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Chandra Seetharaman
     

22 Apr, 2013

1 commit

  • Add a new inode version with a larger core. The primary objective is
    to allow for a crc of the inode, and location information (uuid and ino)
    to verify it was written in the right place. We also extend it by:

    a creation time (for Samba);
    a changecount (for NFSv4);
    a flush sequence (in LSN format for recovery);
    an additional inode flags field; and
    some additional padding.

    These additional fields are not implemented yet, but already laid
    out in the structure.

    [dchinner@redhat.com] Added LSN and flags field, some factoring and rework to
    capture all the necessary information in the crc calculation.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

15 Mar, 2013

1 commit


07 Feb, 2013

1 commit

  • In xfs_ifunlock() there is a call to wake_up_bit() after clearing
    the flush lock on the xfs inode. This is not guaranteed to be safe,
    as noted in the comments above wake_up_bit() beginning with:

    In order for this to function properly, as it uses
    waitqueue_active() internally, some kind of memory
    barrier must be done prior to calling this.

    Signed-off-by: Alex Elder
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Alex Elder
     

16 Nov, 2012

4 commits

  • To separate the verifiers from iodone functions and associate read
    and write verifiers at the same time, introduce a buffer verifier
    operations structure to the xfs_buf.

    This avoids the need for assigning the write verifier, clearing the
    iodone function and re-running ioend processing in the read
    verifier, and gets rid of the nasty "b_pre_io" name for the write
    verifier function pointer. If we ever need to, it will also be
    easier to add further content specific callbacks to a buffer with an
    ops structure in place.

    We also avoid needing to export verifier functions, instead we
    can simply export the ops structures for those that are needed
    outside the function they are defined in.

    This patch also fixes a directory block readahead verifier issue
    it exposed.

    This patch also adds ops callbacks to the inode/alloc btree blocks
    initialised by growfs. These will need more work before they will
    work with CRCs.

    Signed-off-by: Dave Chinner
    Reviewed-by: Phil White
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Metadata buffers that are read from disk have write verifiers
    already attached to them, but newly allocated buffers do not. Add
    appropriate write verifiers to all new metadata buffers.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • These verifiers are essentially the same code as the read verifiers,
    but do not require ioend processing. Hence factor the read verifier
    functions and add a new write verifier wrapper that is used as the
    callback.

    This is done as one large patch for all verifiers rather than one
    patch per verifier as the change is largely mechanical. This
    includes hooking up the write verifier via the read verifier
    function.

    Hooking up the write verifier for buffers obtained via
    xfs_trans_get_buf() will be done in a separate patch as that touches
    code in many different places rather than just the verifier
    functions.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Add an btree block verify callback function and pass it into the
    buffer read functions. Because each different btree block type
    requires different verification, add a function to the ops structure
    that is called from the generic code.

    Also, propagate the verification callback functions through the
    readahead functions, and into the external bmap and bulkstat inode
    readahead code that uses the generic btree buffer read functions.

    Signed-off-by: Dave Chinner
    Reviewed-by: Phil White
    Signed-off-by: Ben Myers

    Dave Chinner
     

09 Nov, 2012

1 commit

  • This check is used in multiple places to determine whether we
    should check for (and potentially free) post EOF blocks on an
    inode. Add a helper to consolidate the check.

    Note that when we remove an inode from the cache (xfs_inactive()),
    we are required to trim post-EOF blocks even if the inode is marked
    preallocated or append-only to maintain correct space accounting.
    The 'force' parameter to xfs_can_free_eofblocks() specifies whether
    we should ignore the prealloc/append-only status of the inode.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Brian Foster
     

18 Oct, 2012

1 commit

  • The inode cache functions remaining in xfs_iget.c can be moved to xfs_icache.c
    along with the other inode cache functions. This removes all functionality from
    xfs_iget.c, so the file can simply be removed.

    This move results in various functions now only having the scope of a single
    file (e.g. xfs_inode_free()), so clean up all the definitions and exported
    prototypes in xfs_icache.[ch] and xfs_inode.h appropriately.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     

30 Jul, 2012

2 commits


22 Jul, 2012

2 commits

  • There is no need to keep this helper around, opencoding it in the only
    caller is just as clear.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • All callers of xfs_imap_to_bp want the dinode pointer, so let's calculate it
    inside xfs_imap_to_bp. Once that is done xfs_itobp becomes a fairly pointless
    wrapper which can be replaced with direct calls to xfs_imap_to_bp.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

15 May, 2012

3 commits

  • The only thing left in xfs_rw.h is a function prototype for an inode
    function. Move that to xfs_inode.h, and kill xfs_rw.h.

    Also move the function implementing the prototype from xfs_rw.c to
    xfs_inode.c so we only have one function left in xfs_rw.c

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
    and write back the buffers per-process instead of by waking up xfsbufd.

    This is now easily doable given that we have very few places left that write
    delwri buffers:

    - log recovery:
    Only done at mount time, and already forcing out the buffers
    synchronously using xfs_flush_buftarg

    - quotacheck:
    Same story.

    - dquot reclaim:
    Writes out dirty dquots on the LRU under memory pressure. We might
    want to look into doing more of this via xfsaild, but it's already
    more optimal than the synchronous inode reclaim that writes each
    buffer synchronously.

    - xfsaild:
    This is the main beneficiary of the change. By keeping a local list
    of buffers to write we reduce latency of writing out buffers, and
    more importably we can remove all the delwri list promotions which
    were hitting the buffer cache hard under sustained metadata loads.

    The implementation is very straight forward - xfs_buf_delwri_queue now gets
    a new list_head pointer that it adds the delwri buffers to, and all callers
    need to eventually submit the list using xfs_buf_delwi_submit or
    xfs_buf_delwi_submit_nowait. Buffers that already are on a delwri list are
    skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
    list. The biggest change to pass down the buffer list was done to the AIL
    pushing. Now that we operate on buffers the trylock, push and pushbuf log
    item methods are merged into a single push routine, which tries to lock the
    item, and if possible add the buffer that needs writeback to the buffer list.
    This leads to much simpler code than the previous split but requires the
    individual IOP_PUSH instances to unlock and reacquire the AIL around calls
    to blocking routines.

    Given that xfsailds now also handle writing out buffers, the conditions for
    log forcing and the sleep times needed some small changes. The most
    important one is that we consider an AIL busy as long we still have buffers
    to push, and the other one is that we do increment the pushed LSN for
    buffers that are under flushing at this moment, but still count them towards
    the stuck items for restart purposes. Without this we could hammer on stuck
    items without ever forcing the log and not make progress under heavy random
    delete workloads on fast flash storage devices.

    [ Dave Chinner:
    - rebase on previous patches.
    - improved comments for XBF_DELWRI_Q handling
    - fix XBF_ASYNC handling in queue submission (test 106 failure)
    - rename delwri submit function buffer list parameters for clarity
    - xfs_efd_item_push() should return XFS_ITEM_PINNED ]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Instead of writing the buffer directly from inside xfs_iflush return it to
    the caller and let the caller decide what to do with the buffer. Also
    remove the pincount check in xfs_iflush that all non-blocking callers already
    implement and the now unused flags parameter.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

27 Mar, 2012

1 commit

  • When we read inodes via bulkstat, we generally only read them once
    and then throw them away - they never get used again. If we retain
    them in cache, then it simply causes the working set of inodes and
    other cached items to be reclaimed just so the inode cache can grow.

    Avoid this problem by marking inodes read by bulkstat not to be
    cached and check this flag in .drop_inode to determine whether the
    inode should be added to the VFS LRU or not. If the inode lookup
    hits an already cached inode, then don't set the flag. If the inode
    lookup hits an inode marked with no cache flag, remove the flag and
    allow it to be cached once the current reference goes away.

    Inodes marked as not cached will get cleaned up by the background
    inode reclaim or via memory pressure, so they will still generate
    some short term cache pressure. They will, however, be reclaimed
    much sooner and in preference to cache hot inodes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     

14 Mar, 2012

1 commit

  • Timestamps on regular files are the last metadata that XFS does not update
    transactionally. Now that we use the delaylog mode exclusively and made
    the log scode scale extremly well there is no need to bypass that code for
    timestamp updates. Logging all updates allows to drop a lot of code, and
    will allow for further performance improvements later on.

    Note that this patch drops optimized handling of fdatasync - it will be
    added back in a separate commit.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

06 Mar, 2012

1 commit

  • Replace xfs_ioend_new_eof with a new inline xfs_new_eof helper that
    doesn't require and ioend, and is available also outside of xfs_aops.c.

    Also make the code a bit more clear by using a normal if statement
    instead of a slightly misleading MIN().

    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

23 Feb, 2012

1 commit

  • There is no reason to wake up log space waiters when unlocking inodes or
    dquots, and the commit log has no explanation for this function either.

    Given that we now have exact log space wakeups everywhere we can assume
    the reason for this function was to paper over log space races in earlier
    XFS versions.

    Reviewed-by: Mark Tinguely
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

18 Jan, 2012

6 commits

  • Now that we use the VFS i_size field throughout XFS there is no need for the
    i_new_size field any more given that the VFS i_size field gets updated
    in ->write_end before unlocking the page, and thus is always uptodate when
    writeback could see a page. Removing i_new_size also has the advantage that
    we will never have to trim back di_size during a failed buffered write,
    given that it never gets updated past i_size.

    Note that currently the generic direct I/O code only updates i_size after
    calling our end_io handler, which requires a small workaround to make
    sure di_size actually makes it to disk. I hope to fix this properly in
    the generic code.

    A downside is that we lose the support for parallel non-overlapping O_DIRECT
    appending writes that recently was added. I don't think keeping the complex
    and fragile i_new_size infrastructure for this is a good tradeoff - if we
    really care about parallel appending writers we should investigate turning
    the iolock into a range lock, which would also allow for parallel
    non-overlapping buffered writers.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • There is no fundamental need to keep an in-memory inode size copy in the XFS
    inode. We already have the on-disk value in the dinode, and the separate
    in-memory copy that we need for regular files only in the XFS inode.

    Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the
    VFS inode i_size field for regular files. Switch code that was directly
    accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where
    we are limited to regular files direct access of the VFS inode i_size field.

    This also allows dropping some fairly complicated code in the write path
    which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size
    that is getting updated inside ->write_end.

    Note that we do not bother resetting the VFS i_size when truncating a file
    that gets freed to zero as there is no point in doing so because the VFS inode
    is no longer in use at this point. Just relax the assert in xfs_ifree to
    only check the on-disk size instead.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Replace i_pin_wait, which is only used during synchronous inode flushing
    with a bit waitqueue. This trades off a much smaller inode against
    slightly slower wakeup performance, and saves 12 (32-bit) or 20 (64-bit)
    bytes in the XFS inode.

    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • We almost never block on i_flock, the exception is synchronous inode
    flushing. Instead of bloating the inode with a 16/24-byte completion
    that we abuse as a semaphore just implement it as a bitlock that uses
    a bit waitqueue for the rare sleeping path. This primarily is a
    tradeoff between a much smaller inode and a faster non-blocking
    path vs faster wakeups, and we are much better off with the former.

    A small downside is that we will lose lockdep checking for i_flock, but
    given that it's always taken inside the ilock that should be acceptable.

    Note that for example the inode writeback locking is implemented in a
    very similar way.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • To be used for bit wakeup i_flags needs to be an unsigned long or we'll
    run into trouble on big endian systems. Because of the 1-byte i_update
    field right after it this actually causes a fairly large size increase
    on its own (4 or 8 bytes), but that increase will be more than offset
    by the next two patches.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • We spent a lot of effort to maintain this field, but it always equals to the
    fork size divided by the constant size of an extent. The prime use of it is
    to assert that the two stay in sync. Just divide the fork size by the extent
    size in the few places that we actually use it and remove the overhead
    of maintaining it. Also introduce a few helpers to consolidate the places
    where we actually care about the value.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

14 Jan, 2012

2 commits

  • .. and the just as dead bhv_desc forward declaration while we're at it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • This wrapper isn't overly useful, not to say rather confusing.

    Around the call to xfs_itruncate_extents it does:

    - add tracing
    - add a few asserts in debug builds
    - conditionally update the inode size in two places
    - log the inode

    Both the tracing and the inode logging can be moved to xfs_itruncate_extents
    as they are useful for the attribute fork as well - in fact the attr code
    already does an equivalent xfs_trans_log_inode call just after calling
    xfs_itruncate_extents. The conditional size updates are a mess, and there
    was no reason to do them in two places anyway, as the first one was
    conditional on the inode having extents - but without extents we
    xfs_itruncate_extents would be a no-op and the placement wouldn't matter
    anyway. Instead move the size assignments and the asserts that make sense
    to the callers that want it.

    As a side effect of this clean up xfs_setattr_size by introducing variables
    for the old and new inode size, and moving the size updates into a common
    place.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

04 Jan, 2012

1 commit


30 Nov, 2011

1 commit

  • If we are doing synchronous inode reclaim we block the VM from making
    progress in memory reclaim. So if we encouter a flush locked inode
    promote it in the delwri list and wake up xfsbufd to write it out now.
    Without this we can get hangs of up to 30 seconds during workloads hitting
    synchronous inode reclaim.

    The scheme is copied from what we do for dquot reclaims.

    Reported-by: Simon Kirby
    Signed-off-by: Christoph Hellwig
    Tested-by: Simon Kirby
    Signed-off-by: Ben Myers

    Christoph Hellwig