12 Jan, 2017

1 commit

  • commit 5d829300bee000980a09ac2ccb761cb25867b67c upstream.

    The open-coded pattern:

    ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)

    is all over the xfs code; provide a new helper
    xfs_iext_count(ifp) to count the number of inline extents
    in an inode fork.

    [dchinner: pick up several missed conversions]

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Eric Sandeen
     

06 Oct, 2016

1 commit

  • Create a per-inode extent size allocator hint for copy-on-write. This
    hint is separate from the existing extent size hint so that CoW can
    take advantage of the fragmentation-reducing properties of extent size
    hints without disabling delalloc for regular writes.

    The extent size hint that's fed to the allocator during a copy on
    write operation is the greater of the cowextsize and regular extsize
    hint.

    During reflink, if we're sharing the entire source file to the entire
    destination file and the destination file doesn't already have a
    cowextsize hint, propagate the source file's cowextsize hint to the
    destination file.

    Furthermore, zero the bulkstat buffer prior to setting the fields
    so that we don't copy kernel memory contents into userspace.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

22 Jul, 2016

1 commit

  • One of the problems we currently have with delayed logging is that
    under serious memory pressure we can deadlock memory reclaim. THis
    occurs when memory reclaim (such as run by kswapd) is reclaiming XFS
    inodes and issues a log force to unpin inodes that are dirty in the
    CIL.

    The CIL is pushed, but this will only occur once it gets the CIL
    context lock to ensure that all committing transactions are complete
    and no new transactions start being committed to the CIL while the
    push switches to a new context.

    The deadlock occurs when the CIL context lock is held by a
    committing process that is doing memory allocation for log vector
    buffers, and that allocation is then blocked on memory reclaim
    making progress. Memory reclaim, however, is blocked waiting for
    a log force to make progress, and so we effectively deadlock at this
    point.

    To solve this problem, we have to move the CIL log vector buffer
    allocation outside of the context lock so that memory reclaim can
    always make progress when it needs to force the log. The problem
    with doing this is that a CIL push can take place while we are
    determining if we need to allocate a new log vector buffer for
    an item and hence the current log vector may go away without
    warning. That means we canot rely on the existing log vector being
    present when we finally grab the context lock and so we must have a
    replacement buffer ready to go at all times.

    To ensure this, introduce a "shadow log vector" buffer that is
    always guaranteed to be present when we gain the CIL context lock
    and format the item. This shadow buffer may or may not be used
    during the formatting, but if the log item does not have an existing
    log vector buffer or that buffer is too small for the new
    modifications, we swap it for the new shadow buffer and format
    the modifications into that new log vector buffer.

    The result of this is that for any object we modify more than once
    in a given CIL checkpoint, we double the memory required
    to track dirty regions in the log. For single modifications then
    we consume the shadow log vectorwe allocate on commit, and that gets
    consumed by the checkpoint. However, if we make multiple
    modifications, then the second transaction commit will allocate a
    shadow log vector and hence we will end up with double the memory
    usage as only one of the log vectors is consumed by the CIL
    checkpoint. The remaining shadow vector will be freed when th elog
    item is freed.

    This can probably be optimised in future - access to the shadow log
    vector is serialised by the object lock (as opposited to the active
    log vector, which is controlled by the CIL context lock) and so we
    can probably free shadow log vector from some objects when the log
    item is marked clean on removal from the AIL.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

20 May, 2016

1 commit


06 Apr, 2016

2 commits

  • These three warnings are fixed:

    fs/xfs/xfs_inode.c:1033:44: warning: Using plain integer as NULL pointer
    fs/xfs/xfs_inode_item.c:525:20: warning: context imbalance in 'xfs_inode_item_push' - unexpected unlock
    fs/xfs/xfs_dquot.c:696:1: warning: symbol 'xfs_dq_get_next_id' was not declared. Should it be static?

    Signed-off-by: Eryu Guan
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Eryu Guan
     
  • By overallocating the in-core inode fork data buffer and zero
    terminating the link target in xfs_init_local_fork we can avoid
    the memory allocation in ->follow_link.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

09 Feb, 2016

8 commits

  • Move the di_mode value from the xfs_icdinode to the VFS inode, reducing
    the xfs_icdinode byte another 2 bytes and collapsing another 2 byte hole
    in the structure.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • We can store the di_changecount in the i_version field of the VFS
    inode and remove another 8 bytes from the xfs_icdinode.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Pull another 4 bytes out of the xfs_icdinode.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The VFS tracks the inode nlink just like the xfs_icdinode. We can
    remove the variable from the icdinode and use the VFS inode variable
    everywhere, reducing the size of the xfs_icdinode by a further 4
    bytes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • So we don't have to carry an di_onlink variable around anymore, move
    the inode conversion from v1 inode format to v2 inode format into
    xfs_inode_from_disk(). This means we can remove the di_onlink fields
    from the struct xfs_icdinode.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Now that the struct xfs_icdinode is not directly related to the
    on-disk format, we can cull things in it we really don't need to
    store:

    - magic number never changes
    - padding is not necessary
    - next_unlinked is never used
    - inode number is redundant
    - uuid is redundant
    - lsn is accessed directly from dinode
    - inode CRC is only accessed directly from dinode

    Hence we can remove these from the struct xfs_icdinode and redirect
    the code that uses them to the xfs_dinode appripriately. This
    reduces the size of the struct icdinode from 152 bytes to 88 bytes,
    and removes a fair chunk of unnecessary code, too.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The struct xfs_inode has two copies of the current timestamps in it,
    one in the vfs inode and one in the struct xfs_icdinode. Now that we
    no longer log the struct xfs_icdinode directly, we don't need to
    keep the timestamps in this structure. instead we can copy them
    straight out of the VFS inode when formatting the inode log item or
    the on-disk inode.

    This reduces the struct xfs_inode in size by 24 bytes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • We currently carry around and log an entire inode core in the
    struct xfs_inode. A lot of the information in the inode core is
    duplicated in the VFS inode, but we cannot remove this duplication
    of infomration because the inode core is logged directly in
    xfs_inode_item_format().

    Add a new function xfs_inode_item_format_core() that copies the
    inode core data into a struct xfs_icdinode that is pulled directly
    from the log vector buffer. This means we no longer directly
    copy the inode core, but copy the structures one member at a time.
    This will be slightly less efficient than copying, but will allow us
    to remove duplicate and unnecessary items from the struct xfs_inode.

    To enable us to do this, call the new structure a xfs_log_dinode,
    so that we know it's different to the physical xfs_dinode and the
    in-core xfs_icdinode.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

03 Nov, 2015

1 commit

  • xfs: timestamp updates cause excessive fdatasync log traffic

    Sage Weil reported that a ceph test workload was writing to the
    log on every fdatasync during an overwrite workload. Event tracing
    showed that the only metadata modification being made was the
    timestamp updates during the write(2) syscall, but fdatasync(2)
    is supposed to ignore them. The key observation was that the
    transactions in the log all looked like this:

    INODE: #regs: 4 ino: 0x8b flags: 0x45 dsize: 32

    And contained a flags field of 0x45 or 0x85, and had data and
    attribute forks following the inode core. This means that the
    timestamp updates were triggering dirty relogging of previously
    logged parts of the inode that hadn't yet been flushed back to
    disk.

    There are two parts to this problem. The first is that XFS relogs
    dirty regions in subsequent transactions, so it carries around the
    fields that have been dirtied since the last time the inode was
    written back to disk, not since the last time the inode was forced
    into the log.

    The second part is that on v5 filesystems, the inode change count
    update during inode dirtying also sets the XFS_ILOG_CORE flag, so
    on v5 filesystems this makes a timestamp update dirty the entire
    inode.

    As a result when fdatasync is run, it looks at the dirty fields in
    the inode, and sees more than just the timestamp flag, even though
    the only metadata change since the last fdatasync was just the
    timestamps. Hence we force the log on every subsequent fdatasync
    even though it is not needed.

    To fix this, add a new field to the inode log item that tracks
    changes since the last time fsync/fdatasync forced the log to flush
    the changes to the journal. This flag is updated when we dirty the
    inode, but we do it before updating the change count so it does not
    carry the "core dirty" flag from timestamp updates. The fields are
    zeroed when the inode is marked clean (due to writeback/freeing) or
    when an fsync/datasync forces the log. Hence if we only dirty the
    timestamps on the inode between fsync/fdatasync calls, the fdatasync
    will not trigger another log force.

    Over 100 runs of the test program:

    Ext4 baseline:
    runtime: 1.63s +/- 0.24s
    avg lat: 1.59ms +/- 0.24ms
    iops: ~2000

    XFS, vanilla kernel:
    runtime: 2.45s +/- 0.18s
    avg lat: 2.39ms +/- 0.18ms
    log forces: ~400/s
    iops: ~1000

    XFS, patched kernel:
    runtime: 1.49s +/- 0.26s
    avg lat: 1.46ms +/- 0.25ms
    log forces: ~30/s
    iops: ~1500

    Reported-by: Sage Weil
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

19 Aug, 2015

1 commit


28 Nov, 2014

3 commits


03 Oct, 2014

1 commit

  • Commit 3013683 ("xfs: remove all the inodes on a buffer from the AIL
    in bulk") made the xfs inode flush callback more efficient by
    combining all the inode writes on the buffer and the deletions of
    the inode log item from AIL.

    The initial loop in this patch should be looping through all
    the log items on the buffer to see which items have
    xfs_iflush_done as their callback function. But currently,
    only the log item passed to the function has its callback
    compared to xfs_iflush_done. If the log item pointer passed to
    the function does have the xfs_iflush_done callback function,
    then all the log items on the buffer are removed from the
    li_bio_list on the buffer b_fspriv and could be removed from
    the AIL even though they may have not been written yet.

    This problem is masked by the fact that currently all inodes on a
    buffer will have the same calback function - either xfs_iflush_done
    or xfs_istale_done - and hence the bug cannot manifest in any way.
    Still, we need to remove the landmine so that if we add new
    callbacks in future this doesn't cause us problems.

    Signed-off-by: Mark Tinguely
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Mark Tinguely
     

25 Jun, 2014

1 commit

  • Convert all the errors the core XFs code to negative error signs
    like the rest of the kernel and remove all the sign conversion we
    do in the interface layers.

    Errors for conversion (and comparison) found via searches like:

    $ git grep " E" fs/xfs
    $ git grep "return E" fs/xfs
    $ git grep " E[A-Z].*;$" fs/xfs

    Negation points found via searches like:

    $ git grep "= -[a-z,A-Z]" fs/xfs
    $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
    $ git grep " -[a-z].*;" fs/xfs

    [ with some bits I missed from Brian Foster ]

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

20 May, 2014

1 commit

  • mkfs has turned on the XFS_SB_VERSION_NLINKBIT feature bit by
    default since November 2007. It's about time we simply made the
    kernel code turn it on by default and so always convert v1 inodes to
    v2 inodes when reading them in from disk or allocating them. This
    This removes needless version checks and modification when bumping
    link counts on inodes, and will take code out of a few common code
    paths.

    text data bss dec hex filename
    783251 100867 616 884734 d7ffe fs/xfs/xfs.o.orig
    782664 100867 616 884147 d7db3 fs/xfs/xfs.o.patched

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

13 Dec, 2013

6 commits

  • No need to keep the inode log format around all the time, we can
    easily generate it at iop_format time.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • With the new iop_format scheme there is no need to have a temporary buffer
    to format logged extents into, we can do so directly into the CIL. This
    also allows to remove the shortcut for big endian systems that probably
    hasn't gotten a lot of test coverage for a long time.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Instead of setting up pointers to memory locations in iop_format which then
    get copied into the CIL linear buffer after return move the copy into
    the individual inode items. This avoids the need to always have a memory
    block in the exact same layout that gets written into the log around, and
    allow the log items to be much more flexible in their in-memory layouts.

    The only caveat is that we need to properly align the data for each
    iovec so that don't have structures misaligned in subsequent iovecs.

    Note that all log item format routines now need to be careful to modify
    the copy of the item that was placed into the CIL after calls to
    xlog_copy_iovec instead of the in-memory copy.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Add a helper to abstract out filling the log iovecs in the log item
    format handlers. This will allow us to change the way we do the log
    item formatting more easily.

    The copy in the name is a bit confusing for now as it just assigns a
    pointer and lets the CIL code perform the copy, but that will change
    soon.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Split out a function to handle the data and attr fork, as well as a
    helper for the really old v1 inodes.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Split out two helpers to size the data and attribute to make the
    function more readable.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

24 Oct, 2013

2 commits

  • Currently the xfs_inode.h header has a dependency on the definition
    of the BMAP btree records as the inode fork includes an array of
    xfs_bmbt_rec_host_t objects in it's definition.

    Move all the btree format definitions from xfs_btree.h,
    xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
    xfs_format.h to continue the process of centralising the on-disk
    format definitions. With this done, the xfs inode definitions are no
    longer dependent on btree header files.

    The enables a massive culling of unnecessary includes, with close to
    200 #include directives removed from the XFS kernel code base.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • xfs_trans.h has a dependency on xfs_log.h for a couple of
    structures. Most code that does transactions doesn't need to know
    anything about the log, but this dependency means that they have to
    include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header
    files and clean up the includes to be in dependency order.

    In doing this, remove the direct include of xfs_trans_reserve.h from
    xfs_trans.h so that we remove the dependency between xfs_trans.h and
    xfs_mount.h. Hence the xfs_trans.h include can be moved to the
    indicate the actual dependencies other header files have on it.

    Note that these are kernel only header files, so this does not
    translate to any userspace changes at all.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     

14 Aug, 2013

1 commit

  • To begin optimising the CIL commit process, we need to have IOP_SIZE
    return both the number of vectors and the size of the data pointed
    to by the vectors. This enables us to calculate the size ofthe
    memory allocation needed before the formatting step and reduces the
    number of memory allocations per item by one.

    While there, kill the IOP_SIZE macro.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     

22 Apr, 2013

1 commit

  • Add a new inode version with a larger core. The primary objective is
    to allow for a crc of the inode, and location information (uuid and ino)
    to verify it was written in the right place. We also extend it by:

    a creation time (for Samba);
    a changecount (for NFSv4);
    a flush sequence (in LSN format for recovery);
    an additional inode flags field; and
    some additional padding.

    These additional fields are not implemented yet, but already laid
    out in the structure.

    [dchinner@redhat.com] Added LSN and flags field, some factoring and rework to
    capture all the necessary information in the crc calculation.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

18 Dec, 2012

1 commit


22 Jun, 2012

1 commit

  • An inode in the AIL can be flush locked and marked stale if
    a cluster free transaction occurs at the right time. The
    inode item is then marked as flushing, which causes xfsaild
    to spin and leaves the filesystem stalled. This is
    reproduced by running xfstests 273 in a loop for an
    extended period of time.

    Check for stale inodes before the flush lock. This marks
    the inode as pinned, leads to a log flush and allows the
    filesystem to proceed.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Brian Foster
     

15 May, 2012

5 commits

  • With the removal of xfs_rw.h and other changes over time, xfs_bit.h
    is being included in many files that don't actually need it. Clean
    up the includes as necessary.

    Also move the only-used-once xfs_ialloc_find_free() static inline
    function out of a header file that is widely included to reduce
    the number of needless dependencies on xfs_bit.h.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Untangle the header file includes a bit by moving the definition of
    xfs_agino_t to xfs_types.h. This removes the dependency that xfs_ag.h has on
    xfs_inum.h, meaning we don't need to include xfs_inum.h everywhere we include
    xfs_ag.h.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • xfs_trans_ail_delete_bulk() can be called from different contexts so
    if the item is not in the AIL we need different shutdown for each
    context. Pass in the shutdown method needed so the correct action
    can be taken.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
    and write back the buffers per-process instead of by waking up xfsbufd.

    This is now easily doable given that we have very few places left that write
    delwri buffers:

    - log recovery:
    Only done at mount time, and already forcing out the buffers
    synchronously using xfs_flush_buftarg

    - quotacheck:
    Same story.

    - dquot reclaim:
    Writes out dirty dquots on the LRU under memory pressure. We might
    want to look into doing more of this via xfsaild, but it's already
    more optimal than the synchronous inode reclaim that writes each
    buffer synchronously.

    - xfsaild:
    This is the main beneficiary of the change. By keeping a local list
    of buffers to write we reduce latency of writing out buffers, and
    more importably we can remove all the delwri list promotions which
    were hitting the buffer cache hard under sustained metadata loads.

    The implementation is very straight forward - xfs_buf_delwri_queue now gets
    a new list_head pointer that it adds the delwri buffers to, and all callers
    need to eventually submit the list using xfs_buf_delwi_submit or
    xfs_buf_delwi_submit_nowait. Buffers that already are on a delwri list are
    skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
    list. The biggest change to pass down the buffer list was done to the AIL
    pushing. Now that we operate on buffers the trylock, push and pushbuf log
    item methods are merged into a single push routine, which tries to lock the
    item, and if possible add the buffer that needs writeback to the buffer list.
    This leads to much simpler code than the previous split but requires the
    individual IOP_PUSH instances to unlock and reacquire the AIL around calls
    to blocking routines.

    Given that xfsailds now also handle writing out buffers, the conditions for
    log forcing and the sleep times needed some small changes. The most
    important one is that we consider an AIL busy as long we still have buffers
    to push, and the other one is that we do increment the pushed LSN for
    buffers that are under flushing at this moment, but still count them towards
    the stuck items for restart purposes. Without this we could hammer on stuck
    items without ever forcing the log and not make progress under heavy random
    delete workloads on fast flash storage devices.

    [ Dave Chinner:
    - rebase on previous patches.
    - improved comments for XBF_DELWRI_Q handling
    - fix XBF_ASYNC handling in queue submission (test 106 failure)
    - rename delwri submit function buffer list parameters for clarity
    - xfs_efd_item_push() should return XFS_ITEM_PINNED ]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Instead of writing the buffer directly from inside xfs_iflush return it to
    the caller and let the caller decide what to do with the buffer. Also
    remove the pincount check in xfs_iflush that all non-blocking callers already
    implement and the now unused flags parameter.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

14 Mar, 2012

1 commit