27 Jul, 2010

8 commits

  • The b_strat callback is used by xfs_buf_iostrategy to perform additional
    checks before submitting a buffer. It is used in xfs_bwrite and when
    writing out delayed buffers. In xfs_bwrite it we can de-virtualize the
    call easily as b_strat is set a few lines above the call to
    xfs_buf_iostrategy. For the delayed buffers the rationale is a bit
    more complicated:

    - there are three callers of xfs_buf_delwri_queue, which places buffers
    on the delwri list:
    (1) xfs_bdwrite - this sets up b_strat, so it's fine
    (2) xfs_buf_iorequest. None of the callers can have XBF_DELWRI set:
    - xlog_bdstrat is only used for log buffers, which are never delwri
    - _xfs_buf_read explicitly clears the delwri flag
    - xfs_buf_iodone_work retries log buffers only
    - xfsbdstrat - only used for reads, superblock writes without the
    delwri flag, log I/O and file zeroing with explicitly allocated
    buffers.
    - xfs_buf_iostrategy - only calls xfs_buf_iorequest if b_strat is
    not set
    (3) xfs_buf_unlock
    - only puts the buffer on the delwri list if the DELWRI flag is
    already set. The DELWRI flag is only ever set in xfs_bwrite,
    xfs_buf_iodone_callbacks, or xfs_trans_log_buf. For
    xfs_buf_iodone_callbacks and xfs_trans_log_buf we require
    an initialized buf item, which means b_strat was set to
    xfs_bdstrat_cb in xfs_buf_item_init.

    Conclusion: we can just get rid of the callback and replace it with
    explicit calls to xfs_bdstrat_cb.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • By making this member a void pointer we can get rid of a lot of pointless
    casts.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Get rid of the xfs_buf_pin/xfs_buf_unpin/xfs_buf_ispin helpers and opencode
    them in their only callers, just like we did for the inode pinning a while
    ago. Also remove duplicate trace points - the bufitem tracepoints cover
    all the information that is present in a buffer tracepoint.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Stop the function pointer casting madness and give all the li_cb instances
    correct prototype.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Stop the function pointer casting madness and give all the xfs_item_ops the
    correct prototypes.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • The unpin_remove item operation instances always share most of the
    implementation with the respective unpin implementation. So instead
    of keeping two different entry points add a remove flag to the unpin
    operation and share the code more easily.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Currently we track log item descriptor belonging to a transaction using a
    complex opencoded chunk allocator. This code has been there since day one
    and seems to work around the lack of an efficient slab allocator.

    This patch replaces it with dynamically allocated log item descriptors
    from a dedicated slab pool, linked to the transaction by a linked list.

    This allows to greatly simplify the log item descriptor tracking to the
    point where it's just a couple hundred lines in xfs_trans.c instead of
    a separate file. The external API has also been simplified while we're
    at it - the xfs_trans_add_item and xfs_trans_del_item functions to add/
    delete items from a transaction have been simplified to the bare minium,
    and the xfs_trans_find_item function is replaced with a direct dereference
    of the li_desc field. All debug code walking the list of log items in
    a transaction is down to a simple list_for_each_entry.

    Note that we could easily use a singly linked list here instead of the
    double linked list from list.h as the fastpath only does deletion from
    sequential traversal. But given that we don't have one available as
    a library function yet I use the list.h functions for simplicity.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Dmapi support was never merged upstream, but we still have a lot of hooks
    bloating XFS for it, all over the fast pathes of the filesystem.

    This patch drops over 700 lines of dmapi overhead. If we'll ever get HSM
    support in mainline at least the namespace events can be done much saner
    in the VFS instead of the individual filesystem, so it's not like this
    is much help for future work.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

24 May, 2010

3 commits

  • With delayed logging, we can get inode allocation buffers in the
    same transaction inode unlink buffers. We don't currently mark inode
    allocation buffers in the log, so inode unlink buffers take
    precedence over allocation buffers.

    The result is that when they are combined into the same checkpoint,
    only the unlinked inode chain fields are replayed, resulting in
    uninitialised inode buffers being detected when the next inode
    modification is replayed.

    To fix this, we need to ensure that we do not set the inode buffer
    flag in the buffer log item format flags if the inode allocation has
    not already hit the log. To avoid requiring a change to log
    recovery, we really need to make this a modification that relies
    only on in-memory sate.

    We can do this by checking during buffer log formatting (while the
    CIL cannot be flushed) if we are still in the same sequence when we
    commit the unlink transaction as the inode allocation transaction.
    If we are, then we do not add the inode buffer flag to the buffer
    log format item flags. This means the entire buffer will be
    replayed, not just the unlinked fields. We do this while
    CIL flusheѕ are locked out to ensure that we don't race with the
    sequence numbers changing and hence fail to put the inode buffer
    flag in the buffer format flags when we really need to.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Clean up the buffer log format (XFS_BLI_*) flags because they have a
    polluted namespace. They XFS_BLI_ prefix is used for both in-memory
    and on-disk flag feilds, but have overlapping values for different
    flags. Rename the buffer log format flags to use the XFS_BLF_*
    prefix to avoid confusing them with the in-memory XFS_BLI_* prefixed
    flags.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • The buffer log item reference counts used to take referenceѕ for every
    transaction, similar to the pin counting. This is symmetric (like the
    pin/unpin) with respect to transaction completion, but with dleayed logging
    becomes assymetric as the pinning becomes assymetric w.r.t. transaction
    completion.

    To make both cases the same, allow the buffer pinning to take a reference to
    the buffer log item and always drop the reference the transaction has on it
    when being unlocked. This is balanced correctly because the unpin operation
    always drops a reference to the log item. Hence reference counting becomes
    symmetric w.r.t. item pinning as well as w.r.t active transactions and as a
    result the reference counting model remain consistent between normal and
    delayed logging.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

19 May, 2010

2 commits

  • The staleness of a object being unpinned can be directly derived
    from the object itself - there is no need to extract it from the
    object then pass it as a parameter into IOP_UNPIN().

    This means we can kill the XFS_LID_BUF_STALE flag - it is set,
    checked and cleared in the same places XFS_BLI_STALE flag in the
    xfs_buf_log_item so it is now redundant and hence safe to remove.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • Each log item type does manual initialisation of the log item.
    Delayed logging introduces new fields that need initialisation, so
    factor all the open coded initialisation into a common function
    first.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

02 Feb, 2010

1 commit

  • All buffers logged into the AIL are marked as delayed write.
    When the AIL needs to push the buffer out, it issues an async write of the
    buffer. This means that IO patterns are dependent on the order of
    buffers in the AIL.

    Instead of flushing the buffer, promote the buffer in the delayed
    write list so that the next time the xfsbufd is run the buffer will
    be flushed by the xfsbufd. Return the state to the xfsaild that the
    buffer was promoted so that the xfsaild knows that it needs to cause
    the xfsbufd to run to flush the buffers that were promoted.

    Using the xfsbufd for issuing the IO allows us to dispatch all
    buffer IO from the one queue. This means that we can make much more
    enlightened decisions on what order to flush buffers to disk as
    we don't have multiple places issuing IO. Optimisations to xfsbufd
    will be in a future patch.

    Version 2
    - kill XFS_ITEM_FLUSHING as it is now unused.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

22 Jan, 2010

1 commit


15 Dec, 2009

1 commit

  • Convert the old xfs tracing support that could only be used with the
    out of tree kdb and xfsidbg patches to use the generic event tracer.

    To use it make sure CONFIG_EVENT_TRACING is enabled and then enable
    all xfs trace channels by:

    echo 1 > /sys/kernel/debug/tracing/events/xfs/enable

    or alternatively enable single events by just doing the same in one
    event subdirectory, e.g.

    echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable

    or set more complex filters, etc. In Documentation/trace/events.txt
    all this is desctribed in more detail. To reads the events do a

    cat /sys/kernel/debug/tracing/trace

    Compared to the last posting this patch converts the tracing mostly to
    the one tracepoint per callsite model that other users of the new
    tracing facility also employ. This allows a very fine-grained control
    of the tracing, a cleaner output of the traces and also enables the
    perf tool to use each tracepoint as a virtual performance counter,
    allowing us to e.g. count how often certain workloads git various
    spots in XFS. Take a look at

    http://lwn.net/Articles/346470/

    for some examples.

    Also the btree tracing isn't included at all yet, as it will require
    additional core tracing features not in mainline yet, I plan to
    deliver it later.

    And the really nice thing about this patch is that it actually removes
    many lines of code while adding this nice functionality:

    fs/xfs/Makefile | 8
    fs/xfs/linux-2.6/xfs_acl.c | 1
    fs/xfs/linux-2.6/xfs_aops.c | 52 -
    fs/xfs/linux-2.6/xfs_aops.h | 2
    fs/xfs/linux-2.6/xfs_buf.c | 117 +--
    fs/xfs/linux-2.6/xfs_buf.h | 33
    fs/xfs/linux-2.6/xfs_fs_subr.c | 3
    fs/xfs/linux-2.6/xfs_ioctl.c | 1
    fs/xfs/linux-2.6/xfs_ioctl32.c | 1
    fs/xfs/linux-2.6/xfs_iops.c | 1
    fs/xfs/linux-2.6/xfs_linux.h | 1
    fs/xfs/linux-2.6/xfs_lrw.c | 87 --
    fs/xfs/linux-2.6/xfs_lrw.h | 45 -
    fs/xfs/linux-2.6/xfs_super.c | 104 ---
    fs/xfs/linux-2.6/xfs_super.h | 7
    fs/xfs/linux-2.6/xfs_sync.c | 1
    fs/xfs/linux-2.6/xfs_trace.c | 75 ++
    fs/xfs/linux-2.6/xfs_trace.h | 1369 +++++++++++++++++++++++++++++++++++++++++
    fs/xfs/linux-2.6/xfs_vnode.h | 4
    fs/xfs/quota/xfs_dquot.c | 110 ---
    fs/xfs/quota/xfs_dquot.h | 21
    fs/xfs/quota/xfs_qm.c | 40 -
    fs/xfs/quota/xfs_qm_syscalls.c | 4
    fs/xfs/support/ktrace.c | 323 ---------
    fs/xfs/support/ktrace.h | 85 --
    fs/xfs/xfs.h | 16
    fs/xfs/xfs_ag.h | 14
    fs/xfs/xfs_alloc.c | 230 +-----
    fs/xfs/xfs_alloc.h | 27
    fs/xfs/xfs_alloc_btree.c | 1
    fs/xfs/xfs_attr.c | 107 ---
    fs/xfs/xfs_attr.h | 10
    fs/xfs/xfs_attr_leaf.c | 14
    fs/xfs/xfs_attr_sf.h | 40 -
    fs/xfs/xfs_bmap.c | 507 +++------------
    fs/xfs/xfs_bmap.h | 49 -
    fs/xfs/xfs_bmap_btree.c | 6
    fs/xfs/xfs_btree.c | 5
    fs/xfs/xfs_btree_trace.h | 17
    fs/xfs/xfs_buf_item.c | 87 --
    fs/xfs/xfs_buf_item.h | 20
    fs/xfs/xfs_da_btree.c | 3
    fs/xfs/xfs_da_btree.h | 7
    fs/xfs/xfs_dfrag.c | 2
    fs/xfs/xfs_dir2.c | 8
    fs/xfs/xfs_dir2_block.c | 20
    fs/xfs/xfs_dir2_leaf.c | 21
    fs/xfs/xfs_dir2_node.c | 27
    fs/xfs/xfs_dir2_sf.c | 26
    fs/xfs/xfs_dir2_trace.c | 216 ------
    fs/xfs/xfs_dir2_trace.h | 72 --
    fs/xfs/xfs_filestream.c | 8
    fs/xfs/xfs_fsops.c | 2
    fs/xfs/xfs_iget.c | 111 ---
    fs/xfs/xfs_inode.c | 67 --
    fs/xfs/xfs_inode.h | 76 --
    fs/xfs/xfs_inode_item.c | 5
    fs/xfs/xfs_iomap.c | 85 --
    fs/xfs/xfs_iomap.h | 8
    fs/xfs/xfs_log.c | 181 +----
    fs/xfs/xfs_log_priv.h | 20
    fs/xfs/xfs_log_recover.c | 1
    fs/xfs/xfs_mount.c | 2
    fs/xfs/xfs_quota.h | 8
    fs/xfs/xfs_rename.c | 1
    fs/xfs/xfs_rtalloc.c | 1
    fs/xfs/xfs_rw.c | 3
    fs/xfs/xfs_trans.h | 47 +
    fs/xfs/xfs_trans_buf.c | 62 -
    fs/xfs/xfs_vnodeops.c | 8
    70 files changed, 2151 insertions(+), 2592 deletions(-)

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

22 Dec, 2008

1 commit


11 Dec, 2008

1 commit

  • Replace the b_fspriv pointer and it's ugly accessors with a properly types
    xfs_mount pointer. Also switch log reocvery over to it instead of using
    b_fspriv for the mount pointer.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Lachlan McIlroy

    Christoph Hellwig
     

30 Oct, 2008

3 commits

  • Change all the remaining AIL API functions that are passed struct
    xfs_mount pointers to pass pointers directly to the struct xfs_ail being
    used. With this conversion, all external access to the AIL is via the
    struct xfs_ail. Hence the operation and referencing of the AIL is almost
    entirely independent of the xfs_mount that is using it - it is now much
    more tightly tied to the log and the items it is tracking in the log than
    it is tied to the xfs_mount.

    SGI-PV: 988143

    SGI-Modid: xfs-linux-melb:xfs-kern:32353a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Christoph Hellwig

    David Chinner
     
  • Add an xfs_ail pointer to log items so that the log items can reference
    the AIL directly during callbacks without needed a struct xfs_mount.

    SGI-PV: 988143

    SGI-Modid: xfs-linux-melb:xfs-kern:32352a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Christoph Hellwig

    David Chinner
     
  • Bring the ail lock inside the struct xfs_ail. This means the AIL can be
    entirely manipulated via the struct xfs_ail rather than needing both the
    struct xfs_mount and the struct xfs_ail.

    SGI-PV: 988143

    SGI-Modid: xfs-linux-melb:xfs-kern:32350a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Christoph Hellwig

    David Chinner
     

17 Sep, 2008

1 commit

  • We have a use-after-free issue where log completions access buffers via
    the buffer log item and the buffer has already been freed. Fix this by
    taking a reference on the buffer when attaching the buffer log item and
    release the hold when the buffer log item is detached and we no longer
    need the buffer. Also create a new function xfs_buf_item_free() to combine
    some common code.

    SGI-PV: 985757

    SGI-Modid: xfs-linux-melb:xfs-kern:32025a

    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Christoph Hellwig

    Lachlan McIlroy
     

13 Aug, 2008

2 commits

  • Use KM_NOFS to prevent recursion back into the filesystem which can cause
    deadlocks.

    In the case of xfs_iread() we hold the lock on the inode cluster buffer
    while allocating memory for the trace buffers. If we recurse back into XFS
    to flush data that may require a transaction to allocate extents which
    needs log space. This can deadlock with the xfsaild thread which can't
    push the tail of the log because it is trying to get the inode cluster
    buffer lock.

    SGI-PV: 981498

    SGI-Modid: xfs-linux-melb:xfs-kern:31838a

    Signed-off-by: Lachlan McIlroy
    Signed-off-by: David Chinner

    Lachlan McIlroy
     
  • The xfs_buf_t b_iodonesema is really just a semaphore that wants to be a
    completion. Change it to a completion and remove the last user of the
    sema_t from XFS.

    SGI-PV: 981498

    SGI-Modid: xfs-linux-melb:xfs-kern:31815a

    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy

    David Chinner
     

28 Jul, 2008

1 commit

  • kmem_free() function takes (ptr, size) arguments but doesn't actually use
    second one.

    This patch removes size argument from all callsites.

    SGI-PV: 981498
    SGI-Modid: xfs-linux-melb:xfs-kern:31050a

    Signed-off-by: Denys Vlasenko
    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy

    Denys Vlasenko
     

18 Apr, 2008

1 commit

  • xfs_bawrite() can return immediate error status on async writes. Unlike
    xfsbdstrat() we don't ever check the error on the buffer after the call,
    so we currently do not catch errors at all here. Ensure we catch and
    propagate or warn to the syslog about up-front async write errors.

    SGI-PV: 980084
    SGI-Modid: xfs-linux-melb:xfs-kern:30824a

    Signed-off-by: David Chinner
    Signed-off-by: Niv Sardi
    Signed-off-by: Lachlan McIlroy

    David Chinner
     

07 Feb, 2008

1 commit


15 Oct, 2007

1 commit

  • One of the perpetual scaling problems XFS has is indexing it's incore
    inodes. We currently uses hashes and the default hash sizes chosen can
    only ever be a tradeoff between memory consumption and the maximum
    realistic size of the cache.

    As a result, anyone who has millions of inodes cached on a filesystem
    needs to tunes the size of the cache via the ihashsize mount option to
    allow decent scalability with inode cache operations.

    A further problem is the separate inode cluster hash, whose size is based
    on the ihashsize but is smaller, and so under certain conditions (sparse
    cluster cache population) this can become a limitation long before the
    inode hash is causing issues.

    The following patchset removes the inode hash and cluster hash and
    replaces them with radix trees to avoid the scalability limitations of the
    hashes. It also reduces the size of the inodes by 3 pointers....

    SGI-PV: 969561
    SGI-Modid: xfs-linux-melb:xfs-kern:29481a

    Signed-off-by: David Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Tim Shimmin

    David Chinner
     

14 Jul, 2007

1 commit

  • xfs_count_bits is only called once, and is then compared to 0. IOW, what
    it really wants to know is, is the bitmap empty. This can be done more
    simply, certainly.

    SGI-PV: 966503
    SGI-Modid: xfs-linux-melb:xfs-kern:28944a

    Signed-off-by: Eric Sandeen
    Signed-off-by: David Chinner
    Signed-off-by: Tim Shimmin

    Eric Sandeen
     

10 Feb, 2007

1 commit

  • gcc-4.1 and more recent aggressively inline static functions which
    increases XFS stack usage by ~15% in critical paths. Prevent this from
    occurring by adding noinline to the STATIC definition.

    Also uninline some functions that are too large to be inlined and were
    causing problems with CONFIG_FORCED_INLINING=y.

    Finally, clean up all the different users of inline, __inline and
    __inline__ and put them under one STATIC_INLINE macro. For debug kernels
    the STATIC_INLINE macro uninlines those functions.

    SGI-PV: 957159
    SGI-Modid: xfs-linux-melb:xfs-kern:27585a

    Signed-off-by: David Chinner
    Signed-off-by: David Chatterton
    Signed-off-by: Tim Shimmin

    David Chinner
     

28 Sep, 2006

2 commits


20 Jun, 2006

1 commit


09 Jun, 2006

2 commits


29 Mar, 2006

1 commit


02 Nov, 2005

2 commits


02 Sep, 2005

1 commit


21 Jun, 2005

1 commit