17 Dec, 2010

1 commit

  • Opencode the xfs_iomap code in it's two callers. The overlap of
    passed flags already was minimal and will be further reduced in the
    next patch.

    As a side effect the BMAPI_* flags for xfs_bmapi and the IO_* flags
    for I/O end processing are merged into a single set of flags, which
    should be a bit more descriptive of the operation we perform.

    Also improve the tracing by giving each caller it's own type set of
    tracepoints.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

23 Oct, 2010

1 commit

  • * 'for-linus' of git://oss.sgi.com/xfs/xfs: (36 commits)
    xfs: semaphore cleanup
    xfs: Extend project quotas to support 32bit project ids
    xfs: remove xfs_buf wrappers
    xfs: remove xfs_cred.h
    xfs: remove xfs_globals.h
    xfs: remove xfs_version.h
    xfs: remove xfs_refcache.h
    xfs: fix the xfs_trans_committed
    xfs: remove unused t_callback field in struct xfs_trans
    xfs: fix bogus m_maxagi check in xfs_iget
    xfs: do not use xfs_mod_incore_sb_batch for per-cpu counters
    xfs: do not use xfs_mod_incore_sb for per-cpu counters
    xfs: remove XFS_MOUNT_NO_PERCPU_SB
    xfs: pack xfs_buf structure more tightly
    xfs: convert buffer cache hash to rbtree
    xfs: serialise inode reclaim within an AG
    xfs: batch inode reclaim lookup
    xfs: implement batched inode lookups for AG walking
    xfs: split out inode walk inode grabbing
    xfs: split inode AG walking into separate code for reclaim
    ...

    Linus Torvalds
     

19 Oct, 2010

2 commits

  • The reclaim walk requires different locking and has a slightly
    different walk algorithm, so separate it out so that it can be
    optimised separately.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • xfs_buf_get_nodaddr() is really used to allocate a buffer that is
    uncached. While it is not directly assigned a disk address, the fact
    that they are not cached is a more important distinction. With the
    upcoming uncached buffer read primitive, we should be consistent
    with this disctinction.

    While there, make page allocation in xfs_buf_get_nodaddr() safe
    against memory reclaim re-entrancy into the filesystem by allowing
    a flags parameter to be passed.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     

10 Sep, 2010

1 commit


10 Aug, 2010

1 commit


27 Jul, 2010

5 commits

  • When CONFIG_XFS_POSIX_ACL is not set "xfs_check_acl" is #defined
    to NULL - which breaks the code attempting to add a tracepoint
    on this function.

    Only define the tracepoint when the function exists.

    Signed-off-by: Tony Luck
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Tony Luck
     
  • Replace the xfs_itrace_entry catchall with specific trace points. For
    most simple callers we now use the simple inode class, which used to
    be the iget class, but add more details tracing for namespace events,
    which now includes the name of the directory entries manipulated.

    Remove the xfs_inactive trace point, which is a duplicate of the clear_inode
    one, and the xfs_change_file_space trace point, which is immediately
    followed by the more specific alloc/free space trace points.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • The xfs_iget_alloc/found tracepoints are a bit misnamed and misplaced.
    Rename them to xfs_iget_hit/xfs_iget_miss and move them to the beggining
    of the xfs_iget_cache_hit/miss functions. Add a new xfs_iget_reclaim_fail
    tracepoint for the case where we fail to re-initialize a VFS inode,
    and add a second instance of the xfs_iget_skip tracepoint for the case
    of a failed igrab() call.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • The writepage implementation in XFS still tries to deal with dirty but
    unmapped buffers which used to caused by writes through shared mmaps. Since
    the introduction of ->page_mkwrite these can't happen anymore, so remove the
    code dealing with them.

    Note that the all_bh variable which causes us to start I/O on all buffers on
    the pages was controlled by the count of unmapped buffers, which also
    included those not actually dirty. It's now unconditionally initialized to
    0 but set to 1 for the case of small file size extensions. It probably can
    be removed entirely, but that's left for another patch.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Get rid of the xfs_buf_pin/xfs_buf_unpin/xfs_buf_ispin helpers and opencode
    them in their only callers, just like we did for the inode pinning a while
    ago. Also remove duplicate trace points - the bufitem tracepoints cover
    all the information that is present in a buffer tracepoint.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

20 Jul, 2010

1 commit

  • https://bugzilla.kernel.org/show_bug.cgi?id=16348

    When the filesystem grows to a large number of allocation groups,
    the summing of recalimable inodes gets expensive. In many cases,
    most AGs won't have any reclaimable inodes and so we are wasting CPU
    time aggregating over these AGs. This is particularly important for
    the inode shrinker that gets called frequently under memory
    pressure.

    To avoid the overhead, track AGs with reclaimable inodes in the
    per-ag radix tree so that we can find all the AGs with reclaimable
    inodes via a simple gang tag lookup. This involves setting the tag
    when the first reclaimable inode is tracked in the AG, and removing
    the tag when the last reclaimable inode is removed from the tree.
    Then the summation process becomes a loop walking the radix tree
    summing AGs with the reclaim tag set.

    This significantly reduces the overhead of scanning - a 6400 AG
    filesystea now only uses about 25% of a cpu in kswapd while slab
    reclaim progresses instead of being permanently stuck at 100% CPU
    and making little progress. Clean filesystems filesystems will see
    no overhead and the overhead only increases linearly with the number
    of dirty AGs.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

29 May, 2010

1 commit

  • Use DECLARE_EVENT_CLASS, and save ~15K:

    text data bss dec hex filename
    171949 43028 48 215025 347f1 fs/xfs/linux-2.6/xfs_trace.o.orig
    156521 43028 36 199585 30ba1 fs/xfs/linux-2.6/xfs_trace.o

    No change in functionality.

    Signed-off-by: Li Zefan
    Acked-by: Steven Rostedt
    Signed-off-by: Alex Elder

    Li Zefan
     

24 May, 2010

1 commit

  • When we free a metadata extent, we record it in the per-AG busy
    extent array so that it is not re-used before the freeing
    transaction hits the disk. This array is fixed size, so when it
    overflows we make further allocation transactions synchronous
    because we cannot track more freed extents until those transactions
    hit the disk and are completed. Under heavy mixed allocation and
    freeing workloads with large log buffers, we can overflow this array
    quite easily.

    Further, the array is sparsely populated, which means that inserts
    need to search for a free slot, and array searches often have to
    search many more slots that are actually used to check all the
    busy extents. Quite inefficient, really.

    To enable this aspect of extent freeing to scale better, we need
    a structure that can grow dynamically. While in other areas of
    XFS we have used radix trees, the extents being freed are at random
    locations on disk so are better suited to being indexed by an rbtree.

    So, use a per-AG rbtree indexed by block number to track busy
    extents. This incures a memory allocation when marking an extent
    busy, but should not occur too often in low memory situations. This
    should scale to an arbitrary number of extents so should not be a
    limitation for features such as in-memory aggregation of
    transactions.

    However, there are still situations where we can't avoid allocating
    busy extents (such as allocation from the AGFL). To minimise the
    overhead of such occurences, we need to avoid doing a synchronous
    log force while holding the AGF locked to ensure that the previous
    transactions are safely on disk before we use the extent. We can do
    this by marking the transaction doing the allocation as synchronous
    rather issuing a log force.

    Because of the locking involved and the ordering of transactions,
    the synchronous transaction provides the same guarantees as a
    synchronous log force because it ensures that all the prior
    transactions are already on disk when the synchronous transaction
    hits the disk. i.e. it preserves the free->allocate order of the
    extent correctly in recovery.

    By doing this, we avoid holding the AGF locked while log writes are
    in progress, hence reducing the length of time the lock is held and
    therefore we increase the rate at which we can allocate and free
    from the allocation group, thereby increasing overall throughput.

    The only problem with this approach is that when a metadata buffer is
    marked stale (e.g. a directory block is removed), then buffer remains
    pinned and locked until the log goes to disk. The issue here is that
    if that stale buffer is reallocated in a subsequent transaction, the
    attempt to lock that buffer in the transaction will hang waiting
    the log to go to disk to unlock and unpin the buffer. Hence if
    someone tries to lock a pinned, stale, locked buffer we need to
    push on the log to get it unlocked ASAP. Effectively we are trading
    off a guaranteed log force for a much less common trigger for log
    force to occur.

    Ideally we should not reallocate busy extents. That is a much more
    complex fix to the problem as it involves direct intervention in the
    allocation btree searches in many places. This is left to a future
    set of modifications.

    Finally, now that we track busy extents in allocated memory, we
    don't need the descriptors in the transaction structure to point to
    them. We can replace the complex busy chunk infrastructure with a
    simple linked list of busy extents. This allows us to remove a large
    chunk of code, making the overall change a net reduction in code
    size.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

19 May, 2010

4 commits

  • Convert the dquot hash list on the filesystem to use listhead
    infrastructure rather than the roll-your-own in the quota code.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • The dquot shaker and the free-list reclaim code use exactly the same
    algorithm but the code is duplicated and slightly different in each
    case. Make the shaker code use the single dquot reclaim code to
    remove the code duplication.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • Currently there is no tracing in log recovery, so it is difficult to
    determine what is going on when something goes wrong.

    Add tracing for log item recovery to provide visibility into the log
    recovery process. The tracing added shows regions being extracted
    from the log transactions and added to the transaction hash forming
    recovery items, followed by the reordering, cancelling and finally
    recovery of the items.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • We don't record pin counts in inode events right now, and this makes
    it difficult to track down problems related to pinning inodes. Add
    the pin count to the inode trace class and add trace events for
    pinning and unpinning inodes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

02 Mar, 2010

2 commits

  • Using a static buffer in xfs_fmtfsblock means we can corrupt traces if
    multiple CPUs hit this code path at the same. Just remove xfs_fmtfsblock
    for now and print the block number purely numerical. If we want the
    NULLFSBLOCK and NULLSTARTBLOCK formatting back the best way would be
    a decoding plugin in the trace-cmd userspace command.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • The be32_to_cpu in the TP_printk output breaks automatic parsing of
    the trace format by the trace-cmd tools, so we have to move it into
    the TP_assign block. While we're at it also fix the format for the
    quota limits to more regular and easier parseable.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

02 Feb, 2010

1 commit

  • All buffers logged into the AIL are marked as delayed write.
    When the AIL needs to push the buffer out, it issues an async write of the
    buffer. This means that IO patterns are dependent on the order of
    buffers in the AIL.

    Instead of flushing the buffer, promote the buffer in the delayed
    write list so that the next time the xfsbufd is run the buffer will
    be flushed by the xfsbufd. Return the state to the xfsaild that the
    buffer was promoted so that the xfsaild knows that it needs to cause
    the xfsbufd to run to flush the buffers that were promoted.

    Using the xfsbufd for issuing the IO allows us to dispatch all
    buffer IO from the one queue. This means that we can make much more
    enlightened decisions on what order to flush buffers to disk as
    we don't have multiple places issuing IO. Optimisations to xfsbufd
    will be in a future patch.

    Version 2
    - kill XFS_ITEM_FLUSHING as it is now unused.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

16 Jan, 2010

2 commits


11 Jan, 2010

2 commits

  • When we search for and find a busy extent during allocation we
    force the log out to ensure the extent free transaction is on
    disk before the allocation transaction. The current implementation
    has a subtle bug in it--it does not handle multiple overlapping
    ranges.

    That is, if we free lots of little extents into a single
    contiguous extent, then allocate the contiguous extent, the busy
    search code stops searching at the first extent it finds that
    overlaps the allocated range. It then uses the commit LSN of the
    transaction to force the log out to.

    Unfortunately, the other busy ranges might have more recent
    commit LSNs than the first busy extent that is found, and this
    results in xfs_alloc_search_busy() returning before all the
    extent free transactions are on disk for the range being
    allocated. This can lead to potential metadata corruption or
    stale data exposure after a crash because log replay won't replay
    all the extent free transactions that cover the allocation range.

    Modified-by: Alex Elder

    (Dropped the "found" argument from the xfs_alloc_busysearch trace
    event.)

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Using DECLARE_EVENT_CLASS allows us to to use trace event code
    instead of duplicating it in the binary. This was not available
    before 2.6.33 so it had to be done as a separate step once the
    prerequisite was merged.

    This only requires changes to xfs_trace.h and the results are
    rather impressive:

    hch@brick:~/work/linux-2.6/obj-kvm$ size fs/xfs/xfs.o*
    text data bss dec hex filename
    607732 41884 3616 653232 9f7b0 fs/xfs/xfs.o
    1026732 41884 3808 1072424 105d28 fs/xfs/xfs.o.old

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

09 Jan, 2010

1 commit


15 Dec, 2009

1 commit

  • Convert the old xfs tracing support that could only be used with the
    out of tree kdb and xfsidbg patches to use the generic event tracer.

    To use it make sure CONFIG_EVENT_TRACING is enabled and then enable
    all xfs trace channels by:

    echo 1 > /sys/kernel/debug/tracing/events/xfs/enable

    or alternatively enable single events by just doing the same in one
    event subdirectory, e.g.

    echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable

    or set more complex filters, etc. In Documentation/trace/events.txt
    all this is desctribed in more detail. To reads the events do a

    cat /sys/kernel/debug/tracing/trace

    Compared to the last posting this patch converts the tracing mostly to
    the one tracepoint per callsite model that other users of the new
    tracing facility also employ. This allows a very fine-grained control
    of the tracing, a cleaner output of the traces and also enables the
    perf tool to use each tracepoint as a virtual performance counter,
    allowing us to e.g. count how often certain workloads git various
    spots in XFS. Take a look at

    http://lwn.net/Articles/346470/

    for some examples.

    Also the btree tracing isn't included at all yet, as it will require
    additional core tracing features not in mainline yet, I plan to
    deliver it later.

    And the really nice thing about this patch is that it actually removes
    many lines of code while adding this nice functionality:

    fs/xfs/Makefile | 8
    fs/xfs/linux-2.6/xfs_acl.c | 1
    fs/xfs/linux-2.6/xfs_aops.c | 52 -
    fs/xfs/linux-2.6/xfs_aops.h | 2
    fs/xfs/linux-2.6/xfs_buf.c | 117 +--
    fs/xfs/linux-2.6/xfs_buf.h | 33
    fs/xfs/linux-2.6/xfs_fs_subr.c | 3
    fs/xfs/linux-2.6/xfs_ioctl.c | 1
    fs/xfs/linux-2.6/xfs_ioctl32.c | 1
    fs/xfs/linux-2.6/xfs_iops.c | 1
    fs/xfs/linux-2.6/xfs_linux.h | 1
    fs/xfs/linux-2.6/xfs_lrw.c | 87 --
    fs/xfs/linux-2.6/xfs_lrw.h | 45 -
    fs/xfs/linux-2.6/xfs_super.c | 104 ---
    fs/xfs/linux-2.6/xfs_super.h | 7
    fs/xfs/linux-2.6/xfs_sync.c | 1
    fs/xfs/linux-2.6/xfs_trace.c | 75 ++
    fs/xfs/linux-2.6/xfs_trace.h | 1369 +++++++++++++++++++++++++++++++++++++++++
    fs/xfs/linux-2.6/xfs_vnode.h | 4
    fs/xfs/quota/xfs_dquot.c | 110 ---
    fs/xfs/quota/xfs_dquot.h | 21
    fs/xfs/quota/xfs_qm.c | 40 -
    fs/xfs/quota/xfs_qm_syscalls.c | 4
    fs/xfs/support/ktrace.c | 323 ---------
    fs/xfs/support/ktrace.h | 85 --
    fs/xfs/xfs.h | 16
    fs/xfs/xfs_ag.h | 14
    fs/xfs/xfs_alloc.c | 230 +-----
    fs/xfs/xfs_alloc.h | 27
    fs/xfs/xfs_alloc_btree.c | 1
    fs/xfs/xfs_attr.c | 107 ---
    fs/xfs/xfs_attr.h | 10
    fs/xfs/xfs_attr_leaf.c | 14
    fs/xfs/xfs_attr_sf.h | 40 -
    fs/xfs/xfs_bmap.c | 507 +++------------
    fs/xfs/xfs_bmap.h | 49 -
    fs/xfs/xfs_bmap_btree.c | 6
    fs/xfs/xfs_btree.c | 5
    fs/xfs/xfs_btree_trace.h | 17
    fs/xfs/xfs_buf_item.c | 87 --
    fs/xfs/xfs_buf_item.h | 20
    fs/xfs/xfs_da_btree.c | 3
    fs/xfs/xfs_da_btree.h | 7
    fs/xfs/xfs_dfrag.c | 2
    fs/xfs/xfs_dir2.c | 8
    fs/xfs/xfs_dir2_block.c | 20
    fs/xfs/xfs_dir2_leaf.c | 21
    fs/xfs/xfs_dir2_node.c | 27
    fs/xfs/xfs_dir2_sf.c | 26
    fs/xfs/xfs_dir2_trace.c | 216 ------
    fs/xfs/xfs_dir2_trace.h | 72 --
    fs/xfs/xfs_filestream.c | 8
    fs/xfs/xfs_fsops.c | 2
    fs/xfs/xfs_iget.c | 111 ---
    fs/xfs/xfs_inode.c | 67 --
    fs/xfs/xfs_inode.h | 76 --
    fs/xfs/xfs_inode_item.c | 5
    fs/xfs/xfs_iomap.c | 85 --
    fs/xfs/xfs_iomap.h | 8
    fs/xfs/xfs_log.c | 181 +----
    fs/xfs/xfs_log_priv.h | 20
    fs/xfs/xfs_log_recover.c | 1
    fs/xfs/xfs_mount.c | 2
    fs/xfs/xfs_quota.h | 8
    fs/xfs/xfs_rename.c | 1
    fs/xfs/xfs_rtalloc.c | 1
    fs/xfs/xfs_rw.c | 3
    fs/xfs/xfs_trans.h | 47 +
    fs/xfs/xfs_trans_buf.c | 62 -
    fs/xfs/xfs_vnodeops.c | 8
    70 files changed, 2151 insertions(+), 2592 deletions(-)

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig