12 Oct, 2011

1 commit


26 Jul, 2011

1 commit


09 Jul, 2011

1 commit


08 Jul, 2011

1 commit


25 May, 2011

2 commits

  • Blocks for the allocation btree are allocated from and released to
    the AGFL, and thus frequently reused. Even worse we do not have an
    easy way to avoid using an AGFL block when it is discarded due to
    the simple FILO list of free blocks, and thus can frequently stall
    on blocks that are currently undergoing a discard.

    Add a flag to the busy extent tracking structure to skip the discard
    for allocation btree blocks. In normal operation these blocks are
    reused frequently enough that there is no need to discard them
    anyway, but if they spill over to the allocation btree as part of a
    balance we "leak" blocks that we would otherwise discard. We could
    fix this by adding another flag and keeping these block in the
    rbtree even after they aren't busy any more so that we could discard
    them when they migrate out of the AGFL. Given that this would cause
    significant overhead I don't think it's worthwile for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Now that we have reliably tracking of deleted extents in a
    transaction we can easily implement "online" discard support
    which calls blkdev_issue_discard once a transaction commits.

    The actual discard is a two stage operation as we first have
    to mark the busy extent as not available for reuse before we
    can start the actual discard. Note that we don't bother
    supporting discard for the non-delaylog mode.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

20 May, 2011

1 commit

  • When allocating an extent that is long enough to consume the
    remaining free space in an AG, we need to ensure that the allocation
    leaves enough space in the AG for any subsequent bmap btree blocks
    that are needed to track the new extent. These have to be allocated
    in the same AG as we only reserve enough blocks in an allocation
    transaction for modification of the freespace trees in a single AG.

    xfs_alloc_fix_minleft() has been considering blocks on the AGFL as
    free blocks available for extent and bmbt block allocation, which is
    not correct - blocks on the AGFL are there exclusively for the use
    of the free space btrees. As a result, when minleft is less than the
    number of blocks on the AGFL, xfs_alloc_fix_minleft() does not trim
    the given extent to leave minleft blocks available for bmbt
    allocation, and hence we can fail allocation during bmbt record
    insertion.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

29 Apr, 2011

4 commits

  • Instead of finding the per-ag and then taking and releasing the pagb_lock
    for every single busy extent completed sort the list of busy extents and
    only switch betweens AGs where nessecary. This becomes especially important
    with the online discard support which will hit this lock more often.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Update the extent tree in case we have to reuse a busy extent, so that it
    always is kept uptodate. This is done by replacing the busy list searches
    with a new xfs_alloc_busy_reuse helper, which updates the busy extent tree
    in case of a reuse. This allows us to allow reusing metadata extents
    unconditionally, and thus avoid log forces especially for allocation btree
    blocks.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Every time we reallocate a busy extent, we cause a synchronous log force
    to occur to ensure the freeing transaction is on disk before we continue
    and use the newly allocated extent. This is extremely sub-optimal as we
    have to mark every transaction with blocks that get reused as synchronous.

    Instead of searching the busy extent list after deciding on the extent to
    allocate, check each candidate extent during the allocation decisions as
    to whether they are in the busy list. If they are in the busy list, we
    trim the busy range out of the extent we have found and determine if that
    trimmed range is still OK for allocation. In many cases, this check can
    be incorporated into the allocation extent alignment code which already
    does trimming of the found extent before determining if it is a valid
    candidate for allocation.

    Based on earlier patches from Dave Chinner.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • While we need to make sure we do not reuse busy extents, there is no need
    to force out busy extents when moving them between the AGFL and the
    freespace btree as we still take care of that when doing the real allocation.

    To avoid the log force when just moving extents from the different free
    space tracking structures, move the busy search out of
    xfs_alloc_get_freelist into the callers that need it, and move the busy
    list insert from xfs_free_ag_extent which is used both by AGFL refills
    and real allocation to xfs_free_extent, which is only used by the latter.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

08 Apr, 2011

1 commit

  • A fuzzed filesystem crashed a kernel when freeing an extent with a
    block number beyond the end of the filesystem. Convert all the debug
    asserts in xfs_free_extent() to active checks so that we catch bad
    extents and return that the filesytsem is corrupted rather than
    crashing.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     

09 Mar, 2011

2 commits


12 Jan, 2011

1 commit

  • Allow manual discards from userspace using the FITRIM ioctl. This is not
    intended to be run during normal workloads, as the freepsace btree walks
    can cause large performance degradation.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

17 Dec, 2010

2 commits


19 Oct, 2010

1 commit

  • Ionut Gabriel Popescu submitted a simple change
    to eliminate some "may be used uninitialized" warnings when building
    XFS. The reported condition seems to be something that GCC did not
    used to recognize or report. The warnings were produced by:

    gcc version 4.5.0 20100604
    [gcc-4_5-branch revision 160292] (SUSE Linux)

    Signed-off-by: Ionut Gabriel Popescu
    Signed-off-by: Alex Elder

    Poyo VL
     

27 Jul, 2010

3 commits


24 May, 2010

1 commit

  • When we free a metadata extent, we record it in the per-AG busy
    extent array so that it is not re-used before the freeing
    transaction hits the disk. This array is fixed size, so when it
    overflows we make further allocation transactions synchronous
    because we cannot track more freed extents until those transactions
    hit the disk and are completed. Under heavy mixed allocation and
    freeing workloads with large log buffers, we can overflow this array
    quite easily.

    Further, the array is sparsely populated, which means that inserts
    need to search for a free slot, and array searches often have to
    search many more slots that are actually used to check all the
    busy extents. Quite inefficient, really.

    To enable this aspect of extent freeing to scale better, we need
    a structure that can grow dynamically. While in other areas of
    XFS we have used radix trees, the extents being freed are at random
    locations on disk so are better suited to being indexed by an rbtree.

    So, use a per-AG rbtree indexed by block number to track busy
    extents. This incures a memory allocation when marking an extent
    busy, but should not occur too often in low memory situations. This
    should scale to an arbitrary number of extents so should not be a
    limitation for features such as in-memory aggregation of
    transactions.

    However, there are still situations where we can't avoid allocating
    busy extents (such as allocation from the AGFL). To minimise the
    overhead of such occurences, we need to avoid doing a synchronous
    log force while holding the AGF locked to ensure that the previous
    transactions are safely on disk before we use the extent. We can do
    this by marking the transaction doing the allocation as synchronous
    rather issuing a log force.

    Because of the locking involved and the ordering of transactions,
    the synchronous transaction provides the same guarantees as a
    synchronous log force because it ensures that all the prior
    transactions are already on disk when the synchronous transaction
    hits the disk. i.e. it preserves the free->allocate order of the
    extent correctly in recovery.

    By doing this, we avoid holding the AGF locked while log writes are
    in progress, hence reducing the length of time the lock is held and
    therefore we increase the rate at which we can allocate and free
    from the allocation group, thereby increasing overall throughput.

    The only problem with this approach is that when a metadata buffer is
    marked stale (e.g. a directory block is removed), then buffer remains
    pinned and locked until the log goes to disk. The issue here is that
    if that stale buffer is reallocated in a subsequent transaction, the
    attempt to lock that buffer in the transaction will hang waiting
    the log to go to disk to unlock and unpin the buffer. Hence if
    someone tries to lock a pinned, stale, locked buffer we need to
    push on the log to get it unlocked ASAP. Effectively we are trading
    off a guaranteed log force for a much less common trigger for log
    force to occur.

    Ideally we should not reallocate busy extents. That is a much more
    complex fix to the problem as it involves direct intervention in the
    allocation btree searches in many places. This is left to a future
    set of modifications.

    Finally, now that we track busy extents in allocated memory, we
    don't need the descriptors in the transaction structure to point to
    them. We can replace the complex busy chunk infrastructure with a
    simple linked list of busy extents. This allows us to remove a large
    chunk of code, making the overall change a net reduction in code
    size.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

22 Jan, 2010

2 commits

  • Remove the XFS_LOG_FORCE argument which was always set, and the
    XFS_LOG_URGE define, which was never used.

    Split xfs_log_force into a two helpers - xfs_log_force which forces
    the whole log, and xfs_log_force_lsn which forces up to the
    specified LSN. The underlying implementations already were entirely
    separate, as were the users.

    Also re-indent the new _xfs_log_force/_xfs_log_force which
    previously had a weird coding style.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Currently we define aliases for the buffer flags in various
    namespaces, which only adds confusion. Remove all but the XBF_
    flags to clean this up a bit.

    Note that we still abuse XFS_B_ASYNC/XBF_ASYNC for some non-buffer
    uses, but I'll clean that up later.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

16 Jan, 2010

3 commits

  • Now that the perag structure is allocated memory rather than held in
    an array, we don't need to have the busy extent array external to
    the structure. Embed it into the perag structure to avoid needing an
    extra allocation when setting up.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • The use of an array for the per-ag structures requires reallocation
    of the array when growing the filesystem. This requires locking
    access to the array to avoid use after free situations, and the
    locking is difficult to get right. To avoid needing to reallocate an
    array, change the per-ag structures to an allocated object per ag
    and index them using a tree structure.

    The AGs are always densely indexed (hence the use of an array), but
    the number supported is 2^32 and lookups tend to be random and hence
    indexing needs to scale. A simple choice is a radix tree - it works
    well with this sort of index. This change also removes another
    large contiguous allocation from the mount/growfs path in XFS.

    The growing process now needs to change to only initialise the new
    AGs required for the extra space, and as such only needs to
    exclusively lock the tree for inserts. The rest of the code only
    needs to lock the tree while doing lookups, and hence this will
    remove all the deadlocks that currently occur on the m_perag_lock as
    it is now an innermost lock. The lock is also changed to a spinlock
    from a read/write lock as the hold time is now extremely short.

    To complete the picture, the per-ag structures will need to be
    reference counted to ensure that we don't free/modify them while
    they are still in use. This will be done in subsequent patch.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Start abstracting the perag references so that the indexing of the
    structures is not directly coded into all the places that uses the
    perag structures. This will allow us to separate the use of the
    perag structure and the way it is indexed and hence avoid the known
    deadlocks related to growing a busy filesystem.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

11 Jan, 2010

1 commit

  • When we search for and find a busy extent during allocation we
    force the log out to ensure the extent free transaction is on
    disk before the allocation transaction. The current implementation
    has a subtle bug in it--it does not handle multiple overlapping
    ranges.

    That is, if we free lots of little extents into a single
    contiguous extent, then allocate the contiguous extent, the busy
    search code stops searching at the first extent it finds that
    overlaps the allocated range. It then uses the commit LSN of the
    transaction to force the log out to.

    Unfortunately, the other busy ranges might have more recent
    commit LSNs than the first busy extent that is found, and this
    results in xfs_alloc_search_busy() returning before all the
    extent free transactions are on disk for the range being
    allocated. This can lead to potential metadata corruption or
    stale data exposure after a crash because log replay won't replay
    all the extent free transactions that cover the allocation range.

    Modified-by: Alex Elder

    (Dropped the "found" argument from the xfs_alloc_busysearch trace
    event.)

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

15 Dec, 2009

1 commit

  • Convert the old xfs tracing support that could only be used with the
    out of tree kdb and xfsidbg patches to use the generic event tracer.

    To use it make sure CONFIG_EVENT_TRACING is enabled and then enable
    all xfs trace channels by:

    echo 1 > /sys/kernel/debug/tracing/events/xfs/enable

    or alternatively enable single events by just doing the same in one
    event subdirectory, e.g.

    echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable

    or set more complex filters, etc. In Documentation/trace/events.txt
    all this is desctribed in more detail. To reads the events do a

    cat /sys/kernel/debug/tracing/trace

    Compared to the last posting this patch converts the tracing mostly to
    the one tracepoint per callsite model that other users of the new
    tracing facility also employ. This allows a very fine-grained control
    of the tracing, a cleaner output of the traces and also enables the
    perf tool to use each tracepoint as a virtual performance counter,
    allowing us to e.g. count how often certain workloads git various
    spots in XFS. Take a look at

    http://lwn.net/Articles/346470/

    for some examples.

    Also the btree tracing isn't included at all yet, as it will require
    additional core tracing features not in mainline yet, I plan to
    deliver it later.

    And the really nice thing about this patch is that it actually removes
    many lines of code while adding this nice functionality:

    fs/xfs/Makefile | 8
    fs/xfs/linux-2.6/xfs_acl.c | 1
    fs/xfs/linux-2.6/xfs_aops.c | 52 -
    fs/xfs/linux-2.6/xfs_aops.h | 2
    fs/xfs/linux-2.6/xfs_buf.c | 117 +--
    fs/xfs/linux-2.6/xfs_buf.h | 33
    fs/xfs/linux-2.6/xfs_fs_subr.c | 3
    fs/xfs/linux-2.6/xfs_ioctl.c | 1
    fs/xfs/linux-2.6/xfs_ioctl32.c | 1
    fs/xfs/linux-2.6/xfs_iops.c | 1
    fs/xfs/linux-2.6/xfs_linux.h | 1
    fs/xfs/linux-2.6/xfs_lrw.c | 87 --
    fs/xfs/linux-2.6/xfs_lrw.h | 45 -
    fs/xfs/linux-2.6/xfs_super.c | 104 ---
    fs/xfs/linux-2.6/xfs_super.h | 7
    fs/xfs/linux-2.6/xfs_sync.c | 1
    fs/xfs/linux-2.6/xfs_trace.c | 75 ++
    fs/xfs/linux-2.6/xfs_trace.h | 1369 +++++++++++++++++++++++++++++++++++++++++
    fs/xfs/linux-2.6/xfs_vnode.h | 4
    fs/xfs/quota/xfs_dquot.c | 110 ---
    fs/xfs/quota/xfs_dquot.h | 21
    fs/xfs/quota/xfs_qm.c | 40 -
    fs/xfs/quota/xfs_qm_syscalls.c | 4
    fs/xfs/support/ktrace.c | 323 ---------
    fs/xfs/support/ktrace.h | 85 --
    fs/xfs/xfs.h | 16
    fs/xfs/xfs_ag.h | 14
    fs/xfs/xfs_alloc.c | 230 +-----
    fs/xfs/xfs_alloc.h | 27
    fs/xfs/xfs_alloc_btree.c | 1
    fs/xfs/xfs_attr.c | 107 ---
    fs/xfs/xfs_attr.h | 10
    fs/xfs/xfs_attr_leaf.c | 14
    fs/xfs/xfs_attr_sf.h | 40 -
    fs/xfs/xfs_bmap.c | 507 +++------------
    fs/xfs/xfs_bmap.h | 49 -
    fs/xfs/xfs_bmap_btree.c | 6
    fs/xfs/xfs_btree.c | 5
    fs/xfs/xfs_btree_trace.h | 17
    fs/xfs/xfs_buf_item.c | 87 --
    fs/xfs/xfs_buf_item.h | 20
    fs/xfs/xfs_da_btree.c | 3
    fs/xfs/xfs_da_btree.h | 7
    fs/xfs/xfs_dfrag.c | 2
    fs/xfs/xfs_dir2.c | 8
    fs/xfs/xfs_dir2_block.c | 20
    fs/xfs/xfs_dir2_leaf.c | 21
    fs/xfs/xfs_dir2_node.c | 27
    fs/xfs/xfs_dir2_sf.c | 26
    fs/xfs/xfs_dir2_trace.c | 216 ------
    fs/xfs/xfs_dir2_trace.h | 72 --
    fs/xfs/xfs_filestream.c | 8
    fs/xfs/xfs_fsops.c | 2
    fs/xfs/xfs_iget.c | 111 ---
    fs/xfs/xfs_inode.c | 67 --
    fs/xfs/xfs_inode.h | 76 --
    fs/xfs/xfs_inode_item.c | 5
    fs/xfs/xfs_iomap.c | 85 --
    fs/xfs/xfs_iomap.h | 8
    fs/xfs/xfs_log.c | 181 +----
    fs/xfs/xfs_log_priv.h | 20
    fs/xfs/xfs_log_recover.c | 1
    fs/xfs/xfs_mount.c | 2
    fs/xfs/xfs_quota.h | 8
    fs/xfs/xfs_rename.c | 1
    fs/xfs/xfs_rtalloc.c | 1
    fs/xfs/xfs_rw.c | 3
    fs/xfs/xfs_trans.h | 47 +
    fs/xfs/xfs_trans_buf.c | 62 -
    fs/xfs/xfs_vnodeops.c | 8
    70 files changed, 2151 insertions(+), 2592 deletions(-)

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

16 Mar, 2009

1 commit


01 Dec, 2008

1 commit


30 Oct, 2008

9 commits

  • structures.

    Always use the generic xfs_btree_block type instead of the short / long
    structures. Add XFS_BTREE_SBLOCK_LEN / XFS_BTREE_LBLOCK_LEN defines for
    the length of a short / long form block. The rationale for this is that we
    will grow more btree block header variants to support CRCs and other RAS
    information, and always accessing them through the same datatype with
    unions for the short / long form pointers makes implementing this much
    easier.

    SGI-PV: 988146

    SGI-Modid: xfs-linux-melb:xfs-kern:32300a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Donald Douwsma
    Signed-off-by: David Chinner
    Signed-off-by: Lachlan McIlroy

    Christoph Hellwig
     
  • SGI-PV: 987683

    SGI-Modid: xfs-linux-melb:xfs-kern:32232a

    Signed-off-by: Barry Naujok
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy

    Barry Naujok
     
  • Not really much reason to make it generic given that it's so small, but
    this is the last non-method in xfs_alloc_btree.c and xfs_ialloc_btree.c,
    so it makes the whole btree implementation more structured.

    SGI-PV: 985583

    SGI-Modid: xfs-linux-melb:xfs-kern:32206a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Bill O'Donnell
    Signed-off-by: David Chinner

    Christoph Hellwig
     
  • Make the btree delete code generic. Based on a patch from David Chinner
    with lots of changes to follow the original btree implementations more
    closely. While this loses some of the generic helper routines for
    inserting/moving/removing records it also solves some of the one off bugs
    in the original code and makes it easier to verify.

    SGI-PV: 985583

    SGI-Modid: xfs-linux-melb:xfs-kern:32205a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Bill O'Donnell
    Signed-off-by: David Chinner

    Christoph Hellwig
     
  • Make the btree insert code generic. Based on a patch from David Chinner
    with lots of changes to follow the original btree implementations more
    closely. While this loses some of the generic helper routines for
    inserting/moving/removing records it also solves some of the one off bugs
    in the original code and makes it easier to verify.

    SGI-PV: 985583

    SGI-Modid: xfs-linux-melb:xfs-kern:32202a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Bill O'Donnell
    Signed-off-by: David Chinner

    Christoph Hellwig
     
  • From: Dave Chinner

    The most complicated part here is the lastrec tracking for the alloc
    btree. Most logic is in the update_lastrec method which has to do some
    hopefully good enough dirty magic to maintain it.

    [hch: split out from bigger patch and a rework of the lastrec

    logic]

    SGI-PV: 985583

    SGI-Modid: xfs-linux-melb:xfs-kern:32194a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Bill O'Donnell
    Signed-off-by: David Chinner

    Christoph Hellwig
     
  • From: Dave Chinner

    [hch: split out from bigger patch and minor adaptions]

    SGI-PV: 985583

    SGI-Modid: xfs-linux-melb:xfs-kern:32192a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Bill O'Donnell
    Signed-off-by: David Chinner

    Christoph Hellwig
     
  • From: Dave Chinner

    [hch: split out from bigger patch and minor adaptions]

    SGI-PV: 985583

    SGI-Modid: xfs-linux-melb:xfs-kern:32191a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Bill O'Donnell
    Signed-off-by: David Chinner

    Christoph Hellwig
     
  • From: Dave Chinner

    Because this is the first major generic btree routine this patch includes
    some infrastrucure, first a few routines to deal with a btree block that
    can be either in short or long form, second xfs_btree_read_buf_block,
    which is the new central routine to read a btree block given a cursor, and
    third the new xfs_btree_ptr_addr routine to calculate the address for a
    given btree pointer record.

    [hch: split out from bigger patch and minor adaptions]

    SGI-PV: 985583

    SGI-Modid: xfs-linux-melb:xfs-kern:32190a

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Lachlan McIlroy
    Signed-off-by: Bill O'Donnell
    Signed-off-by: David Chinner

    Christoph Hellwig