04 Jan, 2013

1 commit


16 Nov, 2012

6 commits

  • To separate the verifiers from iodone functions and associate read
    and write verifiers at the same time, introduce a buffer verifier
    operations structure to the xfs_buf.

    This avoids the need for assigning the write verifier, clearing the
    iodone function and re-running ioend processing in the read
    verifier, and gets rid of the nasty "b_pre_io" name for the write
    verifier function pointer. If we ever need to, it will also be
    easier to add further content specific callbacks to a buffer with an
    ops structure in place.

    We also avoid needing to export verifier functions, instead we
    can simply export the ops structures for those that are needed
    outside the function they are defined in.

    This patch also fixes a directory block readahead verifier issue
    it exposed.

    This patch also adds ops callbacks to the inode/alloc btree blocks
    initialised by growfs. These will need more work before they will
    work with CRCs.

    Signed-off-by: Dave Chinner
    Reviewed-by: Phil White
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Metadata buffers that are read from disk have write verifiers
    already attached to them, but newly allocated buffers do not. Add
    appropriate write verifiers to all new metadata buffers.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • These verifiers are essentially the same code as the read verifiers,
    but do not require ioend processing. Hence factor the read verifier
    functions and add a new write verifier wrapper that is used as the
    callback.

    This is done as one large patch for all verifiers rather than one
    patch per verifier as the change is largely mechanical. This
    includes hooking up the write verifier via the read verifier
    function.

    Hooking up the write verifier for buffers obtained via
    xfs_trans_get_buf() will be done in a separate patch as that touches
    code in many different places rather than just the verifier
    functions.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Add an AGFL block verify callback function and pass it into the
    buffer read functions.

    While this commit adds verification code to the AGFL, it cannot be
    used reliably until the CRC format change comes along as mkfs does
    not initialise the full AGFL. Hence it can be full of garbage at the
    first mount and will fail verification right now. CRC enabled
    filesystems won't have this problem, so leave the code that has
    already been written ifdef'd out until the proper time.

    Signed-off-by: Dave Chinner
    Reviewed-by: Phil White
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Add an AGF block verify callback function and pass it into the
    buffer read functions. This replaces the existing verification that
    is done after the read completes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Add a verifier function callback capability to the buffer read
    interfaces. This will be used by the callers to supply a function
    that verifies the contents of the buffer when it is read from disk.
    This patch does not provide callback functions, but simply modifies
    the interfaces to allow them to be called.

    The reason for adding this to the read interfaces is that it is very
    difficult to tell fom the outside is a buffer was just read from
    disk or whether we just pulled it out of cache. Supplying a callbck
    allows the buffer cache to use it's internal knowledge of the buffer
    to execute it only when the buffer is read from disk.

    It is intended that the verifier functions will mark the buffer with
    an EFSCORRUPTED error when verification fails. This allows the
    reading context to distinguish a verification error from an IO
    error, and potentially take further actions on the buffer (e.g.
    attempt repair) based on the error reported.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Phil White
    Signed-off-by: Ben Myers

    Dave Chinner
     

19 Oct, 2012

3 commits

  • Switching stacks are xfs_alloc_vextent can cause deadlocks when we
    run out of worker threads on the allocation workqueue. This can
    occur because xfs_bmap_btalloc can make multiple calls to
    xfs_alloc_vextent() and even if xfs_alloc_vextent() fails it can
    return with the AGF locked in the current allocation transaction.

    If we then need to make another allocation, and all the allocation
    worker contexts are exhausted because the are blocked waiting for
    the AGF lock, holder of the AGF cannot get it's xfs-alloc_vextent
    work completed to release the AGF. Hence allocation effectively
    deadlocks.

    To avoid this, move the stack switch one layer up to
    xfs_bmapi_allocate() so that all of the allocation attempts in a
    single switched stack transaction occur in a single worker context.
    This avoids the problem of an allocation being blocked waiting for
    a worker thread whilst holding the AGF.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Certain allocation paths through xfs_bmapi_write() are in situations
    where we have limited stack available. These are almost always in
    the buffered IO writeback path when convertion delayed allocation
    extents to real extents.

    The current stack switch occurs for userdata allocations, which
    means we also do stack switches for preallocation, direct IO and
    unwritten extent conversion, even those these call chains have never
    been implicated in a stack overrun.

    Hence, let's target just the single stack overun offended for stack
    switches. To do that, introduce a XFS_BMAPI_STACK_SWITCH flag that
    the caller can pass xfs_bmapi_write() to indicate it should switch
    stacks if it needs to do allocation.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Zero the kernel stack space that makes up the xfs_alloc_arg structures.

    Signed-off-by: Mark Tinguely
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Mark Tinguely
     

14 Jul, 2012

2 commits

  • Almost all metadata allocations come from shallow stack usage
    situations. Avoid the overhead of switching the allocation to a
    workqueue as we are not in danger of running out of stack when
    making these allocations. Metadata allocations are already marked
    through the args that are passed down, so this is trivial to do.

    Signed-off-by: Dave Chinner
    Reported-by: Mel Gorman
    Tested-by: Mel Gorman
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • The current cursor is reallocated when retrying the allocation, so
    the existing cursor needs to be destroyed in both the restart and
    the failure cases.

    Signed-off-by: Dave Chinner
    Tested-by: Mike Snitzer
    Signed-off-by: Ben Myers

    Dave Chinner
     

22 Jun, 2012

1 commit

  • When we fail to find an matching extent near the requested extent
    specification during a left-right distance search in
    xfs_alloc_ag_vextent_near, we fail to free the original cursor that
    we used to look up the XFS_BTNUM_CNT tree and hence leak it.

    Reported-by: Chris J Arges
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     

21 Jun, 2012

1 commit

  • Fengguang reports:

    [ 780.529603] XFS (vdd): Ending clean mount
    [ 781.454590] ODEBUG: object is on stack, but not annotated
    [ 781.455433] ------------[ cut here ]------------
    [ 781.455433] WARNING: at /c/kernel-tests/sound/lib/debugobjects.c:301 __debug_object_init+0x173/0x1f1()
    [ 781.455433] Hardware name: Bochs
    [ 781.455433] Modules linked in:
    [ 781.455433] Pid: 26910, comm: kworker/0:2 Not tainted 3.4.0+ #51
    [ 781.455433] Call Trace:
    [ 781.455433] [] warn_slowpath_common+0x83/0x9b
    [ 781.455433] [] warn_slowpath_null+0x1a/0x1c
    [ 781.455433] [] __debug_object_init+0x173/0x1f1
    [ 781.455433] [] debug_object_init+0x14/0x16
    [ 781.455433] [] __init_work+0x20/0x22
    [ 781.455433] [] xfs_alloc_vextent+0x6c/0xd5

    Use INIT_WORK_ONSTACK in xfs_alloc_vextent instead of INIT_WORK.

    Reported-by: Wu Fengguang
    Signed-off-by: Jie Liu
    Signed-off-by: Ben Myers

    Jeff Liu
     

15 May, 2012

4 commits

  • Commit e459df5, 'xfs: move busy extent handling to it's own file'
    moved some code from xfs_alloc.c into xfs_extent_busy.c for
    convenience in userspace code merges. One of the functions moved is
    xfs_extent_busy_trim (formerly xfs_alloc_busy_trim) which is defined
    STATIC. Unfortunately this function is still used in xfs_alloc.c, and
    this results in an undefined symbol in xfs.ko.

    Make xfs_extent_busy_trim not static and add its prototype to
    xfs_extent_busy.h.

    Signed-off-by: Ben Myers
    Reviewed-by: Mark Tinguely

    Ben Myers
     
  • Now that the busy extent tracking has been moved out of the
    allocation files, clean up the namespace it uses to
    "xfs_extent_busy" rather than a mix of "xfs_busy" and
    "xfs_alloc_busy".

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • To make it easier to handle userspace code merges, move all the busy
    extent handling out of the allocation code and into it's own file.
    The userspace code does not need the busy extent code, so this
    simplifies the merging of the kernel code into the userspace
    xfsprogs library.

    Because the busy extent code has been almost completely rewritten
    over the past couple of years, also update the copyright on this new
    file to include the authors that made all those changes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Untangle the header file includes a bit by moving the definition of
    xfs_agino_t to xfs_types.h. This removes the dependency that xfs_ag.h has on
    xfs_inum.h, meaning we don't need to include xfs_inum.h everywhere we include
    xfs_ag.h.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     

28 Mar, 2012

1 commit

  • xfs_ioc_fstrim() doesn't treat the incoming offset and length
    correctly. It treats them as a filesystem block address, rather than
    a disk address. This is wrong because the range passed in is a
    linear representation, while the filesystem block address notation
    is a sparse representation. Hence we cannot convert the range direct
    to filesystem block units and then use that for calculating the
    range to trim.

    While this sounds dangerous, the problem is limited to calculating
    what AGs need to be trimmed. The code that calcuates the actual
    ranges to trim gets the right result (i.e. only ever discards free
    space), even though it uses the wrong ranges to limit what is
    trimmed. Hence this is not a bug that endangers user data.

    Fix this by treating the range as a disk address range and use the
    appropriate functions to convert the range into the desired formats
    for calculations.

    Further, fix the first free extent lookup (the longest) to actually
    find the largest free extent. Currently this lookup uses a
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     

23 Mar, 2012

1 commit

  • We currently have significant issues with the amount of stack that
    allocation in XFS uses, especially in the writeback path. We can
    easily consume 4k of stack between mapping the page, manipulating
    the bmap btree and allocating blocks from the free list. Not to
    mention btree block readahead and other functionality that issues IO
    in the allocation path.

    As a result, we can no longer fit allocation in the writeback path
    in the stack space provided on x86_64. To alleviate this problem,
    introduce an allocation workqueue and move all allocations to a
    seperate context. This can be easily added as an interposing layer
    into xfs_alloc_vextent(), which takes a single argument structure
    and does not return until the allocation is complete or has failed.

    To do this, add a work structure and a completion to the allocation
    args structure. This allows xfs_alloc_vextent to queue the args onto
    the workqueue and wait for it to be completed by the worker. This
    can be done completely transparently to the caller.

    The worker function needs to ensure that it sets and clears the
    PF_TRANS flag appropriately as it is being run in an active
    transaction context. Work can also be queued in a memory reclaim
    context, so a rescuer is needed for the workqueue.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     

12 Oct, 2011

1 commit


26 Jul, 2011

1 commit


09 Jul, 2011

1 commit


08 Jul, 2011

1 commit


25 May, 2011

2 commits

  • Blocks for the allocation btree are allocated from and released to
    the AGFL, and thus frequently reused. Even worse we do not have an
    easy way to avoid using an AGFL block when it is discarded due to
    the simple FILO list of free blocks, and thus can frequently stall
    on blocks that are currently undergoing a discard.

    Add a flag to the busy extent tracking structure to skip the discard
    for allocation btree blocks. In normal operation these blocks are
    reused frequently enough that there is no need to discard them
    anyway, but if they spill over to the allocation btree as part of a
    balance we "leak" blocks that we would otherwise discard. We could
    fix this by adding another flag and keeping these block in the
    rbtree even after they aren't busy any more so that we could discard
    them when they migrate out of the AGFL. Given that this would cause
    significant overhead I don't think it's worthwile for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Now that we have reliably tracking of deleted extents in a
    transaction we can easily implement "online" discard support
    which calls blkdev_issue_discard once a transaction commits.

    The actual discard is a two stage operation as we first have
    to mark the busy extent as not available for reuse before we
    can start the actual discard. Note that we don't bother
    supporting discard for the non-delaylog mode.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

20 May, 2011

1 commit

  • When allocating an extent that is long enough to consume the
    remaining free space in an AG, we need to ensure that the allocation
    leaves enough space in the AG for any subsequent bmap btree blocks
    that are needed to track the new extent. These have to be allocated
    in the same AG as we only reserve enough blocks in an allocation
    transaction for modification of the freespace trees in a single AG.

    xfs_alloc_fix_minleft() has been considering blocks on the AGFL as
    free blocks available for extent and bmbt block allocation, which is
    not correct - blocks on the AGFL are there exclusively for the use
    of the free space btrees. As a result, when minleft is less than the
    number of blocks on the AGFL, xfs_alloc_fix_minleft() does not trim
    the given extent to leave minleft blocks available for bmbt
    allocation, and hence we can fail allocation during bmbt record
    insertion.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

29 Apr, 2011

4 commits

  • Instead of finding the per-ag and then taking and releasing the pagb_lock
    for every single busy extent completed sort the list of busy extents and
    only switch betweens AGs where nessecary. This becomes especially important
    with the online discard support which will hit this lock more often.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Update the extent tree in case we have to reuse a busy extent, so that it
    always is kept uptodate. This is done by replacing the busy list searches
    with a new xfs_alloc_busy_reuse helper, which updates the busy extent tree
    in case of a reuse. This allows us to allow reusing metadata extents
    unconditionally, and thus avoid log forces especially for allocation btree
    blocks.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Every time we reallocate a busy extent, we cause a synchronous log force
    to occur to ensure the freeing transaction is on disk before we continue
    and use the newly allocated extent. This is extremely sub-optimal as we
    have to mark every transaction with blocks that get reused as synchronous.

    Instead of searching the busy extent list after deciding on the extent to
    allocate, check each candidate extent during the allocation decisions as
    to whether they are in the busy list. If they are in the busy list, we
    trim the busy range out of the extent we have found and determine if that
    trimmed range is still OK for allocation. In many cases, this check can
    be incorporated into the allocation extent alignment code which already
    does trimming of the found extent before determining if it is a valid
    candidate for allocation.

    Based on earlier patches from Dave Chinner.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • While we need to make sure we do not reuse busy extents, there is no need
    to force out busy extents when moving them between the AGFL and the
    freespace btree as we still take care of that when doing the real allocation.

    To avoid the log force when just moving extents from the different free
    space tracking structures, move the busy search out of
    xfs_alloc_get_freelist into the callers that need it, and move the busy
    list insert from xfs_free_ag_extent which is used both by AGFL refills
    and real allocation to xfs_free_extent, which is only used by the latter.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

08 Apr, 2011

1 commit

  • A fuzzed filesystem crashed a kernel when freeing an extent with a
    block number beyond the end of the filesystem. Convert all the debug
    asserts in xfs_free_extent() to active checks so that we catch bad
    extents and return that the filesytsem is corrupted rather than
    crashing.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     

09 Mar, 2011

2 commits


12 Jan, 2011

1 commit

  • Allow manual discards from userspace using the FITRIM ioctl. This is not
    intended to be run during normal workloads, as the freepsace btree walks
    can cause large performance degradation.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

17 Dec, 2010

2 commits


19 Oct, 2010

1 commit

  • Ionut Gabriel Popescu submitted a simple change
    to eliminate some "may be used uninitialized" warnings when building
    XFS. The reported condition seems to be something that GCC did not
    used to recognize or report. The warnings were produced by:

    gcc version 4.5.0 20100604
    [gcc-4_5-branch revision 160292] (SUSE Linux)

    Signed-off-by: Ionut Gabriel Popescu
    Signed-off-by: Alex Elder

    Poyo VL
     

27 Jul, 2010

2 commits