03 Aug, 2016

40 commits

  • Originally-From: Dave Chinner

    So such blocks can be correctly identified and have their operations
    structures attached to validate recovery has not resulted in a
    correct block.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Originally-From: Dave Chinner

    So xfs_info and other userspace utilities know the filesystem is
    using this feature.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • When we map, unmap, or convert an extent in a file's data or attr
    fork, schedule a respective update in the rmapbt. Previous versions
    of this patch required a 1:1 correspondence between bmap and rmap,
    but this is no longer true as we now have ability to make interval
    queries against the rmapbt.

    We use the deferred operations code to handle redo operations
    atomically and deadlock free. This plumbs in all five rmap actions
    (map, unmap, convert extent, alloc, free); we'll use the first three
    now for file data, and reflink will want the last two. We also add
    an error injection site to test log recovery.

    Finally, we need to fix the bmap shift extent code to adjust the
    rmaps correctly.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Connect the xfs_defer mechanism with the pieces that we'll need to
    handle deferred rmap updates. We'll wire up the existing code to
    our new deferred mechanism later.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Provide a mechanism for higher levels to create RUI/RUD items, submit
    them to the log, and a stub function to deal with recovered RUI items.
    These parts will be connected to the rmapbt in a later patch.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Create rmap update intent/done log items to record redo information in
    the log. Because we need to roll transactions between updating the
    bmbt mapping and updating the reverse mapping, we also have to track
    the status of the metadata updates that will be recorded in the
    post-roll transactions, just in case we crash before committing the
    final transaction. This mechanism enables log recovery to finish what
    was already started.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Add a couple of helper functions to encapsulate rmap btree insert and
    delete operations. Add tracepoints to the update function.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Provide a function to convert an unwritten rmap extent to a real one
    and vice versa.

    [ dchinner: Note that this algorithm and code was derived from the
    existing bmapbt unwritten extent conversion code in
    xfs_bmap_add_extent_unwritten_real(). ]

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Originally-From: Dave Chinner

    Now that we have records in the rmap btree, we need to remove them
    when extents are freed. This needs to find the relevant record in
    the btree and remove/trim/split it accordingly.

    [darrick.wong@oracle.com: make rmap routines handle the enlarged keyspace]
    [dchinner: remove remaining unused debug printks]
    [darrick: fix a bug when growfs in an AG with an rmap ending at EOFS]

    Signed-off-by: Dave Chinner
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Originally-From: Dave Chinner

    Now all the btree, free space and transaction infrastructure is in
    place, we can finally add the code to insert reverse mappings to the
    rmap btree. Freeing will be done in a separate patch, so just the
    addition operation can be focussed on here.

    [darrick: handle owner offsets when adding rmaps]
    [dchinner: remove remaining debug printk statements]
    [darrick: move unwritten bit to rm_offset]

    Signed-off-by: Dave Chinner
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Now that the generic btree code supports querying all records within a
    range of keys, use that functionality to allow us to ask for all the
    extents mapped to a range of physical blocks.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Now that the generic btree code supports overlapping intervals, plug
    in the rmap btree to this functionality. We will need it to find
    potential left neighbors in xfs_rmap_{alloc,free} later in the patch
    set.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Originally-From: Dave Chinner

    Implement the generic btree operations needed to manipulate rmap
    btree blocks. This is very similar to the per-ag freespace btree
    implementation, and uses the AGFL for allocation and freeing of
    blocks.

    Adapt the rmap btree to store owner offsets within each rmap record,
    and to handle the primary key being redefined as the tuple
    [agblk, owner, offset]. The expansion of the primary key is crucial
    to allowing multiple owners per extent.

    [darrick: adapt the btree ops to deal with offsets]
    [darrick: remove init_rec_from_key]
    [darrick: move unwritten bit to rm_offset]

    Signed-off-by: Dave Chinner
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Originally-From: Dave Chinner

    The rmap btree is allocated from the AGFL, which means we have to
    ensure ENOSPC is reported to userspace before we run out of free
    space in each AG. The last allocation in an AG can cause a full
    height rmap btree split, and that means we have to reserve at least
    this many blocks *in each AG* to be placed on the AGFL at ENOSPC.
    Update the various space calculation functions to handle this.

    Also, because the macros are now executing conditional code and are
    called quite frequently, convert them to functions that initialise
    variables in the struct xfs_mount, use the new variables everywhere
    and document the calculations better.

    [darrick.wong@oracle.com: don't reserve blocks if !rmap]
    [dchinner@redhat.com: update m_ag_max_usable after growfs]

    Signed-off-by: Dave Chinner
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • The rmap btrees will use the AGFL as the block allocation source, so
    we need to ensure that the transaction reservations reflect the fact
    this tree is modified by allocation and freeing. Hence we need to
    extend all the extent allocation/free reservations used in
    transactions to handle this.

    Note that this also gets rid of the unused XFS_ALLOCFREE_LOG_RES
    macro, as we now do buffer reservations based on the number of
    buffers logged via xfs_calc_buf_res(). Hence we only need the buffer
    count calculation now.

    [darrick: use rmap_maxlevels when calculating log block resv]

    Signed-off-by: Dave Chinner
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Originally-From: Dave Chinner

    Now we can read and write rmap btree blocks, we can add support to
    the growfs code to initialise new rmap btree blocks.

    [darrick.wong@oracle.com: fill out the rmap offset fields]

    Signed-off-by: Dave Chinner
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Originally-From: Dave Chinner

    Now we have all the surrounding call infrastructure in place, we can
    start filling out the rmap btree implementation. Start with the
    on-disk btree format; add everything needed to read, write and
    manipulate rmap btree blocks. This prepares the way for adding the
    btree operations implementation.

    [darrick: record owner and offset info in rmap btree]
    [darrick: fork, bmbt and unwritten state in rmap btree]
    [darrick: flags are a separate field in xfs_rmap_irec]
    [darrick: calculate maxlevels separately]
    [darrick: move the 'unwritten' bit into unused parts of rm_offset]

    Signed-off-by: Dave Chinner
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Originally-From: Dave Chinner

    Add the stubs into the extent allocation and freeing paths that the
    rmap btree implementation will hook into. While doing this, add the
    trace points that will be used to track rmap btree extent
    manipulations.

    [darrick.wong@oracle.com: Extend the stubs to take full owner info.]

    Signed-off-by: Dave Chinner
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • For the rmap btree to work, we have to feed the extent owner
    information to the the allocation and freeing functions. This
    information is what will end up in the rmap btree that tracks
    allocated extents. While we technically don't need the owner
    information when freeing extents, passing it allows us to validate
    that the extent we are removing from the rmap btree actually
    belonged to the owner we expected it to belong to.

    We also define a special set of owner values for internal metadata
    that would otherwise have no owner. This allows us to tell the
    difference between metadata owned by different per-ag btrees, as
    well as static fs metadata (e.g. AG headers) and internal journal
    blocks.

    There are also a couple of special cases we need to take care of -
    during EFI recovery, we don't actually know who the original owner
    was, so we need to pass a wildcard to indicate that we aren't
    checking the owner for validity. We also need special handling in
    growfs, as we "free" the space in the last AG when extending it, but
    because it's new space it has no actual owner...

    While touching the xfs_bmap_add_free() function, re-order the
    parameters to put the struct xfs_mount first.

    Extend the owner field to include both the owner type and some sort
    of index within the owner. The index field will be used to support
    reverse mappings when reflink is enabled.

    When we're freeing extents from an EFI, we don't have the owner
    information available (rmap updates have their own redo items).
    xfs_free_extent therefore doesn't need to do an rmap update. Make
    sure that the log replay code signals this correctly.

    This is based upon a patch originally from Dave Chinner. It has been
    extended to add more owner information with the intent of helping
    recovery operations when things go wrong (e.g. offset of user data
    block in a file).

    [dchinner: de-shout the xfs_rmap_*_owner helpers]
    [darrick: minor style fixes suggested by Christoph Hellwig]

    Signed-off-by: Dave Chinner
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Originally-From: Dave Chinner

    XFS reserves a small amount of space in each AG for the minimum
    number of free blocks needed for operation. Adding the rmap btree
    increases the number of reserved blocks, but it also increases the
    complexity of the calculation as the free inode btree is optional
    (like the rmbt).

    Rather than calculate the prealloc blocks every time we need to
    check it, add a function to calculate it at mount time and store it
    in the struct xfs_mount, and convert the XFS_PREALLOC_BLOCKS macro
    just to use the xfs-mount variable directly.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Originally-From: Dave Chinner

    The rmap btree will require the same stats as all the other generic
    btrees, so add all the code for that now.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Originally-From: Dave Chinner

    Add new per-ag rmap btree definitions to the per-ag structures. The
    rmap btree will sit in the empty slots on disk after the free space
    btrees, and hence form a part of the array of space management
    btrees. This requires the definition of the btree to be contiguous
    with the free space btrees.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • By my calculations, a 1,073,741,824 block AG with a 1k block size
    can attain a maximum height of 9. Assuming a record size of 24
    bytes, a key/ptr size of 44 bytes, and half-full btree nodes, we'd
    need 53,687,092 blocks for the records and ~6 million blocks for the
    keys. That requires a btree of height 9 based on the following
    derivation:

    Block size = 1024b
    sblock CRC header = 56b
    == 1024-56 = 968 bytes for tree data

    rmapbt record = 24b
    == 40 records per leaf block

    rmapbt ptr/key = 44b
    == 22 ptr/keys per block

    Worst case, each block is half full, so 20 records and 11 ptrs per block.

    1073741824 rmap records / 20 records per block
    == 53687092 leaf blocks

    53687092 leaves / 11 ptrs per block
    == 4880645 level 1 blocks
    == 443695 level 2 blocks
    == 40336 level 3 blocks
    == 3667 level 4 blocks
    == 334 level 5 blocks
    == 31 level 6 blocks
    == 3 level 7 blocks
    == 1 level 8 block

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Add a couple of tracepoints for the deferred extent free operation and
    a site for injecting errors while finishing the operation. This makes
    it easier to debug deferred ops and test log redo.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Refactor the EFI intent item recovery (and cancellation) functions
    into a general function that scans the AIL and an intent item type
    specific handler. Move the function that recovers a single EFI item
    into the extent free item code. We'll want the generalized function
    when we start wiring up more redo item types.

    Furthermore, ensure that log recovery only replays the redo items
    that were in the AIL prior to recovery by checking the item LSN
    against the largest LSN seen during log scanning. As written this
    should never happen, but we can be defensive anyway.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Mechanical change of flist/free_list to dfops, since they're now
    deferred ops, not just a freeing list.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Drop the compatibility shims that we were using to integrate the new
    deferred operation mechanism into the existing code. No new code.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Restructure everything that used xfs_bmap_free to use xfs_defer_ops
    instead. For now we'll just remove the old symbols and play some
    cpp magic to make it work; in the next patch we'll actually rename
    everything.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Connect the xfs_defer mechanism with the pieces that we'll need to
    handle deferred extent freeing. We'll wire up the existing code to
    our new deferred mechanism later.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Replace structure typedefs with struct xfs_foo_* in the EFI/EFD
    handling code in preparation to move it over to deferred ops.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Add tracepoints for the internals of the deferred ops mechanism
    and tracepoint classes for clients of the dops, to make debugging
    easier.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • All the code around struct xfs_bmap_free basically implements a
    deferred operation framework through which we can roll transactions
    (to unlock buffers and avoid violating lock order rules) while
    managing all the necessary log redo items. Previously we only used
    this code to free extents after some sort of mapping operation, but
    with the advent of rmap and reflink, we suddenly need to do more than
    that.

    With that in mind, xfs_bmap_free really becomes a deferred ops control
    structure. Rename the structure and move the deferred ops into their
    own file to avoid further bloating of the bmap code.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Refactor the btree_change_owner function into a more generic apparatus
    which visits all blocks in a btree. We'll use this in a subsequent
    patch for counting btree blocks for AG reservations.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Create a function to enable querying of btree records mapping to a
    range of keys. This will be used in subsequent patches to allow
    querying the reverse mapping btree to find the extents mapped to a
    range of physical blocks, though the generic code can be used for
    any range query.

    The overlapped query range function needs to use the btree get_block
    helper because the root block could be an inode, in which case
    bc_bufs[nlevels-1] will be NULL.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • On a filesystem with both reflink and reverse mapping enabled, it's
    possible to have multiple rmap records referring to the same blocks on
    disk. When overlapping intervals are possible, querying a classic
    btree to find all records intersecting a given interval is inefficient
    because we cannot use the left side of the search interval to filter
    out non-matching records the same way that we can use the existing
    btree key to filter out records coming after the right side of the
    search interval. This will become important once we want to use the
    rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.

    (For the non-overlapping case, we can perform such queries trivially
    by starting at the left side of the interval and walking the tree
    until we pass the right side.)

    Therefore, extend the btree code to come closer to supporting
    intervals as a first-class record attribute. This involves widening
    the btree node's key space to store both the lowest key reachable via
    the node pointer (as the btree does now) and the highest key reachable
    via the same pointer and teaching the btree modifying functions to
    keep the highest-key records up to date.

    This behavior can be turned on via a new btree ops flag so that btrees
    that cannot store overlapping intervals don't pay the overhead costs
    in terms of extra code and disk format changes.

    When we're deleting a record in a btree that supports overlapped
    interval records and the deletion results in two btree blocks being
    joined, we defer updating the high/low keys until after all possible
    joining (at higher levels in the tree) have finished. At this point,
    the btree pointers at all levels have been updated to remove the empty
    blocks and we can update the low and high keys.

    When we're doing this, we must be careful to update the keys of all
    node pointers up to the root instead of stopping at the first set of
    keys that don't need updating. This is because it's possible for a
    single deletion to cause joining of multiple levels of tree, and so
    we need to update everything going back to the root.

    The diff_two_keys functions return < 0, 0, or > 0 if key1 is less than,
    equal to, or greater than key2, respectively. This is consistent
    with the rest of the kernel and the C library.

    In btree_updkeys(), we need to evaluate the force_all parameter before
    running the key diff to avoid reading uninitialized memory when we're
    forcing a key update. This happens when we've allocated an empty slot
    at level N + 1 to point to a new block at level N and we're in the
    process of filling out the new keys.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Add some function pointers to bc_ops to get the btree keys for
    leaf and node blocks, and to update parent keys of a block.
    Convert the _btree_updkey calls to use our new pointer, and
    modify the tree shape changing code to call the appropriate
    get_*_keys pointer instead of _btree_copy_keys because the
    overlapping btree has to calculate high key values.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • When a btree block has to be split, we pass the new block's ptr from
    xfs_btree_split() back to xfs_btree_insert() via a pointer parameter;
    however, we pass the block's key through the cursor's record. It is a
    little weird to "initialize" a record from a key since the non-key
    attributes will have garbage values.

    When we go to add support for interval queries, we have to be able to
    pass the lowest and highest keys accessible via a pointer. There's no
    clean way to pass this back through the cursor's record field.
    Therefore, pass the key directly back to xfs_btree_insert() the same
    way that we pass the btree_ptr.

    As a bonus, we no longer need init_rec_from_key and can drop it from the
    codebase.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • If we make the inode root block of a btree unfull by expanding the
    root, we must set *stat to 1 to signal success, rather than leaving
    it uninitialized.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • When we're deleting realtime extents, we need to lock the summary
    inode in case we need to update the summary info to prevent an assert
    on the rsumip inode lock on a debug kernel. While we're at it, fix
    the locking annotations so that we avoid triggering lockdep warnings.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Darrick J. Wong