07 Oct, 2020

5 commits

  • There's a subtle design flaw in the deferred log item code that can lead
    to pinning the log tail. Taking up the defer ops chain examples from
    the previous commit, we can get trapped in sequences like this:

    Caller hands us a transaction t0 with D0-D3 attached. The defer ops
    chain will look like the following if the transaction rolls succeed:

    t1: D0(t0), D1(t0), D2(t0), D3(t0)
    t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
    t3: d5(t1), D1(t0), D2(t0), D3(t0)
    ...
    t9: d9(t7), D3(t0)
    t10: D3(t0)
    t11: d10(t10), d11(t10)
    t12: d11(t10)

    In transaction 9, we finish d9 and try to roll to t10 while holding onto
    an intent item for D3 that we logged in t0.

    The previous commit changed the order in which we place new defer ops in
    the defer ops processing chain to reduce the maximum chain length. Now
    make xfs_defer_finish_noroll capable of relogging the entire chain
    periodically so that we can always move the log tail forward. Most
    chains will never get relogged, except for operations that generate very
    long chains (large extents containing many blocks with different sharing
    levels) or are on filesystems with small logs and a lot of ongoing
    metadata updates.

    Callers are now required to ensure that the transaction reservation is
    large enough to handle logging done items and new intent items for the
    maximum possible chain length. Most callers are careful to keep the
    chain lengths low, so the overhead should be minimal.

    The decision to relog an intent item is made based on whether the intent
    was logged in a previous checkpoint, since there's no point in relogging
    an intent into the same checkpoint.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • In xfs_bui_item_recover, there exists a use-after-free bug with regards
    to the inode that is involved in the bmap replay operation. If the
    mapping operation does not complete, we call xfs_bmap_unmap_extent to
    create a deferred op to finish the unmapping work, and we retain a
    pointer to the incore inode.

    Unfortunately, the very next thing we do is commit the transaction and
    drop the inode. If reclaim tears down the inode before we try to finish
    the defer ops, we dereference garbage and blow up. Therefore, create a
    way to join inodes to the defer ops freezer so that we can maintain the
    xfs_inode reference until we're done with the inode.

    Note: This imposes the requirement that there be enough memory to keep
    every incore inode in memory throughout recovery.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • In most places in XFS, we have a specific order in which we gather
    resources: grab the inode, allocate a transaction, then lock the inode.
    xfs_bui_item_recover doesn't do it in that order, so fix it to be more
    consistent. This also makes the error bailout code a bit less weird.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • The bmap intent item checking code in xfs_bui_item_recover is spread all
    over the function. We should check the recovered log item at the top
    before we allocate any resources or do anything else, so do that.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • When we replay unfinished intent items that have been recovered from the
    log, it's possible that the replay will cause the creation of more
    deferred work items. As outlined in commit 509955823cc9c ("xfs: log
    recovery should replay deferred ops in order"), later work items have an
    implicit ordering dependency on earlier work items. Therefore, recovery
    must replay the items (both recovered and created) in the same order
    that they would have been during normal operation.

    For log recovery, we enforce this ordering by using an empty transaction
    to collect deferred ops that get created in the process of recovering a
    log intent item to prevent them from being committed before the rest of
    the recovered intent items. After we finish committing all the
    recovered log items, we allocate a transaction with an enormous block
    reservation, splice our huge list of created deferred ops into that
    transaction, and commit it, thereby finishing all those ops.

    This is /really/ hokey -- it's the one place in XFS where we allow
    nested transactions; the splicing of the defer ops list is is inelegant
    and has to be done twice per recovery function; and the broken way we
    handle inode pointers and block reservations cause subtle use-after-free
    and allocator problems that will be fixed by this patch and the two
    patches after it.

    Therefore, replace the hokey empty transaction with a structure designed
    to capture each chain of deferred ops that are created as part of
    recovering a single unfinished log intent. Finally, refactor the loop
    that replays those chains to do so using one transaction per chain.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

23 Sep, 2020

3 commits

  • Nowadays, log recovery will call ->release on the recovered intent items
    if recovery fails. Therefore, it's redundant to release them from
    inside the ->recover functions when they're about to return an error.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Darrick J. Wong
     
  • In the bmap intent item recovery code, we must be careful to attach the
    inode to its dquots (if quotas are enabled) so that a change in the
    shape of the bmap btree doesn't cause the quota counters to be
    incorrect.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Darrick J. Wong
     
  • During a code inspection, I found a serious bug in the log intent item
    recovery code when an intent item cannot complete all the work and
    decides to requeue itself to get that done. When this happens, the
    item recovery creates a new incore deferred op representing the
    remaining work and attaches it to the transaction that it allocated. At
    the end of _item_recover, it moves the entire chain of deferred ops to
    the dummy parent_tp that xlog_recover_process_intents passed to it, but
    fail to log a new intent item for the remaining work before committing
    the transaction for the single unit of work.

    xlog_finish_defer_ops logs those new intent items once recovery has
    finished dealing with the intent items that it recovered, but this isn't
    sufficient. If the log is forced to disk after a recovered log item
    decides to requeue itself and the system goes down before we call
    xlog_finish_defer_ops, the second log recovery will never see the new
    intent item and therefore has no idea that there was more work to do.
    It will finish recovery leaving the filesystem in a corrupted state.

    The same logic applies to /any/ deferred ops added during intent item
    recovery, not just the one handling the remaining work.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Darrick J. Wong
     

29 Jul, 2020

1 commit

  • Use kmem_cache_zalloc() directly.

    With the exception of xlog_ticket_alloc() which will be dealt on the
    next patch for readability.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Carlos Maiolino
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner

    Carlos Maiolino
     

08 May, 2020

8 commits


07 May, 2020

1 commit

  • Various intent log items call xfs_trans_ail_remove() with a log I/O
    error shutdown type, but this helper historically checks whether an
    item is in the AIL before calling xfs_trans_ail_delete(). This means
    the shutdown check is essentially a no-op for users of
    xfs_trans_ail_remove().

    It is possible that some items might not be AIL resident when the
    AIL remove attempt occurs, but this should be isolated to cases
    where the filesystem has already shutdown. For example, this
    includes abort of the transaction committing the intent and I/O
    error of the iclog buffer committing the intent to the log.
    Therefore, update these callsites to use xfs_trans_ail_delete() to
    provide AIL state validation for the common path of items being
    released and removed when associated done items commit to the
    physical log.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Allison Collins
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

05 May, 2020

5 commits

  • Given how XFS is all based around btrees it doesn't make much sense
    to offer a totally generic state when we can just use the btree cursor.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • All defer op instance place their own extension of the log item into
    the dfp_done field. Replace that with a xfs_log_item to improve type
    safety and make the code easier to follow.

    Also use the opportunity to improve the ->finish_item calling conventions
    to place the done log item as the higher level structure before the
    list_entry used for the individual items.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • All defer op instance place their own extension of the log item into
    the dfp_intent field. Replace that with a xfs_log_item to improve type
    safety and make the code easier to follow.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • This avoids a per-item indirect call, and also simplifies the interface
    a bit.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • These are aways called together, and my merging them we reduce the amount
    of indirect calls, improve type safety and in general clean up the code
    a bit.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

19 Nov, 2019

1 commit


11 Nov, 2019

1 commit


05 Nov, 2019

1 commit


28 Aug, 2019

1 commit


27 Aug, 2019

1 commit


29 Jun, 2019

8 commits

  • There are many, many xfs header files which are included but
    unneeded (or included twice) in the xfs code, so remove them.

    nb: xfs_linux.h includes about 9 headers for everyone, so those
    explicit includes get removed by this. I'm not sure what the
    preference is, but if we wanted explicit includes everywhere,
    a followup patch could remove those xfs_*.h includes from
    xfs_linux.h and move them into the files that need them.
    Or it could be left as-is.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eric Sandeen
     
  • Keep all bmap item related code together.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • There is no good reason to keep these two functions separate.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • We have various items that are released from ->iop_comitting. Add a
    flag to just call ->iop_release from the commit path to avoid tons
    of boilerplate code.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • The iop_unlock method is called when comitting or cancelling a
    transaction. In the latter case, the transaction may or may not be
    aborted. While there is no known problem with the current code in
    practice, this implementation is limited in that any log item
    implementation that might want to differentiate between a commit and a
    cancellation must rely on the aborted state. The aborted bit is only
    set when the cancelled transaction is dirty, however. This means that
    there is no way to distinguish between a commit and a clean transaction
    cancellation.

    For example, intent log items currently rely on this distinction. The
    log item is either transferred to the CIL on commit or released on
    transaction cancel. There is currently no possibility for a clean intent
    log item in a transaction, but if that state is ever introduced a cancel
    of such a transaction will immediately result in memory leaks of the
    associated log item(s). This is an interface deficiency and landmine.

    To clean this up, replace the iop_unlock method with an iop_release
    method that is specific to transaction cancel. The existing
    iop_committing method occurs at the same time as iop_unlock in the
    commit path and there is no need for two separate callbacks here.
    Overload the iop_committing method with the current commit time
    iop_unlock implementations to eliminate the need for the latter and
    further simplify the interface.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Just check if they are present first.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • The inode geometry structure isn't related to ondisk format; it's
    support for the mount structure. Move it to xfs_shared.h.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

03 Aug, 2018

4 commits

  • The majority of remaining references to struct xfs_defer_ops in XFS
    are associated with xfs_defer_add(). At this point, there are no
    more external xfs_defer_ops users left. All instances of
    xfs_defer_ops are embedded in the transaction, which means we can
    safely pass the transaction down to the dfops add interface.

    Update xfs_defer_add() to receive the transaction as a parameter.
    Various subsystems implement wrappers to allocate and construct the
    context specific data structures for the associated deferred
    operation type. Update these to also carry the transaction down as
    needed and clean up unused dfops parameters along the way.

    This removes most of the remaining references to struct
    xfs_defer_ops throughout the code and facilitates removal of the
    structure.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    [darrick: fix unused variable warnings with ftrace disabled]
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • The dfops infrastructure ->finish_item() callback passes the
    transaction and dfops as separate parameters. Since dfops is always
    part of a transaction, the latter parameter is no longer necessary.
    Remove it from the various callbacks.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • All callers pass ->t_dfops of the associated transactions. Refactor
    the helpers to receive the transactions and facilitate further
    cleanups between xfs_defer_ops and xfs_trans.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • Log intent recovery is the last user of an external (on-stack)
    dfops. The pattern exists because the dfops is used to collect
    additional deferred operations queued during the whole recovery
    sequence. The dfops is finished with a new transaction after intent
    recovery completes.

    We already have a mechanism to create an empty, container-like
    transaction to support the scrub infrastructure. We can reuse that
    mechanism here to drop the final user of external dfops. This
    facilitates folding dfops state (i.e., dop_low) into the
    transaction, the elimination of now unused external dfops support
    and also eliminates the only caller of __xfs_defer_cancel().

    Replace the on-stack dfops with an empty transaction and pass it
    around to the various helpers that queue and finish deferred
    operations during intent recovery.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster