09 Nov, 2011

1 commit

  • The log item ops aren't nessecarily the biggest exploit vector, but marking
    them const is easy enough. Also remove the unused xfs_item_ops_t typedef
    while we're at it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Alex Elder

    Christoph Hellwig
     

18 Oct, 2011

1 commit


12 Oct, 2011

5 commits

  • Use xfs_ioerror_alert instead of opencoding a very similar error
    message.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • For each call to xfs_buf_stale we call xfs_buf_delwri_dequeue either
    directly before or after it, or are guaranteed by the surrounding
    conditionals that we are never called on delwri buffers. Simply
    this situation by moving the call to xfs_buf_delwri_dequeue into
    xfs_buf_stale.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Unify the ways we add buffers to the delwri queue by always calling
    xfs_buf_delwri_queue directly. The xfs_bdwrite functions is removed and
    opencoded in its callers, and the two places setting XBF_DELWRI while a
    buffer is locked and expecting xfs_buf_unlock to pick it up are converted
    to call xfs_buf_delwri_queue directly, too. Also replace the
    XFS_BUF_UNDELAYWRITE macro with direct calls to xfs_buf_delwri_dequeue
    to make the explicit queuing/dequeuing more obvious.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • We need to check for pinned buffers even in .iop_pushbuf given that inode
    items flush into the same buffers that may be pinned directly due operations
    on the unlinked inode list operating directly on buffers. To do this add a
    return value to .iop_pushbuf that tells the AIL push about this and use
    the existing log force mechanisms to unpin it.

    Signed-off-by: Christoph Hellwig
    Reported-by: Stefan Priebe
    Tested-by: Stefan Priebe
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

11 Aug, 2011

1 commit

  • xfs: fix for hang during synchronous buffer write error

    If removed storage while synchronous buffer write underway,
    "xfslogd" hangs.

    Detailed log http://oss.sgi.com/archives/xfs/2011-07/msg00740.html

    Related work bfc60177f8ab509bc225becbb58f7e53a0e33e81
    "xfs: fix error handling for synchronous writes"

    Given that xfs_bwrite actually does the shutdown already after
    waiting for the b_iodone completion and given that we actually
    found that calling xfs_force_shutdown from inside
    xfs_buf_iodone_callbacks was a major contributor the problem
    it better to drop this call.

    Signed-off-by: Ajeet Yadav
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Ajeet Yadav
     

26 Jul, 2011

8 commits


13 Jul, 2011

3 commits


08 Jul, 2011

1 commit

  • Rename xfs_buf_cond_lock and reverse it's return value to fit most other
    trylock operations in the Kernel and XFS (with the exception of down_trylock,
    after which xfs_buf_cond_lock was modelled), and replace xfs_buf_lock_val
    with an xfs_buf_islocked for use in asserts, or and opencoded variant in
    tracing. remove the XFS_BUF_* wrappers for all the locking helpers.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

31 Mar, 2011

1 commit


07 Mar, 2011

1 commit


28 Jan, 2011

1 commit

  • After test 139, kmemleak shows:

    unreferenced object 0xffff880078b405d8 (size 400):
    comm "xfs_io", pid 4904, jiffies 4294909383 (age 1186.728s)
    hex dump (first 32 bytes):
    60 c1 17 79 00 88 ff ff 60 c1 17 79 00 88 ff ff `..y....`..y....
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x2d/0x60
    [] kmem_cache_alloc+0x13f/0x2b0
    [] kmem_zone_alloc+0x77/0xf0
    [] kmem_zone_zalloc+0x1e/0x50
    [] xfs_efi_init+0x4b/0xb0
    [] xfs_trans_get_efi+0x58/0x90
    [] xfs_bmap_finish+0x8b/0x1d0
    [] xfs_itruncate_finish+0x2c4/0x5d0
    [] xfs_setattr+0x8df/0xa70
    [] xfs_vn_setattr+0x1b/0x20
    [] notify_change+0x170/0x2e0
    [] do_truncate+0x66/0xa0
    [] sys_ftruncate+0xdb/0xe0
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

    The cause of the leak is that the "remove" parameter of IOP_UNPIN()
    is never set when a CIL push is aborted. This means that the EFI
    item is never freed if it was in the push being cancelled. The
    problem is specific to delayed logging, but has uncovered a couple
    of problems with the handling of IOP_UNPIN(remove).

    Firstly, we cannot safely call xfs_trans_del_item() from IOP_UNPIN()
    in the CIL commit failure path or the iclog write failure path
    because for delayed loging we have no transaction context. Hence we
    must only call xfs_trans_del_item() if the log item being unpinned
    has an active log item descriptor.

    Secondly, xfs_trans_uncommit() does not handle log item descriptor
    freeing during the traversal of log items on a transaction. It can
    reference a freed log item descriptor when unpinning an EFI item.
    Hence it needs to use a safe list traversal method to allow items to
    be removed from the transaction during IOP_UNPIN().

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder

    Dave Chinner
     

12 Jan, 2011

1 commit

  • If we get an IO error on a synchronous superblock write, we attach an
    error release function to it so that when the last reference goes away
    the release function is called and the buffer is invalidated and
    unlocked. The buffer is left locked until the release function is
    called so that other concurrent users of the buffer will be locked out
    until the buffer error is fully processed.

    Unfortunately, for the superblock buffer the filesyetm itself holds a
    reference to the buffer which prevents the reference count from
    dropping to zero and the release function being called. As a result,
    once an IO error occurs on a sync write, the buffer will never be
    unlocked and all future attempts to lock the buffer will hang.

    To make matters worse, this problems is not unique to such buffers;
    if there is a concurrent _xfs_buf_find() running, the lookup will grab
    a reference to the buffer and then wait on the buffer lock, preventing
    the reference count from ever falling to zero and hence unlocking the
    buffer.

    As such, the whole b_relse function implementation is broken because it
    cannot rely on the buffer reference count falling to zero to unlock the
    errored buffer. The synchronous write error path is the only path that
    uses this callback - it is used to ensure that the synchronous waiter
    gets the buffer error before the error state is cleared from the buffer
    by the release function.

    Given that the only sychronous buffer writes now go through xfs_bwrite
    and the error path in question can only occur for a write of a dirty,
    logged buffer, we can move most of the b_relse processing to happen
    inline in xfs_buf_iodone_callbacks, just like a normal I/O completion.
    In addition to that we make sure the error is not cleared in
    xfs_buf_iodone_callbacks, so that xfs_bwrite can reliably check it.
    Given that xfs_bwrite keeps the buffer locked until it has waited for
    it and checked the error this allows to reliably propagate the error
    to the caller, and make sure that the buffer is reliably unlocked.

    Given that xfs_buf_iodone_callbacks was the only instance of the
    b_relse callback we can remove it entirely.

    Based on earlier patches by Dave Chinner and Ajeet Yadav.

    Signed-off-by: Christoph Hellwig
    Reported-by: Ajeet Yadav
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

03 Dec, 2010

1 commit

  • To allow buffer iodone callbacks to consume multiple items off the
    callback list, first we need to convert the xfs_buf_do_callbacks()
    to consume items and always pull the next item from the head of the
    list.

    The means the item list walk is never dependent on knowing the
    next item on the list and hence allows callbacks to remove items
    from the list as well. This allows callbacks to do bulk operations
    by scanning the list for identical callbacks, consuming them all
    and then processing them in bulk, negating the need for multiple
    callbacks of that type.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

19 Oct, 2010

2 commits


27 Jul, 2010

8 commits

  • The b_strat callback is used by xfs_buf_iostrategy to perform additional
    checks before submitting a buffer. It is used in xfs_bwrite and when
    writing out delayed buffers. In xfs_bwrite it we can de-virtualize the
    call easily as b_strat is set a few lines above the call to
    xfs_buf_iostrategy. For the delayed buffers the rationale is a bit
    more complicated:

    - there are three callers of xfs_buf_delwri_queue, which places buffers
    on the delwri list:
    (1) xfs_bdwrite - this sets up b_strat, so it's fine
    (2) xfs_buf_iorequest. None of the callers can have XBF_DELWRI set:
    - xlog_bdstrat is only used for log buffers, which are never delwri
    - _xfs_buf_read explicitly clears the delwri flag
    - xfs_buf_iodone_work retries log buffers only
    - xfsbdstrat - only used for reads, superblock writes without the
    delwri flag, log I/O and file zeroing with explicitly allocated
    buffers.
    - xfs_buf_iostrategy - only calls xfs_buf_iorequest if b_strat is
    not set
    (3) xfs_buf_unlock
    - only puts the buffer on the delwri list if the DELWRI flag is
    already set. The DELWRI flag is only ever set in xfs_bwrite,
    xfs_buf_iodone_callbacks, or xfs_trans_log_buf. For
    xfs_buf_iodone_callbacks and xfs_trans_log_buf we require
    an initialized buf item, which means b_strat was set to
    xfs_bdstrat_cb in xfs_buf_item_init.

    Conclusion: we can just get rid of the callback and replace it with
    explicit calls to xfs_bdstrat_cb.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • By making this member a void pointer we can get rid of a lot of pointless
    casts.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Get rid of the xfs_buf_pin/xfs_buf_unpin/xfs_buf_ispin helpers and opencode
    them in their only callers, just like we did for the inode pinning a while
    ago. Also remove duplicate trace points - the bufitem tracepoints cover
    all the information that is present in a buffer tracepoint.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Stop the function pointer casting madness and give all the li_cb instances
    correct prototype.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Stop the function pointer casting madness and give all the xfs_item_ops the
    correct prototypes.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • The unpin_remove item operation instances always share most of the
    implementation with the respective unpin implementation. So instead
    of keeping two different entry points add a remove flag to the unpin
    operation and share the code more easily.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Currently we track log item descriptor belonging to a transaction using a
    complex opencoded chunk allocator. This code has been there since day one
    and seems to work around the lack of an efficient slab allocator.

    This patch replaces it with dynamically allocated log item descriptors
    from a dedicated slab pool, linked to the transaction by a linked list.

    This allows to greatly simplify the log item descriptor tracking to the
    point where it's just a couple hundred lines in xfs_trans.c instead of
    a separate file. The external API has also been simplified while we're
    at it - the xfs_trans_add_item and xfs_trans_del_item functions to add/
    delete items from a transaction have been simplified to the bare minium,
    and the xfs_trans_find_item function is replaced with a direct dereference
    of the li_desc field. All debug code walking the list of log items in
    a transaction is down to a simple list_for_each_entry.

    Note that we could easily use a singly linked list here instead of the
    double linked list from list.h as the fastpath only does deletion from
    sequential traversal. But given that we don't have one available as
    a library function yet I use the list.h functions for simplicity.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Dmapi support was never merged upstream, but we still have a lot of hooks
    bloating XFS for it, all over the fast pathes of the filesystem.

    This patch drops over 700 lines of dmapi overhead. If we'll ever get HSM
    support in mainline at least the namespace events can be done much saner
    in the VFS instead of the individual filesystem, so it's not like this
    is much help for future work.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

24 May, 2010

3 commits

  • With delayed logging, we can get inode allocation buffers in the
    same transaction inode unlink buffers. We don't currently mark inode
    allocation buffers in the log, so inode unlink buffers take
    precedence over allocation buffers.

    The result is that when they are combined into the same checkpoint,
    only the unlinked inode chain fields are replayed, resulting in
    uninitialised inode buffers being detected when the next inode
    modification is replayed.

    To fix this, we need to ensure that we do not set the inode buffer
    flag in the buffer log item format flags if the inode allocation has
    not already hit the log. To avoid requiring a change to log
    recovery, we really need to make this a modification that relies
    only on in-memory sate.

    We can do this by checking during buffer log formatting (while the
    CIL cannot be flushed) if we are still in the same sequence when we
    commit the unlink transaction as the inode allocation transaction.
    If we are, then we do not add the inode buffer flag to the buffer
    log format item flags. This means the entire buffer will be
    replayed, not just the unlinked fields. We do this while
    CIL flusheѕ are locked out to ensure that we don't race with the
    sequence numbers changing and hence fail to put the inode buffer
    flag in the buffer format flags when we really need to.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Clean up the buffer log format (XFS_BLI_*) flags because they have a
    polluted namespace. They XFS_BLI_ prefix is used for both in-memory
    and on-disk flag feilds, but have overlapping values for different
    flags. Rename the buffer log format flags to use the XFS_BLF_*
    prefix to avoid confusing them with the in-memory XFS_BLI_* prefixed
    flags.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • The buffer log item reference counts used to take referenceѕ for every
    transaction, similar to the pin counting. This is symmetric (like the
    pin/unpin) with respect to transaction completion, but with dleayed logging
    becomes assymetric as the pinning becomes assymetric w.r.t. transaction
    completion.

    To make both cases the same, allow the buffer pinning to take a reference to
    the buffer log item and always drop the reference the transaction has on it
    when being unlocked. This is balanced correctly because the unpin operation
    always drops a reference to the log item. Hence reference counting becomes
    symmetric w.r.t. item pinning as well as w.r.t active transactions and as a
    result the reference counting model remain consistent between normal and
    delayed logging.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

19 May, 2010

2 commits

  • The staleness of a object being unpinned can be directly derived
    from the object itself - there is no need to extract it from the
    object then pass it as a parameter into IOP_UNPIN().

    This means we can kill the XFS_LID_BUF_STALE flag - it is set,
    checked and cleared in the same places XFS_BLI_STALE flag in the
    xfs_buf_log_item so it is now redundant and hence safe to remove.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • Each log item type does manual initialisation of the log item.
    Delayed logging introduces new fields that need initialisation, so
    factor all the open coded initialisation into a common function
    first.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner