22 Jun, 2012

1 commit

  • Revert commit 1307bbd, which uses the s_umount semaphore to provide
    exclusion between xfs_sync_worker and unmount, in favor of shutting down
    the sync worker before freeing the log in xfs_log_unmount. This is a
    cleaner way of resolving the race between xfs_sync_worker and unmount
    than using s_umount.

    Signed-off-by: Ben Myers
    Reviewed-by: Mark Tinguely
    Reviewed-by: Dave Chinner

    Ben Myers
     

16 May, 2012

1 commit

  • xfs_sync_worker checks the MS_ACTIVE flag in s_flags to avoid doing
    work during mount and unmount. This flag can be cleared by unmount
    after the xfs_sync_worker checks it but before the work is completed.
    The has caused crashes in the completion handler for the dummy
    transaction commited by xfs_sync_worker:

    PID: 27544 TASK: ffff88013544e040 CPU: 3 COMMAND: "kworker/3:0"
    #0 [ffff88016fdff930] machine_kexec at ffffffff810244e9
    #1 [ffff88016fdff9a0] crash_kexec at ffffffff8108d053
    #2 [ffff88016fdffa70] oops_end at ffffffff813ad1b8
    #3 [ffff88016fdffaa0] no_context at ffffffff8102bd48
    #4 [ffff88016fdffaf0] __bad_area_nosemaphore at ffffffff8102c04d
    #5 [ffff88016fdffb40] bad_area_nosemaphore at ffffffff8102c12e
    #6 [ffff88016fdffb50] do_page_fault at ffffffff813afaee
    #7 [ffff88016fdffc60] page_fault at ffffffff813ac635
    [exception RIP: xlog_get_lowest_lsn+0x30]
    RIP: ffffffffa04a9910 RSP: ffff88016fdffd10 RFLAGS: 00010246
    RAX: ffffc90014e48000 RBX: ffff88014d879980 RCX: ffff88014d879980
    RDX: ffff8802214ee4c0 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88016fdffd10 R8: ffff88014d879a80 R9: 0000000000000000
    R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802214ee400
    R13: ffff88014d879980 R14: 0000000000000000 R15: ffff88022fd96605
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
    #8 [ffff88016fdffd18] xlog_state_do_callback at ffffffffa04aa186 [xfs]
    #9 [ffff88016fdffd98] xlog_state_done_syncing at ffffffffa04aa568 [xfs]

    Protect xfs_sync_worker by using the s_umount semaphore at the read
    level to provide exclusion with unmount while work is progressing.

    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Ben Myers
     

15 May, 2012

7 commits

  • With the removal of xfs_rw.h and other changes over time, xfs_bit.h
    is being included in many files that don't actually need it. Clean
    up the includes as necessary.

    Also move the only-used-once xfs_ialloc_find_free() static inline
    function out of a header file that is widely included to reduce
    the number of needless dependencies on xfs_bit.h.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • xfs_trans_ail_delete_bulk() can be called from different contexts so
    if the item is not in the AIL we need different shutdown for each
    context. Pass in the shutdown method needed so the correct action
    can be taken.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
    and write back the buffers per-process instead of by waking up xfsbufd.

    This is now easily doable given that we have very few places left that write
    delwri buffers:

    - log recovery:
    Only done at mount time, and already forcing out the buffers
    synchronously using xfs_flush_buftarg

    - quotacheck:
    Same story.

    - dquot reclaim:
    Writes out dirty dquots on the LRU under memory pressure. We might
    want to look into doing more of this via xfsaild, but it's already
    more optimal than the synchronous inode reclaim that writes each
    buffer synchronously.

    - xfsaild:
    This is the main beneficiary of the change. By keeping a local list
    of buffers to write we reduce latency of writing out buffers, and
    more importably we can remove all the delwri list promotions which
    were hitting the buffer cache hard under sustained metadata loads.

    The implementation is very straight forward - xfs_buf_delwri_queue now gets
    a new list_head pointer that it adds the delwri buffers to, and all callers
    need to eventually submit the list using xfs_buf_delwi_submit or
    xfs_buf_delwi_submit_nowait. Buffers that already are on a delwri list are
    skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
    list. The biggest change to pass down the buffer list was done to the AIL
    pushing. Now that we operate on buffers the trylock, push and pushbuf log
    item methods are merged into a single push routine, which tries to lock the
    item, and if possible add the buffer that needs writeback to the buffer list.
    This leads to much simpler code than the previous split but requires the
    individual IOP_PUSH instances to unlock and reacquire the AIL around calls
    to blocking routines.

    Given that xfsailds now also handle writing out buffers, the conditions for
    log forcing and the sleep times needed some small changes. The most
    important one is that we consider an AIL busy as long we still have buffers
    to push, and the other one is that we do increment the pushed LSN for
    buffers that are under flushing at this moment, but still count them towards
    the stuck items for restart purposes. Without this we could hammer on stuck
    items without ever forcing the log and not make progress under heavy random
    delete workloads on fast flash storage devices.

    [ Dave Chinner:
    - rebase on previous patches.
    - improved comments for XBF_DELWRI_Q handling
    - fix XBF_ASYNC handling in queue submission (test 106 failure)
    - rename delwri submit function buffer list parameters for clarity
    - xfs_efd_item_push() should return XFS_ITEM_PINNED ]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Instead of writing the buffer directly from inside xfs_iflush return it to
    the caller and let the caller decide what to do with the buffer. Also
    remove the pincount check in xfs_iflush that all non-blocking callers already
    implement and the now unused flags parameter.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • We already flush dirty inodes throug the AIL regularly, there is no reason
    to have second thread compete with it and disturb the I/O pattern. We still
    do write inodes when doing a synchronous reclaim from the shrinker or during
    unmount for now.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Now that we write back all metadata either synchronously or through
    the AIL we can simply implement metadata freezing in terms of
    emptying the AIL.

    The implementation for this is fairly simply and straight-forward:
    A new routine is added that asks the xfsaild to push the AIL to the
    end and waits for it to complete and send a wakeup. The routine will
    then loop if the AIL is not actually empty, and continue to do so
    until the AIL is compeltely empty.

    We keep an inode reclaim pass in the freeze process to avoid having
    memory pressure have to reclaim inodes that require dirtying the
    filesystem to be reclaimed after the freeze has completed. This
    means we can also treat unmount in the exact same way as freeze.

    As an upside we can now remove the radix tree based inode writeback
    and xfs_unmountfs_writesb.

    [ Dave Chinner:
    - Cleaned up commit message.
    - Added inode reclaim passes back into freeze.
    - Cleaned up wakeup mechanism to avoid the use of a new
    sleep counter variable. ]

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • If a filesystem has been forced shutdown we are never going to write inodes
    to disk, which means the inode items will stay in the AIL until we free
    the inode. Currently that is not a problem, but a pending change requires us
    to empty the AIL before shutting down the filesystem. In that case leaving
    the inode in the AIL is lethal. Make sure to remove the log item from the AIL
    to allow emptying the AIL on shutdown filesystems.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

18 Apr, 2012

1 commit

  • Because the mount process can run a quotacheck and consume lots of
    inodes, we need to be able to run periodic inode reclaim during the
    mount process. This will prevent running the system out of memory
    during quota checks.

    This essentially reverts 2bcf6e97, but that is safe to do now that
    the quota sync code that was causing problems during long quotacheck
    executions is now gone.

    The reclaim work is currently protected from running during the
    unmount process by a check against MS_ACTIVE. Unfortunately, this
    also means that the reclaim work cannot run during mount. The
    unmount process should stop the reclaim cleanly before freeing
    anything that the reclaim work depends on, so there is no need to
    have this guard in place.

    Also, the inode reclaim work is demand driven, so there is no need
    to start it immediately during mount. It will be started the moment
    an inode is queued for reclaim, so qutoacheck will trigger it just
    fine.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     

14 Mar, 2012

1 commit

  • Timestamps on regular files are the last metadata that XFS does not update
    transactionally. Now that we use the delaylog mode exclusively and made
    the log scode scale extremly well there is no need to bypass that code for
    timestamp updates. Logging all updates allows to drop a lot of code, and
    will allow for further performance improvements later on.

    Note that this patch drops optimized handling of fdatasync - it will be
    added back in a separate commit.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

26 Feb, 2012

1 commit

  • At the end of xfs_reclaim_inode(), the inode is locked in order to
    we wait for a possible concurrent lookup to complete before the
    inode is freed. This synchronization step was taking both the ILOCK
    and the IOLOCK, but the latter was causing lockdep to produce
    reports of the possibility of deadlock.

    It turns out that there's no need to acquire the IOLOCK at this
    point anyway. It may have been required in some earlier version of
    the code, but there should be no need to take the IOLOCK in
    xfs_iget(), so there's no (longer) any need to get it here for
    synchronization. Add an assertion in xfs_iget() as a reminder
    of this assumption.

    Dave Chinner diagnosed this on IRC, and Christoph Hellwig suggested
    no longer including the IOLOCK. I just put together the patch.

    Signed-off-by: Alex Elder
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Alex Elder
     

18 Jan, 2012

1 commit

  • We almost never block on i_flock, the exception is synchronous inode
    flushing. Instead of bloating the inode with a 16/24-byte completion
    that we abuse as a semaphore just implement it as a bitlock that uses
    a bit waitqueue for the rare sleeping path. This primarily is a
    tradeoff between a much smaller inode and a faster non-blocking
    path vs faster wakeups, and we are much better off with the former.

    A small downside is that we will lose lockdep checking for i_flock, but
    given that it's always taken inside the ilock that should be acceptable.

    Note that for example the inode writeback locking is implemented in a
    very similar way.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

09 Jan, 2012

1 commit

  • * 'for-linus' of git://oss.sgi.com/xfs/xfs: (22 commits)
    xfs: mark the xfssyncd workqueue as non-reentrant
    xfs: simplify xfs_qm_detach_gdquots
    xfs: fix acl count validation in xfs_acl_from_disk()
    xfs: remove unused XBT_FORCE_SLEEP bit
    xfs: remove XFS_QMOPT_DQSUSER
    xfs: kill xfs_qm_idtodq
    xfs: merge xfs_qm_dqinit_core into the only caller
    xfs: add a xfs_dqhold helper
    xfs: simplify xfs_qm_dqattach_grouphint
    xfs: nest qm_dqfrlist_lock inside the dquot qlock
    xfs: flatten the dquot lock ordering
    xfs: implement lazy removal for the dquot freelist
    xfs: remove XFS_DQ_INACTIVE
    xfs: cleanup xfs_qm_dqlookup
    xfs: cleanup dquot locking helpers
    xfs: remove the sync_mode argument to xfs_qm_dqflush_all
    xfs: remove xfs_qm_sync
    xfs: make sure to really flush all dquots in xfs_qm_quotacheck
    xfs: untangle SYNC_WAIT and SYNC_TRYLOCK meanings for xfs_qm_dqflush
    xfs: remove the lid_size field in struct log_item_desc
    ...

    Fix up trivial conflict in fs/xfs/xfs_sync.c

    Linus Torvalds
     

24 Dec, 2011

1 commit

  • Since Linux 2.6.36 the writeback code has introduces various measures for
    live lock prevention during sync(). Unfortunately some of these are
    actively harmful for the XFS model, where the inode gets marked dirty for
    metadata from the data I/O handler.

    The older_than_this checks that are now more strictly enforced since

    writeback: avoid livelocking WB_SYNC_ALL writeback

    by only calling into __writeback_inodes_sb and thus only sampling the
    current cut off time once. But on a slow enough devices the previous
    asynchronous sync pass might not have fully completed yet, and thus XFS
    might mark metadata dirty only after that sampling of the cut off time for
    the blocking pass already happened. I have not myself reproduced this
    myself on a real system, but by introducing artificial delay into the
    XFS I/O completion workqueues it can be reproduced easily.

    Fix this by iterating over all XFS inodes in ->sync_fs and log all that
    are dirty. This might log inode that only got redirtied after the
    previous pass, but given how cheap delayed logging of inodes is it
    isn't a major concern for performance.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Tested-by: Mark Tinguely
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

13 Dec, 2011

1 commit

  • Now that we can't have any dirty dquots around that aren't in the AIL we
    can get rid of the explicit dquot syncing from xfssyncd and xfs_fs_sync_fs
    and instead rely on AIL pushing to write out any quota updates.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

30 Nov, 2011

1 commit

  • If we are doing synchronous inode reclaim we block the VM from making
    progress in memory reclaim. So if we encouter a flush locked inode
    promote it in the delwri list and wake up xfsbufd to write it out now.
    Without this we can get hangs of up to 30 seconds during workloads hitting
    synchronous inode reclaim.

    The scheme is copied from what we do for dquot reclaims.

    Reported-by: Simon Kirby
    Signed-off-by: Christoph Hellwig
    Tested-by: Simon Kirby
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

12 Oct, 2011

3 commits

  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • We now have an i_dio_count filed and surrounding infrastructure to wait
    for direct I/O completion instead of i_icount, and we have never needed
    to iocount waits for buffered I/O given that we only set the page uptodate
    after finishing all required work. Thus remove i_iocount, and replace
    the actually needed waits with calls to inode_dio_wait.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Remove the xfs_buf_relse from xfs_bwrite and let the caller handle it to
    mirror the delwri and read paths.

    Also remove the mount pointer passed to xfs_bwrite, which is superflous now
    that we have a mount pointer in the buftarg.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

13 Aug, 2011

1 commit

  • Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the
    annoying subdirectories in the XFS source code. Besides the large
    amount of file rename the only changes are to the Makefile, a few
    files including headers with the subdirectory prefix, and the binary
    sysctl compat code that includes a header under fs/xfs/ from
    kernel/.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig