02 Oct, 2014

5 commits

  • xfs_buf_read_uncached() has two failure modes. If can either return
    NULL or bp->b_error != 0 depending on the type of failure, and not
    all callers check for both. Fix it so that xfs_buf_read_uncached()
    always returns the error status, and the buffer is returned as a
    function parameter. The buffer will only be returned on success.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • There is a lot of cookie-cutter code that looks like:

    if (shutdown)
    handle buffer error
    xfs_buf_iorequest(bp)
    error = xfs_buf_iowait(bp)
    if (error)
    handle buffer error

    spread through XFS. There's significant complexity now in
    xfs_buf_iorequest() to specifically handle this sort of synchronous
    IO pattern, but there's all sorts of nasty surprises in different
    error handling code dependent on who owns the buffer references and
    the locks.

    Pull this pattern into a single helper, where we can hide all the
    synchronous IO warts and hence make the error handling for all the
    callers much saner. This removes the need for a special extra
    reference to protect IO completion processing, as we can now hold a
    single reference across dispatch and waiting, simplifying the sync
    IO smeantics and error handling.

    In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
    make it explicitly handle on asynchronous IO. This forces all users
    to be switched specifically to one interface or the other and
    removes any ambiguity between how the interfaces are to be used. It
    also means that xfs_buf_iowait() goes away.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • There is only one caller now - xfs_trans_read_buf_map() - and it has
    very well defined call semantics - read, synchronous, and b_iodone
    is NULL. Hence it's pretty clear what error handling is necessary
    for this case. The bigger problem of untangling
    xfs_trans_read_buf_map error handling is left to a future patch.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Currently the report of a bio error from completion
    immediately marks the buffer with an error. The issue is that this
    is racy w.r.t. synchronous IO - the submitter can see b_error being
    set before the IO is complete, and hence we cannot differentiate
    between submission failures and completion failures.

    Add an internal b_io_error field protected by the b_lock to catch IO
    completion errors, and only propagate that to the buffer during
    final IO completion handling. Hence we can tell in xfs_buf_iorequest
    if we've had a submission failure bey checking bp->b_error before
    dropping our b_io_remaining reference - that reference will prevent
    b_io_error values from being propagated to b_error in the event that
    completion races with submission.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • We do some work in xfs_buf_ioend, and some work in
    xfs_buf_iodone_work, but much of that functionality is the same.
    This work can all be done in a single function, leaving
    xfs_buf_iodone just a wrapper to determine if we should execute it
    by workqueue or directly. hence rename xfs_buf_iodone_work to
    xfs_buf_ioend(), and add a new xfs_buf_ioend_async() for places that
    need async processing.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

25 Jun, 2014

1 commit

  • Convert all the errors the core XFs code to negative error signs
    like the rest of the kernel and remove all the sign conversion we
    do in the interface layers.

    Errors for conversion (and comparison) found via searches like:

    $ git grep " E" fs/xfs
    $ git grep "return E" fs/xfs
    $ git grep " E[A-Z].*;$" fs/xfs

    Negation points found via searches like:

    $ git grep "= -[a-z,A-Z]" fs/xfs
    $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
    $ git grep " -[a-z].*;" fs/xfs

    [ with some bits I missed from Brian Foster ]

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

10 Jun, 2014

1 commit


06 Jun, 2014

1 commit

  • Most of the callers are just calling ASSERT(!xfs_buf_geterror())
    which means they are checking for bp->b_error == 0. If bp is null in
    this case, we will assert fail, and hence it's no different in
    result to oopsing because of a null bp. In some cases, errors have
    already been checked for or the function returning the buffer can't
    return a buffer with an error, so it's just a redundant assert.
    Either way, the assert can either be removed.

    The other two non-assert callers can just test for a buffer and
    error properly.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

14 Apr, 2014

2 commits


27 Feb, 2014

2 commits


25 Jan, 2014

3 commits

  • Some time ago, mkfs.xfs started picking the storage physical
    sector size as the default filesystem "sector size" in order
    to avoid RMW costs incurred by doing IOs at logical sector
    size alignments.

    However, this means that for a filesystem made with i.e.
    a 4k sector size on an "advanced format" 4k/512 disk,
    512-byte direct IOs are no longer allowed. This means
    that XFS has essentially turned this AF drive into a hard
    4K device, from the filesystem on up.

    XFS's mkfs-specified "sector size" is really just controlling
    the minimum size & alignment of filesystem metadata.

    There is no real need to tightly couple XFS's minimal
    metadata size to the minimum allowed direct IO size;
    XFS can continue doing metadata in optimal sizes, but
    still allow smaller DIOs for apps which issue them,
    for whatever reason.

    This patch adds a new field to the xfs_buftarg, so that
    we now track 2 sizes:

    1) The metadata sector size, which is the minimum unit and
    alignment of IO which will be performed by metadata operations.
    2) The device logical sector size

    The first is used internally by the file system for metadata
    alignment and IOs.
    The second is used for the minimum allowed direct IO alignment.

    This has passed xfstests on filesystems made with 4k sectors,
    including when run under the patch I sent to ignore
    XFS_IOC_DIOINFO, and issue 512 DIOs anyway. I also directly
    tested end of block behavior on preallocated, sparse, and
    existing files when we do a 512 IO into a 4k file on a
    4k-sector filesystem, to be sure there were no unexpected
    behaviors.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Eric Sandeen
     
  • In preparation for adding new members to the structure,
    give these old ones more descriptive names:

    bt_ssize -> bt_meta_sectorsize
    bt_smask -> bt_meta_sectormask

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Eric Sandeen
     
  • Clean up the xfs_buftarg structure a bit:
    - remove bt_bsize which is never used
    - replace bt_sshift with bt_ssize; we only ever shift it back

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Eric Sandeen
     

17 Dec, 2013

2 commits

  • If we are doing aysnc writeback of metadata, we can get write errors
    but have nobody to report them to. At the moment, we simply attempt
    to reissue the write from io completion in the hope that it's a
    transient error.

    When it's not a transient error, the buffer is stuck forever in
    this loop, and we cannot break out of it. Eventually, unmount will
    hang because the AIL cannot be emptied and everything goes downhill
    from them.

    To solve this problem, only retry the write IO once before aborting
    it. We don't throw the buffer away because some transient errors can
    last minutes (e.g. FC path failover) or even hours (thin
    provisioned devices that have run out of backing space) before they
    go away. Hence we really want to keep trying until we can't try any
    more.

    Because the buffer was not cleaned, however, it does not get removed
    from the AIL and hence the next pass across the AIL will start IO on
    it again. As such, we still get the "retry forever" semantics that
    we currently have, but we allow other access to the buffer in the
    mean time. Meanwhile the filesystem can continue to modify the
    buffer and relog it, so the IO errors won't hang the log or the
    filesystem.

    Now when we are pushing the AIL, we can see all these "permanent IO
    error" buffers and we can issue a warning about failures before we
    retry the IO. We can also catch these buffers when unmounting an
    issue a corruption warning, too.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • The xfsbdstrat helper is a small but useless wrapper for xfs_buf_iorequest that
    handles the case of a shut down filesystem. Most of the users have private,
    uncached buffers that can just be freed in this case, but the complex error
    handling in xfs_bioerror_relse messes up the case when it's called without
    a locked buffer.

    Remove xfsbdstrat and opencode the error handling in the callers. All but
    one can simply return an error and don't need to deal with buffer state,
    and the one caller that cares about the buffer state could do with a major
    cleanup as well, but we'll defer that to later.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

11 Sep, 2013

2 commits

  • In converting the buffer lru lists to use the generic code, the locking
    for marking the buffers as on the dispose list was lost. This results in
    confusion in LRU buffer tracking and acocunting, resulting in reference
    counts being mucked up and filesystem beig unmountable.

    To fix this, introduce an internal buffer spinlock to protect the state
    field that holds the dispose list information. Because there is now
    locking needed around xfs_buf_lru_add/del, and they are used in exactly
    one place each two lines apart, get rid of the wrappers and code the logic
    directly in place.

    Further, the LRU emptying code used on unmount is less than optimal.
    Convert it to use a dispose list as per a normal shrinker walk, and repeat
    the walk that fills the dispose list until the LRU is empty. Thi avoids
    needing to drop and regain the LRU lock for every item being freed, and
    allows the same logic as the shrinker isolate call to be used. Simpler,
    easier to understand.

    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Convert the buftarg LRU to use the new generic LRU list and take advantage
    of the functionality it supplies to make the buffer cache shrinker node
    aware.

    Signed-off-by: Glauber Costa
    Signed-off-by: Dave Chinner
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     

17 Jan, 2013

1 commit

  • Commits starting at 77c1a08 introduced a multiple segment support
    to xfs_buf. xfs_trans_buf_item_match() could not find a multi-segment
    buffer in the transaction because it was looking at the single segment
    block number rather than the multi-segment b_maps[0].bm.bn. This
    results on a recursive buffer lock that can never be satisfied.

    This patch:
    1) Changed the remaining b_map accesses to be b_maps[0] accesses.
    2) Renames the single segment b_map structure to __b_map to avoid
    future confusion.

    Signed-off-by: Mark Tinguely
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Mark Tinguely
     

16 Nov, 2012

3 commits

  • To separate the verifiers from iodone functions and associate read
    and write verifiers at the same time, introduce a buffer verifier
    operations structure to the xfs_buf.

    This avoids the need for assigning the write verifier, clearing the
    iodone function and re-running ioend processing in the read
    verifier, and gets rid of the nasty "b_pre_io" name for the write
    verifier function pointer. If we ever need to, it will also be
    easier to add further content specific callbacks to a buffer with an
    ops structure in place.

    We also avoid needing to export verifier functions, instead we
    can simply export the ops structures for those that are needed
    outside the function they are defined in.

    This patch also fixes a directory block readahead verifier issue
    it exposed.

    This patch also adds ops callbacks to the inode/alloc btree blocks
    initialised by growfs. These will need more work before they will
    work with CRCs.

    Signed-off-by: Dave Chinner
    Reviewed-by: Phil White
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Add a callback to the buffer write path to enable verification of
    the buffer and CRC calculation prior to issuing the write to the
    underlying storage.

    If the callback function detects some kind of failure or error
    condition, it must mark the buffer with an error so that the caller
    can take appropriate action. In the case of xfs_buf_ioapply(), a
    corrupt metadta buffer willt rigger a shutdown of the filesystem,
    because something is clearly wrong and we can't allow corrupt
    metadata to be written to disk.

    Signed-off-by: Dave Chinner
    Reviewed-by: Phil White
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Add a verifier function callback capability to the buffer read
    interfaces. This will be used by the callers to supply a function
    that verifies the contents of the buffer when it is read from disk.
    This patch does not provide callback functions, but simply modifies
    the interfaces to allow them to be called.

    The reason for adding this to the read interfaces is that it is very
    difficult to tell fom the outside is a buffer was just read from
    disk or whether we just pulled it out of cache. Supplying a callbck
    allows the buffer cache to use it's internal knowledge of the buffer
    to execute it only when the buffer is read from disk.

    It is intended that the verifier functions will mark the buffer with
    an EFSCORRUPTED error when verification fails. This allows the
    reading context to distinguish a verification error from an IO
    error, and potentially take further actions on the buffer (e.g.
    attempt repair) based on the error reported.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Phil White
    Signed-off-by: Ben Myers

    Dave Chinner
     

30 Aug, 2012

1 commit

  • While xfs_buftarg_shrink() is freeing buffers from the dispose list (filled with
    buffers from lru list), there is a possibility to have xfs_buf_stale() racing
    with it, and removing buffers from dispose list before xfs_buftarg_shrink() does
    it.

    This happens because xfs_buftarg_shrink() handle the dispose list without
    locking and the test condition in xfs_buf_stale() checks for the buffer being in
    *any* list:

    if (!list_empty(&bp->b_lru))

    If the buffer happens to be on dispose list, this causes the buffer counter of
    lru list (btp->bt_lru_nr) to be decremented twice (once in xfs_buftarg_shrink()
    and another in xfs_buf_stale()) causing a wrong account usage of the lru list.

    This may cause xfs_buftarg_shrink() to return a wrong value to the memory
    shrinker shrink_slab(), and such account error may also cause an underflowed
    value to be returned; since the counter is lower than the current number of
    items in the lru list, a decrement may happen when the counter is 0, causing
    an underflow on the counter.

    The fix uses a new flag field (and a new buffer flag) to serialize buffer
    handling during the shrink process. The new flag field has been designed to use
    btp->bt_lru_lock/unlock instead of xfs_buf_lock/unlock mechanism.

    dchinner, sandeen, aquini and aris also deserve credits for this.

    Signed-off-by: Carlos Maiolino
    Reviewed-by: Ben Myers
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Carlos Maiolino
     

14 Jul, 2012

1 commit

  • xfs_bdstrat_cb only adds a check for a shutdown filesystem over
    xfs_buf_iorequest, but xfs_buf_iodone_callbacks just checked for a shut down
    filesystem a little earlier. In addition the shutdown handling in
    xfs_bdstrat_cb is not very suitable for this caller.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

02 Jul, 2012

3 commits

  • With the internal interfaces supporting discontiguous buffer maps,
    add external lookup, read and get interfaces so they can start to be
    used.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • While the external interface currently uses separate blockno/length
    variables, we need to move internal interfaces to passing and
    parsing vector maps. This will then allow us to add external
    interfaces to support discontiguous buffer maps as the internal code
    will already support them.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • To support discontiguous buffers in the buffer cache, we need to
    separate the cache index variables from the I/O map. While this is
    currently a 1:1 mapping, discontiguous buffer support will break
    this relationship.

    However, for caching purposes, we can still treat them the same as a
    contiguous buffer - the block number of the first block and the
    length of the buffer - as that is still a unique representation.
    Also, the only way we will ever access the discontiguous regions of
    buffers is via bulding the complete buffer in the first place, so
    using the initial block number and entire buffer length is a sane
    way to index the buffers.

    Add a block mapping vector construct to the xfs_buf and use it in
    the places where we are doing IO instead of the current
    b_bn/b_length variables.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     

15 May, 2012

10 commits

  • Rather than specifying XBF_MAPPED for almost all buffers, introduce
    XBF_UNMAPPED for the couple of users that use unmapped buffers.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Just about all callers of xfs_buf_read() and xfs_buf_get() use XBF_DONTBLOCK.
    This is used to make memory allocation use GFP_NOFS rather than GFP_KERNEL to
    avoid recursion through memory reclaim back into the filesystem.

    All the blocking get calls in growfs occur inside a transaction, even though
    they are no part of the transaction, so all allocation will be GFP_NOFS due to
    the task flag PF_TRANS being set. The blocking read calls occur during log
    recovery, so they will probably be unaffected by converting to GFP_NOFS
    allocations.

    Hence make XBF_DONTBLOCK behaviour always occur for buffers and kill the flag.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Buffers are always returned locked from the lookup routines. Hence
    we don't need to tell the lookup routines to return locked buffers,
    on to try and lock them. Remove XBF_LOCK from all the callers and
    from internal buffer cache usage.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • xfs_buf_btoc and friends are simple macros that do basic block
    to page index conversion and vice versa. These aren't widely used,
    and we use open coded masking and shifting everywhere else. Hence
    remove the macros and open code the work they do.

    Also, use of PAGE_CACHE_{SIZE|SHIFT|MASK} for these macros is now
    incorrect - we are using pages directly and not the page cache, so
    use PAGE_{SIZE|MASK|SHIFT} instead.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Now that we pass block counts everywhere, and index buffers by block
    number and length in units of blocks, convert the desired IO size
    into block counts rather than bytes. Convert the code to use block
    counts, and those that need byte counts get converted at the time of
    use.

    Rename the b_desired_count variable to something closer to it's
    purpose - b_io_length - as it is only used to specify the length of
    an IO for a subset of the buffer. The only time this is used is for
    log IO - both writing iclogs and during log recovery. In all other
    cases, the b_io_length matches b_length, and hence a lot of code
    confuses the two. e.g. the buf item code uses the io count
    exclusively when it should be using the buffer length. Fix these
    apprpriately as they are found.

    Also, remove the XFS_BUF_{SET_}COUNT() macros that are just wrappers
    around the desired IO length. They only serve to make the code
    shouty loud, don't actually add any real value, and are often used
    incorrectly.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Now that we pass block counts everywhere, and index buffers by block
    number, track the length of the buffer in units of blocks rather
    than bytes. Convert the code to use block counts, and those that
    need byte counts get converted at the time of use.

    Also, remove the XFS_BUF_{SET_}SIZE() macros that are just wrappers
    around the buffer length. They only serve to make the code shouty
    loud and don't actually add any real value.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Seeing as we pass block numbers around everywhere in the buffer
    cache now, it makes no sense to index everything by byte offset.
    Replace all the byte offset indexing with block number based
    indexing, and replace all uses of the byte offset with direct
    conversion from the block index.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • The xfs_buf_get/read API is not consistent in the units it uses, and
    does not use appropriate or consistent units/types for the
    variables.

    Convert the API to use disk addresses and block counts for all
    buffer get and read calls. Use consistent naming for all the
    functions and their declarations, and convert the internal functions
    to use disk addresses and block counts to avoid need to convert them
    from one type to another and back again.

    Fix all the callers to use disk addresses and block counts. In many
    cases, this removes an additional conversion from the function call
    as the callers already have a block count.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • If we call xfs_buf_iowait() on a buffer that failed dispatch due to
    an IO error, it will wait forever for an Io that does not exist.
    This is hndled in xfs_buf_read, but there is other code that calls
    xfs_buf_iowait directly that doesn't.

    Rather than make the call sites have to handle checking for dispatch
    errors and then checking for completion errors, make
    xfs_buf_iowait() check for dispatch errors on the buffer before
    waiting. This means we handle both dispatch and completion errors
    with one set of error handling at the caller sites.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
    and write back the buffers per-process instead of by waking up xfsbufd.

    This is now easily doable given that we have very few places left that write
    delwri buffers:

    - log recovery:
    Only done at mount time, and already forcing out the buffers
    synchronously using xfs_flush_buftarg

    - quotacheck:
    Same story.

    - dquot reclaim:
    Writes out dirty dquots on the LRU under memory pressure. We might
    want to look into doing more of this via xfsaild, but it's already
    more optimal than the synchronous inode reclaim that writes each
    buffer synchronously.

    - xfsaild:
    This is the main beneficiary of the change. By keeping a local list
    of buffers to write we reduce latency of writing out buffers, and
    more importably we can remove all the delwri list promotions which
    were hitting the buffer cache hard under sustained metadata loads.

    The implementation is very straight forward - xfs_buf_delwri_queue now gets
    a new list_head pointer that it adds the delwri buffers to, and all callers
    need to eventually submit the list using xfs_buf_delwi_submit or
    xfs_buf_delwi_submit_nowait. Buffers that already are on a delwri list are
    skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
    list. The biggest change to pass down the buffer list was done to the AIL
    pushing. Now that we operate on buffers the trylock, push and pushbuf log
    item methods are merged into a single push routine, which tries to lock the
    item, and if possible add the buffer that needs writeback to the buffer list.
    This leads to much simpler code than the previous split but requires the
    individual IOP_PUSH instances to unlock and reacquire the AIL around calls
    to blocking routines.

    Given that xfsailds now also handle writing out buffers, the conditions for
    log forcing and the sleep times needed some small changes. The most
    important one is that we consider an AIL busy as long we still have buffers
    to push, and the other one is that we do increment the pushed LSN for
    buffers that are under flushing at this moment, but still count them towards
    the stuck items for restart purposes. Without this we could hammer on stuck
    items without ever forcing the log and not make progress under heavy random
    delete workloads on fast flash storage devices.

    [ Dave Chinner:
    - rebase on previous patches.
    - improved comments for XBF_DELWRI_Q handling
    - fix XBF_ASYNC handling in queue submission (test 106 failure)
    - rename delwri submit function buffer list parameters for clarity
    - xfs_efd_item_push() should return XFS_ITEM_PINNED ]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

29 Mar, 2012

1 commit


17 Dec, 2011

1 commit