25 May, 2011

1 commit

  • Now that we have reliably tracking of deleted extents in a
    transaction we can easily implement "online" discard support
    which calls blkdev_issue_discard once a transaction commits.

    The actual discard is a two stage operation as we first have
    to mark the busy extent as not available for reuse before we
    can start the actual discard. Note that we don't bother
    supporting discard for the non-delaylog mode.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

29 Apr, 2011

1 commit


28 Jan, 2011

1 commit

  • Failure to commit a transaction into the CIL is not handled
    correctly. This currently can only happen when racing with a
    shutdown and requires an explicit shutdown check, so it rare and can
    be avoided. Remove the shutdown check and make the CIL commit a void
    function to indicate it will always succeed, thereby removing the
    incorrectly handled failure case.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     

27 Jan, 2011

1 commit

  • The kmemleak detector shows this after test 139:

    unreferenced object 0xffff880079b88bb0 (size 264):
    comm "xfs_io", pid 4904, jiffies 4294909382 (age 276.824s)
    hex dump (first 32 bytes):
    00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
    ff ff ff ff ff ff ff ff 48 7b c9 82 ff ff ff ff ........H{......
    backtrace:
    [] kmemleak_alloc+0x2d/0x60
    [] kmem_cache_alloc+0x13f/0x2b0
    [] kmem_zone_alloc+0x77/0xf0
    [] kmem_zone_zalloc+0x1e/0x50
    [] xlog_ticket_alloc+0x34/0x170
    [] xlog_cil_push+0xa4/0x3f0
    [] xlog_cil_force_lsn+0x15a/0x160
    [] _xfs_log_force_lsn+0x75/0x2d0
    [] _xfs_trans_commit+0x2bd/0x2f0
    [] xfs_iomap_write_allocate+0x1ad/0x350
    [] xfs_map_blocks+0x21f/0x370
    [] xfs_vm_writepage+0x1c7/0x550
    [] __writepage+0x1a/0x50
    [] write_cache_pages+0x1c2/0x4c0
    [] generic_writepages+0x27/0x30
    [] xfs_vm_writepages+0x5d/0x80

    By inspection, the leak occurs when xlog_write() returns and error
    and we jump to the abort path without dropping the reference on the
    active ticket.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     

21 Dec, 2010

1 commit

  • The log grant queues are one of the few places left using sv_t
    constructs for waiting. Given we are touching this code, we should
    convert them to plain wait queues. While there, convert all the
    other sv_t users in the log code as well.

    Seeing as this removes the last users of the sv_t type, remove the
    header file defining the wrapper and the fragments that still
    reference it.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

20 Dec, 2010

1 commit

  • When inserting items into the AIL from the transaction committed
    callbacks, we take the AIL lock for every single item that is to be
    inserted. For a CIL checkpoint commit, this can be tens of thousands
    of individual inserts, yet almost all of the items will be inserted
    at the same point in the AIL because they have the same index.

    To reduce the overhead and contention on the AIL lock for such
    operations, introduce a "bulk insert" operation which allows a list
    of log items with the same LSN to be inserted in a single operation
    via a list splice. To do this, we need to pre-sort the log items
    being committed into a temporary list for insertion.

    The complexity is that not every log item will end up with the same
    LSN, and not every item is actually inserted into the AIL. Items
    that don't match the commit LSN will be inserted and unpinned as per
    the current one-at-a-time method (relatively rare), while items that
    are not to be inserted will be unpinned and freed immediately. Items
    that are to be inserted at the given commit lsn are placed in a
    temporary array and inserted into the AIL in bulk each time the
    array fills up.

    As a result of this, we trade off AIL hold time for a significant
    reduction in traffic. lock_stat output shows that the worst case
    hold time is unchanged, but contention from AIL inserts drops by an
    order of magnitude and the number of lock traversal decreases
    significantly.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

19 Oct, 2010

1 commit

  • When commiting a transaction, we do a lock CIL state lock round trip
    on every single log vector we insert into the CIL. This is resulting
    in the lock being as hot as the inode and dcache locks on 8-way
    create workloads. Rework the insertion loops to bring the number
    of lock round trips to one per transaction for log vectors, and one
    more do the busy extents.

    Also change the allocation of the log vector buffer not to zero it
    as we copy over the entire allocated buffer anyway.

    This patch also includes a structural cleanup to the CIL item
    insertion provided by Christoph Hellwig.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     

29 Sep, 2010

1 commit

  • I have been seeing occasional pauses in transaction throughput up to
    30s long under heavy parallel workloads. The only notable thing was
    that the xfsaild was trying to be active during the pauses, but
    making no progress. It was running exactly 20 times a second (on the
    50ms no-progress backoff), and the number of pushbuf events was
    constant across this time as well. IOWs, the xfsaild appeared to be
    stuck on buffers that it could not push out.

    Further investigation indicated that it was trying to push out inode
    buffers that were pinned and/or locked. The xfsbufd was also getting
    woken at the same frequency (by the xfsaild, no doubt) to push out
    delayed write buffers. The xfsbufd was not making any progress
    because all the buffers in the delwri queue were pinned. This scan-
    and-make-no-progress dance went one in the trace for some seconds,
    before the xfssyncd came along an issued a log force, and then
    things started going again.

    However, I noticed something strange about the log force - there
    were way too many IO's issued. 516 log buffers were written, to be
    exact. That added up to 129MB of log IO, which got me very
    interested because it's almost exactly 25% of the size of the log.
    He delayed logging code is suppose to aggregate the minimum of 25%
    of the log or 8MB worth of changes before flushing. That's what
    really puzzled me - why did a log force write 129MB instead of only
    8MB?

    Essentially what has happened is that no CIL pushes had occurred
    since the previous tail push which cleared out 25% of the log space.
    That caused all the new transactions to block because there wasn't
    log space for them, but they kick the xfsaild to push the tail.
    However, the xfsaild was not making progress because there were
    buffers it could not lock and flush, and the xfsbufd could not flush
    them because they were pinned. As a result, both the xfsaild and the
    xfsbufd could not move the tail of the log forward without the CIL
    first committing.

    The cause of the problem was that the background CIL push, which
    should happen when 8MB of aggregated changes have been committed, is
    being held off by the concurrent transaction commit load. The
    background push does a down_write_trylock() which will fail if there
    is a concurrent transaction commit holding the push lock in read
    mode. With 8 CPUs all doing transactions as fast as they can, there
    was enough concurrent transaction commits to hold off the background
    push until tail-pushing could no longer free log space, and the halt
    would occur.

    It should be noted that there is no reason why it would halt at 25%
    of log space used by a single CIL checkpoint. This bug could
    definitely violate the "no transaction should be larger than half
    the log" requirement and hence result in corruption if the system
    crashed under heavy load. This sort of bug is exactly the reason why
    delayed logging was tagged as experimental....

    The fix is to start blocking background pushes once the threshold
    has been exceeded. Rework the threshold calculations to keep the
    amount of log space a CIL checkpoint can use to below that of the
    AIL push threshold to avoid the problem completely.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

24 Aug, 2010

3 commits

  • Formatting items requires memory allocation when using delayed
    logging. Currently that memory allocation is done while holding the
    CIL context lock in read mode. This means that if memory allocation
    takes some time (e.g. enters reclaim), we cannot push on the CIL
    until the allocation(s) required by formatting complete. This can
    stall CIL pushes for some time, and once a push is stalled so are
    all new transaction commits.

    Fix this splitting the item formatting into two steps. The first
    step which does the allocation and memcpy() into the allocated
    buffer is now done outside the CIL context lock, and only the CIL
    insert is done inside the CIL context lock. This avoids the stall
    issue.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • Delayed logging adds some serialisation to the log force process to
    ensure that it does not deference a bad commit context structure
    when determining if a CIL push is necessary or not. It does this by
    grabing the CIL context lock exclusively, then dropping it before
    pushing the CIL if necessary. This causes serialisation of all log
    forces and pushes regardless of whether a force is necessary or not.
    As a result fsync heavy workloads (like dbench) can be significantly
    slower with delayed logging than without.

    To avoid this penalty, copy the current sequence from the context to
    the CIL structure when they are swapped. This allows us to do
    unlocked checks on the current sequence without having to worry
    about dereferencing context structures that may have already been
    freed. Hence we can remove the CIL context locking in the forcing
    code and only call into the push code if the current context matches
    the sequence we need to force.

    By passing the sequence into the push code, we can check the
    sequence again once we have the CIL lock held exclusive and abort if
    the sequence has already been pushed. This avoids a lock round-trip
    and unnecessary CIL pushes when we have racing push calls.

    The result is that the regression in dbench performance goes away -
    this change improves dbench performance on a ramdisk from ~2100MB/s
    to ~2500MB/s. This compares favourably to not using delayed logging
    which retuns ~2500MB/s for the same workload.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • When we commit a transaction using delayed logging, we need to
    unlock the items in the transaciton before we unlock the CIL context
    and allow it to be checkpointed. If we unlock them after we release
    the CIl context lock, the CIL can checkpoint and complete before
    we free the log items. This breaks stale buffer item unlock and
    unpin processing as there is an implicit assumption that the unlock
    will occur before the unpin.

    Also, some log items need to store the LSN of the transaction commit
    in the item (inodes and EFIs) and so can race with other transaction
    completions if we don't prevent the CIL from checkpointing before
    the unlock occurs.

    Cc:
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

27 Jul, 2010

3 commits


24 May, 2010

3 commits

  • With delayed logging, we can get inode allocation buffers in the
    same transaction inode unlink buffers. We don't currently mark inode
    allocation buffers in the log, so inode unlink buffers take
    precedence over allocation buffers.

    The result is that when they are combined into the same checkpoint,
    only the unlinked inode chain fields are replayed, resulting in
    uninitialised inode buffers being detected when the next inode
    modification is replayed.

    To fix this, we need to ensure that we do not set the inode buffer
    flag in the buffer log item format flags if the inode allocation has
    not already hit the log. To avoid requiring a change to log
    recovery, we really need to make this a modification that relies
    only on in-memory sate.

    We can do this by checking during buffer log formatting (while the
    CIL cannot be flushed) if we are still in the same sequence when we
    commit the unlink transaction as the inode allocation transaction.
    If we are, then we do not add the inode buffer flag to the buffer
    log format item flags. This means the entire buffer will be
    replayed, not just the unlinked fields. We do this while
    CIL flusheѕ are locked out to ensure that we don't race with the
    sequence numbers changing and hence fail to put the inode buffer
    flag in the buffer format flags when we really need to.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • If we let the CIL grow without bound, it will grow large enough to violate
    recovery constraints (must be at least one complete transaction in the log at
    all times) or take forever to write out through the log buffers. Hence we need
    a check during asynchronous transactions as to whether the CIL needs to be
    pushed.

    We track the amount of log space the CIL consumes, so it is relatively simple
    to limit it on a pure size basis. Make the limit the minimum of just under half
    the log size (recovery constraint) or 8MB of log space (which is an awful lot
    of metadata).

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • The delayed logging code only changes in-memory structures and as
    such can be enabled and disabled with a mount option. Add the mount
    option and emit a warning that this is an experimental feature that
    should not be used in production yet.

    We also need infrastructure to track committed items that have not
    yet been written to the log. This is what the Committed Item List
    (CIL) is for.

    The log item also needs to be extended to track the current log
    vector, the associated memory buffer and it's location in the Commit
    Item List. Extend the log item and log vector structures to enable
    this tracking.

    To maintain the current log format for transactions with delayed
    logging, we need to introduce a checkpoint transaction and a context
    for tracking each checkpoint from initiation to transaction
    completion. This includes adding a log ticket for tracking space
    log required/used by the context checkpoint.

    To track all the changes we need an io vector array per log item,
    rather than a single array for the entire transaction. Using the new
    log vector structure for this requires two passes - the first to
    allocate the log vector structures and chain them together, and the
    second to fill them out. This log vector chain can then be passed
    to the CIL for formatting, pinning and insertion into the CIL.

    Formatting of the log vector chain is relatively simple - it's just
    a loop over the iovecs on each log vector, but it is made slightly
    more complex because we re-write the iovec after the copy to point
    back at the memory buffer we just copied into.

    This code also needs to pin log items. If the log item is not
    already tracked in this checkpoint context, then it needs to be
    pinned. Otherwise it is already pinned and we don't need to pin it
    again.

    The only other complexity is calculating the amount of new log space
    the formatting has consumed. This needs to be accounted to the
    transaction in progress, and the accounting is made more complex
    becase we need also to steal space from it for log metadata in the
    checkpoint transaction. Calculate all this at insert time and update
    all the tickets, counters, etc correctly.

    Once we've formatted all the log items in the transaction, attach
    the busy extents to the checkpoint context so the busy extents live
    until checkpoint completion and can be processed at that point in
    time. Transactions can then be freed at this point in time.

    Now we need to issue checkpoints - we are tracking the amount of log space
    used by the items in the CIL, so we can trigger background checkpoints when the
    space usage gets to a certain threshold. Otherwise, checkpoints need ot be
    triggered when a log synchronisation point is reached - a log force event.

    Because the log write code already handles chained log vectors, writing the
    transaction is trivial, too. Construct a transaction header, add it
    to the head of the chain and write it into the log, then issue a
    commit record write. Then we can release the checkpoint log ticket
    and attach the context to the log buffer so it can be called during
    Io completion to complete the checkpoint.

    We also need to allow for synchronising multiple in-flight
    checkpoints. This is needed for two things - the first is to ensure
    that checkpoint commit records appear in the log in the correct
    sequence order (so they are replayed in the correct order). The
    second is so that xfs_log_force_lsn() operates correctly and only
    flushes and/or waits for the specific sequence it was provided with.

    To do this we need a wait variable and a list tracking the
    checkpoint commits in progress. We can walk this list and wait for
    the checkpoints to change state or complete easily, an this provides
    the necessary synchronisation for correct operation in both cases.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner