28 May, 2011

1 commit

  • git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus

    Conflicts:
    fs/btrfs/disk-io.c
    fs/btrfs/extent-tree.c
    fs/btrfs/free-space-cache.c
    fs/btrfs/inode.c
    fs/btrfs/transaction.c

    Signed-off-by: Chris Mason

    Chris Mason
     

24 May, 2011

4 commits

  • Originally this was going to be used as a way to give hints to the allocator,
    but frankly we can get much better hints elsewhere and it's not even used at all
    for anything usefull. In addition to be completely useless, when we initialize
    an inode we try and find a freeish block group to set as the inodes block group,
    and with a completely full 40gb fs this takes _forever_, so I imagine with say
    1tb fs this is just unbearable. So just axe the thing altoghether, we don't
    need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
    inode lookup in my testcase. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We use trans_mutex for lots of things, here's a basic list

    1) To serialize trans_handles joining the currently running transaction
    2) To make sure that no new trans handles are started while we are committing
    3) To protect the dead_roots list and the transaction lists

    Really the serializing trans_handles joining is not too hard, and can really get
    bogged down in acquiring a reference to the transaction. So replace the
    trans_mutex with a trans_lock spinlock and use it to do the following

    1) Protect fs_info->running_transaction. All trans handles have to do is check
    this, and then take a reference of the transaction and keep on going.
    2) Protect the fs_info->trans_list. This doesn't get used too much, basically
    it just holds the current transactions, which will usually just be the currently
    committing transaction and the currently running transaction at most.
    3) Protect the dead roots list. This is only ever processed by splicing the
    list so this is relatively simple.
    4) Protect the fs_info->reloc_ctl stuff. This is very lightweight and was using
    the trans_mutex before, so this is a pretty straightforward change.
    5) Protect fs_info->no_trans_join. Because we don't hold the trans_lock over
    the entirety of the commit we need to have a way to block new people from
    creating a new transaction while we're doing our work. So we set no_trans_join
    and in join_transaction we test to see if that is set, and if it is we do a
    wait_on_commit.
    6) Make the transaction use count atomic so we don't need to take locks to
    modify it when we're dropping references.
    7) Add a commit_lock to the transaction to make sure multiple people trying to
    commit the same transaction don't race and commit at the same time.
    8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl
    trans.

    I have tested this with xfstests, but obviously it is a pretty hairy change so
    lots of testing is greatly appreciated. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We currently track trans handles in current->journal_info, but we don't actually
    use it. This patch fixes it. This will cover the case where we have multiple
    people starting transactions down the call chain. This keeps us from having to
    allocate a new handle and all of that, we just increase the use count of the
    current handle, save the old block_rsv, and return. I tested this with xfstests
    and it worked out fine. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • I keep forgetting that btrfs_join_transaction() just ignores the num_items
    argument, which leads me to sending pointless patches and looking stupid :). So
    just kill the num_items argument from btrfs_join_transaction and
    btrfs_start_ioctl_transaction, since neither of them use it. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

23 May, 2011

1 commit


21 May, 2011

1 commit

  • Changelog V5 -> V6:
    - Fix oom when the memory load is high, by storing the delayed nodes into the
    root's radix tree, and letting btrfs inodes go.

    Changelog V4 -> V5:
    - Fix the race on adding the delayed node to the inode, which is spotted by
    Chris Mason.
    - Merge Chris Mason's incremental patch into this patch.
    - Fix deadlock between readdir() and memory fault, which is reported by
    Itaru Kitayama.

    Changelog V3 -> V4:
    - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
    inode in time.

    Changelog V2 -> V3:
    - Fix the race between the delayed worker and the task which does delayed items
    balance, which is reported by Tsutomu Itoh.
    - Modify the patch address David Sterba's comment.
    - Fix the bug of the cpu recursion spinlock, reported by Chris Mason

    Changelog V1 -> V2:
    - break up the global rb-tree, use a list to manage the delayed nodes,
    which is created for every directory and file, and used to manage the
    delayed directory name index items and the delayed inode item.
    - introduce a worker to deal with the delayed nodes.

    Compare with Ext3/4, the performance of file creation and deletion on btrfs
    is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
    such as inode item, directory name item, directory name index and so on.

    If we can do some delayed b+ tree insertion or deletion, we can improve the
    performance, so we made this patch which implemented delayed directory name
    index insertion/deletion and delayed inode update.

    Implementation:
    - introduce a delayed root object into the filesystem, that use two lists to
    manage the delayed nodes which are created for every file/directory.
    One is used to manage all the delayed nodes that have delayed items. And the
    other is used to manage the delayed nodes which is waiting to be dealt with
    by the work thread.
    - Every delayed node has two rb-tree, one is used to manage the directory name
    index which is going to be inserted into b+ tree, and the other is used to
    manage the directory name index which is going to be deleted from b+ tree.
    - introduce a worker to deal with the delayed operation. This worker is used
    to deal with the works of the delayed directory name index items insertion
    and deletion and the delayed inode update.
    When the delayed items is beyond the lower limit, we create works for some
    delayed nodes and insert them into the work queue of the worker, and then
    go back.
    When the delayed items is beyond the upper bound, we create works for all
    the delayed nodes that haven't been dealt with, and insert them into the work
    queue of the worker, and then wait for that the untreated items is below some
    threshold value.
    - When we want to insert a directory name index into b+ tree, we just add the
    information into the delayed inserting rb-tree.
    And then we check the number of the delayed items and do delayed items
    balance. (The balance policy is above.)
    - When we want to delete a directory name index from the b+ tree, we search it
    in the inserting rb-tree at first. If we look it up, just drop it. If not,
    add the key of it into the delayed deleting rb-tree.
    Similar to the delayed inserting rb-tree, we also check the number of the
    delayed items and do delayed items balance.
    (The same to inserting manipulation)
    - When we want to update the metadata of some inode, we cached the data of the
    inode into the delayed node. the worker will flush it into the b+ tree after
    dealing with the delayed insertion and deletion.
    - We will move the delayed node to the tail of the list after we access the
    delayed node, By this way, we can cache more delayed items and merge more
    inode updates.
    - If we want to commit transaction, we will deal with all the delayed node.
    - the delayed node will be freed when we free the btrfs inode.
    - Before we log the inode items, we commit all the directory name index items
    and the delayed inode update.

    I did a quick test by the benchmark tool[1] and found we can improve the
    performance of file creation by ~15%, and file deletion by ~20%.

    Before applying this patch:
    Create files:
    Total files: 50000
    Total time: 1.096108
    Average time: 0.000022
    Delete files:
    Total files: 50000
    Total time: 1.510403
    Average time: 0.000030

    After applying this patch:
    Create files:
    Total files: 50000
    Total time: 0.932899
    Average time: 0.000019
    Delete files:
    Total files: 50000
    Total time: 1.215732
    Average time: 0.000024

    [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

    Many thanks for Kitayama-san's help!

    Signed-off-by: Miao Xie
    Reviewed-by: David Sterba
    Tested-by: Tsutomu Itoh
    Tested-by: Itaru Kitayama
    Signed-off-by: Chris Mason

    Miao Xie
     

04 May, 2011

1 commit


12 Apr, 2011

1 commit

  • I've been working on making our O_DIRECT latency not suck and I noticed we were
    taking the trans_mutex in btrfs_end_transaction. So to do this we convert
    num_writers and use_count to atomic_t's and just decrement them in
    btrfs_end_transaction. Instead of deleting the transaction from the trans list
    in put_transaction we do that in btrfs_commit_transaction() since that's the
    only time it actually needs to be removed from the list. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

23 Dec, 2010

1 commit

  • Usage:

    Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call
    ioctl(BTRFS_I0CTL_SNAP_CREATE_V2).

    Implementation:

    - Set readonly bit of btrfs_root_item->flags.
    - Add readonly checks in btrfs_permission (inode_permission),
    btrfs_setattr, btrfs_set/remove_xattr and some ioctls.

    Changelog for v3:

    - Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags.
    - Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY.

    Signed-off-by: Li Zefan

    Li Zefan
     

30 Oct, 2010

2 commits

  • START_SYNC will start a sync/commit, but not wait for it to
    complete. Any modification started after the ioctl returns is
    guaranteed not to be included in the commit. If a non-NULL
    pointer is passed, the transaction id will be returned to
    userspace.

    WAIT_SYNC will wait for any in-progress commit to complete. If a
    transaction id is specified, the ioctl will block and then
    return (success) when the specified transaction has committed.
    If it has already committed when we call the ioctl, it returns
    immediately. If the specified transaction doesn't exist, it
    returns EINVAL.

    If no transaction id is specified, WAIT_SYNC will wait for the
    currently committing transaction to finish it's commit to disk.
    If there is no currently committing transaction, it returns
    success.

    These ioctls are useful for applications which want to impose an
    ordering on when fs modifications reach disk, but do not want to
    wait for the full (slow) commit process to do so.

    Picky callers can take the transid returned by START_SYNC and
    feed it to WAIT_SYNC, and be certain to wait only as long as
    necessary for the transaction _they_ started to reach disk.

    Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
    and provided they didn't wait too long between the calls, they
    will get the same result. However, if a second commit starts
    before they call WAIT_SYNC, they may end up waiting longer for
    it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
    guarantees that any operation completed before the START_SYNC
    reaches disk.

    Signed-off-by: Sage Weil
    Signed-off-by: Chris Mason

    Sage Weil
     
  • Add support for an async transaction commit that is ordered such that any
    subsequent operations will join the following transaction, but does not
    wait until the current commit is fully on disk. This avoids much of the
    latency associated with the btrfs_commit_transaction for callers concerned
    with serialization and not safety.

    The wait_for_unblock flag controls whether we wait for the 'middle' portion
    of commit_transaction to complete, which is necessary if the caller expects
    some of the modifications contained in the commit to be available (this is
    the case for subvol/snapshot creation).

    Signed-off-by: Sage Weil
    Signed-off-by: Chris Mason

    Sage Weil
     

29 Oct, 2010

1 commit

  • In order to save free space cache, we need an inode to hold the data, and we
    need a special item to point at the right inode for the right block group. So
    first, create a special item that will point to the right inode, and the number
    of extent entries we will have and the number of bitmaps we will have. We
    truncate and pre-allocate space everytime to make sure it's uptodate.

    This feature will be turned on as soon as you mount with -o space_cache, however
    it is safe to boot into old kernels, they will just generate the cache the old
    fashion way. When you boot back into a newer kernel we will notice that we
    modified and not the cache and automatically discard the cache.

    Signed-off-by: Josef Bacik

    Josef Bacik
     

25 May, 2010

3 commits

  • Reserve metadata space for extent tree, checksum tree and root tree

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     
  • Besides simplify the code, this change makes sure all metadata
    reservation for normal metadata operations are released after
    committing transaction.

    Changes since V1:

    Add code that check if unlink and rmdir will free space.

    Add ENOSPC handling for clone ioctl.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     
  • Introducing metadata reseravtion contexts has two major advantages.
    First, it makes metadata reseravtion more traceable. Second, it can
    reclaim freed space and re-add them to the itself after transaction
    committed.

    Besides add btrfs_block_rsv structure and related helper functions,
    This patch contains following changes:

    Move code that decides if freed tree block should be pinned into
    btrfs_free_tree_block().

    Make space accounting more accurate, mainly for handling read only
    block groups.

    Signed-off-by: Chris Mason

    Yan, Zheng
     

16 Dec, 2009

1 commit

  • We allow two log transactions at a time, but use same flag
    to mark dirty tree-log btree blocks. So we may flush dirty
    blocks belonging to newer log transaction when committing a
    log transaction. This patch fixes the issue by using two
    flags to mark dirty tree-log btree blocks.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

14 Oct, 2009

2 commits

  • Syncing the tree log is a 3 phase operation.

    1) write and wait for all the tree log blocks for a given root.

    2) write and wait for all the tree log blocks for the
    tree of tree log roots.

    3) write and wait for the super blocks (barriers here)

    This isn't as efficient as it could be because there is
    no requirement to wait for the blocks from step one to hit the disk
    before we start writing the blocks from step two. This commit
    changes the sequence so that we don't start waiting until
    all the tree blocks from both steps one and two have been sent
    to disk.

    We do this by breaking up btrfs_write_wait_marked_extents into
    two functions, which is trivial because it was already broken
    up into two parts.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • rpm has a habit of running fdatasync when the file hasn't
    changed. We already detect if a file hasn't been changed
    in the current transaction but it might have been sent to
    the tree-log in this transaction and not changed since
    the last call to fsync.

    In this case, we want to avoid a tree log sync, which includes
    a number of synchronous writes and barriers. This commit
    extends the existing tracking of the last transaction to change
    a file to also track the last sub-transaction.

    The end result is that rpm -ivh and -Uvh are roughly twice as fast,
    and on par with ext3.

    Signed-off-by: Chris Mason

    Chris Mason
     

30 Jul, 2009

1 commit


10 Jun, 2009

1 commit

  • This commit introduces a new kind of back reference for btrfs metadata.
    Once a filesystem has been mounted with this commit, IT WILL NO LONGER
    BE MOUNTABLE BY OLDER KERNELS.

    When a tree block in subvolume tree is cow'd, the reference counts of all
    extents it points to are increased by one. At transaction commit time,
    the old root of the subvolume is recorded in a "dead root" data structure,
    and the btree it points to is later walked, dropping reference counts
    and freeing any blocks where the reference count goes to 0.

    The increments done during cow and decrements done after commit cancel out,
    and the walk is a very expensive way to go about freeing the blocks that
    are no longer referenced by the new btree root. This commit reduces the
    transaction overhead by avoiding the need for dead root records.

    When a non-shared tree block is cow'd, we free the old block at once, and the
    new block inherits old block's references. When a tree block with reference
    count > 1 is cow'd, we increase the reference counts of all extents
    the new block points to by one, and decrease the old block's reference count by
    one.

    This dead tree avoidance code removes the need to modify the reference
    counts of lower level extents when a non-shared tree block is cow'd.
    But we still need to update back ref for all pointers in the block.
    This is because the location of the block is recorded in the back ref
    item.

    We can solve this by introducing a new type of back ref. The new
    back ref provides information about pointer's key, level and in which
    tree the pointer lives. This information allow us to find the pointer
    by searching the tree. The shortcoming of the new back ref is that it
    only works for pointers in tree blocks referenced by their owner trees.

    This is mostly a problem for snapshots, where resolving one of these
    fuzzy back references would be O(number_of_snapshots) and quite slow.
    The solution used here is to use the fuzzy back references in the common
    case where a given tree block is only referenced by one root,
    and use the full back references when multiple roots have a reference
    on a given block.

    This commit adds per subvolume red-black tree to keep trace of cached
    inodes. The red-black tree helps the balancing code to find cached
    inodes whose inode numbers within a given range.

    This commit improves the balancing code by introducing several data
    structures to keep the state of balancing. The most important one
    is the back ref cache. It caches how the upper level tree blocks are
    referenced. This greatly reduce the overhead of checking back ref.

    The improved balancing code scales significantly better with a large
    number of snapshots.

    This is a very large commit and was written in a number of
    pieces. But, they depend heavily on the disk format change and were
    squashed together to make sure git bisect didn't end up in a
    bad state wrt space balancing or the format change.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng
     

25 Mar, 2009

2 commits

  • To avoid deadlocks and reduce latencies during some critical operations, some
    transaction writers are allowed to jump into the running transaction and make
    it run a little longer, while others sit around and wait for the commit to
    finish.

    This is a bit unfair, especially when the callers that jump in do a bunch
    of IO that makes all the others procs on the box wait. This commit
    reduces the stalls this produces by pre-reading file extent pointers
    during btrfs_finish_ordered_io before the transaction is joined.

    It also tunes the drop_snapshot code to politely wait for transactions
    that have started writing out their delayed refs to finish. This avoids
    new delayed refs being flooded into the queue while we're trying to
    close off the transaction.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The extent allocation tree maintains a reference count and full
    back reference information for every extent allocated in the
    filesystem. For subvolume and snapshot trees, every time
    a block goes through COW, the new copy of the block adds a reference
    on every block it points to.

    If a btree node points to 150 leaves, then the COW code needs to go
    and add backrefs on 150 different extents, which might be spread all
    over the extent allocation tree.

    These updates currently happen during btrfs_cow_block, and most COWs
    happen during btrfs_search_slot. btrfs_search_slot has locks held
    on both the parent and the node we are COWing, and so we really want
    to avoid IO during the COW if we can.

    This commit adds an rbtree of pending reference count updates and extent
    allocations. The tree is ordered by byte number of the extent and byte number
    of the parent for the back reference. The tree allows us to:

    1) Modify back references in something close to disk order, reducing seeks
    2) Significantly reduce the number of modifications made as block pointers
    are balanced around
    3) Do all of the extent insertion and back reference modifications outside
    of the performance critical btrfs_search_slot code.

    #3 has the added benefit of greatly reducing the btrfs stack footprint.
    The extent allocation tree modifications are done without the deep
    (and somewhat recursive) call chains used in the past.

    These delayed back reference updates must be done before the transaction
    commits, and so the rbtree is tied to the transaction. Throttling is
    implemented to help keep the queue of backrefs at a reasonable size.

    Since there was a similar mechanism in place for the extent tree
    extents, that is removed and replaced by the delayed reference tree.

    Yan Zheng helped review and fixup this code.

    Signed-off-by: Chris Mason

    Chris Mason
     

06 Jan, 2009

1 commit


12 Dec, 2008

1 commit

  • The block group structs are referenced in many different
    places, and it's not safe to free while balancing. So, those block
    group structs were simply leaked instead.

    This patch replaces the block group pointer in the inode with the starting byte
    offset of the block group and adds reference counting to the block group
    struct.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

18 Nov, 2008

1 commit

  • Before, all snapshots and subvolumes lived in a single flat directory. This
    was awkward and confusing because the single flat directory was only writable
    with the ioctls.

    This commit changes the ioctls to create subvols and snapshots at any
    point in the directory tree. This requires making separate ioctls for
    snapshot and subvol creation instead of a combining them into one.

    The subvol ioctl does:

    btrfsctl -S subvol_name parent_dir

    After the ioctl is done subvol_name lives inside parent_dir.

    The snapshot ioctl does:

    btrfsctl -s path_for_snapshot root_to_snapshot

    path_for_snapshot can be an absolute or relative path. btrfsctl breaks it up
    into directory and basename components.

    root_to_snapshot can be any file or directory in the FS. The snapshot
    is taken of the entire root where that file lives.

    Signed-off-by: Chris Mason

    Chris Mason
     

25 Sep, 2008

14 commits

  • This is the same way the transaction code makes sure that all the
    other tree blocks are safely on disk. There's an extent_io tree
    for each root, and any blocks allocated to the tree logs are
    recorded in that tree.

    At tree-log sync, the extent_io tree is walked to flush down the
    dirty pages and wait for them.

    The main benefit is less time spent walking the tree log and skipping
    clean pages, and getting sequential IO down to the drive.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • File syncs and directory syncs are optimized by copying their
    items into a special (copy-on-write) log tree. There is one log tree per
    subvolume and the btrfs super block points to a tree of log tree roots.

    After a crash, items are copied out of the log tree and back into the
    subvolume. See tree-log.c for all the details.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This trivial patch contains two locking fixes and a off by one fix.

    ---

    Signed-off-by: Chris Mason

    Yan Zheng
     
  • Commit 597:466b27332893 (btrfs_start_transaction: wait for commits in
    progress) breaks the transaction start/stop ioctls by making
    btrfs_start_transaction conditionally wait for the next transaction to
    start. If an application artificially is holding a transaction open,
    things deadlock.

    This workaround maintains a count of open ioctl-initiated transactions in
    fs_info, and avoids wait_current_trans() if any are currently open (in
    start_transaction() and btrfs_throttle()). The start transaction ioctl
    uses a new btrfs_start_ioctl_transaction() that _does_ call
    wait_current_trans(), effectively pushing the join/wait decision to the
    outer ioctl-initiated transaction.

    This more or less neuters btrfs_throttle() when ioctl-initiated
    transactions are in use, but that seems like a pretty fundamental
    consequence of wrapping lots of write()'s in a transaction. Btrfs has no
    way to tell if the application considers a given operation as part of it's
    transaction.

    Obviously, if the transaction start/stop ioctls aren't being used, there
    is no effect on current behavior.

    Signed-off-by: Sage Weil
    ---
    ctree.h | 1 +
    ioctl.c | 12 +++++++++++-
    transaction.c | 18 +++++++++++++-----
    transaction.h | 2 ++
    4 files changed, 27 insertions(+), 6 deletions(-)

    Signed-off-by: Chris Mason

    Sage Weil
     
  • To check whether a given file extent is referenced by multiple snapshots, the
    checker walks down the fs tree through dead root and checks all tree blocks in
    the path.

    We can easily detect whether a given tree block is directly referenced by other
    snapshot. We can also detect any indirect reference from other snapshot by
    checking reference's generation. The checker can always detect multiple
    references, but can't reliably detect cases of single reference. So btrfs may
    do file data cow even there is only one reference.

    Signed-off-by: Chris Mason

    Yan Zheng
     
  • A large reference cache is directly related to a lot of work pending
    for the cleaner thread. This throttles back new operations based on
    the size of the reference cache so the cleaner thread will be able to keep
    up.

    Overall, this actually makes the FS faster because the cleaner thread will
    be more likely to find things in cache.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • btrfs_commit_transaction has to loop waiting for any writers in the
    transaction to finish before it can proceed. btrfs_start_transaction
    should be polite and not join a transaction that is in the process
    of being finished off.

    There are a few places that can't wait, basically the ones doing IO that
    might be needed to finish the transaction. For them, btrfs_join_transaction
    is added.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The old data=ordered code would force commit to wait until
    all the data extents from the transaction were fully on disk. This
    introduced large latencies into the commit and stalled new writers
    in the transaction for a long time.

    The new code changes the way data allocations and extents work:

    * When delayed allocation is filled, data extents are reserved, and
    the extent bit EXTENT_ORDERED is set on the entire range of the extent.
    A struct btrfs_ordered_extent is allocated an inserted into a per-inode
    rbtree to track the pending extents.

    * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
    to that page.

    * When all of the bytes corresponding to a single struct btrfs_ordered_extent
    are written, The previously reserved extent is inserted into the FS
    btree and into the extent allocation trees. The checksums for the file
    data are also updated.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The btree defragger wasn't making forward progress because the new key wasn't
    being saved by the btrfs_search_forward function.

    This also disables the automatic btree defrag, it wasn't scaling well to
    huge filesystems. The auto-defrag needs to be done differently.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This creates one kthread for commits and one kthread for
    deleting old snapshots. All the work queues are removed.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The existing throttle mechanism was often not sufficient to prevent
    new writers from coming in and making a given transaction run forever.
    This adds an explicit wait at the end of most operations so they will
    allow the current transaction to close.

    There is no wait inside file_write, inode updates, or cow filling, all which
    have different deadlock possibilities.

    This is a temporary measure until better asynchronous commit support is
    added. This code leads to stalls as it waits for data=ordered
    writeback, and it really needs to be fixed.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • There is now extent_map for mapping offsets in the file to disk and
    extent_io for state tracking, IO submission and extent_bufers.

    The new extent_map code shifts from [start,end] pairs to [start,len], and
    pushes the locking out into the caller. This allows a few performance
    optimizations and is easier to use.

    A number of extent_map usage bugs were fixed, mostly with failing
    to remove extent_map entries when changing the file.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • It is very difficult to create a consistent snapshot of the btree when
    other writers may update the btree before the commit is done.

    This changes the snapshot creation to happen during the commit, while
    no other updates are possible.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This forces file data extents down the disk along with the metadata that
    references them. The current implementation is fairly simple, and just
    writes out all of the dirty pages in an inode before the commit.

    Signed-off-by: Chris Mason

    Chris Mason