21 Feb, 2009

1 commit

  • This is a step in the direction of better -ENOSPC handling. Instead of
    checking the global bytes counter we check the space_info bytes counters to
    make sure we have enough space.

    If we don't we go ahead and try to allocate a new chunk, and then if that fails
    we return -ENOSPC. This patch adds two counters to btrfs_space_info,
    bytes_delalloc and bytes_may_use.

    bytes_delalloc account for extents we've actually setup for delalloc and will
    be allocated at some point down the line.

    bytes_may_use is to keep track of how many bytes we may use for delalloc at
    some point. When we actually set the extent_bit for the delalloc bytes we
    subtract the reserved bytes from the bytes_may_use counter. This keeps us from
    not actually being able to allocate space for any delalloc bytes.

    Signed-off-by: Josef Bacik

    Josef Bacik
     

20 Feb, 2009

1 commit


22 Jan, 2009

1 commit

  • To improve performance, btrfs_sync_log merges tree log sync
    requests. But it wrongly merges sync requests for different
    tree logs. If multiple tree logs are synced at the same time,
    only one of them actually gets synced.

    This patch has following changes to fix the bug:

    Move most tree log related fields in btrfs_fs_info to
    btrfs_root. This allows merging sync requests separately
    for each tree log.

    Don't insert root item into the log root tree immediately
    after log tree is allocated. Root item for log tree is
    inserted when log tree get synced for the first time. This
    allows syncing the log root tree without first syncing all
    log trees.

    At tree-log sync, btrfs_sync_log first sync the log tree;
    then updates corresponding root item in the log root tree;
    sync the log root tree; then update the super block.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

21 Jan, 2009

1 commit


06 Jan, 2009

3 commits


12 Dec, 2008

1 commit

  • Checksums on data can be disabled by mount option, so it's
    possible some data extents don't have checksums or have
    invalid checksums. This causes trouble for data relocation.
    This patch contains following things to make data relocation
    work.

    1) make nodatasum/nodatacow mount option only affects new
    files. Checksums and COW on data are only controlled by the
    inode flags.

    2) check the existence of checksum in the nodatacow checker.
    If checksums exist, force COW the data extent. This ensure that
    checksum for a given block is either valid or does not exist.

    3) update data relocation code to properly handle the case
    of checksum missing.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

09 Dec, 2008

2 commits

  • The fsync logging code makes sure to onl copy the relevant checksum for each
    extent based on the file extent pointers it finds.

    But for compressed extents, it needs to copy the checksum for the
    entire extent.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This adds a sequence number to the btrfs inode that is increased on
    every update. NFS will be able to use that to detect when an inode has
    changed, without relying on inaccurate time fields.

    While we're here, this also:

    Puts reserved space into the super block and inode

    Adds a log root transid to the super so we can pick the newest super
    based on the fsync log as well as the main transaction ID. For now
    the log root transid is always zero, but that'll get fixed.

    Adds a starting offset to the dev_item. This will let us do better
    alignment calculations if we know the start of a partition on the disk.

    Signed-off-by: Chris Mason

    Chris Mason
     

02 Dec, 2008

1 commit


13 Nov, 2008

1 commit

  • When extent needs to be split, btrfs_mark_extent_written truncates the extent
    first, then inserts a new extent and increases the reference count.

    The race happens if someone else deletes the old extent before the new extent
    is inserted. The fix here is increase the reference count in advance. This race
    is similar to the race in btrfs_drop_extents that was recently fixed.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

11 Nov, 2008

2 commits

  • btrfs_drop_extents will drop paths and search again when it needs to
    force COW of higher nodes. It was using the key it found during the last
    search as the offset for the next search.

    But, this wasn't always correct. The key could be from before our desired
    range, and because we're dropping the path, it is possible for file's items
    to change while we do the search again.

    The fix here is to make sure we don't search for something smaller than
    the offset btrfs_drop_extents was called with.

    Signed-off-by: Chris Mason

    Yan Zheng
     
  • This makes sure the orig_start field in struct extent_map gets set
    everywhere the extent_map structs are created or modified.

    Signed-off-by: Chris Mason

    Chris Mason
     

10 Nov, 2008

1 commit

  • The decompress code doesn't take the logical offset in extent
    pointer into account. If the logical offset isn't zero, data
    will be decompressed into wrong pages.

    The solution used here is to record the starting offset of the extent
    in the file separately from the logical start of the extent_map struct.
    This allows us to avoid problems inserting overlapping extents.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

07 Nov, 2008

1 commit

  • When reading compressed extents, try to put pages into the page cache
    for any pages covered by the compressed extent that readpages didn't already
    preload.

    Add an async work queue to handle transformations at delayed allocation processing
    time. Right now this is just compression. The workflow is:

    1) Find offsets in the file marked for delayed allocation
    2) Lock the pages
    3) Lock the state bits
    4) Call the async delalloc code

    The async delalloc code clears the state lock bits and delalloc bits. It is
    important this happens before the range goes into the work queue because
    otherwise it might deadlock with other work queue items that try to lock
    those extent bits.

    The file pages are compressed, and if the compression doesn't work the
    pages are written back directly.

    An ordered work queue is used to make sure the inodes are written in the same
    order that pdflush or writepages sent them down.

    This changes extent_write_cache_pages to let the writepage function
    update the wbc nr_written count.

    Signed-off-by: Chris Mason

    Chris Mason
     

01 Nov, 2008

1 commit

  • Make sure we keep page->mapping NULL on the pages we're getting
    via alloc_page. It gets set so a few of the callbacks can do the right
    thing, but in general these pages don't have a mapping.

    Don't try to truncate compressed inline items in btrfs_drop_extents.
    The whole compressed item must be preserved.

    Don't try to create multipage inline compressed items. When we try to
    overwrite just the first page of the file, we would have to read in and recow
    all the pages after it in the same compressed inline items. For now, only
    create single page inline items.

    Make sure we lock pages in the correct order during delalloc. The
    search into the state tree for delalloc bytes can return bytes before
    the page we already have locked.

    Signed-off-by: Chris Mason

    Chris Mason
     

31 Oct, 2008

3 commits

  • This patch updates btrfs-progs for fallocate support.

    fallocate is a little different in Btrfs because we need to tell the
    COW system that a given preallocated extent doesn't need to be
    cow'd as long as there are no snapshots of it. This leverages the
    -o nodatacow checks.

    Signed-off-by: Yan Zheng

    Yan Zheng
     
  • When dropping middle part of an extent, btrfs_drop_extents truncates
    the extent at first, then inserts a bookend extent.

    Since truncation and insertion can't be done atomically, there is a small
    period that the bookend extent isn't in the tree. This causes problem for
    functions that search the tree for file extent item. The way to fix this is
    lock the range of the bookend extent before truncation.

    Signed-off-by: Yan Zheng

    Yan Zheng
     
  • This patch splits the hole insertion code out of btrfs_setattr
    into btrfs_cont_expand and updates btrfs_get_extent to properly
    handle the case that file extent items are not continuous.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

30 Oct, 2008

1 commit

  • This is a large change for adding compression on reading and writing,
    both for inline and regular extents. It does some fairly large
    surgery to the writeback paths.

    Compression is off by default and enabled by mount -o compress. Even
    when the -o compress mount option is not used, it is possible to read
    compressed extents off the disk.

    If compression for a given set of pages fails to make them smaller, the
    file is flagged to avoid future compression attempts later.

    * While finding delalloc extents, the pages are locked before being sent down
    to the delalloc handler. This allows the delalloc handler to do complex things
    such as cleaning the pages, marking them writeback and starting IO on their
    behalf.

    * Inline extents are inserted at delalloc time now. This allows us to compress
    the data before inserting the inline extent, and it allows us to insert
    an inline extent that spans multiple pages.

    * All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
    are changed to record both an in-memory size and an on disk size, as well
    as a flag for compression.

    From a disk format point of view, the extent pointers in the file are changed
    to record the on disk size of a given extent and some encoding flags.
    Space in the disk format is allocated for compression encoding, as well
    as encryption and a generic 'other' field. Neither the encryption or the
    'other' field are currently used.

    In order to limit the amount of data read for a single random read in the
    file, the size of a compressed extent is limited to 128k. This is a
    software only limit, the disk format supports u64 sized compressed extents.

    In order to limit the ram consumed while processing extents, the uncompressed
    size of a compressed extent is limited to 256k. This is a software only limit
    and will be subject to tuning later.

    Checksumming is still done on compressed extents, and it is done on the
    uncompressed version of the data. This way additional encodings can be
    layered on without having to figure out which encoding to checksum.

    Compression happens at delalloc time, which is basically singled threaded because
    it is usually done by a single pdflush thread. This makes it tricky to
    spread the compression load across all the cpus on the box. We'll have to
    look at parallel pdflush walks of dirty inodes at a later time.

    Decompression is hooked into readpages and it does spread across CPUs nicely.

    Signed-off-by: Chris Mason

    Chris Mason
     

09 Oct, 2008

2 commits

  • The offset field in struct btrfs_extent_ref records the position
    inside file that file extent is referenced by. In the new back
    reference system, tree leaves holding references to file extent
    are recorded explicitly. We can scan these tree leaves very quickly, so the
    offset field is not required.

    This patch also makes the back reference system check the objectid
    when extents are in deleting.

    Signed-off-by: Yan Zheng

    Yan Zheng
     
  • This patch makes btrfs count space allocated to file in bytes instead
    of 512 byte sectors.

    Everything else in btrfs uses a byte count instead of sector sizes or
    blocks sizes, so this fits better.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

04 Oct, 2008

1 commit

  • This reworks the btrfs O_DIRECT write code a bit. It had always fallen
    back to buffered IO and done an invalidate, but needed to be updated
    for the data=ordered code. The invalidate wasn't actually removing pages
    because they were still inside an ordered extent.

    This also combines the O_DIRECT/O_SYNC paths where possible, and kicks
    off IO in the main btrfs_file_write loop to keep the pipe down the the
    disk full as we process long writes.

    Signed-off-by: Chris Mason

    Chris Mason
     

30 Sep, 2008

1 commit

  • This improves the comments at the top of many functions. It didn't
    dive into the guts of functions because I was trying to
    avoid merging problems with the new allocator and back reference work.

    extent-tree.c and volumes.c were both skipped, and there is definitely
    more work todo in cleaning and commenting the code.

    Signed-off-by: Chris Mason

    Chris Mason
     

26 Sep, 2008

2 commits

  • * Add an EXTENT_BOUNDARY state bit to keep the writepage code
    from merging data extents that are in the process of being
    relocated. This allows us to do accounting for them properly.

    * The balancing code relocates data extents indepdent of the underlying
    inode. The extent_map code was modified to properly account for
    things moving around (invalidating extent_map caches in the inode).

    * Don't take the drop_mutex in the create_subvol ioctl. It isn't
    required.

    * Fix walking of the ordered extent list to avoid races with sys_unlink

    * Change the lock ordering rules. Transaction start goes outside
    the drop_mutex. This allows btrfs_commit_transaction to directly
    drop the relocation trees.

    Signed-off-by: Chris Mason

    Zheng Yan
     
  • Btrfs had compatibility code for kernels back to 2.6.18. These have
    been removed, and will be maintained in a separate backport
    git tree from now on.

    Signed-off-by: Chris Mason

    Chris Mason
     

25 Sep, 2008

13 commits

  • This patch makes the back reference system to explicit record the
    location of parent node for all types of extents. The location of
    parent node is placed into the offset field of backref key. Every
    time a tree block is balanced, the back references for the affected
    lower level extents are updated.

    Signed-off-by: Chris Mason

    Zheng Yan
     
  • Drop i_mutex during the commit

    Don't bother doing the fsync at all unless the dir is marked as dirtied
    and needing fsync in this transaction. For directories, this means
    that someone has unlinked a file from the dir without fsyncing the
    file.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • File syncs and directory syncs are optimized by copying their
    items into a special (copy-on-write) log tree. There is one log tree per
    subvolume and the btrfs super block points to a tree of log tree roots.

    After a crash, items are copied out of the log tree and back into the
    subvolume. See tree-log.c for all the details.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Signed-off-by: Chris Mason

    Chris Mason
     
  • Signed-off-by: Chris Mason

    Chris Mason
     
  • While dropping snapshots, walk_down_tree does most of the work of checking
    reference counts and limiting tree traversal to just the blocks that
    we are freeing.

    It dropped and held the allocation mutex in strange and confusing ways,
    this commit changes it to only hold the mutex while actually freeing a block.

    The rest of the checks around reference counts should be safe without the lock
    because we only allow one process in btrfs_drop_snapshot at a time. Other
    processes dropping reference counts should not drop it to 1 because
    their tree roots already have an extra ref on the block.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Signed-off-by: Chris Mason

    Chris Mason
     
  • This avoids waiting for transactions with pages locked by breaking out
    the code to wait for the current transaction to close into a function
    called by btrfs_throttle.

    It also lowers the limits for where we start throttling.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Add a couple of #if's to follow API changes.

    Signed-off-by: Sven Wegener
    Signed-off-by: Chris Mason

    Sven Wegener
     
  • The memory reclaiming issue happens when snapshot exists. In that
    case, some cache entries may not be used during old snapshot dropping,
    so they will remain in the cache until umount.

    The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
    the patch makes all dead roots of a given snapshot linked together in order of
    create time. After a old snapshot was completely dropped, we check the dead
    root list and remove all cache entries created before the oldest dead root in
    the list.

    Signed-off-by: Chris Mason

    Yan
     
  • A large reference cache is directly related to a lot of work pending
    for the cleaner thread. This throttles back new operations based on
    the size of the reference cache so the cleaner thread will be able to keep
    up.

    Overall, this actually makes the FS faster because the cleaner thread will
    be more likely to find things in cache.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This changes the reference cache to make a single cache per root
    instead of one cache per transaction, and to key by the byte number
    of the disk block instead of the keys inside.

    This makes it much less likely to have cache misses if a snapshot
    or something has an extra reference on a higher node or a leaf while
    the first transaction that added the leaf into the cache is dropping.

    Some throttling is added to functions that free blocks heavily so they
    wait for old transactions to drop.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Stress testing was showing data checksum errors, most of which were caused
    by a lookup bug in the extent_map tree. The tree was caching the last
    pointer returned, and searches would check the last pointer first.

    But, search callers also expect the search to return the very first
    matching extent in the range, which wasn't always true with the last
    pointer usage.

    For now, the code to cache the last return value is just removed. It is
    easy to fix, but I think lookups are rare enough that it isn't required anymore.

    This commit also replaces do_sync_mapping_range with a local copy of the
    related functions.

    Signed-off-by: Chris Mason

    Chris Mason