22 Dec, 2010

1 commit


29 Nov, 2010

1 commit

  • The new DIO bio splitting code has problems when the bio
    spans more than one ordered extent. This will happen as the
    generic DIO code merges our get_blocks calls together into
    a bigger single bio.

    This fixes things by walking forward in the ordered extent
    code finding all the overlapping ordered extents and completing them
    all at once.

    Signed-off-by: Chris Mason

    Chris Mason
     

25 May, 2010

1 commit

  • This provides basic DIO support for reading and writing. It does not do the
    work to recover from mismatching checksums, that will come later. A few design
    changes have been made from Jim's code (sorry Jim!)

    1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO
    code in order to account for all of BTRFS's oddities, but thanks to that work it
    seems like the best bet is to just ignore compression and such and just opt to
    fallback on buffered IO.

    2) Fallback on buffered IO for compressed or inline extents. Jim's code did
    it's own buffering to make dio with compressed extents work. Now we just
    fallback onto normal buffered IO.

    3) Use ordered extents for the writes so that all of the

    lock_extent()
    lookup_ordered()

    type checks continue to work.

    4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with
    DIO writes.

    I've tested this with fsx and everything works great. This patch depends on my
    dio and filemap.c patches to work. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

15 Mar, 2010

2 commits

  • When finishing io we run btrfs_dec_test_ordered_pending, and then immediately
    run btrfs_lookup_ordered_extent, but btrfs_dec_test_ordered_pending does that
    already, so we're searching twice when we don't have to. This patch lets us
    pass a btrfs_ordered_extent in to btrfs_dec_test_ordered_pending so if we do
    complete io on that ordered extent we can just use the one we found then instead
    of having to do another btrfs_lookup_ordered_extent. This made my fio job with
    the other patch go from 24 mb/s to 29 mb/s.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • The ordered tree used to need a mutex, but currently all we use it for is to
    protect the rb_tree, and a spin_lock is just fine for that. Using a spin_lock
    instead makes dbench run a little faster, 58 mb/s instead of 51 mb/s, and have
    less latency, 3445.138 ms instead of 3820.633 ms.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

09 Mar, 2010

1 commit

  • btrfs inialize rb trees in quite a number of places by settin rb_node =
    NULL; The problem with this is that 17d9ddc72fb8bba0d4f678 in the
    linux-next tree adds a new field to that struct which needs to be NULL for
    the new rbtree library code to work properly. This patch uses RB_ROOT as
    the intializer so all of the relevant fields will be NULL'd. Without the
    patch I get a panic.

    Signed-off-by: Eric Paris
    Acked-by: Venkatesh Pallipadi
    Signed-off-by: Chris Mason

    Eric Paris
     

18 Dec, 2009

2 commits

  • iput() can trigger new transactions if we are dropping the
    final reference, so calling it in btrfs_commit_transaction
    may end up deadlock. This patch adds delayed iput to avoid
    the issue.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     
  • There are some cases file extents are inserted without involving
    ordered struct. In these cases, we update disk_i_size directly,
    without checking pending ordered extent and DELALLOC bit. This
    patch extends btrfs_ordered_update_i_size() to handle these cases.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

02 Oct, 2009

1 commit

  • Use filemap_fdatawrite_range and filemap_fdatawait_range instead of
    local copies of the functions. For filemap_fdatawait_range that
    also means replacing the awkward old wait_on_page_writeback_range
    calling convention with the regular filemap byte offsets.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Chris Mason

    Christoph Hellwig
     

12 Sep, 2009

1 commit

  • Btrfs writes go through delalloc to the data=ordered code. This
    makes sure that all of the data is on disk before the metadata
    that references it. The tracking means that we have to make sure
    each page in an extent is fully written before we add that extent into
    the on-disk btree.

    This was done in the past by setting the EXTENT_ORDERED bit for the
    range of an extent when it was added to the data=ordered code, and then
    clearing the EXTENT_ORDERED bit in the extent state tree as each page
    finished IO.

    One of the reasons we had to do this was because sometimes pages are
    magically dirtied without page_mkwrite being called. The EXTENT_ORDERED
    bit is checked at writepage time, and if it isn't there, our page become
    dirty without going through the proper path.

    These bit operations make for a number of rbtree searches for each page,
    and can cause considerable lock contention.

    This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
    As pages go into the ordered code, PagePrivate2 is set on each one.
    This is a cheap operation because we already have all the pages locked
    and ready to go.

    As IO finishes, the PagePrivate2 bit is cleared and the ordered
    accoutning is updated for each page.

    At writepage time, if the PagePrivate2 bit is missing, we go into the
    writepage fixup code to handle improperly dirtied pages.

    Signed-off-by: Chris Mason

    Chris Mason
     

01 Apr, 2009

1 commit

  • Renames and truncates are both common ways to replace old data with new
    data. The filesystem can make an effort to make sure the new data is
    on disk before actually replacing the old data.

    This is especially important for rename, which many application use as
    though it were atomic for both the data and the metadata involved. The
    current btrfs code will happily replace a file that is fully on disk
    with one that was just created and still has pending IO.

    If we crash after transaction commit but before the IO is done, we'll end
    up replacing a good file with a zero length file. The solution used
    here is to create a list of inodes that need special ordering and force
    them to disk before the commit is done. This is similar to the
    ext3 style data=ordering, except it is only done on selected files.

    Btrfs is able to get away with this because it does not wait on commits
    very often, even for fsync (which use a sub-commit).

    For renames, we order the file when it wasn't already
    on disk and when it is replacing an existing file. Larger files
    are sent to filemap_flush right away (before the transaction handle is
    opened).

    For truncates, we order if the file goes from non-zero size down to
    zero size. This is a little different, because at the time of the
    truncate the file has no dirty bytes to order. But, we flag the inode
    so that it is added to the ordered list on close (via release method). We
    also immediately add it to the ordered list of the current transaction
    so that we can try to flush down any writes the application sneaks in
    before commit.

    Signed-off-by: Chris Mason

    Chris Mason
     

09 Dec, 2008

1 commit

  • Btrfs stores checksums for each data block. Until now, they have
    been stored in the subvolume trees, indexed by the inode that is
    referencing the data block. This means that when we read the inode,
    we've probably read in at least some checksums as well.

    But, this has a few problems:

    * The checksums are indexed by logical offset in the file. When
    compression is on, this means we have to do the expensive checksumming
    on the uncompressed data. It would be faster if we could checksum
    the compressed data instead.

    * If we implement encryption, we'll be checksumming the plain text and
    storing that on disk. This is significantly less secure.

    * For either compression or encryption, we have to get the plain text
    back before we can verify the checksum as correct. This makes the raid
    layer balancing and extent moving much more expensive.

    * It makes the front end caching code more complex, as we have touch
    the subvolume and inodes as we cache extents.

    * There is potentitally one copy of the checksum in each subvolume
    referencing an extent.

    The solution used here is to store the extent checksums in a dedicated
    tree. This allows us to index the checksums by phyiscal extent
    start and length. It means:

    * The checksum is against the data stored on disk, after any compression
    or encryption is done.

    * The checksum is stored in a central location, and can be verified without
    following back references, or reading inodes.

    This makes compression significantly faster by reducing the amount of
    data that needs to be checksummed. It will also allow much faster
    raid management code in general.

    The checksums are indexed by a key with a fixed objectid (a magic value
    in ctree.h) and offset set to the starting byte of the extent. This
    allows us to copy the checksum items into the fsync log tree directly (or
    any other tree), without having to invent a second format for them.

    Signed-off-by: Chris Mason

    Chris Mason
     

31 Oct, 2008

2 commits

  • This patch updates btrfs-progs for fallocate support.

    fallocate is a little different in Btrfs because we need to tell the
    COW system that a given preallocated extent doesn't need to be
    cow'd as long as there are no snapshots of it. This leverages the
    -o nodatacow checks.

    Signed-off-by: Yan Zheng

    Yan Zheng
     
  • This patch simplifies the nodatacow checker. If all references
    were created after the latest snapshot, then we can avoid COW
    safely. This patch also updates run_delalloc_nocow to do more
    fine-grained checking.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

30 Oct, 2008

1 commit

  • This is a large change for adding compression on reading and writing,
    both for inline and regular extents. It does some fairly large
    surgery to the writeback paths.

    Compression is off by default and enabled by mount -o compress. Even
    when the -o compress mount option is not used, it is possible to read
    compressed extents off the disk.

    If compression for a given set of pages fails to make them smaller, the
    file is flagged to avoid future compression attempts later.

    * While finding delalloc extents, the pages are locked before being sent down
    to the delalloc handler. This allows the delalloc handler to do complex things
    such as cleaning the pages, marking them writeback and starting IO on their
    behalf.

    * Inline extents are inserted at delalloc time now. This allows us to compress
    the data before inserting the inline extent, and it allows us to insert
    an inline extent that spans multiple pages.

    * All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
    are changed to record both an in-memory size and an on disk size, as well
    as a flag for compression.

    From a disk format point of view, the extent pointers in the file are changed
    to record the on disk size of a given extent and some encoding flags.
    Space in the disk format is allocated for compression encoding, as well
    as encryption and a generic 'other' field. Neither the encryption or the
    'other' field are currently used.

    In order to limit the amount of data read for a single random read in the
    file, the size of a compressed extent is limited to 128k. This is a
    software only limit, the disk format supports u64 sized compressed extents.

    In order to limit the ram consumed while processing extents, the uncompressed
    size of a compressed extent is limited to 256k. This is a software only limit
    and will be subject to tuning later.

    Checksumming is still done on compressed extents, and it is done on the
    uncompressed version of the data. This way additional encodings can be
    layered on without having to figure out which encoding to checksum.

    Compression happens at delalloc time, which is basically singled threaded because
    it is usually done by a single pdflush thread. This makes it tricky to
    spread the compression load across all the cpus on the box. We'll have to
    look at parallel pdflush walks of dirty inodes at a later time.

    Decompression is hooked into readpages and it does spread across CPUs nicely.

    Signed-off-by: Chris Mason

    Chris Mason
     

04 Oct, 2008

1 commit

  • This reworks the btrfs O_DIRECT write code a bit. It had always fallen
    back to buffered IO and done an invalidate, but needed to be updated
    for the data=ordered code. The invalidate wasn't actually removing pages
    because they were still inside an ordered extent.

    This also combines the O_DIRECT/O_SYNC paths where possible, and kicks
    off IO in the main btrfs_file_write loop to keep the pipe down the the
    disk full as we process long writes.

    Signed-off-by: Chris Mason

    Chris Mason
     

25 Sep, 2008

16 commits