07 May, 2013

2 commits

  • A user sent me a btrfs-image of a file system that was panicing on mount during
    the log recovery. I had originally thought these problems were from a bug in
    the free space cache code, but that was just a symptom of the problem. The
    problem is if your application does something like this

    [prealloc][prealloc][prealloc]

    the internal extent maps will merge those all together into one extent map, even
    though on disk they are 3 separate extents. So if you go to write into one of
    these ranges the extent map will be right since we use the physical extent when
    doing the write, but when we log the extents they will use the wrong sizes for
    the remainder prealloc space. If this doesn't happen to trip up the free space
    cache (which it won't in a lot of cases) then you will get bogus entries in your
    extent tree which will screw stuff up later. The data and such will still work,
    but everything else is broken. This patch fixes this by not allowing extents
    that are on the modified list to be merged. This has the side effect that we
    are no longer adding everything to the modified list all the time, which means
    we now have to call btrfs_drop_extents every time we log an extent into the
    tree. So this allows me to drop all this speciality code I was using to get
    around calling btrfs_drop_extents. With this patch the testcase I've created no
    longer creates a bogus file system after replaying the log. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • When logging changed extents I was logging ram_bytes as the current length,
    which isn't correct, it's supposed to be the ram bytes of the original extent.
    This is for compression where even if we split the extent we need to know the
    ram bytes so when we uncompress the extent we know how big it will be. This was
    still working out right with compression for some reason but I think we were
    getting lucky. It was definitely off for prealloc which is why I noticed it,
    btrfsck was complaining about it. With this patch btrfsck no longer complains
    after a log replay. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

25 Jan, 2013

1 commit

  • We drop the extent map tree lock while we're logging extents, so somebody
    could come in and merge another extent into this one and screw up our
    logging, or they could even remove us from the list which would keep us from
    logging the extent or freeing our ref on it, so we need to make sure to not
    clear LOGGING until after the extent is logged, and then we can merge it to
    adjacent extents. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

17 Dec, 2012

2 commits

  • We are going to use EM's to log extents in the future, so we need to not
    mark them as prealloc if they aren't actually prealloc extents. Instead
    mark them with FILLING so we know to ammend mod_start/mod_len and that way
    we don't confuse the extent logging code. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • If we've written to a prealloc extent we need to know the original block len
    for the extent. We can't figure this out currently since ->block_len is
    just set to the extent length. So introduce ->orig_block_len so that we
    know how many bytes were in the original extent for proper extent logging
    that future patches will need. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

04 Oct, 2012

1 commit

  • Dave Sterba pointed out a sleeping while atomic bug while doing fsync. This
    is because I'm an idiot and didn't realize that rwlock's were spin locks, so
    we've been holding this thing while doing allocations and such which is not
    good. This patch fixes this by dropping the write lock before we do
    anything heavy and re-acquire it when it is done. We also need to take a
    ref on the em's in case their corresponding pages are evicted and mark them
    as being logged so that releasepage does not remove them and doesn't remove
    them from our local list. Thanks,

    Reported-by: Dave Sterba
    Signed-off-by: Josef Bacik

    Josef Bacik
     

02 Oct, 2012

2 commits

  • This is based on Josef's "Btrfs: turbo charge fsync".

    The above Josef's patch performs very good in random sync write test,
    because we won't have too much extents to merge.

    However, it does not performs good on the test:
    dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync

    The reason is when we do sequencial sync write, we need to merge the
    current extent just with the previous one, so that we can get accumulated
    extents to log:

    A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ...

    So we'll have to flush more and more checksum into log tree, which is the
    bottleneck according to my tests.

    But we can avoid this by telling fsync the real extents that are needed
    to be logged.

    With this, I did the above dd sync write test (size=50m),

    w/o (orig) w/ (josef's) w/ (this)
    SATA 104KB/s 109KB/s 121KB/s
    ramdisk 1.5MB/s 1.5MB/s 10.7MB/s (613%)

    Signed-off-by: Liu Bo

    Liu Bo
     
  • At least for the vm workload. Currently on fsync we will

    1) Truncate all items in the log tree for the given inode if they exist

    and

    2) Copy all items for a given inode into the log

    The problem with this is that for things like VMs you can have lots of
    extents from the fragmented writing behavior, and worst yet you may have
    only modified a few extents, not the entire thing. This patch fixes this
    problem by tracking which transid modified our extent, and then when we do
    the tree logging we find all of the extents we've modified in our current
    transaction, sort them and commit them. We also only truncate up to the
    xattrs of the inode and copy that stuff in normally, and then just drop any
    extents in the range we have that exist in the log already. Here are some
    numbers of a 50 meg fio job that does random writes and fsync()s after every
    write

    Original Patched
    SATA drive 82KB/s 140KB/s
    Fusion drive 431KB/s 2532KB/s

    So around 2-6 times faster depending on your hardware. There are a few
    corner cases, for example if you truncate at all we have to do it the old
    way since there is no way to be sure what is in the log is ok. This
    probably could be done smarter, but if you write-fsync-truncate-write-fsync
    you deserve what you get. All this work is in RAM of course so if your
    inode gets evicted from cache and you read it in and fsync it we'll do it
    the slow way if we are still in the same transaction that we last modified
    the inode in.

    The biggest cool part of this is that it requires no changes to the recovery
    code, so if you fsync with this patch and crash and load an old kernel, it
    will run the recovery and be a-ok. I have tested this pretty thoroughly
    with an fsync tester and everything comes back fine, as well as xfstests.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

15 Feb, 2012

1 commit


02 May, 2011

2 commits


22 Dec, 2010

1 commit


19 Sep, 2009

1 commit


12 Sep, 2009

2 commits

  • Data COW means that whenever we write to a file, we replace any old
    extent pointers with new ones. There was a window where a readpage
    might find the old extent pointers on disk and cache them in the
    extent_map tree in ram in the middle of a given write replacing them.

    Even though both the readpage and the write had their respective bytes
    in the file locked, the extent readpage inserts may cover more bytes than
    it had locked down.

    This commit closes the race by keeping the new extent pinned in the extent
    map tree until after the on-disk btree is properly setup with the new
    extent pointers.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • There are two main users of the extent_map tree. The
    first is regular file inodes, where it is evenly spread
    between readers and writers.

    The second is the chunk allocation tree, which maps blocks from
    logical addresses to phyiscal ones, and it is 99.99% reads.

    The mapping tree is a point of lock contention during heavy IO
    workloads, so this commit switches things to a rw lock.

    Signed-off-by: Chris Mason

    Chris Mason
     

10 Nov, 2008

1 commit

  • The decompress code doesn't take the logical offset in extent
    pointer into account. If the logical offset isn't zero, data
    will be decompressed into wrong pages.

    The solution used here is to record the starting offset of the extent
    in the file separately from the logical start of the extent_map struct.
    This allows us to avoid problems inserting overlapping extents.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

31 Oct, 2008

2 commits

  • This patch updates btrfs-progs for fallocate support.

    fallocate is a little different in Btrfs because we need to tell the
    COW system that a given preallocated extent doesn't need to be
    cow'd as long as there are no snapshots of it. This leverages the
    -o nodatacow checks.

    Signed-off-by: Yan Zheng

    Yan Zheng
     
  • This patch splits the hole insertion code out of btrfs_setattr
    into btrfs_cont_expand and updates btrfs_get_extent to properly
    handle the case that file extent items are not continuous.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

30 Oct, 2008

1 commit

  • This is a large change for adding compression on reading and writing,
    both for inline and regular extents. It does some fairly large
    surgery to the writeback paths.

    Compression is off by default and enabled by mount -o compress. Even
    when the -o compress mount option is not used, it is possible to read
    compressed extents off the disk.

    If compression for a given set of pages fails to make them smaller, the
    file is flagged to avoid future compression attempts later.

    * While finding delalloc extents, the pages are locked before being sent down
    to the delalloc handler. This allows the delalloc handler to do complex things
    such as cleaning the pages, marking them writeback and starting IO on their
    behalf.

    * Inline extents are inserted at delalloc time now. This allows us to compress
    the data before inserting the inline extent, and it allows us to insert
    an inline extent that spans multiple pages.

    * All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
    are changed to record both an in-memory size and an on disk size, as well
    as a flag for compression.

    From a disk format point of view, the extent pointers in the file are changed
    to record the on disk size of a given extent and some encoding flags.
    Space in the disk format is allocated for compression encoding, as well
    as encryption and a generic 'other' field. Neither the encryption or the
    'other' field are currently used.

    In order to limit the amount of data read for a single random read in the
    file, the size of a compressed extent is limited to 128k. This is a
    software only limit, the disk format supports u64 sized compressed extents.

    In order to limit the ram consumed while processing extents, the uncompressed
    size of a compressed extent is limited to 256k. This is a software only limit
    and will be subject to tuning later.

    Checksumming is still done on compressed extents, and it is done on the
    uncompressed version of the data. This way additional encodings can be
    layered on without having to figure out which encoding to checksum.

    Compression happens at delalloc time, which is basically singled threaded because
    it is usually done by a single pdflush thread. This makes it tricky to
    spread the compression load across all the cpus on the box. We'll have to
    look at parallel pdflush walks of dirty inodes at a later time.

    Decompression is hooked into readpages and it does spread across CPUs nicely.

    Signed-off-by: Chris Mason

    Chris Mason
     

25 Sep, 2008

21 commits