30 May, 2016

1 commit

  • When we do a device replace, for each device extent we find from the
    source device, we set the corresponding block group to readonly mode to
    prevent writes into it from happening while we are copying the device
    extent from the source to the target device. However just before we set
    the block group to readonly mode some concurrent task might have already
    allocated an extent from it or decided it could perform a nocow write
    into one of its extents, which can make the device replace process to
    miss copying an extent since it uses the extent tree's commit root to
    search for extents and only once it finishes searching for all extents
    belonging to the block group it does set the left cursor to the logical
    end address of the block group - this is a problem if the respective
    ordered extents finish while we are searching for extents using the
    extent tree's commit root and no transaction commit happens while we
    are iterating the tree, since it's the delayed references created by the
    ordered extents (when they complete) that insert the extent items into
    the extent tree (using the non-commit root of course).
    Example:

    CPU 1 CPU 2

    btrfs_dev_replace_start()
    btrfs_scrub_dev()
    scrub_enumerate_chunks()
    --> finds device extent belonging
    to block group X

    starts buffered write
    against some inode

    writepages is run against
    that inode forcing dellaloc
    to run

    btrfs_writepages()
    extent_writepages()
    extent_write_cache_pages()
    __extent_writepage()
    writepage_delalloc()
    run_delalloc_range()
    cow_file_range()
    btrfs_reserve_extent()
    --> allocates an extent
    from block group X
    (which is not yet
    in RO mode)
    btrfs_add_ordered_extent()
    --> creates ordered extent Y
    flush_epd_write_bio()
    --> bio against the extent from
    block group X is submitted

    btrfs_inc_block_group_ro(bg X)
    --> sets block group X to readonly

    scrub_chunk(bg X)
    scrub_stripe(device extent from srcdev)
    --> keeps searching for extent items
    belonging to the block group using
    the extent tree's commit root
    --> it never blocks due to
    fs_info->scrub_pause_req as no
    one tries to commit transaction N
    --> copies all extents found from the
    source device into the target device
    --> finishes search loop

    bio completes

    ordered extent Y completes
    and creates delayed data
    reference which will add an
    extent item to the extent
    tree when run (typically
    at transaction commit time)

    --> so the task doing the
    scrub/device replace
    at CPU 1 misses this
    and does not copy this
    extent into the new/target
    device

    btrfs_dec_block_group_ro(bg X)
    --> turns block group X back to RW mode

    dev_replace->cursor_left is set to the
    logical end offset of block group X

    So fix this by waiting for all cow and nocow writes after setting a block
    group to readonly mode.

    Signed-off-by: Filipe Manana
    Reviewed-by: Josef Bacik

    Filipe Manana
     

26 May, 2016

2 commits


13 May, 2016

1 commit

  • Before the relocation process of a block group starts, it sets the block
    group to readonly mode, then flushes all delalloc writes and then finally
    it waits for all ordered extents to complete. This last step includes
    waiting for ordered extents destinated at extents allocated in other block
    groups, making us waste unecessary time.

    So improve this by waiting only for ordered extents that fall into the
    block group's range.

    Signed-off-by: Filipe Manana
    Reviewed-by: Josef Bacik
    Reviewed-by: Liu Bo

    Filipe Manana
     

22 Oct, 2015

1 commit

  • We have a mechanism to make sure we don't lose updates for ordered extents that
    were logged in the transaction that is currently running. We add the ordered
    extent to a transaction list and then the transaction waits on all the ordered
    extents in that list. However are substantially large file systems this list
    can be extremely large, and can give us soft lockups, since the ordered extents
    don't remove themselves from the list when they do complete.

    To fix this we simply add a counter to the transaction that is incremented any
    time we have a logged extent that needs to be completed in the current
    transaction. Then when the ordered extent finally completes it decrements the
    per transaction counter and wakes up the transaction if we are the last ones.
    This will eliminate the softlockup. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

10 Jun, 2015

1 commit

  • Commit 3a8b36f37806 ("Btrfs: fix data loss in the fast fsync path") added
    a performance regression for that causes an unnecessary sync of the log
    trees (fs/subvol and root log trees) when 2 consecutive fsyncs are done
    against a file, without no writes or any metadata updates to the inode in
    between them and if a transaction is committed before the second fsync is
    called.

    Huang Ying reported this to lkml (https://lkml.org/lkml/2015/3/18/99)
    after a test sysbench test that measured a -62% decrease of file io
    requests per second for that tests' workload.

    The test is:

    echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
    echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
    echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
    echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
    mkfs -t btrfs /dev/sda2
    mount -t btrfs /dev/sda2 /fs/sda2
    cd /fs/sda2
    for ((i = 0; i < 1024; i++)); do fallocate -l 67108864 testfile.$i; done
    sysbench --test=fileio --max-requests=0 --num-threads=4 --max-time=600 \
    --file-test-mode=rndwr --file-total-size=68719476736 --file-io-mode=sync \
    --file-num=1024 run

    A test on kvm guest, running a debug kernel gave me the following results:

    Without 3a8b36f378060d: 16.01 reqs/sec
    With 3a8b36f378060d: 3.39 reqs/sec
    With 3a8b36f378060d and this patch: 16.04 reqs/sec

    Reported-by: Huang Ying
    Tested-by: Huang, Ying
    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     

03 Jun, 2015

1 commit

  • After commit 8407f553268a
    ("Btrfs: fix data corruption after fast fsync and writeback error"),
    during wait_ordered_extents(), we wait for ordered extent setting
    BTRFS_ORDERED_IO_DONE or BTRFS_ORDERED_IOERR, at which point we've
    already got checksum information, so we don't need to check
    (csum_bytes_left == 0) in the whole logging path.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     

22 Nov, 2014

2 commits

  • Instead of collecting all ordered extents from the inode's ordered tree
    and then wait for all of them to complete, just collect the ones that
    overlap the fsync range.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • Liu Bo pointed out that my previous fix would lose the generation update in the
    scenario I described. It is actually much worse than that, we could lose the
    entire extent if we lose power right after the transaction commits. Consider
    the following

    write extent 0-4k
    log extent in log tree
    commit transaction
    < power fail happens here
    ordered extent completes

    We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
    the transaction commit will reset the log so it isn't replayed. If we lose
    power before the transaction commit we are save, otherwise we are not.

    Fix this by keeping track of all extents we logged in this transaction. Then
    when we go to commit the transaction make sure we wait for all of those ordered
    extents to complete before proceeding. This will make sure that if we lose
    power after the transaction commit we still have our data. This also fixes the
    problem of the improperly updated extent generation. Thanks,

    cc: stable@vger.kernel.org
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

15 Aug, 2014

1 commit

  • Truncates and renames are often used to replace old versions of a file
    with new versions. Applications often expect this to be an atomic
    replacement, even if they haven't done anything to make sure the new
    version is fully on disk.

    Btrfs has strict flushing in place to make sure that renaming over an
    old file with a new file will fully flush out the new file before
    allowing the transaction commit with the rename to complete.

    This ordering means the commit code needs to be able to lock file pages,
    and there are a few paths in the filesystem where we will try to end a
    transaction with the page lock held. It's rare, but these things can
    deadlock.

    This patch removes the ordered flushes and switches to a best effort
    filemap_flush like ext4 uses. It's not perfect, but it should fix the
    deadlocks.

    Signed-off-by: Chris Mason

    Chris Mason
     

11 Mar, 2014

4 commits

  • Since the "_struct" suffix is mainly used for distinguish the differnt
    btrfs_work between the original and the newly created one,
    there is no need using the suffix since all btrfs_workers are changed
    into btrfs_workqueue.

    Also this patch fixed some codes whose code style is changed due to the
    too long "_struct" suffix.

    Signed-off-by: Qu Wenruo
    Tested-by: David Sterba
    Signed-off-by: Josef Bacik

    Qu Wenruo
     
  • Replace the fs_info->endio_* workqueues with the newly created
    btrfs_workqueue.

    Signed-off-by: Qu Wenruo
    Tested-by: David Sterba
    Signed-off-by: Josef Bacik

    Qu Wenruo
     
  • Replace the fs_info->submit_workers with the newly created
    btrfs_workqueue.

    Signed-off-by: Qu Wenruo
    Tested-by: David Sterba
    Signed-off-by: Josef Bacik

    Qu Wenruo
     
  • There was a problem in the old code:
    If we failed to log the csum, we would free all the ordered extents in the log list
    including those ordered extents that were logged successfully, it would make the
    log committer not to wait for the completion of the ordered extents.

    This patch doesn't insert the ordered extents that is about to be logged into
    a global list, instead, we insert them into a local list. If we log the ordered
    extents successfully, we splice them with the global list, or we will throw them
    away, then do full sync. It can also reduce the lock contention and the traverse
    time of list.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     

12 Nov, 2013

2 commits

  • It is very likely that there are lots of ordered extents in the filesytem,
    if we wait for the completion of all of them when we want to reclaim some
    space for the metadata space reservation, we would be blocked for a long
    time. The performance would drop down suddenly for a long time.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Miao Xie
     
  • I noticed that if the free space cache has an error writing out it's data it
    won't actually error out, it will just carry on. This is because it doesn't
    check the return value of btrfs_wait_ordered_range, which didn't actually return
    anything. So fix this in order to keep us from making free space cache look
    valid when it really isnt. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

21 Sep, 2013

1 commit

  • This is a left over of how we used to wait for ordered extents, which was to
    grab the inode and then run filemap flush on it. However if we have an ordered
    extent then we already are holding a ref on the inode, and we just use
    btrfs_start_ordered_extent anyway, so there is no reason to have an extra ref on
    the inode to start work on the ordered extent. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

01 Sep, 2013

1 commit

  • We currently have this problem where you can truncate pages that have not yet
    been written for an ordered extent. We do this because the truncate will be
    coming behind to clean us up anyway so what's the harm right? Well if truncate
    fails for whatever reason we leave an orphan item around for the file to be
    cleaned up later. But if the user goes and truncates up the file and tries to
    read from the area that had been discarded previously they will get a csum error
    because we never actually wrote that data out.

    This patch fixes this by allowing us to either discard the ordered extent
    completely, by which I mean we just free up the space we had allocated and not
    add the file extent, or adjust the length of the file extent we write. We do
    this by setting the length we truncated down to in the ordered extent, and then
    we set the file extent length and ram bytes to this length. The total disk
    space stays unchanged since we may be compressed and we can't just chop off the
    disk space, but at least this way the file extent only points to the valid data.
    Then when the file extent is free'd the extent and csums will be freed normally.

    This patch is needed for the next series which will give us more graceful
    recovery of failed truncates. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

02 Jul, 2013

1 commit

  • Using the structure btrfs_sector_sum to keep the checksum value is
    unnecessary, because the extents that btrfs_sector_sum points to are
    continuous, we can find out the expected checksums by btrfs_ordered_sum's
    bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
    removing bytenr, there is only one member in the structure, so it makes
    no sense to keep the structure, just remove it, and use a u32 array to
    store the checksum value.

    By this change, we don't use the while loop to get the checksums one by
    one. Now, we can get several checksum value at one time, it improved the
    performance by ~74% on my SSD (31MB/s -> 54MB/s).

    test command:
    # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     

14 Jun, 2013

1 commit


07 May, 2013

1 commit

  • It is very likely that there are several blocks in bio, it is very
    inefficient if we get their csums one by one. This patch improves
    this problem by getting the csums in batch.

    According to the result of the following test, the execute time of
    __btrfs_lookup_bio_sums() is down by ~28%(300us -> 217us).

    # dd if=/file of=/dev/null bs=1M count=1024

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     

21 Feb, 2013

2 commits

  • …fs-next into for-linus-3.9

    Signed-off-by: Chris Mason <chris.mason@fusionio.com>

    Conflicts:
    fs/btrfs/disk-io.c

    Chris Mason
     
  • Miao made the ordered operations stuff run async, which introduced a
    deadlock where we could get somebody (sync) racing in and committing the
    transaction while a commit was already happening. The new committer would
    try and flush ordered operations which would hang waiting for the commit to
    finish because it is done asynchronously and no longer inherits the callers
    trans handle. To fix this we need to make the ordered operations list a per
    transaction list. We can get new inodes added to the ordered operation list
    by truncating them and then having another process writing to them, so this
    makes it so that anybody trying to add an ordered operation _must_ start a
    transaction in order to add itself to the list, which will keep new inodes
    from getting added to the ordered operations list after we start committing.
    This should fix the deadlock and also keeps us from doing a lot more work
    than we need to during commit. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

20 Feb, 2013

1 commit

  • Since we don't actually copy the extent information from the source tree in
    the fast case we don't need to wait for ordered io to be completed in order
    to fsync, we just need to wait for the io to be completed. So when we're
    logging our file just attach all of the ordered extents to the log, and then
    when the log syncs just wait for IO_DONE on the ordered extents and then
    write the super. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

19 Dec, 2012

1 commit

  • Pull btrfs update from Chris Mason:
    "A big set of fixes and features.

    In terms of line count, most of the code comes from Stefan, who added
    the ability to replace a single drive in place. This is different
    from how btrfs normally replaces drives, and is much much much faster.

    Josef is plowing through our synchronous write performance. This pull
    request does not include the DIO_OWN_WAITING patch that was discussed
    on the list, but it has a number of other improvements to cut down our
    latencies and CPU time during fsync/O_DIRECT writes.

    Miao Xie has a big series of fixes and is spreading out ordered
    operations over more CPUs. This improves performance and reduces
    contention.

    I've put in fixes for error handling around hash collisions. These
    are going back to individual stable kernels as I test against them.

    Otherwise we have a lot of fixes and cleanups, thanks everyone!
    raid5/6 is being rebased against the device replacement code. I'll
    have it posted this Friday along with a nice series of benchmarks."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
    Btrfs: fix a bug of per-file nocow
    Btrfs: fix hash overflow handling
    Btrfs: don't take inode delalloc mutex if we're a free space inode
    Btrfs: fix autodefrag and umount lockup
    Btrfs: fix permissions of empty files not affected by umask
    Btrfs: put raid properties into global table
    Btrfs: fix BUG() in scrub when first superblock reading gives EIO
    Btrfs: do not call file_update_time in aio_write
    Btrfs: only unlock and relock if we have to
    Btrfs: use tokens where we can in the tree log
    Btrfs: optimize leaf_space_used
    Btrfs: don't memset new tokens
    Btrfs: only clear dirty on the buffer if it is marked as dirty
    Btrfs: move checks in set_page_dirty under DEBUG
    Btrfs: log changed inodes based on the extent map tree
    Btrfs: add path->really_keep_locks
    Btrfs: do not mark ems as prealloc if we are writing to them
    Btrfs: keep track of the extents original block length
    Btrfs: inline csums if we're fsyncing
    Btrfs: don't bother copying if we're only logging the inode
    ...

    Linus Torvalds
     

12 Dec, 2012

2 commits


19 Nov, 2012

1 commit


04 Oct, 2012

1 commit


02 Oct, 2012

2 commits

  • The ordered extent allocation is in the fast path of the IO, so use a slab
    to improve the speed of the allocation.

    "Size of the struct is 280, so this will fall into the size-512 bucket,
    giving 8 objects per page, while own slab will pack 14 objects into a page.

    Another benefit I see is to check for leaked objects when the module is
    removed (and the cache destroy takes place)."
    -- David Sterba

    Signed-off-by: Miao Xie

    Miao Xie
     
  • If a snapshot is created while we are writing some data into the file,
    the i_size of the corresponding file in the snapshot will be wrong, it will
    be beyond the end of the last file extent. And btrfsck will report:
    root 256 inode 257 errors 100

    Steps to reproduce:
    # mkfs.btrfs
    # mount
    # cd
    # dd if=/dev/zero of=tmpfile bs=4M count=1024 &
    # for ((i=0; i do
    > btrfs sub snap . $i
    > done

    This because the algorithm of disk_i_size update is wrong. Though there are
    some ordered extents behind the current one which we use to update disk_i_size,
    it doesn't mean those extents will be dealt with in the same transaction. So
    We shouldn't use the offset of those extents to update disk_i_size. Or we will
    get the wrong i_size in the snapshot.

    We fix this problem by recording the max real i_size. If we find there is a
    ordered extent which is in front of the current one and doesn't complete, we
    will record the end of the current one into that ordered extent. Surely, if
    the current extent holds the end of other extent(it must be greater than
    the current one because it is behind the current one), we will record the
    number that the current extent holds. In this way, we can exclude the ordered
    extents that may not be dealth with in the same transaction, and be easy to
    know the real disk_i_size.

    Signed-off-by: Miao Xie

    Miao Xie
     

30 May, 2012

1 commit

  • We noticed that the ordered extent completion doesn't really rely on having
    a page and that it could be done independantly of ending the writeback on a
    page. This patch makes us not do the threaded endio stuff for normal
    buffered writes and direct writes so we can end page writeback as soon as
    possible (in irq context) and only start threads to do the ordered work when
    it is actually done. Compression needs to be reworked some to take
    advantage of this as well, but atm it has to do a find_get_page in its endio
    handler so it must be done in its own thread. This makes direct writes
    quite a bit faster. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

22 Mar, 2012

1 commit


22 Dec, 2010

1 commit


29 Nov, 2010

1 commit

  • The new DIO bio splitting code has problems when the bio
    spans more than one ordered extent. This will happen as the
    generic DIO code merges our get_blocks calls together into
    a bigger single bio.

    This fixes things by walking forward in the ordered extent
    code finding all the overlapping ordered extents and completing them
    all at once.

    Signed-off-by: Chris Mason

    Chris Mason
     

25 May, 2010

1 commit

  • This provides basic DIO support for reading and writing. It does not do the
    work to recover from mismatching checksums, that will come later. A few design
    changes have been made from Jim's code (sorry Jim!)

    1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO
    code in order to account for all of BTRFS's oddities, but thanks to that work it
    seems like the best bet is to just ignore compression and such and just opt to
    fallback on buffered IO.

    2) Fallback on buffered IO for compressed or inline extents. Jim's code did
    it's own buffering to make dio with compressed extents work. Now we just
    fallback onto normal buffered IO.

    3) Use ordered extents for the writes so that all of the

    lock_extent()
    lookup_ordered()

    type checks continue to work.

    4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with
    DIO writes.

    I've tested this with fsx and everything works great. This patch depends on my
    dio and filemap.c patches to work. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

15 Mar, 2010

2 commits

  • When finishing io we run btrfs_dec_test_ordered_pending, and then immediately
    run btrfs_lookup_ordered_extent, but btrfs_dec_test_ordered_pending does that
    already, so we're searching twice when we don't have to. This patch lets us
    pass a btrfs_ordered_extent in to btrfs_dec_test_ordered_pending so if we do
    complete io on that ordered extent we can just use the one we found then instead
    of having to do another btrfs_lookup_ordered_extent. This made my fio job with
    the other patch go from 24 mb/s to 29 mb/s.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • The ordered tree used to need a mutex, but currently all we use it for is to
    protect the rb_tree, and a spin_lock is just fine for that. Using a spin_lock
    instead makes dbench run a little faster, 58 mb/s instead of 51 mb/s, and have
    less latency, 3445.138 ms instead of 3820.633 ms.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

09 Mar, 2010

1 commit

  • btrfs inialize rb trees in quite a number of places by settin rb_node =
    NULL; The problem with this is that 17d9ddc72fb8bba0d4f678 in the
    linux-next tree adds a new field to that struct which needs to be NULL for
    the new rbtree library code to work properly. This patch uses RB_ROOT as
    the intializer so all of the relevant fields will be NULL'd. Without the
    patch I get a panic.

    Signed-off-by: Eric Paris
    Acked-by: Venkatesh Pallipadi
    Signed-off-by: Chris Mason

    Eric Paris
     

18 Dec, 2009

1 commit

  • iput() can trigger new transactions if we are dropping the
    final reference, so calling it in btrfs_commit_transaction
    may end up deadlock. This patch adds delayed iput to avoid
    the issue.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng