07 May, 2013

1 commit


21 Feb, 2013

2 commits

  • Currently, we can do unlocked dio reads, but the following race
    is possible:

    dio_read_task truncate_task
    ->btrfs_setattr()
    ->btrfs_direct_IO
    ->__blockdev_direct_IO
    ->btrfs_get_block
    ->btrfs_truncate()
    #alloc truncated blocks
    #to other inode
    ->submit_io()
    #INFORMATION LEAK

    In order to avoid this problem, we must serialize unlocked dio reads with
    truncate. There are two approaches:
    - use extent lock to protect the extent that we truncate
    - use inode_dio_wait() to make sure the truncating task will wait for
    the read DIO.

    If we use the 1st one, we will meet the endless truncation problem due to
    the nonlocked read DIO after we implement the nonlocked write DIO. It is
    because we still need invoke inode_dio_wait() avoid the race between write
    DIO and truncation. By that time, we have to introduce

    btrfs_inode_{block, resume}_nolock_dio()

    again. That is we have to implement this patch again, so I choose the 2nd
    way to fix the problem.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     
  • We need not use a global lock to protect the delalloc_bytes of the
    inode, just use its own lock. In this way, we can reduce the lock
    contention and ->delalloc_lock will just protect delalloc inode
    list.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     

17 Dec, 2012

2 commits

  • The tree logging stuff needs the csums to be on the ordered extents in order
    to log them properly, so mark that we're sync and inline the csum creation
    so we don't have to wait on the csumming to be done when logging extents
    that are still in flight. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Currently we copy all the file information into the log, inode item, the
    refs, xattrs etc. Except most of this doesn't change from fsync to fsync,
    just the inode item changes. So set a flag if an xattr changes or a link is
    added, and otherwise only log the inode item. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

02 Oct, 2012

2 commits

  • This is based on Josef's "Btrfs: turbo charge fsync".

    The current btrfs checks if an inode is in log by comparing
    root's last_log_commit to inode's last_sub_trans[2].

    But the problem is that this root->last_log_commit is shared among
    inodes.

    Say we have N inodes to be logged, after the first inode,
    root's last_log_commit is updated and the N-1 remained files will
    be skipped.

    This fixes the bug by keeping a local copy of root's last_log_commit
    inside each inode and this local copy will be maintained itself.

    [1]: we regard each log transaction as a subset of btrfs's transaction,
    i.e. sub_trans

    Signed-off-by: Liu Bo

    Liu Bo
     
  • At least for the vm workload. Currently on fsync we will

    1) Truncate all items in the log tree for the given inode if they exist

    and

    2) Copy all items for a given inode into the log

    The problem with this is that for things like VMs you can have lots of
    extents from the fragmented writing behavior, and worst yet you may have
    only modified a few extents, not the entire thing. This patch fixes this
    problem by tracking which transid modified our extent, and then when we do
    the tree logging we find all of the extents we've modified in our current
    transaction, sort them and commit them. We also only truncate up to the
    xattrs of the inode and copy that stuff in normally, and then just drop any
    extents in the range we have that exist in the log already. Here are some
    numbers of a 50 meg fio job that does random writes and fsync()s after every
    write

    Original Patched
    SATA drive 82KB/s 140KB/s
    Fusion drive 431KB/s 2532KB/s

    So around 2-6 times faster depending on your hardware. There are a few
    corner cases, for example if you truncate at all we have to do it the old
    way since there is no way to be sure what is in the log is ok. This
    probably could be done smarter, but if you write-fsync-truncate-write-fsync
    you deserve what you get. All this work is in RAM of course so if your
    inode gets evicted from cache and you read it in and fsync it we'll do it
    the slow way if we are still in the same transaction that we last modified
    the inode in.

    The biggest cool part of this is that it requires no changes to the recovery
    code, so if you fsync with this patch and crash and load an old kernel, it
    will run the recovery and be a-ok. I have tested this pretty thoroughly
    with an fsync tester and everything comes back fine, as well as xfstests.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

24 Jul, 2012

3 commits


15 Jun, 2012

1 commit

  • I removed this in an earlier commit and I was wrong. Because compression
    can return from filemap_fdatawrite() without having actually set any of it's
    pages as writeback() it can make filemap_fdatawait() do essentially nothing,
    and then we won't find any ordered extents because they may not have been
    created yet. So not only does this make fsync() completely useless, but it
    will also screw up if you truncate on a non-page aligned offset since we
    zero out the end and then wait on ordered extents and then call drop caches.
    We can drop the cache before the io completes and then we try to unpin the
    extent we just wrote we won't find it and everything goes sideways. So fix
    this by putting it back and put a giant comment there to keep me from trying
    to remove it in the future. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

30 May, 2012

4 commits

  • We have this check down in the actual logging code, but this is after we
    start a transaction and all that good stuff. So move the helper
    inode_in_log() out so we can call it in fsync() and avoid starting a
    transaction altogether and just exit if we've already fsync()'ed this file
    recently. You would notice this issue if you fsync()'ed a file over and
    over again until the transaction committed. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Ceph was hitting this race where we would remove an inode from the per-root
    orphan list before we would release the space we had reserved for the inode.
    We actually don't need a list or anything, we just need to make sure the
    root doesn't try to free up the orphan reserve until after the inodes have
    released their reservations. So use an atomic counter instead of a list on
    the root and only decrement the counter after we've released our
    reservation. I've tested this as well as several others and we no longer
    see the warnings that you would see while running ceph. Thanks,
    Btrfs: fix how we deal with the orphan block rsv

    Ceph was hitting this race where we would remove an inode from the per-root
    orphan list before we would release the space we had reserved for the inode.
    We actually don't need a list or anything, we just need to make sure the
    root doesn't try to free up the orphan reserve until after the inodes have
    released their reservations. So use an atomic counter instead of a list on
    the root and only decrement the counter after we've released our
    reservation. I've tested this as well as several others and we no longer
    see the warnings that you would see while running ceph. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Miao pointed this out while I was working on an orphan problem that messing
    with a bitfield where different ranges are protected by different locks
    doesn't work out right. Turns out we've been doing this forever where we
    have different parts of the bit field protected by either no lock at all or
    different locks which could cause all sorts of weird problems including the
    issue I was hitting. So instead make a runtime_flags thing that we use the
    normal bit operations on that are all atomic so we can keep having our
    no/different locking for the different flags and then make force_compress
    it's own thing so it can be treated normally. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We've been keeping around the inode sequence number in hopes that somebody
    would use it, but nobody uses it and people actually use i_version which
    serves the same purpose, so use i_version where we used the incore inode's
    sequence number and that way the sequence is updated properly across the
    board, and not just in file write. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

17 Jan, 2012

1 commit

  • I was using i_mutex for this, but we're getting bogus lockdep warnings by doing
    that and theres no real way to get rid of those, so just stop using i_mutex to
    protect delalloc metadata reservations and use a delalloc mutex instead. This
    shouldn't be contended often at all, only if you are writing and mmap writing to
    the file at the same time. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

09 Nov, 2011

1 commit

  • People have been reporting ENOSPC crashes in finish_ordered_io. This is because
    we try to steal from the delalloc block rsv to satisfy a reservation to update
    the inode. The problem with this is we don't explicitly save space for updating
    the inode when doing delalloc. This is kind of a problem and we've gotten away
    with this because way back when we just stole from the delalloc reserve without
    any questions, and this worked out fine because generally speaking the leaf had
    been modified either by the mtime update when we did the original write or
    because we just updated the leaf when we inserted the file extent item, only on
    rare occasions had the leaf not actually been modified, and that was still ok
    because we'd just use a block or two out of the over-reservation that is
    delalloc.

    Then came the delayed inode stuff. This is amazing, except it wants a full
    reservation for updating the inode since it may do it at some point down the
    road after we've written the blocks and we have to recow everything again. This
    worked out because the delayed inode stuff just stole from the global reserve,
    that is until recently when I changed that because it caused other problems.

    So here we are, we're doing everything right and being screwed for it. So take
    an extra reservation for the inode at delalloc reservation time and carry it
    through the life of the delalloc reservation. If we need it we can steal it in
    the delayed inode stuff. If we have already stolen it try and do a normal
    metadata reservation. If that fails try to steal from the delalloc reservation.
    If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to
    solve this and in the meantime we'll steal from the global reserve.

    With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see
    any problems.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

20 Oct, 2011

3 commits

  • We have not been reserving enough space for checksums. We were just reserving
    bytes for the checksum items themselves, we were not taking into account having
    to cow the tree and such. This patch adds a csum_bytes counter to the inode for
    keeping track of the number of bytes outstanding we have for checksums. Then we
    calculate how many leaves would be required for the checksums we are given and
    use that to reserve space. This adds a significant amount of bytes to our
    reservations, but we will handle this later. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • reserved_bytes is not used for anything in the inode, remove it.

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Moving things around to give us better packing in the btrfs_inode. This reduces
    the size of our inode by 8 bytes. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

11 Sep, 2011

1 commit

  • We can reproduce this oops via the following steps:

    $ mkfs.btrfs /dev/sdb7
    $ mount /dev/sdb7 /mnt/btrfs
    $ for ((i=0; ii_ino
    to BTRFS_EMPTY_SUBVOL_DIR_OBJECTID instead of BTRFS_FIRST_FREE_OBJECTID,
    while the snapshot's location.objectid remains unchanged.

    However, btrfs_ino() does not take this into account, and returns a wrong ino,
    and causes the oops.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     

28 Jul, 2011

2 commits

  • Now that we are using regular file crcs for the free space cache,
    we can deadlock if we try to read the free_space_inode while we are
    updating the crc tree.

    This commit fixes things by using the commit_root to read the crcs. This is
    safe because we the free space cache file would already be loaded if
    that block group had been changed in the current transaction.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • So I had this brilliant idea to use atomic counters for outstanding and reserved
    extents, but this turned out to be a bad idea. Consider this where we have 1
    outstanding extent and 1 reserved extent

    Reserver Releaser
    atomic_dec(outstanding) now 0
    atomic_read(outstanding)+1 get 1
    atomic_read(reserved) get 1
    don't actually reserve anything because
    they are the same
    atomic_cmpxchg(reserved, 1, 0)
    atomic_inc(outstanding)
    atomic_add(0, reserved)
    free reserved space for 1 extent

    Then the reserver now has no actual space reserved for it, and when it goes to
    finish the ordered IO it won't have enough space to do it's allocation and you
    get those lovely warnings.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

28 May, 2011

1 commit

  • git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus

    Conflicts:
    fs/btrfs/disk-io.c
    fs/btrfs/extent-tree.c
    fs/btrfs/free-space-cache.c
    fs/btrfs/inode.c
    fs/btrfs/transaction.c

    Signed-off-by: Chris Mason

    Chris Mason
     

27 May, 2011

1 commit

  • This will detect small random writes into files and
    queue the up for an auto defrag process. It isn't well suited to
    database workloads yet, but works for smaller files such as rpm, sqlite
    or bdb databases.

    Signed-off-by: Chris Mason

    Chris Mason
     

24 May, 2011

1 commit

  • Originally this was going to be used as a way to give hints to the allocator,
    but frankly we can get much better hints elsewhere and it's not even used at all
    for anything usefull. In addition to be completely useless, when we initialize
    an inode we try and find a freeish block group to set as the inodes block group,
    and with a completely full 40gb fs this takes _forever_, so I imagine with say
    1tb fs this is just unbearable. So just axe the thing altoghether, we don't
    need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
    inode lookup in my testcase. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

22 May, 2011

1 commit


21 May, 2011

1 commit

  • Changelog V5 -> V6:
    - Fix oom when the memory load is high, by storing the delayed nodes into the
    root's radix tree, and letting btrfs inodes go.

    Changelog V4 -> V5:
    - Fix the race on adding the delayed node to the inode, which is spotted by
    Chris Mason.
    - Merge Chris Mason's incremental patch into this patch.
    - Fix deadlock between readdir() and memory fault, which is reported by
    Itaru Kitayama.

    Changelog V3 -> V4:
    - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
    inode in time.

    Changelog V2 -> V3:
    - Fix the race between the delayed worker and the task which does delayed items
    balance, which is reported by Tsutomu Itoh.
    - Modify the patch address David Sterba's comment.
    - Fix the bug of the cpu recursion spinlock, reported by Chris Mason

    Changelog V1 -> V2:
    - break up the global rb-tree, use a list to manage the delayed nodes,
    which is created for every directory and file, and used to manage the
    delayed directory name index items and the delayed inode item.
    - introduce a worker to deal with the delayed nodes.

    Compare with Ext3/4, the performance of file creation and deletion on btrfs
    is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
    such as inode item, directory name item, directory name index and so on.

    If we can do some delayed b+ tree insertion or deletion, we can improve the
    performance, so we made this patch which implemented delayed directory name
    index insertion/deletion and delayed inode update.

    Implementation:
    - introduce a delayed root object into the filesystem, that use two lists to
    manage the delayed nodes which are created for every file/directory.
    One is used to manage all the delayed nodes that have delayed items. And the
    other is used to manage the delayed nodes which is waiting to be dealt with
    by the work thread.
    - Every delayed node has two rb-tree, one is used to manage the directory name
    index which is going to be inserted into b+ tree, and the other is used to
    manage the directory name index which is going to be deleted from b+ tree.
    - introduce a worker to deal with the delayed operation. This worker is used
    to deal with the works of the delayed directory name index items insertion
    and deletion and the delayed inode update.
    When the delayed items is beyond the lower limit, we create works for some
    delayed nodes and insert them into the work queue of the worker, and then
    go back.
    When the delayed items is beyond the upper bound, we create works for all
    the delayed nodes that haven't been dealt with, and insert them into the work
    queue of the worker, and then wait for that the untreated items is below some
    threshold value.
    - When we want to insert a directory name index into b+ tree, we just add the
    information into the delayed inserting rb-tree.
    And then we check the number of the delayed items and do delayed items
    balance. (The balance policy is above.)
    - When we want to delete a directory name index from the b+ tree, we search it
    in the inserting rb-tree at first. If we look it up, just drop it. If not,
    add the key of it into the delayed deleting rb-tree.
    Similar to the delayed inserting rb-tree, we also check the number of the
    delayed items and do delayed items balance.
    (The same to inserting manipulation)
    - When we want to update the metadata of some inode, we cached the data of the
    inode into the delayed node. the worker will flush it into the b+ tree after
    dealing with the delayed insertion and deletion.
    - We will move the delayed node to the tail of the list after we access the
    delayed node, By this way, we can cache more delayed items and merge more
    inode updates.
    - If we want to commit transaction, we will deal with all the delayed node.
    - the delayed node will be freed when we free the btrfs inode.
    - Before we log the inode items, we commit all the directory name index items
    and the delayed inode update.

    I did a quick test by the benchmark tool[1] and found we can improve the
    performance of file creation by ~15%, and file deletion by ~20%.

    Before applying this patch:
    Create files:
    Total files: 50000
    Total time: 1.096108
    Average time: 0.000022
    Delete files:
    Total files: 50000
    Total time: 1.510403
    Average time: 0.000030

    After applying this patch:
    Create files:
    Total files: 50000
    Total time: 0.932899
    Average time: 0.000019
    Delete files:
    Total files: 50000
    Total time: 1.215732
    Average time: 0.000024

    [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

    Many thanks for Kitayama-san's help!

    Signed-off-by: Miao Xie
    Reviewed-by: David Sterba
    Tested-by: Tsutomu Itoh
    Tested-by: Itaru Kitayama
    Signed-off-by: Chris Mason

    Miao Xie
     

25 Apr, 2011

1 commit

  • There's a potential problem in 32bit system when we exhaust 32bit inode
    numbers and start to allocate big inode numbers, because btrfs uses
    inode->i_ino in many places.

    So here we always use BTRFS_I(inode)->location.objectid, which is an
    u64 variable.

    There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
    inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
    and inode->i_ino will be used in those cases.

    Another reason to make this change is I'm going to use a special inode
    to save free ino cache, and the inode number must be > (u64)-256.

    Signed-off-by: Li Zefan

    Li Zefan
     

18 Mar, 2011

1 commit

  • We track delayed allocation per inodes via 2 counters, one is
    outstanding_extents and reserved_extents. Outstanding_extents is already an
    atomic_t, but reserved_extents is not and is protected by a spinlock. So
    convert this to an atomic_t and instead of using a spinlock, use atomic_cmpxchg
    when releasing delalloc bytes. This makes our inode 72 bytes smaller, and
    reduces locking overhead (albiet it was minimal to begin with). Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

22 Dec, 2010

1 commit


25 May, 2010

2 commits


15 Mar, 2010

1 commit

  • The btrfs defrag ioctl was limited to doing the entire file. This
    commit adds a new interface that can defrag a specific range inside
    the file.

    It can also force compression on the file, allowing you to selectively
    compress individual files after they were created, even when mount -o
    compress isn't turned on.

    Signed-off-by: Chris Mason

    Chris Mason
     

18 Dec, 2009

1 commit

  • There are some cases file extents are inserted without involving
    ordered struct. In these cases, we update disk_i_size directly,
    without checking pending ordered extent and DELALLOC bit. This
    patch extends btrfs_ordered_update_i_size() to handle these cases.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

14 Oct, 2009

1 commit

  • rpm has a habit of running fdatasync when the file hasn't
    changed. We already detect if a file hasn't been changed
    in the current transaction but it might have been sent to
    the tree-log in this transaction and not changed since
    the last call to fsync.

    In this case, we want to avoid a tree log sync, which includes
    a number of synchronous writes and barriers. This commit
    extends the existing tracking of the last transaction to change
    a file to also track the last sub-transaction.

    The end result is that rpm -ivh and -Uvh are roughly twice as fast,
    and on par with ext3.

    Signed-off-by: Chris Mason

    Chris Mason
     

09 Oct, 2009

1 commit

  • This patch fixes an issue with the delalloc metadata space reservation
    code. The problem is we used to free the reservation as soon as we
    allocated the delalloc region. The problem with this is if we are not
    inserting an inline extent, we don't actually insert the extent item until
    after the ordered extent is written out. This patch does 3 things,

    1) It moves the reservation clearing stuff into the ordered code, so when
    we remove the ordered extent we remove the reservation.
    2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear
    delalloc bits in the cases where we want to clear the metadata reservation
    when we clear the delalloc extent, in the case that we do an inline extent
    or we invalidate the page.
    3) It adds another waitqueue to the space info so that when we start a fs
    wide delalloc flush, anybody else who also hits that area will simply wait
    for the flush to finish and then try to make their allocation.

    This has been tested thoroughly to make sure we did not regress on
    performance.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

29 Sep, 2009

1 commit

  • At the start of a transaction we do a btrfs_reserve_metadata_space() and
    specify how many items we plan on modifying. Then once we've done our
    modifications and such, just call btrfs_unreserve_metadata_space() for
    the same number of items we reserved.

    For keeping track of metadata needed for data I've had to add an extent_io op
    for when we merge extents. This lets us track space properly when we are doing
    sequential writes, so we don't end up reserving way more metadata space than
    what we need.

    The only place where the metadata space accounting is not done is in the
    relocation code. This is because Yan is going to be reworking that code in the
    near future, so running btrfs-vol -b could still possibly result in a ENOSPC
    related panic. This patch also turns off the metadata_ratio stuff in order to
    allow users to more efficiently use their disk space.

    This patch makes it so we track how much metadata we need for an inode's
    delayed allocation extents by tracking how many extents are currently
    waiting for allocation. It introduces two new callbacks for the
    extent_io tree's, merge_extent_hook and split_extent_hook. These help
    us keep track of when we merge delalloc extents together and split them
    up. Reservations are handled prior to any actually dirty'ing occurs,
    and then we unreserve after we dirty.

    btrfs_unreserve_metadata_for_delalloc() will make the appropriate
    unreservations as needed based on the number of reservations we
    currently have and the number of extents we currently have. Doing the
    reservation outside of doing any of the actual dirty'ing lets us do
    things like filemap_flush() the inode to try and force delalloc to
    happen, or as a last resort actually start allocation on all delalloc
    inodes in the fs. This has survived dbench, fs_mark and an fsx torture
    test.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

22 Sep, 2009

1 commit

  • btrfs allows subvolumes and snapshots anywhere in the directory tree.
    If we snapshot a subvolume that contains a link to other subvolume
    called subvolA, subvolA can be accessed through both the original
    subvolume and the snapshot. This is similar to creating hard link to
    directory, and has the very similar problems.

    The aim of this patch is enforcing there is only one access point to
    each subvolume. Only the first directory entry (the one added when
    the subvolume/snapshot was created) is treated as valid access point.
    The first directory entry is distinguished by checking root forward
    reference. If the corresponding root forward reference is missing,
    we know the entry is not the first one.

    This patch also adds snapshot/subvolume rename support, the code
    allows rename subvolume link across subvolumes.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

24 Jun, 2009

1 commit