09 Nov, 2011

1 commit

  • People have been reporting ENOSPC crashes in finish_ordered_io. This is because
    we try to steal from the delalloc block rsv to satisfy a reservation to update
    the inode. The problem with this is we don't explicitly save space for updating
    the inode when doing delalloc. This is kind of a problem and we've gotten away
    with this because way back when we just stole from the delalloc reserve without
    any questions, and this worked out fine because generally speaking the leaf had
    been modified either by the mtime update when we did the original write or
    because we just updated the leaf when we inserted the file extent item, only on
    rare occasions had the leaf not actually been modified, and that was still ok
    because we'd just use a block or two out of the over-reservation that is
    delalloc.

    Then came the delayed inode stuff. This is amazing, except it wants a full
    reservation for updating the inode since it may do it at some point down the
    road after we've written the blocks and we have to recow everything again. This
    worked out because the delayed inode stuff just stole from the global reserve,
    that is until recently when I changed that because it caused other problems.

    So here we are, we're doing everything right and being screwed for it. So take
    an extra reservation for the inode at delalloc reservation time and carry it
    through the life of the delalloc reservation. If we need it we can steal it in
    the delayed inode stuff. If we have already stolen it try and do a normal
    metadata reservation. If that fails try to steal from the delalloc reservation.
    If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to
    solve this and in the meantime we'll steal from the global reserve.

    With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see
    any problems.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

20 Oct, 2011

3 commits

  • We have not been reserving enough space for checksums. We were just reserving
    bytes for the checksum items themselves, we were not taking into account having
    to cow the tree and such. This patch adds a csum_bytes counter to the inode for
    keeping track of the number of bytes outstanding we have for checksums. Then we
    calculate how many leaves would be required for the checksums we are given and
    use that to reserve space. This adds a significant amount of bytes to our
    reservations, but we will handle this later. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • reserved_bytes is not used for anything in the inode, remove it.

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Moving things around to give us better packing in the btrfs_inode. This reduces
    the size of our inode by 8 bytes. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

11 Sep, 2011

1 commit

  • We can reproduce this oops via the following steps:

    $ mkfs.btrfs /dev/sdb7
    $ mount /dev/sdb7 /mnt/btrfs
    $ for ((i=0; ii_ino
    to BTRFS_EMPTY_SUBVOL_DIR_OBJECTID instead of BTRFS_FIRST_FREE_OBJECTID,
    while the snapshot's location.objectid remains unchanged.

    However, btrfs_ino() does not take this into account, and returns a wrong ino,
    and causes the oops.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     

28 Jul, 2011

2 commits

  • Now that we are using regular file crcs for the free space cache,
    we can deadlock if we try to read the free_space_inode while we are
    updating the crc tree.

    This commit fixes things by using the commit_root to read the crcs. This is
    safe because we the free space cache file would already be loaded if
    that block group had been changed in the current transaction.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • So I had this brilliant idea to use atomic counters for outstanding and reserved
    extents, but this turned out to be a bad idea. Consider this where we have 1
    outstanding extent and 1 reserved extent

    Reserver Releaser
    atomic_dec(outstanding) now 0
    atomic_read(outstanding)+1 get 1
    atomic_read(reserved) get 1
    don't actually reserve anything because
    they are the same
    atomic_cmpxchg(reserved, 1, 0)
    atomic_inc(outstanding)
    atomic_add(0, reserved)
    free reserved space for 1 extent

    Then the reserver now has no actual space reserved for it, and when it goes to
    finish the ordered IO it won't have enough space to do it's allocation and you
    get those lovely warnings.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

28 May, 2011

1 commit

  • git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus

    Conflicts:
    fs/btrfs/disk-io.c
    fs/btrfs/extent-tree.c
    fs/btrfs/free-space-cache.c
    fs/btrfs/inode.c
    fs/btrfs/transaction.c

    Signed-off-by: Chris Mason

    Chris Mason
     

27 May, 2011

1 commit

  • This will detect small random writes into files and
    queue the up for an auto defrag process. It isn't well suited to
    database workloads yet, but works for smaller files such as rpm, sqlite
    or bdb databases.

    Signed-off-by: Chris Mason

    Chris Mason
     

24 May, 2011

1 commit

  • Originally this was going to be used as a way to give hints to the allocator,
    but frankly we can get much better hints elsewhere and it's not even used at all
    for anything usefull. In addition to be completely useless, when we initialize
    an inode we try and find a freeish block group to set as the inodes block group,
    and with a completely full 40gb fs this takes _forever_, so I imagine with say
    1tb fs this is just unbearable. So just axe the thing altoghether, we don't
    need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
    inode lookup in my testcase. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

22 May, 2011

1 commit


21 May, 2011

1 commit

  • Changelog V5 -> V6:
    - Fix oom when the memory load is high, by storing the delayed nodes into the
    root's radix tree, and letting btrfs inodes go.

    Changelog V4 -> V5:
    - Fix the race on adding the delayed node to the inode, which is spotted by
    Chris Mason.
    - Merge Chris Mason's incremental patch into this patch.
    - Fix deadlock between readdir() and memory fault, which is reported by
    Itaru Kitayama.

    Changelog V3 -> V4:
    - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
    inode in time.

    Changelog V2 -> V3:
    - Fix the race between the delayed worker and the task which does delayed items
    balance, which is reported by Tsutomu Itoh.
    - Modify the patch address David Sterba's comment.
    - Fix the bug of the cpu recursion spinlock, reported by Chris Mason

    Changelog V1 -> V2:
    - break up the global rb-tree, use a list to manage the delayed nodes,
    which is created for every directory and file, and used to manage the
    delayed directory name index items and the delayed inode item.
    - introduce a worker to deal with the delayed nodes.

    Compare with Ext3/4, the performance of file creation and deletion on btrfs
    is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
    such as inode item, directory name item, directory name index and so on.

    If we can do some delayed b+ tree insertion or deletion, we can improve the
    performance, so we made this patch which implemented delayed directory name
    index insertion/deletion and delayed inode update.

    Implementation:
    - introduce a delayed root object into the filesystem, that use two lists to
    manage the delayed nodes which are created for every file/directory.
    One is used to manage all the delayed nodes that have delayed items. And the
    other is used to manage the delayed nodes which is waiting to be dealt with
    by the work thread.
    - Every delayed node has two rb-tree, one is used to manage the directory name
    index which is going to be inserted into b+ tree, and the other is used to
    manage the directory name index which is going to be deleted from b+ tree.
    - introduce a worker to deal with the delayed operation. This worker is used
    to deal with the works of the delayed directory name index items insertion
    and deletion and the delayed inode update.
    When the delayed items is beyond the lower limit, we create works for some
    delayed nodes and insert them into the work queue of the worker, and then
    go back.
    When the delayed items is beyond the upper bound, we create works for all
    the delayed nodes that haven't been dealt with, and insert them into the work
    queue of the worker, and then wait for that the untreated items is below some
    threshold value.
    - When we want to insert a directory name index into b+ tree, we just add the
    information into the delayed inserting rb-tree.
    And then we check the number of the delayed items and do delayed items
    balance. (The balance policy is above.)
    - When we want to delete a directory name index from the b+ tree, we search it
    in the inserting rb-tree at first. If we look it up, just drop it. If not,
    add the key of it into the delayed deleting rb-tree.
    Similar to the delayed inserting rb-tree, we also check the number of the
    delayed items and do delayed items balance.
    (The same to inserting manipulation)
    - When we want to update the metadata of some inode, we cached the data of the
    inode into the delayed node. the worker will flush it into the b+ tree after
    dealing with the delayed insertion and deletion.
    - We will move the delayed node to the tail of the list after we access the
    delayed node, By this way, we can cache more delayed items and merge more
    inode updates.
    - If we want to commit transaction, we will deal with all the delayed node.
    - the delayed node will be freed when we free the btrfs inode.
    - Before we log the inode items, we commit all the directory name index items
    and the delayed inode update.

    I did a quick test by the benchmark tool[1] and found we can improve the
    performance of file creation by ~15%, and file deletion by ~20%.

    Before applying this patch:
    Create files:
    Total files: 50000
    Total time: 1.096108
    Average time: 0.000022
    Delete files:
    Total files: 50000
    Total time: 1.510403
    Average time: 0.000030

    After applying this patch:
    Create files:
    Total files: 50000
    Total time: 0.932899
    Average time: 0.000019
    Delete files:
    Total files: 50000
    Total time: 1.215732
    Average time: 0.000024

    [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

    Many thanks for Kitayama-san's help!

    Signed-off-by: Miao Xie
    Reviewed-by: David Sterba
    Tested-by: Tsutomu Itoh
    Tested-by: Itaru Kitayama
    Signed-off-by: Chris Mason

    Miao Xie
     

25 Apr, 2011

1 commit

  • There's a potential problem in 32bit system when we exhaust 32bit inode
    numbers and start to allocate big inode numbers, because btrfs uses
    inode->i_ino in many places.

    So here we always use BTRFS_I(inode)->location.objectid, which is an
    u64 variable.

    There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
    inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
    and inode->i_ino will be used in those cases.

    Another reason to make this change is I'm going to use a special inode
    to save free ino cache, and the inode number must be > (u64)-256.

    Signed-off-by: Li Zefan

    Li Zefan
     

18 Mar, 2011

1 commit

  • We track delayed allocation per inodes via 2 counters, one is
    outstanding_extents and reserved_extents. Outstanding_extents is already an
    atomic_t, but reserved_extents is not and is protected by a spinlock. So
    convert this to an atomic_t and instead of using a spinlock, use atomic_cmpxchg
    when releasing delalloc bytes. This makes our inode 72 bytes smaller, and
    reduces locking overhead (albiet it was minimal to begin with). Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

22 Dec, 2010

1 commit


25 May, 2010

2 commits


15 Mar, 2010

1 commit

  • The btrfs defrag ioctl was limited to doing the entire file. This
    commit adds a new interface that can defrag a specific range inside
    the file.

    It can also force compression on the file, allowing you to selectively
    compress individual files after they were created, even when mount -o
    compress isn't turned on.

    Signed-off-by: Chris Mason

    Chris Mason
     

18 Dec, 2009

1 commit

  • There are some cases file extents are inserted without involving
    ordered struct. In these cases, we update disk_i_size directly,
    without checking pending ordered extent and DELALLOC bit. This
    patch extends btrfs_ordered_update_i_size() to handle these cases.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

14 Oct, 2009

1 commit

  • rpm has a habit of running fdatasync when the file hasn't
    changed. We already detect if a file hasn't been changed
    in the current transaction but it might have been sent to
    the tree-log in this transaction and not changed since
    the last call to fsync.

    In this case, we want to avoid a tree log sync, which includes
    a number of synchronous writes and barriers. This commit
    extends the existing tracking of the last transaction to change
    a file to also track the last sub-transaction.

    The end result is that rpm -ivh and -Uvh are roughly twice as fast,
    and on par with ext3.

    Signed-off-by: Chris Mason

    Chris Mason
     

09 Oct, 2009

1 commit

  • This patch fixes an issue with the delalloc metadata space reservation
    code. The problem is we used to free the reservation as soon as we
    allocated the delalloc region. The problem with this is if we are not
    inserting an inline extent, we don't actually insert the extent item until
    after the ordered extent is written out. This patch does 3 things,

    1) It moves the reservation clearing stuff into the ordered code, so when
    we remove the ordered extent we remove the reservation.
    2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear
    delalloc bits in the cases where we want to clear the metadata reservation
    when we clear the delalloc extent, in the case that we do an inline extent
    or we invalidate the page.
    3) It adds another waitqueue to the space info so that when we start a fs
    wide delalloc flush, anybody else who also hits that area will simply wait
    for the flush to finish and then try to make their allocation.

    This has been tested thoroughly to make sure we did not regress on
    performance.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

29 Sep, 2009

1 commit

  • At the start of a transaction we do a btrfs_reserve_metadata_space() and
    specify how many items we plan on modifying. Then once we've done our
    modifications and such, just call btrfs_unreserve_metadata_space() for
    the same number of items we reserved.

    For keeping track of metadata needed for data I've had to add an extent_io op
    for when we merge extents. This lets us track space properly when we are doing
    sequential writes, so we don't end up reserving way more metadata space than
    what we need.

    The only place where the metadata space accounting is not done is in the
    relocation code. This is because Yan is going to be reworking that code in the
    near future, so running btrfs-vol -b could still possibly result in a ENOSPC
    related panic. This patch also turns off the metadata_ratio stuff in order to
    allow users to more efficiently use their disk space.

    This patch makes it so we track how much metadata we need for an inode's
    delayed allocation extents by tracking how many extents are currently
    waiting for allocation. It introduces two new callbacks for the
    extent_io tree's, merge_extent_hook and split_extent_hook. These help
    us keep track of when we merge delalloc extents together and split them
    up. Reservations are handled prior to any actually dirty'ing occurs,
    and then we unreserve after we dirty.

    btrfs_unreserve_metadata_for_delalloc() will make the appropriate
    unreservations as needed based on the number of reservations we
    currently have and the number of extents we currently have. Doing the
    reservation outside of doing any of the actual dirty'ing lets us do
    things like filemap_flush() the inode to try and force delalloc to
    happen, or as a last resort actually start allocation on all delalloc
    inodes in the fs. This has survived dbench, fs_mark and an fsx torture
    test.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

22 Sep, 2009

1 commit

  • btrfs allows subvolumes and snapshots anywhere in the directory tree.
    If we snapshot a subvolume that contains a link to other subvolume
    called subvolA, subvolA can be accessed through both the original
    subvolume and the snapshot. This is similar to creating hard link to
    directory, and has the very similar problems.

    The aim of this patch is enforcing there is only one access point to
    each subvolume. Only the first directory entry (the one added when
    the subvolume/snapshot was created) is treated as valid access point.
    The first directory entry is distinguished by checking root forward
    reference. If the corresponding root forward reference is missing,
    we know the entry is not the first one.

    This patch also adds snapshot/subvolume rename support, the code
    allows rename subvolume link across subvolumes.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

24 Jun, 2009

1 commit


10 Jun, 2009

2 commits

  • Add support for the standard attributes set via chattr and read via
    lsattr. Currently we store the attributes in the flags value in
    the btrfs inode, but I wonder whether we should split it into two so
    that we don't have to keep converting between the two formats.

    Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros
    as they were confusing the existing code and got in the way of the
    new additions.

    Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's
    trivial.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Chris Mason

    Christoph Hellwig
     
  • This commit introduces a new kind of back reference for btrfs metadata.
    Once a filesystem has been mounted with this commit, IT WILL NO LONGER
    BE MOUNTABLE BY OLDER KERNELS.

    When a tree block in subvolume tree is cow'd, the reference counts of all
    extents it points to are increased by one. At transaction commit time,
    the old root of the subvolume is recorded in a "dead root" data structure,
    and the btree it points to is later walked, dropping reference counts
    and freeing any blocks where the reference count goes to 0.

    The increments done during cow and decrements done after commit cancel out,
    and the walk is a very expensive way to go about freeing the blocks that
    are no longer referenced by the new btree root. This commit reduces the
    transaction overhead by avoiding the need for dead root records.

    When a non-shared tree block is cow'd, we free the old block at once, and the
    new block inherits old block's references. When a tree block with reference
    count > 1 is cow'd, we increase the reference counts of all extents
    the new block points to by one, and decrease the old block's reference count by
    one.

    This dead tree avoidance code removes the need to modify the reference
    counts of lower level extents when a non-shared tree block is cow'd.
    But we still need to update back ref for all pointers in the block.
    This is because the location of the block is recorded in the back ref
    item.

    We can solve this by introducing a new type of back ref. The new
    back ref provides information about pointer's key, level and in which
    tree the pointer lives. This information allow us to find the pointer
    by searching the tree. The shortcoming of the new back ref is that it
    only works for pointers in tree blocks referenced by their owner trees.

    This is mostly a problem for snapshots, where resolving one of these
    fuzzy back references would be O(number_of_snapshots) and quite slow.
    The solution used here is to use the fuzzy back references in the common
    case where a given tree block is only referenced by one root,
    and use the full back references when multiple roots have a reference
    on a given block.

    This commit adds per subvolume red-black tree to keep trace of cached
    inodes. The red-black tree helps the balancing code to find cached
    inodes whose inode numbers within a given range.

    This commit improves the balancing code by introducing several data
    structures to keep the state of balancing. The most important one
    is the back ref cache. It caches how the upper level tree blocks are
    referenced. This greatly reduce the overhead of checking back ref.

    The improved balancing code scales significantly better with a large
    number of snapshots.

    This is a very large commit and was written in a number of
    pieces. But, they depend heavily on the disk format change and were
    squashed together to make sure git bisect didn't end up in a
    bad state wrt space balancing or the format change.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng
     

01 Apr, 2009

1 commit

  • Renames and truncates are both common ways to replace old data with new
    data. The filesystem can make an effort to make sure the new data is
    on disk before actually replacing the old data.

    This is especially important for rename, which many application use as
    though it were atomic for both the data and the metadata involved. The
    current btrfs code will happily replace a file that is fully on disk
    with one that was just created and still has pending IO.

    If we crash after transaction commit but before the IO is done, we'll end
    up replacing a good file with a zero length file. The solution used
    here is to create a list of inodes that need special ordering and force
    them to disk before the commit is done. This is similar to the
    ext3 style data=ordering, except it is only done on selected files.

    Btrfs is able to get away with this because it does not wait on commits
    very often, even for fsync (which use a sub-commit).

    For renames, we order the file when it wasn't already
    on disk and when it is replacing an existing file. Larger files
    are sent to filemap_flush right away (before the transaction handle is
    opened).

    For truncates, we order if the file goes from non-zero size down to
    zero size. This is a little different, because at the time of the
    truncate the file has no dirty bytes to order. But, we flag the inode
    so that it is added to the ordered list on close (via release method). We
    also immediately add it to the ordered list of the current transaction
    so that we can try to flush down any writes the application sneaks in
    before commit.

    Signed-off-by: Chris Mason

    Chris Mason
     

25 Mar, 2009

1 commit

  • The tree logging code allows individual files or directories to be logged
    without including operations on other files and directories in the FS.
    It tries to commit the minimal set of changes to disk in order to
    fsync the single file or directory that was sent to fsync or O_SYNC.

    The tree logging code was allowing files and directories to be unlinked
    if they were part of a rename operation where only one directory
    in the rename was in the fsync log. This patch adds a few new rules
    to the tree logging.

    1) on rename or unlink, if the inode being unlinked isn't in the fsync
    log, we must force a full commit before doing an fsync of the directory
    where the unlink was done. The commit isn't done during the unlink,
    but it is forced the next time we try to log the parent directory.

    Solution: record transid of last unlink/rename per directory when the
    directory wasn't already logged. For renames this is only done when
    renaming to a different directory.

    mkdir foo/some_dir
    normal commit
    rename foo/some_dir foo2/some_dir
    mkdir foo/some_dir
    fsync foo/some_dir/some_file

    The fsync above will unlink the original some_dir without recording
    it in its new location (foo2). After a crash, some_dir will be gone
    unless the fsync of some_file forces a full commit

    2) we must log any new names for any file or dir that is in the fsync
    log. This way we make sure not to lose files that are unlinked during
    the same transaction.

    2a) we must log any new names for any file or dir during rename
    when the directory they are being removed from was logged.

    2a is actually the more important variant. Without the extra logging
    a crash might unlink the old name without recreating the new one

    3) after a crash, we must go through any directories with a link count
    of zero and redo the rm -rf

    mkdir f1/foo
    normal commit
    rm -rf f1/foo
    fsync(f1)

    The directory f1 was fully removed from the FS, but fsync was never
    called on f1, only its parent dir. After a crash the rm -rf must
    be replayed. This must be able to recurse down the entire
    directory tree. The inode link count fixup code takes care of the
    ugly details.

    Signed-off-by: Chris Mason

    Chris Mason
     

21 Feb, 2009

1 commit

  • This is a step in the direction of better -ENOSPC handling. Instead of
    checking the global bytes counter we check the space_info bytes counters to
    make sure we have enough space.

    If we don't we go ahead and try to allocate a new chunk, and then if that fails
    we return -ENOSPC. This patch adds two counters to btrfs_space_info,
    bytes_delalloc and bytes_may_use.

    bytes_delalloc account for extents we've actually setup for delalloc and will
    be allocated at some point down the line.

    bytes_may_use is to keep track of how many bytes we may use for delalloc at
    some point. When we actually set the extent_bit for the delalloc bytes we
    subtract the reserved bytes from the bytes_may_use counter. This keeps us from
    not actually being able to allocate space for any delalloc bytes.

    Signed-off-by: Josef Bacik

    Josef Bacik
     

12 Dec, 2008

1 commit

  • The block group structs are referenced in many different
    places, and it's not safe to free while balancing. So, those block
    group structs were simply leaked instead.

    This patch replaces the block group pointer in the inode with the starting byte
    offset of the block group and adds reference counting to the block group
    struct.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

09 Dec, 2008

1 commit

  • This adds a sequence number to the btrfs inode that is increased on
    every update. NFS will be able to use that to detect when an inode has
    changed, without relying on inaccurate time fields.

    While we're here, this also:

    Puts reserved space into the super block and inode

    Adds a log root transid to the super so we can pick the newest super
    based on the fsync log as well as the main transaction ID. For now
    the log root transid is always zero, but that'll get fixed.

    Adds a starting offset to the dev_item. This will let us do better
    alignment calculations if we know the start of a partition on the disk.

    Signed-off-by: Chris Mason

    Chris Mason
     

30 Sep, 2008

1 commit

  • This improves the comments at the top of many functions. It didn't
    dive into the guts of functions because I was trying to
    avoid merging problems with the new allocator and back reference work.

    extent-tree.c and volumes.c were both skipped, and there is definitely
    more work todo in cleaning and commenting the code.

    Signed-off-by: Chris Mason

    Chris Mason
     

25 Sep, 2008

8 commits