25 May, 2010

2 commits


15 Mar, 2010

1 commit

  • The btrfs defrag ioctl was limited to doing the entire file. This
    commit adds a new interface that can defrag a specific range inside
    the file.

    It can also force compression on the file, allowing you to selectively
    compress individual files after they were created, even when mount -o
    compress isn't turned on.

    Signed-off-by: Chris Mason

    Chris Mason
     

18 Dec, 2009

1 commit

  • There are some cases file extents are inserted without involving
    ordered struct. In these cases, we update disk_i_size directly,
    without checking pending ordered extent and DELALLOC bit. This
    patch extends btrfs_ordered_update_i_size() to handle these cases.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

14 Oct, 2009

1 commit

  • rpm has a habit of running fdatasync when the file hasn't
    changed. We already detect if a file hasn't been changed
    in the current transaction but it might have been sent to
    the tree-log in this transaction and not changed since
    the last call to fsync.

    In this case, we want to avoid a tree log sync, which includes
    a number of synchronous writes and barriers. This commit
    extends the existing tracking of the last transaction to change
    a file to also track the last sub-transaction.

    The end result is that rpm -ivh and -Uvh are roughly twice as fast,
    and on par with ext3.

    Signed-off-by: Chris Mason

    Chris Mason
     

09 Oct, 2009

1 commit

  • This patch fixes an issue with the delalloc metadata space reservation
    code. The problem is we used to free the reservation as soon as we
    allocated the delalloc region. The problem with this is if we are not
    inserting an inline extent, we don't actually insert the extent item until
    after the ordered extent is written out. This patch does 3 things,

    1) It moves the reservation clearing stuff into the ordered code, so when
    we remove the ordered extent we remove the reservation.
    2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear
    delalloc bits in the cases where we want to clear the metadata reservation
    when we clear the delalloc extent, in the case that we do an inline extent
    or we invalidate the page.
    3) It adds another waitqueue to the space info so that when we start a fs
    wide delalloc flush, anybody else who also hits that area will simply wait
    for the flush to finish and then try to make their allocation.

    This has been tested thoroughly to make sure we did not regress on
    performance.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

29 Sep, 2009

1 commit

  • At the start of a transaction we do a btrfs_reserve_metadata_space() and
    specify how many items we plan on modifying. Then once we've done our
    modifications and such, just call btrfs_unreserve_metadata_space() for
    the same number of items we reserved.

    For keeping track of metadata needed for data I've had to add an extent_io op
    for when we merge extents. This lets us track space properly when we are doing
    sequential writes, so we don't end up reserving way more metadata space than
    what we need.

    The only place where the metadata space accounting is not done is in the
    relocation code. This is because Yan is going to be reworking that code in the
    near future, so running btrfs-vol -b could still possibly result in a ENOSPC
    related panic. This patch also turns off the metadata_ratio stuff in order to
    allow users to more efficiently use their disk space.

    This patch makes it so we track how much metadata we need for an inode's
    delayed allocation extents by tracking how many extents are currently
    waiting for allocation. It introduces two new callbacks for the
    extent_io tree's, merge_extent_hook and split_extent_hook. These help
    us keep track of when we merge delalloc extents together and split them
    up. Reservations are handled prior to any actually dirty'ing occurs,
    and then we unreserve after we dirty.

    btrfs_unreserve_metadata_for_delalloc() will make the appropriate
    unreservations as needed based on the number of reservations we
    currently have and the number of extents we currently have. Doing the
    reservation outside of doing any of the actual dirty'ing lets us do
    things like filemap_flush() the inode to try and force delalloc to
    happen, or as a last resort actually start allocation on all delalloc
    inodes in the fs. This has survived dbench, fs_mark and an fsx torture
    test.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

22 Sep, 2009

1 commit

  • btrfs allows subvolumes and snapshots anywhere in the directory tree.
    If we snapshot a subvolume that contains a link to other subvolume
    called subvolA, subvolA can be accessed through both the original
    subvolume and the snapshot. This is similar to creating hard link to
    directory, and has the very similar problems.

    The aim of this patch is enforcing there is only one access point to
    each subvolume. Only the first directory entry (the one added when
    the subvolume/snapshot was created) is treated as valid access point.
    The first directory entry is distinguished by checking root forward
    reference. If the corresponding root forward reference is missing,
    we know the entry is not the first one.

    This patch also adds snapshot/subvolume rename support, the code
    allows rename subvolume link across subvolumes.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

24 Jun, 2009

1 commit


10 Jun, 2009

2 commits

  • Add support for the standard attributes set via chattr and read via
    lsattr. Currently we store the attributes in the flags value in
    the btrfs inode, but I wonder whether we should split it into two so
    that we don't have to keep converting between the two formats.

    Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros
    as they were confusing the existing code and got in the way of the
    new additions.

    Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's
    trivial.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Chris Mason

    Christoph Hellwig
     
  • This commit introduces a new kind of back reference for btrfs metadata.
    Once a filesystem has been mounted with this commit, IT WILL NO LONGER
    BE MOUNTABLE BY OLDER KERNELS.

    When a tree block in subvolume tree is cow'd, the reference counts of all
    extents it points to are increased by one. At transaction commit time,
    the old root of the subvolume is recorded in a "dead root" data structure,
    and the btree it points to is later walked, dropping reference counts
    and freeing any blocks where the reference count goes to 0.

    The increments done during cow and decrements done after commit cancel out,
    and the walk is a very expensive way to go about freeing the blocks that
    are no longer referenced by the new btree root. This commit reduces the
    transaction overhead by avoiding the need for dead root records.

    When a non-shared tree block is cow'd, we free the old block at once, and the
    new block inherits old block's references. When a tree block with reference
    count > 1 is cow'd, we increase the reference counts of all extents
    the new block points to by one, and decrease the old block's reference count by
    one.

    This dead tree avoidance code removes the need to modify the reference
    counts of lower level extents when a non-shared tree block is cow'd.
    But we still need to update back ref for all pointers in the block.
    This is because the location of the block is recorded in the back ref
    item.

    We can solve this by introducing a new type of back ref. The new
    back ref provides information about pointer's key, level and in which
    tree the pointer lives. This information allow us to find the pointer
    by searching the tree. The shortcoming of the new back ref is that it
    only works for pointers in tree blocks referenced by their owner trees.

    This is mostly a problem for snapshots, where resolving one of these
    fuzzy back references would be O(number_of_snapshots) and quite slow.
    The solution used here is to use the fuzzy back references in the common
    case where a given tree block is only referenced by one root,
    and use the full back references when multiple roots have a reference
    on a given block.

    This commit adds per subvolume red-black tree to keep trace of cached
    inodes. The red-black tree helps the balancing code to find cached
    inodes whose inode numbers within a given range.

    This commit improves the balancing code by introducing several data
    structures to keep the state of balancing. The most important one
    is the back ref cache. It caches how the upper level tree blocks are
    referenced. This greatly reduce the overhead of checking back ref.

    The improved balancing code scales significantly better with a large
    number of snapshots.

    This is a very large commit and was written in a number of
    pieces. But, they depend heavily on the disk format change and were
    squashed together to make sure git bisect didn't end up in a
    bad state wrt space balancing or the format change.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng
     

01 Apr, 2009

1 commit

  • Renames and truncates are both common ways to replace old data with new
    data. The filesystem can make an effort to make sure the new data is
    on disk before actually replacing the old data.

    This is especially important for rename, which many application use as
    though it were atomic for both the data and the metadata involved. The
    current btrfs code will happily replace a file that is fully on disk
    with one that was just created and still has pending IO.

    If we crash after transaction commit but before the IO is done, we'll end
    up replacing a good file with a zero length file. The solution used
    here is to create a list of inodes that need special ordering and force
    them to disk before the commit is done. This is similar to the
    ext3 style data=ordering, except it is only done on selected files.

    Btrfs is able to get away with this because it does not wait on commits
    very often, even for fsync (which use a sub-commit).

    For renames, we order the file when it wasn't already
    on disk and when it is replacing an existing file. Larger files
    are sent to filemap_flush right away (before the transaction handle is
    opened).

    For truncates, we order if the file goes from non-zero size down to
    zero size. This is a little different, because at the time of the
    truncate the file has no dirty bytes to order. But, we flag the inode
    so that it is added to the ordered list on close (via release method). We
    also immediately add it to the ordered list of the current transaction
    so that we can try to flush down any writes the application sneaks in
    before commit.

    Signed-off-by: Chris Mason

    Chris Mason
     

25 Mar, 2009

1 commit

  • The tree logging code allows individual files or directories to be logged
    without including operations on other files and directories in the FS.
    It tries to commit the minimal set of changes to disk in order to
    fsync the single file or directory that was sent to fsync or O_SYNC.

    The tree logging code was allowing files and directories to be unlinked
    if they were part of a rename operation where only one directory
    in the rename was in the fsync log. This patch adds a few new rules
    to the tree logging.

    1) on rename or unlink, if the inode being unlinked isn't in the fsync
    log, we must force a full commit before doing an fsync of the directory
    where the unlink was done. The commit isn't done during the unlink,
    but it is forced the next time we try to log the parent directory.

    Solution: record transid of last unlink/rename per directory when the
    directory wasn't already logged. For renames this is only done when
    renaming to a different directory.

    mkdir foo/some_dir
    normal commit
    rename foo/some_dir foo2/some_dir
    mkdir foo/some_dir
    fsync foo/some_dir/some_file

    The fsync above will unlink the original some_dir without recording
    it in its new location (foo2). After a crash, some_dir will be gone
    unless the fsync of some_file forces a full commit

    2) we must log any new names for any file or dir that is in the fsync
    log. This way we make sure not to lose files that are unlinked during
    the same transaction.

    2a) we must log any new names for any file or dir during rename
    when the directory they are being removed from was logged.

    2a is actually the more important variant. Without the extra logging
    a crash might unlink the old name without recreating the new one

    3) after a crash, we must go through any directories with a link count
    of zero and redo the rm -rf

    mkdir f1/foo
    normal commit
    rm -rf f1/foo
    fsync(f1)

    The directory f1 was fully removed from the FS, but fsync was never
    called on f1, only its parent dir. After a crash the rm -rf must
    be replayed. This must be able to recurse down the entire
    directory tree. The inode link count fixup code takes care of the
    ugly details.

    Signed-off-by: Chris Mason

    Chris Mason
     

21 Feb, 2009

1 commit

  • This is a step in the direction of better -ENOSPC handling. Instead of
    checking the global bytes counter we check the space_info bytes counters to
    make sure we have enough space.

    If we don't we go ahead and try to allocate a new chunk, and then if that fails
    we return -ENOSPC. This patch adds two counters to btrfs_space_info,
    bytes_delalloc and bytes_may_use.

    bytes_delalloc account for extents we've actually setup for delalloc and will
    be allocated at some point down the line.

    bytes_may_use is to keep track of how many bytes we may use for delalloc at
    some point. When we actually set the extent_bit for the delalloc bytes we
    subtract the reserved bytes from the bytes_may_use counter. This keeps us from
    not actually being able to allocate space for any delalloc bytes.

    Signed-off-by: Josef Bacik

    Josef Bacik
     

12 Dec, 2008

1 commit

  • The block group structs are referenced in many different
    places, and it's not safe to free while balancing. So, those block
    group structs were simply leaked instead.

    This patch replaces the block group pointer in the inode with the starting byte
    offset of the block group and adds reference counting to the block group
    struct.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

09 Dec, 2008

1 commit

  • This adds a sequence number to the btrfs inode that is increased on
    every update. NFS will be able to use that to detect when an inode has
    changed, without relying on inaccurate time fields.

    While we're here, this also:

    Puts reserved space into the super block and inode

    Adds a log root transid to the super so we can pick the newest super
    based on the fsync log as well as the main transaction ID. For now
    the log root transid is always zero, but that'll get fixed.

    Adds a starting offset to the dev_item. This will let us do better
    alignment calculations if we know the start of a partition on the disk.

    Signed-off-by: Chris Mason

    Chris Mason
     

30 Sep, 2008

1 commit

  • This improves the comments at the top of many functions. It didn't
    dive into the guts of functions because I was trying to
    avoid merging problems with the new allocator and back reference work.

    extent-tree.c and volumes.c were both skipped, and there is definitely
    more work todo in cleaning and commenting the code.

    Signed-off-by: Chris Mason

    Chris Mason
     

25 Sep, 2008

17 commits


28 Aug, 2007

1 commit


11 Aug, 2007

1 commit


14 Jun, 2007

1 commit

  • Attaching below is some of the code cleanups that i came across while
    reading the code.

    a) alloc_path already calls init_path.
    b) Mention that btrfs_inode is the in memory copy.Ext4 have ext4_inode_info as
    the in memory copy ext4_inode as the disk copy

    Signed-off-by: Chris Mason

    Aneesh
     

12 Jun, 2007

1 commit


01 May, 2007

1 commit


11 Apr, 2007

1 commit