01 Sep, 2013

40 commits

  • If the filesystem was mounted with an old kernel that was not
    aware of the UUID tree, this is detected by looking at the
    uuid_tree_generation field of the superblock (similar to how
    the free space cache is doing it). If a mismatch is detected
    at mount time, a thread is started that does two things:
    1. Iterate through the UUID tree, check each entry, delete those
    entries that are not valid anymore (i.e., the subvol does not
    exist anymore or the value changed).
    2. Iterate through the root tree, for each found subvolume, add
    the UUID tree entries for the subvolume (if they are not
    already there).

    This mechanism is also used to handle and repair errors that
    happened during the initial creation and filling of the tree.
    The update of the uuid_tree_generation field (which indicates
    that the state of the UUID tree is up to date) is blocked until
    all create and repair operations are successfully completed.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • In order to be able to detect the case that a filesystem is mounted
    with an old kernel, add a uuid-tree-gen field like the free space
    cache is doing it. It is part of the super block and written with
    each commit. Old kernels do not know this field and don't update it.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • When the UUID tree is initially created, a task is spawned that
    walks through the root tree. For each found subvolume root_item,
    the uuid and received_uuid entries in the UUID tree are added.
    This is such a quick operation so that in case somebody wants
    to unmount the filesystem while the task is still running, the
    unmount is delayed until the UUID tree building task is finished.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • When a new subvolume or snapshot is created, a new UUID item is added
    to the UUID tree. Such items are removed when the subvolume is deleted.
    The ioctl to set the received subvolume UUID is also touched and will
    now also add this received UUID into the UUID tree together with the
    ID of the subvolume. The latter is also done when read-only snapshots
    are created which inherit all the send/receive information from the
    parent subvolume.

    User mode programs use the BTRFS_IOC_TREE_SEARCH ioctl to search and
    read in the UUID tree.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • This tree is not created by mkfs.btrfs. Therefore when a filesystem
    is mounted writable and the UUID tree does not exist, this tree is
    created if required. The tree is also added to the fs_info structure
    and initialized, but this commit does not yet read or write UUID tree
    elements.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • This commit adds support to print UUID tree elements to print-tree.c.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • Mapping UUIDs to subvolume IDs is an operation with a high effort
    today. Today, the algorithm even has quadratic effort (based on the
    number of existing subvolumes), which means, that it takes minutes
    to send/receive a single subvolume if 10,000 subvolumes exist. But
    even linear effort would be too much since it is a waste. And these
    data structures to allow mapping UUIDs to subvolume IDs are created
    every time a btrfs send/receive instance is started.

    It is much more efficient to maintain a searchable persistent data
    structure in the filesystem, one that is updated whenever a
    subvolume/snapshot is created and deleted, and when the received
    subvolume UUID is set by the btrfs-receive tool.

    Therefore kernel code is added with this commit that is able to
    maintain data structures in the filesystem that allow to quickly
    search for a given UUID and to retrieve data that is assigned to
    this UUID, like which subvolume ID is related to this UUID.

    This commit adds a new tree to hold UUID-to-data mapping items. The
    key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
    Multiple data blocks can be stored for a given UUID, a type/length/
    value scheme is used.

    Now follows the lengthy justification, why a new tree was added
    instead of using the existing root tree:

    The first approach was to not create another tree that holds UUID
    items. Instead, the items should just go into the top root tree.
    Unfortunately this confused the algorithm to assign the objectid
    of subvolumes and snapshots. The reason is that
    btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
    the first created subvol or snapshot after mounting a filesystem,
    and this function simply searches for the largest used objectid in
    the root tree keys to pick the next objectid to assign. Of course,
    the UUID keys have always been the ones with the highest offset
    value, and the next assigned subvol ID was wastefully huge.

    To use any other existing tree did not look proper. To apply a
    workaround such as setting the objectid to zero in the UUID item
    key and to implement collision handling would either add
    limitations (in case of a btrfs_extend_item() approach to handle
    the collisions) or a lot of complexity and source code (in case a
    key would be looked up that is free of collisions). Adding new code
    that introduces limitations is not good, and adding code that is
    complex and lengthy for no good reason is also not good. That's the
    justification why a completely new tree was introduced.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • Cc: Josef Bacik
    Cc: Chris Mason
    Signed-off-by: Sergei Trofimovich
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Sergei Trofimovich
     
  • make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__

    I tried to filter out the warnings for which patches have already
    been sent to the mailing list, pending for inclusion in btrfs-next.

    All these changes should be obviously safe.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • If the inode ref key was not found and the current leaf slot
    was 0 (first item in the leaf) the code would always return
    -ENOENT. This was not correct because the desired inode ref
    item might be the last item in the previous leaf.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe David Borba Manana
     
  • If the path doesn't fit in the input buffer, return ENAMETOOLONG
    instead of returning with a success code (0) and a partially
    filled and right justified buffer.

    Also removed useless buffer pointer check outside the while loop.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe David Borba Manana
     
  • We have checked 'quota_root' with qgroup_ioctl_lock held before,So
    here the check is reduplicate, remove it.

    Signed-off-by: Wang Shilong
    Reviewed-by: Miao Xie
    Reviewed-by: Arne Jansen
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Wang Shilong
     
  • btrfs_free_qgroup_config() is not only called by open/close_ctree(),but
    also btrfs_disable_quota().And for btrfs_disable_quota(),we have set
    'quota_root' to be null before calling btrfs_free_qgroup_config(),so it
    is safe to cleanup in-memory structures without lock held.

    Signed-off-by: Wang Shilong
    Reviewed-by: Miao Xie
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Wang Shilong
     
  • When disabling quota, we should clear out list 'dirty_qgroups',otherwise,
    we will get oops if enabling quota again. Fix this by abstracting similar
    code from del_qgroup_rb().

    Signed-off-by: Wang Shilong
    Reviewed-by: Miao Xie
    Reviewed-by: Arne Jansen
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Wang Shilong
     
  • If you are sending a snapshot and specifying a parent snapshot we will walk the
    trees and figure out where they differ and send the differences only. The way
    we check for differences are if the leaves aren't the same and if the keys are
    not the same within the leaves. So if neither leaf is the same (ie the leaf has
    been cow'ed from the parent snapshot) we walk each item in the send root and
    check it against the parent root. If the items match exactly then we don't do
    anything. This doesn't quite work for inode refs, since they will just have the
    name and the parent objectid. If you move the file from a directory and then
    remove that directory and re-create a directory with the same inode number as
    the old directory and then move that file back into that directory we will
    assume that nothing changed and you will get errors when you try to receive.

    In order to fix this we need to do extra checking to see if the inode ref really
    is the same or not. So do this by passing down BTRFS_COMPARE_TREE_SAME if the
    items match. Then if the key type is an inode ref we can do some extra
    checking, otherwise we just keep processing. The extra checking is to look up
    the generation of the directory in the parent volume and compare it to the
    generation of the send volume. If they match then they are the same directory
    and we are good to go. If they don't we have to add them to the changed refs
    list.

    This means we have to track the generation of the ref we're trying to lookup
    when we iterate all the refs for a particular inode. So in the case of looking
    for new refs we have to get the generation from the parent volume, and in the
    case of looking for deleted refs we have to get the generation from the send
    volume to compare with.

    There was also the issue of using a ulist to keep track of the directories we
    needed to check. Because we can get a deleted ref and a new ref for the same
    inode number the ulist won't work since it indexes based on the value. So
    instead just dup any directory ref we find and add it to a local list, and then
    process that list as normal and do away with using a ulist for this altogether.

    Before we would fail all of the tests in the far-progs that related to moving
    directories (test group 32). With this patch we now pass these tests, and all
    of the tests in the far-progs send testing suite. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • The plan is to have a bunch of unit tests that run when btrfs is loaded when you
    build with the appropriate config option. My ultimate goal is to have a test
    for every non-static function we have, but at first I'm going to focus on the
    things that cause us the most problems. To start out with this just adds a
    tests/ directory and moves the existing free space cache tests into that
    directory and sets up all of the infrastructure. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • I noticed while looking at a deadlock that we are always starting a transaction
    in cow_file_range(). This isn't really needed since we only need a transaction
    if we are doing an inline extent, or if the allocator needs to allocate a chunk.
    So push down all the transaction start stuff to be closer to where we actually
    need a transaction in all of these cases. This will hopefully reduce our write
    latency when we are committing often. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • I added a patch where we started taking the ordered operations mutex when we
    waited on ordered extents. We need this because we splice the list and process
    it, so if a flusher came in during this scenario it would think the list was
    empty and we'd usually get an early ENOSPC. The problem with this is that this
    lock is used in transaction committing. So we end up with something like this

    Transaction commit
    -> wait on writers

    Delalloc flusher
    -> run_ordered_operations (holds mutex)
    ->wait for filemap-flush to do its thing

    flush task
    -> cow_file_range
    ->wait on btrfs_join_transaction because we're commiting

    some other task
    -> commit_transaction because we notice trans->transaction->flush is set
    -> run_ordered_operations (hang on mutex)

    We need to disentangle the ordered operations flushing from the delalloc
    flushing, since they are separate things. This solves the deadlock issue I was
    seeing. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • There are several places where we BUG_ON() if we fail to remove the orphan items
    and such, which is not ok, so remove those and either abort or just carry on.
    This also fixes a problem where if we couldn't start a transaction we wouldn't
    actually remove the orphan item reserve for the inode. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Eric pointed out that btrfs will happily allow you to delete the default subvol.
    This is a problem obviously since the next time you go to mount the file system
    it will freak out because it can't find the root. Fix this by adding a check to
    see if our default subvol points to the subvol we are trying to delete, and if
    it does not allowing it to happen. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • We have logic to see if we've already created a parent directory by check to see
    if an inode inside of that directory has a lower inode number than the one we
    are currently processing. The logic is that if there is a lower inode number
    then we would have had to made sure the directory was created at that previous
    point. The problem is that subvols inode numbers count from the lowest objectid
    in the root tree, which may be less than our current progress. So just skip if
    our dir item key is a root item. This fixes the original test and the xfstest
    version I made that added an extra subvol create. Thanks,

    Reported-by: Emil Karlson
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This patch adds an ioctl, BTRFS_IOC_FILE_EXTENT_SAME which will try to
    de-duplicate a list of extents across a range of files.

    Internally, the ioctl re-uses code from the clone ioctl. This avoids
    rewriting a large chunk of extent handling code.

    Userspace passes in an array of file, offset pairs along with a length
    argument. The ioctl will then (for each dedupe) do a byte-by-byte comparison
    of the user data before deduping the extent. Status and number of bytes
    deduped are returned for each operation.

    Signed-off-by: Mark Fasheh
    Reviewed-by: Zach Brown
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Mark Fasheh
     
  • We want this for btrfs_extent_same. Basically readpage and friends do their
    own extent locking but for the purposes of dedupe, we want to have both
    files locked down across a set of readpage operations (so that we can
    compare data). Introduce this variant and a flag which can be set for
    extent_read_full_page() to indicate that we are already locked.

    Partial credit for this patch goes to Gabriel de Perthuis
    as I have included a fix from him to the original patch which avoids a
    deadlock on compressed extents.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Mark Fasheh
     
  • There's some 250+ lines here that are easily encapsulated into their own
    function. I don't change how anything works here, just create and document
    the new btrfs_clone() function from btrfs_ioctl_clone() code.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Mark Fasheh
     
  • The range locking in btrfs_ioctl_clone is trivially broken out into it's own
    function. This reduces the complexity of btrfs_ioctl_clone() by a small bit
    and makes that locking code available to future functions in
    fs/btrfs/ioctl.c

    Signed-off-by: Mark Fasheh
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Mark Fasheh
     
  • Signed-off-by: Wang Shilong
    Reviewed-by: Miao Xie
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Wang Shilong
     
  • In extent-tree.c:do_chunk_alloc(), early on we returned 0 (success)
    when the target space was full and when chunk allocation is needed.
    However, later on in that same function we return ENOSPC if
    btrfs_alloc_chunk() fails (and chunk allocation was needed) and
    set the space's full flag.

    This was inconsistent, as -ENOSPC should be returned if the space
    is full and a chunk allocation needs to performed. If the space is
    full but no chunk allocation is needed, just return 0 (success).

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe David Borba Manana
     
  • tree-log.c was ignoring the return value from btrfs_run_delayed_items()
    in several places.

    Signed-off-by: Filipe David Borba Manana
    Reviewed-by: Miao Xie
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe David Borba Manana
     
  • In tree-log.c:replay_one_name(), if memory allocation for
    the name fails, ensure we iput the dir inode we got before
    before we return.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe David Borba Manana
     
  • The rule originally comes from nocow writing, but snapshot-aware
    defrag is a different case, the extent has been writen and we're
    not going to change the extent but add a reference on the data.

    So we're able to allow such compressed extents to be merged into
    one bigger extent if they're pointing to the same data.

    Reviewed-by: Miao Xie
    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Liu Bo
     
  • I'ts hardcoded to 30 seconds which is fine for most users. Higher values
    defer data being synced to permanent storage with obvious consequences
    when the system crashes. The upper bound is not forced, but a warning is
    printed if it's more than 300 seconds (5 minutes).

    Signed-off-by: David Sterba
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    David Sterba
     
  • There is no reason we can't just set the path to blocking and then do normal
    GFP_NOFS allocations for these extent buffers. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • We can get ENOMEM trying to allocate dummy bufs for the rewind operation of the
    tree mod log. Instead of BUG_ON()'ing in this case pass up ENOMEM. I looked
    back through the callers and I'm pretty sure I got everybody who did BUG_ON(ret)
    in this path. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • When doing a send with a parent subvol we will check to see if the file we are
    acting on is being overwritten and move it if we think it may be needed further
    down the line during the send. We check this by checking its directory and
    making sure it existed in the parent and making sure the file existed in the
    parent. The problem with this check is that if we create a directory and a file
    in that directory, and then snapshot, and then remove and re-create that same
    directory and file with different inode numbers and then try to snapshot and
    send with the original parent we will try and save the original file inside of
    that directory. This is a problem because during the receive we move the
    directory out of the way because it is a completely new inode, which makes us
    unable to find the old file inside of the directory when we try to move that out
    of the way for the overwrite. We fix this by checking the parent directory of
    the inode we think we are overwriting. If the parent directory generation in
    the send root != the parent directory generation in the parent root then we know
    it is a completely new directory and we need not bother with moving the file out
    of the way because it would have been completely destroyed. This fixes bz
    60673. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Alex Lyakas reported a bug where wait_block_group_cache_progress() would wait
    forever if a drive failed. This is because we just bail out if there is an
    error while trying to cache a block group, we don't update anybody who may be
    waiting. So this introduces a new enum for the cache state in case of error and
    makes everybody bail out if we have an error. Alex tested and verified this
    patch fixed his problem. This fixes bz 59431. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe David Borba Manana
     
  • If we bail out when the stripe alloc fails, we need to undo the
    earlier allocation of raid_map.

    Signed-off-by: Dave Jones
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Dave Jones
     
  • Signed-off-by: Filipe David Borba Manana
    Reviewed-by: Jan Schmidt
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe David Borba Manana
     
  • After reading all device items from the chunk tree, don't
    exit the loop and then navigate down the tree again to find
    the chunk items. Instead just read all device items and
    chunk items with a single tree search. This is possible
    because all device items are found before any chunk item in
    the chunks tree.

    Signed-off-by: Filipe David Borba Manana
    Reviewed-by: Miao Xie
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe David Borba Manana
     
  • There is no reason for this sort of jackassery. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik