Eric Lee / smarc-fsl-linux-kernel

01 Sep, 2013

40 commits

70f801754 Btrfs: check UUID tree during mount if required ... Browse Code »

If the filesystem was mounted with an old kernel that was not
aware of the UUID tree, this is detected by looking at the
uuid_tree_generation field of the superblock (similar to how
the free space cache is doing it). If a mismatch is detected
at mount time, a thread is started that does two things:
1. Iterate through the UUID tree, check each entry, delete those
entries that are not valid anymore (i.e., the subvol does not
exist anymore or the value changed).
2. Iterate through the root tree, for each found subvolume, add
the UUID tree entries for the subvolume (if they are not
already there).

This mechanism is also used to handle and repair errors that
happened during the initial creation and filling of the tree.
The update of the uuid_tree_generation field (which indicates
that the state of the UUID tree is up to date) is blocked until
all create and repair operations are successfully completed.

Signed-off-by: Stefan Behrens
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Stefan Behrens
2013-09-01 20:15:58 +0800
26432799c Btrfs: introduce uuid-tree-gen field ... Browse Code »

In order to be able to detect the case that a filesystem is mounted
with an old kernel, add a uuid-tree-gen field like the free space
cache is doing it. It is part of the super block and written with
each commit. Old kernels do not know this field and don't update it.

Signed-off-by: Stefan Behrens
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Stefan Behrens
2013-09-01 20:15:57 +0800
803b2f54f Btrfs: fill UUID tree initially ... Browse Code »

When the UUID tree is initially created, a task is spawned that
walks through the root tree. For each found subvolume root_item,
the uuid and received_uuid entries in the UUID tree are added.
This is such a quick operation so that in case somebody wants
to unmount the filesystem while the task is still running, the
unmount is delayed until the UUID tree building task is finished.

Signed-off-by: Stefan Behrens
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Stefan Behrens
2013-09-01 20:15:56 +0800
dd5f9615f Btrfs: maintain subvolume items in the UUID tree ... Browse Code »

When a new subvolume or snapshot is created, a new UUID item is added
to the UUID tree. Such items are removed when the subvolume is deleted.
The ioctl to set the received subvolume UUID is also touched and will
now also add this received UUID into the UUID tree together with the
ID of the subvolume. The latter is also done when read-only snapshots
are created which inherit all the send/receive information from the
parent subvolume.

User mode programs use the BTRFS_IOC_TREE_SEARCH ioctl to search and
read in the UUID tree.

Signed-off-by: Stefan Behrens
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Stefan Behrens
2013-09-01 20:15:55 +0800
f7a81ea4c Btrfs: create UUID tree if required ... Browse Code »

This tree is not created by mkfs.btrfs. Therefore when a filesystem
is mounted writable and the UUID tree does not exist, this tree is
created if required. The tree is also added to the fs_info structure
and initialized, but this commit does not yet read or write UUID tree
elements.

Signed-off-by: Stefan Behrens
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Stefan Behrens
2013-09-01 20:15:54 +0800
8f8ae8e21 Btrfs: support printing UUID tree elements ... Browse Code »

This commit adds support to print UUID tree elements to print-tree.c.

Signed-off-by: Stefan Behrens
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Stefan Behrens
2013-09-01 20:15:53 +0800
07b30a49d Btrfs: introduce a tree for items that map UUIDs to something ... Browse Code »

Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.

It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.

Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.

This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.

Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:

The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.

To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.

Signed-off-by: Stefan Behrens
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Stefan Behrens
2013-09-01 20:15:52 +0800
171170c1c btrfs: mark some local function as 'static' ... Browse Code »

Cc: Josef Bacik
Cc: Chris Mason
Signed-off-by: Sergei Trofimovich
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Sergei Trofimovich
2013-09-01 20:15:51 +0800
35a3621be Btrfs: get rid of sparse warnings ... Browse Code »

make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__

I tried to filter out the warnings for which patches have already
been sent to the mailing list, pending for inclusion in btrfs-next.

All these changes should be obviously safe.

Signed-off-by: Stefan Behrens
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Stefan Behrens
2013-09-01 20:15:50 +0800
18674c6cc Btrfs: don't miss inode ref items in BTRFS_IOC_INO_LOOKUP ... Browse Code »

If the inode ref key was not found and the current leaf slot
was 0 (first item in the leaf) the code would always return
-ENOENT. This was not correct because the desired inode ref
item might be the last item in the previous leaf.

Signed-off-by: Filipe David Borba Manana
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Filipe David Borba Manana
2013-09-01 20:15:49 +0800
a696cf352 Btrfs: add missing error code to BTRFS_IOC_INO_LOOKUP handler ... Browse Code »

If the path doesn't fit in the input buffer, return ENAMETOOLONG
instead of returning with a success code (0) and a partially
filled and right justified buffer.

Also removed useless buffer pointer check outside the while loop.

Signed-off-by: Filipe David Borba Manana
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Filipe David Borba Manana
2013-09-01 20:15:48 +0800
b006b2e4f Btrfs: remove reduplicate check when disabling quota ... Browse Code »

We have checked 'quota_root' with qgroup_ioctl_lock held before,So
here the check is reduplicate, remove it.

Signed-off-by: Wang Shilong
Reviewed-by: Miao Xie
Reviewed-by: Arne Jansen
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Wang Shilong
2013-09-01 20:15:47 +0800
e685da14a Btrfs: move btrfs_free_qgroup_config() out of spin_lock and fix comments ... Browse Code »

btrfs_free_qgroup_config() is not only called by open/close_ctree(),but
also btrfs_disable_quota().And for btrfs_disable_quota(),we have set
'quota_root' to be null before calling btrfs_free_qgroup_config(),so it
is safe to cleanup in-memory structures without lock held.

Signed-off-by: Wang Shilong
Reviewed-by: Miao Xie
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Wang Shilong
2013-09-01 20:15:46 +0800
4082bd3d7 Btrfs: fix oops when writing dirty qgroups to disk ... Browse Code »

When disabling quota, we should clear out list 'dirty_qgroups',otherwise,
we will get oops if enabling quota again. Fix this by abstracting similar
code from del_qgroup_rb().

Signed-off-by: Wang Shilong
Reviewed-by: Miao Xie
Reviewed-by: Arne Jansen
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Wang Shilong
2013-09-01 20:15:45 +0800
ba5e8f2e2 Btrfs: fix send issues related to inode number reuse ... Browse Code »

If you are sending a snapshot and specifying a parent snapshot we will walk the
trees and figure out where they differ and send the differences only. The way
we check for differences are if the leaves aren't the same and if the keys are
not the same within the leaves. So if neither leaf is the same (ie the leaf has
been cow'ed from the parent snapshot) we walk each item in the send root and
check it against the parent root. If the items match exactly then we don't do
anything. This doesn't quite work for inode refs, since they will just have the
name and the parent objectid. If you move the file from a directory and then
remove that directory and re-create a directory with the same inode number as
the old directory and then move that file back into that directory we will
assume that nothing changed and you will get errors when you try to receive.

In order to fix this we need to do extra checking to see if the inode ref really
is the same or not. So do this by passing down BTRFS_COMPARE_TREE_SAME if the
items match. Then if the key type is an inode ref we can do some extra
checking, otherwise we just keep processing. The extra checking is to look up
the generation of the directory in the parent volume and compare it to the
generation of the send volume. If they match then they are the same directory
and we are good to go. If they don't we have to add them to the changed refs
list.

This means we have to track the generation of the ref we're trying to lookup
when we iterate all the refs for a particular inode. So in the case of looking
for new refs we have to get the generation from the parent volume, and in the
case of looking for deleted refs we have to get the generation from the send
volume to compare with.

There was also the issue of using a ulist to keep track of the directories we
needed to check. Because we can get a deleted ref and a new ref for the same
inode number the ulist won't work since it indexes based on the value. So
instead just dup any directory ref we find and add it to a local list, and then
process that list as normal and do away with using a ulist for this altogether.

Before we would fail all of the tests in the far-progs that related to moving
directories (test group 32). With this patch we now pass these tests, and all
of the tests in the far-progs send testing suite. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-09-01 20:15:44 +0800
dc11dd5d7 Btrfs: separate out tests into their own directory ... Browse Code »

The plan is to have a bunch of unit tests that run when btrfs is loaded when you
build with the appropriate config option. My ultimate goal is to have a test
for every non-static function we have, but at first I'm going to focus on the
things that cause us the most problems. To start out with this just adds a
tests/ directory and moves the existing free space cache tests into that
directory and sets up all of the infrastructure. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-09-01 20:15:38 +0800
00361589d Btrfs: avoid starting a transaction in the write path ... Browse Code »

I noticed while looking at a deadlock that we are always starting a transaction
in cow_file_range(). This isn't really needed since we only need a transaction
if we are doing an inline extent, or if the allocator needs to allocate a chunk.
So push down all the transaction start stuff to be closer to where we actually
need a transaction in all of these cases. This will hopefully reduce our write
latency when we are committing often. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-09-01 20:05:05 +0800
9ffba8cda Btrfs: fix heavy delalloc related deadlock ... Browse Code »

I added a patch where we started taking the ordered operations mutex when we
waited on ordered extents. We need this because we splice the list and process
it, so if a flusher came in during this scenario it would think the list was
empty and we'd usually get an early ENOSPC. The problem with this is that this
lock is used in transaction committing. So we end up with something like this

Transaction commit
-> wait on writers

Delalloc flusher
-> run_ordered_operations (holds mutex)
->wait for filemap-flush to do its thing

flush task
-> cow_file_range
->wait on btrfs_join_transaction because we're commiting

some other task
-> commit_transaction because we notice trans->transaction->flush is set
-> run_ordered_operations (hang on mutex)

We need to disentangle the ordered operations flushing from the delalloc
flushing, since they are separate things. This solves the deadlock issue I was
seeing. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-09-01 20:05:04 +0800
4ef31a45a Btrfs: fix the error handling wrt orphan items ... Browse Code »

There are several places where we BUG_ON() if we fail to remove the orphan items
and such, which is not ok, so remove those and either abort or just carry on.
This also fixes a problem where if we couldn't start a transaction we wouldn't
actually remove the orphan item reserve for the inode. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-09-01 20:05:03 +0800
175a2b871 Btrfs: don't allow a subvol to be deleted if it is the default subovl ... Browse Code »

Eric pointed out that btrfs will happily allow you to delete the default subvol.
This is a problem obviously since the next time you go to mount the file system
it will freak out because it can't find the root. Fix this by adding a check to
see if our default subvol points to the subvol we are trying to delete, and if
it does not allowing it to happen. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-09-01 20:05:02 +0800
a05254143 Btrfs: skip subvol entries when checking if we've created a dir already ... Browse Code »

We have logic to see if we've already created a parent directory by check to see
if an inode inside of that directory has a lower inode number than the one we
are currently processing. The logic is that if there is a lower inode number
then we would have had to made sure the directory was created at that previous
point. The problem is that subvols inode numbers count from the lowest objectid
in the root tree, which may be less than our current progress. So just skip if
our dir item key is a root item. This fixes the original test and the xfstest
version I made that added an extra subvol create. Thanks,

Reported-by: Emil Karlson
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-09-01 20:05:01 +0800
416161db9 btrfs: offline dedupe ... Browse Code »

This patch adds an ioctl, BTRFS_IOC_FILE_EXTENT_SAME which will try to
de-duplicate a list of extents across a range of files.

Internally, the ioctl re-uses code from the clone ioctl. This avoids
rewriting a large chunk of extent handling code.

Userspace passes in an array of file, offset pairs along with a length
argument. The ioctl will then (for each dedupe) do a byte-by-byte comparison
of the user data before deduping the extent. Status and number of bytes
deduped are returned for each operation.

Signed-off-by: Mark Fasheh
Reviewed-by: Zach Brown
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Mark Fasheh
2013-09-01 20:05:00 +0800
4b384318a btrfs: Introduce extent_read_full_page_nolock() ... Browse Code »

We want this for btrfs_extent_same. Basically readpage and friends do their
own extent locking but for the purposes of dedupe, we want to have both
files locked down across a set of readpage operations (so that we can
compare data). Introduce this variant and a flag which can be set for
extent_read_full_page() to indicate that we are already locked.

Partial credit for this patch goes to Gabriel de Perthuis
as I have included a fix from him to the original patch which avoids a
deadlock on compressed extents.

Signed-off-by: Mark Fasheh
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Mark Fasheh
2013-09-01 20:04:59 +0800
32b7c687c btrfs_ioctl_clone: Move clone code into it's own function ... Browse Code »

There's some 250+ lines here that are easily encapsulated into their own
function. I don't change how anything works here, just create and document
the new btrfs_clone() function from btrfs_ioctl_clone() code.

Signed-off-by: Mark Fasheh
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Mark Fasheh
2013-09-01 20:04:58 +0800
77fe20dc6 btrfs: abtract out range locking in clone ioctl() ... Browse Code »

The range locking in btrfs_ioctl_clone is trivially broken out into it's own
function. This reduces the complexity of btrfs_ioctl_clone() by a small bit
and makes that locking code available to future functions in
fs/btrfs/ioctl.c

Signed-off-by: Mark Fasheh
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Mark Fasheh
2013-09-01 20:04:57 +0800
a4fdb61e8 Btrfs: fix possible memory leak in find_parent_nodes() ... Browse Code »

Signed-off-by: Wang Shilong
Reviewed-by: Miao Xie
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Wang Shilong
2013-09-01 20:04:56 +0800
09fb99a69 Btrfs: return ENOSPC when target space is full ... Browse Code »

In extent-tree.c:do_chunk_alloc(), early on we returned 0 (success)
when the target space was full and when chunk allocation is needed.
However, later on in that same function we return ENOSPC if
btrfs_alloc_chunk() fails (and chunk allocation was needed) and
set the space's full flag.

This was inconsistent, as -ENOSPC should be returned if the space
is full and a chunk allocation needs to performed. If the space is
full but no chunk allocation is needed, just return 0 (success).

Signed-off-by: Filipe David Borba Manana
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Filipe David Borba Manana
2013-09-01 20:04:55 +0800
ada9af215 Btrfs: don't ignore errors from btrfs_run_delayed_items ... Browse Code »

tree-log.c was ignoring the return value from btrfs_run_delayed_items()
in several places.

Signed-off-by: Filipe David Borba Manana
Reviewed-by: Miao Xie
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Filipe David Borba Manana
2013-09-01 20:04:54 +0800
2bac325ea Btrfs: fix inode leak on kmalloc failure in tree-log.c ... Browse Code »

In tree-log.c:replay_one_name(), if memory allocation for
the name fails, ensure we iput the dir inode we got before
before we return.

Signed-off-by: Filipe David Borba Manana
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Filipe David Borba Manana
2013-09-01 20:04:53 +0800
116e0024c Btrfs: allow compressed extents to be merged during defragment ... Browse Code »

The rule originally comes from nocow writing, but snapshot-aware
defrag is a different case, the extent has been writen and we're
not going to change the extent but add a reference on the data.

So we're able to allow such compressed extents to be merged into
one bigger extent if they're pointing to the same data.

Reviewed-by: Miao Xie
Signed-off-by: Liu Bo
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Liu Bo
2013-09-01 20:04:52 +0800
8b87dc17f btrfs: add mount option to set commit interval ... Browse Code »

I'ts hardcoded to 30 seconds which is fine for most users. Higher values
defer data being synced to permanent storage with obvious consequences
when the system crashes. The upper bound is not forced, but a warning is
printed if it's more than 300 seconds (5 minutes).

Signed-off-by: David Sterba
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

David Sterba
2013-09-01 20:04:51 +0800
9ec726775 Btrfs: stop using GFP_ATOMIC when allocating rewind ebs ... Browse Code »

There is no reason we can't just set the path to blocking and then do normal
GFP_NOFS allocations for these extent buffers. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-09-01 20:04:50 +0800
db7f3436c Btrfs: deal with enomem in the rewind path ... Browse Code »

We can get ENOMEM trying to allocate dummy bufs for the rewind operation of the
tree mod log. Instead of BUG_ON()'ing in this case pass up ENOMEM. I looked
back through the callers and I'm pretty sure I got everybody who did BUG_ON(ret)
in this path. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-09-01 20:04:49 +0800
ebdad913a Btrfs: check our parent dir when doing a compare send ... Browse Code »

When doing a send with a parent subvol we will check to see if the file we are
acting on is being overwritten and move it if we think it may be needed further
down the line during the send. We check this by checking its directory and
making sure it existed in the parent and making sure the file existed in the
parent. The problem with this check is that if we create a directory and a file
in that directory, and then snapshot, and then remove and re-create that same
directory and file with different inode numbers and then try to snapshot and
send with the original parent we will try and save the original file inside of
that directory. This is a problem because during the receive we move the
directory out of the way because it is a completely new inode, which makes us
unable to find the old file inside of the directory when we try to move that out
of the way for the overwrite. We fix this by checking the parent directory of
the inode we think we are overwriting. If the parent directory generation in
the send root != the parent directory generation in the parent root then we know
it is a completely new directory and we need not bother with moving the file out
of the way because it would have been completely destroyed. This fixes bz
60673. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-09-01 20:04:48 +0800
36cce9228 Btrfs: handle errors when doing slow caching ... Browse Code »

Alex Lyakas reported a bug where wait_block_group_cache_progress() would wait
forever if a drive failed. This is because we just bail out if there is an
error while trying to cache a block group, we don't update anybody who may be
waiting. So this introduces a new enum for the cache state in case of error and
makes everybody bail out if we have an error. Alex tested and verified this
patch fixed his problem. This fixes bz 59431. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-09-01 20:04:47 +0800
0f0fe8f71 Btrfs: add missing error handling to read_tree_block ... Browse Code »

Signed-off-by: Filipe David Borba Manana
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Filipe David Borba Manana
2013-09-01 20:04:46 +0800
eb2067f71 Fix leak in __btrfs_map_block error path ... Browse Code »

If we bail out when the stripe alloc fails, we need to undo the
earlier allocation of raid_map.

Signed-off-by: Dave Jones
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Dave Jones
2013-09-01 20:04:45 +0800
f5929cd81 Btrfs: add missing error check to find_parent_nodes ... Browse Code »

Signed-off-by: Filipe David Borba Manana
Reviewed-by: Jan Schmidt
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Filipe David Borba Manana
2013-09-01 20:04:44 +0800
395927a9d Btrfs: optimize function btrfs_read_chunk_tree ... Browse Code »

After reading all device items from the chunk tree, don't
exit the loop and then navigate down the tree again to find
the chunk items. Instead just read all device items and
chunk items with a single tree search. This is possible
because all device items are found before any chunk item in
the chunks tree.

Signed-off-by: Filipe David Borba Manana
Reviewed-by: Miao Xie
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Filipe David Borba Manana
2013-09-01 20:04:43 +0800
6596a9281 Btrfs: don't bug_on when we fail when cleaning up transactions ... Browse Code »

There is no reason for this sort of jackassery. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-09-01 20:04:42 +0800