21 Feb, 2009
1 commit
-
This is a step in the direction of better -ENOSPC handling. Instead of
checking the global bytes counter we check the space_info bytes counters to
make sure we have enough space.If we don't we go ahead and try to allocate a new chunk, and then if that fails
we return -ENOSPC. This patch adds two counters to btrfs_space_info,
bytes_delalloc and bytes_may_use.bytes_delalloc account for extents we've actually setup for delalloc and will
be allocated at some point down the line.bytes_may_use is to keep track of how many bytes we may use for delalloc at
some point. When we actually set the extent_bit for the delalloc bytes we
subtract the reserved bytes from the bytes_may_use counter. This keeps us from
not actually being able to allocate space for any delalloc bytes.Signed-off-by: Josef Bacik
20 Feb, 2009
1 commit
-
fsync can be called by NFS with a null file pointer, and btrfs was
oopsing in this case.Signed-off-by: Chris Mason
22 Jan, 2009
1 commit
-
To improve performance, btrfs_sync_log merges tree log sync
requests. But it wrongly merges sync requests for different
tree logs. If multiple tree logs are synced at the same time,
only one of them actually gets synced.This patch has following changes to fix the bug:
Move most tree log related fields in btrfs_fs_info to
btrfs_root. This allows merging sync requests separately
for each tree log.Don't insert root item into the log root tree immediately
after log tree is allocated. Root item for log tree is
inserted when log tree get synced for the first time. This
allows syncing the log root tree without first syncing all
log trees.At tree-log sync, btrfs_sync_log first sync the log tree;
then updates corresponding root item in the log root tree;
sync the log root tree; then update the super block.Signed-off-by: Yan Zheng
21 Jan, 2009
1 commit
-
Removed unused #include 's in btrfs
Signed-off-by: Huang Weiyi
Signed-off-by: Chris Mason
06 Jan, 2009
3 commits
-
btrfs_drop_extents doesn't change file extent's ram_bytes
in the case of booked extent. To be consistent, we should
also not change ram_bytes when truncating existing extent.Signed-off-by: Yan Zheng
-
There were many, most are fixed now. struct-funcs.c generates some warnings
but these are bogus.Signed-off-by: Chris Mason
-
Signed-off-by: Chris Mason
12 Dec, 2008
1 commit
-
Checksums on data can be disabled by mount option, so it's
possible some data extents don't have checksums or have
invalid checksums. This causes trouble for data relocation.
This patch contains following things to make data relocation
work.1) make nodatasum/nodatacow mount option only affects new
files. Checksums and COW on data are only controlled by the
inode flags.2) check the existence of checksum in the nodatacow checker.
If checksums exist, force COW the data extent. This ensure that
checksum for a given block is either valid or does not exist.3) update data relocation code to properly handle the case
of checksum missing.Signed-off-by: Yan Zheng
09 Dec, 2008
2 commits
-
The fsync logging code makes sure to onl copy the relevant checksum for each
extent based on the file extent pointers it finds.But for compressed extents, it needs to copy the checksum for the
entire extent.Signed-off-by: Chris Mason
-
This adds a sequence number to the btrfs inode that is increased on
every update. NFS will be able to use that to detect when an inode has
changed, without relying on inaccurate time fields.While we're here, this also:
Puts reserved space into the super block and inode
Adds a log root transid to the super so we can pick the newest super
based on the fsync log as well as the main transaction ID. For now
the log root transid is always zero, but that'll get fixed.Adds a starting offset to the dev_item. This will let us do better
alignment calculations if we know the start of a partition on the disk.Signed-off-by: Chris Mason
02 Dec, 2008
1 commit
-
Signed-off-by: Chris Mason
13 Nov, 2008
1 commit
-
When extent needs to be split, btrfs_mark_extent_written truncates the extent
first, then inserts a new extent and increases the reference count.The race happens if someone else deletes the old extent before the new extent
is inserted. The fix here is increase the reference count in advance. This race
is similar to the race in btrfs_drop_extents that was recently fixed.Signed-off-by: Yan Zheng
11 Nov, 2008
2 commits
-
btrfs_drop_extents will drop paths and search again when it needs to
force COW of higher nodes. It was using the key it found during the last
search as the offset for the next search.But, this wasn't always correct. The key could be from before our desired
range, and because we're dropping the path, it is possible for file's items
to change while we do the search again.The fix here is to make sure we don't search for something smaller than
the offset btrfs_drop_extents was called with.Signed-off-by: Chris Mason
-
This makes sure the orig_start field in struct extent_map gets set
everywhere the extent_map structs are created or modified.Signed-off-by: Chris Mason
10 Nov, 2008
1 commit
-
The decompress code doesn't take the logical offset in extent
pointer into account. If the logical offset isn't zero, data
will be decompressed into wrong pages.The solution used here is to record the starting offset of the extent
in the file separately from the logical start of the extent_map struct.
This allows us to avoid problems inserting overlapping extents.Signed-off-by: Yan Zheng
07 Nov, 2008
1 commit
-
When reading compressed extents, try to put pages into the page cache
for any pages covered by the compressed extent that readpages didn't already
preload.Add an async work queue to handle transformations at delayed allocation processing
time. Right now this is just compression. The workflow is:1) Find offsets in the file marked for delayed allocation
2) Lock the pages
3) Lock the state bits
4) Call the async delalloc codeThe async delalloc code clears the state lock bits and delalloc bits. It is
important this happens before the range goes into the work queue because
otherwise it might deadlock with other work queue items that try to lock
those extent bits.The file pages are compressed, and if the compression doesn't work the
pages are written back directly.An ordered work queue is used to make sure the inodes are written in the same
order that pdflush or writepages sent them down.This changes extent_write_cache_pages to let the writepage function
update the wbc nr_written count.Signed-off-by: Chris Mason
01 Nov, 2008
1 commit
-
Make sure we keep page->mapping NULL on the pages we're getting
via alloc_page. It gets set so a few of the callbacks can do the right
thing, but in general these pages don't have a mapping.Don't try to truncate compressed inline items in btrfs_drop_extents.
The whole compressed item must be preserved.Don't try to create multipage inline compressed items. When we try to
overwrite just the first page of the file, we would have to read in and recow
all the pages after it in the same compressed inline items. For now, only
create single page inline items.Make sure we lock pages in the correct order during delalloc. The
search into the state tree for delalloc bytes can return bytes before
the page we already have locked.Signed-off-by: Chris Mason
31 Oct, 2008
3 commits
-
This patch updates btrfs-progs for fallocate support.
fallocate is a little different in Btrfs because we need to tell the
COW system that a given preallocated extent doesn't need to be
cow'd as long as there are no snapshots of it. This leverages the
-o nodatacow checks.Signed-off-by: Yan Zheng
-
When dropping middle part of an extent, btrfs_drop_extents truncates
the extent at first, then inserts a bookend extent.Since truncation and insertion can't be done atomically, there is a small
period that the bookend extent isn't in the tree. This causes problem for
functions that search the tree for file extent item. The way to fix this is
lock the range of the bookend extent before truncation.Signed-off-by: Yan Zheng
-
This patch splits the hole insertion code out of btrfs_setattr
into btrfs_cont_expand and updates btrfs_get_extent to properly
handle the case that file extent items are not continuous.Signed-off-by: Yan Zheng
30 Oct, 2008
1 commit
-
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason
09 Oct, 2008
2 commits
-
The offset field in struct btrfs_extent_ref records the position
inside file that file extent is referenced by. In the new back
reference system, tree leaves holding references to file extent
are recorded explicitly. We can scan these tree leaves very quickly, so the
offset field is not required.This patch also makes the back reference system check the objectid
when extents are in deleting.Signed-off-by: Yan Zheng
-
This patch makes btrfs count space allocated to file in bytes instead
of 512 byte sectors.Everything else in btrfs uses a byte count instead of sector sizes or
blocks sizes, so this fits better.Signed-off-by: Yan Zheng
04 Oct, 2008
1 commit
-
This reworks the btrfs O_DIRECT write code a bit. It had always fallen
back to buffered IO and done an invalidate, but needed to be updated
for the data=ordered code. The invalidate wasn't actually removing pages
because they were still inside an ordered extent.This also combines the O_DIRECT/O_SYNC paths where possible, and kicks
off IO in the main btrfs_file_write loop to keep the pipe down the the
disk full as we process long writes.Signed-off-by: Chris Mason
30 Sep, 2008
1 commit
-
This improves the comments at the top of many functions. It didn't
dive into the guts of functions because I was trying to
avoid merging problems with the new allocator and back reference work.extent-tree.c and volumes.c were both skipped, and there is definitely
more work todo in cleaning and commenting the code.Signed-off-by: Chris Mason
26 Sep, 2008
2 commits
-
* Add an EXTENT_BOUNDARY state bit to keep the writepage code
from merging data extents that are in the process of being
relocated. This allows us to do accounting for them properly.* The balancing code relocates data extents indepdent of the underlying
inode. The extent_map code was modified to properly account for
things moving around (invalidating extent_map caches in the inode).* Don't take the drop_mutex in the create_subvol ioctl. It isn't
required.* Fix walking of the ordered extent list to avoid races with sys_unlink
* Change the lock ordering rules. Transaction start goes outside
the drop_mutex. This allows btrfs_commit_transaction to directly
drop the relocation trees.Signed-off-by: Chris Mason
-
Btrfs had compatibility code for kernels back to 2.6.18. These have
been removed, and will be maintained in a separate backport
git tree from now on.Signed-off-by: Chris Mason
25 Sep, 2008
13 commits
-
This patch makes the back reference system to explicit record the
location of parent node for all types of extents. The location of
parent node is placed into the offset field of backref key. Every
time a tree block is balanced, the back references for the affected
lower level extents are updated.Signed-off-by: Chris Mason
-
Drop i_mutex during the commit
Don't bother doing the fsync at all unless the dir is marked as dirtied
and needing fsync in this transaction. For directories, this means
that someone has unlinked a file from the dir without fsyncing the
file.Signed-off-by: Chris Mason
-
File syncs and directory syncs are optimized by copying their
items into a special (copy-on-write) log tree. There is one log tree per
subvolume and the btrfs super block points to a tree of log tree roots.After a crash, items are copied out of the log tree and back into the
subvolume. See tree-log.c for all the details.Signed-off-by: Chris Mason
-
Signed-off-by: Chris Mason
-
Signed-off-by: Chris Mason
-
While dropping snapshots, walk_down_tree does most of the work of checking
reference counts and limiting tree traversal to just the blocks that
we are freeing.It dropped and held the allocation mutex in strange and confusing ways,
this commit changes it to only hold the mutex while actually freeing a block.The rest of the checks around reference counts should be safe without the lock
because we only allow one process in btrfs_drop_snapshot at a time. Other
processes dropping reference counts should not drop it to 1 because
their tree roots already have an extra ref on the block.Signed-off-by: Chris Mason
-
Signed-off-by: Chris Mason
-
This avoids waiting for transactions with pages locked by breaking out
the code to wait for the current transaction to close into a function
called by btrfs_throttle.It also lowers the limits for where we start throttling.
Signed-off-by: Chris Mason
-
Add a couple of #if's to follow API changes.
Signed-off-by: Sven Wegener
Signed-off-by: Chris Mason -
The memory reclaiming issue happens when snapshot exists. In that
case, some cache entries may not be used during old snapshot dropping,
so they will remain in the cache until umount.The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
the patch makes all dead roots of a given snapshot linked together in order of
create time. After a old snapshot was completely dropped, we check the dead
root list and remove all cache entries created before the oldest dead root in
the list.Signed-off-by: Chris Mason
-
A large reference cache is directly related to a lot of work pending
for the cleaner thread. This throttles back new operations based on
the size of the reference cache so the cleaner thread will be able to keep
up.Overall, this actually makes the FS faster because the cleaner thread will
be more likely to find things in cache.Signed-off-by: Chris Mason
-
This changes the reference cache to make a single cache per root
instead of one cache per transaction, and to key by the byte number
of the disk block instead of the keys inside.This makes it much less likely to have cache misses if a snapshot
or something has an extra reference on a higher node or a leaf while
the first transaction that added the leaf into the cache is dropping.Some throttling is added to functions that free blocks heavily so they
wait for old transactions to drop.Signed-off-by: Chris Mason
-
Stress testing was showing data checksum errors, most of which were caused
by a lookup bug in the extent_map tree. The tree was caching the last
pointer returned, and searches would check the last pointer first.But, search callers also expect the search to return the very first
matching extent in the range, which wasn't always true with the last
pointer usage.For now, the code to cache the last return value is just removed. It is
easy to fix, but I think lookups are rare enough that it isn't required anymore.This commit also replaces do_sync_mapping_range with a local copy of the
related functions.Signed-off-by: Chris Mason