Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

12 Apr, 2014

1 commit

3123bca71 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull second set of btrfs updates from Chris Mason:
"The most important changes here are from Josef, fixing a btrfs
regression in 3.14 that can cause corruptions in the extent allocation
tree when snapshots are in use.

Josef also fixed some deadlocks in send/recv and other assorted races
when balance is running"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits)
Btrfs: fix compile warnings on on avr32 platform
btrfs: allow mounting btrfs subvolumes with different ro/rw options
btrfs: export global block reserve size as space_info
btrfs: fix crash in remount(thread_pool=) case
Btrfs: abort the transaction when we don't find our extent ref
Btrfs: fix EINVAL checks in btrfs_clone
Btrfs: fix unlock in __start_delalloc_inodes()
Btrfs: scrub raid56 stripes in the right way
Btrfs: don't compress for a small write
Btrfs: more efficient io tree navigation on wait_extent_bit
Btrfs: send, build path string only once in send_hole
btrfs: filter invalid arg for btrfs resize
Btrfs: send, fix data corruption due to incorrect hole detection
Btrfs: kmalloc() doesn't return an ERR_PTR
Btrfs: fix snapshot vs nocow writting
btrfs: Change the expanding write sequence to fix snapshot related bug.
btrfs: make device scan less noisy
btrfs: fix lockdep warning with reclaim lock inversion
Btrfs: hold the commit_root_sem when getting the commit root during send
Btrfs: remove transaction from send
...

Linus Torvalds
2014-04-12 05:16:53 +0800

08 Apr, 2014

2 commits

36523e951 btrfs: export global block reserve size as space_info ... Browse Code »

Introduce a block group type bit for a global reserve and fill the space
info for SPACE_INFO ioctl. This should replace the newly added ioctl
(01e219e8069516cdb98594d417b8bb8d906ed30d) to get just the 'size' part
of the global reserve, while the actual usage can be now visible in the
'btrfs fi df' output during ENOSPC stress.

The unpatched userspace tools will show the blockgroup as 'unknown'.

CC: Jeff Mahoney
CC: Josef Bacik
Signed-off-by: David Sterba
Signed-off-by: Chris Mason

David Sterba
2014-04-08 01:41:53 +0800
3f8a18cc5 Btrfs: hold the commit_root_sem when getting the commit root during send ... Browse Code »

We currently rely too heavily on roots being read-only to save us from just
accessing root->commit_root. We can easily balance blocks out from underneath a
read only root, so to save us from getting screwed make sure we only access
root->commit_root under the commit root sem. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2014-04-08 00:08:39 +0800

07 Apr, 2014

1 commit

9e351cc86 Btrfs: remove transaction from send ... Browse Code »

Lets try this again. We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send. So fix this by making sure looking at the
commit roots is always going to be consistent. We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once. Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes. With this patch we no longer
deadlock and pass all the weird send/receive corner cases. Thanks,

Reportedy-by: Hugo Mills
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2014-04-07 08:39:30 +0800

05 Apr, 2014

1 commit

53c566625 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs changes from Chris Mason:
"This is a pretty long stream of bug fixes and performance fixes.

Qu Wenruo has replaced the btrfs async threads with regular kernel
workqueues. We'll keep an eye out for performance differences, but
it's nice to be using more generic code for this.

We still have some corruption fixes and other patches coming in for
the merge window, but this batch is tested and ready to go"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits)
Btrfs: fix a crash of clone with inline extents's split
btrfs: fix uninit variable warning
Btrfs: take into account total references when doing backref lookup
Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
Btrfs: fix incremental send's decision to delay a dir move/rename
Btrfs: remove unnecessary inode generation lookup in send
Btrfs: fix race when updating existing ref head
btrfs: Add trace for btrfs_workqueue alloc/destroy
Btrfs: less fs tree lock contention when using autodefrag
Btrfs: return EPERM when deleting a default subvolume
Btrfs: add missing kfree in btrfs_destroy_workqueue
Btrfs: cache extent states in defrag code path
Btrfs: fix deadlock with nested trans handles
Btrfs: fix possible empty list access when flushing the delalloc inodes
Btrfs: split the global ordered extents mutex
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
Btrfs: reclaim delalloc metadata more aggressively
Btrfs: remove unnecessary lock in may_commit_transaction()
Btrfs: remove the unnecessary flush when preparing the pages
Btrfs: just do dirty page flush for the inode with compression before direct IO
...

Linus Torvalds
2014-04-05 06:31:36 +0800

11 Mar, 2014

22 commits

573bfb72f Btrfs: fix possible empty list access when flushing the delalloc inodes ... Browse Code »

We didn't have a lock to protect the access to the delalloc inodes list, that is
we might access a empty delalloc inodes list if someone start flushing delalloc
inodes because the delalloc inodes were moved into a other list temporarily.
Fix it by wrapping the access with a lock.

Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2014-03-11 03:17:29 +0800
31f3d255c Btrfs: split the global ordered extents mutex ... Browse Code »

When we create a snapshot, we just need wait the ordered extents in
the source fs/file root, but because we use the global mutex to protect
this ordered extents list of the source fs/file root to avoid accessing
a empty list, if someone got the mutex to access the ordered extents list
of the other fs/file root, we had to wait.

This patch splits the above global mutex, now every fs/file root has
its own mutex to protect its own list.

Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2014-03-11 03:17:28 +0800
6c255e67c Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock ... Browse Code »

We needn't flush all delalloc inodes when we doesn't get s_umount lock,
or we would make the tasks wait for a long time.

Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2014-03-11 03:17:27 +0800
8257b2dc3 Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume ... Browse Code »

If the snapshot creation happened after the nocow write but before the dirty
data flush, we would fail to flush the dirty data because of no space.

So we must keep track of when those nocow write operations start and when they
end, if there are nocow writers, the snapshot creators must wait. In order
to implement this function, I introduce btrfs_{start, end}_nocow_write(),
which is similar to mnt_{want,drop}_write().

These two functions are only used for nocow file write operations.

Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2014-03-11 03:17:22 +0800
d458b0540 btrfs: Cleanup the "_struct" suffix in btrfs_workequeue ... Browse Code »

Since the "_struct" suffix is mainly used for distinguish the differnt
btrfs_work between the original and the newly created one,
there is no need using the suffix since all btrfs_workers are changed
into btrfs_workqueue.

Also this patch fixed some codes whose code style is changed due to the
too long "_struct" suffix.

Signed-off-by: Qu Wenruo
Tested-by: David Sterba
Signed-off-by: Josef Bacik

Qu Wenruo
2014-03-11 03:17:16 +0800
a046e9c88 btrfs: Cleanup the old btrfs_worker. ... Browse Code »

Since all the btrfs_worker is replaced with the newly created
btrfs_workqueue, the old codes can be easily remove.

Signed-off-by: Quwenruo
Tested-by: David Sterba
Signed-off-by: Josef Bacik

Qu Wenruo
2014-03-11 03:17:15 +0800
0339ef2f4 btrfs: Replace fs_info->scrub_* workqueue with btrfs_workqueue. ... Browse Code »

Replace the fs_info->scrub_* with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo
Tested-by: David Sterba
Signed-off-by: Josef Bacik

Qu Wenruo
2014-03-11 03:17:14 +0800
fc97fab0e btrfs: Replace fs_info->qgroup_rescan_worker workqueue with btrfs_workqueue. ... Browse Code »

Replace the fs_info->qgroup_rescan_worker with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo
Tested-by: David Sterba
Signed-off-by: Josef Bacik

Qu Wenruo
2014-03-11 03:17:13 +0800
5b3bc44e2 btrfs: Replace fs_info->delayed_workers workqueue with btrfs_workqueue. ... Browse Code »

Replace the fs_info->delayed_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo
Tested-by: David Sterba
Signed-off-by: Josef Bacik

Qu Wenruo
2014-03-11 03:17:12 +0800
dc6e32099 btrfs: Replace fs_info->fixup_workers workqueue with btrfs_workqueue. ... Browse Code »

Replace the fs_info->fixup_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo
Tested-by: David Sterba
Signed-off-by: Josef Bacik

Qu Wenruo
2014-03-11 03:17:12 +0800
736cfa15e btrfs: Replace fs_info->readahead_workers workqueue with btrfs_workqueue. ... Browse Code »

Replace the fs_info->readahead_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo
Tested-by: David Sterba
Signed-off-by: Josef Bacik

Qu Wenruo
2014-03-11 03:17:11 +0800
e66f0bb14 btrfs: Replace fs_info->cache_workers workqueue with btrfs_workqueue. ... Browse Code »

Replace the fs_info->cache_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo
Tested-by: David Sterba
Signed-off-by: Josef Bacik

Qu Wenruo
2014-03-11 03:17:10 +0800
d05a33ac2 btrfs: Replace fs_info->rmw_workers workqueue with btrfs_workqueue. ... Browse Code »

Replace the fs_info->rmw_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo
Tested-by: David Sterba
Signed-off-by: Josef Bacik

Qu Wenruo
2014-03-11 03:17:09 +0800
fccb5d86d btrfs: Replace fs_info->endio_* workqueue with btrfs_workqueue. ... Browse Code »

Replace the fs_info->endio_* workqueues with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo
Tested-by: David Sterba
Signed-off-by: Josef Bacik

Qu Wenruo
2014-03-11 03:17:08 +0800
a44903abe btrfs: Replace fs_info->flush_workers with btrfs_workqueue. ... Browse Code »

Replace the fs_info->submit_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo
Tested-by: David Sterba
Signed-off-by: Josef Bacik

Qu Wenruo
2014-03-11 03:17:07 +0800
a8c93d4ef btrfs: Replace fs_info->submit_workers with btrfs_workqueue. ... Browse Code »

Much like the fs_info->workers, replace the fs_info->submit_workers
use the same btrfs_workqueue.

Signed-off-by: Qu Wenruo
Tested-by: David Sterba
Signed-off-by: Josef Bacik

Qu Wenruo
2014-03-11 03:17:07 +0800
afe3d2426 btrfs: Replace fs_info->delalloc_workers with btrfs_workqueue ... Browse Code »

Much like the fs_info->workers, replace the fs_info->delalloc_workers
use the same btrfs_workqueue.

Signed-off-by: Qu Wenruo
Tested-by: David Sterba
Signed-off-by: Josef Bacik

Qu Wenruo
2014-03-11 03:17:06 +0800
5cdc7ad33 btrfs: Replace fs_info->workers with btrfs_workqueue. ... Browse Code »

Use the newly created btrfs_workqueue_struct to replace the original
fs_info->workers

Signed-off-by: Qu Wenruo
Tested-by: David Sterba
Signed-off-by: Josef Bacik

Qu Wenruo
2014-03-11 03:17:05 +0800
d1433debe Btrfs: just wait or commit our own log sub-transaction ... Browse Code »
6

We might commit the log sub-transaction which didn't contain the metadata we
logged. It was because we didn't record the log transid and just select
the current log sub-transaction to commit, but the right one might be
committed by the other task already. Actually, we needn't do anything
and it is safe that we go back directly in this case.

This patch improves the log sync by the above idea. We record the transid
of the log sub-transaction in which we log the metadata, and the transid
of the log sub-transaction we have committed. If the committed transid
is >= the transid we record when logging the metadata, we just go back.

Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2014-03-11 03:16:43 +0800
8b050d350 Btrfs: fix skipped error handle when log sync failed ... Browse Code »

It is possible that many tasks sync the log tree at the same time, but
only one task can do the sync work, the others will wait for it. But those
wait tasks didn't get the result of the log sync, and returned 0 when they
ended the wait. It caused those tasks skipped the error handle, and the
serious problem was they told the users the file sync succeeded but in
fact they failed.

This patch fixes this problem by introducing a log context structure,
we insert it into the a global list. When the sync fails, we will set
the error number of every log context in the list, then the waiting tasks
get the error number of the log context and handle the error if need.

Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2014-03-11 03:16:43 +0800
bb14a59b6 Btrfs: use signed integer instead of unsigned long integer for log transid ... Browse Code »

The log trans id is initialized to be 0 every time we create a log tree,
and the log tree need be re-created after a new transaction is started,
it means the log trans id is unlikely to be a huge number, so we can use
signed integer instead of unsigned long integer to save a bit space.

Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2014-03-11 03:16:42 +0800
c404e0dc2 Btrfs: fix use-after-free in the finishing procedure of the device replace ... Browse Code »
18

During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.

For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.

For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.

Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.

Reported-by: Qu Wenruo
Signed-off-by: Wang Shilong
Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2014-03-11 03:15:39 +0800

31 Jan, 2014

1 commit

e7651b819 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs updates from Chris Mason:
"This is a pretty big pull, and most of these changes have been
floating in btrfs-next for a long time. Filipe's properties work is a
cool building block for inheriting attributes like compression down on
a per inode basis.

Jeff Mahoney kicked in code to export filesystem info into sysfs.

Otherwise, lots of performance improvements, cleanups and bug fixes.

Looks like there are still a few other small pending incrementals, but
I wanted to get the bulk of this in first"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (149 commits)
Btrfs: fix spin_unlock in check_ref_cleanup
Btrfs: setup inode location during btrfs_init_inode_locked
Btrfs: don't use ram_bytes for uncompressed inline items
Btrfs: fix btrfs_search_slot_for_read backwards iteration
Btrfs: do not export ulist functions
Btrfs: rework ulist with list+rb_tree
Btrfs: fix memory leaks on walking backrefs failure
Btrfs: fix send file hole detection leading to data corruption
Btrfs: add a reschedule point in btrfs_find_all_roots()
Btrfs: make send's file extent item search more efficient
Btrfs: fix to catch all errors when resolving indirect ref
Btrfs: fix protection between walking backrefs and root deletion
btrfs: fix warning while merging two adjacent extents
Btrfs: fix infinite path build loops in incremental send
btrfs: undo sysfs when open_ctree() fails
Btrfs: fix snprintf usage by send's gen_unique_name
btrfs: fix defrag 32-bit integer overflow
btrfs: sysfs: list the NO_HOLES feature
btrfs: sysfs: don't show reserved incompat feature
btrfs: call permission checks earlier in ioctls and return EPERM
...

Linus Torvalds
2014-01-31 12:08:20 +0800

29 Jan, 2014

12 commits

514ac8ad8 Btrfs: don't use ram_bytes for uncompressed inline items ... Browse Code »

If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
the new size. The fixe uses the size directly from the item header when
reading uncompressed inlines, and also fixes truncate to update the
size as it goes.

Reported-by: Jens Axboe
Signed-off-by: Chris Mason
CC: stable@vger.kernel.org

Chris Mason
2014-01-29 23:06:29 +0800
26b47ff65 Btrfs: change the members' order of btrfs_space_info structure to reduce the cache miss ... Browse Code »

It is better that the position of the lock is close to the data which is
protected by it, because they may be in the same cache line, we will load
less cache lines when we access them. So we rearrange the members' position
of btrfs_space_info structure to make the lock be closer to the its data.

Signed-off-by: Miao Xie
Reviewed-by: David Sterba
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Miao Xie
2014-01-29 05:20:38 +0800
3818aea27 btrfs: Add noinode_cache mount option ... Browse Code »

Add noinode_cache mount option for btrfs.

Since inode map cache involves all the btrfs_find_free_ino/return_ino
things and if just trigger the mount_opt,
an inode number get from inode map cache will not returned to inode map
cache.

To keep the find and return inode both in the same behavior,
a new bit in mount_opt, CHANGE_INODE_CACHE, is introduced for this idea.
CHANGE_INODE_CACHE is set/cleared in remounting, and the original
INODE_MAP_CACHE is set/cleared according to CHANGE_INODE_CACHE after a
success transaction.
Since find/return inode is all done between btrfs_start_transaction and
btrfs_commit_transaction, this will keep consistent behavior.

Also noinode_cache mount option will not stop the caching_kthread.

Cc: David Sterba
Signed-off-by: Miao Xie
Signed-off-by: Qu Wenruo
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Qu Wenruo
2014-01-29 05:20:33 +0800
ade2e0b3e Btrfs: fix to search previous metadata extent item since skinny metadata ... Browse Code »

There is a bug that using btrfs_previous_item() to search metadata extent item.
This is because in btrfs_previous_item(), we need type match, however, since
skinny metada was introduced by josef, we may mix this two types. So just
use btrfs_previous_item() is not working right.

To keep btrfs_previous_item() like normal tree search, i introduce another
function btrfs_previous_extent_item().

Signed-off-by: Wang Shilong
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Wang Shilong
2014-01-29 05:20:33 +0800
0a2b2a844 Btrfs: throttle delayed refs better ... Browse Code »

On one of our gluster clusters we noticed some pretty big lag spikes. This
turned out to be because our transaction commit was taking like 3 minutes to
complete. This is because we have like 30 gigs of metadata, so our global
reserve would end up being the max which is like 512 mb. So our throttling code
would allow a ridiculous amount of delayed refs to build up and then they'd all
get run at transaction commit time, and for a cold mounted file system that
could take up to 3 minutes to run. So fix the throttling to be based on both
the size of the global reserve and how long it takes us to run delayed refs.
This patch tracks the time it takes to run delayed refs and then only allows 1
seconds worth of outstanding delayed refs at a time. This way it will auto-tune
itself from cold cache up to when everything is in memory and it no longer has
to go to disk. This makes our transaction commits take much less time to run.
Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2014-01-29 05:20:26 +0800
63541927c Btrfs: add support for inode properties ... Browse Code »

This change adds infrastructure to allow for generic properties for
inodes. Properties are name/value pairs that can be associated with
inodes for different purposes. They are stored as xattrs with the
prefix "btrfs."

Properties can be inherited - this means when a directory inode has
inheritable properties set, these are added to new inodes created
under that directory. Further, subvolumes can also have properties
associated with them, and they can be inherited from their parent
subvolume. Naturally, directory properties have priority over subvolume
properties (in practice a subvolume property is just a regular
property associated with the root inode, objectid 256, of the
subvolume's fs tree).

This change also adds one specific property implementation, named
"compression", whose values can be "lzo" or "zlib" and it's an
inheritable property.

The corresponding changes to btrfs-progs were also implemented.
A patch with xfstests for this feature will follow once there's
agreement on this change/feature.

Further, the script at the bottom of this commit message was used to
do some benchmarks to measure any performance penalties of this feature.

Basically the tests correspond to:

Test 1 - create a filesystem and mount it with compress-force=lzo,
then sequentially create N files of 64Kb each, measure how long it took
to create the files, unmount the filesystem, mount the filesystem and
perform an 'ls -lha' against the test directory holding the N files, and
report the time the command took.

Test 2 - create a filesystem and don't use any compression option when
mounting it - instead set the compression property of the subvolume's
root to 'lzo'. Then create N files of 64Kb, and report the time it took.
The unmount the filesystem, mount it again and perform an 'ls -lha' like
in the former test. This means every single file ends up with a property
(xattr) associated to it.

Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
compression property, have no real effect other than adding more work
when inheriting properties and taking more btree leaf space.

Test 4 - same as test 3 but with 10 properties per file.

Results (in seconds, and averages of 5 runs each), for different N
numbers of files follow.

* Without properties (test 1)

file creation time ls -lha time
10 000 files 3.49 0.76
100 000 files 47.19 8.37
1 000 000 files 518.51 107.06

* With 1 property (compression property set to lzo - test 2)

file creation time ls -lha time
10 000 files 3.63 0.93
100 000 files 48.56 9.74
1 000 000 files 537.72 125.11

* With 4 properties (test 3)

file creation time ls -lha time
10 000 files 3.94 1.20
100 000 files 52.14 11.48
1 000 000 files 572.70 142.13

* With 10 properties (test 4)

file creation time ls -lha time
10 000 files 4.61 1.35
100 000 files 58.86 13.83
1 000 000 files 656.01 177.61

The increased latencies with properties are essencialy because of:

*) When creating an inode, we now synchronously write 1 more item
(an xattr item) for each property inherited from the parent dir
(or subvolume). This could be done in an asynchronous way such
as we do for dir intex items (delayed-inode.c), which could help
reduce the file creation latency;

*) With properties, we now have larger fs trees. For this particular
test each xattr item uses 75 bytes of leaf space in the fs tree.
This could be less by using a new item for xattr items, instead of
the current btrfs_dir_item, since we could cut the 'location' and
'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
total of 26 bytes per xattr item) from the btrfs_dir_item type.

Also tried batching the xattr insertions (ignoring proper hash
collision handling, since it didn't exist) when creating files that
inherit properties from their parent inode/subvolume, but the end
results were (surprisingly) essentially the same.

Test script:

$ cat test.pl
#!/usr/bin/perl -w

use strict;
use Time::HiRes qw(time);
use constant NUM_FILES => 10_000;
use constant FILE_SIZES => (64 * 1024);
use constant DEV => '/dev/sdb4';
use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
use constant TEST_DIR => (MNT_POINT . '/testdir');

system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";

# following line for testing without properties
#system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";

# following 2 lines for testing with properties
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";

system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
my ($t1, $t2);

$t1 = time();
for (my $i = 1; $i autoflush(1);
for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
print $f ('A' x 4096) or die "Error writing to file!";
}
close($f);
}
$t2 = time();
print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";

$t1 = time();
system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
$t2 = time();
print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";

Signed-off-by: Filipe David Borba Manana
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Filipe David Borba Manana
2014-01-29 05:20:24 +0800
1acae57b1 Btrfs: faster file extent item replace operations ... Browse Code »

When writing to a file we drop existing file extent items that cover the
write range and then add a new file extent item that represents that write
range.

Before this change we were doing a tree lookup to remove the file extent
items, and then after we did another tree lookup to insert the new file
extent item.
Most of the time all the file extent items we need to drop are located
within a single leaf - this is the leaf where our new file extent item ends
up at. Therefore, in this common case just combine these 2 operations into
a single one.

By avoiding the second btree navigation for insertion of the new file extent
item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
COW operations, CPU time on btree node/leaf key binary searches, etc.

Besides for file writes, this is an operation that happens for file fsync's
as well. However log btrees are much less likely to big as big as regular
fs btrees, therefore the impact of this change is smaller.

The following benchmark was performed against an SSD drive and a
HDD drive, both for random and sequential writes:

sysbench --test=fileio --file-num=4096 --file-total-size=8G \
--file-test-mode=[rndwr|seqwr] --num-threads=512 \
--file-block-size=8192 \ --max-requests=1000000 \
--file-fsync-freq=0 --file-io-mode=sync [prepare|run]

All results below are averages of 10 runs of the respective test.

** SSD sequential writes

Before this change: 225.88 Mb/sec
After this change: 277.26 Mb/sec

** SSD random writes

Before this change: 49.91 Mb/sec
After this change: 56.39 Mb/sec

** HDD sequential writes

Before this change: 68.53 Mb/sec
After this change: 69.87 Mb/sec

** HDD random writes

Before this change: 13.04 Mb/sec
After this change: 14.39 Mb/sec

Signed-off-by: Filipe David Borba Manana
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Filipe David Borba Manana
2014-01-29 05:20:23 +0800
efe120a06 Btrfs: convert printk to btrfs_ and fix BTRFS prefix ... Browse Code »

Convert all applicable cases of printk and pr_* to the btrfs_* macros.

Fix all uses of the BTRFS prefix.

Signed-off-by: Frank Holton
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Frank Holton
2014-01-29 05:20:05 +0800
2c6865378 btrfs: Check read-only status of roots during send ... Browse Code »

All the subvolues that are involved in send must be read-only during the
whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the
status to read-write and the result of send stream is undefined if the
data change unexpectedly.

Fix that by adding a refcount for all involved roots and verify that
there's no send in progress during SUBVOL_SETFLAGS ioctl call that does
read-only -> read-write transition.

We need refcounts because there are no restrictions on number of send
parallel operations currently run on a single subvolume, be it source,
parent or one of the multiple clone sources.

Kernel is silent when the RO checks fail and returns EPERM. The same set
of checks is done already in userspace before send starts.

Signed-off-by: David Sterba
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

David Sterba
2014-01-29 05:20:01 +0800
e223cfcd3 Btrfs: remove field tree_mod_seq_elem from btrfs_fs_info struct ... Browse Code »

It's not used anywhere, so just drop it.

Signed-off-by: Filipe David Borba Manana
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Filipe David Borba Manana
2014-01-29 05:19:58 +0800
f28491e0a Btrfs: move the extent buffer radix tree into the fs_info ... Browse Code »

I need to create a fake tree to test qgroups and I don't want to have to setup a
fake btree_inode. The fact is we only use the radix tree for the fs_info, so
everybody else who allocates an extent_io_tree is just wasting the space anyway.
This patch moves the radix tree and its lock into btrfs_fs_info so there is less
stuff I have to fake to do qgroup sanity tests. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2014-01-29 05:19:55 +0800
27a0dd61a Btrfs: make btrfs_debug match pr_debug handling related to DEBUG ... Browse Code »
2

The kernel macro pr_debug is defined as a empty statement when DEBUG is
not defined. Make btrfs_debug match pr_debug to avoid spamming
the kernel log with debug messages

Signed-off-by: Frank Holton
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Frank Holton
2014-01-29 05:19:39 +0800