Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

12 Apr, 2014

1 commit

3123bca71 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull second set of btrfs updates from Chris Mason:
"The most important changes here are from Josef, fixing a btrfs
regression in 3.14 that can cause corruptions in the extent allocation
tree when snapshots are in use.

Josef also fixed some deadlocks in send/recv and other assorted races
when balance is running"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits)
Btrfs: fix compile warnings on on avr32 platform
btrfs: allow mounting btrfs subvolumes with different ro/rw options
btrfs: export global block reserve size as space_info
btrfs: fix crash in remount(thread_pool=) case
Btrfs: abort the transaction when we don't find our extent ref
Btrfs: fix EINVAL checks in btrfs_clone
Btrfs: fix unlock in __start_delalloc_inodes()
Btrfs: scrub raid56 stripes in the right way
Btrfs: don't compress for a small write
Btrfs: more efficient io tree navigation on wait_extent_bit
Btrfs: send, build path string only once in send_hole
btrfs: filter invalid arg for btrfs resize
Btrfs: send, fix data corruption due to incorrect hole detection
Btrfs: kmalloc() doesn't return an ERR_PTR
Btrfs: fix snapshot vs nocow writting
btrfs: Change the expanding write sequence to fix snapshot related bug.
btrfs: make device scan less noisy
btrfs: fix lockdep warning with reclaim lock inversion
Btrfs: hold the commit_root_sem when getting the commit root during send
Btrfs: remove transaction from send
...

Linus Torvalds
2014-04-12 05:16:53 +0800

08 Apr, 2014

4 commits

36523e951 btrfs: export global block reserve size as space_info ... Browse Code »

Introduce a block group type bit for a global reserve and fill the space
info for SPACE_INFO ioctl. This should replace the newly added ioctl
(01e219e8069516cdb98594d417b8bb8d906ed30d) to get just the 'size' part
of the global reserve, while the actual usage can be now visible in the
'btrfs fi df' output during ENOSPC stress.

The unpatched userspace tools will show the blockgroup as 'unknown'.

CC: Jeff Mahoney
CC: Josef Bacik
Signed-off-by: David Sterba
Signed-off-by: Chris Mason

David Sterba
2014-04-08 01:41:53 +0800
3a29bc092 Btrfs: fix EINVAL checks in btrfs_clone ... Browse Code »

btrfs_drop_extents can now return -EINVAL, but only one caller
in btrfs_clone was checking for it. This adds it to the
caller for inline extents, which is where we really need it.

Signed-off-by: Chris Mason

Chris Mason
2014-04-08 00:08:50 +0800
9a40f1222 btrfs: filter invalid arg for btrfs resize ... Browse Code »

Originally following cmds will work:
# btrfs fi resize -10A
# btrfs fi resize -10Gaha
Filter the arg by checking the return pointer of memparse.

Signed-off-by: Gui Hecheng
Signed-off-by: Chris Mason

Gui Hecheng
2014-04-08 00:08:45 +0800
84dbeb87d Btrfs: kmalloc() doesn't return an ERR_PTR ... Browse Code »

The error handling was copy and pasted from memdup_user(). It should be
checking for NULL obviously.

Fixes: abccd00f8af2 ('btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl')
Signed-off-by: Dan Carpenter
Signed-off-by: Chris Mason

Dan Carpenter
2014-04-08 00:08:44 +0800

05 Apr, 2014

1 commit

53c566625 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs changes from Chris Mason:
"This is a pretty long stream of bug fixes and performance fixes.

Qu Wenruo has replaced the btrfs async threads with regular kernel
workqueues. We'll keep an eye out for performance differences, but
it's nice to be using more generic code for this.

We still have some corruption fixes and other patches coming in for
the merge window, but this batch is tested and ready to go"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits)
Btrfs: fix a crash of clone with inline extents's split
btrfs: fix uninit variable warning
Btrfs: take into account total references when doing backref lookup
Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
Btrfs: fix incremental send's decision to delay a dir move/rename
Btrfs: remove unnecessary inode generation lookup in send
Btrfs: fix race when updating existing ref head
btrfs: Add trace for btrfs_workqueue alloc/destroy
Btrfs: less fs tree lock contention when using autodefrag
Btrfs: return EPERM when deleting a default subvolume
Btrfs: add missing kfree in btrfs_destroy_workqueue
Btrfs: cache extent states in defrag code path
Btrfs: fix deadlock with nested trans handles
Btrfs: fix possible empty list access when flushing the delalloc inodes
Btrfs: split the global ordered extents mutex
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
Btrfs: reclaim delalloc metadata more aggressively
Btrfs: remove unnecessary lock in may_commit_transaction()
Btrfs: remove the unnecessary flush when preparing the pages
Btrfs: just do dirty page flush for the inode with compression before direct IO
...

Linus Torvalds
2014-04-05 06:31:36 +0800

22 Mar, 2014

1 commit

00fdf13a2 Btrfs: fix a crash of clone with inline extents's split ... Browse Code »
13

xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split
of inline extents in __btrfs_drop_extents().

For inline extents, we cannot duplicate another EXTENT_DATA item, because
it breaks the rule of inline extents, that is, 'start offset' needs to be 0.

We have set limitations for the source inode's compressed inline extents,
because it needs to decompress and recompress. Now the destination inode's
inline extents also need similar limitations.

With this, xfstests btrfs/035 doesn't run into panic.

Signed-off-by: Liu Bo
Signed-off-by: Chris Mason

Liu Bo
2014-03-22 08:35:18 +0800

21 Mar, 2014

3 commits

f094c9bd3 Btrfs: less fs tree lock contention when using autodefrag ... Browse Code »

When finding new extents during an autodefrag, don't do so many fs tree
lookups to find an extent with a size smaller then the target treshold.
Instead, after each fs tree forward search immediately unlock upper
levels and process the entire leaf while holding a read lock on the leaf,
since our leaf processing is very fast.
This reduces lock contention, allowing for higher concurrency when other
tasks want to write/update items related to other inodes in the fs tree,
as we're not holding read locks on upper tree levels while processing the
leaf and we do less tree searches.

Test:

sysbench --test=fileio --file-num=512 --file-total-size=16G \
--file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
--file-rw-ratio=3 --file-io-mode=sync --max-time=1800 \
--max-requests=10000000000 [prepare|run]

(fileystem mounted with -o autodefrag, averages of 5 runs)

Before this change: 58.852Mb/sec throughtput, read 77.589Gb, written 25.863Gb
After this change: 63.034Mb/sec throughtput, read 83.102Gb, written 27.701Gb

Test machine: quad core intel i5-3570K, 32Gb of RAM, SSD.

Signed-off-by: Filipe David Borba Manana
Signed-off-by: Chris Mason

Filipe Manana
2014-03-21 08:15:27 +0800
72de6b539 Btrfs: return EPERM when deleting a default subvolume ... Browse Code »

The error message is confusing:

# btrfs sub delete /mnt/mysub/
Delete subvolume '/mnt/mysub'
ERROR: cannot delete '/mnt/mysub' - Directory not empty

The error message does not make sense to me: It's not about deleting a
directory but it's a subvolume, and it doesn't matter if the subvolume is
empty or not.

Maybe EPERM or is more appropriate in this case, combined with an explanatory
kernel log message. (e.g. "subvolume with ID 123 cannot be deleted because
it is configured as default subvolume.")

Reported-by: Koen De Wit
Signed-off-by: Guangyu Sun
Reviewed-by: David Sterba
Signed-off-by: Chris Mason

Guangyu Sun
2014-03-21 08:15:27 +0800
308d9800b Btrfs: cache extent states in defrag code path ... Browse Code »

When locking file ranges in the inode's io_tree, cache the first
extent state that belongs to the target range, so that when unlocking
the range we don't need to search in the io_tree again, reducing cpu
time and making and therefore holding the io_tree's lock for a shorter
period.

Signed-off-by: Filipe David Borba Manana
Signed-off-by: Chris Mason

Filipe Manana
2014-03-21 08:15:27 +0800

11 Mar, 2014

6 commits

6c255e67c Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock ... Browse Code »

We needn't flush all delalloc inodes when we doesn't get s_umount lock,
or we would make the tasks wait for a long time.

Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2014-03-11 03:17:27 +0800
8257b2dc3 Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume ... Browse Code »

If the snapshot creation happened after the nocow write but before the dirty
data flush, we would fail to flush the dirty data because of no space.

So we must keep track of when those nocow write operations start and when they
end, if there are nocow writers, the snapshot creators must wait. In order
to implement this function, I introduce btrfs_{start, end}_nocow_write(),
which is similar to mnt_{want,drop}_write().

These two functions are only used for nocow file write operations.

Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2014-03-11 03:17:22 +0800
e2127cf00 Btrfs: make defrag not fragment files when using prealloc extents ... Browse Code »

When using prealloc extents, a file defragment operation may actually
fragment the file and increase the amount of data space used by the file.
This change fixes that behaviour.

Example:

$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt
$ cd /mnt
$ xfs_io -f -c 'falloc 0 1048576' foobar && sync
$ xfs_io -c 'pwrite -S 0xff -b 100000 5000 100000' foobar
$ xfs_io -c 'pwrite -S 0xac -b 100000 200000 100000' foobar
$ xfs_io -c 'pwrite -S 0xe1 -b 100000 900000 100000' foobar && sync

Before defragmenting the file:

$ btrfs filesystem df /mnt
Data, single: total=8.00MiB, used=1.25MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00

$ btrfs-debug-tree /dev/sdb3
(...)
item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
prealloc data disk byte 12845056 nr 1048576
prealloc data offset 0 nr 4096
item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
extent data disk byte 12845056 nr 1048576
extent data offset 4096 nr 102400 ram 1048576
extent compression 0
item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
prealloc data disk byte 12845056 nr 1048576
prealloc data offset 106496 nr 90112
item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
extent data disk byte 12845056 nr 1048576
extent data offset 196608 nr 106496 ram 1048576
extent compression 0
item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
prealloc data disk byte 12845056 nr 1048576
prealloc data offset 303104 nr 593920
item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
extent data disk byte 12845056 nr 1048576
extent data offset 897024 nr 106496 ram 1048576
extent compression 0
item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
prealloc data disk byte 12845056 nr 1048576
prealloc data offset 1003520 nr 45056
(...)

Now defragmenting the file results in more data space used than before:

$ btrfs filesystem defragment -f foobar && sync
$ btrfs filesystem df /mnt
Data, single: total=8.00MiB, used=1.55MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00

And the corresponding file extent items are now no longer perfectly sequential
as before, and we're now needlessly using more space from data block groups:

$ btrfs-debug-tree /dev/sdb3
(...)
item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
extent data disk byte 12845056 nr 1048576
extent data offset 0 nr 4096 ram 1048576
extent compression 0
item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
extent data disk byte 13893632 nr 102400
extent data offset 0 nr 102400 ram 102400
extent compression 0
item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
extent data disk byte 12845056 nr 1048576
extent data offset 106496 nr 90112 ram 1048576
extent compression 0
item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
extent data disk byte 13996032 nr 106496
extent data offset 0 nr 106496 ram 106496
extent compression 0
item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
prealloc data disk byte 12845056 nr 1048576
prealloc data offset 303104 nr 593920
item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
extent data disk byte 14102528 nr 106496
extent data offset 0 nr 106496 ram 106496
extent compression 0
item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
extent data disk byte 12845056 nr 1048576
extent data offset 1003520 nr 45056 ram 1048576
extent compression 0
(...)

With this change, the above example will no longer cause allocation of new data
space nor change the sequentiality of the file extents, that is, defragment will
be effectless, leaving all extent items pointing to the extent starting at disk
byte 12845056.

In a 20Gb filesystem I had, mounted with the autodefrag option and 20 files of
400Mb each, initially consisting of a single prealloc extent of 400Mb, having
random writes happening at a low rate, lead to a total of over ~17Gb of data
space used, not far from eventually reaching an ENOSPC state.

Signed-off-by: Filipe David Borba Manana
Signed-off-by: Josef Bacik

Filipe Manana
2014-03-11 03:17:17 +0800
dec8ef905 Btrfs: correctly flush data on defrag when compression is enabled ... Browse Code »

When the defrag flag BTRFS_DEFRAG_RANGE_START_IO is set and compression
enabled, we weren't flushing completely, as writing compressed extents
is a 2 steps process, one to compress the data and another one to write
the compressed data to disk.

Signed-off-by: Filipe David Borba Manana
Signed-off-by: Josef Bacik

Filipe Manana
2014-03-11 03:17:16 +0800
abccd00f8 btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl ... Browse Code »

The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit
and 64-bit systems. This means that it is impossible to use btrfs
receive on a system with a 64-bit kernel and 32-bit userspace, because
the structure size (and hence the ioctl number) is different.

This patch adds a compatibility structure and ioctl to deal with the
above case.

Signed-off-by: Hugo Mills
Signed-off-by: Josef Bacik

Hugo Mills
2014-03-11 03:15:40 +0800
23ad5b17d btrfs: Return EXDEV for cross file system snapshot ... Browse Code »

EXDEV seems an appropriate error if an operation fails bacause it
crosses file system boundaries.

Reviewed-by: David Sterba
Signed-off-by: Kusanagi Kouichi
Signed-off-by: Josef Bacik

Kusanagi Kouichi
2014-03-11 03:15:37 +0800

17 Feb, 2014

1 commit

3962dfbe2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fixes from Chris Mason:
"We have a small collection of fixes in my for-linus branch.

The big thing that stands out is a revert of a new ioctl. Users
haven't shipped yet in btrfs-progs, and Dave Sterba found a better way
to export the information"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: use right clone root offset for compressed extents
btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105
Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol
Btrfs: fix max_inline mount option
Btrfs: fix a lockdep warning when cleaning up aborted transaction
Revert "btrfs: add ioctl to export size of global metadata reservation"

Linus Torvalds
2014-02-17 03:05:27 +0800

15 Feb, 2014

1 commit

11bcac89c Revert "btrfs: add ioctl to export size of global metadata reservation" ... Browse Code »

This reverts commit 01e219e8069516cdb98594d417b8bb8d906ed30d.

David Sterba found a different way to provide these features without adding a new
ioctl. We haven't released any progs with this ioctl yet, so I'm taking this out
for now until we finalize things.

Signed-off-by: Chris Mason
Signed-off-by: David Sterba
CC: Jeff Mahoney

Chris Mason
2014-02-15 05:42:13 +0800

10 Feb, 2014

1 commit

9c1db7798 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fixes from Chris Mason:
"This is a small collection of fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix data corruption when reading/updating compressed extents
Btrfs: don't loop forever if we can't run because of the tree mod log
btrfs: reserve no transaction units in btrfs_ioctl_set_features
btrfs: commit transaction after setting label and features
Btrfs: fix assert screwup for the pending move stuff

Linus Torvalds
2014-02-10 03:12:26 +0800

09 Feb, 2014

2 commits

8051aa1a3 btrfs: reserve no transaction units in btrfs_ioctl_set_features ... Browse Code »

Added in patch "btrfs: add ioctls to query/change feature bits online"
modifications to superblock don't need to reserve metadata blocks when
starting a transaction.

Signed-off-by: David Sterba
Signed-off-by: Chris Mason

David Sterba
2014-02-09 09:57:15 +0800
d0270aca8 btrfs: commit transaction after setting label and features ... Browse Code »

The set_fslabel ioctl uses btrfs_end_transaction, which means it's
possible that the change will be lost if the system crashes, same for
the newly set features. Let's use btrfs_commit_transaction instead.

Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba
Signed-off-by: Chris Mason

Jeff Mahoney
2014-02-09 09:57:15 +0800

31 Jan, 2014

1 commit

e7651b819 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs updates from Chris Mason:
"This is a pretty big pull, and most of these changes have been
floating in btrfs-next for a long time. Filipe's properties work is a
cool building block for inheriting attributes like compression down on
a per inode basis.

Jeff Mahoney kicked in code to export filesystem info into sysfs.

Otherwise, lots of performance improvements, cleanups and bug fixes.

Looks like there are still a few other small pending incrementals, but
I wanted to get the bulk of this in first"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (149 commits)
Btrfs: fix spin_unlock in check_ref_cleanup
Btrfs: setup inode location during btrfs_init_inode_locked
Btrfs: don't use ram_bytes for uncompressed inline items
Btrfs: fix btrfs_search_slot_for_read backwards iteration
Btrfs: do not export ulist functions
Btrfs: rework ulist with list+rb_tree
Btrfs: fix memory leaks on walking backrefs failure
Btrfs: fix send file hole detection leading to data corruption
Btrfs: add a reschedule point in btrfs_find_all_roots()
Btrfs: make send's file extent item search more efficient
Btrfs: fix to catch all errors when resolving indirect ref
Btrfs: fix protection between walking backrefs and root deletion
btrfs: fix warning while merging two adjacent extents
Btrfs: fix infinite path build loops in incremental send
btrfs: undo sysfs when open_ctree() fails
Btrfs: fix snprintf usage by send's gen_unique_name
btrfs: fix defrag 32-bit integer overflow
btrfs: sysfs: list the NO_HOLES feature
btrfs: sysfs: don't show reserved incompat feature
btrfs: call permission checks earlier in ioctls and return EPERM
...

Linus Torvalds
2014-01-31 12:08:20 +0800

29 Jan, 2014

14 commits

c41570c9d btrfs: fix defrag 32-bit integer overflow ... Browse Code »
1

When defragging a very large file, the cluster variable can wrap its 32-bit
signed int type and become negative, which eventually gets passed to
btrfs_force_ra() as a very large unsigned long value. On 32-bit platforms,
this eventually results in an Oops from the SLAB allocator.

Change the cluster and max_cluster signed int variables to unsigned long to
match the readahead functions. This also allows the min() comparison in
btrfs_defrag_file() to work as intended.

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Justin Maggard
2014-01-29 05:20:43 +0800
bd60ea0fe btrfs: call permission checks earlier in ioctls and return EPERM ... Browse Code »

The owner and capability checks in IOC_SUBVOL_SETFLAGS and
SET_RECEIVED_SUBVOL should be called before any other checks are done.

Also unify the error code to EPERM.

Signed-off-by: David Sterba
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

David Sterba
2014-01-29 05:20:41 +0800
d02420613 btrfs: restrict snapshotting to own subvolumes ... Browse Code »

Currently, any user can snapshot any subvolume if the path is accessible and
thus indirectly create and keep files he does not own under his direcotries.
This is not possible with traditional directories.

In security context, a user can snapshot root filesystem and pin any
potentially buggy binaries, even if the updates are applied.

All the snapshots are visible to the administrator, so it's possible to
verify if there are suspicious snapshots.

Another more practical problem is that any user can pin the space used
by eg. root and cause ENOSPC.

Original report:
https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/484786

CC: stable@vger.kernel.org
Signed-off-by: David Sterba
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

David Sterba
2014-01-29 05:20:40 +0800
e4355f34e Btrfs: faster file extent item search in clone ioctl ... Browse Code »

When we are looking for file extent items that intersect the cloning
range, for each one that falls completely outside the range, don't
release the path and do another full tree search - just move on
to the next slot and copy the file extent item into our buffer only
if the item intersects the cloning range.

Signed-off-by: Filipe David Borba Manana
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Filipe David Borba Manana
2014-01-29 05:20:35 +0800
c57c2b3ed Btrfs: unlock inodes in correct order in clone ioctl ... Browse Code »

In the clone ioctl, when the source and target inodes are different,
we can acquire their mutexes in 2 possible different orders. After
we're done cloning, we were releasing the mutexes always in the same
order - the most correct way of doing it is to release them by the
reverse order they were acquired.

Signed-off-by: Filipe David Borba Manana
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Filipe David Borba Manana
2014-01-29 05:20:30 +0800
de6e82006 Btrfs: release subvolume's block_rsv before transaction commit ... Browse Code »

We don't have to keep subvolume's block_rsv during transaction commit,
and within transaction commit, we may also need the free space reclaimed
from this block_rsv to process delayed refs.

Signed-off-by: Liu Bo
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Liu Bo
2014-01-29 05:20:29 +0800
63541927c Btrfs: add support for inode properties ... Browse Code »

This change adds infrastructure to allow for generic properties for
inodes. Properties are name/value pairs that can be associated with
inodes for different purposes. They are stored as xattrs with the
prefix "btrfs."

Properties can be inherited - this means when a directory inode has
inheritable properties set, these are added to new inodes created
under that directory. Further, subvolumes can also have properties
associated with them, and they can be inherited from their parent
subvolume. Naturally, directory properties have priority over subvolume
properties (in practice a subvolume property is just a regular
property associated with the root inode, objectid 256, of the
subvolume's fs tree).

This change also adds one specific property implementation, named
"compression", whose values can be "lzo" or "zlib" and it's an
inheritable property.

The corresponding changes to btrfs-progs were also implemented.
A patch with xfstests for this feature will follow once there's
agreement on this change/feature.

Further, the script at the bottom of this commit message was used to
do some benchmarks to measure any performance penalties of this feature.

Basically the tests correspond to:

Test 1 - create a filesystem and mount it with compress-force=lzo,
then sequentially create N files of 64Kb each, measure how long it took
to create the files, unmount the filesystem, mount the filesystem and
perform an 'ls -lha' against the test directory holding the N files, and
report the time the command took.

Test 2 - create a filesystem and don't use any compression option when
mounting it - instead set the compression property of the subvolume's
root to 'lzo'. Then create N files of 64Kb, and report the time it took.
The unmount the filesystem, mount it again and perform an 'ls -lha' like
in the former test. This means every single file ends up with a property
(xattr) associated to it.

Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
compression property, have no real effect other than adding more work
when inheriting properties and taking more btree leaf space.

Test 4 - same as test 3 but with 10 properties per file.

Results (in seconds, and averages of 5 runs each), for different N
numbers of files follow.

* Without properties (test 1)

file creation time ls -lha time
10 000 files 3.49 0.76
100 000 files 47.19 8.37
1 000 000 files 518.51 107.06

* With 1 property (compression property set to lzo - test 2)

file creation time ls -lha time
10 000 files 3.63 0.93
100 000 files 48.56 9.74
1 000 000 files 537.72 125.11

* With 4 properties (test 3)

file creation time ls -lha time
10 000 files 3.94 1.20
100 000 files 52.14 11.48
1 000 000 files 572.70 142.13

* With 10 properties (test 4)

file creation time ls -lha time
10 000 files 4.61 1.35
100 000 files 58.86 13.83
1 000 000 files 656.01 177.61

The increased latencies with properties are essencialy because of:

*) When creating an inode, we now synchronously write 1 more item
(an xattr item) for each property inherited from the parent dir
(or subvolume). This could be done in an asynchronous way such
as we do for dir intex items (delayed-inode.c), which could help
reduce the file creation latency;

*) With properties, we now have larger fs trees. For this particular
test each xattr item uses 75 bytes of leaf space in the fs tree.
This could be less by using a new item for xattr items, instead of
the current btrfs_dir_item, since we could cut the 'location' and
'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
total of 26 bytes per xattr item) from the btrfs_dir_item type.

Also tried batching the xattr insertions (ignoring proper hash
collision handling, since it didn't exist) when creating files that
inherit properties from their parent inode/subvolume, but the end
results were (surprisingly) essentially the same.

Test script:

$ cat test.pl
#!/usr/bin/perl -w

use strict;
use Time::HiRes qw(time);
use constant NUM_FILES => 10_000;
use constant FILE_SIZES => (64 * 1024);
use constant DEV => '/dev/sdb4';
use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
use constant TEST_DIR => (MNT_POINT . '/testdir');

system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";

# following line for testing without properties
#system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";

# following 2 lines for testing with properties
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";

system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
my ($t1, $t2);

$t1 = time();
for (my $i = 1; $i autoflush(1);
for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
print $f ('A' x 4096) or die "Error writing to file!";
}
close($f);
}
$t2 = time();
print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";

$t1 = time();
system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
$t2 = time();
print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";

Signed-off-by: Filipe David Borba Manana
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Filipe David Borba Manana
2014-01-29 05:20:24 +0800
eb8052e01 fs/btrfs: Integer overflow in btrfs_ioctl_resize() ... Browse Code »

The local variable 'new_size' comes from userspace. If a large number
was passed, there would be an integer overflow in the following line:
new_size = old_size + new_size;

Signed-off-by: Wenliang Fan
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Wenliang Fan
2014-01-29 05:20:11 +0800
efe120a06 Btrfs: convert printk to btrfs_ and fix BTRFS prefix ... Browse Code »

Convert all applicable cases of printk and pr_* to the btrfs_* macros.

Fix all uses of the BTRFS prefix.

Signed-off-by: Frank Holton
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Frank Holton
2014-01-29 05:20:05 +0800
2c6865378 btrfs: Check read-only status of roots during send ... Browse Code »

All the subvolues that are involved in send must be read-only during the
whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the
status to read-write and the result of send stream is undefined if the
data change unexpectedly.

Fix that by adding a refcount for all involved roots and verify that
there's no send in progress during SUBVOL_SETFLAGS ioctl call that does
read-only -> read-write transition.

We need refcounts because there are no restrictions on number of send
parallel operations currently run on a single subvolume, be it source,
parent or one of the multiple clone sources.

Kernel is silent when the RO checks fail and returns EPERM. The same set
of checks is done already in userspace before send starts.

Signed-off-by: David Sterba
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

David Sterba
2014-01-29 05:20:01 +0800
5662344b3 Btrfs: fix error check of btrfs_lookup_dentry() ... Browse Code »

Clean up btrfs_lookup_dentry() to never return NULL, but PTR_ERR(-ENOENT)
instead. This keeps the return value convention consistent.

Callers who use btrfs_lookup_dentry() require a trivial update.

create_snapshot() in particular looks like it can also lose a BUG_ON(!inode)
which is not really needed - there seems less harm in returning ENOENT to
userspace at that point in the stack than there is to crash the machine.

Signed-off-by: Tsutomu Itoh
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Tsutomu Itoh
2014-01-29 05:19:56 +0800
01e219e80 btrfs: add ioctl to export size of global metadata reservation ... Browse Code »

btrfs filesystem df output will show the size of the metadata space
and how much of it is used, and the user assumes that the difference
is all usable space. Since that's not actually the case due to the
global metadata reservation, we should provide the full picture to the
user.

This patch adds an ioctl that exports the size of the global metadata
reservation so that btrfs filesystem df can report it.

Signed-off-by: Jeff Mahoney
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Jeff Mahoney
2014-01-29 05:19:28 +0800
3b02a68a6 btrfs: use feature attribute names to print better error messages ... Browse Code »

Now that we have the feature name strings available in the kernel via
the sysfs attributes, we can use them for printing better failure
messages from the ioctl path.

Signed-off-by: Jeff Mahoney
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Jeff Mahoney
2014-01-29 05:19:28 +0800
2eaa055fa btrfs: add ioctls to query/change feature bits online ... Browse Code »

There are some feature bits that require no offline setup and can
be enabled online. I've only reviewed extended irefs, but there will
probably be more.

We introduce three new ioctls:
- BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features.
- BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs
basis, as well as querying for which features are changeable with mounted.
- BTRFS_IOC_SET_FEATURES: change features on a per-fs basis.

We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that
allow us to define which features are safe to change at runtime.

The failure modes for BTRFS_IOC_SET_FEATURES are as follows:
- Enabling a completely unsupported feature: warns and returns -ENOTSUPP
- Enabling a feature that can only be done offline: warns and returns -EPERM

Signed-off-by: Jeff Mahoney
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Jeff Mahoney
2014-01-29 05:19:23 +0800

25 Jan, 2014

1 commit

1c1c8747c btrfs: sanitize BTRFS_IOC_FILE_EXTENT_SAME ... Browse Code »

* don't assume that ->dest_count won't change between copy_from_user()
and memdup_user()
* use fdget instead of fget
* don't bother comparing superblocks when we'd already compared vfsmounts
* get rid of excessive goto
* use file_inode() instead of open-coding the sucker

Signed-off-by: Al Viro

Al Viro
2014-01-25 16:13:03 +0800

12 Dec, 2013

1 commit

e43f998e4 btrfs: call mnt_drop_write after interrupted subvol deletion ... Browse Code »
2

If btrfs_ioctl_snap_destroy blocks on the mutex and the process is
killed, mnt_write count is unbalanced and leads to unmountable
filesystem.

CC: stable@vger.kernel.org
Signed-off-by: David Sterba
Signed-off-by: Chris Mason

David Sterba
2013-12-12 23:11:38 +0800

15 Nov, 2013

2 commits

54563d41a btrfs: get rid of fdentry() ... Browse Code »

3 of 4 callers actually want file_inode()...

Signed-off-by: Al Viro
Signed-off-by: Chris Mason

Al Viro
2013-11-15 22:18:14 +0800
46e0f66a0 btrfs: fix empty_zero_page misusage ... Browse Code »

Heiko Carstens noticed that btrfs was using empty_zero_page
incorrectly. He explained:

The definition of empty_zero_page is architecture specific. It
is (currently) either a character array, an unsigned long
containing the address of the empty_zero_page, or even worse
only the address of the struct page belonging to the
empty_zero_page.

This commit changes btrfs to use a for-loop instead. On x86
the resulting .ko is smaller, and we're no longer worrying about
how each arch builds its zeros.

Reported-by: Heiko Carstens
Signed-off-by: Chris Mason

Chris Mason
2013-11-15 22:17:47 +0800