Eric Lee / smarc-fsl-linux-kernel

25 Jun, 2016

3 commits

da2f6aba4 Merge branch 'for-linus-4.7-part2' of git://git.kernel.org/pub/scm/linux/kernel/… ... Browse Code »

…git/mason/linux-btrfs

Pull btrfs fixes part 2 from Chris Mason:
"This has one patch from Omar to bring iterate_shared back to btrfs.

We have a tree of work we queue up for directory items and it doesn't
lend itself well to shared access. While we're cleaning it up, Omar
has changed things to use an exclusive lock when there are delayed
items"

* 'for-linus-4.7-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes

Linus Torvalds
2016-06-25 23:53:38 +0800
b971712af Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fixes from Chris Mason:
"I have a two part pull this time because one of the patches Dave
Sterba collected needed to be against v4.7-rc2 or higher (we used
rc4). I try to make my for-linus-xx branch testable on top of the
last major so we can hand fixes to people on the list more easily, so
I've split this pull in two.

This first part has some fixes and two performance improvements that
we've been testing for some time.

Josef's two performance fixes are most notable. The transid tracking
patch makes a big improvement on pretty much every workload"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: Force stripesize to the value of sectorsize
btrfs: fix disk_i_size update bug when fallocate() fails
Btrfs: fix error handling in map_private_extent_buffer
Btrfs: fix error return code in btrfs_init_test_fs()
Btrfs: don't do nocow check unless we have to
btrfs: fix deadlock in delayed_ref_async_start
Btrfs: track transid for delayed ref flushing

Linus Torvalds
2016-06-25 23:42:31 +0800
02dbfc99b Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes ... Browse Code »

Commit fe742fd4f90f ("Revert "btrfs: switch to ->iterate_shared()"")
backed out the conversion to ->iterate_shared() for Btrfs because the
delayed inode handling in btrfs_real_readdir() is racy. However, we can
still do readdir in parallel if there are no delayed nodes.

This is a temporary fix which upgrades the shared inode lock to an
exclusive lock only when we have delayed items until we come up with a
more complete solution. While we're here, rename the
btrfs_{get,put}_delayed_items functions to make it very clear that
they're just for readdir.

Tested with xfstests and by doing a parallel kernel build:

while make tinyconfig && make -j4 && git clean dqfx; do
:
done

along with a bunch of parallel finds in another shell:

while true; do
for ((i=0; i/dev/null &
done
wait
done

Signed-off-by: Omar Sandoval
Signed-off-by: David Sterba
Signed-off-by: Chris Mason

Omar Sandoval
2016-06-25 21:20:10 +0800

24 Jun, 2016

4 commits

b7f67055d Btrfs: Force stripesize to the value of sectorsize ... Browse Code »

Btrfs code currently assumes stripesize to be same as
sectorsize. However Btrfs-progs (until commit
df05c7ed455f519e6e15e46196392e4757257305) has been setting
btrfs_super_block->stripesize to a value of 4096.

This commit makes sure that the value of btrfs_super_block->stripesize
is a power of 2. Later, it unconditionally sets btrfs_root->stripesize
to sectorsize.

Signed-off-by: Chandan Rajendra
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Chris Mason

Chandan Rajendra
2016-06-24 01:44:42 +0800
c0d2f6104 btrfs: fix disk_i_size update bug when fallocate() fails ... Browse Code »

When doing truncate operation, btrfs_setsize() will first call
truncate_setsize() to set new inode->i_size, but if later
btrfs_truncate() fails, btrfs_setsize() will call
"i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
inmemory inode size, now bug occurs. It's because for truncate
case btrfs_ordered_update_i_size() directly uses inode->i_size
to update BTRFS_I(inode)->disk_i_size, indeed we should use the
"offset" argument to update disk_i_size. Here is the call graph:
==>btrfs_truncate()
====>btrfs_truncate_inode_items()
======>btrfs_ordered_update_i_size(inode, last_size, NULL);
Here btrfs_ordered_update_i_size()'s offset argument is last_size.

And below test case can reveal this bug:

dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
dev=$(losetup --show -f fs.img)
mkdir -p /mnt/mntpoint
mkfs.btrfs -f $dev
mount $dev /mnt/mntpoint
cd /mnt/mntpoint

echo "workdir is: /mnt/mntpoint"
blocksize=$((128 * 1024))
dd if=/dev/zero of=testfile bs=$blocksize count=1
sync
count=$((17*1024*1024*1024/blocksize))
echo "file size is:" $((count*blocksize))
for ((i = 1; i /dev/null
done
sync

truncate --size 0 testfile
ls -l testfile
du -sh testfile
exit

In this case, truncate operation will fail for enospc reason and
"du -sh testfile" returns value greater than 0, but testfile's
size is 0, we need to reflect correct inode->i_size.

Signed-off-by: Wang Xiaoguang
Signed-off-by: David Sterba
Signed-off-by: Chris Mason

Wang Xiaoguang
2016-06-24 01:44:41 +0800
415b35a55 Btrfs: fix error handling in map_private_extent_buffer ... Browse Code »

map_private_extent_buffer() can return -EINVAL in two different cases,
1. when the requested contents span two pages if nodesize is larger
than pagesize,
2. when it detects something insane.

The 2nd one used to be only a WARN_ON(1), and we decided to return a error
to callers, but we didn't fix up all its callers, which will be
addressed by this patch.

Without this, btrfs may end up with 'general protection', ie.
reading invalid memory.

Reported-by: Vegard Nossum
Signed-off-by: Liu Bo
Signed-off-by: David Sterba
Signed-off-by: Chris Mason

Liu Bo
2016-06-24 01:44:40 +0800
04e1b65af Btrfs: fix error return code in btrfs_init_test_fs() ... Browse Code »

Fix to return a negative error code from the kern_mount() error handling
case instead of 0(ret is set to 0 by register_filesystem), as done
elsewhere in this function.

Signed-off-by: Wei Yongjun
Reviewed-by: Omar Sandoval
Signed-off-by: David Sterba
Signed-off-by: Chris Mason

Wei Yongjun
2016-06-24 01:44:39 +0800

23 Jun, 2016

3 commits

c6887cd11 Btrfs: don't do nocow check unless we have to ... Browse Code »

Before we write into prealloc/nocow space we have to make sure that there are no
references to the extents we are writing into, which means checking the extent
tree and csum tree in the case of nocow. So we don't want to do the nocow dance
unless we can't reserve data space, since it's a serious drag on performance.
With the following sequence

fallocate -l10737418240 /mnt/btrfs-test/file
cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
--end_fsync=1

we get the worst case scenario where we have to fall back on to doing the check
anyway.

Without this patch
lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec

With this patch
lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec

We get twice the throughput, half of the runtime, and half of the average
latency. Thanks,

Signed-off-by: Josef Bacik
[ PAGE_CACHE_ removal related fixups ]
Signed-off-by: David Sterba
Signed-off-by: Chris Mason

Josef Bacik
2016-06-23 08:57:14 +0800
0f873eca8 btrfs: fix deadlock in delayed_ref_async_start ... Browse Code »

"Btrfs: track transid for delayed ref flushing" was deadlocking on
btrfs_attach_transaction because its not safe to call from the async
delayed ref start code. This commit brings back btrfs_join_transaction
instead and checks for a blocked commit.

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Chris Mason
2016-06-23 08:54:18 +0800
31b9655f4 Btrfs: track transid for delayed ref flushing ... Browse Code »

Using the offwakecputime bpf script I noticed most of our time was spent waiting
on the delayed ref throttling. This is what is supposed to happen, but
sometimes the transaction can commit and then we're waiting for throttling that
doesn't matter anymore. So change this stuff to be a little smarter by tracking
the transid we were in when we initiated the throttling. If the transaction we
get is different then we can just bail out. This resulted in a 50% speedup in
my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
over the entire run (which is about 30 minutes). Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2016-06-23 08:54:18 +0800

18 Jun, 2016

9 commits

4c6459f94 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fixes from Chris Mason:
"The most user visible change here is a fix for our recent superblock
validation checks that were causing problems on non-4k pagesized
systems"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: btrfs_check_super_valid: Allow 4096 as stripesize
btrfs: remove build fixup for qgroup_account_snapshot
btrfs: use new error message helper in qgroup_account_snapshot
btrfs: avoid blocking open_ctree from cleaner_kthread
Btrfs: don't BUG_ON() in btrfs_orphan_add
btrfs: account for non-CoW'd blocks in btrfs_abort_transaction
Btrfs: check if extent buffer is aligned to sectorsize
btrfs: Use correct format specifier

Linus Torvalds
2016-06-18 23:57:59 +0800
dd5c93111 Btrfs: btrfs_check_super_valid: Allow 4096 as stripesize ... Browse Code »

Older btrfs-progs/mkfs.btrfs sets 4096 as the stripesize. Hence
restricting stripesize to be equal to sectorsize would cause super block
validation to return an error on architectures where PAGE_SIZE is not
equal to 4096.

Hence as a workaround, this commit allows stripesize to be set to 4096
bytes.

Signed-off-by: Chandan Rajendra
Signed-off-by: David Sterba

Chandan Rajendra
2016-06-18 00:32:49 +0800
89c5a5441 btrfs: remove build fixup for qgroup_account_snapshot ... Browse Code »

Introduced in 2c1984f244838477aab ("btrfs: build fixup for
qgroup_account_snapshot") as temporary bisectability build fixup.

Signed-off-by: David Sterba

David Sterba
2016-06-18 00:32:40 +0800
f7af3934c btrfs: use new error message helper in qgroup_account_snapshot ... Browse Code »

We've renamed btrfs_std_error, this one is left from last merge.

Signed-off-by: David Sterba

David Sterba
2016-06-18 00:32:40 +0800
90c711ab3 btrfs: avoid blocking open_ctree from cleaner_kthread ... Browse Code »

This fixes a problem introduced in commit 2f3165ecf103599f82bf0ea254039db335fb5005
"btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes".

open_ctree eventually calls btrfs_replay_log which in turn calls
btrfs_commit_super which tries to lock the cleaner_mutex, causing a
recursive mutex deadlock during mount.

Instead of playing whack-a-mole trying to keep up with all the
functions that may want to lock cleaner_mutex, put all the cleaner_mutex
lockers back where they were, and attack the problem more directly:
keep cleaner_kthread asleep until the filesystem is mounted.

When filesystems are mounted read-only and later remounted read-write,
open_ctree did not set fs_info->open and neither does anything else.
Set this flag in btrfs_remount so that neither btrfs_delete_unused_bgs
nor cleaner_kthread get confused by the common case of "/" filesystem
read-only mount followed by read-write remount.

Signed-off-by: Zygo Blaxell
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Zygo Blaxell
2016-06-18 00:32:40 +0800
3b6571c18 Btrfs: don't BUG_ON() in btrfs_orphan_add ... Browse Code »

This is just a screwup for developers, so change it to an ASSERT() so developers
notice when things go wrong and deal with the error appropriately if ASSERT()
isn't enabled. Thanks,

Signed-off-by: Josef Bacik
Reviewed-by: Mark Fasheh
Signed-off-by: David Sterba

Josef Bacik
2016-06-18 00:32:40 +0800
64c12921e btrfs: account for non-CoW'd blocks in btrfs_abort_transaction ... Browse Code »

The test for !trans->blocks_used in btrfs_abort_transaction is
insufficient to determine whether it's safe to drop the transaction
handle on the floor. btrfs_cow_block, informed by should_cow_block,
can return blocks that have already been CoW'd in the current
transaction. trans->blocks_used is only incremented for new block
allocations. If an operation overlaps the blocks in the current
transaction entirely and must abort the transaction, we'll happily
let it clean up the trans handle even though it may have modified
the blocks and will commit an incomplete operation.

In the long-term, I'd like to do closer tracking of when the fs
is actually modified so we can still recover as gracefully as possible,
but that approach will need some discussion. In the short term,
since this is the only code using trans->blocks_used, let's just
switch it to a bool indicating whether any blocks were used and set
it when should_cow_block returns false.

Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Jeff Mahoney
Reviewed-by: Filipe Manana
Signed-off-by: David Sterba

Jeff Mahoney
2016-06-18 00:32:40 +0800
c871b0f2f Btrfs: check if extent buffer is aligned to sectorsize ... Browse Code »

Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
via alloc_extent_buffer(). An unaligned eb can have more pages than it
should have, which ends up extent buffer's leak or some corrupted content
in extent buffer.

This adds a warning to let us quickly know what was happening.

Now that alloc_extent_buffer() no more returns NULL, this changes its
caller and callers of its caller to match with the new error
handling.

Signed-off-by: Liu Bo
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Liu Bo
2016-06-18 00:32:40 +0800
16ff4b454 btrfs: Use correct format specifier ... Browse Code »

Component mirror_num of struct btrfsic_block is defined
as unsigned int. Use %u as format specifier.

Signed-off-by: Heinrich Schuchardt
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Heinrich Schuchardt
2016-06-18 00:32:40 +0800

11 Jun, 2016

1 commit

3d0f0b6a5 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fixes from Chris Mason:
"Has some fixes and some new self tests for btrfs. The self tests are
usually disabled in the .config file (unless you're doing btrfs dev
work), and this bunch is meant to find problems with the 64K page size
patches.

Jeff has a patch to help people see if they are using the hardware
assist crc32c module, which really helps us nail down problems when
people ask why crcs are using so much CPU.

Otherwise, it's small fixes"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system
Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize
Btrfs: self-tests: Use macros instead of constants and add missing newline
Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes
Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZE
btrfs: advertise which crc32c implementation is being used at module load
Btrfs: add validadtion checks for chunk loading
Btrfs: add more validation checks for superblock
Btrfs: clear uptodate flags of pages in sys_array eb
Btrfs: self-tests: Support non-4k page size
Btrfs: Fix integer overflow when calculating bytes_per_bitmap
Btrfs: test_check_exists: Fix infinite loop when searching for free space entries
Btrfs: end transaction if we abort when creating uuid root
btrfs: Use __u64 in exported linux/btrfs.h.

Linus Torvalds
2016-06-11 05:13:27 +0800

09 Jun, 2016

2 commits

719da39a6 Merge branch 'misc-fixes-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/k… ... Browse Code »

…dave/linux into for-linus-4.7

Chris Mason
2016-06-09 05:36:12 +0800
4c5299008 Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/… ... Browse Code »

…linux into for-linus-4.7

Chris Mason
2016-06-09 05:35:11 +0800

06 Jun, 2016

9 commits

34b3e6c92 Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system ... Browse Code »

In __test_eb_bitmaps(), we write random data to a bitmap. Then copy
the bitmap to another bitmap that resides inside an extent buffer.
Later we verify the values of corresponding bits in the bitmap and the
bitmap inside the extent buffer. However, extent_buffer_test_bit()
reads in byte granularity while test_bit() reads in unsigned long
granularity. Hence we end up comparing wrong bits on big-endian
systems such as ppc64. This commit fixes the issue by reading the
bitmap in byte granularity.

Reviewed-by: Josef Bacik
Reviewed-by: Chandan Rajendra
Signed-off-by: Feifei Xu
Signed-off-by: David Sterba

Feifei Xu
2016-06-06 23:17:12 +0800
36b3dc05b Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize ... Browse Code »

With 64K sectorsize, 1G sized block group cannot span across bitmaps.
To execute test_bitmaps() function, this commit allocates
"BITS_PER_BITMAP * sectorsize + PAGE_SIZE" sized block group.

Reviewed-by: Josef Bacik
Reviewed-by: Chandan Rajendra
Signed-off-by: Feifei Xu
Signed-off-by: David Sterba

Feifei Xu
2016-06-06 23:17:12 +0800
ef9f2db36 Btrfs: self-tests: Use macros instead of constants and add missing newline ... Browse Code »

This commit replaces numerical constants with appropriate
preprocessor macros.

Reviewed-by: Josef Bacik
Signed-off-by: Chandan Rajendra
Signed-off-by: Feifei Xu
Signed-off-by: David Sterba

Feifei Xu
2016-06-06 23:17:12 +0800
d94f43b4c Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes ... Browse Code »

To test all possible sectorsizes, this commit adds a sectorsize
array. This commit executes the tests for all possible sectorsizes and
nodesizes.

Reviewed-by: Josef Bacik
Signed-off-by: Chandan Rajendra
Signed-off-by: Feifei Xu
Signed-off-by: David Sterba

Feifei Xu
2016-06-06 23:17:12 +0800
ed9e4afdb Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZE ... Browse Code »

On ppc64, PAGE_SIZE is 64k which is same as BTRFS_MAX_METADATA_BLOCKSIZE.
In such a scenario, we will never be able to have an extent buffer
containing more than one page. Hence in such cases this commit does not
execute the page straddling tests.

Reviewed-by: Josef Bacik
Signed-off-by: Feifei Xu
Signed-off-by: Chandan Rajendra
Signed-off-by: David Sterba

Feifei Xu
2016-06-06 23:17:11 +0800
5f9e1059d btrfs: advertise which crc32c implementation is being used at module load ... Browse Code »

Since several architectures support hardware-accelerated crc32c
calculation, it would be nice to confirm that btrfs is actually using it.

We can see an elevated use count for the module, but it doesn't actually
show who the users are. This patch simply prints the name of the driver
after successfully initializing the shash.

Signed-off-by: Jeff Mahoney
[ added a helper and used in module load-time message ]
Signed-off-by: David Sterba

Jeff Mahoney
2016-06-06 20:08:28 +0800
e06cd3dd7 Btrfs: add validadtion checks for chunk loading ... Browse Code »

To prevent fuzzed filesystem images from panic the whole system,
we need various validation checks to refuse to mount such an image
if btrfs finds any invalid value during loading chunks, including
both sys_array and regular chunks.

Note that these checks may not be sufficient to cover all corner cases,
feel free to add more checks.

Reported-by: Vegard Nossum
Reported-by: Quentin Casasnovas
Reviewed-by: David Sterba
Signed-off-by: Liu Bo
Signed-off-by: David Sterba

Liu Bo
2016-06-06 16:57:09 +0800
99e3ecfcb Btrfs: add more validation checks for superblock ... Browse Code »

This adds validation checks for super_total_bytes, super_bytes_used and
super_stripesize, super_num_devices.

Reported-by: Vegard Nossum
Reported-by: Quentin Casasnovas
Reviewed-by: David Sterba
Signed-off-by: Liu Bo
Signed-off-by: David Sterba

Liu Bo
2016-06-06 16:41:53 +0800
d865177a5 Btrfs: clear uptodate flags of pages in sys_array eb ... Browse Code »

We set uptodate flag to pages in the temporary sys_array eb,
but do not clear the flag after free eb. As the special
btree inode may still hold a reference on those pages, the
uptodate flag can remain alive in them.

If btrfs_super_chunk_root has been intentionally changed to the
offset of this sys_array eb, reading chunk_root will read content
of sys_array and it will skip our beautiful checks in
btree_readpage_end_io_hook() because of
"pages of eb are uptodate => eb is uptodate"

This adds the 'clear uptodate' part to force it to read from disk.

Reviewed-by: Josef Bacik
Signed-off-by: Liu Bo
Signed-off-by: David Sterba

Liu Bo
2016-06-06 16:14:40 +0800

05 Jun, 2016

1 commit

b2d5ad822 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fixes from Chris Mason:
"The important part of this pull is Filipe's set of fixes for btrfs
device replacement. Filipe fixed a few issues seen on the list and a
number he found on his own"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent
Btrfs: fix race between device replace and read repair
Btrfs: fix race between device replace and discard
Btrfs: fix race between device replace and chunk allocation
Btrfs: fix race setting block group back to RW mode during device replace
Btrfs: fix unprotected assignment of the left cursor for device replace
Btrfs: fix race setting block group readonly during device replace
Btrfs: fix race between device replace and block group removal
Btrfs: fix race between readahead and device replace/removal

Linus Torvalds
2016-06-05 02:56:28 +0800

04 Jun, 2016

1 commit

8dff9c853 Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent ... Browse Code »

When dealing with inline extents, btrfs_get_extent will incorrectly try
to insert a duplicate extent_map. The dup hits -EEXIST from
add_extent_map, but then we try to merge with the existing one and end
up trying to insert a zero length extent_map.

This actually works most of the time, except when there are extent maps
past the end of the inline extent. rocksdb will trigger this sometimes
because it preallocates an extent and then truncates down.

Josef made a script to trigger with xfs_io:

#!/bin/bash

xfs_io -f -c "pwrite 0 1000" inline
xfs_io -c "falloc -k 4k 1M" inline
xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline
xfs_io -c "fadvise -d 0 1000" inline
cat inline

You'll get EIOs trying to read inline after this because add_extent_map
is returning EEXIST

Signed-off-by: Chris Mason

Chris Mason
2016-06-04 03:32:34 +0800

03 Jun, 2016

3 commits

b9ef22ded Btrfs: self-tests: Support non-4k page size ... Browse Code »

self-tests code assumes 4k as the sectorsize and nodesize. This commit
fix hardcoded 4K. Enables the self-tests code to be executed on non-4k
page sized systems (e.g. ppc64).

Reviewed-by: Josef Bacik
Signed-off-by: Feifei Xu
Signed-off-by: Chandan Rajendra
Signed-off-by: David Sterba

Feifei Xu
2016-06-03 01:23:14 +0800
0ef6447a3 Btrfs: Fix integer overflow when calculating bytes_per_bitmap ... Browse Code »

On ppc64, bytes_per_bitmap will be (65536*8*65536). Hence append UL to
fix integer overflow.

Reviewed-by: Josef Bacik
Reviewed-by: Chandan Rajendra
Signed-off-by: Feifei Xu
Signed-off-by: David Sterba

Feifei Xu
2016-06-03 01:22:49 +0800
5473e0c42 Btrfs: test_check_exists: Fix infinite loop when searching for free space entries ... Browse Code »

On a ppc64 machine using 64K as the block size, assume that the RB
tree at btrfs_free_space_ctl->free_space_offset contains following
two entries:

1. A bitmap entry having an offset value of 0 and having the bits
corresponding to the address range [128M+512K, 128M+768K] set.
2. An extent entry corresponding to the address range
[128M-256K, 128M-128K]

In such a scenario, test_check_exists() invoked for checking the
existence of address range [128M+768K, 256M] can lead to an
infinite loop as explained below:

- Checking for the extent entry fails.
- Checking for a bitmap entry results in the free space info in
range [128M+512K, 128M+768K] beng returned.
- rb_prev(info) returns NULL because the bitmap entry starting from
offset 0 comes first in the RB tree.
- current_node = bitmap node.
- while (current_node)
tmp = rb_next(bitmap_node);/*tmp is extent based free space entry*/
Since extent based free space entry's last address is smaller
than the address being searched for (i.e. 128M+768K) we
incorrectly again obtain the extent node as the "next right node"
of the RB tree and thus end up looping infinitely.

This patch fixes the issue by checking the "tmp" variable which point
to the most recently searched free space node.

Reviewed-by: Josef Bacik
Reviewed-by: Chandan Rajendra
Signed-off-by: Feifei Xu
Signed-off-by: David Sterba

Feifei Xu
2016-06-03 01:22:34 +0800

01 Jun, 2016

1 commit

65d4f4c15 Btrfs: end transaction if we abort when creating uuid root ... Browse Code »

We still need to call btrfs_end_transaction if we call btrfs_abort_transaction,
otherwise we hang and make me super grumpy. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2016-06-01 06:32:42 +0800

31 May, 2016

2 commits

b5de8d0df Btrfs: fix race between device replace and read repair ... Browse Code »

While we are finishing a device replace operation we can have a concurrent
task trying to do a read repair operation, in which case it will call
btrfs_map_block() to get a struct btrfs_bio which can have a stripe that
points to the source device of the device replace operation. This allows
for the read repair task to dereference the stripe's device pointer after
the device replace operation has freed the source device, resulting in
an invalid memory access. This is similar to the problem solved by my
previous patch in the same series and named "Btrfs: fix race between
device replace and discard".

So fix this by surrounding the call to btrfs_map_block() and the code
that uses the returned struct btrfs_bio with calls to
btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the
proper serialization with the finishing phase of the device replace
operation.

Signed-off-by: Filipe Manana
Reviewed-by: Josef Bacik

Filipe Manana
2016-05-31 08:00:03 +0800
2999241da Btrfs: fix race between device replace and discard ... Browse Code »

While we are finishing a device replace operation, we can make a discard
operation (fs mounted with -o discard) do an invalid memory access like
the one reported by the following trace:

[ 3206.384654] general protection fault: 0000 [#1] PREEMPT SMP
[ 3206.387520] Modules linked in: dm_mod btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis psmouse tpm ppdev sg parport_pc evdev i2c_piix4 parport
processor serio_raw i2c_core pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci
virtio_ring scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[ 3206.388595] CPU: 14 PID: 29194 Comm: fsstress Not tainted 4.6.0-rc7-btrfs-next-29+ #1
[ 3206.388595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 3206.388595] task: ffff88017ace0100 ti: ffff880171b98000 task.ti: ffff880171b98000
[ 3206.388595] RIP: 0010:[] [] blkdev_issue_discard+0x5c/0x2a7
[ 3206.388595] RSP: 0018:ffff880171b9bb80 EFLAGS: 00010246
[ 3206.388595] RAX: ffff880171b9bc28 RBX: 000000000090d000 RCX: 0000000000000000
[ 3206.388595] RDX: ffffffff82fa1b48 RSI: ffffffff8179f46c RDI: ffffffff82fa1b48
[ 3206.388595] RBP: ffff880171b9bcc0 R08: 0000000000000000 R09: 0000000000000001
[ 3206.388595] R10: ffff880171b9bce0 R11: 000000000090f000 R12: ffff880171b9bbe8
[ 3206.388595] R13: 0000000000000010 R14: 0000000000004868 R15: 6b6b6b6b6b6b6b6b
[ 3206.388595] FS: 00007f6182e4e700(0000) GS:ffff88023fdc0000(0000) knlGS:0000000000000000
[ 3206.388595] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3206.388595] CR2: 00007f617c2bbb18 CR3: 000000017ad9c000 CR4: 00000000000006e0
[ 3206.388595] Stack:
[ 3206.388595] 0000000000004878 0000000000000000 0000000002400040 0000000000000000
[ 3206.388595] 0000000000000000 ffff880171b9bbe8 ffff880171b9bbb0 ffff880171b9bbb0
[ 3206.388595] ffff880171b9bbc0 ffff880171b9bbc0 ffff880171b9bbd0 ffff880171b9bbd0
[ 3206.388595] Call Trace:
[ 3206.388595] [] btrfs_issue_discard+0x12f/0x143 [btrfs]
[ 3206.388595] [] ? btrfs_issue_discard+0x12f/0x143 [btrfs]
[ 3206.388595] [] btrfs_discard_extent+0x87/0xde [btrfs]
[ 3206.388595] [] btrfs_finish_extent_commit+0xb2/0x1df [btrfs]
[ 3206.388595] [] ? __mutex_unlock_slowpath+0x150/0x15b
[ 3206.388595] [] btrfs_commit_transaction+0x7fc/0x980 [btrfs]
[ 3206.388595] [] ? __mutex_unlock_slowpath+0x150/0x15b
[ 3206.388595] [] btrfs_sync_file+0x38f/0x428 [btrfs]
[ 3206.388595] [] vfs_fsync_range+0x8c/0x9e
[ 3206.388595] [] vfs_fsync+0x1c/0x1e
[ 3206.388595] [] do_fsync+0x31/0x4a
[ 3206.388595] [] SyS_fsync+0x10/0x14
[ 3206.388595] [] entry_SYSCALL_64_fastpath+0x18/0xa8
[ 3206.388595] [] ? time_hardirqs_off+0x9/0x14
[ 3206.388595] [] ? trace_hardirqs_off_caller+0x1f/0xaa

This happens because when we call btrfs_map_block() from
btrfs_discard_extent() to get a btrfs_bio structure, the device replace
operation has not finished yet, but before we use the device of one of the
stripes from the returned btrfs_bio structure, the device object is freed.

This is illustrated by the following diagram.

CPU 1 CPU 2

btrfs_dev_replace_start()

(...)

btrfs_dev_replace_finishing()

btrfs_start_transaction()
btrfs_commit_transaction()

(...)

btrfs_sync_file()
btrfs_start_transaction()

(...)

btrfs_commit_transaction()
btrfs_finish_extent_commit()
btrfs_discard_extent()
btrfs_map_block()
--> returns a struct btrfs_bio
with a stripe that has a
device field pointing to
source device of the replace
operation (the device that
is being replaced)

mutex_lock(&uuid_mutex)
mutex_lock(&fs_info->fs_devices->device_list_mutex)
mutex_lock(&fs_info->chunk_mutex)

btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates the mapping tree and for each
extent map that has a stripe pointing to
the source device, it updates the stripe
to point to the target device instead

btrfs_rm_dev_replace_blocked()
--> waits for fs_info->bio_counter to go down to 0

btrfs_rm_dev_replace_remove_srcdev()
--> removes source device from the list of devices

mutex_unlock(&fs_info->chunk_mutex)
mutex_unlock(&fs_info->fs_devices->device_list_mutex)
mutex_unlock(&uuid_mutex)

btrfs_rm_dev_replace_free_srcdev()
--> frees the source device

--> iterates over all stripes
of the returned struct
btrfs_bio
--> for each stripe it
dereferences its device
pointer
--> it ends up finding a
pointer to the device
used as the source
device for the replace
operation and that was
already freed

So fix this by surrounding the call to btrfs_map_block(), and the code
that uses the returned struct btrfs_bio, with calls to
btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), so that
the finishing phase of the device replace operation blocks until the
the bio counter decreases to zero before it frees the source device.
This is the same approach we do at btrfs_map_bio() for example.

Signed-off-by: Filipe Manana
Reviewed-by: Josef Bacik

Filipe Manana
2016-05-31 07:59:44 +0800

30 May, 2016

1 commit

22ab04e81 Btrfs: fix race between device replace and chunk allocation ... Browse Code »

While iterating and copying extents from the source device, the device
replace code keeps adjusting a left cursor that is used to make sure that
once we finish processing a device extent, any future writes to extents
from the corresponding block group will get into both the source and
target devices. This left cursor is also used for resuming the device
replace operation at mount time.

However using this left cursor to decide whether writes go into both
devices or only the source device is not enough to guarantee we don't
miss copying extents into the target device. There are two cases where
the current approach fails. The first one is related to when there are
holes in the device and they get allocated for new block groups while
the device replace operation is iterating the device extents (more on
this explained below). The second one is that when that loop over the
device extents finishes, we start dellaloc, wait for all ordered extents
and then commit the current transaction, we might have got new block
groups allocated that are now using a device extent that has an offset
greater then or equals to the value of the left cursor, in which case
writes to extents belonging to these new block groups will get issued
only to the source device.

For the first case where the current approach of using a left cursor
fails, consider the source device currently has the following layout:

[ extent bg A ] [ hole, unallocated space ] [extent bg B ]
3Gb 4Gb 5Gb

While we are iterating the device extents from the source device using
the commit root of the device tree, the following happens:

CPU 1 CPU 2

scrub_enumerate_chunks()
--> searches the device tree for
extents belonging to the source
device using the device tree's
commit root
--> 1st iteration finds extent belonging to
block group A

--> sets block group A to RO mode
(btrfs_inc_block_group_ro)

--> sets cursor left to found_key.offset
which is 3Gb

--> scrub_chunk() starts
copies all allocated extents from
block group's A stripe at source
device into target device

btrfs_alloc_chunk()
--> allocates device extent
in the range [4Gb, 5Gb[
from the source device for
a new block group C

extent allocated from block
group C for a direct IO,
buffered write or btree node/leaf

extent is written to, perhaps
in response to a writepages()
call from the VM or directly
through direct IO

the write is made only against
the source device and not against
the target device because the
extent's offset is in the interval
[4Gb, 5Gb[ which is larger then
the value of cursor_left (3Gb)

--> scrub_chunks() finishes

--> updates left cursor from 3Gb to
4Gb

--> btrfs_dec_block_group_ro() sets
block group A back to RW mode

--> 2nd iteration finds extent belonging to
block group B - it did not find the new
extent in the range [4Gb, 5Gb[ for block
group C because we are using the device
tree's commit root or even because the
block group's items are not all yet
inserted in the respective btrees, that is,
the block group is still attached to some
transaction handle's new_bgs list and
btrfs_create_pending_block_groups() was
not called yet against that transaction
handle, so the device extent items were
not yet inserted into the devices tree

--> so we end not copying anything from the newly
allocated device extent from the source device
to the target device

So fix this by making __btrfs_map_block() always redirect writes to the
target device as well, independently of the left cursor's value. With
this change the left cursor is now used only for the purpose of tracking
progress and allow a mount operation to resume a device replace.

Signed-off-by: Filipe Manana
Reviewed-by: Josef Bacik

Filipe Manana
2016-05-30 19:58:26 +0800