Eric Lee / smarc-fsl-linux-kernel

01 Jul, 2019

2 commits

ea41d6b27 btrfs: remove assumption about csum type form btrfs_print_data_csum_error() ... Browse Code »

btrfs_print_data_csum_error() still assumed checksums to be 32 bit in
size. Make it size agnostic.

Reviewed-by: Nikolay Borisov
Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Johannes Thumshirn
2019-07-01 19:35:02 +0800
7ebc7e5f2 btrfs: format checksums according to type for printing ... Browse Code »

Add a small helper for btrfs_print_data_csum_error() which formats the
checksum according to it's type for pretty printing.

Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
[ shorten macro name ]
Signed-off-by: David Sterba

Johannes Thumshirn
2019-07-01 19:35:01 +0800

30 Apr, 2019

2 commits

b8aa330d2 Btrfs: improve performance on fsync of files with multiple hardlinks ... Browse Code »

Commit 41bd6067692382 ("Btrfs: fix fsync of files with multiple hard links
in new directories") introduced a path that makes fsync fallback to a full
transaction commit in order to avoid losing hard links and new ancestors
of the fsynced inode. That path is triggered only when the inode has more
than one hard link and either has a new hard link created in the current
transaction or the inode was evicted and reloaded in the current
transaction.

That path ends up getting triggered very often (hundreds of times) during
the course of pgbench benchmarks, resulting in performance drops of about
20%.

This change restores the performance by not triggering the full transaction
commit in those cases, and instead iterate the fs/subvolume tree in search
of all possible new ancestors, for all hard links, to log them.

Reported-by: Zhao Yuhu
Tested-by: James Wang
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba

Filipe Manana
2019-04-30 01:02:52 +0800
7d157c3d4 btrfs: use common file type conversion ... Browse Code »

Deduplicate the btrfs file type conversion implementation - file systems
that use the same file types as defined by POSIX do not need to define
their own versions and can use the common helper functions decared in
fs_types.h and implemented in fs_types.c

Common implementation can be found via commit:
bbe7449e2599 "fs: common implementation of file type"

Reviewed-by: Jan Kara
Signed-off-by: Amir Goldstein
Signed-off-by: Phillip Potter
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Phillip Potter
2019-04-30 01:02:29 +0800

17 Dec, 2018

4 commits

41bd60676 Btrfs: fix fsync of files with multiple hard links in new directories ... Browse Code »

The log tree has a long standing problem that when a file is fsync'ed we
only check for new ancestors, created in the current transaction, by
following only the hard link for which the fsync was issued. We follow the
ancestors using the VFS' dget_parent() API. This means that if we create a
new link for a file in a directory that is new (or in an any other new
ancestor directory) and then fsync the file using an old hard link, we end
up not logging the new ancestor, and on log replay that new hard link and
ancestor do not exist. In some cases, involving renames, the file will not
exist at all.

Example:

mkfs.btrfs -f /dev/sdb
mount /dev/sdb /mnt

mkdir /mnt/A
touch /mnt/foo
ln /mnt/foo /mnt/A/bar
xfs_io -c fsync /mnt/foo

In this example after log replay only the hard link named 'foo' exists
and directory A does not exist, which is unexpected. In other major linux
filesystems, such as ext4, xfs and f2fs for example, both hard links exist
and so does directory A after mounting again the filesystem.

Checking if any new ancestors are new and need to be logged was added in
2009 by commit 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes"),
however only for the ancestors of the hard link (dentry) for which the
fsync was issued, instead of checking for all ancestors for all of the
inode's hard links.

So fix this by tracking the id of the last transaction where a hard link
was created for an inode and then on fsync fallback to a full transaction
commit when an inode has more than one hard link and at least one new hard
link was created in the current transaction. This is the simplest solution
since this is not a common use case (adding frequently hard links for
which there's an ancestor created in the current transaction and then
fsync the file). In case it ever becomes a common use case, a solution
that consists of iterating the fs/subvol btree for each hard link and
check if any ancestor is new, could be implemented.

This solves many unexpected scenarios reported by Jayashree Mohan and
Vijay Chidambaram, and for which there is a new test case for fstests
under review.

Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes")
CC: stable@vger.kernel.org # 4.4+
Reported-by: Vijay Chidambaram
Reported-by: Jayashree Mohan
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba

Filipe Manana
2018-12-17 21:51:43 +0800
bbe339cc3 btrfs: drop extra enum initialization where using defaults ... Browse Code »

The first auto-assigned value to enum is 0, we can use that and not
initialize all members where the auto-increment does the same. This is
used for values that are not part of on-disk format.

Reviewed-by: Omar Sandoval
Reviewed-by: Qu Wenruo
Reviewed-by: Johannes Thumshirn
Signed-off-by: David Sterba

David Sterba
2018-12-17 21:51:43 +0800
3cd24c698 btrfs: use tagged writepage to mitigate livelock of snapshot ... Browse Code »

Snapshot is expected to be fast. But if there are writers steadily
creating dirty pages in our subvolume, the snapshot may take a very long
time to complete. To fix the problem, we use tagged writepage for
snapshot flusher as we do in the generic write_cache_pages(), so we can
omit pages dirtied after the snapshot command.

This does not change the semantics regarding which data get to the
snapshot, if there are pages being dirtied during the snapshotting
operation. There's a sync called before snapshot is taken in old/new
case, any IO in flight just after that may be in the snapshot but this
depends on other system effects that might still sync the IO.

We do a simple snapshot speed test on a Intel D-1531 box:

fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
--direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio

original: 1m58sec
patched: 6.54sec

This is the best case for this patch since for a sequential write case,
we omit nearly all pages dirtied after the snapshot command.

For a multi writers, random write test:

fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
--direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio

original: 15.83sec
patched: 10.35sec

The improvement is smaller compared to the sequential write case,
since we omit only half of the pages dirtied after snapshot command.

Reviewed-by: Nikolay Borisov
Signed-off-by: Ethan Lien
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Ethan Lien
2018-12-17 21:51:33 +0800
06f2548f9 btrfs: Add function to distinguish between data and btree inode ... Browse Code »

This will be used in future patches that remove the optional
extent_io_ops callbacks.

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2018-12-17 21:51:27 +0800

15 Oct, 2018

1 commit

4fd786e6c btrfs: Remove 'objectid' member from struct btrfs_root ... Browse Code »

There are two members in struct btrfs_root which indicate root's
objectid: objectid and root_key.objectid.

They are both set to the same value in __setup_root():

static void __setup_root(struct btrfs_root *root,
struct btrfs_fs_info *fs_info,
u64 objectid)
{
...
root->objectid = objectid;
...
root->root_key.objectid = objecitd;
...
}

and not changed to other value after initialization.

grep in btrfs directory shows both are used in many places:
$ grep -rI "root->root_key.objectid" | wc -l
133
$ grep -rI "root->objectid" | wc -l
55
(4.17, inc. some noise)

It is confusing to have two similar variable names and it seems
that there is no rule about which should be used in a certain case.

Since ->root_key itself is needed for tree reloc tree, let's remove
'objecitd' member and unify code to use ->root_key.objectid in all places.

Signed-off-by: Misono Tomohiro
Reviewed-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Misono Tomohiro
2018-10-15 23:23:25 +0800

06 Aug, 2018

1 commit

d3c6be6fd btrfs: use timespec64 for i_otime ... Browse Code »

While the regular inode timestamps all use timespec64 now, the i_otime
field is btrfs specific and still needs to be converted to correctly
represent times beyond 2038.

Signed-off-by: Arnd Bergmann
Reviewed-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Arnd Bergmann
2018-08-06 19:12:39 +0800

29 May, 2018

3 commits

7efc3e349 Btrfs: renumber BTRFS_INODE_ runtime flags and switch to enums ... Browse Code »

We got rid of BTRFS_INODE_HAS_ORPHAN_ITEM and
BTRFS_INODE_ORPHAN_META_RESERVED, so we can renumber the flags to make
them consecutive again.

Signed-off-by: Omar Sandoval
[ switch them enums so we don't have to do that again ]
Signed-off-by: David Sterba

Omar Sandoval
2018-05-29 00:23:59 +0800
a575ceeb1 Btrfs: get rid of unused orphan infrastructure ... Browse Code »

Now that we don't keep long-standing reservations for orphan items,
root->orphan_block_rsv isn't used. We can git rid of it, along with:

- root->orphan_lock, which was used to protect root->orphan_block_rsv
- root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv
- BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations
in root->orphan_block_rsv
- btrfs_orphan_commit_root(), which was the last user of any of these
and does nothing else

Reviewed-by: Nikolay Borisov
Signed-off-by: Omar Sandoval
Signed-off-by: David Sterba

Omar Sandoval
2018-05-29 00:23:57 +0800
7b40b695b Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM ... Browse Code »

Now that we don't add orphan items for truncate, there can't be races on
adding or deleting an orphan item, so this bit is unnecessary.

Reviewed-by: Nikolay Borisov
Signed-off-by: Omar Sandoval
Signed-off-by: David Sterba

Omar Sandoval
2018-05-29 00:23:48 +0800

12 Apr, 2018

1 commit

9888c3402 btrfs: replace GPL boilerplate by SPDX -- headers ... Browse Code »

Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.

Unify the include protection macros to match the file names.

Signed-off-by: David Sterba

David Sterba
2018-04-12 22:29:46 +0800

31 Mar, 2018

2 commits

051c98eb1 btrfs: open code trivial helper btrfs_page_exists_in_range ... Browse Code »

The called function name is self explanatory.

Signed-off-by: David Sterba

David Sterba
2018-03-31 07:26:50 +0800
965aab1cf btrfs: Use filemap_range_has_page() ... Browse Code »

The current implementation of btrfs_page_exists_in_range() gives the
wrong answer if the workingset code has stored a shadow entry in the
page cache. The filemap_range_has_page() function does not have this
problem, and it's shared code, so use it instead.

eigned-off-by: Matthew Wilcox
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Matthew Wilcox
2018-03-31 07:26:45 +0800

26 Mar, 2018

1 commit

c1c3fac2a btrfs: Remove btrfs_inode::delayed_iput_count ... Browse Code »

delayed_iput_count wa supposed to be used to implement, well, delayed
iput. The idea is that we keep accumulating the number of iputs we do
until eventually the inode is deleted. Turns out we never really
switched the delayed_iput_count from 0 to 1, hence all conditional
code relying on the value of that member being different than 0 was
never executed. This, as it turns out, didn't cause any problem due
to the simple fact that the generic inode's i_count member was always
used to count the number of iputs. So let's just remove the unused
member and all unused code. This patch essentially provides no
functional changes. While at it, also add proper documentation for
btrfs_add_delayed_iput

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
[ reformat comment ]
Signed-off-by: David Sterba

Nikolay Borisov
2018-03-26 21:09:37 +0800

02 Nov, 2017

3 commits

69fe2d75d btrfs: make the delalloc block rsv per inode ... Browse Code »

The way we handle delalloc metadata reservations has gotten
progressively more complicated over the years. There is so much cruft
and weirdness around keeping the reserved count and outstanding counters
consistent and handling the error cases that it's impossible to
understand.

Fix this by making the delalloc block rsv per-inode. This way we can
calculate the actual size of the outstanding metadata reservations every
time we make a change, and then reserve the delta based on that amount.
This greatly simplifies the code everywhere, and makes the error
handling in btrfs_delalloc_reserve_metadata far less terrifying.

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2017-11-02 03:45:35 +0800
dd48d4072 btrfs: add tracepoints for outstanding extents mods ... Browse Code »

This is handy for tracing problems with modifying the outstanding
extents counters.

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2017-11-02 03:45:35 +0800
8b62f87ba Btrfs: rework outstanding_extents ... Browse Code »

Right now we do a lot of weird hoops around outstanding_extents in order
to keep the extent count consistent. This is because we logically
transfer the outstanding_extent count from the initial reservation
through the set_delalloc_bits. This makes it pretty difficult to get a
handle on how and when we need to mess with outstanding_extents.

Fix this by revamping the rules of how we deal with outstanding_extents.
Now instead everybody that is holding on to a delalloc extent is
required to increase the outstanding extents count for itself. This
means we'll have something like this

btrfs_delalloc_reserve_metadata - outstanding_extents = 1
btrfs_set_extent_delalloc - outstanding_extents = 2
btrfs_release_delalloc_extents - outstanding_extents = 1

for an initial file write. Now take the append write where we extend an
existing delalloc range but still under the maximum extent size

btrfs_delalloc_reserve_metadata - outstanding_extents = 2
btrfs_set_extent_delalloc
btrfs_set_bit_hook - outstanding_extents = 3
btrfs_merge_extent_hook - outstanding_extents = 2
btrfs_delalloc_release_extents - outstanding_extnets = 1

In order to make the ordered extent transition we of course must now
make ordered extents carry their own outstanding_extent reservation, so
for cow_file_range we end up with

btrfs_add_ordered_extent - outstanding_extents = 2
clear_extent_bit - outstanding_extents = 1
btrfs_remove_ordered_extent - outstanding_extents = 0

This makes all manipulations of outstanding_extents much more explicit.
Every successful call to btrfs_delalloc_reserve_metadata _must_ now be
combined with btrfs_release_delalloc_extents, even in the error case, as
that is the only function that actually modifies the
outstanding_extents counter.

The drawback to this is now we are much more likely to have transient
cases where outstanding_extents is much larger than it actually should
be. This could happen before as we manipulated the delalloc bits, but
now it happens basically at every write. This may put more pressure on
the ENOSPC flushing code, but I think making this code simpler is worth
the cost. I have another change coming to mitigate this side-effect
somewhat.

I also added trace points for the counter manipulation. These were used
by a bpf script I wrote to help track down leak issues.

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2017-11-02 03:45:35 +0800

16 Aug, 2017

3 commits

eec63c65d btrfs: separate defrag and property compression ... Browse Code »

Add new value for compression to distinguish between defrag and
property. Previously, a single variable was used and this caused clashes
when the per-file 'compression' was set and a defrag -c was called.

The property-compression is loaded when the file is open, defrag will
overwrite the same variable and reset to 0 (ie. NONE) at when the file
defragmentaion is finished. That's considered a usability bug.

Now we won't touch the property value, use the defrag-compression. The
precedence of defrag is higher than for property (and whole-filesystem).

Signed-off-by: David Sterba

David Sterba
2017-08-16 22:12:05 +0800
b52aa8c93 btrfs: rename variable holding per-inode compression type ... Browse Code »

This is preparatory for separating inode compression requested by defrag
and set via properties. This will fix a usability bug when defrag will
reset compression type to NONE. If the file has compression set via
property, it will not apply anymore (until next mount or reset through
command line).

We're going to fix that by adding another variable just for the defrag
call and won't touch the property. The defrag will have higher priority
when deciding whether to compress the data.

Signed-off-by: David Sterba

David Sterba
2017-08-16 22:12:05 +0800
9a35b6372 btrfs: constify tracepoint arguments ... Browse Code »

Tracepoint arguments are all read-only. If we mark the arguments
as const, we're able to keep or convert those arguments to const
where appropriate.

Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba

Jeff Mahoney
2017-08-16 20:19:53 +0800

09 Jun, 2017

1 commit

4e4cbee93 block: switch bios to blk_status_t ... Browse Code »

Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-09 23:27:32 +0800

26 Apr, 2017

1 commit

a7e3b975a Btrfs: fix reported number of inode blocks ... Browse Code »

Currently when there are buffered writes that were not yet flushed and
they fall within allocated ranges of the file (that is, not in holes or
beyond eof assuming there are no prealloc extents beyond eof), btrfs
simply reports an incorrect number of used blocks through the stat(2)
system call (or any of its variants), regardless of mount options or
inode flags (compress, compress-force, nodatacow). This is because the
number of blocks used that is reported is based on the current number
of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
inode. The later covers bytes that both fall within allocated regions
of the file and holes.

Example scenarios where the number of reported blocks is wrong while the
buffered writes are not flushed:

$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc

$ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)

$ sync

$ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)

# The following should have reported 64K...
$ du -h /mnt/sdc/foo1
128K /mnt/sdc/foo1

$ sync

# After flushing the buffered write, it now reports the correct value.
$ du -h /mnt/sdc/foo1
64K /mnt/sdc/foo1

$ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)

$ sync

$ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
wrote 65536/65536 bytes at offset 65536
64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)

# The following should have reported 128K...
$ du -h /mnt/sdc/foo2
192K /mnt/sdc/foo2

$ sync

# After flushing the buffered write, it now reports the correct value.
$ du -h /mnt/sdc/foo2
128K /mnt/sdc/foo2

So the number of used file blocks is simply incorrect, unlike in other
filesystems such as ext4 and xfs for example, but only while the buffered
writes are not flushed.

Fix this by tracking the number of delalloc bytes that fall within holes
and beyond eof of a file, and use instead this new counter when reporting
the number of used blocks for an inode.

Another different problem that exists is that the delalloc bytes counter
is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
the respective range in the inode's iotree) and the vfs inode's bytes
counter is only incremented when writeback finishes (through
insert_reserved_file_extent()). Therefore while writeback is ongoing we
simply report a wrong number of blocks used by an inode if the write
operation covers a range previously unallocated. While this change does
not fix this problem, it does minimizes it a lot by shortening that time
window, as the new dealloc bytes counter (new_delalloc_bytes) is only
decremented when writeback finishes right before updating the vfs inode's
bytes counter. Fully fixing this second problem is not trivial and will
be addressed later by a different patch.

Signed-off-by: Filipe Manana

Filipe Manana
2017-04-26 23:27:26 +0800

28 Feb, 2017

5 commits

0b581701d btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode ... Browse Code »

Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba

Nikolay Borisov
2017-02-28 18:30:12 +0800
abcefb1ee btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode ... Browse Code »

Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba

Nikolay Borisov
2017-02-28 18:30:12 +0800
0970a22e5 btrfs: make btrfs_print_data_csum_error take btrfs_inode ... Browse Code »

Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba

Nikolay Borisov
2017-02-28 18:30:09 +0800
70ddc553b btrfs: make btrfs_is_free_space_inode take btrfs_inode ... Browse Code »

Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba

Nikolay Borisov
2017-02-28 18:30:06 +0800
6ef06d279 btrfs: Make btrfs_i_size_write take btrfs_inode ... Browse Code »

Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba

Nikolay Borisov
2017-02-28 18:30:06 +0800

17 Feb, 2017

1 commit

6f6b643e4 btrfs: Better csum error message for data csum mismatch ... Browse Code »

The original csum error message only outputs inode number, offset, check
sum and expected check sum.

However no root objectid is outputted, which sometimes makes debugging
quite painful under multi-subvolume case (including relocation).

Also the checksum output is decimal, which seldom makes sense for
users/developers and is hard to read in most time.

This patch will add root objectid, which will be %lld for rootid larger
than LAST_FREE_OBJECTID, and hex csum output for better readability.

Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Qu Wenruo
2017-02-17 19:03:48 +0800

14 Feb, 2017

2 commits

0f8939b8a btrfs: Make btrfs_inode_in_log take btrfs_inode ... Browse Code »

Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba

Nikolay Borisov
2017-02-14 22:50:54 +0800
4a0cc7ca6 btrfs: Make btrfs_ino take a struct btrfs_inode ... Browse Code »

Currently btrfs_ino takes a struct inode and this causes a lot of
internal btrfs functions which consume this ino to take a VFS inode,
rather than btrfs' own struct btrfs_inode. In order to fix this "leak"
of VFS structs into the internals of btrfs first it's necessary to
eliminate all uses of struct inode for the purpose of inode. This patch
does that by using BTRFS_I to convert an inode to btrfs_inode. With
this problem eliminated subsequent patches will start eliminating the
passing of struct inode altogether, eventually resulting in a lot cleaner
code.

Signed-off-by: Nikolay Borisov
[ fix btrfs_get_extent tracepoint prototype ]
Signed-off-by: David Sterba

Nikolay Borisov
2017-02-14 22:50:51 +0800

26 Sep, 2016

1 commit

afcdd129e Btrfs: add a flags field to btrfs_fs_info ... Browse Code »

We have a lot of random ints in btrfs_fs_info that can be put into flags. This
is mostly equivalent with the exception of how we deal with quota going on or
off, now instead we set a flag when we are turning it on or off and deal with
that appropriately, rather than just having a pending state that the current
quota_enabled gets set to. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2016-09-26 23:59:49 +0800

26 May, 2016

2 commits

42f31734e Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 Browse Code »

David Sterba
2016-05-26 04:51:03 +0800
013276101 btrfs: fix string and comment grammatical issues and typos ... Browse Code »

Signed-off-by: Nicholas D Steeves
Signed-off-by: David Sterba

Nicholas D Steeves
2016-05-26 04:35:14 +0800

13 May, 2016

1 commit

5f9a8a51d Btrfs: add semaphore to synchronize direct IO writes with fsync ... Browse Code »

Due to the optimization of lockless direct IO writes (the inode's i_mutex
is not held) introduced in commit 38851cc19adb ("Btrfs: implement unlocked
dio write"), we started having races between such writes with concurrent
fsync operations that use the fast fsync path. These races were addressed
in the patches titled "Btrfs: fix race between fsync and lockless direct
IO writes" and "Btrfs: fix race between fsync and direct IO writes for
prealloc extents". The races happened because the direct IO path, like
every other write path, does create extent maps followed by the
corresponding ordered extents while the fast fsync path collected first
ordered extents and then it collected extent maps. This made it possible
to log file extent items (based on the collected extent maps) without
waiting for the corresponding ordered extents to complete (get their IO
done). The two fixes mentioned before added a solution that consists of
making the direct IO path create first the ordered extents and then the
extent maps, while the fsync path attempts to collect any new ordered
extents once it collects the extent maps. This was simple and did not
require adding any synchonization primitive to any data structure (struct
btrfs_inode for example) but it makes things more fragile for future
development endeavours and adds an exceptional approach compared to the
other write paths.

This change adds a read-write semaphore to the btrfs inode structure and
makes the direct IO path create the extent maps and the ordered extents
while holding read access on that semaphore, while the fast fsync path
collects extent maps and ordered extents while holding write access on
that semaphore. The logic for direct IO write path is encapsulated in a
new helper function that is used both for cow and nocow direct IO writes.

Signed-off-by: Filipe Manana
Reviewed-by: Josef Bacik

Filipe Manana
2016-05-13 08:59:36 +0800

07 Jan, 2016

1 commit

8089fe62c btrfs: put delayed item hook into inode ... Browse Code »

Inodes for delayed iput allocate a trivial helper structure, let's place
the list hook directly into the inode and save a kmalloc (killing a
__GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.

The inode can be put into the delayed_iputs list more than once and we
have to keep the count. This means we can't use the list_splice to
process a bunch of inodes because we'd lost track of the count if the
inode is put into the delayed iputs again while it's processed.

Signed-off-by: David Sterba

David Sterba
2016-01-07 21:26:58 +0800

22 Sep, 2015

1 commit

50745b0a7 Btrfs: Direct I/O: Fix space accounting ... Browse Code »

The following call trace is seen when generic/095 test is executed,

WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0()
Modules linked in:
CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014
ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0
0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38
ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0
Call Trace:
[] dump_stack+0x45/0x57
[] warn_slowpath_common+0x85/0xc0
[] warn_slowpath_null+0x15/0x20
[] btrfs_destroy_inode+0x284/0x2a0
[] destroy_inode+0x37/0x60
[] evict+0x109/0x170
[] dispose_list+0x35/0x50
[] evict_inodes+0xaa/0x100
[] generic_shutdown_super+0x47/0xf0
[] kill_anon_super+0x11/0x20
[] btrfs_kill_super+0x13/0x110
[] deactivate_locked_super+0x39/0x70
[] deactivate_super+0x5f/0x70
[] cleanup_mnt+0x3e/0x90
[] __cleanup_mnt+0xd/0x10
[] task_work_run+0x96/0xb0
[] do_notify_resume+0x3d/0x50
[] int_signal+0x12/0x17

This means that the inode had non-zero "outstanding extents" during
eviction. This occurs because, during direct I/O a task which successfully
used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does
not clear the bit after finishing the DIO write. A future DIO write could
actually fail and the unused reserve space won't be freed because of the
previously set BTRFS_INODE_DIO_READY bit.

Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the
following issue,
|-----------------------------------+-------------------------------------|
| Task A | Task B |
|-----------------------------------+-------------------------------------|
| Start direct i/o write on inode X.| |
| reserve space | |
| Allocate ordered extent | |
| release reserved space | |
| Set BTRFS_INODE_DIO_READY bit. | |
| | splice() |
| | Transfer data from pipe buffer to |
| | destination file. |
| | - kmap(pipe buffer page) |
| | - Start direct i/o write on |
| | inode X. |
| | - reserve space |
| | - dio_refill_pages() |
| | - sdio->blocks_available == 0 |
| | - Since a kernel address is |
| | being passed instead of a |
| | user space address, |
| | iov_iter_get_pages() returns |
| | -EFAULT. |
| | - Since BTRFS_INODE_DIO_READY is |
| | set, we don't release reserved |
| | space. |
| | - Clear BTRFS_INODE_DIO_READY bit.|
| -EIOCBQUEUED is returned. | |
|-----------------------------------+-------------------------------------|

Hence this commit introduces "struct btrfs_dio_data" to track the usage of
reserved data space. The remaining unused "reserve space" can now be freed
reliably.

Signed-off-by: Chandan Rajendra
Reviewed-by: Liu Bo
Signed-off-by: Chris Mason

chandan
2015-09-22 04:47:55 +0800

02 Jul, 2015

1 commit

ddba1bfc2 Btrfs: fix warning of bytes_may_use ... Browse Code »

While running generic/019, dmesg got several warnings from
btrfs_free_reserved_data_space().

Test generic/019 produces some disk failures so sumbit dio will get errors,
in which case, btrfs_direct_IO() goes to the error handling and free
bytes_may_use, but the problem is that bytes_may_use has been free'd
during get_block().

This adds a runtime flag to show if we've gone through get_block(), if so,
don't do the cleanup work.

Signed-off-by: Liu Bo
Reviewed-by: Filipe Manana
Tested-by: Filipe Manana
Signed-off-by: Chris Mason

Liu Bo
2015-07-02 08:17:21 +0800