Eric Lee / smarc-fsl-linux-kernel

02 Dec, 2019

1 commit

0da522107 Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground ... Browse Code »

Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
"As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need
support for time64_t.

In completely unrelated work, I spent time on cleaning up parts of
this file in the past, moving things out into drivers instead.

After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the
rest of it and move it all into drivers.

This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which
is the scsi ioctl handlers. I have patches for those as well, but they
need more testing or possibly a rewrite"

* tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
scsi: sd: enable compat ioctls for sed-opal
pktcdvd: add compat_ioctl handler
compat_ioctl: move SG_GET_REQUEST_TABLE handling
compat_ioctl: ppp: move simple commands into ppp_generic.c
compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
compat_ioctl: unify copy-in of ppp filters
tty: handle compat PPP ioctls
compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
compat_ioctl: handle SIOCOUTQNSD
af_unix: add compat_ioctl support
compat_ioctl: reimplement SG_IO handling
compat_ioctl: move WDIOC handling into wdt drivers
fs: compat_ioctl: move FITRIM emulation into file systems
gfs2: add compat_ioctl support
compat_ioctl: remove unused convert_in_user macro
compat_ioctl: remove last RAID handling code
compat_ioctl: remove /dev/raw ioctl translation
compat_ioctl: remove PCI ioctl translation
compat_ioctl: remove joystick ioctl translation
...

Linus Torvalds
2019-12-02 05:46:15 +0800

19 Nov, 2019

39 commits

fa17ed069 btrfs: drop bdev argument from submit_extent_page ... Browse Code »

After previous patches removing bdev being passed around to set it to
bio, it has become unused in submit_extent_page. So it now has "only" 13
parameters.

Signed-off-by: David Sterba

David Sterba
2019-11-19 06:43:58 +0800
a019e9e19 btrfs: remove extent_map::bdev ... Browse Code »

We can now remove the bdev from extent_map. Previous patches made sure
that bio_set_dev is correctly in all places and that we don't need to
grab it from latest_bdev or pass it around inside the extent map.

Signed-off-by: David Sterba

David Sterba
2019-11-19 06:43:44 +0800
1a4180270 btrfs: drop bio_set_dev where not needed ... Browse Code »

bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
also changing some state bits if there was a different bdev set before.
This is one thing that's not needed.

Another thing is that setting a bdev at bio allocation time is too early
and actually does not work with plain redundancy profiles, where each
time we submit a bio to a device, the bdev is set correctly.

In many places the bio bdev is set to latest_bdev that seems to serve as
a stub pointer "just to put something to bio". But we don't have to do
that.

Where do we know which bdev to set:

* for regular IO: submit_stripe_bio that's called by btrfs_map_bio

* repair IO: repair_io_failure, read or write from specific device

* super block write (using buffer_heads but uses raw bdev) and barriers

* scrub: this does not use all regular IO paths as it needs to reach all
copies, verify and fixup eventually, and for that all bdev management
is independent

* raid56: rbio_add_io_page, for the RMW write

* integrity-checker: does it's own low-level block tracking

Signed-off-by: David Sterba

David Sterba
2019-11-19 06:39:30 +0800
429aebc0a btrfs: get bdev directly from fs_devices in submit_extent_page ... Browse Code »

This is preparatory patch to remove @bdev parameter from
submit_extent_page. It can't be removed completely, because the cgroups
need it for wbc when initializing the bio

wbc_init_bio
bio_associate_blkg_from_css
dereference bdev->bi_disk->queue

The bdev pointer is the same as latest_bdev, thus no functional change.
We can retrieve it from fs_devices that's reachable through several
dereferences. The local variable shadows the parameter, but that's only
temporary.

Signed-off-by: David Sterba

David Sterba
2019-11-19 06:38:46 +0800
3e1740993 btrfs: record all roots for rename exchange on a subvol ... Browse Code »

Testing with the new fsstress support for subvolumes uncovered a pretty
bad problem with rename exchange on subvolumes. We're modifying two
different subvolumes, but we only start the transaction on one of them,
so the other one is not added to the dirty root list. This is caught by
btrfs_cow_block() with a warning because the root has not been updated,
however if we do not modify this root again we'll end up pointing at an
invalid root because the root item is never updated.

Fix this by making sure we add the destination root to the trans list,
the same as we do with normal renames. This fixes the corruption.

Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Filipe Manana
Signed-off-by: Josef Bacik
Signed-off-by: David Sterba

Josef Bacik
2019-11-19 03:08:31 +0800
042528f8d Btrfs: fix block group remaining RO forever after error during device replace ... Browse Code »

When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
set the block group to RO mode and then wait for any ongoing writes into
extents of the block group to complete. While doing that wait we overwrite
the value of the variable 'ret' and can break out of the loop if an error
happens without turning the block group back into RW mode. So what happens
is the following:

1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
to RO mode (its ->ro field set to 1 or incremented to some value > 1);

2) Then btrfs_wait_ordered_roots() returns a value > 0;

3) Then if either joining or committing the transaction fails, we break
out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
the block group in RO mode forever.

To fix this, just remove the code that waits for ongoing writes to extents
of the block group, since it's not needed because in the initial setup
phase of a device replace operation, before starting to find all chunks
and their extents, we set the target device for replace while holding
fs_info->dev_replace->rwsem, which ensures that after releasing that
semaphore, any writes into the source device are made to the target device
as well (__btrfs_map_block() guarantees that). So while at
scrub_enumerate_chunks() we only need to worry about finding and copying
extents (from the source device to the target device) that were written
before we started the device replace operation.

Fixes: f0e9b7d6401959 ("Btrfs: fix race setting block group readonly during device replace")
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba

Filipe Manana
2019-11-19 01:07:55 +0800
b12de5289 btrfs: scrub: Don't check free space before marking a block group RO ... Browse Code »

[BUG]
When running btrfs/072 with only one online CPU, it has a pretty high
chance to fail:

btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
- output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
--- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800
+++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800
@@ -1,2 +1,3 @@
QA output created by 072
Silence is golden
+Scrub find errors in "-m dup -d single" test
...

And with the following call trace:

BTRFS info (device dm-5): scrub: started on devid 1
------------[ cut here ]------------
BTRFS: Transaction aborted (error -27)
WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
Call Trace:
__btrfs_end_transaction+0xdb/0x310 [btrfs]
btrfs_end_transaction+0x10/0x20 [btrfs]
btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
scrub_enumerate_chunks+0x264/0x940 [btrfs]
btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
do_vfs_ioctl+0x636/0xaa0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x43/0x50
do_syscall_64+0x79/0xe0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
---[ end trace 166c865cec7688e7 ]---

[CAUSE]
The error number -27 is -EFBIG, returned from the following call chain:
btrfs_end_transaction()
|- __btrfs_end_transaction()
|- btrfs_create_pending_block_groups()
|- btrfs_finish_chunk_alloc()
|- btrfs_add_system_chunk()

This happens because we have used up all space of
btrfs_super_block::sys_chunk_array.

The root cause is, we have the following bad loop of creating tons of
system chunks:

1. The only SYSTEM chunk is being scrubbed
It's very common to have only one SYSTEM chunk.
2. New SYSTEM bg will be allocated
As btrfs_inc_block_group_ro() will check if we have enough space
after marking current bg RO. If not, then allocate a new chunk.
3. New SYSTEM bg is still empty, will be reclaimed
During the reclaim, we will mark it RO again.
4. That newly allocated empty SYSTEM bg get scrubbed
We go back to step 2, as the bg is already mark RO but still not
cleaned up yet.

If the cleaner kthread doesn't get executed fast enough (e.g. only one
CPU), then we will get more and more empty SYSTEM chunks, using up all
the space of btrfs_super_block::sys_chunk_array.

[FIX]
Since scrub/dev-replace doesn't always need to allocate new extent,
especially chunk tree extent, so we don't really need to do chunk
pre-allocation.

To break above spiral, here we introduce a new parameter to
btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
need extra chunk pre-allocation.

For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
@do_chunk_alloc=false.
This should keep unnecessary empty chunks from popping up for scrub.

Also, since there are two parameters for btrfs_inc_block_group_ro(),
add more comment for it.

Reviewed-by: Filipe Manana
Signed-off-by: Qu Wenruo
Signed-off-by: David Sterba

Qu Wenruo
2019-11-19 01:07:55 +0800
7f0432d0d btrfs: change btrfs_fs_devices::rotating to bool ... Browse Code »

struct btrfs_fs_devices::rotating currently is declared as an integer
variable but only used as a boolean.

Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.

Reviewed-by: Qu Wenruo
Reviewed-by: Anand Jain
Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Johannes Thumshirn
2019-11-19 00:51:51 +0800
0395d84f8 btrfs: change btrfs_fs_devices::seeding to bool ... Browse Code »

struct btrfs_fs_devices::seeding currently is declared as an integer
variable but only used as a boolean.

Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.

Reviewed-by: Qu Wenruo
Reviewed-by: Anand Jain
Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Johannes Thumshirn
2019-11-19 00:51:51 +0800
32da5386d btrfs: rename btrfs_block_group_cache ... Browse Code »

The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was quite long, making function prototypes harder to format.

Suggested-by: Nikolay Borisov
Reviewed-by: Qu Wenruo
Signed-off-by: David Sterba

David Sterba
2019-11-19 00:51:51 +0800
d49a2ddb1 btrfs: block-group: Reuse the item key from caller of read_one_block_group() ... Browse Code »

For read_one_block_group(), its only caller has already got the item key
to search next block group item.

So we can use that key directly without doing our own convertion on
stack.

Also, since that key used in btrfs_read_block_groups() is vital for
block group item search, add 'const' keyword for that parameter to
prevent read_one_block_group() to modify it.

Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Qu Wenruo
2019-11-19 00:51:50 +0800
ffb9e0f05 btrfs: block-group: Refactor btrfs_read_block_groups() ... Browse Code »

Refactor the work inside the loop of btrfs_read_block_groups() into one
separate function, read_one_block_group().

This allows read_one_block_group to be reused for later BG_TREE feature.

The refactor does the following extra fix:
- Use btrfs_fs_incompat() to replace open-coded feature check

Reviewed-by: Johannes Thumshirn
Reviewed-by: Anand Jain
Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Qu Wenruo
2019-11-19 00:51:50 +0800
d4e253bbb btrfs: document extent buffer locking ... Browse Code »

Signed-off-by: David Sterba

David Sterba
2019-11-19 00:51:50 +0800
a4477988c btrfs: access eb::blocking_writers according to ACCESS_ONCE policies ... Browse Code »

A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access
once policies can be found here
https://lwn.net/Articles/799218/#Access-Marking%20Policies .

The locked and unlocked access to eb::blocking_writers should be
annotated accordingly, following this:

Writes:

- locked write must use ONCE, may use plain read
- unlocked write must use ONCE

Reads:

- unlocked read must use ONCE
- locked read may use plain read iff not mixed with unlocked read
- unlocked read then locked must use ONCE

There's one difference on the assembly level, where
btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached
value and did not reevaluate it after taking the lock. This could have
missed some opportunities to take the lock in case blocking writers
changed between the calls, but the window is just a few instructions
long. As this is in try-lock, the callers handle that.

Signed-off-by: David Sterba

David Sterba
2019-11-19 00:51:50 +0800
40d38f53d btrfs: set blocking_writers directly, no increment or decrement ... Browse Code »

The increment and decrement was inherited from previous version that
used atomics, switched in commit 06297d8cefca ("btrfs: switch
extent_buffer blocking_writers from atomic to int"). The only possible
values are 0 and 1 so we can set them directly.

The generated assembly (gcc 9.x) did the direct value assignment in
btrfs_set_lock_blocking_write (asm diff after change in 06297d8cefca):

5d: test %eax,%eax
5f: je 62
61: retq

- 62: lock incl 0x44(%rdi)
- 66: add $0x50,%rdi
- 6a: jmpq 6f

+ 62: movl $0x1,0x44(%rdi)
+ 69: add $0x50,%rdi
+ 6d: jmpq 72

The part in btrfs_tree_unlock did a decrement because
BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but
otherwise the output looks safe:

- lock decl 0x44(%rdi)

+ sub $0x1,%eax
+ mov %eax,0x44(%rdi)

Signed-off-by: David Sterba

David Sterba
2019-11-19 00:51:50 +0800
f5c2a5259 btrfs: merge blocking_writers branches in btrfs_tree_read_lock ... Browse Code »

There are two ifs that use eb::blocking_writers. As this is a variable
modified inside and outside of locks, we could minimize number of
accesses to avoid problems with getting different results at different
times.

The access here is locked so this can only race with btrfs_tree_unlock
that sets blocking_writers to 0 without lock and unsets the lock owner.

The first branch is taken only if the same thread already holds the
lock, the second if checks for blocking writers. Here we'd either unlock
and wait, or proceed. Both are valid states of the locking protocol.

Reviewed-by: Johannes Thumshirn
Signed-off-by: David Sterba

David Sterba
2019-11-19 00:51:50 +0800
9c907446d btrfs: drop incompat bit for raid1c34 after last block group is gone ... Browse Code »

When there are no raid1c3 or raid1c4 block groups left after balance
(either convert or with other filters applied), remove the incompat bit.
This is already done for RAID56, do the same for RAID1C34.

Signed-off-by: David Sterba

David Sterba
2019-11-19 00:51:49 +0800
cfbb825c7 btrfs: add incompat for raid1 with 3, 4 copies ... Browse Code »

The new raid1c3 and raid1c4 profiles are backward incompatible and the
name shall be 'raid1c34', the status can be found in the global
supported features in /sys/fs/btrfs/features or in the per-filesystem
directory.

Signed-off-by: David Sterba

David Sterba
2019-11-19 00:51:49 +0800
8d6fac008 btrfs: add support for 4-copy replication (raid1c4) ... Browse Code »

Add new block group profile to store 4 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the 2-
and 3-copy RAID1.

The minimum number of devices is 4, the maximum number of devices/chunks
that can be lost/damaged is 3. There is no comparable traditional RAID
level, the profile is added for future needs to accompany triple-parity
and beyond.

Signed-off-by: David Sterba

David Sterba
2019-11-19 00:51:49 +0800
47e6f7423 btrfs: add support for 3-copy replication (raid1c3) ... Browse Code »

Add new block group profile to store 3 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the
2-copy RAID1.

The minimum number of devices is 3, the maximum number of devices/chunks
that can be lost/damaged is 2. Like RAID6 but with 33% space
utilization.

Signed-off-by: David Sterba

David Sterba
2019-11-19 00:51:49 +0800
fac07d2b0 btrfs: sink write flags to cow_file_range_async ... Browse Code »

In commit "Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios",
cow_file_range_async gained wbc as a parameter and this makes passing
write flags redundant. Set it inside the function and remove the
parameter.

Reviewed-by: Johannes Thumshirn
Reviewed-by: Nikolay Borisov
Signed-off-by: David Sterba

David Sterba
2019-11-19 00:51:49 +0800
57e5ffeb8 btrfs: sink write_flags to __extent_writepage_io ... Browse Code »

__extent_writepage reads write flags from wbc and passes both to
__extent_writepage_io. This makes write_flags redundant and we can
remove it.

Reviewed-by: Johannes Thumshirn
Reviewed-by: Nikolay Borisov
Signed-off-by: David Sterba

David Sterba
2019-11-19 00:51:48 +0800
fd0ddbe25 Btrfs: send, skip backreference walking for extents with many references ... Browse Code »

Backreference walking, which is used by send to figure if it can issue
clone operations instead of write operations, can be very slow and use
too much memory when extents have many references. This change simply
skips backreference walking when an extent has more than 64 references,
in which case we fallback to a write operation instead of a clone
operation. This limit is conservative and in practice I observed no
signicant slowdown with up to 100 references and still low memory usage
up to that limit.

This is a temporary workaround until there are speedups in the backref
walking code, and as such it does not attempt to add extra interfaces or
knobs to tweak the threshold.

Reported-by: Atemu
Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba

Filipe Manana
2019-11-19 00:51:48 +0800
11f2069c1 Btrfs: send, allow clone operations within the same file ... Browse Code »

For send we currently skip clone operations when the source and
destination files are the same. This is so because clone didn't support
this case in its early days, but support for it was added back in May
2013 by commit a96fbc72884fcb ("Btrfs: allow file data clone within a
file"). This change adds support for it.

Example:

$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt/sdd

$ xfs_io -f -c "pwrite -S 0xab -b 64K 0 64K" /mnt/sdd/foobar
$ xfs_io -c "reflink /mnt/sdd/foobar 0 64K 64K" /mnt/sdd/foobar

$ btrfs subvolume snapshot -r /mnt/sdd /mnt/sdd/snap

$ mkfs.btrfs -f /dev/sde
$ mount /dev/sde /mnt/sde

$ btrfs send /mnt/sdd/snap | btrfs receive /mnt/sde

Without this change file foobar at the destination has a single 128Kb
extent:

$ filefrag -v /mnt/sde/snap/foobar
Filesystem type is: 9123683e
File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 31: 0.. 31: 32: last,unknown_loc,delalloc,eof
/mnt/sde/snap/foobar: 1 extent found

With this we get a single 64Kb extent that is shared at file offsets 0
and 64K, just like in the source filesystem:

$ filefrag -v /mnt/sde/snap/foobar
Filesystem type is: 9123683e
File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 15: 3328.. 3343: 16: shared
1: 16.. 31: 3328.. 3343: 16: 3344: last,shared,eof
/mnt/sde/snap/foobar: 2 extents found

Reviewed-by: Josef Bacik
Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Filipe Manana
2019-11-19 00:51:48 +0800
6b7faadd9 btrfs: Ensure we trim ranges across block group boundary ... Browse Code »

[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:

btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%

type: bbio->map_type, in above case, it's SINGLE DATA.
start: Logical address of this trim
len: Logical length of this trim
trimmed: Physically trimmed bytes
ratio: trimmed / len

Thus leaving some unused space not discarded.

[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:

btrfs_commit_transaction()
|- btrfs_finish_extent_commit()
|- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
|- btrfs_discard_extent()

However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.

When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:

|||
|//////||/////|

To discard that range, we have the following calls:

btrfs_discard_extent()
|- btrfs_map_block()
| Returned bbio will end at BG1's end. As btrfs_map_block()
| never returns result across block group boundary.
|- btrfs_issuse_discard()
Issue discard for each stripe.

So we will only discard the range in BG1, not the remaining part in BG2.

Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.

[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
btrfs_map_block() uses its @length paramter to notify the caller how
many bytes are mapped in current call.
With __btrfs_map_block_for_discard() also modifing the @length,
btrfs_discard_extent() now understands when to do extra trim.

- Call btrfs_map_block() in a loop until we hit the range end Since we
now know how many bytes are mapped each time, we can iterate through
each block group boundary and issue correct trim for each range.

Reviewed-by: Filipe Manana
Reviewed-by: Nikolay Borisov
Tested-by: Nikolay Borisov
Reviewed-by: Josef Bacik
Signed-off-by: Qu Wenruo
Signed-off-by: David Sterba

Qu Wenruo
2019-11-19 00:51:48 +0800
2d974619a btrfs: volumes: Use more straightforward way to calculate map length ... Browse Code »

The old code goes:

offset = logical - em->start;
length = min_t(u64, em->len - offset, length);

Where @length calculation is dependent on offset, it can take reader
several more seconds to find it's just the same code as:

offset = logical - em->start;
length = min_t(u64, em->start + em->len - logical, length);

Use above code to make the length calculate independent from other
variable, thus slightly increase the readability.

Reviewed-by: Johannes Thumshirn
Reviewed-by: Josef Bacik
Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Qu Wenruo
2019-11-19 00:51:48 +0800
153a6d299 btrfs: tree-checker: Check item size before reading file extent type ... Browse Code »

In check_extent_data_item(), we read file extent type without verifying
if the item size is valid.

Add such check to ensure the file extent type we read is correct.

The check is not as accurate as we need to cover both inline and regular
extents, so it only checks if the item size is larger or equal to inline
header.
So the existing size checks on inline/regular extents are still needed.

Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Qu Wenruo
2019-11-19 00:51:48 +0800
3ec17a67c btrfs: clean up locking name in scrub_enumerate_chunks() ... Browse Code »

The "&fs_info->dev_replace.rwsem" and "&dev_replace->rwsem" refer to
the same lock but Smatch is not clever enough to figure that out so it
leads to static checker warnings. It's better to use it consistently
anyway.

Signed-off-by: Dan Carpenter
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Dan Carpenter
2019-11-19 00:51:47 +0800
6ef108dd0 btrfs: Streamline btrfs_fs_info::backup_root_index semantics ... Browse Code »

The backup_root_index member stores the index at which the backup root
should be saved upon next transaction commit. However, there is a
small deviation from this behavior in the form of a check in
backup_super_roots which checks if current root generation equals to the
generation of the previous root. This can trigger in the following
scenario:

slot0: gen-2
slot1: gen-1
slot2: gen
slot3: unused

Now suppose slot3 (which is also the root specified in the super block)
is corrupted hence init_tree_roots chooses to use the backup root at
slot2, meaning read_backup_root will read slot2 and assign the
superblock generation to gen-1. Despite this backup_root_index will
point at slot3 because its init happens in init_backup_root_slot, long
before any parsing of the backup roots occur. Then on next transaction
start, gen-1 will be incremented by 1 making the root's generation
equal gen. Subsequently, on transaction commit the following check
triggers:

if (btrfs_backup_tree_root_gen(root_backup) ==
btrfs_header_generation(info->tree_root->node))

This causes the 'next_backup', which is the index at which the backup is
going to be written to, to set to last_backup, which will be slot2.

All of this is a very confusing way of expressing the following
invariant:

Always write a backup root at the index following the last used backup
root.

This commit streamlines this logic by setting backup_root_index to the
next index after the one used for mount.

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2019-11-19 00:51:47 +0800
4ac039ad7 btrfs: Rename find_oldest_super_backup to init_backup_root_slot ... Browse Code »

The old name name was an awful misnomer because it didn't really find
the oldest super backup per-se but rather its slot. For example if we
have:

slot0: gen - 2
slot1: gen - 1
slot2: gen
slot3: empty

init_backup_root_slot will return slot3 and not slot0.

The new name is more appropriate since the function doesn't care whether
there is a valid backup in the returned slot or not.

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2019-11-19 00:51:47 +0800
260eb11bd btrfs: Remove unused next_root_backup function ... Browse Code »

This function has been superseded by previous commits and is no longer
used so just remove it.

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2019-11-19 00:51:47 +0800
336a0d8df btrfs: Don't use objectid_mutex during mount ... Browse Code »

Since the filesystem is not well formed and no trees are loaded it's
pointless holding the objectid_mutex. Just remove its usage.

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2019-11-19 00:51:47 +0800
b8522a1e5 btrfs: Factor out tree roots initialization during mount ... Browse Code »

The code responsible for reading and initializing tree roots is
scattered in open_ctree among 2 labels, emulating a loop. This is rather
confusing to reason about. Instead, factor the code to a new function,
init_tree_roots which implements the same logical flow.

There are a couple of notable differences, namely:

* Instead of using next_backup_root it's using the newly introduced
read_backup_root.

* If read_backup_root returns an error init_tree_roots propagates the
error and there is no special handling of that case e.g. the code jumps
straight to 'fail_tree_roots' label. The old code, however, was
(erroneously) jumping to 'fail_block_groups' label if next_backup_root
did fail, this was unnecessary since the tree roots init logic doesn't
modify the state of block groups.

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2019-11-19 00:51:46 +0800
bd2336b2a btrfs: Add read_backup_root ... Browse Code »

This function will replace next_root_backup with a much saner/cleaner
interface.

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2019-11-19 00:51:46 +0800
fc2e4c5b3 btrfs: Remove newest_gen argument from find_oldest_super_backup ... Browse Code »

It's no longer needed following cleanups around find_newest_backup_root

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2019-11-19 00:51:46 +0800
01f0f9da9 btrfs: Cleanup and simplify find_newest_super_backup ... Browse Code »

Backup roots are always written in a circular manner. By definition we
can only ever have 1 backup root whose generation equals to that of the
superblock. Hence, the 'if' in the for loop will trigger at most once.
This is sufficient to return the newest backup root.

Furthermore the newest_gen parameter is always set to the generation of
the superblock. This value can be obtained from the fs_info.

This patch removes the unnecessary code dealing with the wraparound
case and makes 'newest_gen' a local variable.

Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Nikolay Borisov
2019-11-19 00:51:46 +0800
16ad3be17 Btrfs: remove unnecessary delalloc mutex for inodes ... Browse Code »

The inode delalloc mutex was added a long time ago by commit f248679e86fea
("Btrfs: add a delalloc mutex to inodes for delalloc reservations"), and
the reason for its introduction is not very clear from the change log. It
claims it solves bogus warnings from lockdep, however it lacks an example
report/warning from lockdep, or any explanation.

Since we have enough concurrentcy protection from the locks of the space
info and block reserve objects, and such lockdep warnings don't seem to
exist anymore (at least on a 5.3 kernel I couldn't get them with fstests,
ltp, fs_mark, etc), remove it, simplifying things a bit and decreasing
the size of the btrfs_inode structure. With some quick fio tests doing
direct IO and mmap writes I couldn't observe any significant performance
increase either (direct IO writes that don't increase the file's size
don't hold the inode's lock for their entire duration and mmap writes
don't hold the inode's lock at all), which are the only type of writes
that could see any performance gain due to less serialization.

Review feedback from Josef:

The problem was taking the i_mutex in mmap, which is how I was
protecting delalloc reservations originally. The delalloc mutex didn't
come with all of the other dependencies. That's what the lockdep
messages were about, removing the lock isn't going to make them appear
again.

We _had_ to lock around this because we used to do tricks to keep from
over-reserving, and if we didn't serialize delalloc reservations we'd
end up with ugly accounting problems when we tried to clean things up.

However with my recentish changes this isn't the case anymore. Every
operation is responsible for reserving its space, and then adding it to
the inode. Then cleaning up is straightforward and can't be mucked up
by other users. So we no longer need the delalloc mutex to safe us from
ourselves.

Reviewed-by: Josef Bacik
Signed-off-by: Filipe Manana
Signed-off-by: David Sterba

Filipe Manana
2019-11-19 00:51:46 +0800
bf2df5aed Btrfs: remove wait queue from space_info structure ... Browse Code »

It is not used anymore since commit 957780eb2788d8 ("Btrfs: introduce
ticketed enospc infrastructure"), so just remove it.

Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Filipe Manana
2019-11-19 00:51:46 +0800
f5389f330 btrfs: remove cached space_info in btrfs_statfs() ... Browse Code »

In btrfs_statfs() we cache fs_info::space_info in a local variable only
to use it once in a list_for_each_rcu() statement.

Not only is the local variable unnecessary it even makes the code harder
to follow as it's not clear which list it is iterating.

Signed-off-by: Johannes Thumshirn
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Johannes Thumshirn
2019-11-19 00:51:45 +0800