Eric Lee / smarc-fsl-linux-kernel

24 Aug, 2018

1 commit

fc9caa36d btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block() ... Browse Code »

[ Upstream commit 665d4953cde6d9e75c62a07ec8f4f8fd7d396ade ]

In commit ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device
replace") we removed the branch of copy_nocow_pages() to avoid
corruption for compressed nodatasum extents.

However above commit only solves the problem in scrub_extent(), if
during scrub_pages() we failed to read some pages,
sctx->no_io_error_seen will be non-zero and we go to fixup function
scrub_handle_errored_block().

In scrub_handle_errored_block(), for sctx without csum (no matter if
we're doing replace or scrub) we go to scrub_fixup_nodatasum() routine,
which does the similar thing with copy_nocow_pages(), but does it
without the extra check in copy_nocow_pages() routine.

So for test cases like btrfs/100, where we emulate read errors during
replace/scrub, we could corrupt compressed extent data again.

This patch will fix it just by avoiding any "optimization" for
nodatasum, just falls back to the normal fixup routine by try read from
any good copy.

This also solves WARN_ON() or dead lock caused by lame backref iteration
in scrub_fixup_nodatasum() routine.

The deadlock or WARN_ON() won't be triggered before commit ac0b4145d662
("btrfs: scrub: Don't use inode pages for device replace") since
copy_nocow_pages() have better locking and extra check for data extent,
and it's already doing the fixup work by try to read data from any good
copy, so it won't go scrub_fixup_nodatasum() anyway.

This patch disables the faulty code and will be removed completely in a
followup patch.

Fixes: ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device replace")
Signed-off-by: Qu Wenruo
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Qu Wenruo
2018-08-24 19:09:18 +0800

26 Jun, 2018

1 commit

b1cf23c2c btrfs: scrub: Don't use inode pages for device replace ... Browse Code »

commit ac0b4145d662a3b9e34085dea460fb06ede9b69b upstream.

[BUG]
Btrfs can create compressed extent without checksum (even though it
shouldn't), and if we then try to replace device containing such extent,
the result device will contain all the uncompressed data instead of the
compressed one.

Test case already submitted to fstests:
https://patchwork.kernel.org/patch/10442353/

[CAUSE]
When handling compressed extent without checksum, device replace will
goe into copy_nocow_pages() function.

In that function, btrfs will get all inodes referring to this data
extents and then use find_or_create_page() to get pages direct from that
inode.

The problem here is, pages directly from inode are always uncompressed.
And for compressed data extent, they mismatch with on-disk data.
Thus this leads to corrupted compressed data extent written to replace
device.

[FIX]
In this attempt, we could just remove the "optimization" branch, and let
unified scrub_pages() to handle it.

Although scrub_pages() won't bother reusing page cache, it will be a
little slower, but it does the correct csum checking and won't cause
such data corruption caused by "optimization".

Note about the fix: this is the minimal fix that can be backported to
older stable trees without conflicts. The whole callchain from
copy_nocow_pages() can be deleted, and will be in followup patches.

Fixes: ff023aac3119 ("Btrfs: add code to scrub to copy read data to another disk")
CC: stable@vger.kernel.org # 4.4+
Reported-by: James Harvey
Reviewed-by: James Harvey
Signed-off-by: Qu Wenruo
[ remove code removal, add note why ]
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Qu Wenruo
2018-06-26 08:06:30 +0800

21 Jun, 2018

1 commit

6bf89b7c6 Btrfs: fix scrub to repair raid6 corruption ... Browse Code »

[ Upstream commit 762221f095e3932669093466aaf4b85ed9ad2ac1 ]

The raid6 corruption is that,
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,

- the 1st retry is to rebuild with all other stripes, it'll eventually
be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
that it will do raid6 style rebuild,

however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content.

We've fixed normal reads to rebuild raid6 correctly with more retries
in Patch "Btrfs: make raid6 rebuild retry more"[1], this is to fix
scrub to do the exactly same rebuild process.

[1]: https://patchwork.kernel.org/patch/10091755/

Signed-off-by: Liu Bo
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Liu Bo
2018-06-21 03:03:02 +0800

10 Sep, 2017

1 commit

66ba772ee Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux ... Browse Code »

Pull btrfs updates from David Sterba:
"The changes range through all types: cleanups, core chagnes, sanity
checks, fixes, other user visible changes, detailed list below:

- deprecated: user transaction ioctl

- mount option ssd does not change allocation alignments

- degraded read-write mount is allowed if all the raid profile
constraints are met, now based on more accurate check

- defrag: do not reset compression afterwards; the NOCOMPRESS flag
can be now overriden by defrag

- prep work for better extent reference tracking (related to the
qgroup slowness with balance)

- prep work for compression heuristics

- memory allocation reductions (may help latencies on a loaded
system)

- better accounting for io waiting states

- error handling improvements (removed BUGs)

- added more sanity checks for shared refs

- fix readdir vs pagefault deadlock under some circumstances

- fix for 'no-hole' mode, certain combination of compressed and
inline extents

- send: fix emission of invalid clone operations

- fixup file mode if setting acls fail

- more fixes from fuzzing

- oher cleanups"

* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits)
btrfs: submit superblock io with REQ_META and REQ_PRIO
btrfs: remove unnecessary memory barrier in btrfs_direct_IO
btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent
btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent
btrfs: pass fs_info to btrfs_del_root instead of tree_root
Btrfs: add one more sanity check for shared ref type
Btrfs: remove BUG_ON in __add_tree_block
Btrfs: remove BUG() in add_data_reference
Btrfs: remove BUG() in print_extent_item
Btrfs: remove BUG() in btrfs_extent_inline_ref_size
Btrfs: convert to use btrfs_get_extent_inline_ref_type
Btrfs: add a helper to retrive extent inline ref type
btrfs: scrub: simplify scrub worker initialization
btrfs: scrub: clean up division in scrub_find_csum
btrfs: scrub: clean up division in __scrub_mark_bitmap
btrfs: scrub: use bool for flush_all_writes
btrfs: preserve i_mode if __btrfs_set_acl() fails
btrfs: Remove extraneous chunk_objectid variable
btrfs: Remove chunk_objectid argument from btrfs_make_block_group
btrfs: Remove extra parentheses from condition in copy_items()
...

Linus Torvalds
2017-09-10 04:27:51 +0800

24 Aug, 2017

1 commit

74d46992e block: replace bi_bdev with a gendisk pointer and partitions index ... Browse Code »

This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-08-24 02:49:55 +0800

21 Aug, 2017

4 commits

af1cbe0a6 btrfs: scrub: simplify scrub worker initialization ... Browse Code »

Minor simplification, merge calls to one.

Signed-off-by: David Sterba

David Sterba
2017-08-21 23:47:42 +0800
1d1bf92d9 btrfs: scrub: clean up division in scrub_find_csum ... Browse Code »

Use proper helpers for 64bit division.

Signed-off-by: David Sterba

David Sterba
2017-08-21 23:47:42 +0800
7736b0a43 btrfs: scrub: clean up division in __scrub_mark_bitmap ... Browse Code »

Use proper helpers for 64bit division and then cast to narrower type.

Signed-off-by: David Sterba

David Sterba
2017-08-21 23:47:42 +0800
2073c4c2e btrfs: scrub: use bool for flush_all_writes ... Browse Code »

flush_all_writes is an atomic but does not use the semantics at all,
it's just on/off indicator, we can use bool.

Signed-off-by: David Sterba

David Sterba
2017-08-21 23:47:42 +0800

18 Aug, 2017

1 commit

44880fdc9 btrfs: use appropriate define for the fsid ... Browse Code »

Though BTRFS_FSID_SIZE and BTRFS_UUID_SIZE are of the same size, we
should use the matching constant for the fsid buffer.

Signed-off-by: Anand Jain
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Anand Jain
2017-08-18 22:36:29 +0800

16 Aug, 2017

2 commits

131ce4367 btrfs: account that we're waiting for IO in scrub_submit_raid56_bio_wait ... Browse Code »

Correctly account for IO when waiting for a submitted bio in scrub. This
only for the accounting purposes and should not change other behaviour.

Signed-off-by: David Sterba

David Sterba
2017-08-16 22:12:04 +0800
913e15357 btrfs: drop newlines from strings when using btrfs_* helpers ... Browse Code »

The helpers append "\n" so we can keep the actual strings shorter. The
extra newline will print an empty line. Some messages have been
slightly modified to be more consistent with the rest (lowercase first
letter).

Reviewed-by: Nikolay Borisov
Signed-off-by: David Sterba

David Sterba
2017-08-16 22:12:02 +0800

06 Jul, 2017

1 commit

8c27cb356 Merge branch 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux ... Browse Code »

Pull btrfs updates from David Sterba:
"The core updates improve error handling (mostly related to bios), with
the usual incremental work on the GFP_NOFS (mis)use removal,
refactoring or cleanups. Except the two top patches, all have been in
for-next for an extensive amount of time.

User visible changes:

- statx support

- quota override tunable

- improved compression thresholds

- obsoleted mount option alloc_start

Core updates:

- bio-related updates:
- faster bio cloning
- no allocation failures
- preallocated flush bios

- more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates

- prep work for btree_inode removal

- dir-item validation

- qgoup fixes and updates

- cleanups:
- removed unused struct members, unused code, refactoring
- argument refactoring (fs_info/root, caller -> callee sink)
- SEARCH_TREE ioctl docs"

* 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
btrfs: Remove false alert when fiemap range is smaller than on-disk extent
btrfs: Don't clear SGID when inheriting ACLs
btrfs: fix integer overflow in calc_reclaim_items_nr
btrfs: scrub: fix target device intialization while setting up scrub context
btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
btrfs: qgroup: Return actually freed bytes for qgroup release or free data
btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
btrfs: qgroup: Add quick exit for non-fs extents
Btrfs: rework delayed ref total_bytes_pinned accounting
Btrfs: return old and new total ref mods when adding delayed refs
Btrfs: always account pinned bytes when dropping a tree block ref
Btrfs: update total_bytes_pinned when pinning down extents
Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
btrfs: fix validation of XATTR_ITEM dir items
btrfs: Verify dir_item in iterate_object_props
btrfs: Check name_len before in btrfs_del_root_ref
btrfs: Check name_len before reading btrfs_get_name
...

Linus Torvalds
2017-07-06 07:41:23 +0800

30 Jun, 2017

2 commits

6374e57ad btrfs: fix integer overflow in calc_reclaim_items_nr ... Browse Code »

Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
v4.12-rc6. This was because commit 70e7af244 made it possible for
calc_reclaim_items_nr() to return a negative number. It's not really a
bug in that commit, it just didn't go far enough down the stack to find
all the possible 64->32 bit overflows.

This switches calc_reclaim_items_nr() to return a u64 and changes everyone
that uses the results of that math to u64 as well.

Reported-by: Dave Jones
Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
Signed-off-by: Chris Mason
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Chris Mason
2017-06-30 02:17:02 +0800
ded56184a btrfs: scrub: fix target device intialization while setting up scrub context ... Browse Code »

The commit "btrfs: scrub: inline helper scrub_setup_wr_ctx" inlined a
helper but wrongly sets up the target device. Incidentally there's a
local variable with the same name as a parameter in the previous
function, so this got caught during runtime as crash in test btrfs/027.

Reported-by: Chris Mason
Signed-off-by: David Sterba

David Sterba
2017-06-30 02:17:02 +0800

20 Jun, 2017

9 commits

c5e4c3d75 btrfs: sink gfp parameter to btrfs_io_bio_alloc ... Browse Code »

We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we
change it back from GFP_KERNEL in scrub. I'd rather save a few stack
bytes from not passing the gfp flags in the remaining, more imporatant,
contexts and the bio allocating API now looks more consistent.

Reviewed-by: Liu Bo
Signed-off-by: David Sterba

David Sterba
2017-06-20 00:26:04 +0800
e4f569038 btrfs: btrfs_io_bio_alloc never fails, skip error handling ... Browse Code »

Update direct callers of btrfs_io_bio_alloc that do error handling, that
we can now remove.

Reviewed-by: Anand Jain
Signed-off-by: David Sterba

David Sterba
2017-06-20 00:26:02 +0800
de2491fde btrfs: scrub: add memalloc_nofs protection around init_ipath ... Browse Code »

init_ipath is called from a safe ioctl context and from scrub when
printing an error. The protection is added for three reasons:

* init_data_container calls vmalloc and this does not work as expected
in the GFP_NOFS context, so this silently does GFP_KERNEL and might
deadlock in some cases
* keep the context constraint of GFP_NOFS, used by scrub
* we want to use GFP_KERNEL unconditionally inside init_ipath or its
callees

Reviewed-by: Anand Jain
Signed-off-by: David Sterba

David Sterba
2017-06-20 00:26:02 +0800
3fb99303c btrfs: scrub: embed scrub_wr_ctx into scrub context ... Browse Code »

The structure scrub_wr_ctx is not used anywhere just the scrub context,
we can move the members there. The tgtdev is renamed so it's more clear
that it belongs to the "wr" part.

Signed-off-by: David Sterba

David Sterba
2017-06-20 00:26:01 +0800
25cc1226c btrfs: scrub: use fs_info::sectorsize and drop it from scrub context ... Browse Code »

As we now have the node/block sizes in fs_info, we can use them and can
drop the local copies.

Reviewed-by: Liu Bo
Signed-off-by: David Sterba

David Sterba
2017-06-20 00:26:01 +0800
4e2814ef0 btrfs: scrub: simplify cleanup of wr_ctx in scrub_free_ctx ... Browse Code »

We don't need to take the mutex and zero out wr_cur_bio, as this is
called after the scrub finished.

Reviewed-by: Liu Bo
Signed-off-by: David Sterba

David Sterba
2017-06-20 00:26:00 +0800
e241ddeb9 btrfs: scrub: inline helper scrub_free_wr_ctx ... Browse Code »

The helper scrub_free_wr_ctx is used only once and fits into
scrub_free_ctx as it continues sctx shutdown, no need to keep it
separate.

Signed-off-by: David Sterba

David Sterba
2017-06-20 00:26:00 +0800
8fcdac3f2 btrfs: scrub: inline helper scrub_setup_wr_ctx ... Browse Code »

The helper scrub_setup_wr_ctx is used only once and fits into
scrub_setup_ctx as it continues intialization, no need to keep it
separate.

Signed-off-by: David Sterba

David Sterba
2017-06-20 00:26:00 +0800
6ec656bc0 btrfs: remove inode argument from repair_io_failure ... Browse Code »

Once we remove the btree_inode we won't have an inode to pass anymore,
just pass the fs_info directly and the inum since we use that to print
out the repair message.

Signed-off-by: Josef Bacik
Reviewed-by: Chandan Rajendra
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Josef Bacik
2017-06-20 00:25:58 +0800

09 Jun, 2017

1 commit

4e4cbee93 block: switch bios to blk_status_t ... Browse Code »

Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-09 23:27:32 +0800

18 Apr, 2017

14 commits

28d70e237 btrfs: scrub: Fix RAID56 recovery race condition ... Browse Code »

When scrubbing a RAID5 which has recoverable data corruption (only one
data stripe is corrupted), sometimes scrub will report more csum errors
than expected. Sometimes even unrecoverable error will be reported.

The problem can be easily reproduced by the following steps:
1) Create a btrfs with RAID5 data profile with 3 devs
2) Mount it with nospace_cache or space_cache=v2
To avoid extra data space usage.
3) Create a 128K file and sync the fs, unmount it
Now the 128K file lies at the beginning of the data chunk
4) Locate the physical bytenr of data chunk on dev3
Dev3 is the 1st data stripe.
5) Corrupt the first 64K of the data chunk stripe on dev3
6) Mount the fs and scrub it

The correct csum error number should be 16 (assuming using x86_64).
Larger csum error number can be reported in a 1/3 chance.
And unrecoverable error can also be reported in a 1/10 chance.

The root cause of the problem is RAID5/6 recover code has race
condition, due to the fact that full scrub is initiated per device.

While for other mirror based profiles, each mirror is independent with
each other, so race won't cause any big problem.

For example:
Corrupted | Correct | Correct |
| Scrub dev3 (D1) | Scrub dev2 (D2) | Scrub dev1(P) |
------------------------------------------------------------------------
Read out D1 |Read out D2 |Read full stripe |
Check csum |Check csum |Check parity |
Csum mismatch |Csum match, continue |Parity mismatch |
handle_errored_block | |handle_errored_block |
Read out full stripe | | Read out full stripe|
D1 csum error(err++) | | D1 csum error(err++)|
Recover D1 | | Recover D1 |

So D1's csum error is accounted twice, just because
handle_errored_block() doesn't have enough protection, and race can happen.

On even worse case, for example D1's recovery code is re-writing
D1/D2/P, and P's recovery code is just reading out full stripe, then we
can cause unrecoverable error.

This patch will use previously introduced lock_full_stripe() and
unlock_full_stripe() to protect the whole scrub_handle_errored_block()
function for RAID56 recovery.
So no extra csum error nor unrecoverable error.

Reported-by: Goffredo Baroncelli
Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Qu Wenruo
2017-04-18 20:07:27 +0800
0966a7b13 btrfs: scrub: Introduce full stripe lock for RAID56 ... Browse Code »

Unlike mirror based profiles, RAID5/6 recovery needs to read out the
whole full stripe.

And if we don't do proper protection, it can easily cause race condition.

Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
for RAID5/6.
Which store a rb_tree of mutexes for full stripes, so scrub callers can
use them to lock a full stripe to avoid race.

Signed-off-by: Qu Wenruo
Reviewed-by: Liu Bo
Reviewed-by: David Sterba
[ minor comment adjustments ]
Signed-off-by: David Sterba

Qu Wenruo
2017-04-18 20:07:27 +0800
42c61ab67 Btrfs: switch to div64_u64 if with a u64 divisor ... Browse Code »

This is fixing code pieces where we use div_u64 when passing a u64 divisor.

Cc: David Sterba
Signed-off-by: Liu Bo
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Liu Bo
2017-04-18 20:07:26 +0800
972d72193 Btrfs: update scrub_parity to use u64 stripe_len ... Browse Code »

Commit 3d8da6781760 ("Btrfs: fix divide error upon chunk's stripe_len")
changed stripe_len in struct map_lookup to u64, but didn't update
stripe_len in struct scrub_parity.

This updates the type and switches to div64_u64_rem to match u64 divisor.

Cc: David Sterba
Signed-off-by: Liu Bo
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Liu Bo
2017-04-18 20:07:26 +0800
619a97429 btrfs: use clear_page where appropriate ... Browse Code »

There's a helper to clear whole page, with a arch-specific optimized
code. The replaced cases do not seem to be in performace critical code,
but we still might get some percent gain.

Signed-off-by: David Sterba

David Sterba
2017-04-18 20:07:26 +0800
e501bfe32 btrfs: Prevent scrub recheck from racing with dev replace ... Browse Code »

scrub_setup_recheck_block() calls btrfs_map_sblock() and then accesses
bbio without protection of bio_counter.

This can lead to use-after-free if racing with dev replace cancel.

Fix it by increasing bio_counter before calling btrfs_map_sblock() and
decreasing the bio_counter when corresponding recover is finished.

Cc: Liu Bo
Reported-by: Liu Bo
Signed-off-by: Qu Wenruo
Signed-off-by: David Sterba

Qu Wenruo
2017-04-18 20:07:26 +0800
ae6529c35 btrfs: Wait for in-flight bios before freeing target device for raid56 ... Browse Code »

When raid56 dev-replace is cancelled by running scrub, we will free
target device without waiting for in-flight bios, causing the following
NULL pointer deference or general protection failure.

BUG: unable to handle kernel NULL pointer dereference at 00000000000005e0
IP: generic_make_request_checks+0x4d/0x610
CPU: 1 PID: 11676 Comm: kworker/u4:14 Tainted: G O 4.11.0-rc2 #72
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs]
task: ffff88002875b4c0 task.stack: ffffc90001334000
RIP: 0010:generic_make_request_checks+0x4d/0x610
Call Trace:
? generic_make_request+0xc7/0x360
generic_make_request+0x24/0x360
? generic_make_request+0xc7/0x360
submit_bio+0x64/0x120
? page_in_rbio+0x4d/0x80 [btrfs]
? rbio_orig_end_io+0x80/0x80 [btrfs]
finish_rmw+0x3f4/0x540 [btrfs]
validate_rbio_for_rmw+0x36/0x40 [btrfs]
raid_rmw_end_io+0x7a/0x90 [btrfs]
bio_endio+0x56/0x60
end_workqueue_fn+0x3c/0x40 [btrfs]
btrfs_scrubparity_helper+0xef/0x620 [btrfs]
btrfs_endio_raid56_helper+0xe/0x10 [btrfs]
process_one_work+0x2af/0x720
? process_one_work+0x22b/0x720
worker_thread+0x4b/0x4f0
kthread+0x10f/0x150
? process_one_work+0x720/0x720
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x2e/0x40
RIP: generic_make_request_checks+0x4d/0x610 RSP: ffffc90001337bb8

In btrfs_dev_replace_finishing(), we will call
btrfs_rm_dev_replace_blocked() to wait bios before destroying the target
device when scrub is finished normally.

However when dev-replace is aborted, either due to error or cancelled by
scrub, we didn't wait for bios, this can lead to use-after-free if there
are bios holding the target device.

Furthermore, for raid56 scrub, at least 2 places are calling
btrfs_map_sblock() without protection of bio_counter, leading to the
problem.

This patch fixes the problem:
1) Wait for bio_counter before freeing target device when canceling
replace
2) When calling btrfs_map_sblock() for raid56, use bio_counter to
protect the call.

Cc: Liu Bo
Signed-off-by: Qu Wenruo
Reviewed-by: Liu Bo
Signed-off-by: David Sterba

Qu Wenruo
2017-04-18 20:07:26 +0800
9a33944bd btrfs: scrub: Don't append on-disk pages for raid56 scrub ... Browse Code »

In the following situation, scrub will calculate wrong parity to
overwrite the correct one:

RAID5 full stripe:

Before
| Dev 1 | Dev 2 | Dev 3 |
| Data stripe 1 | Data stripe 2 | Parity Stripe |
--------------------------------------------------- 0
| 0x0000 (Bad) | 0xcdcd | 0x0000 |
--------------------------------------------------- 4K
| 0xcdcd | 0xcdcd | 0x0000 |
...
| 0xcdcd | 0xcdcd | 0x0000 |
--------------------------------------------------- 64K

After scrubbing dev3 only:

| Dev 1 | Dev 2 | Dev 3 |
| Data stripe 1 | Data stripe 2 | Parity Stripe |
--------------------------------------------------- 0
| 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) |
--------------------------------------------------- 4K
| 0xcdcd | 0xcdcd | 0x0000 |
...
| 0xcdcd | 0xcdcd | 0x0000 |
--------------------------------------------------- 64K

The reason is that after raid56 read rebuild rbio->stripe_pages are all
correctly recovered (0xcd for data stripes).

However when we check and repair parity in
scrub_parity_check_and_repair(), we will append pages in sparity->spages
list to rbio->bio_pages[], which contains old on-disk data.

And when we submit parity data to disk, we calculate parity using
rbio->bio_pages[] first, if rbio->bio_pages[] not found, then fallback
to rbio->stripe_pages[].

The patch fix it by not appending pages from sparity->spages.
So finish_parity_scrub() will use rbio->stripe_pages[] which is correct.

Signed-off-by: Qu Wenruo
Reviewed-by: Liu Bo
Signed-off-by: David Sterba

Qu Wenruo
2017-04-18 20:07:26 +0800
825ad4c96 btrfs: drop redundant parameters from btrfs_map_sblock ... Browse Code »

All callers pass 0 for mirror_num and 1 for need_raid_map.

Signed-off-by: David Sterba

David Sterba
2017-04-18 20:07:26 +0800
1bcd7aa17 Btrfs: set scrub page's io_error if failing to submit io ... Browse Code »

Scrub repairs data by the unit called scrub_block, which may contain
several pages. Scrub always tries to look up a good copy of a whole
block, but if there's no such copy, it tries to do repair page by page.

If we don't set page's io_error when checking this bad copy, in the last
step, we may skip this page when repairing bad copy from good copy.

Cc: David Sterba
Signed-off-by: Liu Bo
Reviewed-by: David Sterba
Signed-off-by: David Sterba

Liu Bo
2017-04-18 20:07:26 +0800
99f4cdb16 btrfs: convert scrub_ctx.refs from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David Sterba

Elena Reshetova
2017-04-18 20:07:24 +0800
78a764504 btrfs: convert scrub_parity.refs from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David Sterba

Elena Reshetova
2017-04-18 20:07:24 +0800
186debd6e btrfs: convert scrub_block.refs from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David Sterba

Elena Reshetova
2017-04-18 20:07:24 +0800
6f615018b btrfs: convert scrub_recover.refs from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David Sterba

Elena Reshetova
2017-04-18 20:07:24 +0800

28 Feb, 2017

1 commit

1c8c9c521 btrfs: Make check_extent_to_block take btrfs_inode ... Browse Code »

Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba

Nikolay Borisov
2017-02-28 18:30:11 +0800