06 Dec, 2018
1 commit
-
commit 42a657f57628402c73237547f0134e083e2f6764 upstream.
The function relocate_block_group calls btrfs_end_transaction to release
trans when update_backref_cache returns 1, and then continues the loop
body. If btrfs_block_rsv_refill fails this time, it will jump out the
loop and the freed trans will be accessed. This may result in a
use-after-free bug. The patch assigns NULL to trans after trans is
released so that it will not be accessed.Fixes: 0647bf564f1 ("Btrfs: improve forever loop when doing balance relocation")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo
Signed-off-by: Pan Bian
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman
14 Nov, 2018
1 commit
-
commit 65c6e82becec33731f48786e5a30f98662c86b16 upstream.
[BUG]
When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
when trying to recover balance:kernel BUG at fs/btrfs/extent-tree.c:8956!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
Call Trace:
walk_up_tree+0x172/0x1f0 [btrfs]
btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
merge_reloc_roots+0xe1/0x1d0 [btrfs]
btrfs_recover_relocation+0x3ea/0x420 [btrfs]
open_ctree+0x1af3/0x1dd0 [btrfs]
btrfs_mount_root+0x66b/0x740 [btrfs]
mount_fs+0x3b/0x16a
vfs_kern_mount.part.9+0x54/0x140
btrfs_mount+0x16d/0x890 [btrfs]
mount_fs+0x3b/0x16a
vfs_kern_mount.part.9+0x54/0x140
do_mount+0x1fd/0xda0
ksys_mount+0xba/0xd0
__x64_sys_mount+0x21/0x30
do_syscall_64+0x60/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe[CAUSE]
Extent tree corruption. In this particular case, reloc tree root's
owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
corrupted and we failed the owner check in walk_up_tree().[FIX]
It's pretty hard to take care of every extent tree corruption, but at
least we can remove such BUG_ON() and exit more gracefully.And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
the same root (which is obviously invalid), we needs to make
__del_reloc_root() more robust to detect such invalid sharing to avoid
possible NULL dereference as root->node can be NULL in this case.Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411
Reported-by: Xu Wen
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman
15 Sep, 2018
1 commit
-
[ Upstream commit 389305b2aa68723c754f88d9dbd268a400e10664 ]
Invalid reloc tree can cause kernel NULL pointer dereference when btrfs
does some cleanup of the reloc roots.It turns out that fs_info::reloc_ctl can be NULL in
btrfs_recover_relocation() as we allocate relocation control after all
reloc roots have been verified.
So when we hit: note, we haven't called set_reloc_control() thus
fs_info::reloc_ctl is still NULL.Link: https://bugzilla.kernel.org/show_bug.cgi?id=199833
Reported-by: Xu Wen
Signed-off-by: Qu Wenruo
Tested-by: Gu Jinxiang
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman
26 Sep, 2017
1 commit
-
__del_reloc_root should be called before freeing up reloc_root->node.
If not, calling __del_reloc_root() dereference reloc_root->node, causing
the system BUG.Fixes: 6bdf131fac23 ("Btrfs: don't leak reloc root nodes on error")
Cc: # 4.9
Signed-off-by: Naohiro Aota
Reviewed-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba
21 Aug, 2017
3 commits
-
The BUG_ON() can be triggered when the caller is processing an invalid
extent inline ref, e.g.a shared data ref is offered instead of an extent data ref, such that
it tries to find a non-existent tree block and then btrfs_search_slot
returns 1 for no such item.This replaces the BUG_ON() with a WARN() followed by calling
btrfs_print_leaf() to show more details about what's going on and
returning -EINVAL to upper callers.Signed-off-by: Liu Bo
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Now that we have a helper to report invalid value of extent inline ref
type, we need to quit gracefully instead of throwing out a kernel panic.Signed-off-by: Liu Bo
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Since we have a helper which can do sanity check, this converts all
btrfs_extent_inline_ref_type to it.Signed-off-by: Liu Bo
Reviewed-by: David Sterba
Signed-off-by: David Sterba
16 Aug, 2017
1 commit
-
One of the error handling paths in __add_reloc_root contains btrfs_panic()
followed by some other code. As the name implies what it does is print
some error message and call BUG, naturally what follow afterwards is not
invoked. So remove this extra code.Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba
30 Jun, 2017
3 commits
-
Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
v4.12-rc6. This was because commit 70e7af244 made it possible for
calc_reclaim_items_nr() to return a negative number. It's not really a
bug in that commit, it just didn't go far enough down the stack to find
all the possible 64->32 bit overflows.This switches calc_reclaim_items_nr() to return a u64 and changes everyone
that uses the results of that math to u64 as well.Reported-by: Dave Jones
Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
Signed-off-by: Chris Mason
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
[BUG]
For the following case, btrfs can underflow qgroup reserved space
at an error path:
(Page size 4K, function name without "btrfs_" prefix)Task A | Task B
----------------------------------------------------------------------
Buffered_write [0, 2K) |
|- check_data_free_space() |
| |- qgroup_reserve_data() |
| Range aligned to page |
| range [0, 4K) <<< |
| 4K bytes reserved <<< |
|- copy pages to page cache |
| Buffered_write [2K, 4K)
| |- check_data_free_space()
| | |- qgroup_reserved_data()
| | Range alinged to page
| | range [0, 4K)
| | Already reserved by A <<<
| | 0 bytes reserved <<<
| |- delalloc_reserve_metadata()
| | And it *FAILED* (Maybe EQUOTA)
| |- free_reserved_data_space()
|- qgroup_free_data()
Range aligned to page range
[0, 4K)
Freeing 4K
(Special thanks to Chandan for the detailed report and analyse)[CAUSE]
Above Task B is freeing reserved data range [0, 4K) which is actually
reserved by Task A.And at writeback time, page dirty by Task A will go through writeback
routine, which will free 4K reserved data space at file extent insert
time, causing the qgroup underflow.[FIX]
For btrfs_qgroup_free_data(), add @reserved parameter to only free
data ranges reserved by previous btrfs_qgroup_reserve_data().
So in above case, Task B will try to free 0 byte, so no underflow.Reported-by: Chandan Rajendra
Signed-off-by: Qu Wenruo
Reviewed-by: Chandan Rajendra
Tested-by: Chandan Rajendra
Signed-off-by: David Sterba -
Introduce a new parameter, struct extent_changeset for
btrfs_qgroup_reserved_data() and its callers.Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
which range it reserved in current reserve, so it can free it in error
paths.The reason we need to export it to callers is, at buffered write error
path, without knowing what exactly which range we reserved in current
allocation, we can free space which is not reserved by us.This will lead to qgroup reserved space underflow.
Reviewed-by: Chandan Rajendra
Signed-off-by: Qu Wenruo
Signed-off-by: David Sterba
20 Jun, 2017
1 commit
-
For extent_io tree's we have carried the address_mapping of the inode
around in the io tree in order to pull the inode back out for calling
into various tree ops hooks. This works fine when everything that has
an extent_io_tree has an inode. But we are going to remove the
btree_inode, so we need to change this. Instead just have a generic
void * for private data that we can initialize with, and have all the
tree ops use that instead. This had a lot of cascading changes but
should be relatively straightforward.Signed-off-by: Josef Bacik
Reviewed-by: Chandan Rajendra
Reviewed-by: David Sterba
[ minor reordering of the callback prototypes ]
Signed-off-by: David Sterba
28 Feb, 2017
4 commits
-
Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba -
Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba -
Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba -
Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba
17 Feb, 2017
2 commits
-
The free space cache APIs accept a root but always use the tree root.
Also, btrfs_truncate_free_space_cache accepts a root AND an inode but
the inode always points to the root anyway, so let's just pass the inode.Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba -
btrfs_inc_block_group_ro is either passed the extent root or the dev
root, but it doesn't do anything with the dev tree. Let's convert
to passing an fs_info and using the extent root.Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba
14 Feb, 2017
2 commits
-
This goes as a separate patch because fixing that inside the patches
caused too many many conflicts.Signed-off-by: David Sterba
-
Currently btrfs_ino takes a struct inode and this causes a lot of
internal btrfs functions which consume this ino to take a VFS inode,
rather than btrfs' own struct btrfs_inode. In order to fix this "leak"
of VFS structs into the internals of btrfs first it's necessary to
eliminate all uses of struct inode for the purpose of inode. This patch
does that by using BTRFS_I to convert an inode to btrfs_inode. With
this problem eliminated subsequent patches will start eliminating the
passing of struct inode altogether, eventually resulting in a lot cleaner
code.Signed-off-by: Nikolay Borisov
[ fix btrfs_get_extent tracepoint prototype ]
Signed-off-by: David Sterba
14 Dec, 2016
1 commit
-
…dmanana/linux into for-linus-4.10
Patches queued up by Filipe:
The most important change is still the fix for the extent tree
corruption that happens due to balance when qgroups are enabled (a
regression introduced in 4.7 by a fix for a regression from the last
qgroups rework). This has been hitting SLE and openSUSE users and QA
very badly, where transactions keep getting aborted when running
delayed references leaving the root filesystem in RO mode and nearly
unusable. There are fixes here that allow us to run xfstests again
with the integrity checker enabled, which has been impossible since 4.8
(apparently I'm the only one running xfstests with the integrity
checker enabled, which is useful to validate dirtied leafs, like
checking if there are keys out of order, etc). The rest are just some
trivial fixes, most of them tagged for stable, and two cleanups.Signed-off-by: Chris Mason <clm@fb.com>
06 Dec, 2016
5 commits
-
Now we only use the root parameter to print the root objectid in
a tracepoint. We can use the root parameter from the transaction
handle for that. It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba -
There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer. Let's convert those to
just accept an fs_info pointer directly.Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba -
In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable. This makes the code considerably
more readable.Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba -
We track the node sizes per-root, but they never vary from the values
in the superblock. This patch messes with the 80-column style a bit,
but subsequent patches to factor out root->fs_info into a convenience
variable fix it up again.Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba -
There are many functions that are always called with the same root
argument. Rather than passing the same root every time, we can
pass an fs_info pointer instead and have the function get the root
pointer itself.Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba
30 Nov, 2016
4 commits
-
Commit 62b99540a1d91e464 (btrfs: relocation: Fix leaking qgroups numbers
on data extents) only fixes the problem partly.The previous fix is to trace all new data extents at transaction commit
time when balance finishes.However balance is not done in a large transaction, every path
replacement can happen in its own transaction.
This makes the fix useless if transaction commits during relocation.For example:
relocate_block_group()
|-merge_reloc_roots()
| |- merge_reloc_root()
| |- btrfs_start_transaction()
Signed-off-by: Qu Wenruo
Reviewed-and-Tested-by: Goldwyn Rodrigues
Signed-off-by: David Sterba -
Rename btrfs_qgroup_insert_dirty_extent(_nolock) to
btrfs_qgroup_trace_extent(_nolock), according to the new
reserve/trace/account naming schema.Signed-off-by: Qu Wenruo
Reviewed-and-Tested-by: Goldwyn Rodrigues
Signed-off-by: David Sterba -
The only memset we do is to 0, so sink the parameter to the function and
simplify all calls. Rename the function to reflect the behaviour.Signed-off-by: David Sterba
-
They're not even documented anywhere, letting users with no recourse but
to RTFS. It's no big burden to output the bitfield as words.Also, display unknown flags as hex.
Signed-off-by: Adam Borowski
Tested-by: Holger Hoffstätte
Signed-off-by: David Sterba
19 Nov, 2016
2 commits
-
In commit 5bc7247ac47c (Btrfs: fix broken nocow after balance) we started
abusing the rtransid and otransid fields of root items from relocation
trees to fix some issues with nodatacow mode. However later in commit
ba8b0289333a (Btrfs: do not reset last_snapshot after relocation) we
dropped the code that made use of those fields but did not remove
the code that sets those fields.So just remove them to avoid confusion.
Signed-off-by: Filipe Manana
Reviewed-by: Josef Bacik -
During relocation of a data block group we create a relocation tree
for each fs/subvol tree by making a snapshot of each tree using
btrfs_copy_root() and the tree's commit root, and then setting the last
snapshot field for the fs/subvol tree's root to the value of the current
transaction id minus 1. However this can lead to relocation later
dropping references that it did not create if we have qgroups enabled,
leaving the filesystem in an inconsistent state that keeps aborting
transactions.Lets consider the following example to explain the problem, which requires
qgroups to be enabled.We are relocating data block group Y, we have a subvolume with id 258 that
has a root at level 1, that subvolume is used to store directory entries
for snapshots and we are currently at transaction 3404.When committing transaction 3404, we have a pending snapshot and therefore
we call btrfs_run_delayed_items() at transaction.c:create_pending_snapshot()
in order to create its dentry at subvolume 258. This results in COWing
leaf A from root 258 in order to add the dentry. Note that leaf A
also contains file extent items referring to extents from some other
block group X (we are currently relocating block group Y). Later on, still
at create_pending_snapshot() we call qgroup_account_snapshot(), which
switches the commit root for root 258 when it calls switch_commit_roots(),
so now the COWed version of leaf A, lets call it leaf A', is accessible
from the commit root of tree 258. At the end of qgroup_account_snapshot(),
we call record_root_in_trans() with 258 as its argument, which results
in btrfs_init_reloc_root() being called, which in turn calls
relocation.c:create_reloc_root() in order to create a relocation tree
associated to root 258, which results in assigning the value of 3403
(which is the current transaction id minus 1 = 3404 - 1) to the
last_snapshot field of root 258. When creating the relocation tree root
at ctree.c:btrfs_copy_root() we add a shared reference for leaf A',
corresponding to the relocation tree's root, when we call btrfs_inc_ref()
against the COWed root (a copy of the commit root from tree 258), which
is at level 1. So at this point leaf A' has 2 references, one normal
reference corresponding to root 258 and one shared reference corresponding
to the root of the relocation tree.Transaction 3404 finishes its commit and transaction 3405 is started by
relocation when calling merge_reloc_root() for the relocation tree
associated to root 258. In the meanwhile leaf A' is COWed again, in
response to some filesystem operation, when we are still at transaction
3405. However when we COW leaf A', at ctree.c:update_ref_for_cow(), we
call btrfs_block_can_be_shared() in order to figure out if other trees
refer to the leaf and if any such trees exists, add a full back reference
to leaf A' - but btrfs_block_can_be_shared() incorrectly returns false
because the following condition is false:btrfs_header_generation(buf) root_item)
which evaluates to 3404 refs[0] is 1, it does call
btrfs_dec_ref() against leaf A', which results in removing the single
references that the extents from block group X have which are associated
to root 258 - the expectation was to have each of these extents with 2
references - one reference for root 258 and one shared reference related
to the root of the relocation tree, and so we would drop only the shared
reference (because leaf A' was supposed to have the flag
BTRFS_BLOCK_FLAG_FULL_BACKREF set).This leaves the filesystem in an inconsistent state as we now have file
extent items in a subvolume tree that point to extents from block group X
without references in the extent tree. So later on when we try to decrement
the references for these extents, for example due to a file unlink operation,
truncate operation or overwriting ranges of a file, we fail because the
expected references do not exist in the extent tree.This leads to warnings and transaction aborts like the following:
[ 588.965795] ------------[ cut here ]------------
[ 588.965815] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:1625 lookup_inline_extent_backref+0x432/0x5b0 [btrfs]
[ 588.965816] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc
parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea
sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg
[ 588.965831] CPU: 2 PID: 2479 Comm: kworker/u8:7 Not tainted 4.7.3-3-default-fdm+ #1
[ 588.965832] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 588.965844] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[ 588.965845] 0000000000000000 ffff8802263bfa28 ffffffff813af542 0000000000000000
[ 588.965847] 0000000000000000 ffff8802263bfa68 ffffffff81081e8b 0000065900000000
[ 588.965848] ffff8801db2af000 000000012bbe2000 0000000000000000 ffff880215703b48
[ 588.965849] Call Trace:
[ 588.965852] [] dump_stack+0x63/0x81
[ 588.965854] [] __warn+0xcb/0xf0
[ 588.965855] [] warn_slowpath_null+0x1d/0x20
[ 588.965863] [] lookup_inline_extent_backref+0x432/0x5b0 [btrfs]
[ 588.965865] [] ? trace_clock_local+0x10/0x30
[ 588.965867] [] ? rb_reserve_next_event+0x6f/0x460
[ 588.965875] [] insert_inline_extent_backref+0x55/0xd0 [btrfs]
[ 588.965882] [] __btrfs_inc_extent_ref.isra.55+0x8f/0x240 [btrfs]
[ 588.965890] [] __btrfs_run_delayed_refs+0x74a/0x1260 [btrfs]
[ 588.965892] [] ? cpuacct_charge+0x86/0xa0
[ 588.965900] [] btrfs_run_delayed_refs+0x9f/0x2c0 [btrfs]
[ 588.965908] [] delayed_ref_async_start+0x94/0xb0 [btrfs]
[ 588.965918] [] btrfs_scrubparity_helper+0xca/0x350 [btrfs]
[ 588.965928] [] btrfs_extent_refs_helper+0xe/0x10 [btrfs]
[ 588.965930] [] process_one_work+0x1f3/0x4e0
[ 588.965931] [] worker_thread+0x48/0x4e0
[ 588.965932] [] ? process_one_work+0x4e0/0x4e0
[ 588.965934] [] kthread+0xc9/0xe0
[ 588.965936] [] ret_from_fork+0x1f/0x40
[ 588.965937] [] ? kthread_worker_fn+0x170/0x170
[ 588.965938] ---[ end trace 34e5232c933a1749 ]---
[ 588.966187] ------------[ cut here ]------------
[ 588.966196] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:2966 btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs]
[ 588.966196] BTRFS: Transaction aborted (error -5)
[ 588.966197] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc
parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea
sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg
[ 588.966206] CPU: 2 PID: 2479 Comm: kworker/u8:7 Tainted: G W 4.7.3-3-default-fdm+ #1
[ 588.966207] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 588.966217] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[ 588.966217] 0000000000000000 ffff8802263bfc98 ffffffff813af542 ffff8802263bfce8
[ 588.966219] 0000000000000000 ffff8802263bfcd8 ffffffff81081e8b 00000b96345ee000
[ 588.966220] ffffffffa021ae1c ffff880215703b48 00000000000005fe ffff8802345ee000
[ 588.966221] Call Trace:
[ 588.966223] [] dump_stack+0x63/0x81
[ 588.966224] [] __warn+0xcb/0xf0
[ 588.966225] [] warn_slowpath_fmt+0x4f/0x60
[ 588.966233] [] btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs]
[ 588.966241] [] delayed_ref_async_start+0x94/0xb0 [btrfs]
[ 588.966250] [] btrfs_scrubparity_helper+0xca/0x350 [btrfs]
[ 588.966259] [] btrfs_extent_refs_helper+0xe/0x10 [btrfs]
[ 588.966260] [] process_one_work+0x1f3/0x4e0
[ 588.966261] [] worker_thread+0x48/0x4e0
[ 588.966263] [] ? process_one_work+0x4e0/0x4e0
[ 588.966264] [] kthread+0xc9/0xe0
[ 588.966265] [] ret_from_fork+0x1f/0x40
[ 588.966267] [] ? kthread_worker_fn+0x170/0x170
[ 588.966268] ---[ end trace 34e5232c933a174a ]---
[ 588.966269] BTRFS: error (device sda2) in btrfs_run_delayed_refs:2966: errno=-5 IO failure
[ 588.966270] BTRFS info (device sda2): forced readonlyThis was happening often on openSUSE and SLE systems using btrfs as the
root filesystem (with its default layout where multiple subvolumes are
used) where balance happens in the background triggered by a cron job and
snapshots are automatically created before/after package installations,
upgrades and removals. The issue could be triggered simply by running the
following loop on the first system boot post installation:while true; do
zypper -n in nfs-kernel-server
zypper -n rm nfs-kernel-server
done(If we were fast enough and made that loop before the cron job triggered
a balance operation and the balance finished)So fix by setting the last_snapshot field of the root to the value of the
generation of its commit root. Like this btrfs_block_can_be_shared()
behaves correctly for the case where the relocation root is created during
a transaction commit and for the case where it's created before a
transaction commit.Fixes: 6426c7ad697d (btrfs: qgroup: Fix qgroup accounting when creating snapshot)
Cc: stable@vger.kernel.org # 4.7+
Signed-off-by: Filipe Manana
Reviewed-by: Josef Bacik
17 Oct, 2016
1 commit
-
While updating btree, we try to push items between sibling
nodes/leaves in order to keep height as low as possible.
But we don't memset the original places with zero when
pushing items so that we could end up leaving stale content
in nodes/leaves. One may read the above stale content by
increasing btree blocks' @nritems.One case I've come across is that in fs tree, a leaf has two
parent nodes, hence running balance ends up with processing
this leaf with two parent nodes, but it can only reach the
valid parent node through btrfs_search_slot, so it'd be like,do_relocation
for P in all parent nodes of block A:
if !P->eb:
btrfs_search_slot(key); --> get path from P to A.
if lowest:
BUG_ON(A->bytenr != bytenr of A recorded in P);
btrfs_cow_block(P, A); --> change A's bytenr in P.After btrfs_cow_block, P has the new bytenr of A, but with the
same @key, we get the same path again, and get panic by BUG_ON.Note that this is only happening in a corrupted fs, for a
regular fs in which we have correct @nritems so that we won't
read stale content in any case.Reviewed-by: Josef Bacik
Signed-off-by: Liu Bo
Signed-off-by: David Sterba
27 Sep, 2016
2 commits
-
CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."This patch unsplits user-visible strings.
Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba -
We don't track the reloc roots in any sort of normal way, so the only way the
root/commit_root nodes get free'd is if the relocation finishes successfully and
the reloc root is deleted. Fix this by free'ing them in free_reloc_roots.
Thanks,Signed-off-by: Josef Bacik
Signed-off-by: David Sterba
26 Sep, 2016
3 commits
-
When relocating tree blocks, we firstly get block information from
back references in the extent tree, we then search fs tree to try to
find all parents of a block.However, if fs tree is corrupted, eg. if there're some missing
items, we could come across these WARN_ONs and BUG_ONs.This makes us print some error messages and return gracefully
from balance.Signed-off-by: Liu Bo
Reviewed-by: Josef Bacik
Signed-off-by: David Sterba -
We have a lot of random ints in btrfs_fs_info that can be put into flags. This
is mostly equivalent with the exception of how we deal with quota going on or
off, now instead we set a flag when we are turning it on or off and deal with
that appropriately, rather than just having a pending state that the current
quota_enabled gets set to. Thanks,Signed-off-by: Josef Bacik
Signed-off-by: David Sterba -
…e and subpage size patchset
Extend btrfs_set_extent_delalloc() and extent_clear_unlock_delalloc()
parameters for both in-band dedupe and subpage sector size patchset.This should reduce conflict of both patchset and the effort to rebase
them.Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
01 Sep, 2016
1 commit
-
Qgroup function may overwrite the saved error 'err' with 0
in case quota is not enabled, and this ends up with a
endless loop in balance because we keep going back to balance
the same block group.It really should use 'ret' instead.
Signed-off-by: Liu Bo
Reviewed-by: Qu Wenruo
Signed-off-by: David Sterba
25 Aug, 2016
1 commit
-
This patch can fix some false ENOSPC errors, below test script can
reproduce one false ENOSPC error:
#!/bin/bash
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
dev=$(losetup --show -f fs.img)
mkfs.btrfs -f -M $dev
mkdir /tmp/mntpoint
mount $dev /tmp/mntpoint
cd /tmp/mntpoint
xfs_io -f -c "falloc 0 $((64*1024*1024))" testfileAbove script will fail for ENOSPC reason, but indeed fs still has free
space to satisfy this request. Please see call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
| bytes_may_use += 64M
|-> btrfs_prealloc_file_range()
|-> btrfs_reserve_extent()
|-> btrfs_add_reserved_bytes()
| alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
| change bytes_may_use, and bytes_reserved += 64M. Now
| bytes_may_use + bytes_reserved == 128M, which is greater
| than btrfs_space_info's total_bytes, false enospc occurs.
| Note, the bytes_may_use decrease operation will be done in
| end of btrfs_fallocate(), which is too late.Here is another simple case for buffered write:
CPU 1 | CPU 2
|
|-> cow_file_range() |-> __btrfs_buffered_write()
|-> btrfs_reserve_extent() | |
| | |
| | |
| ..... | |-> btrfs_check_data_free_space()
| |
| |
|-> extent_clear_unlock_delalloc() |In CPU 1, btrfs_reserve_extent()->find_free_extent()->
btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
operation will be delayed to be done in extent_clear_unlock_delalloc().
Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
btrfs_check_data_free_space() tries to reserve 100MB data space.
If
100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
btrfs_check_data_free_space() will try to allcate new data chunk or call
btrfs_start_delalloc_roots(), or commit current transaction in order to
reserve some free space, obviously a lot of work. But indeed it's not
necessary as long as decreasing bytes_may_use timely, we still have
free space, decreasing 128M from bytes_may_use.To fix this issue, this patch chooses to update bytes_may_use for both
data and metadata in btrfs_add_reserved_bytes(). For compress path, real
extent length may not be equal to file content length, so introduce a
ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
file content length. Then compress path can update bytes_may_use
correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
and RESERVE_FREE.As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
PREALLOC, we also need to update bytes_may_use, but can not pass
EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
to update btrfs_space_info's bytes_may_use.Meanwhile __btrfs_prealloc_file_range() will call
btrfs_free_reserved_data_space() internally for both sucessful and failed
path, btrfs_prealloc_file_range()'s callers does not need to call
btrfs_free_reserved_data_space() any more.Signed-off-by: Wang Xiaoguang
Reviewed-by: Josef Bacik
Signed-off-by: David Sterba
Signed-off-by: Chris Mason