06 Aug, 2018
3 commits
-
Currently the function uses 2 goto labels to properly handle allocation
failures. This could be simplified by simply re-arranging the code so
that allocations are the in the beginning of the function. This allows
to use simple return statements. No functional changes.Signed-off-by: Nikolay Borisov
Reviewed-by: Su Yue
Signed-off-by: David Sterba -
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.Signed-off-by: Nikolay Borisov
Reviewed-by: Qu Wenruo
Signed-off-by: David Sterba -
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.Signed-off-by: Nikolay Borisov
Reviewed-by: Qu Wenruo
Signed-off-by: David Sterba
29 May, 2018
11 commits
-
add_delayed_ref_head really performed 2 independent operations -
initialisting the ref head and adding it to a list. Now that the init
part is in a separate function let's complete the separation between
both operations. This results in a lot simpler interface for
add_delayed_ref_head since the function now deals solely with either
adding the newly initialised delayed ref head or merging it into an
existing delayed ref head. This results in vastly simplified function
signature since 5 arguments are dropped. The only other thing worth
mentioning is that due to this split the WARN_ON catching reinit of
existing. In this patch the condition is extended such that:qrecord && head_ref->qgroup_ref_root && head_ref->qgroup_reserved
is added. This is done because the two qgroup_* prefixed member are
set only if both ref_root and reserved are passed. So functionally
it's equivalent to the old WARN_ON and allows to remove the two args
from add_delayed_ref_head.Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Use the newly introduced function when initialising the head_ref in
add_delayed_ref_head. No functional changes.Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
add_delayed_ref_head implements the logic to both initialize a head_ref
structure as well as perform the necessary operations to add it to the
delayed ref machinery. This has resulted in a very cumebrsome interface
with loads of parameters and code, which at first glance, looks very
unwieldy. Begin untangling it by first extracting the initialization
only code in its own function. It's more or less verbatim copy of the
first part of add_delayed_ref_head.Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Now that the initialization part and the critical section code have been
split it's a lot easier to open code add_delayed_data_ref. Do so in the
following manner:1. The common init function is put immediately after memory-to-be-initialized
is allocated, followed by the specific data ref initialization.2. The only piece of code that remains in the critical section is
insert_delayed_ref call.3. Tracing and memory freeing code is moved outside of the critical
section.No functional changes, just an overall shorter critical section.
Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Now that the initialization part and the critical section code have been
split it's a lot easier to open code add_delayed_tree_ref. Do so in the
following manner:1. The comming init code is put immediately after memory-to-be-initialized
is allocated, followed by the ref-specific member initialization.2. The only piece of code that remains in the critical section is
insert_delayed_ref call.3. Tracing and memory freeing code is put outside of the critical
section as well.The only real change here is an overall shorter critical section when
dealing with delayed tree refs. From functional point of view - the code
is unchanged.Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Use the newly introduced helper and remove the duplicate code. No
functional changes.Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
Use the newly introduced common helper. No functional changes.
Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
THe majority of the init code for struct btrfs_delayed_ref_node is
duplicated in add_delayed_data_ref and add_delayed_tree_ref. Factor out
the common bits in init_delayed_ref_common. This function is going to be
used in future patches to clean that up. No functional changes.Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
It's provided by the transaction handle.
Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba -
It's provided by the transaction handle.
Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba -
It's provided by the transaction handle.
Signed-off-by: Nikolay Borisov
Signed-off-by: David Sterba
28 May, 2018
1 commit
-
It's used to print its pointer in a debug statement but doesn't really
bring any useful information to the error message.Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba
21 Apr, 2018
1 commit
-
When the delayed refs for a head are all run, eventually
cleanup_ref_head is called which (in case of deletion) obtains a
reference for the relevant btrfs_space_info struct by querying the bg
for the range. This is problematic because when the last extent of a
bg is deleted a race window emerges between removal of that bg and the
subsequent invocation of cleanup_ref_head. This can result in cache being null
and either a null pointer dereference or assertion failure.task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000
RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292
RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8
RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001
R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0
R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0
FS: 00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
btrfs_run_delayed_refs+0x68/0x250 [btrfs]
btrfs_should_end_transaction+0x42/0x60 [btrfs]
btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
evict+0xc6/0x190
do_unlinkat+0x19c/0x300
do_syscall_64+0x74/0x140
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7fbf589c57a7To fix this, introduce a new flag "is_system" to head_ref structs,
which is populated at insertion time. This allows to decouple the
querying for the spaceinfo from querying the possibly deleted bg.Fixes: d7eae3403f46 ("Btrfs: rework delayed ref total_bytes_pinned accounting")
CC: stable@vger.kernel.org # 4.14+
Suggested-by: Omar Sandoval
Signed-off-by: Nikolay Borisov
Reviewed-by: Omar Sandoval
Signed-off-by: David Sterba
12 Apr, 2018
1 commit
-
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.Signed-off-by: David Sterba
31 Mar, 2018
1 commit
-
Using lockdep_assert_held is preferred, replace assert_spin_locked.
Signed-off-by: David Sterba
26 Mar, 2018
1 commit
-
The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:- printf wrappers, error messages
- exit helpersSigned-off-by: David Sterba
02 Feb, 2018
1 commit
-
Running generic/019 with qgroups on the scratch device enabled is almost
guaranteed to trigger the BUG_ON in btrfs_free_tree_block. It's supposed
to trigger only on -ENOMEM, in reality, however, it's possible to get
-EIO from btrfs_qgroup_trace_extent_post. This function just finds the
roots of the extent being tracked and sets the qrecord->old_roots list.
If this operation fails nothing critical happens except the quota
accounting can be considered wrong. In such case just set the
INCONSISTENT flag for the quota and print a warning, rather than killing
off the system. Additionally, it's possible to trigger a BUG_ON in
btrfs_truncate_inode_items as well.Signed-off-by: Nikolay Borisov
Reviewed-by: Qu Wenruo
[ error message adjustments ]
Signed-off-by: David Sterba
22 Jan, 2018
1 commit
-
Adding __init macro gives kernel a hint that this function is only used
during the initialization phase and its memory resources can be freed up
after.Signed-off-by: Liu Bo
Reviewed-by: David Sterba
Signed-off-by: David Sterba
02 Nov, 2017
3 commits
-
If we get a significant amount of delayed refs for a single block (think
modifying multiple snapshots) we can end up spending an ungodly amount
of time looping through all of the entries trying to see if they can be
merged. This is because we only add them to a list, so we have O(2n)
for every ref head. This doesn't make any sense as we likely have refs
for different roots, and so they cannot be merged. Tracking in a tree
will allow us to break as soon as we hit an entry that doesn't match,
making our worst case O(n).With this we can also merge entries more easily. Before we had to hope
that matching refs were on the ends of our list, but with the tree we
can search down to exact matches and merge them at insert time.Signed-off-by: Josef Bacik
Signed-off-by: David Sterba -
Instead of open-coding the delayed ref comparisons, add a helper to do
the comparisons generically and use that everywhere. We compare
sequence numbers last for following patches.Signed-off-by: Josef Bacik
Signed-off-by: David Sterba -
Make it more consistent, we want the inserted ref to be compared against
what's already in there. This will make the order go from lowest seq ->
highest seq, which will make us more likely to make forward progress if
there's a seqlock currently held.Signed-off-by: Josef Bacik
Signed-off-by: David Sterba
30 Oct, 2017
2 commits
-
We can get this from the ref we've passed in.
Signed-off-by: Josef Bacik
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
This is just excessive information in the ref_head, and makes the code
complicated. It is a relic from when we had the heads and the refs in
the same tree, which is no longer the case. With this removal I've
cleaned up a bunch of the cruft around this old assumption as well.Signed-off-by: Josef Bacik
Reviewed-by: David Sterba
Signed-off-by: David Sterba
30 Jun, 2017
1 commit
-
We need this to decide when to account pinned bytes.
Signed-off-by: Omar Sandoval
Tested-by: Holger Hoffstätte
Signed-off-by: David Sterba
18 Apr, 2017
1 commit
-
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David Sterba
17 Feb, 2017
1 commit
-
Just as Filipe pointed out, the most time consuming parts of qgroup are
btrfs_qgroup_account_extents() and
btrfs_qgroup_prepare_account_extents().
Which both call btrfs_find_all_roots() to get old_roots and new_roots
ulist.What makes things worse is, we're calling that expensive
btrfs_find_all_roots() at transaction committing time with
TRANS_STATE_COMMIT_DOING, which will blocks all incoming transaction.Such behavior is necessary for @new_roots search as current
btrfs_find_all_roots() can't do it correctly so we do call it just
before switch commit roots.However for @old_roots search, it's not necessary as such search is
based on commit_root, so it will always be correct and we can move it
out of transaction committing.This patch moves the @old_roots search part out of
commit_transaction(), so in theory we can half the time qgroup time
consumption at commit_transaction().But please note that, this won't speedup qgroup overall, the total time
consumption is still the same, just reduce the performance stall.Cc: Filipe Manana
Signed-off-by: Qu Wenruo
Reviewed-by: Filipe Manana
Signed-off-by: David Sterba
14 Feb, 2017
2 commits
-
All we need is @delayed_refs, all callers have get it ahead of calling
btrfs_find_delayed_ref_head since lock needs to be acquired firstly,
there is no reason to deference it again inside the function.Signed-off-by: Liu Bo
Reviewed-by: David Sterba
Signed-off-by: David Sterba -
btrfs_add_delayed_data_ref is always called with a NULL extent_op,
so let's drop the argument.Signed-off-by: Jeff Mahoney
Reviewed-by: David Sterba
Signed-off-by: David Sterba
30 Nov, 2016
2 commits
-
This issue was found when I tried to delete a heavily reflinked file,
when deleting such files, other transaction operation will not have a
chance to make progress, for example, start_transaction() will blocked
in wait_current_trans(root) for long time, sometimes it even triggers
soft lockups, and the time taken to delete such heavily reflinked file
is also very large, often hundreds of seconds. Using perf top, it reports
that:PerfTop: 7416 irqs/sec kernel:99.8% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs)
---------------------------------------------------------------------------------------
84.37% [btrfs] [k] __btrfs_run_delayed_refs.constprop.80
11.02% [kernel] [k] delay_tsc
0.79% [kernel] [k] _raw_spin_unlock_irq
0.78% [kernel] [k] _raw_spin_unlock_irqrestore
0.45% [kernel] [k] do_raw_spin_lock
0.18% [kernel] [k] __slab_alloc
It seems __btrfs_run_delayed_refs() took most cpu time, after some debug
work, I found it's select_delayed_ref() causing this issue, for a delayed
head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but
select_delayed_ref() will firstly try to iterate node list to find
BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and
waste much time.To fix this issue, we introduce a new ref_add_list in struct btrfs_delayed_ref_head,
then in select_delayed_ref(), if this list is not empty, we can directly use
nodes in this list. With this patch, it just took about 10~15 seconds to
delte the same file. Now using perf top, it reports that:PerfTop: 2734 irqs/sec kernel:99.5% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs)
----------------------------------------------------------------------------------------20.74% [kernel] [k] _raw_spin_unlock_irqrestore
16.33% [kernel] [k] __slab_alloc
5.41% [kernel] [k] lock_acquired
4.42% [kernel] [k] lock_acquire
4.05% [kernel] [k] lock_release
3.37% [kernel] [k] _raw_spin_unlock_irqFor normal files, this patch also gives help, at least we do not need to
iterate whole list to found BTRFS_ADD_DELAYED_REF nodes.Signed-off-by: Wang Xiaoguang
Reviewed-by: Liu Bo
Tested-by: Holger Hoffstätte
Signed-off-by: David Sterba -
Rename btrfs_qgroup_insert_dirty_extent(_nolock) to
btrfs_qgroup_trace_extent(_nolock), according to the new
reserve/trace/account naming schema.Signed-off-by: Qu Wenruo
Reviewed-and-Tested-by: Goldwyn Rodrigues
Signed-off-by: David Sterba
27 Sep, 2016
1 commit
-
For many printks, we want to know which file system issued the message.
This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.fs/btrfs/check-integrity.c is left alone for another day.
Signed-off-by: Jeff Mahoney
Reviewed-by: David Sterba
Signed-off-by: David Sterba
26 Sep, 2016
1 commit
-
We have a lot of random ints in btrfs_fs_info that can be put into flags. This
is mostly equivalent with the exception of how we deal with quota going on or
off, now instead we set a flag when we are turning it on or off and deal with
that appropriately, rather than just having a pending state that the current
quota_enabled gets set to. Thanks,Signed-off-by: Josef Bacik
Signed-off-by: David Sterba
25 Aug, 2016
1 commit
-
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
1. btrfs_qgroup_insert_dirty_extent_nolock()
Almost the same with original code.
For delayed_ref usage, which has delayed refs locked.Change the return value type to int, since caller never needs the
pointer, but only needs to know if they need to free the allocated
memory.2. btrfs_qgroup_insert_dirty_extent()
The more encapsulated version.Will do the delayed_refs lock, memory allocation, quota enabled check
and other things.The original design is to keep exported functions to minimal, but since
more btrfs hacks exposed, like replacing path in balance, we need to
record dirty extents manually, so we have to add such functions.Also, add comment for both functions, to info developers how to keep
qgroup correct when doing hacks.Cc: Mark Fasheh
Signed-off-by: Qu Wenruo
Reviewed-and-Tested-by: Goldwyn Rodrigues
Signed-off-by: David Sterba
Signed-off-by: Chris Mason
06 Aug, 2016
1 commit
-
…fdmanana/linux into for-linus-4.8
03 Aug, 2016
1 commit
-
No longer used as of commit 5846a3c26873 ("btrfs: qgroup: Fix a race in
delayed_ref which leads to abort trans").Signed-off-by: Filipe Manana
26 Jul, 2016
2 commits
-
When using trace events to debug a problem, it's impossible to determine
which file system generated a particular event. This patch adds a
macro to prefix standard information to the head of a trace event.The extent_state alloc/free events are all that's left without an
fs_info available.Signed-off-by: Jeff Mahoney
Signed-off-by: David Sterba -
BTRFS is using a variety of slab caches to satisfy internal needs.
Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
meaning allocations from the caches are going to be accounted as
SReclaimable. At the same time btrfs is not registering any shrinkers
whatsoever, thus preventing memory from the slabs to be shrunk. This
means those caches are not in fact reclaimable.To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
inode cache, since this one is being freed by the generic VFS super_block
shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
to better document the lifetime of the objects (it just translates
to SLAB_RECLAIM_ACCOUNT).Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
Signed-off-by: David Sterba