06 Aug, 2018

3 commits


29 May, 2018

11 commits

  • add_delayed_ref_head really performed 2 independent operations -
    initialisting the ref head and adding it to a list. Now that the init
    part is in a separate function let's complete the separation between
    both operations. This results in a lot simpler interface for
    add_delayed_ref_head since the function now deals solely with either
    adding the newly initialised delayed ref head or merging it into an
    existing delayed ref head. This results in vastly simplified function
    signature since 5 arguments are dropped. The only other thing worth
    mentioning is that due to this split the WARN_ON catching reinit of
    existing. In this patch the condition is extended such that:

    qrecord && head_ref->qgroup_ref_root && head_ref->qgroup_reserved

    is added. This is done because the two qgroup_* prefixed member are
    set only if both ref_root and reserved are passed. So functionally
    it's equivalent to the old WARN_ON and allows to remove the two args
    from add_delayed_ref_head.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Use the newly introduced function when initialising the head_ref in
    add_delayed_ref_head. No functional changes.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • add_delayed_ref_head implements the logic to both initialize a head_ref
    structure as well as perform the necessary operations to add it to the
    delayed ref machinery. This has resulted in a very cumebrsome interface
    with loads of parameters and code, which at first glance, looks very
    unwieldy. Begin untangling it by first extracting the initialization
    only code in its own function. It's more or less verbatim copy of the
    first part of add_delayed_ref_head.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Now that the initialization part and the critical section code have been
    split it's a lot easier to open code add_delayed_data_ref. Do so in the
    following manner:

    1. The common init function is put immediately after memory-to-be-initialized
    is allocated, followed by the specific data ref initialization.

    2. The only piece of code that remains in the critical section is
    insert_delayed_ref call.

    3. Tracing and memory freeing code is moved outside of the critical
    section.

    No functional changes, just an overall shorter critical section.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Now that the initialization part and the critical section code have been
    split it's a lot easier to open code add_delayed_tree_ref. Do so in the
    following manner:

    1. The comming init code is put immediately after memory-to-be-initialized
    is allocated, followed by the ref-specific member initialization.

    2. The only piece of code that remains in the critical section is
    insert_delayed_ref call.

    3. Tracing and memory freeing code is put outside of the critical
    section as well.

    The only real change here is an overall shorter critical section when
    dealing with delayed tree refs. From functional point of view - the code
    is unchanged.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Use the newly introduced helper and remove the duplicate code. No
    functional changes.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Use the newly introduced common helper. No functional changes.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • THe majority of the init code for struct btrfs_delayed_ref_node is
    duplicated in add_delayed_data_ref and add_delayed_tree_ref. Factor out
    the common bits in init_delayed_ref_common. This function is going to be
    used in future patches to clean that up. No functional changes.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • It's provided by the transaction handle.

    Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • It's provided by the transaction handle.

    Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • It's provided by the transaction handle.

    Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba

    Nikolay Borisov
     

28 May, 2018

1 commit


21 Apr, 2018

1 commit

  • When the delayed refs for a head are all run, eventually
    cleanup_ref_head is called which (in case of deletion) obtains a
    reference for the relevant btrfs_space_info struct by querying the bg
    for the range. This is problematic because when the last extent of a
    bg is deleted a race window emerges between removal of that bg and the
    subsequent invocation of cleanup_ref_head. This can result in cache being null
    and either a null pointer dereference or assertion failure.

    task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000
    RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
    RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292
    RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8
    RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001
    R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0
    R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0
    FS: 00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
    btrfs_run_delayed_refs+0x68/0x250 [btrfs]
    btrfs_should_end_transaction+0x42/0x60 [btrfs]
    btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
    btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
    evict+0xc6/0x190
    do_unlinkat+0x19c/0x300
    do_syscall_64+0x74/0x140
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    RIP: 0033:0x7fbf589c57a7

    To fix this, introduce a new flag "is_system" to head_ref structs,
    which is populated at insertion time. This allows to decouple the
    querying for the spaceinfo from querying the possibly deleted bg.

    Fixes: d7eae3403f46 ("Btrfs: rework delayed ref total_bytes_pinned accounting")
    CC: stable@vger.kernel.org # 4.14+
    Suggested-by: Omar Sandoval
    Signed-off-by: Nikolay Borisov
    Reviewed-by: Omar Sandoval
    Signed-off-by: David Sterba

    Nikolay Borisov
     

12 Apr, 2018

1 commit


31 Mar, 2018

1 commit


26 Mar, 2018

1 commit

  • The __cold functions are placed to a special section, as they're
    expected to be called rarely. This could help i-cache prefetches or help
    compiler to decide which branches are more/less likely to be taken
    without any other annotations needed.

    Though we can't add more __exit annotations, it's still possible to add
    __cold (that's also added with __exit). That way the following function
    categories are tagged:

    - printf wrappers, error messages
    - exit helpers

    Signed-off-by: David Sterba

    David Sterba
     

02 Feb, 2018

1 commit

  • Running generic/019 with qgroups on the scratch device enabled is almost
    guaranteed to trigger the BUG_ON in btrfs_free_tree_block. It's supposed
    to trigger only on -ENOMEM, in reality, however, it's possible to get
    -EIO from btrfs_qgroup_trace_extent_post. This function just finds the
    roots of the extent being tracked and sets the qrecord->old_roots list.
    If this operation fails nothing critical happens except the quota
    accounting can be considered wrong. In such case just set the
    INCONSISTENT flag for the quota and print a warning, rather than killing
    off the system. Additionally, it's possible to trigger a BUG_ON in
    btrfs_truncate_inode_items as well.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: Qu Wenruo
    [ error message adjustments ]
    Signed-off-by: David Sterba

    Nikolay Borisov
     

22 Jan, 2018

1 commit


02 Nov, 2017

3 commits

  • If we get a significant amount of delayed refs for a single block (think
    modifying multiple snapshots) we can end up spending an ungodly amount
    of time looping through all of the entries trying to see if they can be
    merged. This is because we only add them to a list, so we have O(2n)
    for every ref head. This doesn't make any sense as we likely have refs
    for different roots, and so they cannot be merged. Tracking in a tree
    will allow us to break as soon as we hit an entry that doesn't match,
    making our worst case O(n).

    With this we can also merge entries more easily. Before we had to hope
    that matching refs were on the ends of our list, but with the tree we
    can search down to exact matches and merge them at insert time.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Instead of open-coding the delayed ref comparisons, add a helper to do
    the comparisons generically and use that everywhere. We compare
    sequence numbers last for following patches.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Make it more consistent, we want the inserted ref to be compared against
    what's already in there. This will make the order go from lowest seq ->
    highest seq, which will make us more likely to make forward progress if
    there's a seqlock currently held.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     

30 Oct, 2017

2 commits


30 Jun, 2017

1 commit


18 Apr, 2017

1 commit


17 Feb, 2017

1 commit

  • Just as Filipe pointed out, the most time consuming parts of qgroup are
    btrfs_qgroup_account_extents() and
    btrfs_qgroup_prepare_account_extents().
    Which both call btrfs_find_all_roots() to get old_roots and new_roots
    ulist.

    What makes things worse is, we're calling that expensive
    btrfs_find_all_roots() at transaction committing time with
    TRANS_STATE_COMMIT_DOING, which will blocks all incoming transaction.

    Such behavior is necessary for @new_roots search as current
    btrfs_find_all_roots() can't do it correctly so we do call it just
    before switch commit roots.

    However for @old_roots search, it's not necessary as such search is
    based on commit_root, so it will always be correct and we can move it
    out of transaction committing.

    This patch moves the @old_roots search part out of
    commit_transaction(), so in theory we can half the time qgroup time
    consumption at commit_transaction().

    But please note that, this won't speedup qgroup overall, the total time
    consumption is still the same, just reduce the performance stall.

    Cc: Filipe Manana
    Signed-off-by: Qu Wenruo
    Reviewed-by: Filipe Manana
    Signed-off-by: David Sterba

    Qu Wenruo
     

14 Feb, 2017

2 commits


30 Nov, 2016

2 commits

  • This issue was found when I tried to delete a heavily reflinked file,
    when deleting such files, other transaction operation will not have a
    chance to make progress, for example, start_transaction() will blocked
    in wait_current_trans(root) for long time, sometimes it even triggers
    soft lockups, and the time taken to delete such heavily reflinked file
    is also very large, often hundreds of seconds. Using perf top, it reports
    that:

    PerfTop: 7416 irqs/sec kernel:99.8% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs)
    ---------------------------------------------------------------------------------------
    84.37% [btrfs] [k] __btrfs_run_delayed_refs.constprop.80
    11.02% [kernel] [k] delay_tsc
    0.79% [kernel] [k] _raw_spin_unlock_irq
    0.78% [kernel] [k] _raw_spin_unlock_irqrestore
    0.45% [kernel] [k] do_raw_spin_lock
    0.18% [kernel] [k] __slab_alloc
    It seems __btrfs_run_delayed_refs() took most cpu time, after some debug
    work, I found it's select_delayed_ref() causing this issue, for a delayed
    head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but
    select_delayed_ref() will firstly try to iterate node list to find
    BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and
    waste much time.

    To fix this issue, we introduce a new ref_add_list in struct btrfs_delayed_ref_head,
    then in select_delayed_ref(), if this list is not empty, we can directly use
    nodes in this list. With this patch, it just took about 10~15 seconds to
    delte the same file. Now using perf top, it reports that:

    PerfTop: 2734 irqs/sec kernel:99.5% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs)
    ----------------------------------------------------------------------------------------

    20.74% [kernel] [k] _raw_spin_unlock_irqrestore
    16.33% [kernel] [k] __slab_alloc
    5.41% [kernel] [k] lock_acquired
    4.42% [kernel] [k] lock_acquire
    4.05% [kernel] [k] lock_release
    3.37% [kernel] [k] _raw_spin_unlock_irq

    For normal files, this patch also gives help, at least we do not need to
    iterate whole list to found BTRFS_ADD_DELAYED_REF nodes.

    Signed-off-by: Wang Xiaoguang
    Reviewed-by: Liu Bo
    Tested-by: Holger Hoffstätte
    Signed-off-by: David Sterba

    Wang Xiaoguang
     
  • Rename btrfs_qgroup_insert_dirty_extent(_nolock) to
    btrfs_qgroup_trace_extent(_nolock), according to the new
    reserve/trace/account naming schema.

    Signed-off-by: Qu Wenruo
    Reviewed-and-Tested-by: Goldwyn Rodrigues
    Signed-off-by: David Sterba

    Qu Wenruo
     

27 Sep, 2016

1 commit

  • For many printks, we want to know which file system issued the message.

    This patch converts most pr_* calls to use the btrfs_* versions instead.
    In some cases, this means adding plumbing to allow call sites access to
    an fs_info pointer.

    fs/btrfs/check-integrity.c is left alone for another day.

    Signed-off-by: Jeff Mahoney
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Jeff Mahoney
     

26 Sep, 2016

1 commit

  • We have a lot of random ints in btrfs_fs_info that can be put into flags. This
    is mostly equivalent with the exception of how we deal with quota going on or
    off, now instead we set a flag when we are turning it on or off and deal with
    that appropriately, rather than just having a pending state that the current
    quota_enabled gets set to. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     

25 Aug, 2016

1 commit

  • Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
    1. btrfs_qgroup_insert_dirty_extent_nolock()
    Almost the same with original code.
    For delayed_ref usage, which has delayed refs locked.

    Change the return value type to int, since caller never needs the
    pointer, but only needs to know if they need to free the allocated
    memory.

    2. btrfs_qgroup_insert_dirty_extent()
    The more encapsulated version.

    Will do the delayed_refs lock, memory allocation, quota enabled check
    and other things.

    The original design is to keep exported functions to minimal, but since
    more btrfs hacks exposed, like replacing path in balance, we need to
    record dirty extents manually, so we have to add such functions.

    Also, add comment for both functions, to info developers how to keep
    qgroup correct when doing hacks.

    Cc: Mark Fasheh
    Signed-off-by: Qu Wenruo
    Reviewed-and-Tested-by: Goldwyn Rodrigues
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Qu Wenruo
     

06 Aug, 2016

1 commit


03 Aug, 2016

1 commit


26 Jul, 2016

2 commits

  • When using trace events to debug a problem, it's impossible to determine
    which file system generated a particular event. This patch adds a
    macro to prefix standard information to the head of a trace event.

    The extent_state alloc/free events are all that's left without an
    fs_info available.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • BTRFS is using a variety of slab caches to satisfy internal needs.
    Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
    meaning allocations from the caches are going to be accounted as
    SReclaimable. At the same time btrfs is not registering any shrinkers
    whatsoever, thus preventing memory from the slabs to be shrunk. This
    means those caches are not in fact reclaimable.

    To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
    inode cache, since this one is being freed by the generic VFS super_block
    shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
    to better document the lifetime of the objects (it just translates
    to SLAB_RECLAIM_ACCOUNT).

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov