27 Sep, 2016

1 commit

  • For many printks, we want to know which file system issued the message.

    This patch converts most pr_* calls to use the btrfs_* versions instead.
    In some cases, this means adding plumbing to allow call sites access to
    an fs_info pointer.

    fs/btrfs/check-integrity.c is left alone for another day.

    Signed-off-by: Jeff Mahoney
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Jeff Mahoney
     

26 Sep, 2016

1 commit

  • We have a lot of random ints in btrfs_fs_info that can be put into flags. This
    is mostly equivalent with the exception of how we deal with quota going on or
    off, now instead we set a flag when we are turning it on or off and deal with
    that appropriately, rather than just having a pending state that the current
    quota_enabled gets set to. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     

25 Aug, 2016

1 commit

  • Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
    1. btrfs_qgroup_insert_dirty_extent_nolock()
    Almost the same with original code.
    For delayed_ref usage, which has delayed refs locked.

    Change the return value type to int, since caller never needs the
    pointer, but only needs to know if they need to free the allocated
    memory.

    2. btrfs_qgroup_insert_dirty_extent()
    The more encapsulated version.

    Will do the delayed_refs lock, memory allocation, quota enabled check
    and other things.

    The original design is to keep exported functions to minimal, but since
    more btrfs hacks exposed, like replacing path in balance, we need to
    record dirty extents manually, so we have to add such functions.

    Also, add comment for both functions, to info developers how to keep
    qgroup correct when doing hacks.

    Cc: Mark Fasheh
    Signed-off-by: Qu Wenruo
    Reviewed-and-Tested-by: Goldwyn Rodrigues
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Qu Wenruo
     

06 Aug, 2016

1 commit


03 Aug, 2016

1 commit


26 Jul, 2016

2 commits

  • When using trace events to debug a problem, it's impossible to determine
    which file system generated a particular event. This patch adds a
    macro to prefix standard information to the head of a trace event.

    The extent_state alloc/free events are all that's left without an
    fs_info available.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • BTRFS is using a variety of slab caches to satisfy internal needs.
    Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
    meaning allocations from the caches are going to be accounted as
    SReclaimable. At the same time btrfs is not registering any shrinkers
    whatsoever, thus preventing memory from the slabs to be shrunk. This
    means those caches are not in fact reclaimable.

    To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
    inode cache, since this one is being freed by the generic VFS super_block
    shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
    to better document the lifetime of the objects (it just translates
    to SLAB_RECLAIM_ACCOUNT).

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     

18 Feb, 2016

1 commit


07 Jan, 2016

1 commit

  • btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now
    and has 8 unused bytes. Reducing the level type to u8 makes it possible
    to squeeze it to the padding byte after key. The bitfields were switched
    to bool as there's space to store the full byte without increasing the
    whole structure, besides that the generated assembly is smaller.

    struct btrfs_delayed_extent_op {
    struct btrfs_disk_key key; /* 0 17 */
    u8 level; /* 17 1 */
    bool update_key; /* 18 1 */
    bool update_flags; /* 19 1 */
    bool is_data; /* 20 1 */

    /* XXX 3 bytes hole, try to pack */

    u64 flags_to_set; /* 24 8 */

    /* size: 32, cachelines: 1, members: 6 */
    /* sum members: 29, holes: 1, sum holes: 3 */
    /* last cacheline: 32 bytes */
    };

    The final size is 32 bytes which gives +26 object per slab page.

    text data bss dec hex filename
    938811 43670 23144 1005625 f5839 fs/btrfs/btrfs.ko.before
    938747 43670 23144 1005561 f57f9 fs/btrfs/btrfs.ko.after

    Signed-off-by: David Sterba

    David Sterba
     

27 Oct, 2015

1 commit

  • Between btrfs_allocerved_file_extent() and
    btrfs_add_delayed_qgroup_reserve(), there is a window that delayed_refs
    are run and delayed ref head maybe freed before
    btrfs_add_delayed_qgroup_reserve().

    This will cause btrfs_dad_delayed_qgroup_reserve() to return -ENOENT,
    and cause transaction to be aborted.

    This patch will record qgroup reserve space info into delayed_ref_head
    at btrfs_add_delayed_ref(), to eliminate the race window.

    Reported-by: Filipe Manana
    Signed-off-by: Qu Wenruo
    Signed-off-by: Chris Mason

    Qu Wenruo
     

26 Oct, 2015

2 commits

  • In the kernel 4.2 merge window we had a big changes to the implementation
    of delayed references and qgroups which made the no_quota field of delayed
    references not used anymore. More specifically the no_quota field is not
    used anymore as of:

    commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")

    Leaving the no_quota field actually prevents delayed references from
    getting merged, which in turn cause the following BUG_ON(), at
    fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:

    static int run_delayed_tree_ref(...)
    {
    (...)
    BUG_ON(node->ref_mod != 1);
    (...)
    }

    This happens on a scenario like the following:

    1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.

    2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
    It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.

    3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
    It's not merged with the reference at the tail of the list of refs
    for bytenr X because the reference at the tail, Ref2 is incompatible
    due to Ref2->no_quota != Ref3->no_quota.

    4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
    It's not merged with the reference at the tail of the list of refs
    for bytenr X because the reference at the tail, Ref3 is incompatible
    due to Ref3->no_quota != Ref4->no_quota.

    5) We run delayed references, trigger merging of delayed references,
    through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().

    6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
    all other conditions are satisfied too. So Ref1 gets a ref_mod
    value of 2.

    7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
    all other conditions are satisfied too. So Ref2 gets a ref_mod
    value of 2.

    8) Ref1 and Ref2 aren't merged, because they have different values
    for their no_quota field.

    9) Delayed reference Ref1 is picked for running (select_delayed_ref()
    always prefers references with an action == BTRFS_ADD_DELAYED_REF).
    So run_delayed_tree_ref() is called for Ref1 which triggers the
    BUG_ON because Ref1->red_mod != 1 (equals 2).

    So fix this by removing the no_quota field, as it's not used anymore as
    of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented
    qgroup mechanism.").

    The use of no_quota was also buggy in at least two places:

    1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
    no_quota to 0 instead of 1 when the following condition was true:
    is_fstree(ref_root) || !fs_info->quota_enabled

    2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
    reset a node's no_quota when the condition "!is_fstree(root_objectid)
    || !root->fs_info->quota_enabled" was true but we did it only in
    an unused local stack variable, that is, we never reset the no_quota
    value in the node itself.

    This fixes the remainder of problems several people have been having when
    running delayed references, mostly while a balance is running in parallel,
    on a 4.2+ kernel.

    Very special thanks to Stéphane Lesimple for helping debugging this issue
    and testing this fix on his multi terabyte filesystem (which took more
    than one day to balance alone, plus fsck, etc).

    Also, this fixes deadlock issue when using the clone ioctl with qgroups
    enabled, as reported by Elias Probst in the mailing list. The deadlock
    happens because after calling btrfs_insert_empty_item we have our path
    holding a write lock on a leaf of the fs/subvol tree and then before
    releasing the path we called check_ref() which did backref walking, when
    qgroups are enabled, and tried to read lock the same leaf. The trace for
    this case is the following:

    INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
    (...)
    Call Trace:
    [] schedule+0x74/0x83
    [] btrfs_tree_read_lock+0xc0/0xea
    [] ? wait_woken+0x74/0x74
    [] btrfs_search_old_slot+0x51a/0x810
    [] btrfs_next_old_leaf+0xdf/0x3ce
    [] ? ulist_add_merge+0x1b/0x127
    [] __resolve_indirect_refs+0x62a/0x667
    [] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
    [] find_parent_nodes+0xaf3/0xfc6
    [] __btrfs_find_all_roots+0x92/0xf0
    [] btrfs_find_all_roots+0x45/0x65
    [] ? btrfs_get_tree_mod_seq+0x2b/0x88
    [] check_ref+0x64/0xc4
    [] btrfs_clone+0x66e/0xb5d
    [] btrfs_ioctl_clone+0x48f/0x5bb
    [] ? native_sched_clock+0x28/0x77
    [] btrfs_ioctl+0xabc/0x25cb
    (...)

    The problem goes away by eleminating check_ref(), which no longer is
    needed as its purpose was to get a value for the no_quota field of
    a delayed reference (this patch removes the no_quota field as mentioned
    earlier).

    Reported-by: Stéphane Lesimple
    Tested-by: Stéphane Lesimple
    Reported-by: Elias Probst
    Reported-by: Peter Becker
    Reported-by: Malte Schröder
    Reported-by: Derek Dongray
    Reported-by: Erkki Seppala
    Cc: stable@vger.kernel.org # 4.2+
    Signed-off-by: Filipe Manana
    Reviewed-by: Qu Wenruo

    Filipe Manana
     
  • In the kernel 4.2 merge window we had a refactoring/rework of the delayed
    references implementation in order to fix certain problems with qgroups.
    However that rework introduced one more regression that leads to the
    following trace when running delayed references for metadata:

    [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
    [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
    [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
    [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
    [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
    [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
    [35908.065201] RIP: 0010:[] [] insert_inline_extent_backref+0x52/0xb1 [btrfs]
    [35908.065201] RSP: 0018:ffff88010c4cbb08 EFLAGS: 00010293
    [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
    [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
    [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
    [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
    [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
    [35908.065201] FS: 0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
    [35908.065201] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
    [35908.065201] Stack:
    [35908.065201] ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
    [35908.065201] ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
    [35908.065201] ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
    [35908.065201] Call Trace:
    [35908.065201] [] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
    [35908.065201] [] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
    [35908.065201] [] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
    [35908.065201] [] ? join_transaction.isra.10+0x25/0x41f [btrfs]
    [35908.065201] [] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
    [35908.065201] [] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
    [35908.065201] [] delayed_ref_async_start+0x3c/0x7b [btrfs]
    [35908.065201] [] normal_work_helper+0x14c/0x32a [btrfs]
    [35908.065201] [] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
    [35908.065201] [] process_one_work+0x24a/0x4ac
    [35908.065201] [] worker_thread+0x206/0x2c2
    [35908.065201] [] ? rescuer_thread+0x2cb/0x2cb
    [35908.065201] [] ? rescuer_thread+0x2cb/0x2cb
    [35908.065201] [] kthread+0xef/0xf7
    [35908.065201] [] ? kthread_parkme+0x24/0x24
    [35908.065201] [] ret_from_fork+0x3f/0x70
    [35908.065201] [] ? kthread_parkme+0x24/0x24
    [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 0b 4c 8b 45 30 8b 4d 28 45 31
    [35908.065201] RIP [] insert_inline_extent_backref+0x52/0xb1 [btrfs]
    [35908.065201] RSP
    [35908.310885] ---[ end trace fe4299baf0666457 ]---

    This happens because the new delayed references code no longer merges
    delayed references that have different sequence values. The following
    steps are an example sequence leading to this issue:

    1) Transaction N starts, fs_info->tree_mod_seq has value 0;

    2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
    bytenr A is created, with a value of 1 and a seq value of 0;

    3) fs_info->tree_mod_seq is incremented to 1;

    4) Extent buffer A is deleted through btrfs_del_items(), which calls
    btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
    later returns the metadata extent associated to extent buffer A to
    the free space cache (the range is not pinned), because the extent
    buffer was created in the current transaction (N) and writeback never
    happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
    in the extent buffer).
    This creates the delayed reference Ref2 for bytenr A, with a value
    of -1 and a seq value of 1;

    5) Delayed reference Ref2 is not merged with Ref1 when we create it,
    because they have different sequence numbers (decided at
    add_delayed_ref_tail_merge());

    6) fs_info->tree_mod_seq is incremented to 2;

    7) Some task attempts to allocate a new extent buffer (done at
    extent-tree.c:find_free_extent()), but due to heavy fragmentation
    and running low on metadata space the clustered allocation fails
    and we fall back to unclustered allocation, which finds the
    extent at offset A, so a new extent buffer at offset A is allocated.
    This creates delayed reference Ref3 for bytenr A, with a value of 1
    and a seq value of 2;

    8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
    all have different seq values;

    9) We start running the delayed references (__btrfs_run_delayed_refs());

    10) The delayed Ref1 is the first one being applied, which ends up
    creating an inline extent backref in the extent tree;

    10) Next the delayed reference Ref3 is selected for execution, and not
    Ref2, because select_delayed_ref() always gives a preference for
    positive references (that have an action of BTRFS_ADD_DELAYED_REF);

    11) When running Ref3 we encounter alreay the inline extent backref
    in the extent tree at insert_inline_extent_backref(), which makes
    us hit the following BUG_ON:

    BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);

    This is always true because owner corresponds to the level of the
    extent buffer/btree node in the btree.

    For the scenario described above we hit the BUG_ON because we never merge
    references that have different seq values.

    We used to do the merging before the 4.2 kernel, more specifically, before
    the commmits:

    c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
    c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")

    This issue became more exposed after the following change that was added
    to 4.2 as well:

    cffc3374e567 ("Btrfs: fix order by which delayed references are run")

    Which in turn fixed another regression by the two commits previously
    mentioned.

    So fix this by bringing back the delayed reference merge code, with the
    proper adaptations so that it operates against the new data structure
    (linked list vs old red black tree implementation).

    This issue was hit running fstest btrfs/063 in a loop. Several people have
    reported this issue in the mailing list when running on kernels 4.2+.

    Very special thanks to Stéphane Lesimple for helping debugging this issue
    and testing this fix on his multi terabyte filesystem (which took more
    than one day to balance alone, plus fsck, etc).

    Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
    Reported-by: Peter Becker
    Reported-by: Stéphane Lesimple
    Tested-by: Stéphane Lesimple
    Reported-by: Malte Schröder
    Reported-by: Derek Dongray
    Reported-by: Erkki Seppala
    Cc: stable@vger.kernel.org # 4.2+
    Signed-off-by: Filipe Manana
    Reviewed-by: Liu Bo

    Filipe Manana
     

22 Oct, 2015

1 commit


25 Jun, 2015

1 commit


11 Jun, 2015

3 commits


11 Apr, 2015

1 commit

  • As we delete large extents, we end up doing huge amounts of COW in order
    to delete the corresponding crcs. This adds accounting so that we keep
    track of that space and flushing of delayed refs so that we don't build
    up too much delayed crc work.

    This helps limit the delayed work that must be done at commit time and
    tries to avoid ENOSPC aborts because the crcs eat all the global
    reserves.

    Signed-off-by: Chris Mason

    Josef Bacik
     

10 Jun, 2014

1 commit

  • Currently qgroups account for space by intercepting delayed ref updates to fs
    trees. It does this by adding sequence numbers to delayed ref updates so that
    it can figure out how the tree looked before the update so we can adjust the
    counters properly. The problem with this is that it does not allow delayed refs
    to be merged, so if you say are defragging an extent with 5k snapshots pointing
    to it we will thrash the delayed ref lock because we need to go back and
    manually merge these things together. Instead we want to process quota changes
    when we know they are going to happen, like when we first allocate an extent, we
    free a reference for an extent, we add new references etc. This patch
    accomplishes this by only adding qgroup operations for real ref changes. We
    only modify the sequence number when we need to lookup roots for bytenrs, this
    reduces the amount of churn on the sequence number and allows us to merge
    delayed refs as we add them most of the time. This patch encompasses a bunch of
    architectural changes

    1) qgroup ref operations: instead of tracking qgroup operations through the
    delayed refs we simply add new ref operations whenever we notice that we need to
    when we've modified the refs themselves.

    2) tree mod seq: we no longer have this separation of major/minor counters.
    this makes the sequence number stuff much more sane and we can remove some
    locking that was needed to protect the counter.

    3) delayed ref seq: we now read the tree mod seq number and use that as our
    sequence. This means each new delayed ref doesn't have it's own unique sequence
    number, rather whenever we go to lookup backrefs we inc the sequence number so
    we can make sure to keep any new operations from screwing up our world view at
    that given point. This allows us to merge delayed refs during runtime.

    With all of these changes the delayed ref stuff is a little saner and the qgroup
    accounting stuff no longer goes negative in some cases like it was before.
    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

21 Mar, 2014

1 commit

  • While we update an existing ref head's extent_op, we're not holding
    its spinlock, so while we're updating its extent_op contents (key,
    flags) we can have a task running __btrfs_run_delayed_refs() that
    holds the ref head's lock and sets its extent_op to NULL right after
    the task updating the ref head just checked its extent_op was not NULL.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     

11 Mar, 2014

2 commits

  • The argument last wasn't used, all callers supplied a NULL value
    for it. Also removed unnecessary intermediate storage of the result
    of key comparisons.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Josef Bacik

    Filipe Manana
     
  • When we didn't find the exact ref head we were looking for, if
    return_bigger != 0 we set a new search key to match either the
    next node after the last one we found or the first one in the
    ref heads rb tree, and then did another full tree search. For both
    cases this ended up being pointless as we would end up returning
    an entry we already had before repeating the search.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Josef Bacik

    Filipe Manana
     

29 Jan, 2014

3 commits

  • Currently we have two rb-trees, one for delayed ref heads and one for all of the
    delayed refs, including the delayed ref heads. When we process the delayed refs
    we have to hold onto the delayed ref lock for all of the selecting and merging
    and such, which results in quite a bit of lock contention. This was solved by
    having a waitqueue and only one flusher at a time, however this hurts if we get
    a lot of delayed refs queued up.

    So instead just have an rb tree for the delayed ref heads, and then attach the
    delayed ref updates to an rb tree that is per delayed ref head. Then we only
    need to take the delayed ref lock when adding new delayed refs and when
    selecting a delayed ref head to process, all the rest of the time we deal with a
    per delayed ref head lock which will be much less contentious.

    The locking rules for this get a little more complicated since we have to lock
    up to 3 things to properly process delayed refs, but I will address that problem
    later. For now this passes all of xfstests and my overnight stress tests.
    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • When we have data deduplication on, we'll hang on the merge part
    because it needs to verify every queued delayed data refs related to
    this disk offset but we may have millions refs.

    And in the case of delayed data refs, we don't usually have too much
    data refs to merge.

    So it's safe to shut it down for data refs.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Liu Bo
     
  • The way how we process delayed refs is
    1) get a bunch of head refs,
    2) pick up one head ref,
    3) go one node back for any delayed ref updates.

    The head ref is also linked in the same rbtree as the delayed ref is,
    so in 1) stage, we have to walk one by one including not only head refs, but
    delayed refs.

    When we have a great number of delayed refs pending to process,
    this'll cost time a lot.

    Here we introduce a head ref specific rbtree, it only has head refs, so troubles
    go away.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Liu Bo
     

01 Sep, 2013

2 commits

  • make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__

    I tried to filter out the warnings for which patches have already
    been sent to the mailing list, pending for inclusion in btrfs-next.

    All these changes should be obviously safe.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • This shows exactly how btrfs processes the delayed refs onto disks,
    which is very helpful on understanding delayed ref mechanism and
    debugging related bugs.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Liu Bo
     

07 May, 2013

2 commits

  • Sequence numbers for delayed refs have been introduced in the first version
    of the qgroup patch set. To solve the problem of find_all_roots on a busy
    file system, the tree mod log was introduced. The sequence numbers for that
    were simply shared between those two users.

    However, at one point in qgroup's quota accounting, there's a statement
    accessing the previous sequence number, that's still just doing (seq - 1)
    just as it would have to in the very first version.

    To satisfy that requirement, this patch makes the sequence number counter 64
    bit and splits it into a major part (used for qgroup sequence number
    counting) and a minor part (incremented for each tree modification in the
    log). This enables us to go exactly one major step backwards, as required
    for qgroups, while still incrementing the sequence counter for tree mod log
    insertions to keep track of their order. Keeping them in a single variable
    means there's no need to change all the code dealing with comparisons of two
    sequence numbers.

    The sequence number is reset to 0 on commit (not new in this patch), which
    ensures we won't overflow the two 32 bit counters.

    Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
    from the tree mod log code may happen.

    Signed-off-by: Jan Schmidt
    Signed-off-by: Josef Bacik

    Jan Schmidt
     
  • A user reported a panic while running a balance. What was happening was he was
    relocating a block, which added the reference to the relocation tree. Then
    relocation would walk through the relocation tree and drop that reference and
    free that block, and then it would walk down a snapshot which referenced the
    same block and add another ref to the block. The problem is this was all
    happening in the same transaction, so the parent block was free'ed up when we
    drop our reference which was immediately available for allocation, and then it
    was used _again_ to add a reference for the same block from a different
    snapshot. This resulted in something like this in the delayed ref tree

    add ref to 90234880, parent=2067398656, ref_root 1766, level 1
    del ref to 90234880, parent=2067398656, ref_root 18446744073709551608, level 1
    add ref to 90234880, parent=2067398656, ref_root 1767, level 1

    as you can see the ref_root's don't match, because when we inc the ref we use
    the header owner, which is the original tree the block belonged to, instead of
    the data reloc tree. Then when we remove the extent we use the reloc tree
    objectid. But none of this matters, since it is a shared reference which means
    only the parent matters. When the delayed ref stuff runs it adds all the
    increments first, and then does all the drops, to make sure that we don't delete
    the ref if we net a positive ref count. But tree blocks aren't allowed to have
    multiple refs from the same block, so this panics when it tries to add the
    second ref. We need the add and the drop to cancel each other out in memory so
    we only do the final add.

    So to fix this we need to adjust how the delayed refs are added to the tree.
    Only the ref_root matters when it is a normal backref, and only the parent
    matters when it is a shared backref. So make our decision based on what ref
    type we have. This allows us to keep the ref_root in memory in case anybody
    wants to use it for something else, and it allows the delayed refs to be merged
    properly so we don't end up with this panic.

    With this patch the users image no longer panics on mount, and it has a clean
    fsck after a normal mount/umount cycle. Thanks,

    Cc: stable@vger.kernel.org
    Reported-by: Roman Mamedov
    Signed-off-by: Josef Bacik

    Josef Bacik
     

20 Feb, 2013

2 commits


29 Aug, 2012

2 commits

  • Daniel Blueman reported a bug with fio+balance on a ramdisk setup.
    Basically what happens is the balance relocates a tree block which will drop
    the implicit refs for all of its children and adds a full backref. Once the
    block is relocated we have to add the implicit refs back, so when we cow the
    block again we add the implicit refs for its children back. The problem
    comes when the original drop ref doesn't get run before we add the implicit
    refs back. The delayed ref stuff will specifically prefer ADD operations
    over DROP to keep us from freeing up an extent that will have references to
    it, so we try to add the implicit ref before it is actually removed and we
    panic. This worked fine before because the add would have just canceled the
    drop out and we would have been fine. But the backref walking work needs to
    be able to freeze the delayed ref stuff in time so we have this ever
    increasing sequence number that gets attached to all new delayed ref updates
    which makes us not merge refs and we run into this issue.

    So to fix this we need to merge delayed refs. So everytime we run a
    clustered ref we need to try and merge all of its delayed refs. The backref
    walking stuff locks the delayed ref head before processing, so if we have it
    locked we are safe to merge any refs inside of the sequence number. If
    there is no sequence number we can merge all refs. Doing this not only
    fixes our bug but keeps the delayed ref code from adding and removing
    useless refs and batching together multiple refs into one search instead of
    one search per delayed ref, which will really help our commit times. I ran
    this with Daniels test and 276 and I haven't seen any problems. Thanks,

    Reported-by: Daniel J Blueman
    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Commit a168650c introduced a waiting mechanism to prevent busy waiting in
    btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations,
    where a tree_mod_seq is held while waiting for the io to complete, while
    the end_io calls btrfs_run_delayed_refs.
    This whole mechanism is unnecessary. If not enough runnable refs are
    available to satisfy count, just return as count is more like a guideline
    than a strict requirement.
    In case we have to run all refs, commit transaction makes sure that no
    other threads are working in the transaction anymore, so we just assert
    here that no refs are blocked.

    Signed-off-by: Arne Jansen
    Signed-off-by: Chris Mason

    Arne Jansen
     

12 Jul, 2012

1 commit


10 Jul, 2012

1 commit

  • We've got two mechanisms both required for reliable backref resolving (tree
    mod log and holding back delayed refs). You cannot make use of one without
    the other. So instead of requiring the user of this mechanism to setup both
    correctly, we join them into a single interface.

    Additionally, we stop inserting non-blockers into fs_info->tree_mod_seq_list
    as we did before, which was of no value.

    Signed-off-by: Jan Schmidt

    Jan Schmidt
     

31 May, 2012

1 commit

  • The sequence number for delayed refs is needed to postpone certain delayed
    refs for a very short period while walking backrefs. Before the tree
    modification log, we thought we'd only have to hold back those references
    that don't have a counter operation.

    While now we've the tree mod log, we're rewinding fs tree blocks to a
    defined consistent state. We cannot know in advance for which tree block
    we'll be doing rewind operations later. Therefore, we must postpone all the
    delayed refs for fs-tree blocks, even those having a counter operation.

    Signed-off-by: Jan Schmidt

    Jan Schmidt
     

22 Mar, 2012

2 commits


04 Jan, 2012

2 commits

  • Now that we may be holding back delayed refs for a limited period, we
    might end up having no runnable delayed refs. Without this commit, we'd
    do busy waiting in that thread until another (runnable) ref arives.
    Instead, we're detecting this situation and use a waitqueue, such that
    we only try to run more refs after
    a) another runnable ref was added or
    b) delayed refs are no longer held back

    Signed-off-by: Jan Schmidt

    Jan Schmidt
     
  • When processing a delayed ref, first check if there are still old refs in
    the process of being added. If so, put this ref back to the tree. To avoid
    looping on this ref, choose a newer one in the next loop.
    btrfs_find_ref_cluster has to take care of that.

    Signed-off-by: Arne Jansen
    Signed-off-by: Jan Schmidt

    Arne Jansen