27 Jul, 2020

6 commits

  • This patch will add the following sysfs interface:

    /sys/fs/btrfs//qgroups//referenced
    /sys/fs/btrfs//qgroups//exclusive
    /sys/fs/btrfs//qgroups//max_referenced
    /sys/fs/btrfs//qgroups//max_exclusive
    /sys/fs/btrfs//qgroups//limit_flags

    Which is also available in output of "btrfs qgroup show".

    /sys/fs/btrfs//qgroups//rsv_data
    /sys/fs/btrfs//qgroups//rsv_meta_pertrans
    /sys/fs/btrfs//qgroups//rsv_meta_prealloc

    The last 3 rsv related members are not visible to users, but can be very
    useful to debug qgroup limit related bugs.

    Also, to avoid '/' used in , the separator between qgroup
    level and qgroup id is changed to '_'.

    The interface is not hidden behind 'debug' as we want this interface to
    be included into production build and to provide another way to read the
    qgroup information besides the ioctls.

    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • vfs_inode is used only for the inode number everything else requires
    btrfs_inode.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    [ use btrfs_ino ]
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • There's only a single use of vfs_inode in a tracepoint so let's take
    btrfs_inode directly.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • It just forwards its argument to __btrfs_qgroup_release_data.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • It passes btrfs_inode to its callee so change the interface.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Before this patch, qgroup completely relies on per-inode extent io tree
    to detect reserved data space leak.

    However previous bug has already shown how release page before
    btrfs_finish_ordered_io() could lead to leak, and since it's
    QGROUP_RESERVED bit cleared without triggering qgroup rsv, it can't be
    detected by per-inode extent io tree.

    So this patch adds another (and hopefully the final) safety net to catch
    qgroup data reserved space leak. At least the new safety net catches
    all the leaks during development, so it should be pretty useful in the
    real world.

    Reviewed-by: Josef Bacik
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     

19 Feb, 2020

1 commit

  • We clean up the delayed references when we abort a transaction but we
    leave the pending qgroup extent records behind, leaking memory.

    This patch destroys the extent records when we destroy the delayed refs
    and makes sure ensure they're gone before releasing the transaction.

    Fixes: 3368d001ba5d ("btrfs: qgroup: Record possible quota-related extent for qgroup.")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Jeff Mahoney
    [ Rebased to latest upstream, remove to_qgroup() helper, use
    rbtree_postorder_for_each_entry_safe() wrapper ]
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Jeff Mahoney
     

19 Nov, 2019

1 commit

  • The type name is misleading, a single entry is named 'cache' while this
    normally means a collection of objects. Rename that everywhere. Also the
    identifier was quite long, making function prototypes harder to format.

    Suggested-by: Nikolay Borisov
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    David Sterba
     

25 Feb, 2019

4 commits

  • …s_qgroup_extent_record

    [BUG]
    Btrfs/139 will fail with a high probability if the testing machine (VM)
    has only 2G RAM.

    Resulting the final write success while it should fail due to EDQUOT,
    and the fs will have quota exceeding the limit by 16K.

    The simplified reproducer will be: (needs a 2G ram VM)

    $ mkfs.btrfs -f $dev
    $ mount $dev $mnt

    $ btrfs subv create $mnt/subv
    $ btrfs quota enable $mnt
    $ btrfs quota rescan -w $mnt
    $ btrfs qgroup limit -e 1G $mnt/subv

    $ for i in $(seq -w 1 8); do
    xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null
    echo "file $i written" > /dev/kmsg
    done
    $ sync
    $ btrfs qgroup show -pcre --raw $mnt

    The last pwrite will not trigger EDQUOT and final 'qgroup show' will
    show something like:

    qgroupid rfer excl max_rfer max_excl parent child
    -------- ---- ---- -------- -------- ------ -----
    0/5 16384 16384 none none --- ---
    0/256 1073758208 1073758208 none 1073741824 --- ---

    And 1073758208 is larger than
    > 1073741824.

    [CAUSE]
    It's a bug in btrfs qgroup data reserved space management.

    For quota limit, we must ensure that:
    reserved (data + metadata) + rfer/excl <= limit

    Since rfer/excl is only updated at transaction commmit time, reserved
    space needs to be taken special care.

    One important part of reserved space is data, and for a new data extent
    written to disk, we still need to take the reserved space until
    rfer/excl numbers get updated.

    Originally when an ordered extent finishes, we migrate the reserved
    qgroup data space from extent_io tree to delayed ref head of the data
    extent, expecting delayed ref will only be cleaned up at commit
    transaction time.

    However for small RAM machine, due to memory pressure dirty pages can be
    flushed back to disk without committing a transaction.

    The related events will be something like:

    file 1 written
    btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
    btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
    btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
    btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
    btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
    cleanup_ref_head: num_bytes=54947840
    cleanup_ref_head: num_bytes=5636096
    cleanup_ref_head: num_bytes=569344
    cleanup_ref_head: num_bytes=57344
    cleanup_ref_head: num_bytes=8192
    ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
    file 2 written
    ...
    file 8 written
    cleanup_ref_head: num_bytes=8192
    ...
    btrfs_commit_transaction <<< the only transaction committed during
    the test

    When file 2 is written, we have already freed 128M reserved qgroup data
    space for ino 258. Thus later write won't trigger EDQUOT.

    This allows us to write more data beyond qgroup limit.

    In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.

    [FIX]
    By moving reserved qgroup data space from btrfs_delayed_ref_head to
    btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
    space won't be freed half way before commit transaction, thus fix the
    problem.

    Fixes: f64d5ca86821 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>

    Qu Wenruo
     
  • Since it's replaced by new delayed subtree swap code, remove the
    original code.

    The cleanup is small since most of its core function is still used by
    delayed subtree swap trace.

    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Before this patch, qgroup code traces the whole subtree of subvolume and
    reloc trees unconditionally.

    This makes qgroup numbers consistent, but it could cause tons of
    unnecessary extent tracing, which causes a lot of overhead.

    However for subtree swap of balance, just swap both subtrees because
    they contain the same contents and tree structure, so qgroup numbers
    won't change.

    It's the race window between subtree swap and transaction commit could
    cause qgroup number change.

    This patch will delay the qgroup subtree scan until COW happens for the
    subtree root.

    So if there is no other operations for the fs, balance won't cause extra
    qgroup overhead. (best case scenario)
    Depending on the workload, most of the subtree scan can still be
    avoided.

    Only for worst case scenario, it will fall back to old subtree swap
    overhead. (scan all swapped subtrees)

    [[Benchmark]]
    Hardware:
    VM 4G vRAM, 8 vCPUs,
    disk is using 'unsafe' cache mode,
    backing device is SAMSUNG 850 evo SSD.
    Host has 16G ram.

    Mkfs parameter:
    --nodesize 4K (To bump up tree size)

    Initial subvolume contents:
    4G data copied from /usr and /lib.
    (With enough regular small files)

    Snapshots:
    16 snapshots of the original subvolume.
    each snapshot has 3 random files modified.

    balance parameter:
    -m

    So the content should be pretty similar to a real world root fs layout.

    And after file system population, there is no other activity, so it
    should be the best case scenario.

    | v4.20-rc1 | w/ patchset | diff
    -----------------------------------------------------------------------
    relocated extents | 22615 | 22457 | -0.1%
    qgroup dirty extents | 163457 | 121606 | -25.6%
    time (sys) | 22.884s | 18.842s | -17.6%
    time (real) | 27.724s | 22.884s | -17.5%

    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • To allow delayed subtree swap rescan, btrfs needs to record per-root
    information about which tree blocks get swapped. This patch introduces
    the required infrastructure.

    The designed workflow will be:

    1) Record the subtree root block that gets swapped.

    During subtree swap:
    O = Old tree blocks
    N = New tree blocks
    reloc tree subvolume tree X
    Root Root
    / \ / \
    NA OB OA OB
    / | | \ / | | \
    NC ND OE OF OC OD OE OF

    In this case, NA and OA are going to be swapped, record (NA, OA) into
    subvolume tree X.

    2) After subtree swap.
    reloc tree subvolume tree X
    Root Root
    / \ / \
    OA OB NA OB
    / | | \ / | | \
    OC OD OE OF NC ND OE OF

    3a) COW happens for OB
    If we are going to COW tree block OB, we check OB's bytenr against
    tree X's swapped_blocks structure.
    If it doesn't fit any, nothing will happen.

    3b) COW happens for NA
    Check NA's bytenr against tree X's swapped_blocks, and get a hit.
    Then we do subtree scan on both subtrees OA and NA.
    Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).

    Then no matter what we do to subvolume tree X, qgroup numbers will
    still be correct.
    Then NA's record gets removed from X's swapped_blocks.

    4) Transaction commit
    Any record in X's swapped_blocks gets removed, since there is no
    modification to swapped subtrees, no need to trigger heavy qgroup
    subtree rescan for them.

    This will introduce 128 bytes overhead for each btrfs_root even qgroup
    is not enabled. This is to reduce memory allocations and potential
    failures.

    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     

17 Dec, 2018

2 commits


15 Oct, 2018

3 commits

  • Some qgroup trace events like btrfs_qgroup_release_data() and
    btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is
    not enabled.

    This is caused by the lack of qgroup status check before calling some
    qgroup functions. Thankfully the functions can handle quota disabled
    case well and just do nothing for qgroup disabled case.

    This patch will do earlier check before triggering related trace events.

    And for enabled disabled race case:

    1) For enabled->disabled case
    Disable will wipe out all qgroups data including reservation and
    excl/rfer. Even if we leak some reservation or numbers, it will
    still be cleared, so nothing will go wrong.

    2) For disabled -> enabled case
    Current btrfs_qgroup_release_data() will use extent_io tree to ensure
    we won't underflow reservation. And for delayed_ref we use
    head->qgroup_reserved to record the reserved space, so in that case
    head->qgroup_reserved should be 0 and we won't underflow.

    CC: stable@vger.kernel.org # 4.14+
    Reported-by: Chris Murphy
    Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • For qgroup_trace_extent_swap(), if we find one leaf that needs to be
    traced, we will also iterate all file extents and trace them.

    This is OK if we're relocating data block groups, but if we're
    relocating metadata block groups, balance code itself has ensured that
    both subtree of file tree and reloc tree contain the same contents.

    That's to say, if we're relocating metadata block groups, all file
    extents in reloc and file tree should match, thus no need to trace them.
    This should reduce the total number of dirty extents processed in metadata
    block group balance.

    [[Benchmark]] (with all previous enhancement)
    Hardware:
    VM 4G vRAM, 8 vCPUs,
    disk is using 'unsafe' cache mode,
    backing device is SAMSUNG 850 evo SSD.
    Host has 16G ram.

    Mkfs parameter:
    --nodesize 4K (To bump up tree size)

    Initial subvolume contents:
    4G data copied from /usr and /lib.
    (With enough regular small files)

    Snapshots:
    16 snapshots of the original subvolume.
    each snapshot has 3 random files modified.

    balance parameter:
    -m

    So the content should be pretty similar to a real world root fs layout.

    | v4.19-rc1 | w/ patchset | diff (*)
    ---------------------------------------------------------------
    relocated extents | 22929 | 22851 | -0.3%
    qgroup dirty extents | 227757 | 140886 | -38.1%
    time (sys) | 65.253s | 37.464s | -42.6%
    time (real) | 74.032s | 44.722s | -39.6%

    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Before this patch, with quota enabled during balance, we need to mark
    the whole subtree dirty for quota.

    E.g.
    OO = Old tree blocks (from file tree)
    NN = New tree blocks (from reloc tree)

    File tree (src) Reloc tree (dst)
    OO (a) NN (a)
    / \ / \
    (b) OO OO (c) (b) NN NN (c)
    / \ / \ / \ / \
    OO OO OO OO (d) OO OO OO NN (d)

    For old balance + quota case, quota will mark the whole src and dst tree
    dirty, including all the 3 old tree blocks in reloc tree.

    It's doable for small file tree or new tree blocks are all located at
    lower level.

    But for large file tree or new tree blocks are all located at higher
    level, this will lead to mark the whole tree dirty, and be unbelievably
    slow.

    This patch will change how we handle such balance with quota enabled
    case.

    Now we will search from (b) and (c) for any new tree blocks whose
    generation is equal to @last_snapshot, and only mark them dirty.

    In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
    (NN(a) will be traced when COW happens for nodeptr modification). And
    also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW
    happens for nodeptr modification.)

    For above case, we could skip 3 tree blocks, but for larger tree, we can
    skip tons of unmodified tree blocks, and hugely speed up balance.

    This patch will introduce a new function,
    btrfs_qgroup_trace_subtree_swap(), which will do the following main
    work:

    1) Read out real root eb
    And setup basic dst_path for later calls
    2) Call qgroup_trace_new_subtree_blocks()
    To trace all new tree blocks in reloc tree and their counter
    parts in the file tree.

    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     

06 Aug, 2018

12 commits


12 Apr, 2018

1 commit


31 Mar, 2018

5 commits

  • For meta_prealloc reservation users, after btrfs_join_transaction()
    caller will modify tree so part (or even all) meta_prealloc reservation
    should be converted to meta_pertrans until transaction commit time.

    This patch introduces a new function,
    btrfs_qgroup_convert_reserved_meta() to do this for META_PREALLOC
    reservation user.

    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Btrfs uses 2 different methods to reseve metadata qgroup space.

    1) Reserve at btrfs_start_transaction() time
    This is quite straightforward, caller will use the trans handler
    allocated to modify b-trees.

    In this case, reserved metadata should be kept until qgroup numbers
    are updated.

    2) Reserve by using block_rsv first, and later btrfs_join_transaction()
    This is more complicated, caller will reserve space using block_rsv
    first, and then later call btrfs_join_transaction() to get a trans
    handle.

    In this case, before we modify trees, the reserved space can be
    modified on demand, and after btrfs_join_transaction(), such reserved
    space should also be kept until qgroup numbers are updated.

    Since these two types behave differently, split the original "META"
    reservation type into 2 sub-types:

    META_PERTRANS:
    For above case 1)

    META_PREALLOC:
    For reservations that happened before btrfs_join_transaction() of
    case 2)

    NOTE: This patch will only convert existing qgroup meta reservation
    callers according to its situation, not ensuring all callers are at
    correct timing.
    Such fix will be added in later patches.

    Signed-off-by: Qu Wenruo
    [ update comments ]
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • So qgroup is switched to new separate types reservation system.

    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Instead of single qgroup->reserved, use a new structure btrfs_qgroup_rsv
    to store different types of reservation.

    This patch only updates the header needed to compile.

    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • It's provided by the transaction handle.

    Signed-off-by: Nikolay Borisov
    Signed-off-by: David Sterba

    Nikolay Borisov
     

30 Jun, 2017

3 commits

  • [BUG]
    For the following case, btrfs can underflow qgroup reserved space
    at an error path:
    (Page size 4K, function name without "btrfs_" prefix)

    Task A | Task B
    ----------------------------------------------------------------------
    Buffered_write [0, 2K) |
    |- check_data_free_space() |
    | |- qgroup_reserve_data() |
    | Range aligned to page |
    | range [0, 4K) <<< |
    | 4K bytes reserved <<< |
    |- copy pages to page cache |
    | Buffered_write [2K, 4K)
    | |- check_data_free_space()
    | | |- qgroup_reserved_data()
    | | Range alinged to page
    | | range [0, 4K)
    | | Already reserved by A <<<
    | | 0 bytes reserved <<<
    | |- delalloc_reserve_metadata()
    | | And it *FAILED* (Maybe EQUOTA)
    | |- free_reserved_data_space()
    |- qgroup_free_data()
    Range aligned to page range
    [0, 4K)
    Freeing 4K
    (Special thanks to Chandan for the detailed report and analyse)

    [CAUSE]
    Above Task B is freeing reserved data range [0, 4K) which is actually
    reserved by Task A.

    And at writeback time, page dirty by Task A will go through writeback
    routine, which will free 4K reserved data space at file extent insert
    time, causing the qgroup underflow.

    [FIX]
    For btrfs_qgroup_free_data(), add @reserved parameter to only free
    data ranges reserved by previous btrfs_qgroup_reserve_data().
    So in above case, Task B will try to free 0 byte, so no underflow.

    Reported-by: Chandan Rajendra
    Signed-off-by: Qu Wenruo
    Reviewed-by: Chandan Rajendra
    Tested-by: Chandan Rajendra
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Introduce a new parameter, struct extent_changeset for
    btrfs_qgroup_reserved_data() and its callers.

    Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
    which range it reserved in current reserve, so it can free it in error
    paths.

    The reason we need to export it to callers is, at buffered write error
    path, without knowing what exactly which range we reserved in current
    allocation, we can free space which is not reserved by us.

    This will lead to qgroup reserved space underflow.

    Reviewed-by: Chandan Rajendra
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Quite a lot of qgroup corruption happens due to wrong time of calling
    btrfs_qgroup_prepare_account_extents().

    Since the safest time is to call it just before
    btrfs_qgroup_account_extents(), there is no need to separate these 2
    functions.

    Merging them will make code cleaner and less bug prone.

    Signed-off-by: Qu Wenruo
    [ changelog and comment adjustments ]
    Signed-off-by: David Sterba

    Qu Wenruo
     

18 Apr, 2017

2 commits

  • Newly introduced qgroup reserved space trace points are normally nested
    into several common qgroup operations.

    While some other trace points are not well placed to co-operate with
    them, causing confusing output.

    This patch re-arrange trace_btrfs_qgroup_release_data() and
    trace_btrfs_qgroup_free_delayed_ref() trace points so they are triggered
    before reserved space ones.

    Signed-off-by: Qu Wenruo
    Reviewed-by: Jeff Mahoney
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Introduce the following trace points:
    qgroup_update_reserve
    qgroup_meta_reserve

    These trace points are handy to trace qgroup reserve space related
    problems.

    Also export btrfs_qgroup structure, as now we directly pass btrfs_qgroup
    structure to trace points, so that structure needs to be exported.

    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo