06 Dec, 2018

1 commit

  • commit 42a657f57628402c73237547f0134e083e2f6764 upstream.

    The function relocate_block_group calls btrfs_end_transaction to release
    trans when update_backref_cache returns 1, and then continues the loop
    body. If btrfs_block_rsv_refill fails this time, it will jump out the
    loop and the freed trans will be accessed. This may result in a
    use-after-free bug. The patch assigns NULL to trans after trans is
    released so that it will not be accessed.

    Fixes: 0647bf564f1 ("Btrfs: improve forever loop when doing balance relocation")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Qu Wenruo
    Signed-off-by: Pan Bian
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Pan Bian
     

14 Nov, 2018

1 commit

  • commit 65c6e82becec33731f48786e5a30f98662c86b16 upstream.

    [BUG]
    When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
    when trying to recover balance:

    kernel BUG at fs/btrfs/extent-tree.c:8956!
    invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
    CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
    RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
    RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
    Call Trace:
    walk_up_tree+0x172/0x1f0 [btrfs]
    btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
    merge_reloc_roots+0xe1/0x1d0 [btrfs]
    btrfs_recover_relocation+0x3ea/0x420 [btrfs]
    open_ctree+0x1af3/0x1dd0 [btrfs]
    btrfs_mount_root+0x66b/0x740 [btrfs]
    mount_fs+0x3b/0x16a
    vfs_kern_mount.part.9+0x54/0x140
    btrfs_mount+0x16d/0x890 [btrfs]
    mount_fs+0x3b/0x16a
    vfs_kern_mount.part.9+0x54/0x140
    do_mount+0x1fd/0xda0
    ksys_mount+0xba/0xd0
    __x64_sys_mount+0x21/0x30
    do_syscall_64+0x60/0x210
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    [CAUSE]
    Extent tree corruption. In this particular case, reloc tree root's
    owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
    corrupted and we failed the owner check in walk_up_tree().

    [FIX]
    It's pretty hard to take care of every extent tree corruption, but at
    least we can remove such BUG_ON() and exit more gracefully.

    And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
    the same root (which is obviously invalid), we needs to make
    __del_reloc_root() more robust to detect such invalid sharing to avoid
    possible NULL dereference as root->node can be NULL in this case.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411
    Reported-by: Xu Wen
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     

15 Sep, 2018

1 commit

  • [ Upstream commit 389305b2aa68723c754f88d9dbd268a400e10664 ]

    Invalid reloc tree can cause kernel NULL pointer dereference when btrfs
    does some cleanup of the reloc roots.

    It turns out that fs_info::reloc_ctl can be NULL in
    btrfs_recover_relocation() as we allocate relocation control after all
    reloc roots have been verified.
    So when we hit: note, we haven't called set_reloc_control() thus
    fs_info::reloc_ctl is still NULL.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=199833
    Reported-by: Xu Wen
    Signed-off-by: Qu Wenruo
    Tested-by: Gu Jinxiang
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     

26 Sep, 2017

1 commit

  • __del_reloc_root should be called before freeing up reloc_root->node.
    If not, calling __del_reloc_root() dereference reloc_root->node, causing
    the system BUG.

    Fixes: 6bdf131fac23 ("Btrfs: don't leak reloc root nodes on error")
    Cc: # 4.9
    Signed-off-by: Naohiro Aota
    Reviewed-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Naohiro Aota
     

21 Aug, 2017

3 commits

  • The BUG_ON() can be triggered when the caller is processing an invalid
    extent inline ref, e.g.

    a shared data ref is offered instead of an extent data ref, such that
    it tries to find a non-existent tree block and then btrfs_search_slot
    returns 1 for no such item.

    This replaces the BUG_ON() with a WARN() followed by calling
    btrfs_print_leaf() to show more details about what's going on and
    returning -EINVAL to upper callers.

    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Liu Bo
     
  • Now that we have a helper to report invalid value of extent inline ref
    type, we need to quit gracefully instead of throwing out a kernel panic.

    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Liu Bo
     
  • Since we have a helper which can do sanity check, this converts all
    btrfs_extent_inline_ref_type to it.

    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Liu Bo
     

16 Aug, 2017

1 commit


30 Jun, 2017

3 commits

  • Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
    v4.12-rc6. This was because commit 70e7af244 made it possible for
    calc_reclaim_items_nr() to return a negative number. It's not really a
    bug in that commit, it just didn't go far enough down the stack to find
    all the possible 64->32 bit overflows.

    This switches calc_reclaim_items_nr() to return a u64 and changes everyone
    that uses the results of that math to u64 as well.

    Reported-by: Dave Jones
    Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
    Signed-off-by: Chris Mason
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Chris Mason
     
  • [BUG]
    For the following case, btrfs can underflow qgroup reserved space
    at an error path:
    (Page size 4K, function name without "btrfs_" prefix)

    Task A | Task B
    ----------------------------------------------------------------------
    Buffered_write [0, 2K) |
    |- check_data_free_space() |
    | |- qgroup_reserve_data() |
    | Range aligned to page |
    | range [0, 4K) <<< |
    | 4K bytes reserved <<< |
    |- copy pages to page cache |
    | Buffered_write [2K, 4K)
    | |- check_data_free_space()
    | | |- qgroup_reserved_data()
    | | Range alinged to page
    | | range [0, 4K)
    | | Already reserved by A <<<
    | | 0 bytes reserved <<<
    | |- delalloc_reserve_metadata()
    | | And it *FAILED* (Maybe EQUOTA)
    | |- free_reserved_data_space()
    |- qgroup_free_data()
    Range aligned to page range
    [0, 4K)
    Freeing 4K
    (Special thanks to Chandan for the detailed report and analyse)

    [CAUSE]
    Above Task B is freeing reserved data range [0, 4K) which is actually
    reserved by Task A.

    And at writeback time, page dirty by Task A will go through writeback
    routine, which will free 4K reserved data space at file extent insert
    time, causing the qgroup underflow.

    [FIX]
    For btrfs_qgroup_free_data(), add @reserved parameter to only free
    data ranges reserved by previous btrfs_qgroup_reserve_data().
    So in above case, Task B will try to free 0 byte, so no underflow.

    Reported-by: Chandan Rajendra
    Signed-off-by: Qu Wenruo
    Reviewed-by: Chandan Rajendra
    Tested-by: Chandan Rajendra
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Introduce a new parameter, struct extent_changeset for
    btrfs_qgroup_reserved_data() and its callers.

    Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
    which range it reserved in current reserve, so it can free it in error
    paths.

    The reason we need to export it to callers is, at buffered write error
    path, without knowing what exactly which range we reserved in current
    allocation, we can free space which is not reserved by us.

    This will lead to qgroup reserved space underflow.

    Reviewed-by: Chandan Rajendra
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     

20 Jun, 2017

1 commit

  • For extent_io tree's we have carried the address_mapping of the inode
    around in the io tree in order to pull the inode back out for calling
    into various tree ops hooks. This works fine when everything that has
    an extent_io_tree has an inode. But we are going to remove the
    btree_inode, so we need to change this. Instead just have a generic
    void * for private data that we can initialize with, and have all the
    tree ops use that instead. This had a lot of cascading changes but
    should be relatively straightforward.

    Signed-off-by: Josef Bacik
    Reviewed-by: Chandan Rajendra
    Reviewed-by: David Sterba
    [ minor reordering of the callback prototypes ]
    Signed-off-by: David Sterba

    Josef Bacik
     

28 Feb, 2017

4 commits


17 Feb, 2017

2 commits


14 Feb, 2017

2 commits

  • This goes as a separate patch because fixing that inside the patches
    caused too many many conflicts.

    Signed-off-by: David Sterba

    David Sterba
     
  • Currently btrfs_ino takes a struct inode and this causes a lot of
    internal btrfs functions which consume this ino to take a VFS inode,
    rather than btrfs' own struct btrfs_inode. In order to fix this "leak"
    of VFS structs into the internals of btrfs first it's necessary to
    eliminate all uses of struct inode for the purpose of inode. This patch
    does that by using BTRFS_I to convert an inode to btrfs_inode. With
    this problem eliminated subsequent patches will start eliminating the
    passing of struct inode altogether, eventually resulting in a lot cleaner
    code.

    Signed-off-by: Nikolay Borisov
    [ fix btrfs_get_extent tracepoint prototype ]
    Signed-off-by: David Sterba

    Nikolay Borisov
     

14 Dec, 2016

1 commit

  • …dmanana/linux into for-linus-4.10

    Patches queued up by Filipe:

    The most important change is still the fix for the extent tree
    corruption that happens due to balance when qgroups are enabled (a
    regression introduced in 4.7 by a fix for a regression from the last
    qgroups rework). This has been hitting SLE and openSUSE users and QA
    very badly, where transactions keep getting aborted when running
    delayed references leaving the root filesystem in RO mode and nearly
    unusable. There are fixes here that allow us to run xfstests again
    with the integrity checker enabled, which has been impossible since 4.8
    (apparently I'm the only one running xfstests with the integrity
    checker enabled, which is useful to validate dirtied leafs, like
    checking if there are keys out of order, etc). The rest are just some
    trivial fixes, most of them tagged for stable, and two cleanups.

    Signed-off-by: Chris Mason <clm@fb.com>

    Chris Mason
     

06 Dec, 2016

5 commits


30 Nov, 2016

4 commits


19 Nov, 2016

2 commits

  • In commit 5bc7247ac47c (Btrfs: fix broken nocow after balance) we started
    abusing the rtransid and otransid fields of root items from relocation
    trees to fix some issues with nodatacow mode. However later in commit
    ba8b0289333a (Btrfs: do not reset last_snapshot after relocation) we
    dropped the code that made use of those fields but did not remove
    the code that sets those fields.

    So just remove them to avoid confusion.

    Signed-off-by: Filipe Manana
    Reviewed-by: Josef Bacik

    Filipe Manana
     
  • During relocation of a data block group we create a relocation tree
    for each fs/subvol tree by making a snapshot of each tree using
    btrfs_copy_root() and the tree's commit root, and then setting the last
    snapshot field for the fs/subvol tree's root to the value of the current
    transaction id minus 1. However this can lead to relocation later
    dropping references that it did not create if we have qgroups enabled,
    leaving the filesystem in an inconsistent state that keeps aborting
    transactions.

    Lets consider the following example to explain the problem, which requires
    qgroups to be enabled.

    We are relocating data block group Y, we have a subvolume with id 258 that
    has a root at level 1, that subvolume is used to store directory entries
    for snapshots and we are currently at transaction 3404.

    When committing transaction 3404, we have a pending snapshot and therefore
    we call btrfs_run_delayed_items() at transaction.c:create_pending_snapshot()
    in order to create its dentry at subvolume 258. This results in COWing
    leaf A from root 258 in order to add the dentry. Note that leaf A
    also contains file extent items referring to extents from some other
    block group X (we are currently relocating block group Y). Later on, still
    at create_pending_snapshot() we call qgroup_account_snapshot(), which
    switches the commit root for root 258 when it calls switch_commit_roots(),
    so now the COWed version of leaf A, lets call it leaf A', is accessible
    from the commit root of tree 258. At the end of qgroup_account_snapshot(),
    we call record_root_in_trans() with 258 as its argument, which results
    in btrfs_init_reloc_root() being called, which in turn calls
    relocation.c:create_reloc_root() in order to create a relocation tree
    associated to root 258, which results in assigning the value of 3403
    (which is the current transaction id minus 1 = 3404 - 1) to the
    last_snapshot field of root 258. When creating the relocation tree root
    at ctree.c:btrfs_copy_root() we add a shared reference for leaf A',
    corresponding to the relocation tree's root, when we call btrfs_inc_ref()
    against the COWed root (a copy of the commit root from tree 258), which
    is at level 1. So at this point leaf A' has 2 references, one normal
    reference corresponding to root 258 and one shared reference corresponding
    to the root of the relocation tree.

    Transaction 3404 finishes its commit and transaction 3405 is started by
    relocation when calling merge_reloc_root() for the relocation tree
    associated to root 258. In the meanwhile leaf A' is COWed again, in
    response to some filesystem operation, when we are still at transaction
    3405. However when we COW leaf A', at ctree.c:update_ref_for_cow(), we
    call btrfs_block_can_be_shared() in order to figure out if other trees
    refer to the leaf and if any such trees exists, add a full back reference
    to leaf A' - but btrfs_block_can_be_shared() incorrectly returns false
    because the following condition is false:

    btrfs_header_generation(buf) root_item)

    which evaluates to 3404 refs[0] is 1, it does call
    btrfs_dec_ref() against leaf A', which results in removing the single
    references that the extents from block group X have which are associated
    to root 258 - the expectation was to have each of these extents with 2
    references - one reference for root 258 and one shared reference related
    to the root of the relocation tree, and so we would drop only the shared
    reference (because leaf A' was supposed to have the flag
    BTRFS_BLOCK_FLAG_FULL_BACKREF set).

    This leaves the filesystem in an inconsistent state as we now have file
    extent items in a subvolume tree that point to extents from block group X
    without references in the extent tree. So later on when we try to decrement
    the references for these extents, for example due to a file unlink operation,
    truncate operation or overwriting ranges of a file, we fail because the
    expected references do not exist in the extent tree.

    This leads to warnings and transaction aborts like the following:

    [ 588.965795] ------------[ cut here ]------------
    [ 588.965815] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:1625 lookup_inline_extent_backref+0x432/0x5b0 [btrfs]
    [ 588.965816] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc
    parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea
    sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg
    [ 588.965831] CPU: 2 PID: 2479 Comm: kworker/u8:7 Not tainted 4.7.3-3-default-fdm+ #1
    [ 588.965832] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
    [ 588.965844] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
    [ 588.965845] 0000000000000000 ffff8802263bfa28 ffffffff813af542 0000000000000000
    [ 588.965847] 0000000000000000 ffff8802263bfa68 ffffffff81081e8b 0000065900000000
    [ 588.965848] ffff8801db2af000 000000012bbe2000 0000000000000000 ffff880215703b48
    [ 588.965849] Call Trace:
    [ 588.965852] [] dump_stack+0x63/0x81
    [ 588.965854] [] __warn+0xcb/0xf0
    [ 588.965855] [] warn_slowpath_null+0x1d/0x20
    [ 588.965863] [] lookup_inline_extent_backref+0x432/0x5b0 [btrfs]
    [ 588.965865] [] ? trace_clock_local+0x10/0x30
    [ 588.965867] [] ? rb_reserve_next_event+0x6f/0x460
    [ 588.965875] [] insert_inline_extent_backref+0x55/0xd0 [btrfs]
    [ 588.965882] [] __btrfs_inc_extent_ref.isra.55+0x8f/0x240 [btrfs]
    [ 588.965890] [] __btrfs_run_delayed_refs+0x74a/0x1260 [btrfs]
    [ 588.965892] [] ? cpuacct_charge+0x86/0xa0
    [ 588.965900] [] btrfs_run_delayed_refs+0x9f/0x2c0 [btrfs]
    [ 588.965908] [] delayed_ref_async_start+0x94/0xb0 [btrfs]
    [ 588.965918] [] btrfs_scrubparity_helper+0xca/0x350 [btrfs]
    [ 588.965928] [] btrfs_extent_refs_helper+0xe/0x10 [btrfs]
    [ 588.965930] [] process_one_work+0x1f3/0x4e0
    [ 588.965931] [] worker_thread+0x48/0x4e0
    [ 588.965932] [] ? process_one_work+0x4e0/0x4e0
    [ 588.965934] [] kthread+0xc9/0xe0
    [ 588.965936] [] ret_from_fork+0x1f/0x40
    [ 588.965937] [] ? kthread_worker_fn+0x170/0x170
    [ 588.965938] ---[ end trace 34e5232c933a1749 ]---
    [ 588.966187] ------------[ cut here ]------------
    [ 588.966196] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:2966 btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs]
    [ 588.966196] BTRFS: Transaction aborted (error -5)
    [ 588.966197] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc
    parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea
    sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg
    [ 588.966206] CPU: 2 PID: 2479 Comm: kworker/u8:7 Tainted: G W 4.7.3-3-default-fdm+ #1
    [ 588.966207] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
    [ 588.966217] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
    [ 588.966217] 0000000000000000 ffff8802263bfc98 ffffffff813af542 ffff8802263bfce8
    [ 588.966219] 0000000000000000 ffff8802263bfcd8 ffffffff81081e8b 00000b96345ee000
    [ 588.966220] ffffffffa021ae1c ffff880215703b48 00000000000005fe ffff8802345ee000
    [ 588.966221] Call Trace:
    [ 588.966223] [] dump_stack+0x63/0x81
    [ 588.966224] [] __warn+0xcb/0xf0
    [ 588.966225] [] warn_slowpath_fmt+0x4f/0x60
    [ 588.966233] [] btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs]
    [ 588.966241] [] delayed_ref_async_start+0x94/0xb0 [btrfs]
    [ 588.966250] [] btrfs_scrubparity_helper+0xca/0x350 [btrfs]
    [ 588.966259] [] btrfs_extent_refs_helper+0xe/0x10 [btrfs]
    [ 588.966260] [] process_one_work+0x1f3/0x4e0
    [ 588.966261] [] worker_thread+0x48/0x4e0
    [ 588.966263] [] ? process_one_work+0x4e0/0x4e0
    [ 588.966264] [] kthread+0xc9/0xe0
    [ 588.966265] [] ret_from_fork+0x1f/0x40
    [ 588.966267] [] ? kthread_worker_fn+0x170/0x170
    [ 588.966268] ---[ end trace 34e5232c933a174a ]---
    [ 588.966269] BTRFS: error (device sda2) in btrfs_run_delayed_refs:2966: errno=-5 IO failure
    [ 588.966270] BTRFS info (device sda2): forced readonly

    This was happening often on openSUSE and SLE systems using btrfs as the
    root filesystem (with its default layout where multiple subvolumes are
    used) where balance happens in the background triggered by a cron job and
    snapshots are automatically created before/after package installations,
    upgrades and removals. The issue could be triggered simply by running the
    following loop on the first system boot post installation:

    while true; do
    zypper -n in nfs-kernel-server
    zypper -n rm nfs-kernel-server
    done

    (If we were fast enough and made that loop before the cron job triggered
    a balance operation and the balance finished)

    So fix by setting the last_snapshot field of the root to the value of the
    generation of its commit root. Like this btrfs_block_can_be_shared()
    behaves correctly for the case where the relocation root is created during
    a transaction commit and for the case where it's created before a
    transaction commit.

    Fixes: 6426c7ad697d (btrfs: qgroup: Fix qgroup accounting when creating snapshot)
    Cc: stable@vger.kernel.org # 4.7+
    Signed-off-by: Filipe Manana
    Reviewed-by: Josef Bacik

    Filipe Manana
     

17 Oct, 2016

1 commit

  • While updating btree, we try to push items between sibling
    nodes/leaves in order to keep height as low as possible.
    But we don't memset the original places with zero when
    pushing items so that we could end up leaving stale content
    in nodes/leaves. One may read the above stale content by
    increasing btree blocks' @nritems.

    One case I've come across is that in fs tree, a leaf has two
    parent nodes, hence running balance ends up with processing
    this leaf with two parent nodes, but it can only reach the
    valid parent node through btrfs_search_slot, so it'd be like,

    do_relocation
    for P in all parent nodes of block A:
    if !P->eb:
    btrfs_search_slot(key); --> get path from P to A.
    if lowest:
    BUG_ON(A->bytenr != bytenr of A recorded in P);
    btrfs_cow_block(P, A); --> change A's bytenr in P.

    After btrfs_cow_block, P has the new bytenr of A, but with the
    same @key, we get the same path again, and get panic by BUG_ON.

    Note that this is only happening in a corrupted fs, for a
    regular fs in which we have correct @nritems so that we won't
    read stale content in any case.

    Reviewed-by: Josef Bacik
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba

    Liu Bo
     

27 Sep, 2016

2 commits

  • CodingStyle chapter 2:
    "[...] never break user-visible strings such as printk messages,
    because that breaks the ability to grep for them."

    This patch unsplits user-visible strings.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • We don't track the reloc roots in any sort of normal way, so the only way the
    root/commit_root nodes get free'd is if the relocation finishes successfully and
    the reloc root is deleted. Fix this by free'ing them in free_reloc_roots.
    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     

26 Sep, 2016

3 commits

  • When relocating tree blocks, we firstly get block information from
    back references in the extent tree, we then search fs tree to try to
    find all parents of a block.

    However, if fs tree is corrupted, eg. if there're some missing
    items, we could come across these WARN_ONs and BUG_ONs.

    This makes us print some error messages and return gracefully
    from balance.

    Signed-off-by: Liu Bo
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    Liu Bo
     
  • We have a lot of random ints in btrfs_fs_info that can be put into flags. This
    is mostly equivalent with the exception of how we deal with quota going on or
    off, now instead we set a flag when we are turning it on or off and deal with
    that appropriately, rather than just having a pending state that the current
    quota_enabled gets set to. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • …e and subpage size patchset

    Extend btrfs_set_extent_delalloc() and extent_clear_unlock_delalloc()
    parameters for both in-band dedupe and subpage sector size patchset.

    This should reduce conflict of both patchset and the effort to rebase
    them.

    Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com>
    Cc: David Sterba <dsterba@suse.cz>
    Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
    Signed-off-by: David Sterba <dsterba@suse.com>

    Qu Wenruo
     

01 Sep, 2016

1 commit

  • Qgroup function may overwrite the saved error 'err' with 0
    in case quota is not enabled, and this ends up with a
    endless loop in balance because we keep going back to balance
    the same block group.

    It really should use 'ret' instead.

    Signed-off-by: Liu Bo
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    Liu Bo
     

25 Aug, 2016

1 commit

  • This patch can fix some false ENOSPC errors, below test script can
    reproduce one false ENOSPC error:
    #!/bin/bash
    dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
    dev=$(losetup --show -f fs.img)
    mkfs.btrfs -f -M $dev
    mkdir /tmp/mntpoint
    mount $dev /tmp/mntpoint
    cd /tmp/mntpoint
    xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile

    Above script will fail for ENOSPC reason, but indeed fs still has free
    space to satisfy this request. Please see call graph:
    btrfs_fallocate()
    |-> btrfs_alloc_data_chunk_ondemand()
    | bytes_may_use += 64M
    |-> btrfs_prealloc_file_range()
    |-> btrfs_reserve_extent()
    |-> btrfs_add_reserved_bytes()
    | alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
    | change bytes_may_use, and bytes_reserved += 64M. Now
    | bytes_may_use + bytes_reserved == 128M, which is greater
    | than btrfs_space_info's total_bytes, false enospc occurs.
    | Note, the bytes_may_use decrease operation will be done in
    | end of btrfs_fallocate(), which is too late.

    Here is another simple case for buffered write:
    CPU 1 | CPU 2
    |
    |-> cow_file_range() |-> __btrfs_buffered_write()
    |-> btrfs_reserve_extent() | |
    | | |
    | | |
    | ..... | |-> btrfs_check_data_free_space()
    | |
    | |
    |-> extent_clear_unlock_delalloc() |

    In CPU 1, btrfs_reserve_extent()->find_free_extent()->
    btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
    operation will be delayed to be done in extent_clear_unlock_delalloc().
    Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
    btrfs_check_data_free_space() tries to reserve 100MB data space.
    If
    100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
    data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
    data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
    btrfs_check_data_free_space() will try to allcate new data chunk or call
    btrfs_start_delalloc_roots(), or commit current transaction in order to
    reserve some free space, obviously a lot of work. But indeed it's not
    necessary as long as decreasing bytes_may_use timely, we still have
    free space, decreasing 128M from bytes_may_use.

    To fix this issue, this patch chooses to update bytes_may_use for both
    data and metadata in btrfs_add_reserved_bytes(). For compress path, real
    extent length may not be equal to file content length, so introduce a
    ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
    btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
    file content length. Then compress path can update bytes_may_use
    correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
    and RESERVE_FREE.

    As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
    run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
    PREALLOC, we also need to update bytes_may_use, but can not pass
    EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
    here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
    to update btrfs_space_info's bytes_may_use.

    Meanwhile __btrfs_prealloc_file_range() will call
    btrfs_free_reserved_data_space() internally for both sucessful and failed
    path, btrfs_prealloc_file_range()'s callers does not need to call
    btrfs_free_reserved_data_space() any more.

    Signed-off-by: Wang Xiaoguang
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Wang Xiaoguang