06 Dec, 2018

1 commit

  • We want to release the unused reservation we have since it refills the
    delayed refs reserve, which will make everything go smoother when
    running the delayed refs if we're short on our reservation.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Omar Sandoval
    Reviewed-by: Liu Bo
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik
     

14 Nov, 2018

1 commit

  • commit 30928e9baac238a7330085a1c5747f0b5df444b4 upstream.

    This could result in a really bad case where we do something like

    evict
    evict_refill_and_join
    btrfs_commit_transaction
    btrfs_run_delayed_iputs
    evict
    evict_refill_and_join
    btrfs_commit_transaction
    ... forever

    We have plenty of other places where we run delayed iputs that are much
    safer, let those do the work.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     

30 May, 2018

1 commit

  • …created with quota enabled

    [ Upstream commit 4d31778aa2fa342f5f92ca4025b293a1729161d1 ]

    When multiple pending snapshots referring to the same source subvolume
    are executed, enabled quota will cause root item corruption, where root
    items are using old bytenr (no backref in extent tree).

    This can be triggered by fstests btrfs/152.

    The cause is when source subvolume is still dirty, extra commit
    (simplied transaction commit) of qgroup_account_snapshot() can skip
    dirty roots not recorded in current transaction, making root item of
    source subvolume not updated.

    Fix it by forcing recording source subvolume in current transaction
    before qgroup sub-transaction commit.

    Reported-by: Justin Maggard <jmaggard@netgear.com>
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

    Qu Wenruo
     

19 Mar, 2018

1 commit


09 Mar, 2018

1 commit

  • commit 3c181c12c431fe33b669410d663beb9cceefcd1b upstream.

    The fs_info::super_copy is a byte copy of the on-disk structure and all
    members must use the accessor macros/functions to obtain the right
    value. This was missing in update_super_roots and in sysfs readers.

    Moving between opposite endianness hosts will report bogus numbers in
    sysfs, and mount may fail as the root will not be restored correctly. If
    the filesystem is always used on a same endian host, this will not be a
    problem.

    Fix this by using the btrfs_set_super...() functions to set
    fs_info::super_copy values, and for the sysfs, use the cached
    fs_info::nodesize/sectorsize values.

    CC: stable@vger.kernel.org
    Fixes: df93589a17378 ("btrfs: export more from FS_INFO to sysfs")
    Signed-off-by: Anand Jain
    Reviewed-by: Liu Bo
    Reviewed-by: David Sterba
    [ update changelog ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Anand Jain
     

30 Jun, 2017

2 commits

  • Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
    v4.12-rc6. This was because commit 70e7af244 made it possible for
    calc_reclaim_items_nr() to return a negative number. It's not really a
    bug in that commit, it just didn't go far enough down the stack to find
    all the possible 64->32 bit overflows.

    This switches calc_reclaim_items_nr() to return a u64 and changes everyone
    that uses the results of that math to u64 as well.

    Reported-by: Dave Jones
    Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
    Signed-off-by: Chris Mason
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Chris Mason
     
  • Quite a lot of qgroup corruption happens due to wrong time of calling
    btrfs_qgroup_prepare_account_extents().

    Since the safest time is to call it just before
    btrfs_qgroup_account_extents(), there is no need to separate these 2
    functions.

    Merging them will make code cleaner and less bug prone.

    Signed-off-by: Qu Wenruo
    [ changelog and comment adjustments ]
    Signed-off-by: David Sterba

    Qu Wenruo
     

20 Jun, 2017

3 commits

  • We can keep the state among the other fs_info flags, there's no reason
    why fs_frozen would need to be separate.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: David Sterba

    David Sterba
     
  • Observing the number of slab objects of btrfs_transaction, there's just
    one active on an almost quiescent filesystem, and the number of objects
    goes to about ten when sync is in progress. Then the nubmer goes down to
    1. This matches the expectations of the transaction lifetime.

    For such use the separate slab cache is not justified, as we do not
    reuse objects frequently. For the shortlived transaction, the generic
    slab (size 512) should be ok. We can optimistically expect that the 512
    slabs are not all used (fragmentation) and there are free slots to take
    when we do the allocation, compared to potentially allocating a whole new
    page for the separate slab.

    We'll lose the stats about the object use, which could be added later if
    we really need them.

    Signed-off-by: David Sterba

    David Sterba
     
  • For extent_io tree's we have carried the address_mapping of the inode
    around in the io tree in order to pull the inode back out for calling
    into various tree ops hooks. This works fine when everything that has
    an extent_io_tree has an inode. But we are going to remove the
    btree_inode, so we need to change this. Instead just have a generic
    void * for private data that we can initialize with, and have all the
    tree ops use that instead. This had a lot of cascading changes but
    should be relatively straightforward.

    Signed-off-by: Josef Bacik
    Reviewed-by: Chandan Rajendra
    Reviewed-by: David Sterba
    [ minor reordering of the callback prototypes ]
    Signed-off-by: David Sterba

    Josef Bacik
     

18 Apr, 2017

3 commits

  • [BUG]
    The easist way to reproduce the bug is:
    ------
    # mkfs.btrfs -f $dev -n 16K
    # mount $dev $mnt -o inode_cache
    # btrfs quota enable $mnt
    # btrfs quota rescan -w $mnt
    # btrfs qgroup show $mnt
    qgroupid rfer excl
    -------- ---- ----
    0/5 32.00KiB 32.00KiB
    ^^ Twice the correct value
    ------

    And fstests/btrfs qgroup test group can easily detect them with
    inode_cache mount option.
    Although some of them are false alerts since old test cases are using
    fixed golden output.
    While new test cases will use "btrfs check" to detect qgroup mismatch.

    [CAUSE]
    Inode_cache mount option will make commit_fs_roots() to call
    btrfs_save_ino_cache() to update fs/subvol trees, and generate new
    delayed refs.

    However we call btrfs_qgroup_prepare_account_extents() too early, before
    commit_fs_roots().
    This makes the "old_roots" for newly generated extents are always NULL.
    For freeing extent case, this makes both new_roots and old_roots to be
    empty, while correct old_roots should not be empty.
    This causing qgroup numbers not decreased correctly.

    [FIX]
    Modify the timing of calling btrfs_qgroup_prepare_account_extents() to
    just before btrfs_qgroup_account_extents(), and add needed delayed_refs
    handler.
    So qgroup can handle inode_map mount options correctly.

    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • The members have been effectively unused since "Btrfs: rework qgroup
    accounting" (fcebe4562dec83b3), there's no substitute for
    assert_qgroups_uptodate so it's removed as well.

    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    David Sterba
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David Sterba

    Elena Reshetova
     

28 Feb, 2017

3 commits


17 Feb, 2017

3 commits


14 Feb, 2017

3 commits

  • Once a qgroup limit is exceeded, it's impossible to restore normal
    operation to the subvolume without modifying the limit or removing
    the subvolume. This is a surprising situation for many users used
    to the typical workflow with quotas on other file systems where it's
    possible to remove files until the used space is back under the limit.

    When we go to unlink a file and start the transaction, we'll hit
    the qgroup limit while trying to reserve space for the items we'll
    modify while removing the file. We discussed last month how best
    to handle this situation and agreed that there is no perfect solution.
    The best principle-of-least-surprise solution is to handle it similarly
    to how we already handle ENOSPC when unlinking, which is to allow
    the operation to succeed with the expectation that it will ultimately
    release space under most circumstances.

    This patch modifies the transaction start path to select whether to
    honor the qgroups limits. btrfs_start_transaction_fallback_global_rsv
    is the only caller that skips enforcement. The reservation and tracking
    still happens normally -- it just skips the enforcement step.

    Signed-off-by: Jeff Mahoney
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • Currently btrfs_ino takes a struct inode and this causes a lot of
    internal btrfs functions which consume this ino to take a VFS inode,
    rather than btrfs' own struct btrfs_inode. In order to fix this "leak"
    of VFS structs into the internals of btrfs first it's necessary to
    eliminate all uses of struct inode for the purpose of inode. This patch
    does that by using BTRFS_I to convert an inode to btrfs_inode. With
    this problem eliminated subsequent patches will start eliminating the
    passing of struct inode altogether, eventually resulting in a lot cleaner
    code.

    Signed-off-by: Nikolay Borisov
    [ fix btrfs_get_extent tracepoint prototype ]
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • This replaces ACCESS_ONCE macro with the corresponding
    READ|WRITE macros

    Signed-off-by: Seraphime Kirkovski
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Seraphime Kirkovski
     

06 Dec, 2016

9 commits


12 Oct, 2016

1 commit

  • Pull btrfs updates from Chris Mason:
    "This is a big variety of fixes and cleanups.

    Liu Bo continues to fixup fuzzer related problems, and some of Josef's
    cleanups are prep for his bigger extent buffer changes (slated for
    v4.10)"

    * 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (39 commits)
    Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
    Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf
    Btrfs: don't BUG() during drop snapshot
    btrfs: fix btrfs_no_printk stub helper
    Btrfs: memset to avoid stale content in btree leaf
    btrfs: parent_start initialization cleanup
    btrfs: Remove already completed TODO comment
    btrfs: Do not reassign count in btrfs_run_delayed_refs
    btrfs: fix a possible umount deadlock
    Btrfs: fix memory leak in do_walk_down
    btrfs: btrfs_debug should consume fs_info when DEBUG is not defined
    btrfs: convert send's verbose_printk to btrfs_debug
    btrfs: convert pr_* to btrfs_* where possible
    btrfs: convert printk(KERN_* to use pr_* calls
    btrfs: unsplit printed strings
    btrfs: clean the old superblocks before freeing the device
    Btrfs: kill BUG_ON in run_delayed_tree_ref
    Btrfs: don't leak reloc root nodes on error
    btrfs: squash lines for simple wrapper functions
    Btrfs: improve check_node to avoid reading corrupted nodes
    ...

    Linus Torvalds
     

28 Sep, 2016

1 commit

  • current_fs_time() uses struct super_block* as an argument.
    As per Linus's suggestion, this is changed to take struct
    inode* as a parameter instead. This is because the function
    is primarily meant for vfs inode timestamps.
    Also the function was renamed as per Arnd's suggestion.

    Change all calls to current_fs_time() to use the new
    current_time() function instead. current_fs_time() will be
    deleted.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Al Viro

    Deepa Dinamani
     

27 Sep, 2016

4 commits

  • For many printks, we want to know which file system issued the message.

    This patch converts most pr_* calls to use the btrfs_* versions instead.
    In some cases, this means adding plumbing to allow call sites access to
    an fs_info pointer.

    fs/btrfs/check-integrity.c is left alone for another day.

    Signed-off-by: Jeff Mahoney
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • This patch converts printk(KERN_* style messages to use the pr_* versions.

    One side effect is that anything that was KERN_DEBUG is now automatically
    a dynamic debug message.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • CodingStyle chapter 2:
    "[...] never break user-visible strings such as printk messages,
    because that breaks the ability to grep for them."

    This patch unsplits user-visible strings.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • Since we could get errors from the concurrent aborted transaction,
    the check of this BUG_ON in start_transaction is not true any more.

    Say, while flushing free space cache inode's dirty pages,
    btrfs_finish_ordered_io
    -> btrfs_join_transaction_nolock
    (the transaction has been aborted.)
    -> BUG_ON(type == TRANS_JOIN_NOLOCK);

    Signed-off-by: Liu Bo
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    Liu Bo
     

26 Sep, 2016

1 commit

  • We have a lot of random ints in btrfs_fs_info that can be put into flags. This
    is mostly equivalent with the exception of how we deal with quota going on or
    off, now instead we set a flag when we are turning it on or off and deal with
    that appropriately, rather than just having a pending state that the current
    quota_enabled gets set to. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     

25 Aug, 2016

1 commit

  • When running fstests generic/068, sometimes we got below deadlock:
    xfs_io D ffff8800331dbb20 0 6697 6693 0x00000080
    ffff8800331dbb20 ffff88007acfc140 ffff880034d895c0 ffff8800331dc000
    ffff880032d243e8 fffffffeffffffff ffff880032d24400 0000000000000001
    ffff8800331dbb38 ffffffff816a9045 ffff880034d895c0 ffff8800331dbba8
    Call Trace:
    [] schedule+0x35/0x80
    [] rwsem_down_read_failed+0xf2/0x140
    [] ? __filemap_fdatawrite_range+0xd1/0x100
    [] call_rwsem_down_read_failed+0x18/0x30
    [] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs]
    [] percpu_down_read+0x35/0x50
    [] __sb_start_write+0x2c/0x40
    [] start_transaction+0x2a5/0x4d0 [btrfs]
    [] btrfs_join_transaction+0x17/0x20 [btrfs]
    [] btrfs_evict_inode+0x3c4/0x5d0 [btrfs]
    [] evict+0xba/0x1a0
    [] iput+0x196/0x200
    [] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs]
    [] btrfs_commit_transaction+0x928/0xa80 [btrfs]
    [] btrfs_freeze+0x30/0x40 [btrfs]
    [] freeze_super+0xf0/0x190
    [] do_vfs_ioctl+0x4a5/0x5c0
    [] ? do_audit_syscall_entry+0x66/0x70
    [] ? syscall_trace_enter_phase1+0x11f/0x140
    [] SyS_ioctl+0x79/0x90
    [] do_syscall_64+0x62/0x110
    [] entry_SYSCALL64_slow_path+0x25/0x25

    >From this warning, freeze_super() already holds SB_FREEZE_FS, but
    btrfs_freeze() will call btrfs_commit_transaction() again, if
    btrfs_commit_transaction() finds that it has delayed iputs to handle,
    it'll start_transaction(), which will try to get SB_FREEZE_FS lock
    again, then deadlock occurs.

    The root cause is that in btrfs, sync_filesystem(sb) does not make
    sure all metadata is updated. There still maybe some codes adding
    delayed iputs, see below sample race window:

    CPU1 | CPU2
    |-> freeze_super() |
    |-> sync_filesystem(sb); |
    | |-> cleaner_kthread()
    | | |-> btrfs_delete_unused_bgs()
    | | |-> btrfs_remove_chunk()
    | | |-> btrfs_remove_block_group()
    | | |-> btrfs_add_delayed_iput()
    | |
    |-> sb->s_writers.frozen = SB_FREEZE_FS; |
    |-> sb_wait_write(sb, SB_FREEZE_FS); |
    | acquire SB_FREEZE_FS lock. |
    | |
    |-> btrfs_freeze() |
    |-> btrfs_commit_transaction() |
    |-> btrfs_run_delayed_iputs() |
    | will handle delayed iputs, |
    | that means start_transaction() |
    | will be called, which will try |
    | to get SB_FREEZE_FS lock. |

    To fix this issue, introduce a "int fs_frozen" to record internally whether
    fs has been frozen. If fs has been frozen, we can not handle delayed iputs.

    Signed-off-by: Wang Xiaoguang
    Reviewed-by: David Sterba
    [ add comment to btrfs_freeze ]
    Signed-off-by: David Sterba

    Signed-off-by: Chris Mason

    Wang Xiaoguang
     

26 Jul, 2016

1 commit