17 Jan, 2021

1 commit

  • [ Upstream commit 3d45f221ce627d13e2e6ef3274f06750c84a6542 ]

    When cloning an inline extent there are cases where we can not just copy
    the inline extent from the source range to the target range (e.g. when the
    target range starts at an offset greater than zero). In such cases we copy
    the inline extent's data into a page of the destination inode and then
    dirty that page. However, after that we will need to start a transaction
    for each processed extent and, if we are ever low on available metadata
    space, we may need to flush existing delalloc for all dirty inodes in an
    attempt to release metadata space - if that happens we may deadlock:

    * the async reclaim task queued a delalloc work to flush delalloc for
    the destination inode of the clone operation;

    * the task executing that delalloc work gets blocked waiting for the
    range with the dirty page to be unlocked, which is currently locked
    by the task doing the clone operation;

    * the async reclaim task blocks waiting for the delalloc work to complete;

    * the cloning task is waiting on the waitqueue of its reservation ticket
    while holding the range with the dirty page locked in the inode's
    io_tree;

    * if metadata space is not released by some other task (like delalloc for
    some other inode completing for example), the clone task waits forever
    and as a consequence the delalloc work and async reclaim tasks will hang
    forever as well. Releasing more space on the other hand may require
    starting a transaction, which will hang as well when trying to reserve
    metadata space, resulting in a deadlock between all these tasks.

    When this happens, traces like the following show up in dmesg/syslog:

    [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
    [87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
    [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
    [87452.326136] Call Trace:
    [87452.326737] __schedule+0x5d1/0xcf0
    [87452.327390] schedule+0x45/0xe0
    [87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs]
    [87452.328894] ? finish_wait+0x90/0x90
    [87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs]
    [87452.330133] ? __mod_memcg_state+0x8e/0x160
    [87452.330738] __extent_writepage+0x2d4/0x400 [btrfs]
    [87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs]
    [87452.332007] ? lock_release+0x20e/0x4c0
    [87452.332557] ? trace_hardirqs_on+0x1b/0xf0
    [87452.333127] extent_writepages+0x43/0x90 [btrfs]
    [87452.333653] ? lock_acquire+0x1a3/0x490
    [87452.334177] do_writepages+0x43/0xe0
    [87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100
    [87452.335720] __filemap_fdatawrite_range+0xc5/0x100
    [87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
    [87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs]
    [87452.337838] process_one_work+0x24e/0x5e0
    [87452.338437] worker_thread+0x50/0x3b0
    [87452.339137] ? process_one_work+0x5e0/0x5e0
    [87452.339884] kthread+0x153/0x170
    [87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0
    [87452.341153] ret_from_fork+0x22/0x30
    [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
    [87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
    [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
    [87452.345655] Call Trace:
    [87452.346305] __schedule+0x5d1/0xcf0
    [87452.346947] ? kvm_clock_read+0x14/0x30
    [87452.347676] ? wait_for_completion+0x81/0x110
    [87452.348389] schedule+0x45/0xe0
    [87452.349077] schedule_timeout+0x30c/0x580
    [87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [87452.350340] ? lock_acquire+0x1a3/0x490
    [87452.351006] ? try_to_wake_up+0x7a/0xa20
    [87452.351541] ? lock_release+0x20e/0x4c0
    [87452.352040] ? lock_acquired+0x199/0x490
    [87452.352517] ? wait_for_completion+0x81/0x110
    [87452.353000] wait_for_completion+0xab/0x110
    [87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs]
    [87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
    [87452.354455] flush_space+0x24f/0x660 [btrfs]
    [87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
    [87452.355565] process_one_work+0x24e/0x5e0
    [87452.356024] worker_thread+0x20f/0x3b0
    [87452.356487] ? process_one_work+0x5e0/0x5e0
    [87452.356973] kthread+0x153/0x170
    [87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0
    [87452.357880] ret_from_fork+0x22/0x30
    (...)
    < stack traces of several tasks waiting for the locks of the inodes of the
    clone operation >
    (...)
    [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
    [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
    [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
    [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
    [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
    [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
    [92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000
    [92867.447920] Call Trace:
    [92867.448435] __schedule+0x5d1/0xcf0
    [92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [92867.449423] schedule+0x45/0xe0
    [92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs]
    [92867.450576] ? finish_wait+0x90/0x90
    [92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
    [92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs]
    [92867.452412] start_transaction+0x2d1/0x760 [btrfs]
    [92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs]
    [92867.453848] ? lock_release+0x20e/0x4c0
    [92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
    [92867.455218] btrfs_clone+0x569/0x7e0 [btrfs]
    [92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs]
    [92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs]
    [92867.457213] do_clone_file_range+0xd4/0x1f0
    [92867.457828] vfs_clone_file_range+0x4d/0x230
    [92867.458355] ? lock_release+0x20e/0x4c0
    [92867.458890] ioctl_file_clone+0x8f/0xc0
    [92867.459377] do_vfs_ioctl+0x342/0x750
    [92867.459913] __x64_sys_ioctl+0x62/0xb0
    [92867.460377] do_syscall_64+0x33/0x80
    [92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    (...)
    < stack traces of more tasks blocked on metadata reservation like the clone
    task above, because the async reclaim task has deadlocked >
    (...)

    Another thing to notice is that the worker task that is deadlocked when
    trying to flush the destination inode of the clone operation is at
    btrfs_invalidatepage(). This is simply because the clone operation has a
    destination offset greater than the i_size and we only update the i_size
    of the destination file after cloning an extent (just like we do in the
    buffered write path).

    Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
    the flushing of delalloc for all inodes that have delalloc, add a runtime
    flag to an inode to signal it should not be flushed, and for inodes with
    that flag set, start_delalloc_inodes() will simply skip them. When the
    cloning code needs to dirty a page to copy an inline extent, set that flag
    on the inode and then clear it when the clone operation finishes.

    This could be sporadically triggered with test case generic/269 from
    fstests, which exercises many fsstress processes running in parallel with
    several dd processes filling up the entire filesystem.

    CC: stable@vger.kernel.org # 5.9+
    Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents")
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     

30 Dec, 2020

1 commit

  • commit 7f458a3873ae94efe1f37c8b96c97e7298769e98 upstream.

    When defragmenting we skip ranges that have holes or inline extents, so that
    we don't do unnecessary IO and waste space. We do this check when calling
    should_defrag_range() at btrfs_defrag_file(). However we do it without
    holding the inode's lock. The reason we do it like this is to avoid
    blocking other tasks for too long, that possibly want to operate on other
    file ranges, since after the call to should_defrag_range() and before
    locking the inode, we trigger a synchronous page cache readahead. However
    before we were able to lock the inode, some other task might have punched
    a hole in our range, or we may now have an inline extent there, in which
    case we should not set the range for defrag anymore since that would cause
    unnecessary IO and make us waste space (i.e. allocating extents to contain
    zeros for a hole).

    So after we locked the inode and the range in the iotree, check again if
    we have holes or an inline extent, and if we do, just skip the range.

    I hit this while testing my next patch that fixes races when updating an
    inode's number of bytes (subject "btrfs: update the number of bytes used
    by an inode atomically"), and it depends on this change in order to work
    correctly. Alternatively I could rework that other patch to detect holes
    and flag their range with the 'new delalloc' bit, but this itself fixes
    an efficiency problem due a race that from a functional point of view is
    not harmful (it could be triggered with btrfs/062 from fstests).

    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     

05 Nov, 2020

1 commit

  • On 32-bit systems, this shift will overflow for files larger than 4GB as
    start_index is unsigned long while the calls to btrfs_delalloc_*_space
    expect u64.

    CC: stable@vger.kernel.org # 4.4+
    Fixes: df480633b891 ("btrfs: extent-tree: Switch to new delalloc space reserve and release")
    Reviewed-by: Josef Bacik
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: David Sterba
    [ define the variable instead of repeating the shift ]
    Signed-off-by: David Sterba

    Matthew Wilcox (Oracle)
     

07 Oct, 2020

7 commits

  • The passed in ordered_extent struct is always well-formed and contains
    the inode making the explicit argument redundant.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • We have this thing wrapped in an RCU lock, but it's really not needed.
    We create all the space_info's on mount, and we destroy them on unmount.
    The list never changes and we're protected from messing with it by the
    normal mount/umount path, so kill the RCU stuff around it.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • /sys/fs//exclusive_operation contains the currently executing
    exclusive operation. Add a sysfs_notify() when operation end, so
    userspace can be notified of exclusive operation is finished.

    Reviewed-by: Nikolay Borisov
    Reviewed-by: Josef Bacik
    Signed-off-by: Goldwyn Rodrigues
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Goldwyn Rodrigues
     
  • Instead of using a flag bit for exclusive operation, use a variable to
    store which exclusive operation is being performed. Introduce an API
    to start and finish an exclusive operation.

    This would enable another way for tools to check which operation is
    running on why starting an exclusive operation failed. The followup
    patch adds a sysfs_notify() to alert userspace when the state changes, so
    userspace can perform select() on it to get notified of the change.

    This would enable us to enqueue a command which will wait for current
    exclusive operation to complete before issuing the next exclusive
    operation. This has been done synchronously as opposed to a background
    process, or else error collection (if any) will become difficult.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Goldwyn Rodrigues
    Reviewed-by: David Sterba
    [ update comments ]
    Signed-off-by: David Sterba

    Goldwyn Rodrigues
     
  • When we COW a block we are holding a lock on the original block, and
    then we lock the new COW block. Because our lockdep maps are based on
    root + level, this will make lockdep complain. We need a way to
    indicate a subclass for locking the COW'ed block, so plumb through our
    btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer,
    and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks.

    The reason I've added all this extra infrastructure is because there
    will be need of different nesting classes in follow up patches.

    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • [BUG]
    When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:

    generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)

    And with the following metadata leak:

    BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
    Call Trace:
    btrfs_put_super+0x15/0x17 [btrfs]
    generic_shutdown_super+0x72/0x110
    kill_anon_super+0x18/0x30
    btrfs_kill_super+0x17/0x30 [btrfs]
    deactivate_locked_super+0x3b/0xa0
    deactivate_super+0x40/0x50
    cleanup_mnt+0x135/0x190
    __cleanup_mnt+0x12/0x20
    task_work_run+0x64/0xb0
    __prepare_exit_to_usermode+0x1bc/0x1c0
    __syscall_return_slowpath+0x47/0x230
    do_syscall_64+0x64/0xb0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    ---[ end trace a6cfd45ba80e4e06 ]---
    BTRFS error (device dm-3): qgroup reserved space leaked
    BTRFS info (device dm-3): disk space caching is enabled
    BTRFS info (device dm-3): has skinny extents

    [CAUSE]
    The qgroup preallocated meta rsv operations of that offending root are:

    btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
    btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
    btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
    btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
    btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072

    It's pretty obvious that, we reserve qgroup meta rsv in
    btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
    release/convert calls in btrfs_subvolume_release_metadata().

    This leads to the leakage.

    [FIX]
    To fix this bug, we should follow what we're doing in
    btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
    add it to block_rsv->qgroup_rsv_reserved.

    And free the qgroup reserved metadata space when releasing the
    block_rsv.

    To do this, we need to change the btrfs_subvolume_release_metadata() to
    accept btrfs_root, and record the qgroup_to_release number, and call
    btrfs_qgroup_convert_reserved_meta() for it.

    Fixes: 733e03a0b26a ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Josef Bacik
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • We have btrfs_wait_ordered_roots() which takes a u64 for nr, but
    btrfs_start_delalloc_roots() that takes an int for nr, which makes using
    them in conjunction, especially for something like (u64)-1, annoying and
    inconsistent. Fix btrfs_start_delalloc_roots() to take a u64 for nr and
    adjust start_delalloc_inodes() and it's callers appropriately.

    This means we've adjusted start_delalloc_inodes() to take a pointer of
    nr since we want to preserve the ability for start-delalloc_inodes() to
    return an error, so simply make it do the nr adjusting as necessary.

    Part of adjusting the callers to this means changing
    btrfs_writeback_inodes_sb_nr() to take a u64 for items. This may be
    confusing because it seems unrelated, but the caller of
    btrfs_writeback_inodes_sb_nr() already passes in a u64, it's just the
    function variable that needs to be changed.

    Reviewed-by: Nikolay Borisov
    Tested-by: Nikolay Borisov
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     

14 Sep, 2020

1 commit

  • When faulting in the pages for the user supplied buffer for the search
    ioctl, we are passing only the base address of the buffer to the function
    fault_in_pages_writeable(). This means that after the first iteration of
    the while loop that searches for leaves, when we have a non-zero offset,
    stored in 'sk_offset', we try to fault in a wrong page range.

    So fix this by adding the offset in 'sk_offset' to the base address of the
    user supplied buffer when calling fault_in_pages_writeable().

    Several users have reported that the applications compsize and bees have
    started to operate incorrectly since commit a48b73eca4ceb9 ("btrfs: fix
    potential deadlock in the search ioctl") was added to stable trees, and
    these applications make heavy use of the search ioctls. This fixes their
    issues.

    Link: https://lore.kernel.org/linux-btrfs/632b888d-a3c3-b085-cdf5-f9bb61017d92@lechevalier.se/
    Link: https://github.com/kilobyte/compsize/issues/34
    Fixes: a48b73eca4ceb9 ("btrfs: fix potential deadlock in the search ioctl")
    CC: stable@vger.kernel.org # 4.4+
    Tested-by: A L
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     

27 Aug, 2020

1 commit

  • With the conversion of the tree locks to rwsem I got the following
    lockdep splat:

    ======================================================
    WARNING: possible circular locking dependency detected
    5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted
    ------------------------------------------------------
    compsize/11122 is trying to acquire lock:
    ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90

    but task is already holding lock:
    ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (btrfs-fs-00){++++}-{3:3}:
    down_write_nested+0x3b/0x70
    __btrfs_tree_lock+0x24/0x120
    btrfs_search_slot+0x756/0x990
    btrfs_lookup_inode+0x3a/0xb4
    __btrfs_update_delayed_inode+0x93/0x270
    btrfs_async_run_delayed_root+0x168/0x230
    btrfs_work_helper+0xd4/0x570
    process_one_work+0x2ad/0x5f0
    worker_thread+0x3a/0x3d0
    kthread+0x133/0x150
    ret_from_fork+0x1f/0x30

    -> #1 (&delayed_node->mutex){+.+.}-{3:3}:
    __mutex_lock+0x9f/0x930
    btrfs_delayed_update_inode+0x50/0x440
    btrfs_update_inode+0x8a/0xf0
    btrfs_dirty_inode+0x5b/0xd0
    touch_atime+0xa1/0xd0
    btrfs_file_mmap+0x3f/0x60
    mmap_region+0x3a4/0x640
    do_mmap+0x376/0x580
    vm_mmap_pgoff+0xd5/0x120
    ksys_mmap_pgoff+0x193/0x230
    do_syscall_64+0x50/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #0 (&mm->mmap_lock#2){++++}-{3:3}:
    __lock_acquire+0x1272/0x2310
    lock_acquire+0x9e/0x360
    __might_fault+0x68/0x90
    _copy_to_user+0x1e/0x80
    copy_to_sk.isra.32+0x121/0x300
    search_ioctl+0x106/0x200
    btrfs_ioctl_tree_search_v2+0x7b/0xf0
    btrfs_ioctl+0x106f/0x30a0
    ksys_ioctl+0x83/0xc0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x50/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    other info that might help us debug this:

    Chain exists of:
    &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(btrfs-fs-00);
    lock(&delayed_node->mutex);
    lock(btrfs-fs-00);
    lock(&mm->mmap_lock#2);

    *** DEADLOCK ***

    1 lock held by compsize/11122:
    #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

    stack backtrace:
    CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922
    Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
    Call Trace:
    dump_stack+0x78/0xa0
    check_noncircular+0x165/0x180
    __lock_acquire+0x1272/0x2310
    lock_acquire+0x9e/0x360
    ? __might_fault+0x3e/0x90
    ? find_held_lock+0x72/0x90
    __might_fault+0x68/0x90
    ? __might_fault+0x3e/0x90
    _copy_to_user+0x1e/0x80
    copy_to_sk.isra.32+0x121/0x300
    ? btrfs_search_forward+0x2a6/0x360
    search_ioctl+0x106/0x200
    btrfs_ioctl_tree_search_v2+0x7b/0xf0
    btrfs_ioctl+0x106f/0x30a0
    ? __do_sys_newfstat+0x5a/0x70
    ? ksys_ioctl+0x83/0xc0
    ksys_ioctl+0x83/0xc0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x50/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The problem is we're doing a copy_to_user() while holding tree locks,
    which can deadlock if we have to do a page fault for the copy_to_user().
    This exists even without my locking changes, so it needs to be fixed.
    Rework the search ioctl to do the pre-fault and then
    copy_to_user_nofault for the copying.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     

27 Jul, 2020

8 commits

  • User Forza reported on IRC that some invalid combinations of file
    attributes are accepted by chattr.

    The NODATACOW and compression file flags/attributes are mutually
    exclusive, but they could be set by 'chattr +c +C' on an empty file. The
    nodatacow will be in effect because it's checked first in
    btrfs_run_delalloc_range.

    Extend the flag validation to catch the following cases:

    - input flags are conflicting
    - old and new flags are conflicting
    - initialize the local variable with inode flags after inode ls locked

    Inode attributes take precedence over mount options and are an
    independent setting.

    Nocompress would be a no-op with nodatacow, but we don't want to mix
    any compression-related options with nodatacow.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: David Sterba

    David Sterba
     
  • Add retrieval of the filesystem's metadata UUID to the fsinfo ioctl.
    This is driven by setting the BTRFS_FS_INFO_FLAG_METADATA_UUID flag in
    btrfs_ioctl_fs_info_args::flags.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • Add retrieval of the filesystem's generation to the fsinfo ioctl. This is
    driven by setting the BTRFS_FS_INFO_FLAG_GENERATION flag in
    btrfs_ioctl_fs_info_args::flags.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • With the recent addition of filesystem checksum types other than CRC32c,
    it is not anymore hard-coded which checksum type a btrfs filesystem uses.

    Up to now there is no good way to read the filesystem checksum, apart from
    reading the filesystem UUID and then query sysfs for the checksum type.

    Add a new csum_type and csum_size fields to the BTRFS_IOC_FS_INFO ioctl
    command which usually is used to query filesystem features. Also add a
    flags member indicating that the kernel responded with a set csum_type and
    csum_size field.

    For compatibility reasons, only return the csum_type and csum_size if
    the BTRFS_FS_INFO_FLAG_CSUM_INFO flag was passed to the kernel. Also
    clear any unknown flags so we don't pass false positives to user-space
    newer than the kernel.

    To simplify further additions to the ioctl, also switch the padding to a
    u8 array. Pahole was used to verify the result of this switch:

    The csum members are added before flags, which might look odd, but this
    is to keep the alignment requirements and not to introduce holes in the
    structure.

    $ pahole -C btrfs_ioctl_fs_info_args fs/btrfs/btrfs.ko
    struct btrfs_ioctl_fs_info_args {
    __u64 max_id; /* 0 8 */
    __u64 num_devices; /* 8 8 */
    __u8 fsid[16]; /* 16 16 */
    __u32 nodesize; /* 32 4 */
    __u32 sectorsize; /* 36 4 */
    __u32 clone_alignment; /* 40 4 */
    __u16 csum_type; /* 44 2 */
    __u16 csum_size; /* 46 2 */
    __u64 flags; /* 48 8 */
    __u8 reserved[968]; /* 56 968 */

    /* size: 1024, cachelines: 16, members: 10 */
    };

    Fixes: 3951e7f050ac ("btrfs: add xxhash64 to checksumming algorithms")
    Fixes: 3831bf0094ab ("btrfs: add sha256 to checksumming algorithm")
    CC: stable@vger.kernel.org # 5.5+
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • [BUG]
    When the anonymous block device pool is exhausted, subvolume/snapshot
    creation fails with EMFILE (Too many files open). This has been reported
    by a user. The allocation happens in the second phase during transaction
    commit where it's only way out is to abort the transaction

    BTRFS: Transaction aborted (error -24)
    WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
    RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
    Call Trace:
    create_pending_snapshots+0x82/0xa0 [btrfs]
    btrfs_commit_transaction+0x275/0x8c0 [btrfs]
    btrfs_mksubvol+0x4b9/0x500 [btrfs]
    btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
    btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
    btrfs_ioctl+0x11a4/0x2da0 [btrfs]
    do_vfs_ioctl+0xa9/0x640
    ksys_ioctl+0x67/0x90
    __x64_sys_ioctl+0x1a/0x20
    do_syscall_64+0x5a/0x110
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    ---[ end trace 33f2f83f3d5250e9 ]---
    BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
    BTRFS info (device sda1): forced readonly
    BTRFS warning (device sda1): Skipping commit of aborted transaction.
    BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown

    [CAUSE]
    When the global anonymous block device pool is exhausted, the following
    call chain will fail, and lead to transaction abort:

    btrfs_ioctl_snap_create_v2()
    |- btrfs_ioctl_snap_create_transid()
    |- btrfs_mksubvol()
    |- btrfs_commit_transaction()
    |- create_pending_snapshot()
    |- btrfs_get_fs_root()
    |- btrfs_init_fs_root()
    |- get_anon_bdev()

    [FIX]
    Although we can't enlarge the anonymous block device pool, at least we
    can preallocate anon_dev for subvolume/snapshot in the first phase,
    outside of transaction context and exactly at the moment the user calls
    the creation ioctl.

    Reported-by: Greed Rong
    Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • All of its children take btrfs_inode so bubble up this requirement to
    btrfs_delalloc_reserve_space's interface and stop calling BTRFS_I
    internally.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • It needs btrfs_inode so take it as a parameter directly.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • It doesn't use the generic vfs inode for anything use btrfs_inode
    directly.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     

17 Jun, 2020

1 commit

  • In btrfs_ioctl_get_subvol_info(), there is a classic case where kzalloc()
    was incorrectly paired with kzfree(). According to David Sterba, there
    isn't any sensitive information in the subvol_info that needs to be
    cleared before freeing. So kzfree() isn't really needed, use kfree()
    instead.

    Signed-off-by: Waiman Long
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Waiman Long
     

25 May, 2020

4 commits

  • The inode lookup starting at btrfs_iget takes the full location key,
    while only the objectid is used to match the inode, because the lookup
    happens inside the given root thus the inode number is unique.
    The entire location key is properly set up in btrfs_init_locked_inode.

    Simplify the helpers and pass only inode number, renaming it to 'ino'
    instead of 'objectid'. This allows to remove temporary variables key,
    saving some stack space.

    Signed-off-by: David Sterba

    David Sterba
     
  • The main function to lookup a root by its id btrfs_get_fs_root takes the
    whole key, while only using the objectid. The value of offset is preset
    to (u64)-1 but not actually used until btrfs_find_root that does the
    actual search.

    Switch btrfs_get_fs_root to use only objectid and remove all local
    variables that existed just for the lookup. The actual key for search is
    set up in btrfs_get_fs_root, reusing another key variable.

    Signed-off-by: David Sterba

    David Sterba
     
  • When creating a snapshot, ordered extents need to be flushed and this
    can take a long time.

    In create_snapshot there are two locks held when this happens:

    1. Destination directory inode lock
    2. Global subvolume semaphore

    This will unnecessarily block other operations like subvolume destroy,
    create, or setflag until the snapshot is created.

    We can fix that by moving the flush outside the locked section as this
    does not depend on the aforementioned locks. The code factors out the
    snapshot related work from create_snapshot to btrfs_mksnapshot.

    __btrfs_ioctl_snap_create
    btrfs_mksubvol
    create_subvol
    btrfs_mksnapshot

    btrfs_mksubvol
    create_snapshot

    Reviewed-by: Filipe Manana
    Signed-off-by: Robbie Ko
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Robbie Ko
     
  • The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.

    In fact, that bit can only be set to those trees:

    - Subvolume roots
    - Data reloc root
    - Reloc roots for above roots

    All other trees won't get this bit set. So just by the result, it is
    obvious that, roots with this bit set can have tree blocks shared with
    other trees. Either shared by snapshots, or by reloc roots (an special
    snapshot created by relocation).

    This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
    make it easier to understand, and update all comment mentioning
    "reference counted" to follow the rename.

    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     

24 Mar, 2020

15 commits

  • With BTRFS_SUBVOL_CREATE_ASYNC support remove it's no longer required to
    pass the async_transid parameter so remove it and any code using it.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • btrfs_ioctl_snap_create_transid no longer takes a transid argument, so
    remove it and rename the function to __btrfs_ioctl_snap_create to
    reflect it's an internal, worker function.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • This functionality was deprecated in kernel 5.4. Since no one has
    complained of the impending removal it's time we did so.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    [ add comment ]
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • The reflink code is quite large and has been living in ioctl.c since ever.
    It has grown over the years after many bug fixes and improvements, and
    since I'm planning on making some further improvements on it, it's time
    to get it better organized by moving into its own file, reflink.c
    (similar to what xfs does for example).

    This change only moves the code out of ioctl.c into the new file, it
    doesn't do any other change.

    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • There are new types and helpers that are supposed to be used in new code.

    As a preparation to get rid of legacy types and API functions do
    the conversion here.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Andy Shevchenko
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Andy Shevchenko
     
  • There is no point to inform the user about size change if there's none.
    Update the message to conform to a commonly used format where the path
    and devid are printed and also print old and new sizes.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Marcos Paulo de Souza
    Reviewed-by: David Sterba
    [ enhance message ]
    Signed-off-by: David Sterba

    Marcos Paulo de Souza
     
  • This patch removes all haphazard code implementing nocow writers
    exclusion from pending snapshot creation and switches to using the drew
    lock to ensure this invariant still holds.

    'Readers' are snapshot creators from create_snapshot and 'writers' are
    nocow writers from buffered write path or btrfs_setsize. This locking
    scheme allows for multiple snapshots to happen while any nocow writers
    are blocked, since writes to page cache in the nocow path will make
    snapshots inconsistent.

    So for performance reasons we'd like to have the ability to run multiple
    concurrent snapshots and also favors readers in this case. And in case
    there aren't pending snapshots (which will be the majority of the cases)
    we rely on the percpu's writers counter to avoid cacheline contention.

    The main gain from using the drew lock is it's now a lot easier to
    reason about the guarantees of the locking scheme and whether there is
    some silent breakage lurking.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • This ioctl will be responsible for deleting a subvolume using its id.
    This can be used when a system has a file system mounted from a
    subvolume, rather than the root file system, like below:

    /
    @subvol1/
    @subvol2/
    @subvol_default/

    If only @subvol_default is mounted, we have no path to reach @subvol1
    and @subvol2, thus no way to delete them. Current subvolume delete ioctl
    takes a file handle point as argument, and if @subvol_default is
    mounted, we can't reach @subvol1 and @subvol2 from the same mount point.

    This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
    the extended structure with flags to allow to delete subvolume using
    subvolid.

    Now, we can use this new ioctl specifying the subvolume id and refer to
    the same mount point. It doesn't matter which subvolume was mounted,
    since we can reach to the desired one using the subvolume id, and then
    delete it.

    The full path to the subvolume id is resolved internally and access is
    verified as if the subvolume was accessed by path.

    The volume args v2 structure is extended to use the existing union for
    subvolume id specification, that's valid in case the
    BTRFS_SUBVOL_SPEC_BY_ID is set.

    Signed-off-by: Marcos Paulo de Souza
    Reviewed-by: David Sterba
    [ update changelog ]
    Signed-off-by: David Sterba

    Marcos Paulo de Souza
     
  • When the device remove v2 ioctl was added, the full support mask was
    added to sanity check the flags. However this would allow to let the
    subvolume related flags to be accepted. This is not supposed to happen.

    Use the correct support mask, which means that now any of
    BTRFS_SUBVOL_CREATE_ASYNC, BTRFS_SUBVOL_RDONLY or
    BTRFS_SUBVOL_QGROUP_INHERIT will be rejected as ENOTSUPP. Though this is
    a user-visible change, specifying subvolume flags for device deletion
    does not make sense and there are hopefully no applications doing that.

    Reviewed-by: Marcos Paulo de Souza
    Reviewed-by: Nikolay Borisov
    Signed-off-by: David Sterba

    David Sterba
     
  • Using the defined mask instead of flag enumeration in the ioctl handler
    is preferred. No functional changes.

    Reviewed-by: Marcos Paulo de Souza
    Reviewed-by: Nikolay Borisov
    Signed-off-by: David Sterba

    David Sterba
     
  • We are now using these for all roots, rename them to btrfs_put_root()
    and btrfs_grab_root();

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Now that all callers of btrfs_get_fs_root are subsequently calling
    btrfs_grab_fs_root and handling dropping the ref when they are done
    appropriately, go ahead and push btrfs_grab_fs_root up into
    btrfs_get_fs_root.

    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We create the snapshot and then use it for a bunch of things, we need to
    hold a ref on it while we're messing with it.

    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We look up an arbitrary fs root here, we need to hold a ref on the root
    for the duration.

    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We look up whatever root userspace has given us, we need to hold a ref
    throughout this operation. Use 'root' only for the on fs root and not as
    a temporary variable elsewhere.

    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik