04 Jul, 2019

4 commits


02 Jul, 2019

36 commits

  • These helpers belong in block-rsv.c

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • This moves everything out of extent-tree.c to block-rsv.c.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • block_rsv_release_bytes() is the internal to the block_rsv code, and
    shouldn't be called directly by anything else. Switch all users to the
    exported helpers.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • This works for all callers already, but if we wanted to use the helper
    for the global_block_rsv it would end up trying to refill itself. Fix
    the logic to be able to be used no matter which block rsv is passed in
    to this helper.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • The delalloc reserve stuff calls this directly because it cares about
    the qgroup accounting stuff, so export it so we can move it around. Fix
    btrfs_block_rsv_release() to just be a static inline since it just calls
    __btrfs_block_rsv_release() with NULL for the qgroup stuff.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • This is used in a few places, we need to make sure it's exported so we
    can move it around.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Prep work for separating out all of the block_rsv related code into its
    own file.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We don't need an if-else-if chain where we can use a simple OR since
    both conditions are performing the same action. The short-circuit for OR
    will ensure that if the first condition is true, can_overcommit() is not
    called.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Goldwyn Rodrigues
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Goldwyn Rodrigues
     
  • Now that we've moved all of the users to space-info.c, unexport it and
    name it back to can_overcommit.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • This moves all of the metadata reservation code into space-info.c.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We'll need this exported so we can use it in all the various was we need
    to use it. This is prep work to move reserve_metadata_bytes.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We are going to need this to move the metadata reservation stuff to
    space_info.c.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Now that we've moved all the pre-requisite stuff, move these two
    functions.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Also rename it to btrfs_space_info_update_* so it's clear what we're
    updating.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • This is the first piece of moving the space reservation code to
    space-info.c

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • These are the basic init and lookup functions and some helper functions,
    fairly straightforward before the bad stuff starts.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Prep work for consolidating all of the space_info code into one file.
    We need to export these so multiple files can use them.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Really we just need the enum, but as we break more things up it'll help
    to have this external to extent-tree.c.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Migrate the struct definition and the one helper that's in ctree.h into
    space-info.h

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • The block device is passed around for the only purpose to set it in new
    bios. Move the assignment one level up. This is a preparatory patch for
    further bdev cleanups.

    Signed-off-by: David Sterba

    David Sterba
     
  • Minimum stripe count matches the minimum devices required for a given
    profile. The open coded assignments match the raid_attr table.

    What's changed here is the meaning for RAID5/6. Previously their
    min_stripes would be 1, while newly it's devs_min. This however shold be
    the same as before because it's not possible to create filesystem on
    fewer devices than the raid_attr table allows.

    There's no adjustment regarding the parity stripes (like
    calc_data_stripes does), because we're interested in overall space that
    would fit on the devices.

    Missing devices make no difference for the whole calculation, we have
    the size stored in the structures.

    Signed-off-by: David Sterba

    David Sterba
     
  • Special case for DUP can be replaced by lookup to the attribute table,
    where the dev_stripes is the right coefficient.

    Signed-off-by: David Sterba

    David Sterba
     
  • A few more instances whre we don't need to specify the values as long as
    they are the same that enum assigns automatically. All of the enums are
    in-memory only and nothing relies on the exact values.

    Signed-off-by: David Sterba

    David Sterba
     
  • Print the error messages using the helpers that also print the
    filesystem identification.

    Signed-off-by: David Sterba

    David Sterba
     
  • We have been seeing issues in production where a cleaner script will end
    up unlinking a bunch of files that have pending iputs. This means they
    will get their final iput's run at btrfs-cleaner time and thus are not
    throttled, which impacts the workload.

    Since we are unlinking these files we can just drop the delayed iput at
    unlink time. We are already holding a reference to the inode so this
    will not be the final iput and thus is completely safe to do at this
    point. Doing this means we are more likely to be doing the final iput
    at unlink time, and thus will get the IO charged to the caller and get
    throttled appropriately without affecting the main workload.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • If the range for which we are punching a hole covers only part of a page,
    we end up updating the inode item but we skip the update of the inode's
    iversion, mtime and ctime. Fix that by ensuring we update those properties
    of the inode.

    A patch for fstests test case generic/059 that tests this as been sent
    along with this fix.

    Fixes: 2aaa66558172b0 ("Btrfs: add hole punching")
    Fixes: e8c1c76e804b18 ("Btrfs: add missing inode update when punching hole")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     
  • In order to avoid searches on a log tree when unlinking an inode, we check
    if the inode being unlinked was logged in the current transaction, as well
    as the inode of its parent directory. When any of the inodes are logged,
    we proceed to delete directory items and inode reference items from the
    log, to ensure that if a subsequent fsync of only the inode being unlinked
    or only of the parent directory when the other is not fsync'ed as well,
    does not result in the entry still existing after a power failure.

    That check however is not reliable when one of the inodes involved (the
    one being unlinked or its parent directory's inode) is evicted, since the
    logged_trans field is transient, that is, it is not stored on disk, so it
    is lost when the inode is evicted and loaded into memory again (which is
    set to zero on load). As a consequence the checks currently being done by
    btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log() always
    return true if the inode was evicted before, regardless of the inode
    having been logged or not before (and in the current transaction), this
    results in the dentry being unlinked still existing after a log replay
    if after the unlink operation only one of the inodes involved is fsync'ed.

    Example:

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt

    $ mkdir /mnt/dir
    $ touch /mnt/dir/foo
    $ xfs_io -c fsync /mnt/dir/foo

    # Keep an open file descriptor on our directory while we evict inodes.
    # We just want to evict the file's inode, the directory's inode must not
    # be evicted.
    $ ( cd /mnt/dir; while true; do :; done ) &
    $ pid=$!

    # Wait a bit to give time to background process to chdir to our test
    # directory.
    $ sleep 0.5

    # Trigger eviction of the file's inode.
    $ echo 2 > /proc/sys/vm/drop_caches

    # Unlink our file and fsync the parent directory. After a power failure
    # we don't expect to see the file anymore, since we fsync'ed the parent
    # directory.
    $ rm -f $SCRATCH_MNT/dir/foo
    $ xfs_io -c fsync /mnt/dir

    $ mount /dev/sdb /mnt
    $ ls /mnt/dir
    foo
    $
    --> file still there, unlink not persisted despite explicit fsync on dir

    Fix this by checking if the inode has the full_sync bit set in its runtime
    flags as well, since that bit is set everytime an inode is loaded from
    disk, or for other less common cases such as after a shrinking truncate
    or failure to allocate extent maps for holes, and gets cleared after the
    first fsync. Also consider the inode as possibly logged only if it was
    last modified in the current transaction (besides having the full_fsync
    flag set).

    Fixes: 3a5f1d458ad161 ("Btrfs: Optimize btree walking while logging inodes")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     
  • Presently btrfs_map_block is used not only to do everything necessary to
    map a bio to the underlying allocation profile but it's also used to
    identify how much data could be written based on btrfs' stripe logic
    without actually submitting anything. This is achieved by passing NULL
    for 'bbio_ret' parameter.

    This patch refactors all callers that require just the mapping length
    by switching them to using btrfs_io_geometry instead of calling
    btrfs_map_block with a special NULL value for 'bbio_ret'. No functional
    change.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Add a structure that holds various parameters for IO calculations and a
    helper that fills the values. This will help further refactoring and
    reduction of functions that in some way open-coded the calculations.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Currently the messages printed after setting an incompat feature are
    cryptis, we can easily make it better as the textual description is
    passed to the helpers. Old:

    setting 128 feature flag

    updated:

    setting incompat feature flag for RAID56 (0x80)

    Signed-off-by: David Sterba

    David Sterba
     
  • gcc sometimes can't determine whether a variable has been initialized
    when both the initialization and the use are conditional:

    fs/btrfs/props.c: In function 'inherit_props':
    fs/btrfs/props.c:389:4: error: 'num_bytes' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    btrfs_block_rsv_release(fs_info, trans->block_rsv,

    This code is fine. Unfortunately, I cannot think of a good way to
    rephrase it in a way that makes gcc understand this, so I add a bogus
    initialization the way one should not.

    Signed-off-by: Arnd Bergmann
    Reviewed-by: David Sterba
    [ gcc 8 and 9 don't emit the warning ]
    Signed-off-by: David Sterba

    Arnd Bergmann
     
  • Send always operates on read-only trees and always expected that while it
    is in progress, nothing changes in those trees. Due to that expectation
    and the fact that send is a read-only operation, it operates on commit
    roots and does not hold transaction handles. However relocation can COW
    nodes and leafs from read-only trees, which can cause unexpected failures
    and crashes (hitting BUG_ONs). while send using a node/leaf, it gets
    COWed, the transaction used to COW it is committed, a new transaction
    starts, the extent previously used for that node/leaf gets allocated,
    possibly for another tree, and the respective extent buffer' content
    changes while send is still using it. When this happens send normally
    fails with EIO being returned to user space and messages like the
    following are found in dmesg/syslog:

    [ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253
    [ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984

    Other times, less often, we hit a BUG_ON() because an extent buffer that
    send is using used to be a node, and while send is still using it, it
    got COWed and got reused as a leaf while send is still using, producing
    the following trace:

    [ 3478.466280] ------------[ cut here ]------------
    [ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806!
    [ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
    [ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1
    [ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
    [ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
    (...)
    [ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246
    [ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002
    [ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000
    [ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e
    [ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708
    [ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    [ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000
    [ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0
    [ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 3478.479003] Call Trace:
    [ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0
    [ 3478.480202] tree_advance+0x173/0x1d0 [btrfs]
    [ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs]
    [ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs]
    [ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
    [ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs]
    [ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs]
    [ 3478.483581] ? rq_clock_task+0x2e/0x60
    [ 3478.484086] ? wake_up_new_task+0x1f3/0x370
    [ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0
    [ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
    [ 3478.485552] do_vfs_ioctl+0xa2/0x6f0
    [ 3478.486016] ? __fget+0x113/0x200
    [ 3478.486467] ksys_ioctl+0x70/0x80
    [ 3478.486911] __x64_sys_ioctl+0x16/0x20
    [ 3478.487337] do_syscall_64+0x60/0x1b0
    [ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7
    (...)
    [ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
    [ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7
    [ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005
    [ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700
    [ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005
    [ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001
    (...)
    [ 3478.493352] ---[ end trace d5f537302be4f8c8 ]---

    Another possibility, much less likely to happen, is that send will not
    fail but the contents of the stream it produces may not be correct.

    To avoid this, do not allow send and relocation (balance) to run in
    parallel. In the long term the goal is to allow for both to be able to
    run concurrently without any problems, but that will take a significant
    effort in development and testing.

    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • The real meaning of that constant is not clear from the context due to
    the target device inclusion.

    Signed-off-by: David Sterba

    David Sterba
     
  • We don't need to enumerate the profiles, use the mask for consistency.

    Signed-off-by: David Sterba

    David Sterba
     
  • Preparatory patch for additional RAID1 profiles with more copies. The
    mask will contain 3-copy and 4-copy, most of the checks for plain RAID1
    work the same for the other profiles.

    Signed-off-by: David Sterba

    David Sterba
     
  • [BUG]
    Lockdep will report the following circular locking dependency:

    WARNING: possible circular locking dependency detected
    5.2.0-rc2-custom #24 Tainted: G O
    ------------------------------------------------------
    btrfs/8631 is trying to acquire lock:
    000000002536438c (&fs_info->qgroup_ioctl_lock#2){+.+.}, at: btrfs_qgroup_inherit+0x40/0x620 [btrfs]

    but task is already holding lock:
    000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (&fs_info->tree_log_mutex){+.+.}:
    __mutex_lock+0x76/0x940
    mutex_lock_nested+0x1b/0x20
    btrfs_commit_transaction+0x475/0xa00 [btrfs]
    btrfs_commit_super+0x71/0x80 [btrfs]
    close_ctree+0x2bd/0x320 [btrfs]
    btrfs_put_super+0x15/0x20 [btrfs]
    generic_shutdown_super+0x72/0x110
    kill_anon_super+0x18/0x30
    btrfs_kill_super+0x16/0xa0 [btrfs]
    deactivate_locked_super+0x3a/0x80
    deactivate_super+0x51/0x60
    cleanup_mnt+0x3f/0x80
    __cleanup_mnt+0x12/0x20
    task_work_run+0x94/0xb0
    exit_to_usermode_loop+0xd8/0xe0
    do_syscall_64+0x210/0x240
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #1 (&fs_info->reloc_mutex){+.+.}:
    __mutex_lock+0x76/0x940
    mutex_lock_nested+0x1b/0x20
    btrfs_commit_transaction+0x40d/0xa00 [btrfs]
    btrfs_quota_enable+0x2da/0x730 [btrfs]
    btrfs_ioctl+0x2691/0x2b40 [btrfs]
    do_vfs_ioctl+0xa9/0x6d0
    ksys_ioctl+0x67/0x90
    __x64_sys_ioctl+0x1a/0x20
    do_syscall_64+0x65/0x240
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (&fs_info->qgroup_ioctl_lock#2){+.+.}:
    lock_acquire+0xa7/0x190
    __mutex_lock+0x76/0x940
    mutex_lock_nested+0x1b/0x20
    btrfs_qgroup_inherit+0x40/0x620 [btrfs]
    create_pending_snapshot+0x9d7/0xe60 [btrfs]
    create_pending_snapshots+0x94/0xb0 [btrfs]
    btrfs_commit_transaction+0x415/0xa00 [btrfs]
    btrfs_mksubvol+0x496/0x4e0 [btrfs]
    btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
    btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
    btrfs_ioctl+0xa90/0x2b40 [btrfs]
    do_vfs_ioctl+0xa9/0x6d0
    ksys_ioctl+0x67/0x90
    __x64_sys_ioctl+0x1a/0x20
    do_syscall_64+0x65/0x240
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

    Chain exists of:
    &fs_info->qgroup_ioctl_lock#2 --> &fs_info->reloc_mutex --> &fs_info->tree_log_mutex

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&fs_info->tree_log_mutex);
    lock(&fs_info->reloc_mutex);
    lock(&fs_info->tree_log_mutex);
    lock(&fs_info->qgroup_ioctl_lock#2);

    *** DEADLOCK ***

    6 locks held by btrfs/8631:
    #0: 00000000ed8f23f6 (sb_writers#12){.+.+}, at: mnt_want_write_file+0x28/0x60
    #1: 000000009fb1597a (&type->i_mutex_dir_key#10/1){+.+.}, at: btrfs_mksubvol+0x70/0x4e0 [btrfs]
    #2: 0000000088c5ad88 (&fs_info->subvol_sem){++++}, at: btrfs_mksubvol+0x128/0x4e0 [btrfs]
    #3: 000000009606fc3e (sb_internal#2){.+.+}, at: start_transaction+0x37a/0x520 [btrfs]
    #4: 00000000f82bbdf5 (&fs_info->reloc_mutex){+.+.}, at: btrfs_commit_transaction+0x40d/0xa00 [btrfs]
    #5: 000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]

    [CAUSE]
    Due to the delayed subvolume creation, we need to call
    btrfs_qgroup_inherit() inside commit transaction code, with a lot of
    other mutex hold.
    This hell of lock chain can lead to above problem.

    [FIX]
    On the other hand, we don't really need to hold qgroup_ioctl_lock if
    we're in the context of create_pending_snapshot().
    As in that context, we're the only one being able to modify qgroup.

    All other qgroup functions which needs qgroup_ioctl_lock are either
    holding a transaction handle, or will start a new transaction:
    Functions will start a new transaction():
    * btrfs_quota_enable()
    * btrfs_quota_disable()
    Functions hold a transaction handler:
    * btrfs_add_qgroup_relation()
    * btrfs_del_qgroup_relation()
    * btrfs_create_qgroup()
    * btrfs_remove_qgroup()
    * btrfs_limit_qgroup()
    * btrfs_qgroup_inherit() call inside create_subvol()

    So we have a higher level protection provided by transaction, thus we
    don't need to always hold qgroup_ioctl_lock in btrfs_qgroup_inherit().

    Only the btrfs_qgroup_inherit() call in create_subvol() needs to hold
    qgroup_ioctl_lock, while the btrfs_qgroup_inherit() call in
    create_pending_snapshot() is already protected by transaction.

    So the fix is to detect the context by checking
    trans->transaction->state.
    If we're at TRANS_STATE_COMMIT_DOING, then we're in commit transaction
    context and no need to get the mutex.

    Reported-by: Nikolay Borisov
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo