25 Mar, 2020

1 commit

  • Zygo reported the following lockdep splat while testing the balance
    patches

    ======================================================
    WARNING: possible circular locking dependency detected
    5.6.0-c6f0579d496a+ #53 Not tainted
    ------------------------------------------------------
    kswapd0/1133 is trying to acquire lock:
    ffff888092f622c0 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node+0x7c/0x5b0

    but task is already holding lock:
    ffffffff8fc5f860 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (fs_reclaim){+.+.}:
    fs_reclaim_acquire.part.91+0x29/0x30
    fs_reclaim_acquire+0x19/0x20
    kmem_cache_alloc_trace+0x32/0x740
    add_block_entry+0x45/0x260
    btrfs_ref_tree_mod+0x6e2/0x8b0
    btrfs_alloc_tree_block+0x789/0x880
    alloc_tree_block_no_bg_flush+0xc6/0xf0
    __btrfs_cow_block+0x270/0x940
    btrfs_cow_block+0x1ba/0x3a0
    btrfs_search_slot+0x999/0x1030
    btrfs_insert_empty_items+0x81/0xe0
    btrfs_insert_delayed_items+0x128/0x7d0
    __btrfs_run_delayed_items+0xf4/0x2a0
    btrfs_run_delayed_items+0x13/0x20
    btrfs_commit_transaction+0x5cc/0x1390
    insert_balance_item.isra.39+0x6b2/0x6e0
    btrfs_balance+0x72d/0x18d0
    btrfs_ioctl_balance+0x3de/0x4c0
    btrfs_ioctl+0x30ab/0x44a0
    ksys_ioctl+0xa1/0xe0
    __x64_sys_ioctl+0x43/0x50
    do_syscall_64+0x77/0x2c0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (&delayed_node->mutex){+.+.}:
    __lock_acquire+0x197e/0x2550
    lock_acquire+0x103/0x220
    __mutex_lock+0x13d/0xce0
    mutex_lock_nested+0x1b/0x20
    __btrfs_release_delayed_node+0x7c/0x5b0
    btrfs_remove_delayed_node+0x49/0x50
    btrfs_evict_inode+0x6fc/0x900
    evict+0x19a/0x2c0
    dispose_list+0xa0/0xe0
    prune_icache_sb+0xbd/0xf0
    super_cache_scan+0x1b5/0x250
    do_shrink_slab+0x1f6/0x530
    shrink_slab+0x32e/0x410
    shrink_node+0x2a5/0xba0
    balance_pgdat+0x4bd/0x8a0
    kswapd+0x35a/0x800
    kthread+0x1e9/0x210
    ret_from_fork+0x3a/0x50

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(fs_reclaim);
    lock(&delayed_node->mutex);
    lock(fs_reclaim);
    lock(&delayed_node->mutex);

    *** DEADLOCK ***

    3 locks held by kswapd0/1133:
    #0: ffffffff8fc5f860 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
    #1: ffffffff8fc380d8 (shrinker_rwsem){++++}, at: shrink_slab+0x1e8/0x410
    #2: ffff8881e0e6c0e8 (&type->s_umount_key#42){++++}, at: trylock_super+0x1b/0x70

    stack backtrace:
    CPU: 2 PID: 1133 Comm: kswapd0 Not tainted 5.6.0-c6f0579d496a+ #53
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
    Call Trace:
    dump_stack+0xc1/0x11a
    print_circular_bug.isra.38.cold.57+0x145/0x14a
    check_noncircular+0x2a9/0x2f0
    ? print_circular_bug.isra.38+0x130/0x130
    ? stack_trace_consume_entry+0x90/0x90
    ? save_trace+0x3cc/0x420
    __lock_acquire+0x197e/0x2550
    ? btrfs_inode_clear_file_extent_range+0x9b/0xb0
    ? register_lock_class+0x960/0x960
    lock_acquire+0x103/0x220
    ? __btrfs_release_delayed_node+0x7c/0x5b0
    __mutex_lock+0x13d/0xce0
    ? __btrfs_release_delayed_node+0x7c/0x5b0
    ? __asan_loadN+0xf/0x20
    ? pvclock_clocksource_read+0xeb/0x190
    ? __btrfs_release_delayed_node+0x7c/0x5b0
    ? mutex_lock_io_nested+0xc20/0xc20
    ? __kasan_check_read+0x11/0x20
    ? check_chain_key+0x1e6/0x2e0
    mutex_lock_nested+0x1b/0x20
    ? mutex_lock_nested+0x1b/0x20
    __btrfs_release_delayed_node+0x7c/0x5b0
    btrfs_remove_delayed_node+0x49/0x50
    btrfs_evict_inode+0x6fc/0x900
    ? btrfs_setattr+0x840/0x840
    ? do_raw_spin_unlock+0xa8/0x140
    evict+0x19a/0x2c0
    dispose_list+0xa0/0xe0
    prune_icache_sb+0xbd/0xf0
    ? invalidate_inodes+0x310/0x310
    super_cache_scan+0x1b5/0x250
    do_shrink_slab+0x1f6/0x530
    shrink_slab+0x32e/0x410
    ? do_shrink_slab+0x530/0x530
    ? do_shrink_slab+0x530/0x530
    ? __kasan_check_read+0x11/0x20
    ? mem_cgroup_protected+0x13d/0x260
    shrink_node+0x2a5/0xba0
    balance_pgdat+0x4bd/0x8a0
    ? mem_cgroup_shrink_node+0x490/0x490
    ? _raw_spin_unlock_irq+0x27/0x40
    ? finish_task_switch+0xce/0x390
    ? rcu_read_lock_bh_held+0xb0/0xb0
    kswapd+0x35a/0x800
    ? _raw_spin_unlock_irqrestore+0x4c/0x60
    ? balance_pgdat+0x8a0/0x8a0
    ? finish_wait+0x110/0x110
    ? __kasan_check_read+0x11/0x20
    ? __kthread_parkme+0xc6/0xe0
    ? balance_pgdat+0x8a0/0x8a0
    kthread+0x1e9/0x210
    ? kthread_create_worker_on_cpu+0xc0/0xc0
    ret_from_fork+0x3a/0x50

    This is because we hold that delayed node's mutex while doing tree
    operations. Fix this by just wrapping the searches in nofs.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     

24 Mar, 2020

3 commits

  • Currently the non-prefixed version is a simple wrapper used to hide
    the 4th argument of the prefixed version. This doesn't bring much value
    in practice and only makes the code harder to follow by adding another
    level of indirection. Rectify this by removing the __ prefix and
    have only one public function to release bytes from a block reservation.
    No semantic changes.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • The status of aborted transaction can change between calls and it needs
    to be accessed by READ_ONCE. Add a helper that also wraps the unlikely
    hint.

    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    David Sterba
     
  • We want to use this everywhere we modify the file extent items
    permanently. These include:

    1) Inserting new file extents for writes and prealloc extents.
    2) Truncating inode items.
    3) btrfs_cont_expand().
    4) Insert inline extents.
    5) Insert new extents from log replay.
    6) Insert a new extent for clone, as it could be past i_size.
    7) Hole punching

    For hole punching in particular it might seem it's not necessary because
    anybody extending would use btrfs_cont_expand, however there is a corner
    that still can give us trouble. Start with an empty file and

    fallocate KEEP_SIZE 1M-2M

    We now have a 0 length file, and a hole file extent from 0-1M, and a
    prealloc extent from 1M-2M. Now

    punch 1M-1.5M

    Because this is past i_size we have

    [HOLE EXTENT][ NOTHING ][PREALLOC]
    [0 1M][1M 1.5M][1.5M 2M]

    with an i_size of 0. Now if we pwrite 0-1.5M we'll increas our i_size
    to 1.5M, but our disk_i_size is still 0 until the ordered extent
    completes.

    However if we now immediately truncate 2M on the file we'll just call
    btrfs_cont_expand(inode, 1.5M, 2M), since our old i_size is 1.5M. If we
    commit the transaction here and crash we'll expose the gap.

    To fix this we need to clear the file extent mapping for the range that
    we punched but didn't insert a corresponding file extent for. This will
    mean the truncate will only get an disk_i_size set to 1M if we crash
    before the finish ordered io happens.

    I've written an xfstest to reproduce the problem and validate this fix.

    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     

18 Nov, 2019

3 commits

  • We hit the following warning while running down a different problem

    [ 6197.175850] ------------[ cut here ]------------
    [ 6197.185082] refcount_t: underflow; use-after-free.
    [ 6197.194704] WARNING: CPU: 47 PID: 966 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60
    [ 6197.521792] Call Trace:
    [ 6197.526687] __btrfs_release_delayed_node+0x76/0x1c0
    [ 6197.536615] btrfs_kill_all_delayed_nodes+0xec/0x130
    [ 6197.546532] ? __btrfs_btree_balance_dirty+0x60/0x60
    [ 6197.556482] btrfs_clean_one_deleted_snapshot+0x71/0xd0
    [ 6197.566910] cleaner_kthread+0xfa/0x120
    [ 6197.574573] kthread+0x111/0x130
    [ 6197.581022] ? kthread_create_on_node+0x60/0x60
    [ 6197.590086] ret_from_fork+0x1f/0x30
    [ 6197.597228] ---[ end trace 424bb7ae00509f56 ]---

    This is because the free side drops the ref without the lock, and then
    takes the lock if our refcount is 0. So you can have nodes on the tree
    that have a refcount of 0. Fix this by zero'ing out that element in our
    temporary array so we don't try to kill it again.

    CC: stable@vger.kernel.org # 4.14+
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    [ add comment ]
    Signed-off-by: David Sterba

    Josef Bacik
     
  • The function belongs to the family of locking functions, so move it
    there. The 'noinline' keyword is dropped as it's now an exported
    function that does not need it.

    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba

    David Sterba
     
  • Commit 9e0af2376434 ("Btrfs: fix task hang under heavy compressed
    write") worked around the issue that a recycled work item could get a
    false dependency on the original work item due to how the workqueue code
    guarantees non-reentrancy. It did so by giving different work functions
    to different types of work.

    However, the fixes in the previous few patches are more complete, as
    they prevent a work item from being recycled at all (except for a tiny
    window that the kernel workqueue code handles for us). This obsoletes
    the previous fix, so we don't need the unique helpers for correctness.
    The only other reason to keep them would be so they show up in stack
    traces, but they always seem to be optimized to a tail call, so they
    don't show up anyways. So, let's just get rid of the extra indirection.

    While we're here, rename normal_work_helper() to the more informative
    btrfs_work_helper().

    Reviewed-by: Nikolay Borisov
    Reviewed-by: Filipe Manana
    Signed-off-by: Omar Sandoval
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Omar Sandoval
     

09 Sep, 2019

4 commits

  • The file ctree.h serves as a header for everything and has become quite
    bloated. Split some helpers that are generic and create a new file that
    should be the catch-all for code that's not btrfs-specific.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    David Sterba
     
  • Historically we reserved worst case for every btree operation, and
    generally speaking we want to do that in cases where it could be the
    worst case. However for updating inodes we know the inode items are
    already in the tree, so it will only be an update operation and never an
    insert operation. This allows us to always reserve only the
    metadata_size amount for inode updates rather than the
    insert_metadata_size amount.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that
    it doesn't take into account any splitting at the levels, because
    truncate will never split nodes. However truncate _and_ changing will
    never split nodes, so rename btrfs_calc_trunc_metadata_size to
    btrfs_calc_metadata_size. Also btrfs_calc_trans_metadata_size is purely
    for inserting items, so rename this to btrfs_calc_insert_metadata_size.
    Making these clearer will help when I start using them differently in
    upcoming patches.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • There is one report of fuzzed image which leads to BUG_ON() in
    btrfs_delete_delayed_dir_index().

    Although that fuzzed image can already be addressed by enhanced
    extent-tree error handler, it's still better to hunt down more BUG_ON().

    This patch will hunt down two BUG_ON()s in
    btrfs_delete_delayed_dir_index():
    - One for error from btrfs_delayed_item_reserve_metadata()
    Instead of BUG_ON(), we output an error message and free the item.
    And return the error.
    All callers of this function handles the error by aborting current
    trasaction.

    - One for possible EEXIST from __btrfs_add_delayed_deletion_item()
    That function can return -EEXIST.
    We already have a good enough error message for that, only need to
    clean up the reserved metadata space and allocated item.

    To help above cleanup, also modifiy __btrfs_remove_delayed_item() called
    in btrfs_release_delayed_item(), to skip unassociated item.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203253
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     

30 Apr, 2019

2 commits

  • We can read fs_info from extent buffer and can drop it from the
    parameters.

    Signed-off-by: David Sterba

    David Sterba
     
  • Deduplicate the btrfs file type conversion implementation - file systems
    that use the same file types as defined by POSIX do not need to define
    their own versions and can use the common helper functions decared in
    fs_types.h and implemented in fs_types.c

    Common implementation can be found via commit:
    bbe7449e2599 "fs: common implementation of file type"

    Reviewed-by: Jan Kara
    Signed-off-by: Amir Goldstein
    Signed-off-by: Phillip Potter
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Phillip Potter
     

15 Oct, 2018

4 commits

  • Btrfs's btree locking has two modes, spinning mode and blocking mode,
    while searching btree, locking is always acquired in spinning mode and
    then converted to blocking mode if necessary, and in some hot paths we may
    switch the locking back to spinning mode by btrfs_clear_path_blocking().

    When acquiring locks, both of reader and writer need to wait for blocking
    readers and writers to complete before doing read_lock()/write_lock().

    The problem is that btrfs_clear_path_blocking() needs to switch nodes
    in the path to blocking mode at first (by btrfs_set_path_blocking) to
    make lockdep happy before doing its actual clearing blocking job.

    When switching to blocking mode from spinning mode, it consists of

    step 1) bumping up blocking readers counter and
    step 2) read_unlock()/write_unlock(),

    this has caused serious ping-pong effect if there're a great amount of
    concurrent readers/writers, as waiters will be woken up and go to
    sleep immediately.

    1) Killing this kind of ping-pong results in a big improvement in my 1600k
    files creation script,

    MNT=/mnt/btrfs
    mkfs.btrfs -f /dev/sdf
    mount /dev/def $MNT
    time fsmark -D 10000 -S0 -n 100000 -s 0 -L 1 -l /tmp/fs_log.txt \
    -d $MNT/0 -d $MNT/1 \
    -d $MNT/2 -d $MNT/3 \
    -d $MNT/4 -d $MNT/5 \
    -d $MNT/6 -d $MNT/7 \
    -d $MNT/8 -d $MNT/9 \
    -d $MNT/10 -d $MNT/11 \
    -d $MNT/12 -d $MNT/13 \
    -d $MNT/14 -d $MNT/15

    w/o patch:
    real 2m27.307s
    user 0m12.839s
    sys 13m42.831s

    w/ patch:
    real 1m2.273s
    user 0m15.802s
    sys 8m16.495s

    1.1) latency histogram from funclatency[1]

    Overall with the patch, there're ~50% less write lock acquisition and
    the 95% max latency that write lock takes also reduces to ~100ms from
    >500ms.

    --------------------------------------------
    w/o patch:
    --------------------------------------------
    Function = btrfs_tree_lock
    msecs : count distribution
    0 -> 1 : 2385222 |****************************************|
    2 -> 3 : 37147 | |
    4 -> 7 : 20452 | |
    8 -> 15 : 13131 | |
    16 -> 31 : 3877 | |
    32 -> 63 : 3900 | |
    64 -> 127 : 2612 | |
    128 -> 255 : 974 | |
    256 -> 511 : 165 | |
    512 -> 1023 : 13 | |

    Function = btrfs_tree_read_lock
    msecs : count distribution
    0 -> 1 : 6743860 |****************************************|
    2 -> 3 : 2146 | |
    4 -> 7 : 190 | |
    8 -> 15 : 38 | |
    16 -> 31 : 4 | |

    --------------------------------------------
    w/ patch:
    --------------------------------------------
    Function = btrfs_tree_lock
    msecs : count distribution
    0 -> 1 : 1318454 |****************************************|
    2 -> 3 : 6800 | |
    4 -> 7 : 3664 | |
    8 -> 15 : 2145 | |
    16 -> 31 : 809 | |
    32 -> 63 : 219 | |
    64 -> 127 : 10 | |

    Function = btrfs_tree_read_lock
    msecs : count distribution
    0 -> 1 : 6854317 |****************************************|
    2 -> 3 : 2383 | |
    4 -> 7 : 601 | |
    8 -> 15 : 92 | |

    2) dbench also proves the improvement,
    dbench -t 120 -D /mnt/btrfs 16

    w/o patch:
    Throughput 158.363 MB/sec

    w/ patch:
    Throughput 449.52 MB/sec

    3) xfstests didn't show any additional failures.

    One thing to note is that callers may set path->leave_spinning to have
    all nodes in the path stay in spinning mode, which means callers are
    ready to not sleep before releasing the path, but it won't cause
    problems if they don't want to sleep in blocking mode.

    [1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py

    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba

    Liu Bo
     
  • rb_first_cached() trades an extra pointer "leftmost" for doing the same job as
    rb_first() but in O(1).

    Functions manipulating delayed_item need to get the first entry, this converts
    it to use rb_first_cached().

    For more details about the optimization see patch "Btrfs: delayed-refs:
    use rb_first_cached for href_root".

    Tested-by: Holger Hoffstätte
    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Liu Bo
     
  • There are two members in struct btrfs_root which indicate root's
    objectid: objectid and root_key.objectid.

    They are both set to the same value in __setup_root():

    static void __setup_root(struct btrfs_root *root,
    struct btrfs_fs_info *fs_info,
    u64 objectid)
    {
    ...
    root->objectid = objectid;
    ...
    root->root_key.objectid = objecitd;
    ...
    }

    and not changed to other value after initialization.

    grep in btrfs directory shows both are used in many places:
    $ grep -rI "root->root_key.objectid" | wc -l
    133
    $ grep -rI "root->objectid" | wc -l
    55
    (4.17, inc. some noise)

    It is confusing to have two similar variable names and it seems
    that there is no rule about which should be used in a certain case.

    Since ->root_key itself is needed for tree reloc tree, let's remove
    'objecitd' member and unify code to use ->root_key.objectid in all places.

    Signed-off-by: Misono Tomohiro
    Reviewed-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Misono Tomohiro
     
  • Using true and false here is closer to the expected semantic than using
    0 and 1. No functional change.

    Signed-off-by: Lu Fengqi
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Lu Fengqi
     

06 Aug, 2018

3 commits


29 May, 2018

1 commit


18 Apr, 2018

1 commit

  • Commit 4f5427ccce5d ("btrfs: delayed-inode: Use new qgroup meta rsv for
    delayed inode and item") merged into mainline was not latest version
    submitted to the mail list in Dec 2017.

    Which lacks the following fixes:

    1) Remove btrfs_qgroup_convert_reserved_meta() call in
    btrfs_delayed_item_release_metadata()
    2) Remove btrfs_qgroup_reserve_meta_prealloc() call in
    btrfs_delayed_inode_reserve_metadata()

    Those fixes will resolve unexpected EDQUOT problems.

    Fixes: 4f5427ccce5d ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item")
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     

12 Apr, 2018

1 commit


31 Mar, 2018

1 commit


26 Mar, 2018

3 commits


30 Jan, 2018

2 commits

  • Pull btrfs updates from David Sterba:
    "Features or user visible changes:

    - fallocate: implement zero range mode

    - avoid losing data raid profile when deleting a device

    - tree item checker: more checks for directory items and xattrs

    Notable fixes:

    - raid56 recovery: don't use cached stripes, that could be
    potentially changed and a later RMW or recovery would lead to
    corruptions or failures

    - let raid56 try harder to rebuild damaged data, reading from all
    stripes if necessary

    - fix scrub to repair raid56 in a similar way as in the case above

    Other:

    - cleanups: device freeing, removed some call indirections, redundant
    bio_put/_get, unused parameters, refactorings and renames

    - RCU list traversal fixups

    - simplify mount callchain, remove recursing back when mounting a
    subvolume

    - plug for fsync, may improve bio merging on multiple devices

    - compression heurisic: replace heap sort with radix sort, gains some
    performance

    - add extent map selftests, buffered write vs dio"

    * tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (155 commits)
    btrfs: drop devid as device_list_add() arg
    btrfs: get device pointer from device_list_add()
    btrfs: set the total_devices in device_list_add()
    btrfs: move pr_info into device_list_add
    btrfs: make btrfs_free_stale_devices() to match the path
    btrfs: rename btrfs_free_stale_devices() arg to skip_dev
    btrfs: make btrfs_free_stale_devices() argument optional
    btrfs: make btrfs_free_stale_device() to iterate all stales
    btrfs: no need to check for btrfs_fs_devices::seeding
    btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it
    Btrfs: noinline merge_extent_mapping
    Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping
    Btrfs: extent map selftest: dio write vs dio read
    Btrfs: extent map selftest: buffered write vs dio read
    Btrfs: add extent map selftests
    Btrfs: move extent map specific code to extent_map.c
    Btrfs: add helper for em merge logic
    Btrfs: fix unexpected EEXIST from btrfs_get_extent
    Btrfs: fix incorrect block_len in merge_extent_mapping
    btrfs: Remove unused readahead spinlock
    ...

    Linus Torvalds
     
  • Pull inode->i_version rework from Jeff Layton:
    "This pile of patches is a rework of the inode->i_version field. We
    have traditionally incremented that field on every inode data or
    metadata change. Typically this increment needs to be logged on disk
    even when nothing else has changed, which is rather expensive.

    It turns out though that none of the consumers of that field actually
    require this behavior. The only real requirement for all of them is
    that it be different iff the inode has changed since the last time the
    field was checked.

    Given that, we can optimize away most of the i_version increments and
    avoid dirtying inode metadata when the only change is to the i_version
    and no one is querying it. Queries of the i_version field are rather
    rare, so we can help write performance under many common workloads.

    This patch series converts existing accesses of the i_version field to
    a new API, and then converts all of the in-kernel filesystems to use
    it. The last patch in the series then converts the backend
    implementation to a scheme that optimizes away a large portion of the
    metadata updates when no one is looking at it.

    In my own testing this series significantly helps performance with
    small I/O sizes. I also got this email for Christmas this year from
    the kernel test robot (a 244% r/w bandwidth improvement with XFS over
    DAX, with 4k writes):

    https://lkml.org/lkml/2017/12/25/8

    A few of the earlier patches in this pile are also flowing to you via
    other trees (mm, integrity, and nfsd trees in particular)".

    * tag 'iversion-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: (22 commits)
    fs: handle inode->i_version more efficiently
    btrfs: only dirty the inode in btrfs_update_time if something was changed
    xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing
    fs: only set S_VERSION when updating times if necessary
    IMA: switch IMA over to new i_version API
    xfs: convert to new i_version API
    ufs: use new i_version API
    ocfs2: convert to new i_version API
    nfsd: convert to new i_version API
    nfs: convert to new i_version API
    ext4: convert to new i_version API
    ext2: convert to new i_version API
    exofs: switch to new i_version API
    btrfs: convert to new i_version API
    afs: convert to new i_version API
    affs: convert to new i_version API
    fat: convert to new i_version API
    fs: don't take the i_lock in inode_inc_iversion
    fs: new API for handling inode->i_version
    ntfs: remove i_version handling
    ...

    Linus Torvalds
     

29 Jan, 2018

1 commit


25 Jan, 2018

1 commit

  • In fixing the readdir+pagefault deadlock I accidentally introduced a
    stale entry regression in readdir. If we get close to full for the
    temporary buffer, and then skip a few delayed deletions, and then try to
    add another entry that won't fit, we will emit the entries we found and
    retry. Unfortunately we delete entries from our del_list as we find
    them, assuming we won't need them. However our pos will be with
    whatever our last entry was, which could be before the delayed deletions
    we skipped, so the next search will add the deleted entries back into
    our readdir buffer. So instead don't delete entries we find in our
    del_list so we can make sure we always find our delayed deletions. This
    is a slight perf hit for readdir with lots of pending deletions, but
    hopefully this isn't a common occurrence. If it is we can revist this
    and optimize it.

    cc: stable@vger.kernel.org
    Fixes: 23b5ec74943f ("btrfs: fix readdir deadlock with pagefault")
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     

22 Jan, 2018

2 commits

  • btrfs_balance_delayed_items is the sole caller of
    btrfs_wq_run_delayed_node and already includes one of the checks whether
    the delayed inodes should be run. On the other hand
    btrfs_wq_run_delayed_node duplicates that check and performs an
    additional one for wq congestion.

    Let's remove the duplicate check and move the congestion one in
    btrfs_balance_delayed_items, leaving btrfs_wq_run_delayed_node to only
    care about setting up the wq run. No functional changes.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Currently btrfs_async_run_delayed_root's implementation uses 3 goto
    labels to mimic the functionality of a simple do {} while loop. Refactor
    the function to use a do {} while construct, making intention clear and
    code easier to follow. No functional changes.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    Nikolay Borisov
     

03 Jan, 2018

1 commit

  • refcounts have a generic implementation and an asm optimized one. The
    generic version has extra debugging to make sure that once a refcount
    goes to zero, refcount_inc won't increase it.

    The btrfs delayed inode code wasn't expecting this, and we're tripping
    over the warnings when the generic refcounts are used. We ended up with
    this race:

    Process A Process B
    btrfs_get_delayed_node()
    spin_lock(root->inode_lock)
    radix_tree_lookup()
    __btrfs_release_delayed_node()
    refcount_dec_and_test(&delayed_node->refs)
    our refcount is now zero
    refcount_add(2) inode_lock)
    radix_tree_delete()

    With the generic refcounts, we actually warn again when process B above
    tries to release his refcount because refcount_add() turned into a
    no-op.

    We saw this in production on older kernels without the asm optimized
    refcounts.

    The fix used here is to use refcount_inc_not_zero() to detect when the
    object is in the middle of being freed and return NULL. This is almost
    always the right answer anyway, since we usually end up pitching the
    delayed_node if it didn't have fresh data in it.

    This also changes __btrfs_release_delayed_node() to remove the extra
    check for zero refcounts before radix tree deletion.
    btrfs_get_delayed_node() was the only path that was allowing refcounts
    to go from zero to one.

    Fixes: 6de5f18e7b0da ("btrfs: fix refcount_t usage when deleting btrfs_delayed_node")
    CC: # 4.12+
    Signed-off-by: Chris Mason
    Reviewed-by: Liu Bo
    Signed-off-by: David Sterba

    Chris Mason
     

02 Nov, 2017

1 commit

  • The way we handle delalloc metadata reservations has gotten
    progressively more complicated over the years. There is so much cruft
    and weirdness around keeping the reserved count and outstanding counters
    consistent and handling the error cases that it's impossible to
    understand.

    Fix this by making the delalloc block rsv per-inode. This way we can
    calculate the actual size of the outstanding metadata reservations every
    time we make a change, and then reserve the delta based on that amount.
    This greatly simplifies the code everywhere, and makes the error
    handling in btrfs_delalloc_reserve_metadata far less terrifying.

    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     

18 Aug, 2017

1 commit

  • Our dir_context->pos is supposed to hold the next position we're
    supposed to look. If we successfully insert a delayed dir index we
    could end up with a duplicate entry because we don't increase ctx->pos
    after doing the dir_emit.

    Signed-off-by: Josef Bacik
    Reviewed-by: Liu Bo
    Signed-off-by: David Sterba

    Josef Bacik
     

18 Apr, 2017

2 commits

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David Sterba

    Elena Reshetova
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David Sterba

    Elena Reshetova
     

28 Feb, 2017

1 commit


14 Feb, 2017

1 commit