02 Dec, 2019

1 commit

  • Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
    "As part of the cleanup of some remaining y2038 issues, I came to
    fs/compat_ioctl.c, which still has a couple of commands that need
    support for time64_t.

    In completely unrelated work, I spent time on cleaning up parts of
    this file in the past, moving things out into drivers instead.

    After Al Viro reviewed an earlier version of this series and did a lot
    more of that cleanup, I decided to try to completely eliminate the
    rest of it and move it all into drivers.

    This series incorporates some of Al's work and many patches of my own,
    but in the end stops short of actually removing the last part, which
    is the scsi ioctl handlers. I have patches for those as well, but they
    need more testing or possibly a rewrite"

    * tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
    scsi: sd: enable compat ioctls for sed-opal
    pktcdvd: add compat_ioctl handler
    compat_ioctl: move SG_GET_REQUEST_TABLE handling
    compat_ioctl: ppp: move simple commands into ppp_generic.c
    compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
    compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
    compat_ioctl: unify copy-in of ppp filters
    tty: handle compat PPP ioctls
    compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
    compat_ioctl: handle SIOCOUTQNSD
    af_unix: add compat_ioctl support
    compat_ioctl: reimplement SG_IO handling
    compat_ioctl: move WDIOC handling into wdt drivers
    fs: compat_ioctl: move FITRIM emulation into file systems
    gfs2: add compat_ioctl support
    compat_ioctl: remove unused convert_in_user macro
    compat_ioctl: remove last RAID handling code
    compat_ioctl: remove /dev/raw ioctl translation
    compat_ioctl: remove PCI ioctl translation
    compat_ioctl: remove joystick ioctl translation
    ...

    Linus Torvalds
     

19 Nov, 2019

39 commits

  • After previous patches removing bdev being passed around to set it to
    bio, it has become unused in submit_extent_page. So it now has "only" 13
    parameters.

    Signed-off-by: David Sterba

    David Sterba
     
  • We can now remove the bdev from extent_map. Previous patches made sure
    that bio_set_dev is correctly in all places and that we don't need to
    grab it from latest_bdev or pass it around inside the extent map.

    Signed-off-by: David Sterba

    David Sterba
     
  • bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
    also changing some state bits if there was a different bdev set before.
    This is one thing that's not needed.

    Another thing is that setting a bdev at bio allocation time is too early
    and actually does not work with plain redundancy profiles, where each
    time we submit a bio to a device, the bdev is set correctly.

    In many places the bio bdev is set to latest_bdev that seems to serve as
    a stub pointer "just to put something to bio". But we don't have to do
    that.

    Where do we know which bdev to set:

    * for regular IO: submit_stripe_bio that's called by btrfs_map_bio

    * repair IO: repair_io_failure, read or write from specific device

    * super block write (using buffer_heads but uses raw bdev) and barriers

    * scrub: this does not use all regular IO paths as it needs to reach all
    copies, verify and fixup eventually, and for that all bdev management
    is independent

    * raid56: rbio_add_io_page, for the RMW write

    * integrity-checker: does it's own low-level block tracking

    Signed-off-by: David Sterba

    David Sterba
     
  • This is preparatory patch to remove @bdev parameter from
    submit_extent_page. It can't be removed completely, because the cgroups
    need it for wbc when initializing the bio

    wbc_init_bio
    bio_associate_blkg_from_css
    dereference bdev->bi_disk->queue

    The bdev pointer is the same as latest_bdev, thus no functional change.
    We can retrieve it from fs_devices that's reachable through several
    dereferences. The local variable shadows the parameter, but that's only
    temporary.

    Signed-off-by: David Sterba

    David Sterba
     
  • Testing with the new fsstress support for subvolumes uncovered a pretty
    bad problem with rename exchange on subvolumes. We're modifying two
    different subvolumes, but we only start the transaction on one of them,
    so the other one is not added to the dirty root list. This is caught by
    btrfs_cow_block() with a warning because the root has not been updated,
    however if we do not modify this root again we'll end up pointing at an
    invalid root because the root item is never updated.

    Fix this by making sure we add the destination root to the trans list,
    the same as we do with normal renames. This fixes the corruption.

    Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
    CC: stable@vger.kernel.org # 4.9+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     
  • When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
    set the block group to RO mode and then wait for any ongoing writes into
    extents of the block group to complete. While doing that wait we overwrite
    the value of the variable 'ret' and can break out of the loop if an error
    happens without turning the block group back into RW mode. So what happens
    is the following:

    1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
    to RO mode (its ->ro field set to 1 or incremented to some value > 1);

    2) Then btrfs_wait_ordered_roots() returns a value > 0;

    3) Then if either joining or committing the transaction fails, we break
    out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
    the block group in RO mode forever.

    To fix this, just remove the code that waits for ongoing writes to extents
    of the block group, since it's not needed because in the initial setup
    phase of a device replace operation, before starting to find all chunks
    and their extents, we set the target device for replace while holding
    fs_info->dev_replace->rwsem, which ensures that after releasing that
    semaphore, any writes into the source device are made to the target device
    as well (__btrfs_map_block() guarantees that). So while at
    scrub_enumerate_chunks() we only need to worry about finding and copying
    extents (from the source device to the target device) that were written
    before we started the device replace operation.

    Fixes: f0e9b7d6401959 ("Btrfs: fix race setting block group readonly during device replace")
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     
  • [BUG]
    When running btrfs/072 with only one online CPU, it has a pretty high
    chance to fail:

    btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
    - output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
    --- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800
    +++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800
    @@ -1,2 +1,3 @@
    QA output created by 072
    Silence is golden
    +Scrub find errors in "-m dup -d single" test
    ...

    And with the following call trace:

    BTRFS info (device dm-5): scrub: started on devid 1
    ------------[ cut here ]------------
    BTRFS: Transaction aborted (error -27)
    WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
    CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
    RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
    Call Trace:
    __btrfs_end_transaction+0xdb/0x310 [btrfs]
    btrfs_end_transaction+0x10/0x20 [btrfs]
    btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
    scrub_enumerate_chunks+0x264/0x940 [btrfs]
    btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
    btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
    do_vfs_ioctl+0x636/0xaa0
    ksys_ioctl+0x67/0x90
    __x64_sys_ioctl+0x43/0x50
    do_syscall_64+0x79/0xe0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    ---[ end trace 166c865cec7688e7 ]---

    [CAUSE]
    The error number -27 is -EFBIG, returned from the following call chain:
    btrfs_end_transaction()
    |- __btrfs_end_transaction()
    |- btrfs_create_pending_block_groups()
    |- btrfs_finish_chunk_alloc()
    |- btrfs_add_system_chunk()

    This happens because we have used up all space of
    btrfs_super_block::sys_chunk_array.

    The root cause is, we have the following bad loop of creating tons of
    system chunks:

    1. The only SYSTEM chunk is being scrubbed
    It's very common to have only one SYSTEM chunk.
    2. New SYSTEM bg will be allocated
    As btrfs_inc_block_group_ro() will check if we have enough space
    after marking current bg RO. If not, then allocate a new chunk.
    3. New SYSTEM bg is still empty, will be reclaimed
    During the reclaim, we will mark it RO again.
    4. That newly allocated empty SYSTEM bg get scrubbed
    We go back to step 2, as the bg is already mark RO but still not
    cleaned up yet.

    If the cleaner kthread doesn't get executed fast enough (e.g. only one
    CPU), then we will get more and more empty SYSTEM chunks, using up all
    the space of btrfs_super_block::sys_chunk_array.

    [FIX]
    Since scrub/dev-replace doesn't always need to allocate new extent,
    especially chunk tree extent, so we don't really need to do chunk
    pre-allocation.

    To break above spiral, here we introduce a new parameter to
    btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
    need extra chunk pre-allocation.

    For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
    @do_chunk_alloc=false.
    This should keep unnecessary empty chunks from popping up for scrub.

    Also, since there are two parameters for btrfs_inc_block_group_ro(),
    add more comment for it.

    Reviewed-by: Filipe Manana
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • struct btrfs_fs_devices::rotating currently is declared as an integer
    variable but only used as a boolean.

    Change the variable definition to bool and update to code touching it to
    set 'true' and 'false'.

    Reviewed-by: Qu Wenruo
    Reviewed-by: Anand Jain
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • struct btrfs_fs_devices::seeding currently is declared as an integer
    variable but only used as a boolean.

    Change the variable definition to bool and update to code touching it to
    set 'true' and 'false'.

    Reviewed-by: Qu Wenruo
    Reviewed-by: Anand Jain
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • The type name is misleading, a single entry is named 'cache' while this
    normally means a collection of objects. Rename that everywhere. Also the
    identifier was quite long, making function prototypes harder to format.

    Suggested-by: Nikolay Borisov
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    David Sterba
     
  • For read_one_block_group(), its only caller has already got the item key
    to search next block group item.

    So we can use that key directly without doing our own convertion on
    stack.

    Also, since that key used in btrfs_read_block_groups() is vital for
    block group item search, add 'const' keyword for that parameter to
    prevent read_one_block_group() to modify it.

    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Refactor the work inside the loop of btrfs_read_block_groups() into one
    separate function, read_one_block_group().

    This allows read_one_block_group to be reused for later BG_TREE feature.

    The refactor does the following extra fix:
    - Use btrfs_fs_incompat() to replace open-coded feature check

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Anand Jain
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Signed-off-by: David Sterba

    David Sterba
     
  • A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access
    once policies can be found here
    https://lwn.net/Articles/799218/#Access-Marking%20Policies .

    The locked and unlocked access to eb::blocking_writers should be
    annotated accordingly, following this:

    Writes:

    - locked write must use ONCE, may use plain read
    - unlocked write must use ONCE

    Reads:

    - unlocked read must use ONCE
    - locked read may use plain read iff not mixed with unlocked read
    - unlocked read then locked must use ONCE

    There's one difference on the assembly level, where
    btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached
    value and did not reevaluate it after taking the lock. This could have
    missed some opportunities to take the lock in case blocking writers
    changed between the calls, but the window is just a few instructions
    long. As this is in try-lock, the callers handle that.

    Signed-off-by: David Sterba

    David Sterba
     
  • The increment and decrement was inherited from previous version that
    used atomics, switched in commit 06297d8cefca ("btrfs: switch
    extent_buffer blocking_writers from atomic to int"). The only possible
    values are 0 and 1 so we can set them directly.

    The generated assembly (gcc 9.x) did the direct value assignment in
    btrfs_set_lock_blocking_write (asm diff after change in 06297d8cefca):

    5d: test %eax,%eax
    5f: je 62
    61: retq

    - 62: lock incl 0x44(%rdi)
    - 66: add $0x50,%rdi
    - 6a: jmpq 6f

    + 62: movl $0x1,0x44(%rdi)
    + 69: add $0x50,%rdi
    + 6d: jmpq 72

    The part in btrfs_tree_unlock did a decrement because
    BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but
    otherwise the output looks safe:

    - lock decl 0x44(%rdi)

    + sub $0x1,%eax
    + mov %eax,0x44(%rdi)

    Signed-off-by: David Sterba

    David Sterba
     
  • There are two ifs that use eb::blocking_writers. As this is a variable
    modified inside and outside of locks, we could minimize number of
    accesses to avoid problems with getting different results at different
    times.

    The access here is locked so this can only race with btrfs_tree_unlock
    that sets blocking_writers to 0 without lock and unsets the lock owner.

    The first branch is taken only if the same thread already holds the
    lock, the second if checks for blocking writers. Here we'd either unlock
    and wait, or proceed. Both are valid states of the locking protocol.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    David Sterba
     
  • When there are no raid1c3 or raid1c4 block groups left after balance
    (either convert or with other filters applied), remove the incompat bit.
    This is already done for RAID56, do the same for RAID1C34.

    Signed-off-by: David Sterba

    David Sterba
     
  • The new raid1c3 and raid1c4 profiles are backward incompatible and the
    name shall be 'raid1c34', the status can be found in the global
    supported features in /sys/fs/btrfs/features or in the per-filesystem
    directory.

    Signed-off-by: David Sterba

    David Sterba
     
  • Add new block group profile to store 4 copies in a simliar way that
    current RAID1 does. The profile attributes and constraints are defined
    in the raid table and used by the same code that already handles the 2-
    and 3-copy RAID1.

    The minimum number of devices is 4, the maximum number of devices/chunks
    that can be lost/damaged is 3. There is no comparable traditional RAID
    level, the profile is added for future needs to accompany triple-parity
    and beyond.

    Signed-off-by: David Sterba

    David Sterba
     
  • Add new block group profile to store 3 copies in a simliar way that
    current RAID1 does. The profile attributes and constraints are defined
    in the raid table and used by the same code that already handles the
    2-copy RAID1.

    The minimum number of devices is 3, the maximum number of devices/chunks
    that can be lost/damaged is 2. Like RAID6 but with 33% space
    utilization.

    Signed-off-by: David Sterba

    David Sterba
     
  • In commit "Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios",
    cow_file_range_async gained wbc as a parameter and this makes passing
    write flags redundant. Set it inside the function and remove the
    parameter.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Nikolay Borisov
    Signed-off-by: David Sterba

    David Sterba
     
  • __extent_writepage reads write flags from wbc and passes both to
    __extent_writepage_io. This makes write_flags redundant and we can
    remove it.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Nikolay Borisov
    Signed-off-by: David Sterba

    David Sterba
     
  • Backreference walking, which is used by send to figure if it can issue
    clone operations instead of write operations, can be very slow and use
    too much memory when extents have many references. This change simply
    skips backreference walking when an extent has more than 64 references,
    in which case we fallback to a write operation instead of a clone
    operation. This limit is conservative and in practice I observed no
    signicant slowdown with up to 100 references and still low memory usage
    up to that limit.

    This is a temporary workaround until there are speedups in the backref
    walking code, and as such it does not attempt to add extra interfaces or
    knobs to tweak the threshold.

    Reported-by: Atemu
    Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Qu Wenruo
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     
  • For send we currently skip clone operations when the source and
    destination files are the same. This is so because clone didn't support
    this case in its early days, but support for it was added back in May
    2013 by commit a96fbc72884fcb ("Btrfs: allow file data clone within a
    file"). This change adds support for it.

    Example:

    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdd /mnt/sdd

    $ xfs_io -f -c "pwrite -S 0xab -b 64K 0 64K" /mnt/sdd/foobar
    $ xfs_io -c "reflink /mnt/sdd/foobar 0 64K 64K" /mnt/sdd/foobar

    $ btrfs subvolume snapshot -r /mnt/sdd /mnt/sdd/snap

    $ mkfs.btrfs -f /dev/sde
    $ mount /dev/sde /mnt/sde

    $ btrfs send /mnt/sdd/snap | btrfs receive /mnt/sde

    Without this change file foobar at the destination has a single 128Kb
    extent:

    $ filefrag -v /mnt/sde/snap/foobar
    Filesystem type is: 9123683e
    File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
    ext: logical_offset: physical_offset: length: expected: flags:
    0: 0.. 31: 0.. 31: 32: last,unknown_loc,delalloc,eof
    /mnt/sde/snap/foobar: 1 extent found

    With this we get a single 64Kb extent that is shared at file offsets 0
    and 64K, just like in the source filesystem:

    $ filefrag -v /mnt/sde/snap/foobar
    Filesystem type is: 9123683e
    File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
    ext: logical_offset: physical_offset: length: expected: flags:
    0: 0.. 15: 3328.. 3343: 16: shared
    1: 16.. 31: 3328.. 3343: 16: 3344: last,shared,eof
    /mnt/sde/snap/foobar: 2 extents found

    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • [BUG]
    When deleting large files (which cross block group boundary) with
    discard mount option, we find some btrfs_discard_extent() calls only
    trimmed part of its space, not the whole range:

    btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%

    type: bbio->map_type, in above case, it's SINGLE DATA.
    start: Logical address of this trim
    len: Logical length of this trim
    trimmed: Physically trimmed bytes
    ratio: trimmed / len

    Thus leaving some unused space not discarded.

    [CAUSE]
    When discard mount option is specified, after a transaction is fully
    committed (super block written to disk), we begin to cleanup pinned
    extents in the following call chain:

    btrfs_commit_transaction()
    |- btrfs_finish_extent_commit()
    |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
    |- btrfs_discard_extent()

    However, pinned extents are recorded in an extent_io_tree, which can
    merge adjacent extent states.

    When a large file gets deleted and it has adjacent file extents across
    block group boundary, we will get a large merged range like this:

    |||
    |//////||/////|

    To discard that range, we have the following calls:

    btrfs_discard_extent()
    |- btrfs_map_block()
    | Returned bbio will end at BG1's end. As btrfs_map_block()
    | never returns result across block group boundary.
    |- btrfs_issuse_discard()
    Issue discard for each stripe.

    So we will only discard the range in BG1, not the remaining part in BG2.

    Furthermore, this bug is not that reliably observed, for above case, if
    there is no other extent in BG2, BG2 will be empty and btrfs will trim
    all space of BG2, covering up the bug.

    [FIX]
    - Allow __btrfs_map_block_for_discard() to modify @length parameter
    btrfs_map_block() uses its @length paramter to notify the caller how
    many bytes are mapped in current call.
    With __btrfs_map_block_for_discard() also modifing the @length,
    btrfs_discard_extent() now understands when to do extra trim.

    - Call btrfs_map_block() in a loop until we hit the range end Since we
    now know how many bytes are mapped each time, we can iterate through
    each block group boundary and issue correct trim for each range.

    Reviewed-by: Filipe Manana
    Reviewed-by: Nikolay Borisov
    Tested-by: Nikolay Borisov
    Reviewed-by: Josef Bacik
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • The old code goes:

    offset = logical - em->start;
    length = min_t(u64, em->len - offset, length);

    Where @length calculation is dependent on offset, it can take reader
    several more seconds to find it's just the same code as:

    offset = logical - em->start;
    length = min_t(u64, em->start + em->len - logical, length);

    Use above code to make the length calculate independent from other
    variable, thus slightly increase the readability.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Josef Bacik
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • In check_extent_data_item(), we read file extent type without verifying
    if the item size is valid.

    Add such check to ensure the file extent type we read is correct.

    The check is not as accurate as we need to cover both inline and regular
    extents, so it only checks if the item size is larger or equal to inline
    header.
    So the existing size checks on inline/regular extents are still needed.

    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • The "&fs_info->dev_replace.rwsem" and "&dev_replace->rwsem" refer to
    the same lock but Smatch is not clever enough to figure that out so it
    leads to static checker warnings. It's better to use it consistently
    anyway.

    Signed-off-by: Dan Carpenter
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Dan Carpenter
     
  • The backup_root_index member stores the index at which the backup root
    should be saved upon next transaction commit. However, there is a
    small deviation from this behavior in the form of a check in
    backup_super_roots which checks if current root generation equals to the
    generation of the previous root. This can trigger in the following
    scenario:

    slot0: gen-2
    slot1: gen-1
    slot2: gen
    slot3: unused

    Now suppose slot3 (which is also the root specified in the super block)
    is corrupted hence init_tree_roots chooses to use the backup root at
    slot2, meaning read_backup_root will read slot2 and assign the
    superblock generation to gen-1. Despite this backup_root_index will
    point at slot3 because its init happens in init_backup_root_slot, long
    before any parsing of the backup roots occur. Then on next transaction
    start, gen-1 will be incremented by 1 making the root's generation
    equal gen. Subsequently, on transaction commit the following check
    triggers:

    if (btrfs_backup_tree_root_gen(root_backup) ==
    btrfs_header_generation(info->tree_root->node))

    This causes the 'next_backup', which is the index at which the backup is
    going to be written to, to set to last_backup, which will be slot2.

    All of this is a very confusing way of expressing the following
    invariant:

    Always write a backup root at the index following the last used backup
    root.

    This commit streamlines this logic by setting backup_root_index to the
    next index after the one used for mount.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • The old name name was an awful misnomer because it didn't really find
    the oldest super backup per-se but rather its slot. For example if we
    have:

    slot0: gen - 2
    slot1: gen - 1
    slot2: gen
    slot3: empty

    init_backup_root_slot will return slot3 and not slot0.

    The new name is more appropriate since the function doesn't care whether
    there is a valid backup in the returned slot or not.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • This function has been superseded by previous commits and is no longer
    used so just remove it.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Since the filesystem is not well formed and no trees are loaded it's
    pointless holding the objectid_mutex. Just remove its usage.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • The code responsible for reading and initializing tree roots is
    scattered in open_ctree among 2 labels, emulating a loop. This is rather
    confusing to reason about. Instead, factor the code to a new function,
    init_tree_roots which implements the same logical flow.

    There are a couple of notable differences, namely:

    * Instead of using next_backup_root it's using the newly introduced
    read_backup_root.

    * If read_backup_root returns an error init_tree_roots propagates the
    error and there is no special handling of that case e.g. the code jumps
    straight to 'fail_tree_roots' label. The old code, however, was
    (erroneously) jumping to 'fail_block_groups' label if next_backup_root
    did fail, this was unnecessary since the tree roots init logic doesn't
    modify the state of block groups.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • This function will replace next_root_backup with a much saner/cleaner
    interface.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • It's no longer needed following cleanups around find_newest_backup_root

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • Backup roots are always written in a circular manner. By definition we
    can only ever have 1 backup root whose generation equals to that of the
    superblock. Hence, the 'if' in the for loop will trigger at most once.
    This is sufficient to return the newest backup root.

    Furthermore the newest_gen parameter is always set to the generation of
    the superblock. This value can be obtained from the fs_info.

    This patch removes the unnecessary code dealing with the wraparound
    case and makes 'newest_gen' a local variable.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     
  • The inode delalloc mutex was added a long time ago by commit f248679e86fea
    ("Btrfs: add a delalloc mutex to inodes for delalloc reservations"), and
    the reason for its introduction is not very clear from the change log. It
    claims it solves bogus warnings from lockdep, however it lacks an example
    report/warning from lockdep, or any explanation.

    Since we have enough concurrentcy protection from the locks of the space
    info and block reserve objects, and such lockdep warnings don't seem to
    exist anymore (at least on a 5.3 kernel I couldn't get them with fstests,
    ltp, fs_mark, etc), remove it, simplifying things a bit and decreasing
    the size of the btrfs_inode structure. With some quick fio tests doing
    direct IO and mmap writes I couldn't observe any significant performance
    increase either (direct IO writes that don't increase the file's size
    don't hold the inode's lock for their entire duration and mmap writes
    don't hold the inode's lock at all), which are the only type of writes
    that could see any performance gain due to less serialization.

    Review feedback from Josef:

    The problem was taking the i_mutex in mmap, which is how I was
    protecting delalloc reservations originally. The delalloc mutex didn't
    come with all of the other dependencies. That's what the lockdep
    messages were about, removing the lock isn't going to make them appear
    again.

    We _had_ to lock around this because we used to do tricks to keep from
    over-reserving, and if we didn't serialize delalloc reservations we'd
    end up with ugly accounting problems when we tried to clean things up.

    However with my recentish changes this isn't the case anymore. Every
    operation is responsible for reserving its space, and then adding it to
    the inode. Then cleaning up is straightforward and can't be mucked up
    by other users. So we no longer need the delalloc mutex to safe us from
    ourselves.

    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     
  • It is not used anymore since commit 957780eb2788d8 ("Btrfs: introduce
    ticketed enospc infrastructure"), so just remove it.

    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • In btrfs_statfs() we cache fs_info::space_info in a local variable only
    to use it once in a list_for_each_rcu() statement.

    Not only is the local variable unnecessary it even makes the code harder
    to follow as it's not clear which list it is iterating.

    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn