29 Oct, 2018

1 commit

  • Fix bug of commit 74d46992e0d9 ("block: replace bi_bdev with a gendisk
    pointer and partitions index").

    bio_dev(bio) is used to find the dev state in function
    __btrfsic_submit_bio. But when dev_state is added to the hashtable, it
    is using dev_t of block_device.

    bio_dev(bio) returns a dev_t of part0 which is different from dev_t in
    block_device(bd_dev). bd_dev in block_device represents the exact
    partition.

    block_device.bd_dev =
    bio->bi_partno (same as block_device.bd_partno) + bio_dev(bio).

    When adding a dev_state into hashtable, we use the exact partition dev_t.
    So when looking it up, it should also use the exact partition dev_t.

    Reproducer of this bug:

    Use MOUNT_OPTIONS="-o check_int" and run btrfs/001 in fstests.
    Then there will be WARNING like below.

    WARNING:
    btrfs: attempt to write superblock which references block M @29523968 (sda7 /1111654400/2) which is never written!

    Signed-off-by: Gu JinXiang
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    (cherry picked from commit d28e649a5c58b779b303c252c66ee84a0f2c3b32)

    Gu JinXiang
     

10 Oct, 2018

1 commit

  • [ Upstream commit 801660b040d132f67fac6a95910ad307c5929b49 ]

    Test case btrfs/164 reports use-after-free:

    [ 6712.084324] general protection fault: 0000 [#1] PREEMPT SMP
    ..
    [ 6712.195423] btrfs_update_commit_device_size+0x75/0xf0 [btrfs]
    [ 6712.201424] btrfs_commit_transaction+0x57d/0xa90 [btrfs]
    [ 6712.206999] btrfs_rm_device+0x627/0x850 [btrfs]
    [ 6712.211800] btrfs_ioctl+0x2b03/0x3120 [btrfs]

    Reason for this is that btrfs_shrink_device adds the resized device to
    the fs_devices::resized_devices after it has called the last commit
    transaction.

    So the list fs_devices::resized_devices is not empty when
    btrfs_shrink_device returns. Now the parent function
    btrfs_rm_device calls:

    btrfs_close_bdev(device);
    call_rcu(&device->rcu, free_device_rcu);

    and then does the transactio ncommit. It goes through the
    fs_devices::resized_devices in btrfs_update_commit_device_size and
    leads to use-after-free.

    Fix this by making sure btrfs_shrink_device calls the last needed
    btrfs_commit_transaction before the return. This is consistent with what
    the grow counterpart does and this makes sure the on-disk state is
    persistent when the function returns.

    Reported-by: Lu Fengqi
    Tested-by: Lu Fengqi
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    [ update changelog ]
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Anand Jain
     

20 Sep, 2018

1 commit

  • commit de02b9f6bb65a6a1848f346f7a3617b7a9b930c0 upstream.

    If we deduplicate extents between two different files we can end up
    corrupting data if the source range ends at the size of the source file,
    the source file's size is not aligned to the filesystem's block size
    and the destination range does not go past the size of the destination
    file size.

    Example:

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt

    $ xfs_io -f -c "pwrite -S 0x6b 0 2518890" /mnt/foo
    # The first byte with a value of 0xae starts at an offset (2518890)
    # which is not a multiple of the sector size.
    $ xfs_io -c "pwrite -S 0xae 2518890 102398" /mnt/foo

    # Confirm the file content is full of bytes with values 0x6b and 0xae.
    $ od -t x1 /mnt/foo
    0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
    *
    11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
    11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
    *
    11777540 ae ae ae ae ae ae ae ae
    11777550

    # Create a second file with a length not aligned to the sector size,
    # whose bytes all have the value 0x6b, so that its extent(s) can be
    # deduplicated with the first file.
    $ xfs_io -f -c "pwrite -S 0x6b 0 557771" /mnt/bar

    # Now deduplicate the entire second file into a range of the first file
    # that also has all bytes with the value 0x6b. The destination range's
    # end offset must not be aligned to the sector size and must be less
    # then the offset of the first byte with the value 0xae (byte at offset
    # 2518890).
    $ xfs_io -c "dedupe /mnt/bar 0 1957888 557771" /mnt/foo

    # The bytes in the range starting at offset 2515659 (end of the
    # deduplication range) and ending at offset 2519040 (start offset
    # rounded up to the block size) must all have the value 0xae (and not
    # replaced with 0x00 values). In other words, we should have exactly
    # the same data we had before we asked for deduplication.
    $ od -t x1 /mnt/foo
    0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
    *
    11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
    11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
    *
    11777540 ae ae ae ae ae ae ae ae
    11777550

    # Unmount the filesystem and mount it again. This guarantees any file
    # data in the page cache is dropped.
    $ umount /dev/sdb
    $ mount /dev/sdb /mnt

    $ od -t x1 /mnt/foo
    0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
    *
    11461300 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00
    11461320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    *
    11470000 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
    *
    11777540 ae ae ae ae ae ae ae ae
    11777550

    # The bytes in range 2515659 to 2519040 have a value of 0x00 and not a
    # value of 0xae, data corruption happened due to the deduplication
    # operation.

    So fix this by rounding down, to the sector size, the length used for the
    deduplication when the following conditions are met:

    1) Source file's range ends at its i_size;
    2) Source file's i_size is not aligned to the sector size;
    3) Destination range does not cross the i_size of the destination file.

    Fixes: e1d227a42ea2 ("btrfs: Handle unaligned length in extent_same")
    CC: stable@vger.kernel.org # 4.2+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     

15 Sep, 2018

4 commits

  • [ Upstream commit 43794446548730ac8461be30bbe47d5d027d1d16 ]

    [BUG]
    Under certain KVM load and LTP tests, it is possible to hit the
    following calltrace if quota is enabled:

    BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
    BTRFS critical (device vda2): unable to find logical 8820195328 length 4096

    WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
    CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
    task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
    RIP: 0010:blk_status_to_errno+0x1a/0x30
    Call Trace:
    submit_extent_page+0x191/0x270 [btrfs]
    ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
    __do_readpage+0x2d2/0x810 [btrfs]
    ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
    ? run_one_async_done+0xc0/0xc0 [btrfs]
    __extent_read_full_page+0xe7/0x100 [btrfs]
    ? run_one_async_done+0xc0/0xc0 [btrfs]
    read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
    ? run_one_async_done+0xc0/0xc0 [btrfs]
    btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
    read_tree_block+0x31/0x60 [btrfs]
    read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
    btrfs_search_slot+0x46b/0xa00 [btrfs]
    ? kmem_cache_alloc+0x1a8/0x510
    ? btrfs_get_token_32+0x5b/0x120 [btrfs]
    find_parent_nodes+0x11d/0xeb0 [btrfs]
    ? leaf_space_used+0xb8/0xd0 [btrfs]
    ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
    ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
    btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
    btrfs_find_all_roots+0x45/0x60 [btrfs]
    btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
    btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
    btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
    insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
    btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
    ? pick_next_task_fair+0x2cd/0x530
    ? __switch_to+0x92/0x4b0
    btrfs_worker_helper+0x81/0x300 [btrfs]
    process_one_work+0x1da/0x3f0
    worker_thread+0x2b/0x3f0
    ? process_one_work+0x3f0/0x3f0
    kthread+0x11a/0x130
    ? kthread_create_on_node+0x40/0x40
    ret_from_fork+0x35/0x40

    BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
    BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
    BTRFS info (device vda2): forced readonly
    BTRFS error (device vda2): pending csums is 2887680

    [CAUSE]
    It's caused by race with block group auto removal:

    - There is a meta block group X, which has only one tree block
    The tree block belongs to fs tree 257.
    - In current transaction, some operation modified fs tree 257
    The tree block gets COWed, so the block group X is empty, and marked
    as unused, queued to be deleted.
    - Some workload (like fsync) wakes up cleaner_kthread()
    Which will call btrfs_delete_unused_bgs() to remove unused block
    groups.
    So block group X along its chunk map get removed.
    - Some delalloc work finished for fs tree 257
    Quota needs to get the original reference of the extent, which will
    read tree blocks of commit root of 257.
    Then since the chunk map gets removed, the above warning gets
    triggered.

    [FIX]
    Just let btrfs_delete_unused_bgs() skip block group which still has
    pinned bytes.

    However there is a minor side effect: currently we only queue empty
    blocks at update_block_group(), and such empty block group with pinned
    bytes won't go through update_block_group() again, such block group
    won't be removed, until it gets new extent allocated and removed.

    Signed-off-by: Qu Wenruo
    Reviewed-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     
  • [ Upstream commit 389305b2aa68723c754f88d9dbd268a400e10664 ]

    Invalid reloc tree can cause kernel NULL pointer dereference when btrfs
    does some cleanup of the reloc roots.

    It turns out that fs_info::reloc_ctl can be NULL in
    btrfs_recover_relocation() as we allocate relocation control after all
    reloc roots have been verified.
    So when we hit: note, we haven't called set_reloc_control() thus
    fs_info::reloc_ctl is still NULL.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=199833
    Reported-by: Xu Wen
    Signed-off-by: Qu Wenruo
    Tested-by: Gu Jinxiang
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     
  • [ Upstream commit 1e7e1f9e3aba00c9b9c323bfeeddafe69ff21ff6 ]

    on-disk devs stats value is updated in btrfs_run_dev_stats(),
    which is called during commit transaction, if device->dev_stats_ccnt
    is not zero.

    Since current replace operation does not touch dev_stats_ccnt,
    on-disk dev stats value is not updated. Therefore "btrfs device stats"
    may return old device's value after umount/mount
    (Example: See "btrfs ins dump-t -t DEV $DEV" after btrfs/100 finish).

    Fix this by just incrementing dev_stats_ccnt in
    btrfs_dev_replace_finishing() when replace is succeeded and this will
    update the values.

    Signed-off-by: Misono Tomohiro
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Misono Tomohiro
     
  • [ Upstream commit 64f64f43c89aca1782aa672e0586f6903c5d8979 ]

    It's entirely possible that a crafted btrfs image contains overlapping
    chunks.

    Although we can't detect such problem by tree-checker, it's not a
    catastrophic problem, current extent map can already detect such problem
    and return -EEXIST.

    We just only need to exit gracefully and fail the mount.

    Reported-by: Xu Wen
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=200409
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     

05 Sep, 2018

3 commits

  • commit 3c4276936f6fbe52884b4ea4e6cc120b890a0f9f upstream.

    We recently ran into the following deadlock involving
    btrfs_write_inode():

    [ +0.005066] __schedule+0x38e/0x8c0
    [ +0.007144] schedule+0x36/0x80
    [ +0.006447] bit_wait+0x11/0x60
    [ +0.006446] __wait_on_bit+0xbe/0x110
    [ +0.007487] ? bit_wait_io+0x60/0x60
    [ +0.007319] __inode_wait_for_writeback+0x96/0xc0
    [ +0.009568] ? autoremove_wake_function+0x40/0x40
    [ +0.009565] inode_wait_for_writeback+0x21/0x30
    [ +0.009224] evict+0xb0/0x190
    [ +0.006099] iput+0x1a8/0x210
    [ +0.006103] btrfs_run_delayed_iputs+0x73/0xc0
    [ +0.009047] btrfs_commit_transaction+0x799/0x8c0
    [ +0.009567] btrfs_write_inode+0x81/0xb0
    [ +0.008008] __writeback_single_inode+0x267/0x320
    [ +0.009569] writeback_sb_inodes+0x25b/0x4e0
    [ +0.008702] wb_writeback+0x102/0x2d0
    [ +0.007487] wb_workfn+0xa4/0x310
    [ +0.006794] ? wb_workfn+0xa4/0x310
    [ +0.007143] process_one_work+0x150/0x410
    [ +0.008179] worker_thread+0x6d/0x520
    [ +0.007490] kthread+0x12c/0x160
    [ +0.006620] ? put_pwq_unlocked+0x80/0x80
    [ +0.008185] ? kthread_park+0xa0/0xa0
    [ +0.007484] ? do_syscall_64+0x53/0x150
    [ +0.007837] ret_from_fork+0x29/0x40

    Writeback calls:

    btrfs_write_inode
    btrfs_commit_transaction
    btrfs_run_delayed_iputs

    If iput() is called on that same inode, evict() will wait for writeback
    forever.

    btrfs_write_inode() was originally added way back in 4730a4bc5bf3
    ("btrfs_dirty_inode") to support O_SYNC writes. However, ->write_inode()
    hasn't been used for O_SYNC since 148f948ba877 ("vfs: Introduce new
    helpers for syncing after writing to O_SYNC file or IS_SYNC inode"), so
    btrfs_write_inode() is actually unnecessary (and leads to a bunch of
    unnecessary commits). Get rid of it, which also gets rid of the
    deadlock.

    CC: stable@vger.kernel.org # 3.2+
    Signed-off-by: Josef Bacik
    [Omar: new commit message]
    Signed-off-by: Omar Sandoval
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 4559b0a71749c442d34f7cfb9e72c9e58db83948 upstream.

    If we're trying to make a data reservation and we have to allocate a
    data chunk we could leak ret == 1, as do_chunk_alloc() will return 1 if
    it allocated a chunk. Since the end of the function is the success path
    just return 0.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Josef Bacik
    Reviewed-by: Nikolay Borisov
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit d814a49198eafa6163698bdd93961302f3a877a4 upstream.

    We use customized, nodesize batch value to update dirty_metadata_bytes.
    We should also use batch version of compare function or we will easily
    goto fast path and get false result from percpu_counter_compare().

    Fixes: e2d845211eda ("Btrfs: use percpu counter for dirty metadata count")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Ethan Lien
    Reviewed-by: Nikolay Borisov
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Ethan Lien
     

24 Aug, 2018

1 commit

  • [ Upstream commit 665d4953cde6d9e75c62a07ec8f4f8fd7d396ade ]

    In commit ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device
    replace") we removed the branch of copy_nocow_pages() to avoid
    corruption for compressed nodatasum extents.

    However above commit only solves the problem in scrub_extent(), if
    during scrub_pages() we failed to read some pages,
    sctx->no_io_error_seen will be non-zero and we go to fixup function
    scrub_handle_errored_block().

    In scrub_handle_errored_block(), for sctx without csum (no matter if
    we're doing replace or scrub) we go to scrub_fixup_nodatasum() routine,
    which does the similar thing with copy_nocow_pages(), but does it
    without the extra check in copy_nocow_pages() routine.

    So for test cases like btrfs/100, where we emulate read errors during
    replace/scrub, we could corrupt compressed extent data again.

    This patch will fix it just by avoiding any "optimization" for
    nodatasum, just falls back to the normal fixup routine by try read from
    any good copy.

    This also solves WARN_ON() or dead lock caused by lame backref iteration
    in scrub_fixup_nodatasum() routine.

    The deadlock or WARN_ON() won't be triggered before commit ac0b4145d662
    ("btrfs: scrub: Don't use inode pages for device replace") since
    copy_nocow_pages() have better locking and extra check for data extent,
    and it's already doing the fixup work by try to read data from any good
    copy, so it won't go scrub_fixup_nodatasum() anyway.

    This patch disables the faulty code and will be removed completely in a
    followup patch.

    Fixes: ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device replace")
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     

09 Aug, 2018

1 commit

  • commit bd3599a0e142cd73edd3b6801068ac3f48ac771a upstream.

    When we clone a range into a file we can end up dropping existing
    extent maps (or trimming them) and replacing them with new ones if the
    range to be cloned overlaps with a range in the destination inode.
    When that happens we add the new extent maps to the list of modified
    extents in the inode's extent map tree, so that a "fast" fsync (the flag
    BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
    and log corresponding extent items. However, at the end of range cloning
    operation we do truncate all the pages in the affected range (in order to
    ensure future reads will not get stale data). Sometimes this truncation
    will release the corresponding extent maps besides the pages from the page
    cache. If this happens, then a "fast" fsync operation will miss logging
    some extent items, because it relies exclusively on the extent maps being
    present in the inode's extent tree, leading to data loss/corruption if
    the fsync ends up using the same transaction used by the clone operation
    (that transaction was not committed in the meanwhile). An extent map is
    released through the callback btrfs_invalidatepage(), which gets called by
    truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
    later ends up calling try_release_extent_mapping() which will release the
    extent map if some conditions are met, like the file size being greater
    than 16Mb, gfp flags allow blocking and the range not being locked (which
    is the case during the clone operation) nor being the extent map flagged
    as pinned (also the case for cloning).

    The following example, turned into a test for fstests, reproduces the
    issue:

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt

    $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
    $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar

    $ xfs_io -c "fsync" /mnt/bar
    # reflink destination offset corresponds to the size of file bar,
    # 2728Kb minus 4Kb.
    $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
    $ xfs_io -c "fsync" /mnt/bar

    $ md5sum /mnt/bar
    95a95813a8c2abc9aa75a6c2914a077e /mnt/bar

    $ mount /dev/sdb /mnt
    $ md5sum /mnt/bar
    207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar
    # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
    # power failure

    In the above example, the destination offset of the clone operation
    corresponds to the size of the "bar" file minus 4Kb. So during the clone
    operation, the extent map covering the range from 2572Kb to 2728Kb gets
    trimmed so that it ends at offset 2724Kb, and a new extent map covering
    the range from 2724Kb to 11724Kb is created. So at the end of the clone
    operation when we ask to truncate the pages in the range from 2724Kb to
    2724Kb + 15908Kb, the page invalidation callback ends up removing the new
    extent map (through try_release_extent_mapping()) when the page at offset
    2724Kb is passed to that callback.

    Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
    map is removed at try_release_extent_mapping(), forcing the next fsync to
    search for modified extents in the fs/subvolume tree instead of relying on
    the presence of extent maps in memory. This way we can continue doing a
    "fast" fsync if the destination range of a clone operation does not
    overlap with an existing range or if any of the criteria necessary to
    remove an extent map at try_release_extent_mapping() is not met (file
    size not bigger then 16Mb or gfp flags do not allow blocking).

    CC: stable@vger.kernel.org # 3.16+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     

03 Aug, 2018

5 commits

  • [ Upstream commit ff3d27a048d926b3920ccdb75d98788c567cae0d ]

    Under the following case, qgroup rescan can double account cowed tree
    blocks:

    In this case, extent tree only has one tree block.

    -
    | transid=5 last committed=4
    | btrfs_qgroup_rescan_worker()
    | |- btrfs_start_transaction()
    | | transid = 5
    | |- qgroup_rescan_leaf()
    | |- btrfs_search_slot_for_read() on extent tree
    | Get the only extent tree block from commit root (transid = 4).
    | Scan it, set qgroup_rescan_progress to the last
    | EXTENT/META_ITEM + 1
    | now qgroup_rescan_progress = A + 1.
    |
    | fs tree get CoWed, new tree block is at A + 16K
    | transid 5 get committed
    -
    | transid=6 last committed=5
    | btrfs_qgroup_rescan_worker()
    | btrfs_qgroup_rescan_worker()
    | |- btrfs_start_transaction()
    | | transid = 5
    | |- qgroup_rescan_leaf()
    | |- btrfs_search_slot_for_read() on extent tree
    | Get the only extent tree block from commit root (transid = 5).
    | scan it using qgroup_rescan_progress (A + 1).
    | found new tree block beyong A, and it's fs tree block,
    | account it to increase qgroup numbers.
    -

    In above case, tree block A, and tree block A + 16K get accounted twice,
    while qgroup rescan should stop when it already reach the last leaf,
    other than continue using its qgroup_rescan_progress.

    Such case could happen by just looping btrfs/017 and with some
    possibility it can hit such double qgroup accounting problem.

    Fix it by checking the path to determine if we should finish qgroup
    rescan, other than relying on next loop to exit.

    Reported-by: Nikolay Borisov
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     
  • [ Upstream commit 3d3a2e610ea5e7c6d4f9481ecce5d8e2d8317843 ]

    Currently the code assumes that there's an implied barrier by the
    sequence of code preceding the wakeup, namely the mutex unlock.

    As Nikolay pointed out:

    I think this is wrong (not your code) but the original assumption that
    the RELEASE semantics provided by mutex_unlock is sufficient.
    According to memory-barriers.txt:

    Section 'LOCK ACQUISITION FUNCTIONS' states:

    (2) RELEASE operation implication:

    Memory operations issued before the RELEASE will be completed before the
    RELEASE operation has completed.

    Memory operations issued after the RELEASE *may* be completed before the
    RELEASE operation has completed.

    (I've bolded the may portion)

    The example given there:

    As an example, consider the following:

    *A = a;
    *B = b;
    ACQUIRE
    *C = c;
    *D = d;
    RELEASE
    *E = e;
    *F = f;

    The following sequence of events is acceptable:

    ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE

    So if we assume that *C is modifying the flag which the waitqueue is checking,
    and *E is the actual wakeup, then those accesses can be re-ordered...

    IMHO this code should be considered broken...
    Signed-off-by: Greg Kroah-Hartman

    David Sterba
     
  • [ Upstream commit 0552210997badb6a60740a26ff9d976a416510f0 ]

    btrfs_free_extent() can fail because of ENOMEM. There's no reason to
    panic here, we can just abort the transaction.

    Fixes: f4b9aa8d3b87 ("btrfs_truncate")
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Omar Sandoval
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     
  • [ Upstream commit c08db7d8d295a4f3a10faaca376de011afff7950 ]

    In btrfs_evict_inode(), if btrfs_truncate_inode_items() fails, the inode
    item will still be in the tree but we still return the ino to the ino
    cache. That will blow up later when someone tries to allocate that ino,
    so don't return it to the cache.

    Fixes: 581bb050941b ("Btrfs: Cache free inode numbers in memory")
    Reviewed-by: Josef Bacik
    Signed-off-by: Omar Sandoval
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     
  • [ Upstream commit e73e81b6d0114d4a303205a952ab2e87c44bd279 ]

    [Problem description and how we fix it]
    We should balance dirty metadata pages at the end of
    btrfs_finish_ordered_io, since a small, unmergeable random write can
    potentially produce dirty metadata which is multiple times larger than
    the data itself. For example, a small, unmergeable 4KiB write may
    produce:

    16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree

    Although we do call balance dirty pages in write side, but in the
    buffered write path, most metadata are dirtied only after we reach the
    dirty background limit (which by far only counts dirty data pages) and
    wakeup the flusher thread. If there are many small, unmergeable random
    writes spread in a large btree, we'll find a burst of dirty pages
    exceeds the dirty_bytes limit after we wakeup the flusher thread - which
    is not what we expect. In our machine, it caused out-of-memory problem
    since a page cannot be dropped if it is marked dirty.

    Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
    but since we do btrfs_finish_ordered_io in a separate worker, it will not
    stop the flusher consuming dirty pages. Also, we use different worker for
    metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
    the size of dirty metadata pages.

    [Reproduce steps]
    To reproduce the problem, we need to do 4KiB write randomly spread in a
    large btree. In our 2GiB RAM machine:

    1) Create 4 subvolumes.
    2) Run fio on each subvolume:

    [global]
    direct=0
    rw=randwrite
    ioengine=libaio
    bs=4k
    iodepth=16
    numjobs=1
    group_reporting
    size=128G
    runtime=1800
    norandommap
    time_based
    randrepeat=0

    3) Take snapshot on each subvolume and repeat fio on existing files.
    4) Repeat step (3) until we get large btrees.
    In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
    metadata in each subvolume tree and 12GiB of metadata in extent tree.
    5) Stop all fio, take snapshot again, and wait until all delayed work is
    completed.
    6) Start all fio. Few seconds later we hit OOM when the flusher starts
    to work.

    It can be reproduced even when using nocow write.

    Signed-off-by: Ethan Lien
    Reviewed-by: David Sterba
    [ add comment ]
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ethan Lien
     

22 Jul, 2018

1 commit

  • commit 31d11b83b96faaee4bb514d375a09489117c3e8d upstream.

    In commit 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size
    after fsync log replay"), on fsync, we started to always log all prealloc
    extents beyond an inode's i_size in order to avoid losing them after a
    power failure. However under some cases this can lead to the log replay
    code to create duplicate extent items, with different lengths, in the
    extent tree. That happens because, as of that commit, we can now log
    extent items based on extent maps that are not on the "modified" list
    of extent maps of the inode's extent map tree. Logging extent items based
    on extent maps is used during the fast fsync path to save time and for
    this to work reliably it requires that the extent maps are not merged
    with other adjacent extent maps - having the extent maps in the list
    of modified extents gives such guarantee.

    Consider the following example, captured during a long run of fsstress,
    which illustrates this problem.

    We have inode 271, in the filesystem tree (root 5), for which all of the
    following operations and discussion apply to.

    A buffered write starts at offset 312391 with a length of 933471 bytes
    (end offset at 1245862). At this point we have, for this inode, the
    following extent maps with the their field values:

    em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
    block_len 0, orig_block_len 0
    em B, start 40960, orig_start 40960, len 376832, block_start 1106399232,
    block_len 376832, orig_block_len 376832
    em C, start 417792, orig_start 417792, len 782336, block_start
    18446744073709551613, block_len 0, orig_block_len 0
    em D, start 1200128, orig_start 1200128, len 835584, block_start
    1106776064, block_len 835584, orig_block_len 835584
    em E, start 2035712, orig_start 2035712, len 245760, block_start
    1107611648, block_len 245760, orig_block_len 245760

    Extent map A corresponds to a hole and extent maps D and E correspond to
    preallocated extents.

    Extent map D ends where extent map E begins (1106776064 + 835584 =
    1107611648), but these extent maps were not merged because they are in
    the inode's list of modified extent maps.

    An fsync against this inode is made, which triggers the fast path
    (BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback
    of the data previously written using buffered IO, and when the respective
    ordered extent finishes, btrfs_drop_extents() is called against the
    (aligned) range 311296..1249279. This causes a split of extent map D at
    btrfs_drop_extent_cache(), replacing extent map D with a new extent map
    D', also added to the list of modified extents, with the following
    values:

    em D', start 1249280, orig_start of 1200128,
    block_start 1106825216 (= 1106776064 + 1249280 - 1200128),
    orig_block_len 835584,
    block_len 786432 (835584 - (1249280 - 1200128))

    Then, during the fast fsync, btrfs_log_changed_extents() is called and
    extent maps D' and E are removed from the list of modified extents. The
    flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged
    clear_em_logging() is called on each of them, and that makes extent map E
    to be merged with extent map D' (try_merge_map()), resulting in D' being
    deleted and E adjusted to:

    em E, start 1249280, orig_start 1200128, len 1032192,
    block_start 1106825216, block_len 1032192,
    orig_block_len 245760

    A direct IO write at offset 1847296 and length of 360448 bytes (end offset
    at 2207744) starts, and at that moment the following extent maps exist for
    our inode:

    em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
    block_len 0, orig_block_len 0
    em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
    block_len 270336, orig_block_len 376832
    em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
    block_len 937984, orig_block_len 937984
    em E (prealloc), start 1249280, orig_start 1200128, len 1032192,
    block_start 1106825216, block_len 1032192, orig_block_len 245760

    The dio write results in drop_extent_cache() being called twice. The first
    time for a range that starts at offset 1847296 and ends at offset 2035711
    (length of 188416), which results in a double split of extent map E,
    replacing it with two new extent maps:

    em F, start 1249280, orig_start 1200128, block_start 1106825216,
    block_len 598016, orig_block_len 598016
    em G, start 2035712, orig_start 1200128, block_start 1107611648,
    block_len 245760, orig_block_len 1032192

    It also creates a new extent map that represents a part of the requested
    IO (through create_io_em()):

    em H, start 1847296, len 188416, block_start 1107423232, block_len 188416

    The second call to drop_extent_cache() has a range with a start offset of
    2035712 and end offset of 2207743 (length of 172032). This leads to
    replacing extent map G with a new extent map I with the following values:

    em I, start 2207744, orig_start 1200128, block_start 1107783680,
    block_len 73728, orig_block_len 1032192

    It also creates a new extent map that represents the second part of the
    requested IO (through create_io_em()):

    em J, start 2035712, len 172032, block_start 1107611648, block_len 172032

    The dio write set the inode's i_size to 2207744 bytes.

    After the dio write the inode has the following extent maps:

    em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
    block_len 0, orig_block_len 0
    em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
    block_len 270336, orig_block_len 376832
    em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
    block_len 937984, orig_block_len 937984
    em F, start 1249280, orig_start 1200128, len 598016,
    block_start 1106825216, block_len 598016, orig_block_len 598016
    em H, start 1847296, orig_start 1200128, len 188416,
    block_start 1107423232, block_len 188416, orig_block_len 835584
    em J, start 2035712, orig_start 2035712, len 172032,
    block_start 1107611648, block_len 172032, orig_block_len 245760
    em I, start 2207744, orig_start 1200128, len 73728,
    block_start 1107783680, block_len 73728, orig_block_len 1032192

    Now do some change to the file, like adding a xattr for example and then
    fsync it again. This triggers a fast fsync path, and as of commit
    471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync
    log replay"), we use the extent map I to log a file extent item because
    it's a prealloc extent and it starts at an offset matching the inode's
    i_size. However when we log it, we create a file extent item with a value
    for the disk byte location that is wrong, as can be seen from the
    following output of "btrfs inspect-internal dump-tree":

    item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53
    generation 22 type 2 (prealloc)
    prealloc data disk byte 1106776064 nr 1032192
    prealloc data offset 1007616 nr 73728

    Here the disk byte value corresponds to calculation based on some fields
    from the extent map I:

    1106776064 = block_start (1107783680) - 1007616 (extent_offset)
    extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616

    The disk byte value of 1106776064 clashes with disk byte values of the
    file extent items at offsets 1249280 and 1847296 in the fs tree:

    item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53
    generation 20 type 2 (prealloc)
    prealloc data disk byte 1106776064 nr 835584
    prealloc data offset 49152 nr 598016
    item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53
    generation 20 type 1 (regular)
    extent data disk byte 1106776064 nr 835584
    extent data offset 647168 nr 188416 ram 835584
    extent compression 0 (none)
    item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53
    generation 20 type 1 (regular)
    extent data disk byte 1107611648 nr 245760
    extent data offset 0 nr 172032 ram 245760
    extent compression 0 (none)
    item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53
    generation 20 type 2 (prealloc)
    prealloc data disk byte 1107611648 nr 245760
    prealloc data offset 172032 nr 73728

    Instead of the disk byte value of 1106776064, the value of 1107611648
    should have been logged. Also the data offset value should have been
    172032 and not 1007616.
    After a log replay we end up getting two extent items in the extent tree
    with different lengths, one of 835584, which is correct and existed
    before the log replay, and another one of 1032192 which is wrong and is
    based on the logged file extent item:

    item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53
    refs 2 gen 15 flags DATA
    extent data backref root 5 objectid 271 offset 1200128 count 2
    item 13 key (1106776064 EXTENT_ITEM 1032192) itemoff 3353 itemsize 53
    refs 1 gen 22 flags DATA
    extent data backref root 5 objectid 271 offset 1200128 count 1

    Obviously this leads to many problems and a filesystem check reports many
    errors:

    (...)
    checking extents
    Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1
    extent item 1106776064 has multiple extent items
    ref mismatch on [1106776064 835584] extent item 2, found 3
    Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680
    Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree
    Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70
    Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192
    backpointer mismatch on [1106776064 835584]
    checking free space cache
    block group 1103101952 has wrong amount of free space
    failed to load free space cache for block group 1103101952
    checking fs roots
    (...)

    So fix this by logging the prealloc extents beyond the inode's i_size
    based on searches in the subvolume tree instead of the extent maps.

    Fixes: 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
    CC: stable@vger.kernel.org # 4.14+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     

03 Jul, 2018

1 commit

  • commit c5b4a50b74018b3677098151ec5f4fce07d5e6a0 upstream.

    If we failed during a rename exchange operation after starting/joining a
    transaction, we would end up replacing the return value, stored in the
    local 'ret' variable, with the return value from btrfs_end_transaction().
    So this could end up returning 0 (success) to user space despite the
    operation having failed and aborted the transaction, because if there are
    multiple tasks having a reference on the transaction at the time
    btrfs_end_transaction() is called by the rename exchange, that function
    returns 0 (otherwise it returns -EIO and not the original error value).
    So fix this by not overwriting the return value on error after getting
    a transaction handle.

    Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
    CC: stable@vger.kernel.org # 4.9+
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     

26 Jun, 2018

4 commits

  • commit ac0b4145d662a3b9e34085dea460fb06ede9b69b upstream.

    [BUG]
    Btrfs can create compressed extent without checksum (even though it
    shouldn't), and if we then try to replace device containing such extent,
    the result device will contain all the uncompressed data instead of the
    compressed one.

    Test case already submitted to fstests:
    https://patchwork.kernel.org/patch/10442353/

    [CAUSE]
    When handling compressed extent without checksum, device replace will
    goe into copy_nocow_pages() function.

    In that function, btrfs will get all inodes referring to this data
    extents and then use find_or_create_page() to get pages direct from that
    inode.

    The problem here is, pages directly from inode are always uncompressed.
    And for compressed data extent, they mismatch with on-disk data.
    Thus this leads to corrupted compressed data extent written to replace
    device.

    [FIX]
    In this attempt, we could just remove the "optimization" branch, and let
    unified scrub_pages() to handle it.

    Although scrub_pages() won't bother reusing page cache, it will be a
    little slower, but it does the correct csum checking and won't cause
    such data corruption caused by "optimization".

    Note about the fix: this is the minimal fix that can be backported to
    older stable trees without conflicts. The whole callchain from
    copy_nocow_pages() can be deleted, and will be in followup patches.

    Fixes: ff023aac3119 ("Btrfs: add code to scrub to copy read data to another disk")
    CC: stable@vger.kernel.org # 4.4+
    Reported-by: James Harvey
    Reviewed-by: James Harvey
    Signed-off-by: Qu Wenruo
    [ remove code removal, add note why ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     
  • commit 090a127afa8f73e9618d4058d6755f7ec7453dd6 upstream.

    In cow_file_range(), create_io_em() may fail, but its return value is
    not recorded. Then return value may be 0 even it failed which is a
    wrong behavior.

    Let cow_file_range() return PTR_ERR(em) if create_io_em() failed.

    Fixes: 6f9994dbabe5 ("Btrfs: create a helper to create em for IO")
    CC: stable@vger.kernel.org # 4.11+
    Signed-off-by: Su Yue
    Reviewed-by: Nikolay Borisov
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Su Yue
     
  • commit fd4e994bd1f9dc9628e168a7f619bf69f6984635 upstream.

    If we have invalid flags set, when we error out we must drop our writer
    counter and free the buffer we allocated for the arguments. This bug is
    trivially reproduced with the following program on 4.7+:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(int argc, char **argv)
    {
    struct btrfs_ioctl_vol_args_v2 vol_args = {
    .flags = UINT64_MAX,
    };
    int ret;
    int fd;

    if (argc != 2) {
    fprintf(stderr, "usage: %s PATH\n", argv[0]);
    return EXIT_FAILURE;
    }

    fd = open(argv[1], O_WRONLY);
    if (fd == -1) {
    perror("open");
    return EXIT_FAILURE;
    }

    ret = ioctl(fd, BTRFS_IOC_RM_DEV_V2, &vol_args);
    if (ret == -1)
    perror("ioctl");

    close(fd);
    return EXIT_SUCCESS;
    }

    When unmounting the filesystem, we'll hit the
    WARN_ON(mnt_get_writers(mnt)) in cleanup_mnt() and also may prevent the
    filesystem to be remounted read-only as the writer count will stay
    lifted.

    Fixes: 6b526ed70cf1 ("btrfs: introduce device delete by devid")
    CC: stable@vger.kernel.org # 4.9+
    Signed-off-by: Omar Sandoval
    Reviewed-by: Su Yue
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     
  • commit b5c40d598f5408bd0ca22dfffa82f03cd9433f23 upstream.

    In btrfs_clone_files(), we must check the NODATASUM flag while the
    inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags()
    will change the flags after we check and we can end up with a party
    checksummed file.

    The race window is only a few instructions in size, between the if and
    the locks which is:

    3834 if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
    3835 return -EISDIR;

    where the setflags must be run and toggle the NODATASUM flag (provided
    the file size is 0). The clone will block on the inode lock, segflags
    takes the inode lock, changes flags, releases log and clone continues.

    Not impossible but still needs a lot of bad luck to hit unintentionally.

    Fixes: 0e7b824c4ef9 ("Btrfs: don't make a file partly checksummed through file clone")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Omar Sandoval
    Reviewed-by: Nikolay Borisov
    Reviewed-by: David Sterba
    [ update changelog ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     

21 Jun, 2018

3 commits

  • [ Upstream commit 8810f7517a3bc4ca2d41d022446d3f5fd6b77c09 ]

    There is a scenario that can end up with rebuild process failing to
    return good content, i.e.
    suppose that all disks can be read without problems and if the content
    that was read out doesn't match its checksum, currently for raid6
    btrfs at most retries twice,

    - the 1st retry is to rebuild with all other stripes, it'll eventually
    be a raid5 xor rebuild,
    - if the 1st fails, the 2nd retry will deliberately fail parity p so
    that it will do raid6 style rebuild,

    however, the chances are that another non-parity stripe content also
    has something corrupted, so that the above retries are not able to
    return correct content, and users will think of this as data loss.
    More seriouly, if the loss happens on some important internal btree
    roots, it could refuse to mount.

    This extends btrfs to do more retries and each retry fails only one
    stripe. Since raid6 can tolerate 2 disk failures, if there is one
    more failure besides the failure on which we're recovering, this can
    always work.

    The worst case is to retry as many times as the number of raid6 disks,
    but given the fact that such a scenario is really rare in practice,
    it's still acceptable.

    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit 762221f095e3932669093466aaf4b85ed9ad2ac1 ]

    The raid6 corruption is that,
    suppose that all disks can be read without problems and if the content
    that was read out doesn't match its checksum, currently for raid6
    btrfs at most retries twice,

    - the 1st retry is to rebuild with all other stripes, it'll eventually
    be a raid5 xor rebuild,
    - if the 1st fails, the 2nd retry will deliberately fail parity p so
    that it will do raid6 style rebuild,

    however, the chances are that another non-parity stripe content also
    has something corrupted, so that the above retries are not able to
    return correct content.

    We've fixed normal reads to rebuild raid6 correctly with more retries
    in Patch "Btrfs: make raid6 rebuild retry more"[1], this is to fix
    scrub to do the exactly same rebuild process.

    [1]: https://patchwork.kernel.org/patch/10091755/

    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • This reverts commit d91bb7c6988bd6450284c762b33f2e1ea3fe7c97.

    This commit used an incorrect log message.

    Signed-off-by: Sasha Levin
    Reported-by: Ben Hutchings
    Signed-off-by: Greg Kroah-Hartman

    Sasha Levin
     

12 Jun, 2018

1 commit

  • commit e2731e55884f2138a252b0a3d7b24d57e49c3c59 upstream.

    btrfs-progs uses super flag bit BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34).
    So just define that in kernel so that we know its been used.

    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Anand Jain
     

30 May, 2018

13 commits

  • …created with quota enabled

    [ Upstream commit 4d31778aa2fa342f5f92ca4025b293a1729161d1 ]

    When multiple pending snapshots referring to the same source subvolume
    are executed, enabled quota will cause root item corruption, where root
    items are using old bytenr (no backref in extent tree).

    This can be triggered by fstests btrfs/152.

    The cause is when source subvolume is still dirty, extra commit
    (simplied transaction commit) of qgroup_account_snapshot() can skip
    dirty roots not recorded in current transaction, making root item of
    source subvolume not updated.

    Fix it by forcing recording source subvolume in current transaction
    before qgroup sub-transaction commit.

    Reported-by: Justin Maggard <jmaggard@netgear.com>
    Signed-off-by: Qu Wenruo <wqu@suse.com>
    Reviewed-by: Filipe Manana <fdmanana@suse.com>
    Signed-off-by: David Sterba <dsterba@suse.com>
    Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

    Qu Wenruo
     
  • [ Upstream commit 8a5a916d9a35e13576d79cc16e24611821b13e34 ]

    While running btrfs/011, I hit the following lockdep splat.

    This is the important bit:
    pcpu_alloc+0x1ac/0x5e0
    __percpu_counter_init+0x4e/0xb0
    btrfs_init_fs_root+0x99/0x1c0 [btrfs]
    btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
    resolve_indirect_refs+0x130/0x830 [btrfs]
    find_parent_nodes+0x69e/0xff0 [btrfs]
    btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
    btrfs_find_all_roots+0x50/0x70 [btrfs]
    btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
    btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]

    The percpu_counter_init call in btrfs_alloc_subvolume_writers
    uses GFP_KERNEL, which we can't do during transaction commit.

    This switches it to GFP_NOFS.

    ========================================================
    WARNING: possible irq lock inversion dependency detected
    4.12.14-kvmsmall #8 Tainted: G W
    --------------------------------------------------------
    kswapd0/50 just changed the state of lock:
    (&delayed_node->mutex){+.+.-.}, at: [] __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    but this lock took another, RECLAIM_FS-unsafe lock in the past:
    (pcpu_alloc_mutex){+.+.+.}

    and interrupts could create inverse lock ordering between them.

    other info that might help us debug this:
    Chain exists of:
    &delayed_node->mutex --> &found->groups_sem --> pcpu_alloc_mutex

    Possible interrupt unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(pcpu_alloc_mutex);
    local_irq_disable();
    lock(&delayed_node->mutex);
    lock(&found->groups_sem);

    lock(&delayed_node->mutex);

    *** DEADLOCK ***

    2 locks held by kswapd0/50:
    #0: (shrinker_rwsem){++++..}, at: [] shrink_slab+0x7f/0x5b0
    #1: (&type->s_umount_key#30){+++++.}, at: [] trylock_super+0x16/0x50

    the shortest dependencies between 2nd lock and 1st lock:
    -> (pcpu_alloc_mutex){+.+.+.} ops: 4904 {
    HARDIRQ-ON-W at:
    __mutex_lock+0x4e/0x8c0
    pcpu_alloc+0x1ac/0x5e0
    alloc_kmem_cache_cpus.isra.70+0x25/0xa0
    __do_tune_cpucache+0x2c/0x220
    do_tune_cpucache+0x26/0xc0
    enable_cpucache+0x6d/0xf0
    kmem_cache_init_late+0x42/0x75
    start_kernel+0x343/0x4cb
    x86_64_start_kernel+0x127/0x134
    secondary_startup_64+0xa5/0xb0
    SOFTIRQ-ON-W at:
    __mutex_lock+0x4e/0x8c0
    pcpu_alloc+0x1ac/0x5e0
    alloc_kmem_cache_cpus.isra.70+0x25/0xa0
    __do_tune_cpucache+0x2c/0x220
    do_tune_cpucache+0x26/0xc0
    enable_cpucache+0x6d/0xf0
    kmem_cache_init_late+0x42/0x75
    start_kernel+0x343/0x4cb
    x86_64_start_kernel+0x127/0x134
    secondary_startup_64+0xa5/0xb0
    RECLAIM_FS-ON-W at:
    __kmalloc+0x47/0x310
    pcpu_extend_area_map+0x2b/0xc0
    pcpu_alloc+0x3ec/0x5e0
    alloc_kmem_cache_cpus.isra.70+0x25/0xa0
    __do_tune_cpucache+0x2c/0x220
    do_tune_cpucache+0x26/0xc0
    enable_cpucache+0x6d/0xf0
    __kmem_cache_create+0x1bf/0x390
    create_cache+0xba/0x1b0
    kmem_cache_create+0x1f8/0x2b0
    ksm_init+0x6f/0x19d
    do_one_initcall+0x50/0x1b0
    kernel_init_freeable+0x201/0x289
    kernel_init+0xa/0x100
    ret_from_fork+0x3a/0x50
    INITIAL USE at:
    __mutex_lock+0x4e/0x8c0
    pcpu_alloc+0x1ac/0x5e0
    alloc_kmem_cache_cpus.isra.70+0x25/0xa0
    setup_cpu_cache+0x2f/0x1f0
    __kmem_cache_create+0x1bf/0x390
    create_boot_cache+0x8b/0xb1
    kmem_cache_init+0xa1/0x19e
    start_kernel+0x270/0x4cb
    x86_64_start_kernel+0x127/0x134
    secondary_startup_64+0xa5/0xb0
    }
    ... key at: [] pcpu_alloc_mutex+0x70/0xa0
    ... acquired at:
    pcpu_alloc+0x1ac/0x5e0
    __percpu_counter_init+0x4e/0xb0
    btrfs_init_fs_root+0x99/0x1c0 [btrfs]
    btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
    resolve_indirect_refs+0x130/0x830 [btrfs]
    find_parent_nodes+0x69e/0xff0 [btrfs]
    btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
    btrfs_find_all_roots+0x50/0x70 [btrfs]
    btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
    btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
    transaction_kthread+0x176/0x1b0 [btrfs]
    kthread+0x102/0x140
    ret_from_fork+0x3a/0x50

    -> (&fs_info->commit_root_sem){++++..} ops: 1566382 {
    HARDIRQ-ON-W at:
    down_write+0x3e/0xa0
    cache_block_group+0x287/0x420 [btrfs]
    find_free_extent+0x106c/0x12d0 [btrfs]
    btrfs_reserve_extent+0xd8/0x170 [btrfs]
    cow_file_range.isra.66+0x133/0x470 [btrfs]
    run_delalloc_range+0x121/0x410 [btrfs]
    writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
    __extent_writepage+0x19a/0x360 [btrfs]
    extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
    extent_writepages+0x4d/0x60 [btrfs]
    do_writepages+0x1a/0x70
    __filemap_fdatawrite_range+0xa7/0xe0
    btrfs_rename+0x5ee/0xdb0 [btrfs]
    vfs_rename+0x52a/0x7e0
    SyS_rename+0x351/0x3b0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    HARDIRQ-ON-R at:
    down_read+0x35/0x90
    caching_thread+0x57/0x560 [btrfs]
    normal_work_helper+0x1c0/0x5e0 [btrfs]
    process_one_work+0x1e0/0x5c0
    worker_thread+0x44/0x390
    kthread+0x102/0x140
    ret_from_fork+0x3a/0x50
    SOFTIRQ-ON-W at:
    down_write+0x3e/0xa0
    cache_block_group+0x287/0x420 [btrfs]
    find_free_extent+0x106c/0x12d0 [btrfs]
    btrfs_reserve_extent+0xd8/0x170 [btrfs]
    cow_file_range.isra.66+0x133/0x470 [btrfs]
    run_delalloc_range+0x121/0x410 [btrfs]
    writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
    __extent_writepage+0x19a/0x360 [btrfs]
    extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
    extent_writepages+0x4d/0x60 [btrfs]
    do_writepages+0x1a/0x70
    __filemap_fdatawrite_range+0xa7/0xe0
    btrfs_rename+0x5ee/0xdb0 [btrfs]
    vfs_rename+0x52a/0x7e0
    SyS_rename+0x351/0x3b0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-R at:
    down_read+0x35/0x90
    caching_thread+0x57/0x560 [btrfs]
    normal_work_helper+0x1c0/0x5e0 [btrfs]
    process_one_work+0x1e0/0x5c0
    worker_thread+0x44/0x390
    kthread+0x102/0x140
    ret_from_fork+0x3a/0x50
    INITIAL USE at:
    down_write+0x3e/0xa0
    cache_block_group+0x287/0x420 [btrfs]
    find_free_extent+0x106c/0x12d0 [btrfs]
    btrfs_reserve_extent+0xd8/0x170 [btrfs]
    cow_file_range.isra.66+0x133/0x470 [btrfs]
    run_delalloc_range+0x121/0x410 [btrfs]
    writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
    __extent_writepage+0x19a/0x360 [btrfs]
    extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
    extent_writepages+0x4d/0x60 [btrfs]
    do_writepages+0x1a/0x70
    __filemap_fdatawrite_range+0xa7/0xe0
    btrfs_rename+0x5ee/0xdb0 [btrfs]
    vfs_rename+0x52a/0x7e0
    SyS_rename+0x351/0x3b0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    }
    ... key at: [] __key.61970+0x0/0xfffffffffff9aa88 [btrfs]
    ... acquired at:
    cache_block_group+0x287/0x420 [btrfs]
    find_free_extent+0x106c/0x12d0 [btrfs]
    btrfs_reserve_extent+0xd8/0x170 [btrfs]
    btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
    btrfs_create_tree+0xbb/0x2a0 [btrfs]
    btrfs_create_uuid_tree+0x37/0x140 [btrfs]
    open_ctree+0x23c0/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

    -> (&found->groups_sem){++++..} ops: 2134587 {
    HARDIRQ-ON-W at:
    down_write+0x3e/0xa0
    __link_block_group+0x34/0x130 [btrfs]
    btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
    open_ctree+0x2054/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    HARDIRQ-ON-R at:
    down_read+0x35/0x90
    btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
    open_ctree+0x207b/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-W at:
    down_write+0x3e/0xa0
    __link_block_group+0x34/0x130 [btrfs]
    btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
    open_ctree+0x2054/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-R at:
    down_read+0x35/0x90
    btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
    open_ctree+0x207b/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    INITIAL USE at:
    down_write+0x3e/0xa0
    __link_block_group+0x34/0x130 [btrfs]
    btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
    open_ctree+0x2054/0x2660 [btrfs]
    btrfs_mount+0xd36/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    btrfs_mount+0x18c/0xf90 [btrfs]
    mount_fs+0x3a/0x160
    vfs_kern_mount+0x66/0x150
    do_mount+0x1c1/0xcc0
    SyS_mount+0x7e/0xd0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    }
    ... key at: [] __key.59101+0x0/0xfffffffffff9ab78 [btrfs]
    ... acquired at:
    find_free_extent+0xcb4/0x12d0 [btrfs]
    btrfs_reserve_extent+0xd8/0x170 [btrfs]
    btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
    __btrfs_cow_block+0x110/0x5b0 [btrfs]
    btrfs_cow_block+0xd7/0x290 [btrfs]
    btrfs_search_slot+0x1f6/0x960 [btrfs]
    btrfs_lookup_inode+0x2a/0x90 [btrfs]
    __btrfs_update_delayed_inode+0x65/0x210 [btrfs]
    btrfs_commit_inode_delayed_inode+0x121/0x130 [btrfs]
    btrfs_evict_inode+0x3fe/0x6a0 [btrfs]
    evict+0xc4/0x190
    __dentry_kill+0xbf/0x170
    dput+0x2ae/0x2f0
    SyS_rename+0x2a6/0x3b0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

    -> (&delayed_node->mutex){+.+.-.} ops: 5580204 {
    HARDIRQ-ON-W at:
    __mutex_lock+0x4e/0x8c0
    btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
    btrfs_update_inode+0x83/0x110 [btrfs]
    btrfs_dirty_inode+0x62/0xe0 [btrfs]
    touch_atime+0x8c/0xb0
    do_generic_file_read+0x818/0xb10
    __vfs_read+0xdc/0x150
    vfs_read+0x8a/0x130
    SyS_read+0x45/0xa0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    SOFTIRQ-ON-W at:
    __mutex_lock+0x4e/0x8c0
    btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
    btrfs_update_inode+0x83/0x110 [btrfs]
    btrfs_dirty_inode+0x62/0xe0 [btrfs]
    touch_atime+0x8c/0xb0
    do_generic_file_read+0x818/0xb10
    __vfs_read+0xdc/0x150
    vfs_read+0x8a/0x130
    SyS_read+0x45/0xa0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    IN-RECLAIM_FS-W at:
    __mutex_lock+0x4e/0x8c0
    __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    btrfs_evict_inode+0x22c/0x6a0 [btrfs]
    evict+0xc4/0x190
    dispose_list+0x35/0x50
    prune_icache_sb+0x42/0x50
    super_cache_scan+0x139/0x190
    shrink_slab+0x262/0x5b0
    shrink_node+0x2eb/0x2f0
    kswapd+0x2eb/0x890
    kthread+0x102/0x140
    ret_from_fork+0x3a/0x50
    INITIAL USE at:
    __mutex_lock+0x4e/0x8c0
    btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
    btrfs_update_inode+0x83/0x110 [btrfs]
    btrfs_dirty_inode+0x62/0xe0 [btrfs]
    touch_atime+0x8c/0xb0
    do_generic_file_read+0x818/0xb10
    __vfs_read+0xdc/0x150
    vfs_read+0x8a/0x130
    SyS_read+0x45/0xa0
    do_syscall_64+0x79/0x1e0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7
    }
    ... key at: [] __key.56935+0x0/0xfffffffffff96b78 [btrfs]
    ... acquired at:
    __lock_acquire+0x264/0x11c0
    lock_acquire+0xbd/0x1e0
    __mutex_lock+0x4e/0x8c0
    __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    btrfs_evict_inode+0x22c/0x6a0 [btrfs]
    evict+0xc4/0x190
    dispose_list+0x35/0x50
    prune_icache_sb+0x42/0x50
    super_cache_scan+0x139/0x190
    shrink_slab+0x262/0x5b0
    shrink_node+0x2eb/0x2f0
    kswapd+0x2eb/0x890
    kthread+0x102/0x140
    ret_from_fork+0x3a/0x50

    stack backtrace:
    CPU: 1 PID: 50 Comm: kswapd0 Tainted: G W 4.12.14-kvmsmall #8 SLE15 (unreleased)
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    Call Trace:
    dump_stack+0x78/0xb7
    print_irq_inversion_bug.part.38+0x19f/0x1aa
    check_usage_forwards+0x102/0x120
    ? ret_from_fork+0x3a/0x50
    ? check_usage_backwards+0x110/0x110
    mark_lock+0x16c/0x270
    __lock_acquire+0x264/0x11c0
    ? pagevec_lookup_entries+0x1a/0x30
    ? truncate_inode_pages_range+0x2b3/0x7f0
    lock_acquire+0xbd/0x1e0
    ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    __mutex_lock+0x4e/0x8c0
    ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    ? btrfs_evict_inode+0x1f6/0x6a0 [btrfs]
    __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
    btrfs_evict_inode+0x22c/0x6a0 [btrfs]
    evict+0xc4/0x190
    dispose_list+0x35/0x50
    prune_icache_sb+0x42/0x50
    super_cache_scan+0x139/0x190
    shrink_slab+0x262/0x5b0
    shrink_node+0x2eb/0x2f0
    kswapd+0x2eb/0x890
    kthread+0x102/0x140
    ? mem_cgroup_shrink_node+0x2c0/0x2c0
    ? kthread_create_on_node+0x40/0x40
    ret_from_fork+0x3a/0x50

    Signed-off-by: Jeff Mahoney
    Reviewed-by: Liu Bo
    Signed-off-by: David Sterba

    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jeff Mahoney
     
  • [ Upstream commit 8434ec46c6e3232cebc25a910363b29f5c617820 ]

    When logging an inode, at tree-log.c:copy_items(), if we call
    btrfs_next_leaf() at the loop which checks for the need to log holes, we
    need to make sure copy_items() returns the value 1 to its caller and
    not 0 (on success). This is because the path the caller passed was
    released and is now different from what is was before, and the caller
    expects a return value of 0 to mean both success and that the path
    has not changed, while a return value of 1 means both success and
    signals the caller that it can not reuse the path, it has to perform
    another tree search.

    Even though this is a case that should not be triggered on normal
    circumstances or very rare at least, its consequences can be very
    unpredictable (especially when replaying a log tree).

    Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents")
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • [ Upstream commit 3c0efdf03b2d127f0e40e30db4e7aa0429b1b79a ]

    The extent tree of the test fs is like the following:

    BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919
    item 0 key (4096 168 4096) itemoff 3944 itemsize 51
    extent refs 1 gen 1 flags 2
    tree block key (68719476736 0 0) level 1
    ^^^^^^^
    ref#0: tree block backref root 5

    And it's using an empty tree for fs tree, so there is no way that its
    level can be 1.

    For REAL (created by mkfs) fs tree backref with no skinny metadata, the
    result should look like:

    item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51
    refs 1 gen 4 flags TREE_BLOCK
    tree block key (256 INODE_ITEM 0) level 0
    ^^^^^^^
    tree block backref root 5

    Fix the level to 0, so it won't break later tree level checker.

    Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code")
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     
  • [ Upstream commit 1e1c50a929bc9e49bc3f9935b92450d9e69f8158 ]

    do_chunk_alloc implements a loop checking whether there is a pending
    chunk allocation and if so causes the caller do loop. Generally this
    loop is executed only once, however testing with btrfs/072 on a single
    core vm machines uncovered an extreme case where the system could loop
    indefinitely. This is due to a missing cond_resched when loop which
    doesn't give a chance to the previous chunk allocator finish its job.

    The fix is to simply add the missing cond_resched.

    Fixes: 6d74119f1a3e ("Btrfs: avoid taking the chunk_mutex in do_chunk_alloc")
    Signed-off-by: Nikolay Borisov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     
  • [ Upstream commit 80c0b4210a963e31529e15bf90519708ec947596 ]

    0, 1 and nodes[0] could be NULL, log_dir_items lacks such a
    check for
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit b98def7ca6e152ee55e36863dddf6f41f12d1dc6 ]

    If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs
    to bail out, otherwise @ret would be forced to be 0 after 'break;' and
    the caller won't be aware of it.

    Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit 471d557afed155b85da237ec46c549f443eeb5de ]

    Currently if we allocate extents beyond an inode's i_size (through the
    fallocate system call) and then fsync the file, we log the extents but
    after a power failure we replay them and then immediately drop them.
    This behaviour happens since about 2009, commit c71bf099abdd ("Btrfs:
    Avoid orphan inodes cleanup while replaying log"), because it marks
    the inode as an orphan instead of dropping any extents beyond i_size
    before replaying logged extents, so after the log replay, and while
    the mount operation is still ongoing, we find the inode marked as an
    orphan and then perform a truncation (drop extents beyond the inode's
    i_size). Because the processing of orphan inodes is still done
    right after replaying the log and before the mount operation finishes,
    the intention of that commit does not make any sense (at least as
    of today). However reverting that behaviour is not enough, because
    we can not simply discard all extents beyond i_size and then replay
    logged extents, because we risk dropping extents beyond i_size created
    in past transactions, for example:

    add prealloc extent beyond i_size
    fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
    transaction commit
    add another prealloc extent beyond i_size
    fsync - triggers the fast fsync path
    power failure

    In that scenario, we would drop the first extent and then replay the
    second one. To fix this just make sure that all prealloc extents
    beyond i_size are logged, and if we find too many (which is far from
    a common case), fallback to a full transaction commit (like we do when
    logging regular extents in the fast fsync path).

    Trivial reproducer:

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt
    $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
    $ sync
    $ xfs_io -c "falloc -k 256K 1M" /mnt/foo
    $ xfs_io -c "fsync" /mnt/foo

    # mount to replay log
    $ mount /dev/sdb /mnt
    # at this point the file only has one extent, at offset 0, size 256K

    A test case for fstests follows soon, covering multiple scenarios that
    involve adding prealloc extents with previous shrinking truncates and
    without such truncates.

    Fixes: c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log")
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • [ Upstream commit af7227338135d2f1b1552bf9a6d43e02dcba10b9 ]

    Currently if some fatal errors occur, like all IO get -EIO, resources
    would be cleaned up when
    a) transaction is being committed or
    b) BTRFS_FS_STATE_ERROR is set

    However, in some rare cases, resources may be left alone after transaction
    gets aborted and umount may run into some ASSERT(), e.g.
    ASSERT(list_empty(&block_group->dirty_list));

    For case a), in btrfs_commit_transaciton(), there're several places at the
    beginning where we just call btrfs_end_transaction() without cleaning up
    resources. For case b), it is possible that the trans handle doesn't have
    any dirty stuff, then only trans hanlde is marked as aborted while
    BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.

    This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
    all resources won't stay in memory after umount.

    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     
  • [ Upstream commit 9a6509c4daa91400b52a5fd541a5521c649a8fea ]

    If in the same transaction we rename a special file (fifo, character/block
    device or symbolic link), create a hard link for it having its old name
    then sync the log, we will end up with a log that can not be replayed and
    at when attempting to replay it, an EEXIST error is returned and mounting
    the filesystem fails. Example scenario:

    $ mkfs.btrfs -f /dev/sdc
    $ mount /dev/sdc /mnt
    $ mkdir /mnt/testdir
    $ mkfifo /mnt/testdir/foo
    # Make sure everything done so far is durably persisted.
    $ sync

    # Create some unrelated file and fsync it, this is just to create a log
    # tree. The file must be in the same directory as our special file.
    $ touch /mnt/testdir/f1
    $ xfs_io -c "fsync" /mnt/testdir/f1

    # Rename our special file and then create a hard link with its old name.
    $ mv /mnt/testdir/foo /mnt/testdir/bar
    $ ln /mnt/testdir/bar /mnt/testdir/foo

    # Create some other unrelated file and fsync it, this is just to persist
    # the log tree which was modified by the previous rename and link
    # operations. Alternatively we could have modified file f1 and fsync it.
    $ touch /mnt/f2
    $ xfs_io -c "fsync" /mnt/f2

    $ mount /dev/sdc /mnt
    mount: mount /dev/sdc on /mnt failed: File exists

    This happens because when both the log tree and the subvolume's tree have
    an entry in the directory "testdir" with the same name, that is, there
    is one key (258 INODE_REF 257) in the subvolume tree and another one in
    the log tree (where 258 is the inode number of our special file and 257
    is the inode for directory "testdir"). Only the data of those two keys
    differs, in the subvolume tree the index field for inode reference has
    a value of 3 while the log tree it has a value of 5. Because the same key
    exists in both trees, but have different index, the log replay fails with
    an -EEXIST error when attempting to replay the inode reference from the
    log tree.

    Fix this by setting the last_unlink_trans field of the inode (our special
    file) to the current transaction id when a hard link is created, as this
    forces logging the parent directory inode, solving the conflict at log
    replay time.

    A new generic test case for fstests was also submitted.

    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • [ Upstream commit d4dfc0f4d39475ccbbac947880b5464a74c30b99 ]

    When doing an incremental send of a filesystem with the no-holes feature
    enabled, we end up issuing a write operation when using the no data mode
    send flag, instead of issuing an update extent operation. Fix this by
    issuing the update extent operation instead.

    Trivial reproducer:

    $ mkfs.btrfs -f -O no-holes /dev/sdc
    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdc /mnt/sdc
    $ mount /dev/sdd /mnt/sdd

    $ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar
    $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1

    $ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar
    $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2

    $ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd
    $ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \
    | btrfs receive -vv /mnt/sdd

    Before this change the output of the second receive command is:

    receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447...
    utimes
    write foobar, offset 8192, len 8192
    utimes foobar
    BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-...

    After this change it is:

    receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
    utimes
    update_extent foobar: offset=8192, len=8192
    utimes foobar
    BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...

    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • [ Upstream commit a8fd1f71749387c9a1053a83ff1c16287499a4e7 ]

    The srcu_struct in btrfs_fs_info scales in size with NR_CPUS. On
    kernels built with NR_CPUS=8192, this can result in kmalloc failures
    that prevent mounting.

    There is work in progress to try to resolve this for every user of
    srcu_struct but using kvzalloc will work around the failures until
    that is complete.

    As an example with NR_CPUS=512 on x86_64: the overall size of
    subvol_srcu is 3460 bytes, fs_info is 6496.

    Signed-off-by: Jeff Mahoney
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jeff Mahoney
     
  • commit 1e2e547a93a00ebc21582c06ca3c6cfea2a309ee upstream.

    For anything NFS-exported we do _not_ want to unlock new inode
    before it has grown an alias; original set of fixes got the
    ordering right, but missed the nasty complication in case of
    lockdep being enabled - unlock_new_inode() does
    lockdep_annotate_inode_mutex_key(inode)
    which can only be done before anyone gets a chance to touch
    ->i_mutex. Unfortunately, flipping the order and doing
    unlock_new_inode() before d_instantiate() opens a window when
    mkdir can race with open-by-fhandle on a guessed fhandle, leading
    to multiple aliases for a directory inode and all the breakage
    that follows from that.

    Correct solution: a new primitive (d_instantiate_new())
    combining these two in the right order - lockdep annotate, then
    d_instantiate(), then the rest of unlock_new_inode(). All
    combinations of d_instantiate() with unlock_new_inode() should
    be converted to that.

    Cc: stable@kernel.org # 2.6.29 and later
    Tested-by: Mike Marshall
    Reviewed-by: Andreas Dilger
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro