05 Nov, 2020

1 commit


07 Oct, 2020

1 commit


27 Aug, 2020

1 commit

  • I got the following lockdep splat while testing:

    ======================================================
    WARNING: possible circular locking dependency detected
    5.8.0-rc7-00172-g021118712e59 #932 Not tainted
    ------------------------------------------------------
    btrfs/229626 is trying to acquire lock:
    ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450

    but task is already holding lock:
    ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
    __mutex_lock+0x9f/0x930
    btrfs_scrub_dev+0x11c/0x630
    btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
    btrfs_ioctl+0x2799/0x30a0
    ksys_ioctl+0x83/0xc0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x50/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
    __mutex_lock+0x9f/0x930
    btrfs_run_dev_stats+0x49/0x480
    commit_cowonly_roots+0xb5/0x2a0
    btrfs_commit_transaction+0x516/0xa60
    sync_filesystem+0x6b/0x90
    generic_shutdown_super+0x22/0x100
    kill_anon_super+0xe/0x30
    btrfs_kill_super+0x12/0x20
    deactivate_locked_super+0x29/0x60
    cleanup_mnt+0xb8/0x140
    task_work_run+0x6d/0xb0
    __prepare_exit_to_usermode+0x1cc/0x1e0
    do_syscall_64+0x5c/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
    __mutex_lock+0x9f/0x930
    btrfs_commit_transaction+0x4bb/0xa60
    sync_filesystem+0x6b/0x90
    generic_shutdown_super+0x22/0x100
    kill_anon_super+0xe/0x30
    btrfs_kill_super+0x12/0x20
    deactivate_locked_super+0x29/0x60
    cleanup_mnt+0xb8/0x140
    task_work_run+0x6d/0xb0
    __prepare_exit_to_usermode+0x1cc/0x1e0
    do_syscall_64+0x5c/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
    __mutex_lock+0x9f/0x930
    btrfs_record_root_in_trans+0x43/0x70
    start_transaction+0xd1/0x5d0
    btrfs_dirty_inode+0x42/0xd0
    touch_atime+0xa1/0xd0
    btrfs_file_mmap+0x3f/0x60
    mmap_region+0x3a4/0x640
    do_mmap+0x376/0x580
    vm_mmap_pgoff+0xd5/0x120
    ksys_mmap_pgoff+0x193/0x230
    do_syscall_64+0x50/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
    __might_fault+0x68/0x90
    _copy_to_user+0x1e/0x80
    perf_read+0x141/0x2c0
    vfs_read+0xad/0x1b0
    ksys_read+0x5f/0xe0
    do_syscall_64+0x50/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    -> #2 (&cpuctx_mutex){+.+.}-{3:3}:
    __mutex_lock+0x9f/0x930
    perf_event_init_cpu+0x88/0x150
    perf_event_init+0x1db/0x20b
    start_kernel+0x3ae/0x53c
    secondary_startup_64+0xa4/0xb0

    -> #1 (pmus_lock){+.+.}-{3:3}:
    __mutex_lock+0x9f/0x930
    perf_event_init_cpu+0x4f/0x150
    cpuhp_invoke_callback+0xb1/0x900
    _cpu_up.constprop.26+0x9f/0x130
    cpu_up+0x7b/0xc0
    bringup_nonboot_cpus+0x4f/0x60
    smp_init+0x26/0x71
    kernel_init_freeable+0x110/0x258
    kernel_init+0xa/0x103
    ret_from_fork+0x1f/0x30

    -> #0 (cpu_hotplug_lock){++++}-{0:0}:
    __lock_acquire+0x1272/0x2310
    lock_acquire+0x9e/0x360
    cpus_read_lock+0x39/0xb0
    alloc_workqueue+0x378/0x450
    __btrfs_alloc_workqueue+0x15d/0x200
    btrfs_alloc_workqueue+0x51/0x160
    scrub_workers_get+0x5a/0x170
    btrfs_scrub_dev+0x18c/0x630
    btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
    btrfs_ioctl+0x2799/0x30a0
    ksys_ioctl+0x83/0xc0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x50/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    other info that might help us debug this:

    Chain exists of:
    cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&fs_info->scrub_lock);
    lock(&fs_devs->device_list_mutex);
    lock(&fs_info->scrub_lock);
    lock(cpu_hotplug_lock);

    *** DEADLOCK ***

    2 locks held by btrfs/229626:
    #0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
    #1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630

    stack backtrace:
    CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
    Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
    Call Trace:
    dump_stack+0x78/0xa0
    check_noncircular+0x165/0x180
    __lock_acquire+0x1272/0x2310
    lock_acquire+0x9e/0x360
    ? alloc_workqueue+0x378/0x450
    cpus_read_lock+0x39/0xb0
    ? alloc_workqueue+0x378/0x450
    alloc_workqueue+0x378/0x450
    ? rcu_read_lock_sched_held+0x52/0x80
    __btrfs_alloc_workqueue+0x15d/0x200
    btrfs_alloc_workqueue+0x51/0x160
    scrub_workers_get+0x5a/0x170
    btrfs_scrub_dev+0x18c/0x630
    ? start_transaction+0xd1/0x5d0
    btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
    btrfs_ioctl+0x2799/0x30a0
    ? do_sigaction+0x102/0x250
    ? lockdep_hardirqs_on_prepare+0xca/0x160
    ? _raw_spin_unlock_irq+0x24/0x30
    ? trace_hardirqs_on+0x1c/0xe0
    ? _raw_spin_unlock_irq+0x24/0x30
    ? do_sigaction+0x102/0x250
    ? ksys_ioctl+0x83/0xc0
    ksys_ioctl+0x83/0xc0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x50/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    This happens because we're allocating the scrub workqueues under the
    scrub and device list mutex, which brings in a whole host of other
    dependencies.

    Because the work queue allocation is done with GFP_KERNEL, it can
    trigger reclaim, which can lead to a transaction commit, which in turns
    needs the device_list_mutex, it can lead to a deadlock. A different
    problem for which this fix is a solution.

    Fix this by moving the actual allocation outside of the
    scrub lock, and then only take the lock once we're ready to actually
    assign them to the fs_info. We'll now have to cleanup the workqueues in
    a few more places, so I've added a helper to do the refcount dance to
    safely free the workqueues.

    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     

27 Jul, 2020

10 commits

  • Eric reported seeing this message while running generic/475

    BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted

    Full stack trace:

    BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
    BTRFS info (device dm-0): forced readonly
    BTRFS warning (device dm-0): Skipping commit of aborted transaction.
    ------------[ cut here ]------------
    BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
    BTRFS: Transaction aborted (error -117)
    BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
    WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
    BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
    BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
    CPU: 3 PID: 23985 Comm: fsstress Tainted: G W L 5.8.0-rc4-default+ #1181
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
    RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
    RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
    RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
    RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
    R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
    FS: 00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
    Call Trace:
    ? finish_wait+0x90/0x90
    ? __mutex_unlock_slowpath+0x45/0x2a0
    ? lock_acquire+0xa3/0x440
    ? lockref_put_or_lock+0x9/0x30
    ? dput+0x20/0x4a0
    ? dput+0x20/0x4a0
    ? do_raw_spin_unlock+0x4b/0xc0
    ? _raw_spin_unlock+0x1f/0x30
    btrfs_sync_file+0x335/0x490 [btrfs]
    do_fsync+0x38/0x70
    __x64_sys_fsync+0x10/0x20
    do_syscall_64+0x50/0xe0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f6a6ef1b6e3
    Code: Bad RIP value.
    RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
    RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
    RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
    RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
    R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x67b/0x1b00
    softirqs last enabled at (0): [] copy_process+0x67b/0x1b00
    softirqs last disabled at (0): [] 0x0
    ---[ end trace af146e0e38433456 ]---
    BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted

    This ret came from btrfs_write_marked_extents(). If we get an aborted
    transaction via EIO before, we'll see it in btree_write_cache_pages()
    and return EUCLEAN, which gets printed as "Filesystem corrupted".

    Except we shouldn't be returning EUCLEAN here, we need to be returning
    EROFS because EUCLEAN is reserved for actual corruption, not IO errors.

    We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
    elsewhere, but we want to use EROFS for this particular case. The
    original transaction abort has the real error code for why we ended up
    with an aborted transaction, all subsequent actions just need to return
    EROFS because they may not have a trans handle and have no idea about
    the original cause of the abort.

    After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
    stacktrace will not be dumped either.

    Reported-by: Eric Sandeen
    CC: stable@vger.kernel.org # 5.4+
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    [ add full test stacktrace ]
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Add proper variable for the scrub page and use it instead of repeatedly
    dereferencing the other structures.

    Signed-off-by: David Sterba

    David Sterba
     
  • Use a simpler iteration over tree block pages, same what csum_tree_block
    does: first page always exists, loop over the rest.

    Signed-off-by: David Sterba

    David Sterba
     
  • Add proper variable for the scrub page and use it instead of repeatedly
    dereferencing the other structures.

    Signed-off-by: David Sterba

    David Sterba
     
  • We have sectorsize same as PAGE_SIZE, the checksum can be calculated in
    one go.

    Signed-off-by: David Sterba

    David Sterba
     
  • Add proper variable for the scrub page and use it instead of repeatedly
    dereferencing the other structures.

    Signed-off-by: David Sterba

    David Sterba
     
  • The page contents with the checksum is available during the entire
    function so we don't need to make a copy.

    Signed-off-by: David Sterba

    David Sterba
     
  • BTRFS_SUPER_INFO_SIZE is 4096, and fits to a page on all supported
    architectures, so we can calculate the checksum in one go.

    Signed-off-by: David Sterba

    David Sterba
     
  • As the page mapping has been removed, rename the variables to 'kaddr'
    that we use everywhere else. The type is changed to 'char *' so pointer
    arithmetic works without casts.

    Signed-off-by: David Sterba

    David Sterba
     
  • All pages that scrub uses in the scrub_block::pagev array are allocated
    with GFP_KERNEL and never part of any mapping, so kmap is not necessary,
    we only need to know the page address.

    In scrub_write_page_to_dev_replace we don't even need to call
    flush_dcache_page because of the same reason as above.

    Signed-off-by: David Sterba

    David Sterba
     

25 May, 2020

4 commits

  • The main function to lookup a root by its id btrfs_get_fs_root takes the
    whole key, while only using the objectid. The value of offset is preset
    to (u64)-1 but not actually used until btrfs_find_root that does the
    actual search.

    Switch btrfs_get_fs_root to use only objectid and remove all local
    variables that existed just for the lookup. The actual key for search is
    set up in btrfs_get_fs_root, reusing another key variable.

    Signed-off-by: David Sterba

    David Sterba
     
  • When scrubbing a stripe, whenever we find an extent we lookup for its
    checksums in the checksum tree. However we do it even for metadata extents
    which don't have checksum items stored in the checksum tree, that is
    only for data extents.

    So make the lookup for checksums only if we are processing with a data
    extent.

    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • Back in 2014, commit 04216820fe83d5 ("Btrfs: fix race between fs trimming
    and block group remove/allocation"), I added the 'trimming' member to the
    block group structure. Its purpose was to prevent races between trimming
    and block group deletion/allocation by pinning the block group in a way
    that prevents its logical address and device extents from being reused
    while trimming is in progress for a block group, so that if another task
    deletes the block group and then another task allocates a new block group
    that gets the same logical address and device extents while the trimming
    task is still in progress.

    After the previous fix for scrub (patch "btrfs: fix a race between scrub
    and block group removal/allocation"), scrub now also has the same needs that
    trimming has, so the member name 'trimming' no longer makes sense.
    Since there is already a 'pinned' member in the block group that refers
    to space reservations (pinned bytes), rename the member to 'frozen',
    add a comment on top of it to describe its general purpose and rename
    the helpers to increment and decrement the counter as well, to match
    the new member name.

    The next patch in the series will move the helpers into a more suitable
    file (from free-space-cache.c to block-group.c).

    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     
  • When scrub is verifying the extents of a block group for a device, it is
    possible that the corresponding block group gets removed and its logical
    address and device extents get used for a new block group allocation.
    When this happens scrub incorrectly reports that errors were detected
    and, if the the new block group has a different profile then the old one,
    deleted block group, we can crash due to a null pointer dereference.
    Possibly other unexpected and weird consequences can happen as well.

    Consider the following sequence of actions that leads to the null pointer
    dereference crash when scrub is running in parallel with balance:

    1) Balance sets block group X to read-only mode and starts relocating it.
    Block group X is a metadata block group, has a raid1 profile (two
    device extents, each one in a different device) and a logical address
    of 19424870400;

    2) Scrub is running and finds device extent E, which belongs to block
    group X. It enters scrub_stripe() to find all extents allocated to
    block group X, the search is done using the extent tree;

    3) Balance finishes relocating block group X and removes block group X;

    4) Balance starts relocating another block group and when trying to
    commit the current transaction as part of the preparation step
    (prepare_to_relocate()), it blocks because scrub is running;

    5) The scrub task finds the metadata extent at the logical address
    19425001472 and marks the pages of the extent to be read by a bio
    (struct scrub_bio). The extent item's flags, which have the bit
    BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
    scrub_page). It is these flags in the scrub pages that tells the
    bio's end io function (scrub_bio_end_io_worker) which type of extent
    it is dealing with. At this point we end up with 4 pages in a bio
    which is ready for submission (the metadata extent has a size of
    16Kb, so that gives 4 pages on x86);

    6) At the next iteration of scrub_stripe(), scrub checks that there is a
    pause request from the relocation task trying to commit a transaction,
    therefore it submits the pending bio and pauses, waiting for the
    transaction commit to complete before resuming;

    7) The relocation task commits the transaction. The device extent E, that
    was used by our block group X, is now available for allocation, since
    the commit root for the device tree was swapped by the transaction
    commit;

    8) Another task doing a direct IO write allocates a new data block group Y
    which ends using device extent E. This new block group Y also ends up
    getting the same logical address that block group X had: 19424870400.
    This happens because block group X was the block group with the highest
    logical address and, when allocating Y, find_next_chunk() returns the
    end offset of the current last block group to be used as the logical
    address for the new block group, which is

    18351128576 + 1073741824 = 19424870400

    So our new block group Y has the same logical address and device extent
    that block group X had. However Y is a data block group, while X was
    a metadata one, and Y has a raid0 profile, while X had a raid1 profile;

    9) After allocating block group Y, the direct IO submits a bio to write
    to device extent E;

    10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
    extent E, which now correspond to the data written by the task that
    did a direct IO write. Then at the end io function associated with
    the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
    which calls scrub_checksum(). This later function checks the flags
    of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
    is set in the flags, so it assumes it has a metadata extent and
    then calls scrub_checksum_tree_block(). That functions returns an
    error, since interpreting data as a metadata extent causes the
    checksum verification to fail.

    So this makes scrub_checksum() call scrub_handle_errored_block(),
    which determines 'failed_mirror_index' to be 1, since the device
    extent E was allocated as the second mirror of block group X.

    It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
    'sblocks_for_recheck', and all the memory is initialized to zeroes by
    kcalloc().

    After that it calls scrub_setup_recheck_block(), which is responsible
    for filling each of those structures. However, when that function
    calls btrfs_map_sblock() against the logical address of the metadata
    extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
    the current block group Y. However block group Y has a raid0 profile
    and not a raid1 profile like X had, so the following call returns 1:

    scrub_nr_raid_mirrors(bbio)

    And as a result scrub_setup_recheck_block() only initializes the
    first (index 0) scrub_block structure in 'sblocks_for_recheck'.

    Then scrub_recheck_block() is called by scrub_handle_errored_block()
    with the second (index 1) scrub_block structure as the argument,
    because 'failed_mirror_index' was previously set to 1.
    This scrub_block was not initialized by scrub_setup_recheck_block(),
    so it has zero pages, its 'page_count' member is 0 and its 'pagev'
    page array has all members pointing to NULL.

    Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
    we have a NULL pointer dereference when accessing the flags of the first
    page, as pavev[0] is NULL:

    static void scrub_recheck_block_checksum(struct scrub_block *sblock)
    {
    (...)
    if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
    scrub_checksum_data(sblock);
    (...)
    }

    Producing a stack trace like the following:

    [542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
    [542998.010238] #PF: supervisor read access in kernel mode
    [542998.010878] #PF: error_code(0x0000) - not-present page
    [542998.011516] PGD 0 P4D 0
    [542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
    [542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1
    [542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
    [542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
    [542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
    [542998.018474] Code: 4c 89 e6 ...
    [542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
    [542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
    [542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
    [542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
    [542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
    [542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
    [542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
    [542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
    [542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [542998.032380] Call Trace:
    [542998.032752] scrub_recheck_block+0x162/0x400 [btrfs]
    [542998.033500] ? __alloc_pages_nodemask+0x31e/0x460
    [542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
    [542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs]
    [542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs]
    [542998.036735] process_one_work+0x26d/0x6a0
    [542998.037275] worker_thread+0x4f/0x3e0
    [542998.037740] ? process_one_work+0x6a0/0x6a0
    [542998.038378] kthread+0x103/0x140
    [542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70
    [542998.039419] ret_from_fork+0x3a/0x50
    [542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
    [542998.047288] CR2: 0000000000000028
    [542998.047724] ---[ end trace bde186e176c7f96a ]---

    This issue has been around for a long time, possibly since scrub exists.
    The last time I ran into it was over 2 years ago. After recently fixing
    fstests to pass the "--full-balance" command line option to btrfs-progs
    when doing balance, several tests started to more heavily exercise balance
    with fsstress, scrub and other operations in parallel, and therefore
    started to hit this issue again (with btrfs/061 for example).

    Fix this by having scrub increment the 'trimming' counter of the block
    group, which pins the block group in such a way that it guarantees neither
    its logical address nor device extents can be reused by future block group
    allocations until we decrement the 'trimming' counter. Also make sure that
    on each iteration of scrub_stripe() we stop scrubbing the block group if
    it was removed already.

    A later patch in the series will rename the block group's 'trimming'
    counter and its helpers to a more generic name, since now it is not used
    exclusively for pinning while trimming anymore.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     

24 Mar, 2020

5 commits

  • The current codebase makes use of the zero-length array language
    extension to the C90 standard, but the preferred mechanism to declare
    variable-length types such as these ones is a flexible array
    member[1][2], introduced in C99:

    struct foo {
    int stuff;
    struct boo array[];
    };

    By making use of the mechanism above, we will get a compiler warning in
    case the flexible array does not occur last in the structure, which will
    help us prevent some kind of undefined behavior bugs from being
    inadvertently introduced[3] to the codebase from now on.

    Also, notice that, dynamic memory allocations won't be affected by this
    change:

    "Flexible array members have incomplete type, and so the sizeof operator
    may not be applied. As a quirk of the original implementation of
    zero-length arrays, sizeof evaluates to zero." [1]

    This issue was found with the help of Coccinelle.

    [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
    [2] https://github.com/KSPP/linux/issues/21
    [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

    Signed-off-by: Gustavo A. R. Silva
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Gustavo A. R. Silva
     
  • We are now using these for all roots, rename them to btrfs_put_root()
    and btrfs_grab_root();

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • Now that all callers of btrfs_get_fs_root are subsequently calling
    btrfs_grab_fs_root and handling dropping the ref when they are done
    appropriately, go ahead and push btrfs_grab_fs_root up into
    btrfs_get_fs_root.

    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • We look up the root for the bytenr that is failing, so we need to hold a
    ref on the root for that operation.

    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     
  • All this does is call btrfs_get_fs_root() with check_ref == true. Just
    use btrfs_get_fs_root() so we don't have a bunch of different helpers
    that do the same thing.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Josef Bacik
     

29 Jan, 2020

1 commit

  • Pull btrfs updates from David Sterba:
    "Features, highlights:

    - async discard
    - "mount -o discard=async" to enable it
    - freed extents are not discarded immediatelly, but grouped
    together and trimmed later, with IO rate limiting
    - the "sync" mode submits short extents that could have been
    ignored completely by the device, for SATA prior to 3.1 the
    requests are unqueued and have a big impact on performance
    - the actual discard IO requests have been moved out of
    transaction commit to a worker thread, improving commit latency
    - IO rate and request size can be tuned by sysfs files, for now
    enabled only with CONFIG_BTRFS_DEBUG as we might need to
    add/delete the files and don't have a stable-ish ABI for
    general use, defaults are conservative

    - export device state info in sysfs, eg. missing, writeable

    - no discard of extents known to be untouched on disk (eg. after
    reservation)

    - device stats reset is logged with process name and PID that called
    the ioctl

    Fixes:

    - fix missing hole after hole punching and fsync when using NO_HOLES

    - writeback: range cyclic mode could miss some dirty pages and lead
    to OOM

    - two more corner cases for metadata_uuid change after power loss
    during the change

    - fix infinite loop during fsync after mix of rename operations

    Core changes:

    - qgroup assign returns ENOTCONN when quotas not enabled, used to
    return EINVAL that was confusing

    - device closing does not need to allocate memory anymore

    - snapshot aware code got removed, disabled for years due to
    performance problems, reimplmentation will allow to select wheter
    defrag breaks or does not break COW on shared extents

    - tree-checker:
    - check leaf chunk item size, cross check against number of
    stripes
    - verify location keys for DIR_ITEM, DIR_INDEX and XATTR items

    - new self test for physical -> logical mapping code, used for super
    block range exclusion

    - assertion helpers/macros updated to avoid objtool "unreachable
    code" reports on older compilers or config option combinations"

    * tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (84 commits)
    btrfs: free block groups after free'ing fs trees
    btrfs: Fix split-brain handling when changing FSID to metadata uuid
    btrfs: Handle another split brain scenario with metadata uuid feature
    btrfs: Factor out metadata_uuid code from find_fsid.
    btrfs: Call find_fsid from find_fsid_inprogress
    Btrfs: fix infinite loop during fsync after rename operations
    btrfs: set trans->drity in btrfs_commit_transaction
    btrfs: drop log root for dropped roots
    btrfs: sysfs, add devid/dev_state kobject and device attributes
    btrfs: Refactor btrfs_rmap_block to improve readability
    btrfs: Add self-tests for btrfs_rmap_block
    btrfs: selftests: Add support for dummy devices
    btrfs: Move and unexport btrfs_rmap_block
    btrfs: separate definition of assertion failure handlers
    btrfs: device stats, log when stats are zeroed
    btrfs: fix improper setting of scanned for range cyclic write cache pages
    btrfs: safely advance counter when looking up bio csums
    btrfs: remove unused member btrfs_device::work
    btrfs: remove unnecessary wrapper get_alloc_profile
    btrfs: add correction to handle -1 edge case in async discard
    ...

    Linus Torvalds
     

24 Jan, 2020

1 commit

  • [BUG]
    For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
    looped runs can lead to random failure, where scrub finds csum error.

    The possibility is not high, around 1/20 to 1/100, but it's causing data
    corruption.

    The bug is observable after commit b12de52896c0 ("btrfs: scrub: Don't
    check free space before marking a block group RO")

    [CAUSE]
    Dev-replace has two source of writes:

    - Write duplication
    All writes to source device will also be duplicated to target device.

    Content: Not yet persisted data/meta

    - Scrub copy
    Dev-replace reused scrub code to iterate through existing extents, and
    copy the verified data to target device.

    Content: Previously persisted data and metadata

    The difference in contents makes the following race possible:
    Regular Writer | Dev-replace
    -----------------------------------------------------------------
    ^ |
    | Preallocate one data extent |
    | at bytenr X, len 1M |
    v |
    ^ Commit transaction |
    | Now extent [X, X+1M) is in |
    v commit root |
    ================== Dev replace starts =========================
    | ^
    | | Scrub extent [X, X+1M)
    | | Read [X, X+1M)
    | | (The content are mostly garbage
    | | since it's preallocated)
    ^ | v
    | Write back happens for |
    | extent [X, X+512K) |
    | New data writes to both |
    | source and target dev. |
    v |
    | ^
    | | Scrub writes back extent [X, X+1M)
    | | to target device.
    | | This will over write the new data in
    | | [X, X+512K)
    | v

    This race can only happen for nocow writes. Thus metadata and data cow
    writes are safe, as COW will never overwrite extents of previous
    transaction (in commit root).

    This behavior can be confirmed by disabling all fallocate related calls
    in fsstress (*), then all related tests can pass a 2000 run loop.

    *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
    -f collapse=0 -f punch=0 -f resvsp=0"
    I didn't expect resvsp ioctl will fallback to fallocate in VFS...

    [FIX]
    Make dev-replace to require mandatory block group RO, and wait for current
    nocow writes before calling scrub_chunk().

    This patch will mostly revert commit 76a8efa171bf ("btrfs: Continue replace
    when set_block_ro failed") for dev-replace path.

    The side effect is, dev-replace can be more strict on avaialble space, but
    definitely worth to avoid data corruption.

    Reported-by: Filipe Manana
    Fixes: 76a8efa171bf ("btrfs: Continue replace when set_block_ro failed")
    Fixes: b12de52896c0 ("btrfs: scrub: Don't check free space before marking a block group RO")
    Reviewed-by: Filipe Manana
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     

20 Jan, 2020

1 commit

  • block_group removal is a little tricky. It can race with the extent
    allocator, the cleaner thread, and balancing. The current path is for a
    block_group to be added to the unused_bgs list. Then, when the cleaner
    thread comes around, it starts a transaction and then proceeds with
    removing the block_group. Extents that are pinned are subsequently
    removed from the pinned trees and then eventually a discard is issued
    for the entire block_group.

    Async discard introduces another player into the game, the discard
    workqueue. While it has none of the racing issues, the new problem is
    ensuring we don't leave free space untrimmed prior to forgetting the
    block_group. This is handled by placing fully free block_groups on a
    separate discard queue. This is necessary to maintain discarding order
    as in the future we will slowly trim even fully free block_groups. The
    ordering helps us make progress on the same block_group rather than say
    the last fully freed block_group or needing to search through the fully
    freed block groups at the beginning of a list and insert after.

    The new order of events is a fully freed block group gets placed on the
    unused discard queue first. Once it's processed, it will be placed on
    the unusued_bgs list and then the original sequence of events will
    happen, just without the final whole block_group discard.

    The mount flags can change when processing unused_bgs, so when flipping
    from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
    discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
    free block groups on the discard_list to the unused_bg queue which will
    do the final discard for us.

    Reviewed-by: Josef Bacik
    Signed-off-by: Dennis Zhou
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Dennis Zhou
     

19 Nov, 2019

6 commits

  • When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
    set the block group to RO mode and then wait for any ongoing writes into
    extents of the block group to complete. While doing that wait we overwrite
    the value of the variable 'ret' and can break out of the loop if an error
    happens without turning the block group back into RW mode. So what happens
    is the following:

    1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
    to RO mode (its ->ro field set to 1 or incremented to some value > 1);

    2) Then btrfs_wait_ordered_roots() returns a value > 0;

    3) Then if either joining or committing the transaction fails, we break
    out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
    the block group in RO mode forever.

    To fix this, just remove the code that waits for ongoing writes to extents
    of the block group, since it's not needed because in the initial setup
    phase of a device replace operation, before starting to find all chunks
    and their extents, we set the target device for replace while holding
    fs_info->dev_replace->rwsem, which ensures that after releasing that
    semaphore, any writes into the source device are made to the target device
    as well (__btrfs_map_block() guarantees that). So while at
    scrub_enumerate_chunks() we only need to worry about finding and copying
    extents (from the source device to the target device) that were written
    before we started the device replace operation.

    Fixes: f0e9b7d6401959 ("Btrfs: fix race setting block group readonly during device replace")
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     
  • [BUG]
    When running btrfs/072 with only one online CPU, it has a pretty high
    chance to fail:

    btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
    - output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
    --- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800
    +++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800
    @@ -1,2 +1,3 @@
    QA output created by 072
    Silence is golden
    +Scrub find errors in "-m dup -d single" test
    ...

    And with the following call trace:

    BTRFS info (device dm-5): scrub: started on devid 1
    ------------[ cut here ]------------
    BTRFS: Transaction aborted (error -27)
    WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
    CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
    RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
    Call Trace:
    __btrfs_end_transaction+0xdb/0x310 [btrfs]
    btrfs_end_transaction+0x10/0x20 [btrfs]
    btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
    scrub_enumerate_chunks+0x264/0x940 [btrfs]
    btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
    btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
    do_vfs_ioctl+0x636/0xaa0
    ksys_ioctl+0x67/0x90
    __x64_sys_ioctl+0x43/0x50
    do_syscall_64+0x79/0xe0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    ---[ end trace 166c865cec7688e7 ]---

    [CAUSE]
    The error number -27 is -EFBIG, returned from the following call chain:
    btrfs_end_transaction()
    |- __btrfs_end_transaction()
    |- btrfs_create_pending_block_groups()
    |- btrfs_finish_chunk_alloc()
    |- btrfs_add_system_chunk()

    This happens because we have used up all space of
    btrfs_super_block::sys_chunk_array.

    The root cause is, we have the following bad loop of creating tons of
    system chunks:

    1. The only SYSTEM chunk is being scrubbed
    It's very common to have only one SYSTEM chunk.
    2. New SYSTEM bg will be allocated
    As btrfs_inc_block_group_ro() will check if we have enough space
    after marking current bg RO. If not, then allocate a new chunk.
    3. New SYSTEM bg is still empty, will be reclaimed
    During the reclaim, we will mark it RO again.
    4. That newly allocated empty SYSTEM bg get scrubbed
    We go back to step 2, as the bg is already mark RO but still not
    cleaned up yet.

    If the cleaner kthread doesn't get executed fast enough (e.g. only one
    CPU), then we will get more and more empty SYSTEM chunks, using up all
    the space of btrfs_super_block::sys_chunk_array.

    [FIX]
    Since scrub/dev-replace doesn't always need to allocate new extent,
    especially chunk tree extent, so we don't really need to do chunk
    pre-allocation.

    To break above spiral, here we introduce a new parameter to
    btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
    need extra chunk pre-allocation.

    For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
    @do_chunk_alloc=false.
    This should keep unnecessary empty chunks from popping up for scrub.

    Also, since there are two parameters for btrfs_inc_block_group_ro(),
    add more comment for it.

    Reviewed-by: Filipe Manana
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • The type name is misleading, a single entry is named 'cache' while this
    normally means a collection of objects. Rename that everywhere. Also the
    identifier was quite long, making function prototypes harder to format.

    Suggested-by: Nikolay Borisov
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    David Sterba
     
  • The "&fs_info->dev_replace.rwsem" and "&dev_replace->rwsem" refer to
    the same lock but Smatch is not clever enough to figure that out so it
    leads to static checker warnings. It's better to use it consistently
    anyway.

    Signed-off-by: Dan Carpenter
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Dan Carpenter
     
  • The on-disk format of block group item makes use of the key that stores
    the offset and length. This is further used in the code, although this
    makes thing harder to understand. The key is also packed so the
    offset/length is not properly aligned as u64.

    Add start (key.objectid) and length (key.offset) members to block group
    and remove the embedded key. When the item is searched or written, a
    local variable for key is used.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    David Sterba
     
  • For unknown reasons, the member 'used' in the block group struct is
    stored in the b-tree item and accessed everywhere using the special
    accessor helper. Let's unify it and make it a regular member and only
    update the item before writing it to the tree.

    The item is still being used for flags and chunk_objectid, there's some
    duplication until the item is removed in following patches.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    David Sterba
     

18 Nov, 2019

2 commits

  • Commit 9e0af2376434 ("Btrfs: fix task hang under heavy compressed
    write") worked around the issue that a recycled work item could get a
    false dependency on the original work item due to how the workqueue code
    guarantees non-reentrancy. It did so by giving different work functions
    to different types of work.

    However, the fixes in the previous few patches are more complete, as
    they prevent a work item from being recycled at all (except for a tiny
    window that the kernel workqueue code handles for us). This obsoletes
    the previous fix, so we don't need the unique helpers for correctness.
    The only other reason to keep them would be so they show up in stack
    traces, but they always seem to be optimized to a tail call, so they
    don't show up anyways. So, let's just get rid of the extra indirection.

    While we're here, rename normal_work_helper() to the more informative
    btrfs_work_helper().

    Reviewed-by: Nikolay Borisov
    Reviewed-by: Filipe Manana
    Signed-off-by: Omar Sandoval
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Omar Sandoval
     
  • Currently, scrub_missing_raid56_worker() puts and potentially frees
    sblock (which embeds the work item) and then submits a bio through
    scrub_wr_submit(). This is another potential instance of the bug in
    "btrfs: don't prematurely free work in run_ordered_work()". Fix it by
    dropping the reference after we submit the bio.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Omar Sandoval
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Omar Sandoval
     

09 Sep, 2019

1 commit


02 Jul, 2019

1 commit


01 Jul, 2019

4 commits

  • Currently btrfs_csum_data() relied on the crc32c() wrapper around the
    crypto framework for calculating the CRCs.

    As we have our own crypto_shash structure in the fs_info now, we can
    directly call into the crypto framework without going trough the wrapper.

    This way we can even remove the btrfs_csum_data() and btrfs_csum_final()
    wrappers.

    The module dependency on crc32c is preserved via MODULE_SOFTDEP("pre:
    crc32c"), which was previously provided by LIBCRC32C config option doing
    the same.

    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • BTRFS has the implicit assumption that a checksum in btrfs_orderd_sums
    is 4 bytes. While this is true for CRC32C, it is not for any other
    checksum.

    Change the data type to be a byte array and adjust loop index
    calculation accordingly.

    This includes moving the adjustment of 'index' by 'ins_size' in
    btrfs_csum_file_blocks() before dividing 'ins_size' by the checksum
    size, because before this patch the 'sums' member of 'struct
    btrfs_ordered_sum' was 4 Bytes in size and afterwards it is only one
    byte.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • There are several places that call nr_data_stripes, but this value does
    not change.

    Signed-off-by: David Sterba

    David Sterba
     
  • fs_info::mapping_tree is the physicallogical mapping tree and uses
    the same underlying structure as extents, but is embedded to another
    structure. There are no other members and this indirection is useless.
    No functional change.

    Signed-off-by: David Sterba

    David Sterba
     

30 Apr, 2019

1 commit