27 Jul, 2020

6 commits


24 Mar, 2020

2 commits

  • The struct_size macro does the same calculation and is safe regarding
    overflows. Though we're not expecting them to happen, use the helper for
    clarity.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    David Sterba
     
  • There are temporary variables tracking the index of P and Q stripes, but
    none of them is really used as such, merely for determining if the Q
    stripe is present. This leads to compiler warnings with
    -Wunused-but-set-variable and has been reported several times.

    fs/btrfs/raid56.c: In function ‘finish_rmw’:
    fs/btrfs/raid56.c:1199:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
    1199 | int p_stripe = -1;
    | ^~~~~~~~
    fs/btrfs/raid56.c: In function ‘finish_parity_scrub’:
    fs/btrfs/raid56.c:2356:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
    2356 | int p_stripe = -1;
    | ^~~~~~~~

    Replace the two variables with one that has a clear meaning and also get
    rid of the warnings. The logic that verifies that there are only 2
    valid cases is unchanged.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: David Sterba

    David Sterba
     

18 Nov, 2019

3 commits

  • In lock_stripe_add() we're caching the bucket for the stripe hash table
    just for a single call to dereference the stripe hash.

    If we just directly call rbio_bucket() we can safe the pointless local
    variable.

    Also move the dereferencing of the stripe hash outside of the variable
    declaration block to not break over the 80 characters limit.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • In lock_stripe_add() we're traversing the stripe hash list and check if
    the current list element's raid_map equals is equal to the raid bio's
    raid_map. If both are equal we continue processing.

    If we'd check for inequality instead of equality we can reduce one level
    of indentation.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Johannes Thumshirn
     
  • Commit 9e0af2376434 ("Btrfs: fix task hang under heavy compressed
    write") worked around the issue that a recycled work item could get a
    false dependency on the original work item due to how the workqueue code
    guarantees non-reentrancy. It did so by giving different work functions
    to different types of work.

    However, the fixes in the previous few patches are more complete, as
    they prevent a work item from being recycled at all (except for a tiny
    window that the kernel workqueue code handles for us). This obsoletes
    the previous fix, so we don't need the unique helpers for correctness.
    The only other reason to keep them would be so they show up in stack
    traces, but they always seem to be optimized to a tail call, so they
    don't show up anyways. So, let's just get rid of the extra indirection.

    While we're here, rename normal_work_helper() to the more informative
    btrfs_work_helper().

    Reviewed-by: Nikolay Borisov
    Reviewed-by: Filipe Manana
    Signed-off-by: Omar Sandoval
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Omar Sandoval
     

09 Sep, 2019

1 commit


30 Apr, 2019

1 commit


27 Mar, 2019

1 commit

  • Pull btrfs fixes from David Sterba:

    - fsync fixes: i_size for truncate vs fsync, dio vs buffered during
    snapshotting, remove complicated but incomplete assertion

    - removed excessive warnigs, misreported device stats updates

    - fix raid56 page mapping for 32bit arch

    - fixes reported by static analyzer

    * tag 'for-5.1-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    Btrfs: fix assertion failure on fsync with NO_HOLES enabled
    btrfs: Avoid possible qgroup_rsv_size overflow in btrfs_calculate_inode_block_rsv_size
    btrfs: Fix bound checking in qgroup_trace_new_subtree_blocks
    btrfs: raid56: properly unmap parity page in finish_parity_scrub()
    btrfs: don't report readahead errors and don't update statistics
    Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes
    btrfs: remove WARN_ON in log_dir_items
    Btrfs: fix incorrect file size after shrinking truncate and fsync

    Linus Torvalds
     

19 Mar, 2019

1 commit

  • Parity page is incorrectly unmapped in finish_parity_scrub(), triggering
    a reference counter bug on i386, i.e.:

    [ 157.662401] kernel BUG at mm/highmem.c:349!
    [ 157.666725] invalid opcode: 0000 [#1] SMP PTI

    The reason is that kunmap(p_page) was completely left out, so we never
    did an unmap for the p_page and the loop unmapping the rbio page was
    iterating over the wrong number of stripes: unmapping should be done
    with nr_data instead of rbio->real_stripes.

    Test case to reproduce the bug:

    - create a raid5 btrfs filesystem:
    # mkfs.btrfs -m raid5 -d raid5 /dev/sdb /dev/sdc /dev/sdd /dev/sde

    - mount it:
    # mount /dev/sdb /mnt

    - run btrfs scrub in a loop:
    # while :; do btrfs scrub start -BR /mnt; done

    BugLink: https://bugs.launchpad.net/bugs/1812845
    Fixes: 5a6ac9eacb49 ("Btrfs, raid56: support parity scrub on raid56")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Andrea Righi
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Andrea Righi
     

15 Feb, 2019

1 commit

  • This patch introduces one extra iterator variable to bio_for_each_segment_all(),
    then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

    Given it is just one mechannical & simple change on all bio_for_each_segment_all()
    users, this patch does tree-wide change in one single patch, so that we can
    avoid to use a temporary helper for this conversion.

    Reviewed-by: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

17 Dec, 2018

1 commit


06 Aug, 2018

9 commits


30 May, 2018

1 commit

  • In the quest to remove all stack VLA usage from the kernel[1], this
    allocates the working buffers during regular init, instead of using stack
    space. This refactors the allocation code a bit to make it easier
    to review.

    [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

    Signed-off-by: Kees Cook
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Kees Cook
     

12 Apr, 2018

1 commit


31 Mar, 2018

1 commit


26 Mar, 2018

1 commit


15 Mar, 2018

1 commit

  • On detaching of a disk which is a part of a RAID6 filesystem, the
    following kernel OOPS may happen:

    [63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
    [63122.719584] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo
    [63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
    [63122.803516] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo
    [63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
    [63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo
    [63122.935338] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
    [63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs]
    [63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0
    [63122.971202] Oops: 0000 [#1] SMP
    [63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W 4.14.2-16-scst34x+ #8
    [63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
    [63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs]
    [63123.007595] task: ffff880036ea4040 task.stack: ffffc90006384000
    [63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs]
    [63123.007968] RSP: 0018:ffffc90006387ad8 EFLAGS: 00010287
    [63123.008140] RAX: 0000000000000002 RBX: ffff88004beaa0b8 RCX: ffff8800b2bd5690
    [63123.008359] RDX: 0000000000000000 RSI: ffff88007bb43500 RDI: ffff88004beaa000
    [63123.008621] RBP: ffffc90006387ae8 R08: 0000000099100000 R09: ffff8800b2bd5600
    [63123.008840] R10: 0000000000000004 R11: 0000000000010000 R12: ffff88007bb43500
    [63123.009059] R13: 00000000fffffffb R14: ffff880036fc5180 R15: 0000000000000004
    [63123.009278] FS: 0000000000000000(0000) GS:ffff8800b7000000(0000) knlGS:0000000000000000
    [63123.009564] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [63123.009748] CR2: 0000000000000080 CR3: 00000000b0866000 CR4: 00000000000406f0
    [63123.009969] Call Trace:
    [63123.010085] raid_write_end_io+0x7e/0x80 [btrfs]
    [63123.010251] bio_endio+0xa1/0x120
    [63123.010378] generic_make_request+0x218/0x270
    [63123.010921] submit_bio+0x66/0x130
    [63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs]
    [63123.011245] full_stripe_write+0x96/0xc0 [btrfs]
    [63123.011428] raid56_parity_write+0x117/0x170 [btrfs]
    [63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs]
    [63123.011759] ? ___cache_free+0x1c5/0x300
    [63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs]
    [63123.012087] run_one_async_done+0x9c/0xc0 [btrfs]
    [63123.012257] normal_work_helper+0x19e/0x300 [btrfs]
    [63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs]
    [63123.012656] process_one_work+0x14d/0x350
    [63123.012888] worker_thread+0x4d/0x3a0
    [63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20
    [63123.013192] kthread+0x109/0x140
    [63123.013315] ? process_scheduled_works+0x40/0x40
    [63123.013472] ? kthread_stop+0x110/0x110
    [63123.013610] ret_from_fork+0x25/0x30
    [63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: ffffc90006387ad8
    [63123.014678] CR2: 0000000000000080
    [63123.016590] ---[ end trace a295ea7259c17880 ]—

    This is reproducible in a cycle, where a series of writes is followed by
    SCSI device delete command. The test may take up to few minutes.

    Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
    [ no signed-off-by provided ]
    Author: Dmitriy Gorokh
    Reviewed-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Dmitriy Gorokh
     

22 Jan, 2018

8 commits

  • Before rbio_orig_end_io() goes to free rbio, rbio may get merged with
    more bios from other rbios and rbio->bio_list becomes non-empty,
    in that case, these newly merged bios don't end properly.

    Once unlock_stripe() is done, rbio->bio_list will not be updated any
    more and we can call bio_endio() on all queued bios.

    It should only happen in error-out cases, the normal path of recover
    and full stripe write have already set RBIO_RMW_LOCKED_BIT to disable
    merge before doing IO, so rbio_orig_end_io() called by them doesn't
    have the above issue.

    Reported-by: Jérôme Carretero
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba

    Liu Bo
     
  • Since raid6 recover tries all possible combinations of failed stripes,

    - when raid6 rebuild algorithm is used, i.e. raid6_datap_recov() and
    raid6_2data_recov(), it may change the in-memory content of failed
    stripes, if such a raid bio is cached, a later raid write rmw or recover
    can steal @stripe_pages from it instead of reading from disks, such that
    it carries the wrong content to do write rmw or recovery and ends up
    with corruption or recovery failures.

    - when raid5 rebuild algorithm is used, i.e. xor, raid bio can be cached
    because the only failed stripe which contains @rbio->bio_pages gets
    modified, others remain the same so that their in-memory content is
    consistent with their on-disk content.

    This adds a check to skip caching rbio if using raid6 recover.

    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba

    Liu Bo
     
  • Bio iterated by set_bio_pages_uptodate() is raid56 internal one, so it
    will never be a BIO_CLONED bio, and since this is called by end_io
    functions, bio->bi_iter.bi_size is zero, we mustn't use
    bio_for_each_segment() as that is a no-op if bi_size is zero.

    Fixes: 6592e58c6b68e61f003a01ba29a3716e7e2e9484 ("Btrfs: fix write corruption due to bio cloning on raid5/6")
    Cc: # v4.12-rc6+
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba

    Liu Bo
     
  • Since fail stripe index in rbio would be used to decide which
    algorithm reconstruction would be run, we cannot merge rbios if
    their's fail striped indexes are different, otherwise, one of the two
    reconstructions would fail.

    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Liu Bo
     
  • Given the above
    '
    if (last->operation != cur->operation)
    return 0;
    ',
    it's guaranteed that two operations are same.

    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Liu Bo
     
  • There is a scenario that can end up with rebuild process failing to
    return good content, i.e.
    suppose that all disks can be read without problems and if the content
    that was read out doesn't match its checksum, currently for raid6
    btrfs at most retries twice,

    - the 1st retry is to rebuild with all other stripes, it'll eventually
    be a raid5 xor rebuild,
    - if the 1st fails, the 2nd retry will deliberately fail parity p so
    that it will do raid6 style rebuild,

    however, the chances are that another non-parity stripe content also
    has something corrupted, so that the above retries are not able to
    return correct content, and users will think of this as data loss.
    More seriouly, if the loss happens on some important internal btree
    roots, it could refuse to mount.

    This extends btrfs to do more retries and each retry fails only one
    stripe. Since raid6 can tolerate 2 disk failures, if there is one
    more failure besides the failure on which we're recovering, this can
    always work.

    The worst case is to retry as many times as the number of raid6 disks,
    but given the fact that such a scenario is really rare in practice,
    it's still acceptable.

    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba

    Liu Bo
     
  • In fact nobody is waiting on @wait's waitqueue, it can be safely
    removed.

    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Liu Bo
     
  • The defined wait is not used anywhere.

    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Liu Bo
     

30 Oct, 2017

1 commit

  • The local bio_list may have pending bios when doing cleanup, it can
    end up with memory leak if they don't get freed.

    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Liu Bo