24 Aug, 2018

1 commit

  • [ Upstream commit 665d4953cde6d9e75c62a07ec8f4f8fd7d396ade ]

    In commit ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device
    replace") we removed the branch of copy_nocow_pages() to avoid
    corruption for compressed nodatasum extents.

    However above commit only solves the problem in scrub_extent(), if
    during scrub_pages() we failed to read some pages,
    sctx->no_io_error_seen will be non-zero and we go to fixup function
    scrub_handle_errored_block().

    In scrub_handle_errored_block(), for sctx without csum (no matter if
    we're doing replace or scrub) we go to scrub_fixup_nodatasum() routine,
    which does the similar thing with copy_nocow_pages(), but does it
    without the extra check in copy_nocow_pages() routine.

    So for test cases like btrfs/100, where we emulate read errors during
    replace/scrub, we could corrupt compressed extent data again.

    This patch will fix it just by avoiding any "optimization" for
    nodatasum, just falls back to the normal fixup routine by try read from
    any good copy.

    This also solves WARN_ON() or dead lock caused by lame backref iteration
    in scrub_fixup_nodatasum() routine.

    The deadlock or WARN_ON() won't be triggered before commit ac0b4145d662
    ("btrfs: scrub: Don't use inode pages for device replace") since
    copy_nocow_pages() have better locking and extra check for data extent,
    and it's already doing the fixup work by try to read data from any good
    copy, so it won't go scrub_fixup_nodatasum() anyway.

    This patch disables the faulty code and will be removed completely in a
    followup patch.

    Fixes: ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device replace")
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     

26 Jun, 2018

1 commit

  • commit ac0b4145d662a3b9e34085dea460fb06ede9b69b upstream.

    [BUG]
    Btrfs can create compressed extent without checksum (even though it
    shouldn't), and if we then try to replace device containing such extent,
    the result device will contain all the uncompressed data instead of the
    compressed one.

    Test case already submitted to fstests:
    https://patchwork.kernel.org/patch/10442353/

    [CAUSE]
    When handling compressed extent without checksum, device replace will
    goe into copy_nocow_pages() function.

    In that function, btrfs will get all inodes referring to this data
    extents and then use find_or_create_page() to get pages direct from that
    inode.

    The problem here is, pages directly from inode are always uncompressed.
    And for compressed data extent, they mismatch with on-disk data.
    Thus this leads to corrupted compressed data extent written to replace
    device.

    [FIX]
    In this attempt, we could just remove the "optimization" branch, and let
    unified scrub_pages() to handle it.

    Although scrub_pages() won't bother reusing page cache, it will be a
    little slower, but it does the correct csum checking and won't cause
    such data corruption caused by "optimization".

    Note about the fix: this is the minimal fix that can be backported to
    older stable trees without conflicts. The whole callchain from
    copy_nocow_pages() can be deleted, and will be in followup patches.

    Fixes: ff023aac3119 ("Btrfs: add code to scrub to copy read data to another disk")
    CC: stable@vger.kernel.org # 4.4+
    Reported-by: James Harvey
    Reviewed-by: James Harvey
    Signed-off-by: Qu Wenruo
    [ remove code removal, add note why ]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     

21 Jun, 2018

1 commit

  • [ Upstream commit 762221f095e3932669093466aaf4b85ed9ad2ac1 ]

    The raid6 corruption is that,
    suppose that all disks can be read without problems and if the content
    that was read out doesn't match its checksum, currently for raid6
    btrfs at most retries twice,

    - the 1st retry is to rebuild with all other stripes, it'll eventually
    be a raid5 xor rebuild,
    - if the 1st fails, the 2nd retry will deliberately fail parity p so
    that it will do raid6 style rebuild,

    however, the chances are that another non-parity stripe content also
    has something corrupted, so that the above retries are not able to
    return correct content.

    We've fixed normal reads to rebuild raid6 correctly with more retries
    in Patch "Btrfs: make raid6 rebuild retry more"[1], this is to fix
    scrub to do the exactly same rebuild process.

    [1]: https://patchwork.kernel.org/patch/10091755/

    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Liu Bo
     

10 Sep, 2017

1 commit

  • Pull btrfs updates from David Sterba:
    "The changes range through all types: cleanups, core chagnes, sanity
    checks, fixes, other user visible changes, detailed list below:

    - deprecated: user transaction ioctl

    - mount option ssd does not change allocation alignments

    - degraded read-write mount is allowed if all the raid profile
    constraints are met, now based on more accurate check

    - defrag: do not reset compression afterwards; the NOCOMPRESS flag
    can be now overriden by defrag

    - prep work for better extent reference tracking (related to the
    qgroup slowness with balance)

    - prep work for compression heuristics

    - memory allocation reductions (may help latencies on a loaded
    system)

    - better accounting for io waiting states

    - error handling improvements (removed BUGs)

    - added more sanity checks for shared refs

    - fix readdir vs pagefault deadlock under some circumstances

    - fix for 'no-hole' mode, certain combination of compressed and
    inline extents

    - send: fix emission of invalid clone operations

    - fixup file mode if setting acls fail

    - more fixes from fuzzing

    - oher cleanups"

    * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits)
    btrfs: submit superblock io with REQ_META and REQ_PRIO
    btrfs: remove unnecessary memory barrier in btrfs_direct_IO
    btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent
    btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent
    btrfs: pass fs_info to btrfs_del_root instead of tree_root
    Btrfs: add one more sanity check for shared ref type
    Btrfs: remove BUG_ON in __add_tree_block
    Btrfs: remove BUG() in add_data_reference
    Btrfs: remove BUG() in print_extent_item
    Btrfs: remove BUG() in btrfs_extent_inline_ref_size
    Btrfs: convert to use btrfs_get_extent_inline_ref_type
    Btrfs: add a helper to retrive extent inline ref type
    btrfs: scrub: simplify scrub worker initialization
    btrfs: scrub: clean up division in scrub_find_csum
    btrfs: scrub: clean up division in __scrub_mark_bitmap
    btrfs: scrub: use bool for flush_all_writes
    btrfs: preserve i_mode if __btrfs_set_acl() fails
    btrfs: Remove extraneous chunk_objectid variable
    btrfs: Remove chunk_objectid argument from btrfs_make_block_group
    btrfs: Remove extra parentheses from condition in copy_items()
    ...

    Linus Torvalds
     

24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

21 Aug, 2017

4 commits


18 Aug, 2017

1 commit


16 Aug, 2017

2 commits


06 Jul, 2017

1 commit

  • Pull btrfs updates from David Sterba:
    "The core updates improve error handling (mostly related to bios), with
    the usual incremental work on the GFP_NOFS (mis)use removal,
    refactoring or cleanups. Except the two top patches, all have been in
    for-next for an extensive amount of time.

    User visible changes:

    - statx support

    - quota override tunable

    - improved compression thresholds

    - obsoleted mount option alloc_start

    Core updates:

    - bio-related updates:
    - faster bio cloning
    - no allocation failures
    - preallocated flush bios

    - more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates

    - prep work for btree_inode removal

    - dir-item validation

    - qgoup fixes and updates

    - cleanups:
    - removed unused struct members, unused code, refactoring
    - argument refactoring (fs_info/root, caller -> callee sink)
    - SEARCH_TREE ioctl docs"

    * 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
    btrfs: Remove false alert when fiemap range is smaller than on-disk extent
    btrfs: Don't clear SGID when inheriting ACLs
    btrfs: fix integer overflow in calc_reclaim_items_nr
    btrfs: scrub: fix target device intialization while setting up scrub context
    btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
    btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
    btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
    btrfs: qgroup: Return actually freed bytes for qgroup release or free data
    btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
    btrfs: qgroup: Add quick exit for non-fs extents
    Btrfs: rework delayed ref total_bytes_pinned accounting
    Btrfs: return old and new total ref mods when adding delayed refs
    Btrfs: always account pinned bytes when dropping a tree block ref
    Btrfs: update total_bytes_pinned when pinning down extents
    Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
    Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
    btrfs: fix validation of XATTR_ITEM dir items
    btrfs: Verify dir_item in iterate_object_props
    btrfs: Check name_len before in btrfs_del_root_ref
    btrfs: Check name_len before reading btrfs_get_name
    ...

    Linus Torvalds
     

30 Jun, 2017

2 commits

  • Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
    v4.12-rc6. This was because commit 70e7af244 made it possible for
    calc_reclaim_items_nr() to return a negative number. It's not really a
    bug in that commit, it just didn't go far enough down the stack to find
    all the possible 64->32 bit overflows.

    This switches calc_reclaim_items_nr() to return a u64 and changes everyone
    that uses the results of that math to u64 as well.

    Reported-by: Dave Jones
    Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
    Signed-off-by: Chris Mason
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Chris Mason
     
  • The commit "btrfs: scrub: inline helper scrub_setup_wr_ctx" inlined a
    helper but wrongly sets up the target device. Incidentally there's a
    local variable with the same name as a parameter in the previous
    function, so this got caught during runtime as crash in test btrfs/027.

    Reported-by: Chris Mason
    Signed-off-by: David Sterba

    David Sterba
     

20 Jun, 2017

9 commits


09 Jun, 2017

1 commit

  • Replace bi_error with a new bi_status to allow for a clear conversion.
    Note that device mapper overloaded bi_error with a private value, which
    we'll have to keep arround at least for now and thus propagate to a
    proper blk_status_t value.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

18 Apr, 2017

14 commits

  • When scrubbing a RAID5 which has recoverable data corruption (only one
    data stripe is corrupted), sometimes scrub will report more csum errors
    than expected. Sometimes even unrecoverable error will be reported.

    The problem can be easily reproduced by the following steps:
    1) Create a btrfs with RAID5 data profile with 3 devs
    2) Mount it with nospace_cache or space_cache=v2
    To avoid extra data space usage.
    3) Create a 128K file and sync the fs, unmount it
    Now the 128K file lies at the beginning of the data chunk
    4) Locate the physical bytenr of data chunk on dev3
    Dev3 is the 1st data stripe.
    5) Corrupt the first 64K of the data chunk stripe on dev3
    6) Mount the fs and scrub it

    The correct csum error number should be 16 (assuming using x86_64).
    Larger csum error number can be reported in a 1/3 chance.
    And unrecoverable error can also be reported in a 1/10 chance.

    The root cause of the problem is RAID5/6 recover code has race
    condition, due to the fact that full scrub is initiated per device.

    While for other mirror based profiles, each mirror is independent with
    each other, so race won't cause any big problem.

    For example:
    Corrupted | Correct | Correct |
    | Scrub dev3 (D1) | Scrub dev2 (D2) | Scrub dev1(P) |
    ------------------------------------------------------------------------
    Read out D1 |Read out D2 |Read full stripe |
    Check csum |Check csum |Check parity |
    Csum mismatch |Csum match, continue |Parity mismatch |
    handle_errored_block | |handle_errored_block |
    Read out full stripe | | Read out full stripe|
    D1 csum error(err++) | | D1 csum error(err++)|
    Recover D1 | | Recover D1 |

    So D1's csum error is accounted twice, just because
    handle_errored_block() doesn't have enough protection, and race can happen.

    On even worse case, for example D1's recovery code is re-writing
    D1/D2/P, and P's recovery code is just reading out full stripe, then we
    can cause unrecoverable error.

    This patch will use previously introduced lock_full_stripe() and
    unlock_full_stripe() to protect the whole scrub_handle_errored_block()
    function for RAID56 recovery.
    So no extra csum error nor unrecoverable error.

    Reported-by: Goffredo Baroncelli
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • Unlike mirror based profiles, RAID5/6 recovery needs to read out the
    whole full stripe.

    And if we don't do proper protection, it can easily cause race condition.

    Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
    for RAID5/6.
    Which store a rb_tree of mutexes for full stripes, so scrub callers can
    use them to lock a full stripe to avoid race.

    Signed-off-by: Qu Wenruo
    Reviewed-by: Liu Bo
    Reviewed-by: David Sterba
    [ minor comment adjustments ]
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • This is fixing code pieces where we use div_u64 when passing a u64 divisor.

    Cc: David Sterba
    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Liu Bo
     
  • Commit 3d8da6781760 ("Btrfs: fix divide error upon chunk's stripe_len")
    changed stripe_len in struct map_lookup to u64, but didn't update
    stripe_len in struct scrub_parity.

    This updates the type and switches to div64_u64_rem to match u64 divisor.

    Cc: David Sterba
    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Liu Bo
     
  • There's a helper to clear whole page, with a arch-specific optimized
    code. The replaced cases do not seem to be in performace critical code,
    but we still might get some percent gain.

    Signed-off-by: David Sterba

    David Sterba
     
  • scrub_setup_recheck_block() calls btrfs_map_sblock() and then accesses
    bbio without protection of bio_counter.

    This can lead to use-after-free if racing with dev replace cancel.

    Fix it by increasing bio_counter before calling btrfs_map_sblock() and
    decreasing the bio_counter when corresponding recover is finished.

    Cc: Liu Bo
    Reported-by: Liu Bo
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • When raid56 dev-replace is cancelled by running scrub, we will free
    target device without waiting for in-flight bios, causing the following
    NULL pointer deference or general protection failure.

    BUG: unable to handle kernel NULL pointer dereference at 00000000000005e0
    IP: generic_make_request_checks+0x4d/0x610
    CPU: 1 PID: 11676 Comm: kworker/u4:14 Tainted: G O 4.11.0-rc2 #72
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
    Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs]
    task: ffff88002875b4c0 task.stack: ffffc90001334000
    RIP: 0010:generic_make_request_checks+0x4d/0x610
    Call Trace:
    ? generic_make_request+0xc7/0x360
    generic_make_request+0x24/0x360
    ? generic_make_request+0xc7/0x360
    submit_bio+0x64/0x120
    ? page_in_rbio+0x4d/0x80 [btrfs]
    ? rbio_orig_end_io+0x80/0x80 [btrfs]
    finish_rmw+0x3f4/0x540 [btrfs]
    validate_rbio_for_rmw+0x36/0x40 [btrfs]
    raid_rmw_end_io+0x7a/0x90 [btrfs]
    bio_endio+0x56/0x60
    end_workqueue_fn+0x3c/0x40 [btrfs]
    btrfs_scrubparity_helper+0xef/0x620 [btrfs]
    btrfs_endio_raid56_helper+0xe/0x10 [btrfs]
    process_one_work+0x2af/0x720
    ? process_one_work+0x22b/0x720
    worker_thread+0x4b/0x4f0
    kthread+0x10f/0x150
    ? process_one_work+0x720/0x720
    ? kthread_create_on_node+0x40/0x40
    ret_from_fork+0x2e/0x40
    RIP: generic_make_request_checks+0x4d/0x610 RSP: ffffc90001337bb8

    In btrfs_dev_replace_finishing(), we will call
    btrfs_rm_dev_replace_blocked() to wait bios before destroying the target
    device when scrub is finished normally.

    However when dev-replace is aborted, either due to error or cancelled by
    scrub, we didn't wait for bios, this can lead to use-after-free if there
    are bios holding the target device.

    Furthermore, for raid56 scrub, at least 2 places are calling
    btrfs_map_sblock() without protection of bio_counter, leading to the
    problem.

    This patch fixes the problem:
    1) Wait for bio_counter before freeing target device when canceling
    replace
    2) When calling btrfs_map_sblock() for raid56, use bio_counter to
    protect the call.

    Cc: Liu Bo
    Signed-off-by: Qu Wenruo
    Reviewed-by: Liu Bo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • In the following situation, scrub will calculate wrong parity to
    overwrite the correct one:

    RAID5 full stripe:

    Before
    | Dev 1 | Dev 2 | Dev 3 |
    | Data stripe 1 | Data stripe 2 | Parity Stripe |
    --------------------------------------------------- 0
    | 0x0000 (Bad) | 0xcdcd | 0x0000 |
    --------------------------------------------------- 4K
    | 0xcdcd | 0xcdcd | 0x0000 |
    ...
    | 0xcdcd | 0xcdcd | 0x0000 |
    --------------------------------------------------- 64K

    After scrubbing dev3 only:

    | Dev 1 | Dev 2 | Dev 3 |
    | Data stripe 1 | Data stripe 2 | Parity Stripe |
    --------------------------------------------------- 0
    | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) |
    --------------------------------------------------- 4K
    | 0xcdcd | 0xcdcd | 0x0000 |
    ...
    | 0xcdcd | 0xcdcd | 0x0000 |
    --------------------------------------------------- 64K

    The reason is that after raid56 read rebuild rbio->stripe_pages are all
    correctly recovered (0xcd for data stripes).

    However when we check and repair parity in
    scrub_parity_check_and_repair(), we will append pages in sparity->spages
    list to rbio->bio_pages[], which contains old on-disk data.

    And when we submit parity data to disk, we calculate parity using
    rbio->bio_pages[] first, if rbio->bio_pages[] not found, then fallback
    to rbio->stripe_pages[].

    The patch fix it by not appending pages from sparity->spages.
    So finish_parity_scrub() will use rbio->stripe_pages[] which is correct.

    Signed-off-by: Qu Wenruo
    Reviewed-by: Liu Bo
    Signed-off-by: David Sterba

    Qu Wenruo
     
  • All callers pass 0 for mirror_num and 1 for need_raid_map.

    Signed-off-by: David Sterba

    David Sterba
     
  • Scrub repairs data by the unit called scrub_block, which may contain
    several pages. Scrub always tries to look up a good copy of a whole
    block, but if there's no such copy, it tries to do repair page by page.

    If we don't set page's io_error when checking this bad copy, in the last
    step, we may skip this page when repairing bad copy from good copy.

    Cc: David Sterba
    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Liu Bo
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David Sterba

    Elena Reshetova
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David Sterba

    Elena Reshetova
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David Sterba

    Elena Reshetova
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David Sterba

    Elena Reshetova
     

28 Feb, 2017

1 commit