03 Apr, 2019

1 commit

  • commit 0cc068e6ee59c1fffbfa977d8bf868b7551d80ac upstream.

    As readahead is an optimization, all errors are usually filtered out,
    but still properly handled when the real read call is done. The commit
    5e9d398240b2 ("btrfs: readpages() should submit IO as read-ahead") added
    REQ_RAHEAD to readpages() because that's only used for readahead
    (despite what one would expect from the callback name).

    This causes a flood of messages and inflated read error stats, so skip
    reporting in case it's readahead.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202403
    Reported-by: LimeTech
    Fixes: 5e9d398240b2 ("btrfs: readpages() should submit IO as read-ahead")
    CC: stable@vger.kernel.org # 4.19+
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    David Sterba
     

24 Mar, 2019

1 commit

  • commit 349ae63f40638a28c6fce52e8447c2d14b84cc0c upstream.

    We recently had a customer issue with a corrupted filesystem. When
    trying to mount this image btrfs panicked with a division by zero in
    calc_stripe_length().

    The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length()
    takes this value and divides it by the number of copies the RAID profile
    is expected to have to calculate the amount of data stripes. As a DUP
    profile is expected to have 2 copies this division resulted in 1/2 = 0.
    Later then the 'data_stripes' variable is used as a divisor in the
    stripe length calculation which results in a division by 0 and thus a
    kernel panic.

    When encountering a filesystem with a DUP block group and a
    'num_stripes' value unequal to 2, refuse mounting as the image is
    corrupted and will lead to unexpected behaviour.

    Code inspection showed a RAID1 block group has the same issues.

    Fixes: e06cd3dd7cea ("Btrfs: add validadtion checks for chunk loading")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Qu Wenruo
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Johannes Thumshirn
     

13 Feb, 2019

1 commit

  • [ Upstream commit a9261d4125c97ce8624e9941b75dee1b43ad5df9 ]

    It's not that impossible to imagine that a device OR a btrfs image is
    copied just by using the dd or the cp command. Which in case both the
    copies of the btrfs will have the same fsid. If on the system with
    automount enabled, the copied FS gets scanned.

    We have a known bug in btrfs, that we let the device path be changed
    after the device has been mounted. So using this loop hole the new
    copied device would appears as if its mounted immediately after it's
    been copied.

    For example:

    Initially.. /dev/mmcblk0p4 is mounted as /

    $ lsblk
    NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    mmcblk0 179:0 0 29.2G 0 disk
    |-mmcblk0p4 179:4 0 4G 0 part /
    |-mmcblk0p2 179:2 0 500M 0 part /boot
    |-mmcblk0p3 179:3 0 256M 0 part [SWAP]
    `-mmcblk0p1 179:1 0 256M 0 part /boot/efi

    $ btrfs fi show
    Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba
    Total devices 1 FS bytes used 1.40GiB
    devid 1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4

    Copy mmcblk0 to sda

    $ dd if=/dev/mmcblk0 of=/dev/sda

    And immediately after the copy completes the change in the device
    superblock is notified which the automount scans using btrfs device scan
    and the new device sda becomes the mounted root device.

    $ lsblk
    NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    sda 8:0 1 14.9G 0 disk
    |-sda4 8:4 1 4G 0 part /
    |-sda2 8:2 1 500M 0 part
    |-sda3 8:3 1 256M 0 part
    `-sda1 8:1 1 256M 0 part
    mmcblk0 179:0 0 29.2G 0 disk
    |-mmcblk0p4 179:4 0 4G 0 part
    |-mmcblk0p2 179:2 0 500M 0 part /boot
    |-mmcblk0p3 179:3 0 256M 0 part [SWAP]
    `-mmcblk0p1 179:1 0 256M 0 part /boot/efi

    $ btrfs fi show /
    Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba
    Total devices 1 FS bytes used 1.40GiB
    devid 1 size 4.00GiB used 3.00GiB path /dev/sda4

    The bug is quite nasty that you can't either unmount /dev/sda4 or
    /dev/mmcblk0p4. And the problem does not get solved until you take sda
    out of the system on to another system to change its fsid using the
    'btrfstune -u' command.

    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Anand Jain
     

26 Jan, 2019

2 commits

  • [ Upstream commit baf92114c7e6dd6124aa3d506e4bc4b694da3bc3 ]

    Commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling"
    fixed calculating the stripe_size for a new DUP chunk.

    However, the same calculation reappears a bit later, and that one was
    not changed yet. The resulting bug that is exposed is that the newly
    allocated device extents ('stripes') can have a few MiB overlap with the
    next thing stored after them, which is another device extent or the end
    of the disk.

    The scenario in which this can happen is:
    * The block device for the filesystem is less than 10GiB in size.
    * The amount of contiguous free unallocated disk space chosen to use for
    chunk allocation is 20% of the total device size, or a few MiB more or
    less.

    An example:
    - The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
    - There's 1578MiB unallocated raw disk space left in one contiguous
    piece.

    In this case stripe_size is first calculated as 789MiB, (half of
    1578MiB).

    Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
    enter the if block. Now stripe_size value is immediately overwritten
    while calculating an adjusted value based on max_chunk_size, which ends
    up as 788MiB.

    Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
    actually more than the value we had before. However, the last comparison
    fails to detect this, because it's comparing the value with the total
    amount of free space, which is about twice the size of stripe_size.

    In the example above, this means that the resulting raw disk space being
    allocated is 1600MiB, while only a gap of 1578MiB has been found. The
    second device extent object for this DUP chunk will overlap for 22MiB
    with whatever comes next.

    The underlying problem here is that the stripe_size is reused all the
    time for different things. So, when entering the code in the if block,
    stripe_size is immediately overwritten with something else. If later we
    decide we want to have the previous value back, then the logic to
    compute it was copy pasted in again.

    With this change, the value in stripe_size is not unnecessarily
    destroyed, so the duplicated calculation is not needed any more.

    Signed-off-by: Hans van Kranenburg
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Hans van Kranenburg
     
  • [ Upstream commit 5eb193812a42dc49331f25137a38dfef9612d3e4 ]

    Enhance btrfs_verify_dev_extents() to remember previous checked dev
    extents, so it can verify no dev extents can overlap.

    Analysis from Hans:

    "Imagine allocating a DATA|DUP chunk.

    In the chunk allocator, we first set...
    max_stripe_size = SZ_1G;
    max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
    ... which is 10GiB.

    Then...
    /* we don't want a chunk larger than 10% of writeable space */
    max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
    max_chunk_size);

    Imagine we only have one 7880MiB block device in this filesystem. Now
    max_chunk_size is down to 788MiB.

    The next step in the code is to search for max_stripe_size * dev_stripes
    amount of free space on the device, which is in our example 1GiB * 2 =
    2GiB. Imagine the device has exactly 1578MiB free in one contiguous
    piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail

    Next we recalculate the stripe_size (which is actually the device extent
    length), based on the actual maximum amount of available raw disk space:
    stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);

    stripe_size is now 789MiB

    Next we do...
    data_stripes = num_stripes / ncopies
    ...where data_stripes ends up as 1, because num_stripes is 2 (the amount
    of device extents we're going to have), and DUP has ncopies 2.

    Next there's a check...
    if (stripe_size * data_stripes > max_chunk_size)
    ...which matches because 789MiB * 1 > 788MiB.

    We go into the if code, and next is...
    stripe_size = div_u64(max_chunk_size, data_stripes);
    ...which resets stripe_size to max_chunk_size: 788MiB

    Next is a fun one...
    /* bump the answer up to a 16MB boundary */
    stripe_size = round_up(stripe_size, SZ_16M);
    ...which changes stripe_size from 788MiB to 800MiB.

    We're not done changing stripe_size yet...
    /* But don't go higher than the limits we found while searching
    * for free extents
    */
    stripe_size = min(devices_info[ndevs - 1].max_avail,
    stripe_size);

    This is bad. max_avail is twice the stripe_size (we need to fit 2 device
    extents on the same device for DUP).

    The result here is that 800MiB < 1578MiB, so it's unchanged. However,
    the resulting DUP chunk will need 1600MiB disk space, which isn't there,
    and the second dev_extent might extend into the next thing (next
    dev_extent? end of device?) for 22MiB.

    The last shown line of code relies on a situation where there's twice
    the value of stripe_size present as value for the variable stripe_size
    when it's DUP. This was actually the case before commit 92e222df7b
    "btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
    "[...] in the meantime there's a check to see if the stripe_size does
    not exceed max_chunk_size. Since during this check stripe_size is twice
    the amount as intended, the check will reduce the stripe_size to
    max_chunk_size if the actual correct to be used stripe_size is more than
    half the amount of max_chunk_size."

    In the previous version of the code, the 16MiB alignment (why is this
    done, by the way?) would result in a 50% chance that it would actually
    do an 8MiB alignment for the individual dev_extents, since it was
    operating on double the size. Does this matter?

    Does it matter that stripe_size can be set to anything which is not
    16MiB aligned because of the amount of remaining available disk space
    which is just taken?

    What is the main purpose of this round_up?

    The most straightforward thing to do seems something like...
    stripe_size = min(
    div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
    stripe_size
    )
    ..just putting half of the max_avail into stripe_size."

    Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/
    Reported-by: Hans van Kranenburg
    Signed-off-by: Qu Wenruo
    [ add analysis from report ]
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Qu Wenruo
     

17 Jan, 2019

1 commit

  • commit 5a8067c0d17feb7579db0476191417b441a8996e upstream.

    The available allocation bits members from struct btrfs_fs_info are
    protected by a sequence lock, and when starting balance we access them
    incorrectly in two different ways:

    1) In the read sequence lock loop at btrfs_balance() we use the values we
    read from fs_info->avail_*_alloc_bits and we can immediately do actions
    that have side effects and can not be undone (printing a message and
    jumping to a label). This is wrong because a retry might be needed, so
    our actions must not have side effects and must be repeatable as long
    as read_seqretry() returns a non-zero value. In other words, we were
    essentially ignoring the sequence lock;

    2) Right below the read sequence lock loop, we were reading the values
    from avail_metadata_alloc_bits and avail_data_alloc_bits without any
    protection from concurrent writers, that is, reading them outside of
    the read sequence lock critical section.

    So fix this by making sure we only read the available allocation bits
    while in a read sequence lock critical section and that what we do in the
    critical section is repeatable (has nothing that can not be undone) so
    that any eventual retry that is needed is handled properly.

    Fixes: de98ced9e743 ("Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits")
    Fixes: 14506127979a ("btrfs: fix a bogus warning when converting only data or metadata")
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     

23 Aug, 2018

1 commit

  • Test case btrfs/164 reports use-after-free:

    [ 6712.084324] general protection fault: 0000 [#1] PREEMPT SMP
    ..
    [ 6712.195423] btrfs_update_commit_device_size+0x75/0xf0 [btrfs]
    [ 6712.201424] btrfs_commit_transaction+0x57d/0xa90 [btrfs]
    [ 6712.206999] btrfs_rm_device+0x627/0x850 [btrfs]
    [ 6712.211800] btrfs_ioctl+0x2b03/0x3120 [btrfs]

    Reason for this is that btrfs_shrink_device adds the resized device to
    the fs_devices::resized_devices after it has called the last commit
    transaction.

    So the list fs_devices::resized_devices is not empty when
    btrfs_shrink_device returns. Now the parent function
    btrfs_rm_device calls:

    btrfs_close_bdev(device);
    call_rcu(&device->rcu, free_device_rcu);

    and then does the transactio ncommit. It goes through the
    fs_devices::resized_devices in btrfs_update_commit_device_size and
    leads to use-after-free.

    Fix this by making sure btrfs_shrink_device calls the last needed
    btrfs_commit_transaction before the return. This is consistent with what
    the grow counterpart does and this makes sure the on-disk state is
    persistent when the function returns.

    Reported-by: Lu Fengqi
    Tested-by: Lu Fengqi
    Signed-off-by: Anand Jain
    Reviewed-by: David Sterba
    [ update changelog ]
    Signed-off-by: David Sterba

    Anand Jain
     

06 Aug, 2018

33 commits