25 Jun, 2016

3 commits

  • …git/mason/linux-btrfs

    Pull btrfs fixes part 2 from Chris Mason:
    "This has one patch from Omar to bring iterate_shared back to btrfs.

    We have a tree of work we queue up for directory items and it doesn't
    lend itself well to shared access. While we're cleaning it up, Omar
    has changed things to use an exclusive lock when there are delayed
    items"

    * 'for-linus-4.7-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes

    Linus Torvalds
     
  • Pull btrfs fixes from Chris Mason:
    "I have a two part pull this time because one of the patches Dave
    Sterba collected needed to be against v4.7-rc2 or higher (we used
    rc4). I try to make my for-linus-xx branch testable on top of the
    last major so we can hand fixes to people on the list more easily, so
    I've split this pull in two.

    This first part has some fixes and two performance improvements that
    we've been testing for some time.

    Josef's two performance fixes are most notable. The transid tracking
    patch makes a big improvement on pretty much every workload"

    * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: Force stripesize to the value of sectorsize
    btrfs: fix disk_i_size update bug when fallocate() fails
    Btrfs: fix error handling in map_private_extent_buffer
    Btrfs: fix error return code in btrfs_init_test_fs()
    Btrfs: don't do nocow check unless we have to
    btrfs: fix deadlock in delayed_ref_async_start
    Btrfs: track transid for delayed ref flushing

    Linus Torvalds
     
  • Commit fe742fd4f90f ("Revert "btrfs: switch to ->iterate_shared()"")
    backed out the conversion to ->iterate_shared() for Btrfs because the
    delayed inode handling in btrfs_real_readdir() is racy. However, we can
    still do readdir in parallel if there are no delayed nodes.

    This is a temporary fix which upgrades the shared inode lock to an
    exclusive lock only when we have delayed items until we come up with a
    more complete solution. While we're here, rename the
    btrfs_{get,put}_delayed_items functions to make it very clear that
    they're just for readdir.

    Tested with xfstests and by doing a parallel kernel build:

    while make tinyconfig && make -j4 && git clean dqfx; do
    :
    done

    along with a bunch of parallel finds in another shell:

    while true; do
    for ((i=0; i/dev/null &
    done
    wait
    done

    Signed-off-by: Omar Sandoval
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Omar Sandoval
     

24 Jun, 2016

4 commits

  • Btrfs code currently assumes stripesize to be same as
    sectorsize. However Btrfs-progs (until commit
    df05c7ed455f519e6e15e46196392e4757257305) has been setting
    btrfs_super_block->stripesize to a value of 4096.

    This commit makes sure that the value of btrfs_super_block->stripesize
    is a power of 2. Later, it unconditionally sets btrfs_root->stripesize
    to sectorsize.

    Signed-off-by: Chandan Rajendra
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Chandan Rajendra
     
  • When doing truncate operation, btrfs_setsize() will first call
    truncate_setsize() to set new inode->i_size, but if later
    btrfs_truncate() fails, btrfs_setsize() will call
    "i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
    inmemory inode size, now bug occurs. It's because for truncate
    case btrfs_ordered_update_i_size() directly uses inode->i_size
    to update BTRFS_I(inode)->disk_i_size, indeed we should use the
    "offset" argument to update disk_i_size. Here is the call graph:
    ==>btrfs_truncate()
    ====>btrfs_truncate_inode_items()
    ======>btrfs_ordered_update_i_size(inode, last_size, NULL);
    Here btrfs_ordered_update_i_size()'s offset argument is last_size.

    And below test case can reveal this bug:

    dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
    dev=$(losetup --show -f fs.img)
    mkdir -p /mnt/mntpoint
    mkfs.btrfs -f $dev
    mount $dev /mnt/mntpoint
    cd /mnt/mntpoint

    echo "workdir is: /mnt/mntpoint"
    blocksize=$((128 * 1024))
    dd if=/dev/zero of=testfile bs=$blocksize count=1
    sync
    count=$((17*1024*1024*1024/blocksize))
    echo "file size is:" $((count*blocksize))
    for ((i = 1; i /dev/null
    done
    sync

    truncate --size 0 testfile
    ls -l testfile
    du -sh testfile
    exit

    In this case, truncate operation will fail for enospc reason and
    "du -sh testfile" returns value greater than 0, but testfile's
    size is 0, we need to reflect correct inode->i_size.

    Signed-off-by: Wang Xiaoguang
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Wang Xiaoguang
     
  • map_private_extent_buffer() can return -EINVAL in two different cases,
    1. when the requested contents span two pages if nodesize is larger
    than pagesize,
    2. when it detects something insane.

    The 2nd one used to be only a WARN_ON(1), and we decided to return a error
    to callers, but we didn't fix up all its callers, which will be
    addressed by this patch.

    Without this, btrfs may end up with 'general protection', ie.
    reading invalid memory.

    Reported-by: Vegard Nossum
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Liu Bo
     
  • Fix to return a negative error code from the kern_mount() error handling
    case instead of 0(ret is set to 0 by register_filesystem), as done
    elsewhere in this function.

    Signed-off-by: Wei Yongjun
    Reviewed-by: Omar Sandoval
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Wei Yongjun
     

23 Jun, 2016

3 commits

  • Before we write into prealloc/nocow space we have to make sure that there are no
    references to the extents we are writing into, which means checking the extent
    tree and csum tree in the case of nocow. So we don't want to do the nocow dance
    unless we can't reserve data space, since it's a serious drag on performance.
    With the following sequence

    fallocate -l10737418240 /mnt/btrfs-test/file
    cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
    fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
    --end_fsync=1

    we get the worst case scenario where we have to fall back on to doing the check
    anyway.

    Without this patch
    lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
    write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec

    With this patch
    lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
    write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec

    We get twice the throughput, half of the runtime, and half of the average
    latency. Thanks,

    Signed-off-by: Josef Bacik
    [ PAGE_CACHE_ removal related fixups ]
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • "Btrfs: track transid for delayed ref flushing" was deadlocking on
    btrfs_attach_transaction because its not safe to call from the async
    delayed ref start code. This commit brings back btrfs_join_transaction
    instead and checks for a blocked commit.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Chris Mason
     
  • Using the offwakecputime bpf script I noticed most of our time was spent waiting
    on the delayed ref throttling. This is what is supposed to happen, but
    sometimes the transaction can commit and then we're waiting for throttling that
    doesn't matter anymore. So change this stuff to be a little smarter by tracking
    the transid we were in when we initiated the throttling. If the transaction we
    get is different then we can just bail out. This resulted in a 50% speedup in
    my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
    over the entire run (which is about 30 minutes). Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

18 Jun, 2016

9 commits

  • Pull btrfs fixes from Chris Mason:
    "The most user visible change here is a fix for our recent superblock
    validation checks that were causing problems on non-4k pagesized
    systems"

    * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: btrfs_check_super_valid: Allow 4096 as stripesize
    btrfs: remove build fixup for qgroup_account_snapshot
    btrfs: use new error message helper in qgroup_account_snapshot
    btrfs: avoid blocking open_ctree from cleaner_kthread
    Btrfs: don't BUG_ON() in btrfs_orphan_add
    btrfs: account for non-CoW'd blocks in btrfs_abort_transaction
    Btrfs: check if extent buffer is aligned to sectorsize
    btrfs: Use correct format specifier

    Linus Torvalds
     
  • Older btrfs-progs/mkfs.btrfs sets 4096 as the stripesize. Hence
    restricting stripesize to be equal to sectorsize would cause super block
    validation to return an error on architectures where PAGE_SIZE is not
    equal to 4096.

    Hence as a workaround, this commit allows stripesize to be set to 4096
    bytes.

    Signed-off-by: Chandan Rajendra
    Signed-off-by: David Sterba

    Chandan Rajendra
     
  • Introduced in 2c1984f244838477aab ("btrfs: build fixup for
    qgroup_account_snapshot") as temporary bisectability build fixup.

    Signed-off-by: David Sterba

    David Sterba
     
  • We've renamed btrfs_std_error, this one is left from last merge.

    Signed-off-by: David Sterba

    David Sterba
     
  • This fixes a problem introduced in commit 2f3165ecf103599f82bf0ea254039db335fb5005
    "btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes".

    open_ctree eventually calls btrfs_replay_log which in turn calls
    btrfs_commit_super which tries to lock the cleaner_mutex, causing a
    recursive mutex deadlock during mount.

    Instead of playing whack-a-mole trying to keep up with all the
    functions that may want to lock cleaner_mutex, put all the cleaner_mutex
    lockers back where they were, and attack the problem more directly:
    keep cleaner_kthread asleep until the filesystem is mounted.

    When filesystems are mounted read-only and later remounted read-write,
    open_ctree did not set fs_info->open and neither does anything else.
    Set this flag in btrfs_remount so that neither btrfs_delete_unused_bgs
    nor cleaner_kthread get confused by the common case of "/" filesystem
    read-only mount followed by read-write remount.

    Signed-off-by: Zygo Blaxell
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Zygo Blaxell
     
  • This is just a screwup for developers, so change it to an ASSERT() so developers
    notice when things go wrong and deal with the error appropriately if ASSERT()
    isn't enabled. Thanks,

    Signed-off-by: Josef Bacik
    Reviewed-by: Mark Fasheh
    Signed-off-by: David Sterba

    Josef Bacik
     
  • The test for !trans->blocks_used in btrfs_abort_transaction is
    insufficient to determine whether it's safe to drop the transaction
    handle on the floor. btrfs_cow_block, informed by should_cow_block,
    can return blocks that have already been CoW'd in the current
    transaction. trans->blocks_used is only incremented for new block
    allocations. If an operation overlaps the blocks in the current
    transaction entirely and must abort the transaction, we'll happily
    let it clean up the trans handle even though it may have modified
    the blocks and will commit an incomplete operation.

    In the long-term, I'd like to do closer tracking of when the fs
    is actually modified so we can still recover as gracefully as possible,
    but that approach will need some discussion. In the short term,
    since this is the only code using trans->blocks_used, let's just
    switch it to a bool indicating whether any blocks were used and set
    it when should_cow_block returns false.

    Cc: stable@vger.kernel.org # 3.4+
    Signed-off-by: Jeff Mahoney
    Reviewed-by: Filipe Manana
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
    via alloc_extent_buffer(). An unaligned eb can have more pages than it
    should have, which ends up extent buffer's leak or some corrupted content
    in extent buffer.

    This adds a warning to let us quickly know what was happening.

    Now that alloc_extent_buffer() no more returns NULL, this changes its
    caller and callers of its caller to match with the new error
    handling.

    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Liu Bo
     
  • Component mirror_num of struct btrfsic_block is defined
    as unsigned int. Use %u as format specifier.

    Signed-off-by: Heinrich Schuchardt
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Heinrich Schuchardt
     

11 Jun, 2016

1 commit

  • Pull btrfs fixes from Chris Mason:
    "Has some fixes and some new self tests for btrfs. The self tests are
    usually disabled in the .config file (unless you're doing btrfs dev
    work), and this bunch is meant to find problems with the 64K page size
    patches.

    Jeff has a patch to help people see if they are using the hardware
    assist crc32c module, which really helps us nail down problems when
    people ask why crcs are using so much CPU.

    Otherwise, it's small fixes"

    * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system
    Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize
    Btrfs: self-tests: Use macros instead of constants and add missing newline
    Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes
    Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZE
    btrfs: advertise which crc32c implementation is being used at module load
    Btrfs: add validadtion checks for chunk loading
    Btrfs: add more validation checks for superblock
    Btrfs: clear uptodate flags of pages in sys_array eb
    Btrfs: self-tests: Support non-4k page size
    Btrfs: Fix integer overflow when calculating bytes_per_bitmap
    Btrfs: test_check_exists: Fix infinite loop when searching for free space entries
    Btrfs: end transaction if we abort when creating uuid root
    btrfs: Use __u64 in exported linux/btrfs.h.

    Linus Torvalds
     

09 Jun, 2016

2 commits


06 Jun, 2016

9 commits

  • In __test_eb_bitmaps(), we write random data to a bitmap. Then copy
    the bitmap to another bitmap that resides inside an extent buffer.
    Later we verify the values of corresponding bits in the bitmap and the
    bitmap inside the extent buffer. However, extent_buffer_test_bit()
    reads in byte granularity while test_bit() reads in unsigned long
    granularity. Hence we end up comparing wrong bits on big-endian
    systems such as ppc64. This commit fixes the issue by reading the
    bitmap in byte granularity.

    Reviewed-by: Josef Bacik
    Reviewed-by: Chandan Rajendra
    Signed-off-by: Feifei Xu
    Signed-off-by: David Sterba

    Feifei Xu
     
  • With 64K sectorsize, 1G sized block group cannot span across bitmaps.
    To execute test_bitmaps() function, this commit allocates
    "BITS_PER_BITMAP * sectorsize + PAGE_SIZE" sized block group.

    Reviewed-by: Josef Bacik
    Reviewed-by: Chandan Rajendra
    Signed-off-by: Feifei Xu
    Signed-off-by: David Sterba

    Feifei Xu
     
  • This commit replaces numerical constants with appropriate
    preprocessor macros.

    Reviewed-by: Josef Bacik
    Signed-off-by: Chandan Rajendra
    Signed-off-by: Feifei Xu
    Signed-off-by: David Sterba

    Feifei Xu
     
  • To test all possible sectorsizes, this commit adds a sectorsize
    array. This commit executes the tests for all possible sectorsizes and
    nodesizes.

    Reviewed-by: Josef Bacik
    Signed-off-by: Chandan Rajendra
    Signed-off-by: Feifei Xu
    Signed-off-by: David Sterba

    Feifei Xu
     
  • On ppc64, PAGE_SIZE is 64k which is same as BTRFS_MAX_METADATA_BLOCKSIZE.
    In such a scenario, we will never be able to have an extent buffer
    containing more than one page. Hence in such cases this commit does not
    execute the page straddling tests.

    Reviewed-by: Josef Bacik
    Signed-off-by: Feifei Xu
    Signed-off-by: Chandan Rajendra
    Signed-off-by: David Sterba

    Feifei Xu
     
  • Since several architectures support hardware-accelerated crc32c
    calculation, it would be nice to confirm that btrfs is actually using it.

    We can see an elevated use count for the module, but it doesn't actually
    show who the users are. This patch simply prints the name of the driver
    after successfully initializing the shash.

    Signed-off-by: Jeff Mahoney
    [ added a helper and used in module load-time message ]
    Signed-off-by: David Sterba

    Jeff Mahoney
     
  • To prevent fuzzed filesystem images from panic the whole system,
    we need various validation checks to refuse to mount such an image
    if btrfs finds any invalid value during loading chunks, including
    both sys_array and regular chunks.

    Note that these checks may not be sufficient to cover all corner cases,
    feel free to add more checks.

    Reported-by: Vegard Nossum
    Reported-by: Quentin Casasnovas
    Reviewed-by: David Sterba
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba

    Liu Bo
     
  • This adds validation checks for super_total_bytes, super_bytes_used and
    super_stripesize, super_num_devices.

    Reported-by: Vegard Nossum
    Reported-by: Quentin Casasnovas
    Reviewed-by: David Sterba
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba

    Liu Bo
     
  • We set uptodate flag to pages in the temporary sys_array eb,
    but do not clear the flag after free eb. As the special
    btree inode may still hold a reference on those pages, the
    uptodate flag can remain alive in them.

    If btrfs_super_chunk_root has been intentionally changed to the
    offset of this sys_array eb, reading chunk_root will read content
    of sys_array and it will skip our beautiful checks in
    btree_readpage_end_io_hook() because of
    "pages of eb are uptodate => eb is uptodate"

    This adds the 'clear uptodate' part to force it to read from disk.

    Reviewed-by: Josef Bacik
    Signed-off-by: Liu Bo
    Signed-off-by: David Sterba

    Liu Bo
     

05 Jun, 2016

1 commit

  • Pull btrfs fixes from Chris Mason:
    "The important part of this pull is Filipe's set of fixes for btrfs
    device replacement. Filipe fixed a few issues seen on the list and a
    number he found on his own"

    * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent
    Btrfs: fix race between device replace and read repair
    Btrfs: fix race between device replace and discard
    Btrfs: fix race between device replace and chunk allocation
    Btrfs: fix race setting block group back to RW mode during device replace
    Btrfs: fix unprotected assignment of the left cursor for device replace
    Btrfs: fix race setting block group readonly during device replace
    Btrfs: fix race between device replace and block group removal
    Btrfs: fix race between readahead and device replace/removal

    Linus Torvalds
     

04 Jun, 2016

1 commit

  • When dealing with inline extents, btrfs_get_extent will incorrectly try
    to insert a duplicate extent_map. The dup hits -EEXIST from
    add_extent_map, but then we try to merge with the existing one and end
    up trying to insert a zero length extent_map.

    This actually works most of the time, except when there are extent maps
    past the end of the inline extent. rocksdb will trigger this sometimes
    because it preallocates an extent and then truncates down.

    Josef made a script to trigger with xfs_io:

    #!/bin/bash

    xfs_io -f -c "pwrite 0 1000" inline
    xfs_io -c "falloc -k 4k 1M" inline
    xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline
    xfs_io -c "fadvise -d 0 1000" inline
    cat inline

    You'll get EIOs trying to read inline after this because add_extent_map
    is returning EEXIST

    Signed-off-by: Chris Mason

    Chris Mason
     

03 Jun, 2016

3 commits

  • self-tests code assumes 4k as the sectorsize and nodesize. This commit
    fix hardcoded 4K. Enables the self-tests code to be executed on non-4k
    page sized systems (e.g. ppc64).

    Reviewed-by: Josef Bacik
    Signed-off-by: Feifei Xu
    Signed-off-by: Chandan Rajendra
    Signed-off-by: David Sterba

    Feifei Xu
     
  • On ppc64, bytes_per_bitmap will be (65536*8*65536). Hence append UL to
    fix integer overflow.

    Reviewed-by: Josef Bacik
    Reviewed-by: Chandan Rajendra
    Signed-off-by: Feifei Xu
    Signed-off-by: David Sterba

    Feifei Xu
     
  • On a ppc64 machine using 64K as the block size, assume that the RB
    tree at btrfs_free_space_ctl->free_space_offset contains following
    two entries:

    1. A bitmap entry having an offset value of 0 and having the bits
    corresponding to the address range [128M+512K, 128M+768K] set.
    2. An extent entry corresponding to the address range
    [128M-256K, 128M-128K]

    In such a scenario, test_check_exists() invoked for checking the
    existence of address range [128M+768K, 256M] can lead to an
    infinite loop as explained below:

    - Checking for the extent entry fails.
    - Checking for a bitmap entry results in the free space info in
    range [128M+512K, 128M+768K] beng returned.
    - rb_prev(info) returns NULL because the bitmap entry starting from
    offset 0 comes first in the RB tree.
    - current_node = bitmap node.
    - while (current_node)
    tmp = rb_next(bitmap_node);/*tmp is extent based free space entry*/
    Since extent based free space entry's last address is smaller
    than the address being searched for (i.e. 128M+768K) we
    incorrectly again obtain the extent node as the "next right node"
    of the RB tree and thus end up looping infinitely.

    This patch fixes the issue by checking the "tmp" variable which point
    to the most recently searched free space node.

    Reviewed-by: Josef Bacik
    Reviewed-by: Chandan Rajendra
    Signed-off-by: Feifei Xu
    Signed-off-by: David Sterba

    Feifei Xu
     

01 Jun, 2016

1 commit


31 May, 2016

2 commits

  • While we are finishing a device replace operation we can have a concurrent
    task trying to do a read repair operation, in which case it will call
    btrfs_map_block() to get a struct btrfs_bio which can have a stripe that
    points to the source device of the device replace operation. This allows
    for the read repair task to dereference the stripe's device pointer after
    the device replace operation has freed the source device, resulting in
    an invalid memory access. This is similar to the problem solved by my
    previous patch in the same series and named "Btrfs: fix race between
    device replace and discard".

    So fix this by surrounding the call to btrfs_map_block() and the code
    that uses the returned struct btrfs_bio with calls to
    btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the
    proper serialization with the finishing phase of the device replace
    operation.

    Signed-off-by: Filipe Manana
    Reviewed-by: Josef Bacik

    Filipe Manana
     
  • While we are finishing a device replace operation, we can make a discard
    operation (fs mounted with -o discard) do an invalid memory access like
    the one reported by the following trace:

    [ 3206.384654] general protection fault: 0000 [#1] PREEMPT SMP
    [ 3206.387520] Modules linked in: dm_mod btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis psmouse tpm ppdev sg parport_pc evdev i2c_piix4 parport
    processor serio_raw i2c_core pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci
    virtio_ring scsi_mod e1000 virtio floppy [last unloaded: btrfs]
    [ 3206.388595] CPU: 14 PID: 29194 Comm: fsstress Not tainted 4.6.0-rc7-btrfs-next-29+ #1
    [ 3206.388595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
    [ 3206.388595] task: ffff88017ace0100 ti: ffff880171b98000 task.ti: ffff880171b98000
    [ 3206.388595] RIP: 0010:[] [] blkdev_issue_discard+0x5c/0x2a7
    [ 3206.388595] RSP: 0018:ffff880171b9bb80 EFLAGS: 00010246
    [ 3206.388595] RAX: ffff880171b9bc28 RBX: 000000000090d000 RCX: 0000000000000000
    [ 3206.388595] RDX: ffffffff82fa1b48 RSI: ffffffff8179f46c RDI: ffffffff82fa1b48
    [ 3206.388595] RBP: ffff880171b9bcc0 R08: 0000000000000000 R09: 0000000000000001
    [ 3206.388595] R10: ffff880171b9bce0 R11: 000000000090f000 R12: ffff880171b9bbe8
    [ 3206.388595] R13: 0000000000000010 R14: 0000000000004868 R15: 6b6b6b6b6b6b6b6b
    [ 3206.388595] FS: 00007f6182e4e700(0000) GS:ffff88023fdc0000(0000) knlGS:0000000000000000
    [ 3206.388595] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 3206.388595] CR2: 00007f617c2bbb18 CR3: 000000017ad9c000 CR4: 00000000000006e0
    [ 3206.388595] Stack:
    [ 3206.388595] 0000000000004878 0000000000000000 0000000002400040 0000000000000000
    [ 3206.388595] 0000000000000000 ffff880171b9bbe8 ffff880171b9bbb0 ffff880171b9bbb0
    [ 3206.388595] ffff880171b9bbc0 ffff880171b9bbc0 ffff880171b9bbd0 ffff880171b9bbd0
    [ 3206.388595] Call Trace:
    [ 3206.388595] [] btrfs_issue_discard+0x12f/0x143 [btrfs]
    [ 3206.388595] [] ? btrfs_issue_discard+0x12f/0x143 [btrfs]
    [ 3206.388595] [] btrfs_discard_extent+0x87/0xde [btrfs]
    [ 3206.388595] [] btrfs_finish_extent_commit+0xb2/0x1df [btrfs]
    [ 3206.388595] [] ? __mutex_unlock_slowpath+0x150/0x15b
    [ 3206.388595] [] btrfs_commit_transaction+0x7fc/0x980 [btrfs]
    [ 3206.388595] [] ? __mutex_unlock_slowpath+0x150/0x15b
    [ 3206.388595] [] btrfs_sync_file+0x38f/0x428 [btrfs]
    [ 3206.388595] [] vfs_fsync_range+0x8c/0x9e
    [ 3206.388595] [] vfs_fsync+0x1c/0x1e
    [ 3206.388595] [] do_fsync+0x31/0x4a
    [ 3206.388595] [] SyS_fsync+0x10/0x14
    [ 3206.388595] [] entry_SYSCALL_64_fastpath+0x18/0xa8
    [ 3206.388595] [] ? time_hardirqs_off+0x9/0x14
    [ 3206.388595] [] ? trace_hardirqs_off_caller+0x1f/0xaa

    This happens because when we call btrfs_map_block() from
    btrfs_discard_extent() to get a btrfs_bio structure, the device replace
    operation has not finished yet, but before we use the device of one of the
    stripes from the returned btrfs_bio structure, the device object is freed.

    This is illustrated by the following diagram.

    CPU 1 CPU 2

    btrfs_dev_replace_start()

    (...)

    btrfs_dev_replace_finishing()

    btrfs_start_transaction()
    btrfs_commit_transaction()

    (...)

    btrfs_sync_file()
    btrfs_start_transaction()

    (...)

    btrfs_commit_transaction()
    btrfs_finish_extent_commit()
    btrfs_discard_extent()
    btrfs_map_block()
    --> returns a struct btrfs_bio
    with a stripe that has a
    device field pointing to
    source device of the replace
    operation (the device that
    is being replaced)

    mutex_lock(&uuid_mutex)
    mutex_lock(&fs_info->fs_devices->device_list_mutex)
    mutex_lock(&fs_info->chunk_mutex)

    btrfs_dev_replace_update_device_in_mapping_tree()
    --> iterates the mapping tree and for each
    extent map that has a stripe pointing to
    the source device, it updates the stripe
    to point to the target device instead

    btrfs_rm_dev_replace_blocked()
    --> waits for fs_info->bio_counter to go down to 0

    btrfs_rm_dev_replace_remove_srcdev()
    --> removes source device from the list of devices

    mutex_unlock(&fs_info->chunk_mutex)
    mutex_unlock(&fs_info->fs_devices->device_list_mutex)
    mutex_unlock(&uuid_mutex)

    btrfs_rm_dev_replace_free_srcdev()
    --> frees the source device

    --> iterates over all stripes
    of the returned struct
    btrfs_bio
    --> for each stripe it
    dereferences its device
    pointer
    --> it ends up finding a
    pointer to the device
    used as the source
    device for the replace
    operation and that was
    already freed

    So fix this by surrounding the call to btrfs_map_block(), and the code
    that uses the returned struct btrfs_bio, with calls to
    btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), so that
    the finishing phase of the device replace operation blocks until the
    the bio counter decreases to zero before it frees the source device.
    This is the same approach we do at btrfs_map_bio() for example.

    Signed-off-by: Filipe Manana
    Reviewed-by: Josef Bacik

    Filipe Manana
     

30 May, 2016

1 commit

  • While iterating and copying extents from the source device, the device
    replace code keeps adjusting a left cursor that is used to make sure that
    once we finish processing a device extent, any future writes to extents
    from the corresponding block group will get into both the source and
    target devices. This left cursor is also used for resuming the device
    replace operation at mount time.

    However using this left cursor to decide whether writes go into both
    devices or only the source device is not enough to guarantee we don't
    miss copying extents into the target device. There are two cases where
    the current approach fails. The first one is related to when there are
    holes in the device and they get allocated for new block groups while
    the device replace operation is iterating the device extents (more on
    this explained below). The second one is that when that loop over the
    device extents finishes, we start dellaloc, wait for all ordered extents
    and then commit the current transaction, we might have got new block
    groups allocated that are now using a device extent that has an offset
    greater then or equals to the value of the left cursor, in which case
    writes to extents belonging to these new block groups will get issued
    only to the source device.

    For the first case where the current approach of using a left cursor
    fails, consider the source device currently has the following layout:

    [ extent bg A ] [ hole, unallocated space ] [extent bg B ]
    3Gb 4Gb 5Gb

    While we are iterating the device extents from the source device using
    the commit root of the device tree, the following happens:

    CPU 1 CPU 2

    scrub_enumerate_chunks()
    --> searches the device tree for
    extents belonging to the source
    device using the device tree's
    commit root
    --> 1st iteration finds extent belonging to
    block group A

    --> sets block group A to RO mode
    (btrfs_inc_block_group_ro)

    --> sets cursor left to found_key.offset
    which is 3Gb

    --> scrub_chunk() starts
    copies all allocated extents from
    block group's A stripe at source
    device into target device

    btrfs_alloc_chunk()
    --> allocates device extent
    in the range [4Gb, 5Gb[
    from the source device for
    a new block group C

    extent allocated from block
    group C for a direct IO,
    buffered write or btree node/leaf

    extent is written to, perhaps
    in response to a writepages()
    call from the VM or directly
    through direct IO

    the write is made only against
    the source device and not against
    the target device because the
    extent's offset is in the interval
    [4Gb, 5Gb[ which is larger then
    the value of cursor_left (3Gb)

    --> scrub_chunks() finishes

    --> updates left cursor from 3Gb to
    4Gb

    --> btrfs_dec_block_group_ro() sets
    block group A back to RW mode

    --> 2nd iteration finds extent belonging to
    block group B - it did not find the new
    extent in the range [4Gb, 5Gb[ for block
    group C because we are using the device
    tree's commit root or even because the
    block group's items are not all yet
    inserted in the respective btrees, that is,
    the block group is still attached to some
    transaction handle's new_bgs list and
    btrfs_create_pending_block_groups() was
    not called yet against that transaction
    handle, so the device extent items were
    not yet inserted into the devices tree

    --> so we end not copying anything from the newly
    allocated device extent from the source device
    to the target device

    So fix this by making __btrfs_map_block() always redirect writes to the
    target device as well, independently of the left cursor's value. With
    this change the left cursor is now used only for the purpose of tracking
    progress and allow a mount operation to resume a device replace.

    Signed-off-by: Filipe Manana
    Reviewed-by: Josef Bacik

    Filipe Manana