11 Oct, 2013

3 commits

  • Now we don't drop all the deleted snapshots/subvolumes before the space
    balance. It means we have to relocate the space which is held by the dead
    snapshots/subvolumes. So we must into them into fs radix tree, or we would
    forget to commit the change of them when doing transaction commit, and it
    would corrupt the metadata.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Miao Xie
     
  • Liu fixed part of this problem and unfortunately I steered him in slightly the
    wrong direction and so didn't completely fix the problem. The problem is we
    limit the size of the delalloc range we are looking for to max bytes and then we
    try to lock that range. If we fail to lock the pages in that range we will
    shrink the max bytes to a single page and re loop. However if our first page is
    inside of the delalloc range then we will end up limiting the end of the range
    to a period before our first page. This is illustrated below

    [0 -------- delalloc range --------- 256mb]
    [page]

    So find_delalloc_range will return with delalloc_start as 0 and end as 128mb,
    and then we will notice that delalloc_start < *start and adjust it up, but not
    adjust delalloc_end up, so things go sideways. To fix this we need to not limit
    the max bytes in find_delalloc_range, but in find_lock_delalloc_range and that
    way we don't end up with this confusion. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • btrfs_rename was using the root of the old dir instead of the root of the new
    dir when checking for a hash collision, so if you tried to move a file into a
    subvol it would freak out because it would see the file you are trying to move
    in its current root. This fixes the bug where this would fail

    btrfs subvol create test1
    btrfs subvol create test2
    mv test1 test2.

    Thanks to Chris Murphy for catching this,

    Cc: stable@vger.kernel.org
    Reported-by: Chris Murphy
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

05 Oct, 2013

4 commits

  • free_device rcu callback, scheduled from btrfs_rm_dev_replace_srcdev,
    can be processed before btrfs_scratch_superblock is called, which would
    result in a use-after-free on btrfs_device contents. Fix this by
    zeroing the superblock before the rcu callback is registered.

    Cc: Stefan Behrens
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Josef Bacik

    Ilya Dryomov
     
  • The current implementation of worker threads in Btrfs has races in
    worker stopping code, which cause all kinds of panics and lockups when
    running btrfs/011 xfstest in a loop. The problem is that
    btrfs_stop_workers is unsynchronized with respect to check_idle_worker,
    check_busy_worker and __btrfs_start_workers.

    E.g., check_idle_worker race flow:

    btrfs_stop_workers(): check_idle_worker(aworker):
    - grabs the lock
    - splices the idle list into the
    working list
    - removes the first worker from the
    working list
    - releases the lock to wait for
    its kthread's completion
    - grabs the lock
    - if aworker is on the working list,
    moves aworker from the working list
    to the idle list
    - releases the lock
    - grabs the lock
    - puts the worker
    - removes the second worker from the
    working list
    ......
    btrfs_stop_workers returns, aworker is on the idle list
    FS is umounted, memory is freed
    ......
    aworker is waken up, fireworks ensue

    With this applied, I wasn't able to trigger the problem in 48 hours,
    whereas previously I could reliably reproduce at least one of these
    races within an hour.

    Reported-by: David Sterba
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Josef Bacik

    Ilya Dryomov
     
  • The crash[1] is found by xfstests/generic/208 with "-o compress",
    it's not reproduced everytime, but it does panic.

    The bug is quite interesting, it's actually introduced by a recent commit
    (573aecafca1cf7a974231b759197a1aebcf39c2a,
    Btrfs: actually limit the size of delalloc range).

    Btrfs implements delay allocation, so during writeback, we
    (1) get a page A and lock it
    (2) search the state tree for delalloc bytes and lock all pages within the range
    (3) process the delalloc range, including find disk space and create
    ordered extent and so on.
    (4) submit the page A.

    It runs well in normal cases, but if we're in a racy case, eg.
    buffered compressed writes and aio-dio writes,
    sometimes we may fail to lock all pages in the 'delalloc' range,
    in which case, we need to fall back to search the state tree again with
    a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset).

    The mentioned commit has a side effect, that is, in the fallback case,
    we can find delalloc bytes before the index of the page we already have locked,
    so we're in the case of (delalloc_end 0).

    This ends with not locking delalloc pages but making ->writepage still
    process them, and the crash happens.

    This fixes it by just thinking that we find nothing and returning to caller
    as the caller knows how to deal with it properly.

    [1]:
    ------------[ cut here ]------------
    kernel BUG at mm/page-writeback.c:2170!
    [...]
    CPU: 2 PID: 11755 Comm: btrfs-delalloc- Tainted: G O 3.11.0+ #8
    [...]
    RIP: 0010:[] [] clear_page_dirty_for_io+0x1e/0x83
    [...]
    [ 4934.248731] Stack:
    [ 4934.248731] ffff8801477e5dc8 ffffea00049b9f00 ffff8801869f9ce8 ffffffffa02b841a
    [ 4934.248731] 0000000000000000 0000000000000000 0000000000000fff 0000000000000620
    [ 4934.248731] ffff88018db59c78 ffffea0005da8d40 ffffffffa02ff860 00000001810016c0
    [ 4934.248731] Call Trace:
    [ 4934.248731] [] extent_range_clear_dirty_for_io+0xcf/0xf5 [btrfs]
    [ 4934.248731] [] compress_file_range+0x1dc/0x4cb [btrfs]
    [ 4934.248731] [] ? detach_if_pending+0x22/0x4b
    [ 4934.248731] [] async_cow_start+0x35/0x53 [btrfs]
    [ 4934.248731] [] worker_loop+0x14b/0x48c [btrfs]
    [ 4934.248731] [] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
    [ 4934.248731] [] kthread+0x8d/0x95
    [ 4934.248731] [] ? kthread_freezable_should_stop+0x43/0x43
    [ 4934.248731] [] ret_from_fork+0x7c/0xb0
    [ 4934.248731] [] ? kthread_freezable_should_stop+0x43/0x43
    [ 4934.248731] Code: ff 85 c0 0f 94 c0 0f b6 c0 59 5b 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 2c de 00 00 49 89 c4 48 8b 03 a8 01 75 02 0b 4d 85 e4 74 52 49 8b 84 24 80 00 00 00 f6 40 20 01 75 44
    [ 4934.248731] RIP [] clear_page_dirty_for_io+0x1e/0x83
    [ 4934.248731] RSP
    [ 4934.280307] ---[ end trace 36f06d3f8750236a ]---

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • If we crash with a log, remount and recover that log, and then crash before we
    can commit another transaction we will get transid verify errors on the next
    mount. This is because we were not zero'ing out the log when we committed the
    transaction after recovery. This is ok as long as we commit another transaction
    at some point in the future, but if you abort or something else goes wrong you
    can end up in this weird state because the recovery stuff says that the tree log
    should have a generation+1 of the super generation, which won't be the case of
    the transaction that was started for recovery. Fix this by removing the check
    and _always_ zero out the log portion of the super when we commit a transaction.
    This fixes the transid verify issues I was seeing with my force errors tests.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

21 Sep, 2013

26 commits

  • Users have been complaining of the uuid tree stuff warning that there is no uuid
    root when trying to do snapshot operations. This is because if you mount -o ro
    we will not create the uuid tree. But then if you mount -o rw,remount we will
    still not create it and then any subsequent snapshot/subvol operations you try
    to do will fail gloriously. Fix this by creating the uuid_root on remount rw if
    it was not already there. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • btrfs_ioctl_file_extent_same() uses __put_user_unaligned() to copy some data
    back to it's argument struct. Unfortunately, not all architectures provide
    __put_user_unaligned(), so compiles break on them if btrfs is selected.

    Instead, just copy the whole struct in / out at the start and end of
    operations, respectively.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Mark Fasheh
     
  • Commit 2bc5565286121d2a77ccd728eb3484dff2035b58 (Btrfs: don't update atime on
    RO subvolumes) ensures that the access time of an inode is not updated when
    the inode lives in a read-only subvolume.
    However, if a directory on a read-only subvolume is accessed, the atime is
    updated. This results in a write operation to a read-only subvolume. I
    believe that access times should never be updated on read-only subvolumes.

    To reproduce:

    # mkfs.btrfs -f /dev/dm-3
    (...)
    # mount /dev/dm-3 /mnt
    # btrfs subvol create /mnt/sub
    Create subvolume '/mnt/sub'
    # mkdir /mnt/sub/dir
    # echo "abc" > /mnt/sub/dir/file
    # btrfs subvol snapshot -r /mnt/sub /mnt/rosnap
    Create a readonly snapshot of '/mnt/sub' in '/mnt/rosnap'
    # stat /mnt/rosnap/dir
    File: `/mnt/rosnap/dir'
    Size: 8 Blocks: 0 IO Block: 4096 directory
    Device: 16h/22d Inode: 257 Links: 1
    Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
    Access: 2013-09-11 07:21:49.389157126 -0400
    Modify: 2013-09-11 07:22:02.330156079 -0400
    Change: 2013-09-11 07:22:02.330156079 -0400
    # ls /mnt/rosnap/dir
    file
    # stat /mnt/rosnap/dir
    File: `/mnt/rosnap/dir'
    Size: 8 Blocks: 0 IO Block: 4096 directory
    Device: 16h/22d Inode: 257 Links: 1
    Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
    Access: 2013-09-11 07:22:56.797151670 -0400
    Modify: 2013-09-11 07:22:02.330156079 -0400
    Change: 2013-09-11 07:22:02.330156079 -0400

    Reported-by: Koen De Wit
    Signed-off-by: Guangyu Sun
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Guangyu Sun
     
  • The kernel log entries for device label %s and device fsid %pU
    are missing the btrfs: prefix. Add those here.

    Signed-off-by: Frank Holton
    Reviewed-by: David Sterba
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Frank Holton
     
  • It's still possible to flip the filesystem into RW mode after it's
    remounted RO due to an abort. There are lots of places that check for
    the superblock error bit and will not write data, but we should not let
    the filesystem appear read-write.

    Signed-off-by: David Sterba
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    David Sterba
     
  • This patch makes it possible to set BTRFS_FS_TREE_OBJECTID as the default
    subvolume by passing a subvolume id of 0.

    Signed-off-by: chandan
    Reviewed-by: David Sterba
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    chandan
     
  • In btrfs_sync_file(), if the call to btrfs_log_dentry_safe() returns
    a negative error (for e.g. -ENOMEM via btrfs_log_inode()), we would
    return without ending/freeing the transaction.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe David Borba Manana
     
  • The BUG() was replaced by btrfs_error() and return -EIO with the
    patch "get rid of one BUG() in write_all_supers()", but the missing
    mutex_unlock() was overlooked.

    The 0-DAY kernel build service from Intel reported the missing
    unlock which was found by the coccinelle tool:

    fs/btrfs/disk-io.c:3422:2-8: preceding lock on line 3374

    Signed-off-by: Stefan Behrens
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • We don't do the iput when we fail to allocate our delayed delalloc work in
    __start_delalloc_inodes, fix this.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This isn't used for anything anymore, just remove it.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This is a left over of how we used to wait for ordered extents, which was to
    grab the inode and then run filemap flush on it. However if we have an ordered
    extent then we already are holding a ref on the inode, and we just use
    btrfs_start_ordered_extent anyway, so there is no reason to have an extra ref on
    the inode to start work on the ordered extent. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Forever ago I made the worst case calculator say that we could potentially split
    into 3 blocks for every level on the way down, which isn't right. If we split
    we're only going to get two new blocks, the one we originally cow'ed and the new
    one we're going to split. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This reverts commit 70afa3998c9baed4186df38988246de1abdab56d. It is causing
    performance issues and wasn't actually correct. There were problems with the
    way we flushed delalloc and that was the real cause of the early enospc.
    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Various people have hit a deadlock when running btrfs/011. This is because when
    replacing nocow extents we will take the i_mutex to make sure nobody messes with
    the file while we are replacing the extent. The problem is we are already
    holding a transaction open, which is a locking inversion, so instead we need to
    save these inodes we find and then process them outside of the transaction.

    Further we can't just lock the inode and assume we are good to go. We need to
    lock the extent range and then read back the extent cache for the inode to make
    sure the extent really still points at the physical block we want. If it
    doesn't we don't have to copy it. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • So if we have dir_index items in the log that means we also have the inode item
    as well, which means that the inode's i_size is correct. However when we
    process dir_index'es we call btrfs_add_link() which will increase the
    directory's i_size for the new entry. To fix this we need to just set the dir
    items i_size to 0, and then as we find dir_index items we adjust the i_size.
    btrfs_add_link() will do it for new entries, and if the entry already exists we
    can just add the name_len to the i_size ourselves. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • A user reported a bug where his log would not replay because he was getting
    -EEXIST back. This was because he had a file moved into a directory that was
    logged. What happens is the file had a lower inode number, and so it is
    processed first when replaying the log, and so we add the inode ref in for the
    directory it was moved to. But then we process the directories DIR_INDEX item
    and try to add the inode ref for that inode and it fails because we already
    added it when we replayed the inode. To solve this problem we need to just
    process any DIR_INDEX items we have in the log first so this all is taken care
    of, and then we can replay the rest of the items. With this patch my reproducer
    can remount the file system properly instead of erroring out. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Liu introduced a local copy of the last log commit for an inode to make sure we
    actually log an inode even if a log commit has already taken place. In order to
    make sure we didn't relog the same inode multiple times he set this local copy
    to the current trans when we log the inode, because usually we log the inode and
    then sync the log. The exception to this is during rename, we will relog an
    inode if the name changed and it is already in the log. The problem with this
    is then we go to sync the inode, and our check to see if the inode has already
    been logged is tripped and we don't sync the log. To fix this we need to _also_
    check against the roots last log commit, because it could be less than what is
    in our local copy of the log commit. This fixes a bug where we rename a file
    into a directory and then fsync the directory and then on remount the directory
    is no longer there. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • If you just create a directory and then fsync that directory and then pull the
    power plug you will come back up and the directory will not be there. That is
    because we won't actually create directories if we've logged files inside of
    them since they will be created on replay, but in this check we will set our
    logged_trans of our current directory if it happens to be a directory, making us
    think it doesn't need to be logged. Fix the logic to only do this to parent
    directories. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • So forever we have had this thing to limit the amount of delalloc pages we'll
    setup to be written out to 128mb. This is because we have to lock all the pages
    in this range, so anything above this gets a bit unweildly, and also without a
    limit we'll happily allocate gigantic chunks of disk space. Turns out our check
    for this wasn't quite right, we wouldn't actually limit the chunk we wanted to
    write out, we'd just stop looking for more space after we went over the limit.
    So if you do a giant 20gb dd on my box with lots of ram I could get 2gig
    extents. This is fine normally, except when you go to relocate these extents
    and we can't find enough space to relocate these moster extents, since we have
    to be able to allocate exactly the same sized extent to move it around. So fix
    this by actually enforcing the limit. With this patch I'm no longer seeing
    giant 1.5gb extents. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • By the current code, if the requested size is very large, and all the extents
    in the free space cache are small, we will waste lots of the cpu time to cut
    the requested size in half and search the cache again and again until it gets
    down to the size the allocator can return. In fact, we can know the max extent
    size in the cache after the first search, so we needn't cut the size in half
    repeatedly, and just use the max extent size directly. This way can save
    lots of cpu time and make the performance grow up when there are only fragments
    in the free space cache.

    According to my test, if there are only 4KB free space extents in the fs,
    and the total size of those extents are 256MB, we can reduce the execute
    time of the following test from 5.4s to 1.4s.
    dd if=/dev/zero of= bs=1MB count=1 oflag=sync

    Changelog v2 -> v3:
    - fix the problem that we skip the block group with the space which is
    less than we need.

    Changelog v1 -> v2:
    - address the problem that we return a wrong start position when searching
    the free space in a bitmap.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Miao Xie
     
  • Signed-off-by: David Sterba
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    David Sterba
     
  • We want to know if there are debugging features compiled in, this may
    affect performance. The message is printed before the sanity checks.

    (This commit message is a copy of David Sterba's commit message when
    he introduced btrfs_print_info()).

    Signed-off-by: Stefan Behrens
    Reviewed-by: David Sterba
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • Instead of removing the current inode from the red black tree
    and then add the new one, just use the red black tree replace
    operation, which is more efficient.

    Signed-off-by: Filipe David Borba Manana
    Reviewed-by: Zach Brown
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe David Borba Manana
     
  • If replace was suspended by the umount, replace target device is added
    to the fs_devices->alloc_list during a later mount. This is obviously
    wrong. ->is_tgtdev_for_dev_replace is supposed to guard against that,
    but ->is_tgtdev_for_dev_replace is (and can only ever be) initialized
    *after* everything is opened and fs_devices lists are populated. Fix
    this by checking the devid instead: for replace targets it's always
    equal to BTRFS_DEV_REPLACE_DEVID.

    Cc: Stefan Behrens
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Stefan Behrens
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Ilya Dryomov
     
  • If we failed to actually allocate the correct size of the extent to relocate we
    will end up in an infinite loop because we won't return an error, we'll just
    move on to the next extent. So fix this up by returning an error, and then fix
    all the callers to return an error up the stack rather than BUG_ON()'ing.
    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Linux 3.11

    Chris Mason
     

03 Sep, 2013

4 commits


01 Sep, 2013

3 commits

  • When the binary search returns 0 (exact match), the target key
    will necessarily be at slot 0 of all nodes below the current one,
    so in this case the binary search is not needed because it will
    always return 0, and we waste time doing it, holding node locks
    for longer than necessary, etc.

    Below follow histograms with the times spent on the current approach of
    doing a binary search when the previous binary search returned 0, and
    times for the new approach, which directly picks the first item/child
    node in the leaf/node.

    Current approach:

    Count: 6682
    Range: 35.000 - 8370.000; Mean: 85.837; Median: 75.000; Stddev: 106.429
    Percentiles: 90th: 124.000; 95th: 145.000; 99th: 206.000
    35.000 - 61.080: 1235 ################
    61.080 - 106.053: 4207 #####################################################
    106.053 - 183.606: 1122 ##############
    183.606 - 317.341: 111 #
    317.341 - 547.959: 6 |
    547.959 - 8370.000: 1 |

    Approach proposed by this patch:

    Count: 6682
    Range: 6.000 - 135.000; Mean: 16.690; Median: 16.000; Stddev: 7.160
    Percentiles: 90th: 23.000; 95th: 27.000; 99th: 40.000
    6.000 - 8.418: 58 #
    8.418 - 11.670: 1149 #########################
    11.670 - 16.046: 2418 #####################################################
    16.046 - 21.934: 2098 ##############################################
    21.934 - 29.854: 744 ################
    29.854 - 40.511: 154 ###
    40.511 - 54.848: 41 #
    54.848 - 74.136: 5 |
    74.136 - 100.087: 9 |
    100.087 - 135.000: 6 |

    These samples were captured during a run of the btrfs tests 001, 002 and
    004 in the xfstests, with a leaf/node size of 4Kb.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe David Borba Manana
     
  • We only need an async starter if we can't make a GFP_NOFS allocation in our
    current path. This is the case for the endio stuff since it happens in IRQ
    context, but things like the caching thread workers and the delalloc flushers we
    can easily make this allocation and start threads right away. Also change the
    worker count for the caching thread pool. Traditionally we limited this to 2
    since we took read locks while caching, but nowadays we do this lockless so
    there's no reason to limit the number of caching threads. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This fixes a problem where if we fail a truncate we will leave the i_size set
    where we wanted to truncate to instead of where we were able to truncate to.
    Fix this by making btrfs_truncate_inode_items do the disk_i_size update as it
    removes extents, that way it will always be consistent with where its extents
    are. Then if the truncate fails at all we can update the in-ram i_size with
    what we have on disk and delete the orphan item. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik