16 Dec, 2011

1 commit


15 Dec, 2011

7 commits

  • The Smack LSM hook for security_d_instantiate checks
    the inode's i_op->getxattr value to determine if the
    containing filesystem supports extended attributes.
    The BTRFS filesystem sets the inode's i_op value only
    after it has instantiated the inode. This results in
    Smack incorrectly giving new BTRFS inodes attributes
    from the filesystem defaults on the assumption that
    values can't be stored on the filesystem. This patch
    moves the assignment of inode operation vectors ahead
    of the calls to d_instantiate, letting Smack know that
    the filesystem supports extended attributes. There
    should be no impact on the performance or behavior of
    BTRFS.

    Signed-off-by: Casey Schaufler
    Signed-off-by: Chris Mason

    Casey Schaufler
     
  • If we have a constant stream of end_io completions or crc work,
    we can hit softlockup messages from the async helper threads. This
    adds a cond_resched() into the loop to avoid them.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • To reproduce the bug:

    # touch /mnt/tmp
    # stat /mnt/tmp | grep Change
    Change: 2011-12-09 09:32:23.412105981 +0800
    # chattr +i /mnt/tmp
    # stat /mnt/tmp | grep Change
    Change: 2011-12-09 09:32:43.198105295 +0800
    # umount /mnt
    # mount /dev/loop1 /mnt
    # stat /mnt/tmp | grep Change
    Change: 2011-12-09 09:32:23.412105981 +0800

    We should update ctime of in-memory inode before calling
    btrfs_update_inode().

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • Since we have the free space caches, btrfs_orphan_cleanup also runs for
    the tree_root. Unfortunately this also cleans up the orphans used to mark
    subvol deletions in progress.

    Currently if a subvol deletion gets interrupted twice by umount/mount, the
    deletion will not be continued and the space permanently lost, though it
    would be possible to write a tool to recover those lost subvol deletions.
    This patch checks if the orphan belongs to a subvol (dead root) and skips
    the deletion.

    Signed-off-by: Arne Jansen
    Signed-off-by: Chris Mason

    Arne Jansen
     
  • When we use raid0 as the data profile, df command may show us a very
    inaccurate value of the available space, which may be much less than the
    real one. It may make the users puzzled. Fix it by changing the calculation
    of the available space, and making it be more similar to a fake chunk
    allocation.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • Btrfsck report errors after the 83th case of xfstests was run, The error
    number is 400, it means the used disk space of the file is wrong.

    The reason of this bug is that:
    The file truncation may fail when the space of the file system is not enough,
    and leave some file extents, whose offset are beyond the end of the files.
    When we want to expand those files, we will drop those file extents, and
    put in dummy file extents, and then we should update the i-node. But btrfs
    forgets to do it.

    This patch adds the forgotten i-node update.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • Btrfsck report error 100 after the 83th case of xfstests was run, it means
    the i_size of the file is wrong.

    The reason of this bug is that:
    Btrfs increased i_size of the file at the beginning, but it failed to expand
    the file, and failed to update the i_size to the old size because there is no
    enough space in the file system, so we found a wrong i_size.

    This patch fixes this bug by updating the i_size just when we pass the file
    expanding and get enough space to update i-node.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     

10 Dec, 2011

1 commit

  • btrfs_end_bio checks the number of errors on a bio against the max
    number of errors allowed before sending any EIOs up to the higher
    levels.

    If we got enough copies of the bio done for a given raid level, it is
    supposed to clear the bio error flag and return success.

    We have pointers to the original bio sent down by the higher layers and
    pointers to any cloned bios we made for raid purposes. If the original
    bio happens to be the one that got an io error, but not the last one to
    finish, it might not have the BIO_UPTODATE bit set.

    Then, when the last bio does finish, we'll call bio_end_io on the
    original bio. It won't have the uptodate bit set and we'll end up
    sending EIO to the higher layers.

    We already had a check for this, it just was conditional on getting the
    IO error on the very last bio. Make the check unconditional so we eat
    the EIOs properly.

    Signed-off-by: Chris Mason

    Chris Mason
     

08 Dec, 2011

4 commits

  • Drop spin lock in convert_extent_bit() when memory alloc fails,
    otherwise, it will be a deadlock.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • If we call ioctl(BTRFS_IOC_ADD_DEV) directly, we'll succeed in adding
    a readonly device to a btrfs filesystem, and btrfs will write to
    that device, emitting kernel errors:

    [ 3109.833692] lost page write due to I/O error on loop2
    [ 3109.833720] lost page write due to I/O error on loop2
    ...

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • When we find an existing cluster, we switch to its block group as the
    current block group, possibly skipping multiple blocks in the process.
    Furthermore, under heavy contention, multiple threads may fail to
    allocate from a cluster and then release just-created clusters just to
    proceed to create new ones in a different block group.

    This patch tries to allocate from an existing cluster regardless of its
    block group, and doesn't switch to that group, instead proceeding to
    try to allocate a cluster from the group it was iterating before the
    attempt.

    Signed-off-by: Alexandre Oliva
    Signed-off-by: Chris Mason

    Alexandre Oliva
     
  • If we reach LOOP_NO_EMPTY_SIZE, we won't even try to use a cluster that
    others might have set up. Odds are that there won't be one, but if
    someone else succeeded in setting it up, we might as well use it, even
    if we don't try to set up a cluster again.

    Signed-off-by: Alexandre Oliva
    Signed-off-by: Chris Mason

    Alexandre Oliva
     

01 Dec, 2011

11 commits

  • Commit 4a54c8c16 introduced raid-repair, killing the individual
    readpage_io_failed_hook entries from inode.c and disk-io.c. Commit
    4bb31e92 introduced new readahead code, adding a readpage_io_failed_hook to
    disk-io.c.

    The raid-repair commit had logic to disable raid-repair, if
    readpage_io_failed_hook is set. Thus, the readahead commit effectively
    disabled raid-repair for meta data.

    This commit changes the logic to always attempt raid-repair when needed and
    call the readpage_io_failed_hook in case raid-repair fails. This is much
    more straight forward and should have been like that from the beginning.

    Signed-off-by: Jan Schmidt
    Reported-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Jan Schmidt
     
  • If we don't have a cluster, don't bother trying to allocate from it,
    jumping right away to the attempt to allocate a new cluster.

    Signed-off-by: Alexandre Oliva
    Signed-off-by: Chris Mason

    Alexandre Oliva
     
  • We test whether a block group has enough free space to hold the
    requested block, but when we're doing clustered allocation, we can
    save some cycles by testing whether it has enough room for the cluster
    upfront, otherwise we end up attempting to set up a cluster and
    failing. Only in the NO_EMPTY_SIZE loop do we attempt an unclustered
    allocation, and by then we'll have zeroed the cluster size, so this
    patch won't stop us from using the block group as a last resort.

    Signed-off-by: Alexandre Oliva
    Signed-off-by: Chris Mason

    Alexandre Oliva
     
  • Instead of starting at zero (offset is always zero), request a cluster
    starting at search_start, that denotes the beginning of the current
    block group.

    Signed-off-by: Alexandre Oliva
    Signed-off-by: Chris Mason

    Alexandre Oliva
     
  • The field that indicates the size of the largest contiguous chunk of
    free space in the cluster is not initialized when setting up bitmaps,
    it's only increased when we find a larger contiguous chunk. We end up
    retaining a larger value than appropriate for highly-fragmented
    clusters, which may cause pointless searches for large contiguous
    groups, and even cause clusters that do not meet the density
    requirements to be set up.

    Signed-off-by: Alexandre Oliva
    Signed-off-by: Chris Mason

    Alexandre Oliva
     
  • We're failing to create clusters with bitmaps because
    setup_cluster_no_bitmap checks that the list is empty before inserting
    the bitmap entry in the list for setup_cluster_bitmap, but the list
    field is only initialized when it is restored from the on-disk free
    space cache, or when it is written out to disk.

    Besides a potential race condition due to the multiple use of the list
    field, filesystem performance severely degrades over time: as we use
    up all non-bitmap free extents, the try-to-set-up-cluster dance is
    done at every metadata block allocation. For every block group, we
    fail to set up a cluster, and after failing on them all up to twice,
    we fall back to the much slower unclustered allocation.

    To make matters worse, before the unclustered allocation, we try to
    create new block groups until we reach the 1% threshold, which
    introduces additional bitmaps and thus block groups that we'll iterate
    over at each metadata block request.

    Alexandre Oliva
     
  • To reproduce this bug:

    # dd if=/dev/zero of=img bs=1M count=256
    # mkfs.btrfs img
    # losetup -r /dev/loop1 img
    # mount /dev/loop1 /mnt
    OOPS!!

    It triggered BUG_ON(!nr_devices) in btrfs_calc_avail_data_space().

    To fix this, instead of checking write-only devices, we check all open
    deivces:

    # df -h /dev/loop1
    Filesystem Size Used Avail Use% Mounted on
    /dev/loop1 250M 28K 238M 1% /mnt

    Signed-off-by: Li Zefan

    Li Zefan
     
  • It seems overly harsh to fail a resize of a btrfs file system to the
    same size when a shrink or grow would succeed. User app GParted trips
    over this error. Allow it by bypassing the shrink or grow operation.

    Signed-off-by: Mike Fleetwood

    Mike Fleetwood
     
  • When I ran the xfstests, I found the test tasks was blocked on meta-data
    reservation.

    By debugging, I found the reason of this bug:
    start transaction
    |
    v
    reserve meta-data space
    |
    v
    flush delay allocation -> iput inode -> evict inode
    ^ |
    | v
    wait for delay allocation flush

    Miao Xie
     
  • The location of the btrfs-progs repository has been changed.
    This patch updates the documentation accordingly.

    Signed-off-by: Arnd Hannemann

    Arnd Hannemann
     
  • init_ipath() can return an ERR_PTR(-ENOMEM).

    Signed-off-by: Dan Carpenter

    Dan Carpenter
     

22 Nov, 2011

1 commit

  • The log replay code only partially loads block groups, since
    the block group caching code is able to detect and deal with
    extents the logging code has pinned down.

    While the logging code is pinning down block groups, there is
    a bogus WARN_ON we're hitting if the code wasn't able to find
    an extent in the cache. This commit removes the warning because
    it can happen any time there isn't a valid free space cache
    for that block group.

    Signed-off-by: Chris Mason

    Chris Mason
     

20 Nov, 2011

10 commits

  • We've been hitting BUG()'s in btrfs_cont_expand and btrfs_fallocate and anywhere
    else that calls btrfs_get_extent while running xfstests 13 in a loop. This is
    because fiemap is calling btrfs_get_extent with non-sectorsize aligned offsets,
    which will end up adding mappings that are not sectorsize aligned, which will
    cause problems in some cases for subsequent calls to btrfs_get_extent for
    similar areas that are sectorsize aligned. With this patch I ran xfstests 13 in
    a loop for a couple of hours and didn't hit the problem that I could previously
    hit in at most 20 minutes. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • When doing the io_ctl helpers to clean up the free space cache stuff I stopped
    using our normal prepare_pages stuff, which means I of course forgot to do
    things like set the pages extent mapped, which will cause us all sorts of
    wonderful propblems. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We've been hitting panics when running xfstest 13 in a loop for long periods of
    time. And actually this problem has always existed so we've been hitting these
    things randomly for a while. Basically what happens is we get a thread coming
    into the allocator and reading the space cache off of disk and adding the
    entries to the free space cache as we go. Then we get another thread that comes
    in and tries to allocate from that block group. Since block_group->cached !=
    BTRFS_CACHE_NO it goes ahead and tries to do the allocation. We do this because
    if we're doing the old slow way of caching we don't want to hold people up and
    wait for everything to finish. The problem with this is we could end up
    discarding the space cache at some arbitrary point in the future, which means we
    could very well end up allocating space that is either bad, or when the real
    caching happens it could end up thinking the space isn't in use when it really
    is and cause all sorts of other problems.

    The solution is to add a new flag to indicate we are loading the free space
    cache from disk, and always try to cache the block group if cache->cached !=
    BTRFS_CACHE_FINISHED. That way if we are loading the space cache anybody else
    who tries to allocate from the block group will have to wait until it's finished
    to make sure it completes successfully. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • For the user it is confusing to find something like:
    [10197.627710] new size for /dev/mapper/vg0-usr_share is 3221225472
    in kernel log, because it doesn't point directly to btrfs.

    This patch prefixes those messages with "btrfs:" like other btrfs
    related printks.

    Signed-off-by: Arnd Hannemann
    Signed-off-by: Chris Mason

    Arnd Hannemann
     
  • Round inode bytes and delalloc bytes up to real blocksize before
    converting to sector size. Otherwise eg. files smaller than 512
    are reported with zero blocks due to incorrect rounding.

    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • setup_cluster_no_bitmap() searches all the extents and bitmaps starting
    from offset. Therefore if it returns -ENOSPC, all the bitmaps starting
    from offset are in the bitmaps list, so it's sufficient to search from
    this list in setup_cluser_bitmap().

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • Suppose there are two bitmaps [0, 256], [256, 512] and one extent
    [100, 120] in the free space cache, and we want to setup a cluster
    with offset=100, bytes=50.

    In this case, there will be only one bitmap [256, 512] in the temporary
    bitmaps list, and then setup_cluster_bitmap() won't search bitmap [0, 256].

    The cause is, the list is constructed in setup_cluster_no_bitmap(),
    and only bitmaps with bitmap_entry->offset >= offset will be added
    into the list, and the very bitmap that convers offset has
    bitmap_entry->offset
    Signed-off-by: Chris Mason

    Li Zefan
     
  • My previous patch introduced some u64 for failed_mirror variables, this one
    makes it consistent again.

    Signed-off-by: Jan Schmidt
    Signed-off-by: Chris Mason

    Jan Schmidt
     
  • This patch casts to unsigned long before casting to a pointer and fixes
    the following warnings:
    fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
    fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
    fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Chris Mason

    Jeff Mahoney
     
  • When btrfs is writing the super blocks, it send barrier flushes to make
    sure writeback caching drives get all the metadata on disk in the
    right order.

    But, we have two bugs in the way these are sent down. When doing
    full commits (not via the tree log), we are sending the barrier down
    before the last super when it should be going down before the first.

    In multi-device setups, we should be waiting for the barriers to
    complete on all devices before writing any of the supers.

    Both of these bugs can cause corruptions on power failures. We fix it
    with some new code to send down empty barriers to all devices before
    writing the first super.

    Alexandre Oliva found the multi-device bug. Arne Jansen did the async
    barrier loop.

    Signed-off-by: Chris Mason
    Reported-by: Alexandre Oliva

    Chris Mason
     

15 Nov, 2011

1 commit

  • The btrfs snapshotting code requires that once a root has been
    snapshotted, we don't change it during a commit.

    But there are two cases to lead to tree corruptions:

    1) multi-thread snapshots can commit serveral snapshots in a transaction,
    and this may change the src root when processing the following pending
    snapshots, which lead to the former snapshots corruptions;

    2) the free inode cache was changing the roots when it root the cache,
    which lead to corruptions.

    This fixes things by making sure we force COW the block after we create a
    snapshot during commiting a transaction, then any changes to the roots
    will result in COW, and we get all the fs roots and snapshot roots to be
    consistent.

    Signed-off-by: Liu Bo
    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Liu Bo
     

11 Nov, 2011

4 commits

  • Rename no_space_cache option to nospace_cache to be more consistent with
    the rest, where the simple prefix 'no' is used to negate an option.

    The option has been introduced during the -rc1 cycle and there are has not been
    widely used, so it's safe.

    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • Currently scrub fails with ENOMEM when bio_add_page fails. Unfortunately
    dm based targets accept only one page per bio, thus making scrub always
    fails. This patch just submits the current bio when an error is encountered
    and starts a new one.

    Signed-off-by: Arne Jansen
    Signed-off-by: Chris Mason

    Arne Jansen
     
  • We can not do flushable reservation for the relocation when we create snapshot,
    because it may make the transaction commit task and the flush task wait for
    each other and the deadlock happens.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • People have been running into a warning when loading space cache because the
    page is already mapped when trying to read in a bitmap. The way we read in
    entries and pages is kind of convoluted, so fix it so that io_ctl_read_entry
    maps the entries if it needs to, and if it hits the end of the page it simply
    unmaps the page. That way we can unconditionally unmap the io_ctl before
    reading in the bitmap and we should stop hitting these warnings. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik