23 Nov, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: remove free-space-cache.c WARN during log replay
    Btrfs: sectorsize align offsets in fiemap
    Btrfs: clear pages dirty for io and set them extent mapped
    Btrfs: wait on caching if we're loading the free space cache
    Btrfs: prefix resize related printks with btrfs:
    btrfs: fix stat blocks accounting
    Btrfs: avoid unnecessary bitmap search for cluster setup
    Btrfs: fix to search one more bitmap for cluster setup
    btrfs: mirror_num should be int, not u64
    btrfs: Fix up 32/64-bit compatibility for new ioctls
    Btrfs: fix barrier flushes
    Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush

    Linus Torvalds
     

22 Nov, 2011

1 commit

  • The log replay code only partially loads block groups, since
    the block group caching code is able to detect and deal with
    extents the logging code has pinned down.

    While the logging code is pinning down block groups, there is
    a bogus WARN_ON we're hitting if the code wasn't able to find
    an extent in the cache. This commit removes the warning because
    it can happen any time there isn't a valid free space cache
    for that block group.

    Signed-off-by: Chris Mason

    Chris Mason
     

20 Nov, 2011

10 commits

  • We've been hitting BUG()'s in btrfs_cont_expand and btrfs_fallocate and anywhere
    else that calls btrfs_get_extent while running xfstests 13 in a loop. This is
    because fiemap is calling btrfs_get_extent with non-sectorsize aligned offsets,
    which will end up adding mappings that are not sectorsize aligned, which will
    cause problems in some cases for subsequent calls to btrfs_get_extent for
    similar areas that are sectorsize aligned. With this patch I ran xfstests 13 in
    a loop for a couple of hours and didn't hit the problem that I could previously
    hit in at most 20 minutes. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • When doing the io_ctl helpers to clean up the free space cache stuff I stopped
    using our normal prepare_pages stuff, which means I of course forgot to do
    things like set the pages extent mapped, which will cause us all sorts of
    wonderful propblems. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We've been hitting panics when running xfstest 13 in a loop for long periods of
    time. And actually this problem has always existed so we've been hitting these
    things randomly for a while. Basically what happens is we get a thread coming
    into the allocator and reading the space cache off of disk and adding the
    entries to the free space cache as we go. Then we get another thread that comes
    in and tries to allocate from that block group. Since block_group->cached !=
    BTRFS_CACHE_NO it goes ahead and tries to do the allocation. We do this because
    if we're doing the old slow way of caching we don't want to hold people up and
    wait for everything to finish. The problem with this is we could end up
    discarding the space cache at some arbitrary point in the future, which means we
    could very well end up allocating space that is either bad, or when the real
    caching happens it could end up thinking the space isn't in use when it really
    is and cause all sorts of other problems.

    The solution is to add a new flag to indicate we are loading the free space
    cache from disk, and always try to cache the block group if cache->cached !=
    BTRFS_CACHE_FINISHED. That way if we are loading the space cache anybody else
    who tries to allocate from the block group will have to wait until it's finished
    to make sure it completes successfully. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • For the user it is confusing to find something like:
    [10197.627710] new size for /dev/mapper/vg0-usr_share is 3221225472
    in kernel log, because it doesn't point directly to btrfs.

    This patch prefixes those messages with "btrfs:" like other btrfs
    related printks.

    Signed-off-by: Arnd Hannemann
    Signed-off-by: Chris Mason

    Arnd Hannemann
     
  • Round inode bytes and delalloc bytes up to real blocksize before
    converting to sector size. Otherwise eg. files smaller than 512
    are reported with zero blocks due to incorrect rounding.

    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • setup_cluster_no_bitmap() searches all the extents and bitmaps starting
    from offset. Therefore if it returns -ENOSPC, all the bitmaps starting
    from offset are in the bitmaps list, so it's sufficient to search from
    this list in setup_cluser_bitmap().

    Signed-off-by: Li Zefan
    Signed-off-by: Chris Mason

    Li Zefan
     
  • Suppose there are two bitmaps [0, 256], [256, 512] and one extent
    [100, 120] in the free space cache, and we want to setup a cluster
    with offset=100, bytes=50.

    In this case, there will be only one bitmap [256, 512] in the temporary
    bitmaps list, and then setup_cluster_bitmap() won't search bitmap [0, 256].

    The cause is, the list is constructed in setup_cluster_no_bitmap(),
    and only bitmaps with bitmap_entry->offset >= offset will be added
    into the list, and the very bitmap that convers offset has
    bitmap_entry->offset
    Signed-off-by: Chris Mason

    Li Zefan
     
  • My previous patch introduced some u64 for failed_mirror variables, this one
    makes it consistent again.

    Signed-off-by: Jan Schmidt
    Signed-off-by: Chris Mason

    Jan Schmidt
     
  • This patch casts to unsigned long before casting to a pointer and fixes
    the following warnings:
    fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
    fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
    fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Chris Mason

    Jeff Mahoney
     
  • When btrfs is writing the super blocks, it send barrier flushes to make
    sure writeback caching drives get all the metadata on disk in the
    right order.

    But, we have two bugs in the way these are sent down. When doing
    full commits (not via the tree log), we are sending the barrier down
    before the last super when it should be going down before the first.

    In multi-device setups, we should be waiting for the barriers to
    complete on all devices before writing any of the supers.

    Both of these bugs can cause corruptions on power failures. We fix it
    with some new code to send down empty barriers to all devices before
    writing the first super.

    Alexandre Oliva found the multi-device bug. Arne Jansen did the async
    barrier loop.

    Signed-off-by: Chris Mason
    Reported-by: Alexandre Oliva

    Chris Mason
     

17 Nov, 2011

3 commits


15 Nov, 2011

1 commit

  • The btrfs snapshotting code requires that once a root has been
    snapshotted, we don't change it during a commit.

    But there are two cases to lead to tree corruptions:

    1) multi-thread snapshots can commit serveral snapshots in a transaction,
    and this may change the src root when processing the following pending
    snapshots, which lead to the former snapshots corruptions;

    2) the free inode cache was changing the roots when it root the cache,
    which lead to corruptions.

    This fixes things by making sure we force COW the block after we create a
    snapshot during commiting a transaction, then any changes to the roots
    will result in COW, and we get all the fs roots and snapshot roots to be
    consistent.

    Signed-off-by: Liu Bo
    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Liu Bo
     

12 Nov, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    btrfs: rename the option to nospace_cache
    Btrfs: handle bio_add_page failure gracefully in scrub
    Btrfs: fix deadlock caused by the race between relocation
    Btrfs: only map pages if we know we need them when reading the space cache
    Btrfs: fix orphan backref nodes
    Btrfs: Abstract similar code for btrfs_block_rsv_add{, _noflush}
    Btrfs: fix unreleased path in btrfs_orphan_cleanup()
    Btrfs: fix no reserved space for writing out inode cache
    Btrfs: fix nocow when deleting the item
    Btrfs: tweak the delayed inode reservations again
    Btrfs: rework error handling in btrfs_mount()
    Btrfs: close devices on all error paths in open_ctree()
    Btrfs: avoid null dereference and leaks when bailing from open_ctree()
    Btrfs: fix subvol_name leak on error in btrfs_mount()
    Btrfs: fix memory leak in btrfs_parse_early_options()
    Btrfs: fix our reservations for updating an inode when completing io
    Btrfs: fix oops on NULL trans handle in btrfs_truncate
    btrfs: fix double-free 'tree_root' in 'btrfs_mount()'

    Linus Torvalds
     

11 Nov, 2011

11 commits

  • Rename no_space_cache option to nospace_cache to be more consistent with
    the rest, where the simple prefix 'no' is used to negate an option.

    The option has been introduced during the -rc1 cycle and there are has not been
    widely used, so it's safe.

    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • Currently scrub fails with ENOMEM when bio_add_page fails. Unfortunately
    dm based targets accept only one page per bio, thus making scrub always
    fails. This patch just submits the current bio when an error is encountered
    and starts a new one.

    Signed-off-by: Arne Jansen
    Signed-off-by: Chris Mason

    Arne Jansen
     
  • We can not do flushable reservation for the relocation when we create snapshot,
    because it may make the transaction commit task and the flush task wait for
    each other and the deadlock happens.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • People have been running into a warning when loading space cache because the
    page is already mapped when trying to read in a bitmap. The way we read in
    entries and pages is kind of convoluted, so fix it so that io_ctl_read_entry
    maps the entries if it needs to, and if it hits the end of the page it simply
    unmaps the page. That way we can unconditionally unmap the io_ctl before
    reading in the bitmap and we should stop hitting these warnings. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • If the root node of a fs/file tree is in the block group that is
    being relocated, but the others are not in the other block groups.
    when we create a snapshot for this tree between the relocation tree
    creation ends and ->create_reloc_tree is set to 0, Btrfs will create
    some backref nodes that are the lowest nodes of the backrefs cache.
    But we forget to add them into ->leaves list of the backref cache
    and deal with them, and at last, they will triggered BUG_ON().

    kernel BUG at fs/btrfs/relocation.c:239!

    This patch fixes it by adding them into ->leaves list of backref cache.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • btrfs_block_rsv_add{, _noflush}() have similar code, so abstract that code.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • When we did stress test for the space relocation, the deadlock happened.
    By debugging, We found it was caused by the carelessness that we forgot
    to unlock the read lock of the extent buffers in btrfs_orphan_cleanup()
    before we end the transaction handle, so the transaction commit task waited
    the task, which called btrfs_orphan_cleanup(), to unlock the extent buffer,
    but that task waited the commit task to end the transaction commit, and
    the deadlock happened. Fix it.

    Signed-ff-by: Miao Xie

    Signed-off-by: Chris Mason

    Miao Xie
     
  • I-node cache forgets to reserve the space when writing out it. And when
    we do some stress test, such as synctest, it will trigger WARN_ON() in
    use_block_rsv().

    WARNING: at fs/btrfs/extent-tree.c:5718 btrfs_alloc_free_block+0xbf/0x281 [btrfs]()
    ...
    Call Trace:
    [] warn_slowpath_common+0x80/0x98
    [] warn_slowpath_null+0x15/0x17
    [] btrfs_alloc_free_block+0xbf/0x281 [btrfs]
    [] ? __set_page_dirty_nobuffers+0xfe/0x108
    [] __btrfs_cow_block+0x118/0x3b5 [btrfs]
    [] btrfs_cow_block+0x103/0x14e [btrfs]
    [] btrfs_search_slot+0x249/0x6a4 [btrfs]
    [] btrfs_lookup_inode+0x2a/0x8a [btrfs]
    [] btrfs_update_inode+0xaa/0x141 [btrfs]
    [] btrfs_save_ino_cache+0xea/0x202 [btrfs]
    [] ? btrfs_update_reloc_root+0x17e/0x197 [btrfs]
    [] commit_fs_roots+0xaa/0x158 [btrfs]
    [] btrfs_commit_transaction+0x405/0x731 [btrfs]
    [] ? wake_up_bit+0x25/0x25
    [] ? btrfs_log_dentry_safe+0x43/0x51 [btrfs]
    [] btrfs_sync_file+0x16a/0x198 [btrfs]
    [] ? mntput+0x21/0x23
    [] vfs_fsync_range+0x18/0x21
    [] vfs_fsync+0x17/0x19
    [] do_fsync+0x29/0x3e
    [] sys_fsync+0xb/0xf
    [] system_call_fastpath+0x16/0x1b

    Sometimes it causes BUG_ON() in the reservation code of the delayed inode
    is triggered.

    So we must reserve enough space for inode cache.

    Note: If we can not reserve the enough space for inode cache, we will
    give up writing out it.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • btrfs_previous_item() just search the b+ tree, do not COW the nodes or leaves,
    if we modify the result of it, the meta-data will be broken. fix it.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • Chris Mason
     
  • Josef sent along an incremental to the inode reservation
    code to make sure we try and fall back to directly updating
    the inode item if things go horribly wrong.

    This reworks that patch slightly, adding a fallback function
    that will always try to update the inode item directly without
    going through the delayed_inode code.

    Signed-off-by: Chris Mason

    Chris Mason
     

10 Nov, 2011

5 commits

  • Commits 6c41761f and 45ea6095 introduced the possibility of NULL pointer
    dereference on error paths, also we would leave all devices busy and
    leak fs_info with all sub-structures on error when trying to mount an
    already mounted fs to a different directory.

    Fix this by doing all allocations before trying to open any of the
    devices, adjust error path for mount-already-mounted-fs case.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Fix a bug introduced by 7e662854 where we would leave devices busy on
    certain error paths in open_ctree(). fs_info is guaranteed to be
    non-NULL now so it's safe to dereference it on all error paths.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Fix bugs introduced by 6c41761f. Firstly, after failing to allocate any
    of the tree roots (first 'goto fail' in open_ctree()) we would
    dereference a NULL fs_info pointer in free_fs_info(). Secondly, after
    failures from init_srcu_struct(), setup_bdi() and new_inode() we would
    leak all earlier allocated roots: fs_info fields haven't been
    initialized yet so free_fs_info() is rendered useless.

    Fix this by initializing fs_info pointer and fs_info fields before any
    allocations happen.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • btrfs_parse_early_options() can fail due to error while scanning devices
    (-o device= option), but still strdup() subvol_name string:

    mount -o subvol=SUBV,device=BAD_DEVICE

    So free subvol_name string on error.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Don't leak subvol_name string in case multiple subvol= options are
    given. "The lastest option is effective" behavior (consistent with
    subvolid= and subvolrootid= options) is preserved.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     

09 Nov, 2011

2 commits

  • People have been reporting ENOSPC crashes in finish_ordered_io. This is because
    we try to steal from the delalloc block rsv to satisfy a reservation to update
    the inode. The problem with this is we don't explicitly save space for updating
    the inode when doing delalloc. This is kind of a problem and we've gotten away
    with this because way back when we just stole from the delalloc reserve without
    any questions, and this worked out fine because generally speaking the leaf had
    been modified either by the mtime update when we did the original write or
    because we just updated the leaf when we inserted the file extent item, only on
    rare occasions had the leaf not actually been modified, and that was still ok
    because we'd just use a block or two out of the over-reservation that is
    delalloc.

    Then came the delayed inode stuff. This is amazing, except it wants a full
    reservation for updating the inode since it may do it at some point down the
    road after we've written the blocks and we have to recow everything again. This
    worked out because the delayed inode stuff just stole from the global reserve,
    that is until recently when I changed that because it caused other problems.

    So here we are, we're doing everything right and being screwed for it. So take
    an extra reservation for the inode at delalloc reservation time and carry it
    through the life of the delalloc reservation. If we need it we can steal it in
    the delayed inode stuff. If we have already stolen it try and do a normal
    metadata reservation. If that fails try to steal from the delalloc reservation.
    If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to
    solve this and in the meantime we'll steal from the global reserve.

    With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see
    any problems.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • If we fail to reserve space in the transaction during truncate, we can
    error out with a NULL trans handle. The cleanup code needs an extra
    check to make sure we aren't trying to use the bad handle.

    Signed-off-by: Chris Mason

    Chris Mason
     

08 Nov, 2011

1 commit

  • On error path 'tree_root' is treed in 'free_fs_info()'.
    No need to free it explicitely. Noticed by SLUB in debug mode:

    Complete reproducer under usermode linux (discovered on real
    machine):

    bdev=/dev/ubda
    btr_root=/btr
    /mkfs.btrfs $bdev
    mount $bdev $btr_root
    mkdir $btr_root/subvols/
    cd $btr_root/subvols/
    /btrfs su cr foo
    /btrfs su cr bar
    mount $bdev -osubvol=subvols/foo $btr_root/subvols/bar
    umount $btr_root/subvols/bar

    which gives

    device fsid 4d55aa28-45b1-474b-b4ec-da912322195e devid 1 transid 7 /dev/ubda
    =============================================================================
    BUG kmalloc-2048: Object already free
    -----------------------------------------------------------------------------

    INFO: Allocated in btrfs_mount+0x389/0x7f0 age=0 cpu=0 pid=277
    INFO: Freed in btrfs_mount+0x51c/0x7f0 age=0 cpu=0 pid=277
    INFO: Slab 0x0000000062886200 objects=15 used=9 fp=0x0000000070b4d2d0 flags=0x4081
    INFO: Object 0x0000000070b4d2d0 @offset=21200 fp=0x0000000070b4a968
    ...
    Call Trace:
    70b31948: [] print_trailer+0xe2/0x130
    70b31978: [] object_err+0x3a/0x50
    70b319a8: [] free_debug_processing+0x142/0x2a0
    70b319e0: [] btrfs_mount+0x55f/0x7f0
    70b319f8: [] __slab_free+0x221/0x2d0

    Signed-off-by: Sergei Trofimovich
    Cc: Arne Jansen
    Cc: Chris Mason
    Cc: David Sterba
    Signed-off-by: Chris Mason

    slyich@gmail.com
     

07 Nov, 2011

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (114 commits)
    Btrfs: check for a null fs root when writing to the backup root log
    Btrfs: fix race during transaction joins
    Btrfs: fix a potential btrfs_bio leak on scrub fixups
    Btrfs: rename btrfs_bio multi -> bbio for consistency
    Btrfs: stop leaking btrfs_bios on readahead
    Btrfs: stop the readahead threads on failed mount
    Btrfs: fix extent_buffer leak in the metadata IO error handling
    Btrfs: fix the new inspection ioctls for 32 bit compat
    Btrfs: fix delayed insertion reservation
    Btrfs: ClearPageError during writepage and clean_tree_block
    Btrfs: be smarter about committing the transaction in reserve_metadata_bytes
    Btrfs: make a delayed_block_rsv for the delayed item insertion
    Btrfs: add a log of past tree roots
    btrfs: separate superblock items out of fs_info
    Btrfs: use the global reserve when truncating the free space cache inode
    Btrfs: release metadata from global reserve if we have to fallback for unlink
    Btrfs: make sure to flush queued bios if write_cache_pages waits
    Btrfs: fix extent pinning bugs in the tree log
    Btrfs: make sure btrfs_remove_free_space doesn't leak EAGAIN
    Btrfs: don't wait as long for more batches during SSD log commit
    ...

    Linus Torvalds
     
  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Add a 'reason' to wb_writeback_work
    writeback: send work item to queue_io, move_expired_inodes
    writeback: trace event balance_dirty_pages
    writeback: trace event bdi_dirty_ratelimit
    writeback: fix ppc compile warnings on do_div(long long, unsigned long)
    writeback: per-bdi background threshold
    writeback: dirty position control - bdi reserve area
    writeback: control dirty pause time
    writeback: limit max dirty pause time
    writeback: IO-less balance_dirty_pages()
    writeback: per task dirty rate limit
    writeback: stabilize bdi->dirty_ratelimit
    writeback: dirty rate control
    writeback: add bg_threshold parameter to __bdi_update_bandwidth()
    writeback: dirty position control
    writeback: account per-bdi accumulated dirtied pages

    Linus Torvalds
     
  • During log replay, can commit the transaction before the fs_root
    pointers are setup, so we have to make sure they are not null before
    trying to use them.

    Signed-off-by: Chris Mason

    Chris Mason
     

06 Nov, 2011

1 commit

  • While we're allocating ram for a new transaction, we drop our spinlock.
    When we get the lock back, we do check to see if a transaction started
    while we slept, but we don't check to make sure it isn't blocked
    because a commit has already started.

    Signed-off-by: Chris Mason

    Chris Mason