18 May, 2013

6 commits

  • The quota_tree was set up to use the empty_block_rsv before
    which would be problematic when the filesystem is filled up
    and ENOSPC happens during internal operations while the quota
    tree is updated and COWed (when the btrfs_qgroup_info_item
    items) are written. In fact, use_block_rsv() which is used
    in btrfs_cow_block() falls back to the global_block_rsv in
    this case. But just in order to make it more clear what is
    happening, change it to explicitly use the global_block_rsv.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Josef Bacik

    Stefan Behrens
     
  • Before applying this patch, we reserved the space for the global reserve
    by the minimum unit if we found it is empty, it was unreasonable and
    inefficient, because if the global reserve space was depleted, it implied
    that the size of the global reserve was too small. In this case, we shoud
    update the global reserve and fill it.

    Cc: Tsutomu Itoh
    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     
  • If the type of the space we need is different with the global reserve, we
    can not steal the space from the global reserve, because we can not allocate
    the space from the free space cache that the global reserve points to.

    Cc: Tsutomu Itoh
    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     
  • cc: Tsutomu Itoh
    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     
  • It is very likely that there are lots of subvolumes/snapshots in the filesystem,
    so if we use global block reservation to do inode cache truncation, we may hog
    all the free space that is reserved in global rsv. So it is better that we do
    the free space reservation for inode cache truncation by ourselves.

    Cc: Tsutomu Itoh
    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     
  • Chris hit a bug where we weren't finding extent records when running extent ops.
    This is because we use the delayed_ref_head when running the extent op, which
    means we can't use the ->type checks to see if we are metadata. We also lose
    the level of the metadata we are working on. So to fix this we can just check
    the ->is_data section of the extent_op, and we can store the level of the buffer
    we were modifying in the extent_op. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

07 May, 2013

17 commits

  • The variable was named 'data' in btrfs_reserve_extent and that's the
    only function that actually uses it to let btrfs_get_alloc_profile know
    what profile we want. Then it's passed down as u64 flags.

    Signed-off-by: David Sterba
    Signed-off-by: Josef Bacik

    David Sterba
     
  • Big patch, but all it does is add statics to functions which
    are in fact static, then remove the associated dead-code fallout.

    removed functions:

    btrfs_iref_to_path()
    __btrfs_lookup_delayed_deletion_item()
    __btrfs_search_delayed_insertion_item()
    __btrfs_search_delayed_deletion_item()
    find_eb_for_page()
    btrfs_find_block_group()
    range_straddles_pages()
    extent_range_uptodate()
    btrfs_file_extent_length()
    btrfs_scrub_cancel_devid()
    btrfs_start_transaction_lflush()

    btrfs_print_tree() is left because it is used for debugging.
    btrfs_start_transaction_lflush() and btrfs_reada_detach() are
    left for symmetry.

    ulist.c functions are left, another patch will take care of those.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Josef Bacik

    Eric Sandeen
     
  • So everybody who got hit by my fsync bug will still continue to hit this
    BUG_ON() in the free space cache, which is pretty heavy handed. So I took a
    file system that had this bug and fixed up all the BUG_ON()'s and leaks that
    popped up when I tried to mount a broken file system like this. With this patch
    we just fail to mount instead of panicing. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • When running the 208th of xfstests, the fs returned the enospc
    error when there was lots of free space in the disk.

    By bisect debug, we found it was introduced by commit 96f1bb5777.
    This commit makes the space check for the global reservation in
    can_overcommit() be inconsistent with should_alloc_chunk().
    can_overcommit() requires that the free space is 2 times the size
    of the global reservation, or we can't do overcommit. And instead,
    we need reclaim some reserved space, and if we still don't have
    enough free space, we need allocate a new chunk. But unfortunately,
    should_alloc_chunk() just requires that the free space is 1 time
    the size of the global reservation, that is we would not try to
    allocate a new chunk if the free space size is in the middle of
    these two requires, and just return the enospc error. Fix it.

    Cc: Jim Schutt
    Cc: Josef Bacik
    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     
  • Sequence numbers for delayed refs have been introduced in the first version
    of the qgroup patch set. To solve the problem of find_all_roots on a busy
    file system, the tree mod log was introduced. The sequence numbers for that
    were simply shared between those two users.

    However, at one point in qgroup's quota accounting, there's a statement
    accessing the previous sequence number, that's still just doing (seq - 1)
    just as it would have to in the very first version.

    To satisfy that requirement, this patch makes the sequence number counter 64
    bit and splits it into a major part (used for qgroup sequence number
    counting) and a minor part (incremented for each tree modification in the
    log). This enables us to go exactly one major step backwards, as required
    for qgroups, while still incrementing the sequence counter for tree mod log
    insertions to keep track of their order. Keeping them in a single variable
    means there's no need to change all the code dealing with comparisons of two
    sequence numbers.

    The sequence number is reset to 0 on commit (not new in this patch), which
    ensures we won't overflow the two 32 bit counters.

    Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
    from the tree mod log code may happen.

    Signed-off-by: Jan Schmidt
    Signed-off-by: Josef Bacik

    Jan Schmidt
     
  • This is just obnoxious. Just print a message, abort the transaction, and return
    an error. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We kept leaking extent buffers when mounting a broken file system and it turns
    out it's because not everybody uses read_tree_block properly. You need to check
    and make sure the extent_buffer is uptodate before you use it. This patch fixes
    everybody who calls read_tree_block directly to make sure they check that it is
    uptodate and free it and return an error if it is not. With this we no longer
    leak EB's when things go horribly wrong. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • If we fail to load block groups halfway through we can leave extent_state's on
    the excluded tree. This is because we just lookup the supers and add them to
    the excluded tree regardless of which block group we are looking at currently.
    This is a problem because we remove the excluded extents for the range of the
    block group only, so if we don't ever load a block group for one of the excluded
    extents we won't ever free it. This fixes the problem by only adding excluded
    extents if it falls in the block group range we care about. With this patch
    we're no longer leaking space when we fail to read all of the block groups.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • So I noticed there is an infinite loop in the slow caching code. If we return 1
    when we hit the end of the tree, so we could end up caching the last block group
    the slow way and suddenly we're looping forever because we just keep
    re-searching and trying again. Fix this by only doing btrfs_next_leaf() if we
    don't need_resched(). Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Argument 'trans' became unnecessary from setup_inline_extent_backref()
    that called btrfs_extend_item().

    Signed-off-by: Tsutomu Itoh
    Signed-off-by: Josef Bacik

    Tsutomu Itoh
     
  • Argument 'trans' is not used in btrfs_extend_item().

    Signed-off-by: Tsutomu Itoh
    Signed-off-by: Josef Bacik

    Tsutomu Itoh
     
  • If argument 'trans' is unnecessary in the function where
    fixup_low_keys() is called, 'trans' is deleted.

    Signed-off-by: Tsutomu Itoh
    Signed-off-by: Josef Bacik

    Tsutomu Itoh
     
  • Dave was hitting a lockdep warning because we're now properly taking the ordered
    operations mutex in the ordered wait stuff. This is because some cases we will
    have a trans handle when we are flushing delalloc space, but we can't wait on
    ordered extents because we could potentially deadlock, so fix this by not doing
    the wait if we have a trans handle. Thanks

    Reported-and-tested-by: David Sterba
    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • I noticed that we will add a block group to the space info before we add it to
    the block group cache rb tree, so we could potentially allocate from the block
    group before it's able to be searched for. I don't think this is too much of
    a problem, the race window is microscopic, but just in case move the tree
    insertion to above the space info linking. This makes it easier to adjust the
    error handling as well, so we can remove a couple of BUG_ON(ret)'s and have real
    error handling setup for these scenarios. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • With more than one btrfs volume mounted, it can be very difficult to find
    out which volume is hitting an error. btrfs_error() will print this, but
    it is currently rigged as more of a fatal error handler, while many of
    the printk()s are currently for debugging and yet-unhandled cases.

    This patch just changes the functions where the device information is
    already available. Some cases remain where the root or fs_info is not
    passed to the function emitting the error.

    This may introduce some confusion with volumes backed by multiple devices
    emitting errors referring to the primary device in the set instead of the
    one on which the error occurred.

    Use btrfs_printk(fs_info, format, ...) rather than writing the device
    string every time, and introduce macro wrappers ala XFS for brevity.
    Since the function already cannot be used for continuations, print a
    newline as part of the btrfs_printk() message rather than at each caller.

    Signed-off-by: Simon Kirby
    Reviewed-by: David Sterba
    Signed-off-by: Josef Bacik

    Simon Kirby
     
  • Each time pick one dead root from the list and let the caller know if
    it's needed to continue. This should improve responsiveness during
    umount and balance which at some point waits for cleaning all currently
    queued dead roots.

    A new dead root is added to the end of the list, so the snapshots
    disappear in the order of deletion.

    The snapshot cleaning work is now done only from the cleaner thread and the
    others wake it if needed.

    Signed-off-by: David Sterba
    Signed-off-by: Josef Bacik

    David Sterba
     
  • We currently store the first key of the tree block inside the reference for the
    tree block in the extent tree. This takes up quite a bit of space. Make a new
    key type for metadata which holds the level as the offset and completely removes
    storing the btrfs_tree_block_info inside the extent ref. This reduces the size
    from 51 bytes to 33 bytes per extent reference for each tree block. In practice
    this results in a 30-35% decrease in the size of our extent tree, which means we
    COW less and can keep more of the extent tree in memory which makes our heavy
    metadata operations go much faster. This is not an automatic format change, you
    must enable it at mkfs time or with btrfstune. This patch deals with having
    metadata stored as either the old format or the new format so it is easy to
    convert. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

30 Mar, 2013

1 commit

  • Pull btrfs fixes from Chris Mason:
    "We've had a busy two weeks of bug fixing. The biggest patches in here
    are some long standing early-enospc problems (Josef) and a very old
    race where compression and mmap combine forces to lose writes (me).
    I'm fairly sure the mmap bug goes all the way back to the introduction
    of the compression code, which is proof that fsx doesn't trigger every
    possible mmap corner after all.

    I'm sure you'll notice one of these is from this morning, it's a small
    and isolated use-after-free fix in our scrub error reporting. I
    double checked it here."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: don't drop path when printing out tree errors in scrub
    Btrfs: fix wrong return value of btrfs_lookup_csum()
    Btrfs: fix wrong reservation of csums
    Btrfs: fix double free in the btrfs_qgroup_account_ref()
    Btrfs: limit the global reserve to 512mb
    Btrfs: hold the ordered operations mutex when waiting on ordered extents
    Btrfs: fix space accounting for unlink and rename
    Btrfs: fix space leak when we fail to reserve metadata space
    Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes
    Btrfs: fix race between mmap writes and compression
    Btrfs: fix memory leak in btrfs_create_tree()
    Btrfs: fix locking on ROOT_REPLACE operations in tree mod log
    Btrfs: fix missing qgroup reservation before fallocating
    Btrfs: handle a bogus chunk tree nicely
    Btrfs: update to use fs_state bit

    Linus Torvalds
     

28 Mar, 2013

2 commits

  • A user reported a problem where he was getting early ENOSPC with hundreds of
    gigs of free data space and 6 gigs of free metadata space. This is because the
    global block reserve was taking up the entire free metadata space. This is
    ridiculous, we have infrastructure in place to throttle if we start using too
    much of the global reserve, so instead of letting it get this huge just limit it
    to 512mb so that users can still get work done. This allowed the user to
    complete his rsync without issues. Thanks

    Cc: stable@vger.kernel.org
    Reported-and-tested-by: Stefan Priebe
    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Dave reported a warning when running xfstest 275. We have been leaking delalloc
    metadata space when our reservations fail. This is because we were improperly
    calculating how much space to free for our checksum reservations. The problem
    is we would sometimes free up space that had already been freed in another
    thread and we would end up with negative usage for the delalloc space. This
    patch fixes the problem by calculating how much space the other threads would
    have already freed, and then calculate how much space we need to free had we not
    done the reservation at all, and then freeing any excess space. This makes
    xfstests 275 no longer have leaked space. Thanks

    Cc: stable@vger.kernel.org
    Reported-by: David Sterba
    Signed-off-by: Josef Bacik

    Josef Bacik
     

22 Mar, 2013

1 commit

  • If you restore a btrfs-image file system and try to mount that file system we'll
    panic. That's because btrfs-image restores and just makes one big chunk to
    envelope the whole disk, since they are really only meant to be messed with by
    our btrfs-progs. So fix up btrfs_rmap_block and the callers of it for mount so
    that we no longer panic but instead just return an error and fail to mount.
    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

18 Mar, 2013

1 commit

  • Pull btrfs fixes from Chris Mason:
    "Eric's rcu barrier patch fixes a long standing problem with our
    unmount code hanging on to devices in workqueue helpers. Liu Bo
    nailed down a difficult assertion for in-memory extent mappings."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix warning of free_extent_map
    Btrfs: fix warning when creating snapshots
    Btrfs: return as soon as possible when edquot happens
    Btrfs: return EIO if we have extent tree corruption
    btrfs: use rcu_barrier() to wait for bdev puts at unmount
    Btrfs: remove btrfs_try_spin_lock
    Btrfs: get better concurrency for snapshot-aware defrag work

    Linus Torvalds
     

15 Mar, 2013

1 commit

  • The callers of lookup_inline_extent_info all handle getting an error back
    properly, so return an error if we have corruption instead of being a jerk and
    panicing. Still WARN_ON() since this is kind of crucial and I've been seeing it
    a bit too much recently for my taste, I think we're doing something wrong
    somewhere. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

03 Mar, 2013

1 commit

  • Pull btrfs update from Chris Mason:
    "The biggest feature in the pull is the new (and still experimental)
    raid56 code that David Woodhouse started long ago. I'm still working
    on the parity logging setup that will avoid inconsistent parity after
    a crash, so this is only for testing right now. But, I'd really like
    to get it out to a broader audience to hammer out any performance
    issues or other problems.

    scrub does not yet correct errors on raid5/6 either.

    Josef has another pass at fsync performance. The big change here is
    to combine waiting for metadata with waiting for data, which is a big
    latency win. It is also step one toward using atomics from the
    hardware during a commit.

    Mark Fasheh has a new way to use btrfs send/receive to send only the
    metadata changes. SUSE is using this to make snapper more efficient
    at finding changes between snapshosts.

    Snapshot-aware defrag is also included.

    Otherwise we have a large number of fixes and cleanups. Eric Sandeen
    wins the award for removing the most lines, and I'm hoping we steal
    this idea from XFS over and over again."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits)
    btrfs: fixup/remove module.h usage as required
    Btrfs: delete inline extents when we find them during logging
    btrfs: try harder to allocate raid56 stripe cache
    Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more logic
    Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails
    Btrfs: remove reduplicate check about root in the function btrfs_clean_quota_tree
    Btrfs: return ENOMEM rather than use BUG_ON when btrfs_alloc_path fails
    Btrfs: fix missing deleted items in btrfs_clean_quota_tree
    btrfs: use only inline_pages from extent buffer
    Btrfs: fix wrong reserved space when deleting a snapshot/subvolume
    Btrfs: fix wrong reserved space in qgroup during snap/subv creation
    Btrfs: remove unnecessary dget_parent/dput when creating the pending snapshot
    btrfs: remove a printk from scan_one_device
    Btrfs: fix NULL pointer after aborting a transaction
    Btrfs: fix memory leak of log roots
    Btrfs: copy everything if we've created an inline extent
    btrfs: cleanup for open-coded alignment
    Btrfs: do not change inode flags in rename
    Btrfs: use reserved space for creating a snapshot
    clear chunk_alloc flag on retryable failure
    ...

    Linus Torvalds
     

01 Mar, 2013

4 commits


27 Feb, 2013

2 commits

  • Though most of the btrfs codes are using ALIGN macro for page alignment,
    there are still some codes using open-coded alignment like the
    following:
    ------
    u64 mask = ((u64)root->stripesize - 1);
    u64 ret = (val + mask) & ~mask;
    ------
    Or even hidden one:
    ------
    num_bytes = (end - start + blocksize) & ~(blocksize - 1);
    ------

    Sometimes these open-coded alignment is not so easy to understand for
    newbie like me.

    This commit changes the open-coded alignment to the ALIGN macro for a
    better readability.

    Also there is a previous patch from David Sterba with similar changes,
    but the patch is for 3.2 kernel and seems not merged.
    http://www.spinics.net/lists/linux-btrfs/msg12747.html

    Cc: David Sterba
    Signed-off-by: Qu Wenruo
    Signed-off-by: Josef Bacik

    Qu Wenruo
     
  • I've experienced filesystem freezes with permanent spikes in the active
    process count for quite a while, particularly on filesystems whose
    available raw space has already been fully allocated to chunks.

    While looking into this, I found a pretty obvious error in
    do_chunk_alloc: it sets space_info->chunk_alloc, but if
    btrfs_alloc_chunk returns an error other than ENOSPC, it returns leaving
    that flag set, which causes any other threads waiting for
    space_info->chunk_alloc to become zero to spin indefinitely.

    I haven't double-checked that this patch fixes the failure I've observed
    fully (it's not exactly trivial to trigger), but it surely is a bug and
    the fix is trivial, so... Please put it in :-)

    What I saw in that function also happens to explain why in some cases I
    see filesystems allocate a huge number of chunks that remain unused
    (leading to the scenario above, of not having more chunks to allocate).
    It happens for data and metadata, but not necessarily both. I'm
    guessing some thread sets the force_alloc flag on the corresponding
    space_info, and then several threads trying to get disk space end up
    attempting to allocate a new chunk concurrently. All of them will see
    the force_alloc flag and bump their local copy of force up to the level
    they see first, and they won't clear it even if another thread succeeds
    in allocating a chunk, thus clearing the force flag. Then each thread
    that observed the force flag will, on its turn, force the allocation of
    a new chunk. And any threads that come in while it does that will see
    the force flag still set and pick it up, and so on. This sounds like a
    problem to me, but... what should the correct behavior be? Clear
    force_flag once we copy it to a local force? Reset force to the
    incoming value on every loop? Set the flag to our incoming force if we
    have it at first, clear our local flag, and move it from the space_info
    when we determined that we are the thread that's going to perform the
    allocation?

    btrfs: clear chunk_alloc flag on retryable failure

    From: Alexandre Oliva

    If btrfs_alloc_chunk fails with e.g. ENOMEM, we exit do_chunk_alloc
    without clearing chunk_alloc in space_info. As a result, any further
    calls to do_chunk_alloc on that filesystem will start busy-waiting for
    chunk_alloc to be cleared, but it never will be. This patch adjusts
    do_chunk_alloc so that it clears this flag in case of an error.

    Signed-off-by: Alexandre Oliva
    Signed-off-by: Josef Bacik

    Alexandre Oliva
     

22 Feb, 2013

1 commit

  • Pull trivial tree from Jiri Kosina:
    "Assorted tiny fixes queued in trivial tree"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (22 commits)
    DocBook: update EXPORT_SYMBOL entry to point at export.h
    Documentation: update top level 00-INDEX file with new additions
    ARM: at91/ide: remove unsused at91-ide Kconfig entry
    percpu_counter.h: comment code for better readability
    x86, efi: fix comment typo in head_32.S
    IB: cxgb3: delay freeing mem untill entirely done with it
    net: mvneta: remove unneeded version.h include
    time: x86: report_lost_ticks doesn't exist any more
    pcmcia: avoid static analysis complaint about use-after-free
    fs/jfs: Fix typo in comment : 'how may' -> 'how many'
    of: add missing documentation for of_platform_populate()
    btrfs: remove unnecessary cur_trans set before goto loop in join_transaction
    sound: soc: Fix typo in sound/codecs
    treewide: Fix typo in various drivers
    btrfs: fix comment typos
    Update ibmvscsi module name in Kconfig.
    powerpc: fix typo (utilties -> utilities)
    of: fix spelling mistake in comment
    h8300: Fix home page URL in h8300/README
    xtensa: Fix home page URL in Kconfig
    ...

    Linus Torvalds
     

21 Feb, 2013

3 commits

  • Very large fallocate requests are cpu bound and result in extents with a
    repeating pattern of ever decreasing size:

    $ time fallocate -l 1T file
    real 0m13.039s

    ( an excerpt of the extents from btrfs-debug-tree: )
    prealloc data disk byte 1536292564992 nr 397312
    prealloc data disk byte 1536292962304 nr 196608
    prealloc data disk byte 1536293158912 nr 98304
    prealloc data disk byte 1536293257216 nr 49152
    prealloc data disk byte 1536293306368 nr 24576
    prealloc data disk byte 1536293330944 nr 12288
    prealloc data disk byte 1536293343232 nr 8192
    prealloc data disk byte 1536293351424 nr 4096
    prealloc data disk byte 1536293355520 nr 4096
    prealloc data disk byte 1536293359616 nr 4096

    The excessive cpu use comes from __btrfs_prealloc_file_range() trying to
    allocate the entire remaining size after each extent is allocated.
    btrfs_reserve_extent() repeatedly cuts this requested size in half until
    it gets down to the size that the allocators can return. We limit the
    problem for now by capping each reservation at 256 meg.

    The small extents come from a masking bug when decreasing the requested
    reservation size. The high 32bits are cleared and the remaining low
    bits might happen to reserve a small size. Fix this by using
    round_down() which properly casts the mask.

    After these fixes huge fallocate requests are fast and result in nice
    large extents:

    $ time fallocate -l 1T file
    real 0m0.082s

    prealloc data disk byte 1112425889792 nr 268435456
    prealloc data disk byte 1112694325248 nr 268435456
    prealloc data disk byte 1112962760704 nr 268435456

    Reported-by: Eric Sandeen
    Signed-off-by: Zach Brown
    Signed-off-by: Chris Mason

    Zach Brown
     
  • Signed-off-by: Chris Mason

    Conflicts:
    fs/btrfs/ctree.h
    fs/btrfs/extent-tree.c
    fs/btrfs/inode.c
    fs/btrfs/volumes.c

    Chris Mason
     
  • …fs-next into for-linus-3.9

    Signed-off-by: Chris Mason <chris.mason@fusionio.com>

    Conflicts:
    fs/btrfs/disk-io.c

    Chris Mason