28 Mar, 2013

5 commits

  • A user reported a problem where he was getting early ENOSPC with hundreds of
    gigs of free data space and 6 gigs of free metadata space. This is because the
    global block reserve was taking up the entire free metadata space. This is
    ridiculous, we have infrastructure in place to throttle if we start using too
    much of the global reserve, so instead of letting it get this huge just limit it
    to 512mb so that users can still get work done. This allowed the user to
    complete his rsync without issues. Thanks

    Cc: stable@vger.kernel.org
    Reported-and-tested-by: Stefan Priebe
    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We need to hold the ordered_operations mutex while waiting on ordered extents
    since we splice and run the ordered extents list. We need to make sure anybody
    else who wants to wait on ordered extents does actually wait for them to be
    completed. This will keep us from bailing out of flushing in case somebody is
    already waiting on ordered extents to complete. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We are way over-reserving for unlink and rename. Rename is just some random
    huge number and unlink accounts for tree log operations that don't actually
    happen during unlink, not to mention the tree log doesn't take from the trans
    block rsv anyway so it's completely useless. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Dave reported a warning when running xfstest 275. We have been leaking delalloc
    metadata space when our reservations fail. This is because we were improperly
    calculating how much space to free for our checksum reservations. The problem
    is we would sometimes free up space that had already been freed in another
    thread and we would end up with negative usage for the delalloc space. This
    patch fixes the problem by calculating how much space the other threads would
    have already freed, and then calculate how much space we need to free had we not
    done the reservation at all, and then freeing any excess space. This makes
    xfstests 275 no longer have leaked space. Thanks

    Cc: stable@vger.kernel.org
    Reported-by: David Sterba
    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • When you take a snapshot, punch a hole where there has been data, then take
    another snapshot and try to send an incremental stream, btrfs send would
    give you EIO. That is because is_extent_unchanged had no support for holes
    being punched. With this patch, instead of returning EIO we just return
    0 (== the extent is not unchanged) and we're good.

    Signed-off-by: Jan Schmidt
    Cc: Alexander Block
    Signed-off-by: Josef Bacik

    Jan Schmidt
     

27 Mar, 2013

1 commit

  • Btrfs uses page_mkwrite to ensure stable pages during
    crc calculations and mmap workloads. We call clear_page_dirty_for_io
    before we do any crcs, and this forces any application with the file
    mapped to wait for the crc to finish before it is allowed to change
    the file.

    With compression on, the clear_page_dirty_for_io step is happening after
    we've compressed the pages. This means the applications might be
    changing the pages while we are compressing them, and some of those
    modifications might not hit the disk.

    This commit adds the clear_page_dirty_for_io before compression starts
    and makes sure to redirty the page if we have to fallback to
    uncompressed IO as well.

    Signed-off-by: Chris Mason
    Reported-by: Alexandre Oliva
    cc: stable@vger.kernel.org

    Chris Mason
     

22 Mar, 2013

5 commits

  • We should free leaf and root before returning from the error
    handling code.

    Signed-off-by: Tsutomu Itoh
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Tsutomu Itoh
     
  • To resolve backrefs, ROOT_REPLACE operations in the tree mod log are
    required to be tied to at least one KEY_REMOVE_WHILE_FREEING operation.
    Therefore, those operations must be enclosed by tree_mod_log_write_lock()
    and tree_mod_log_write_unlock() calls.

    Those calls are private to the tree_mod_log_* functions, which means that
    removal of the elements of an old root node must be logged from
    tree_mod_log_insert_root. This partly reverts and corrects commit ba1bfbd5
    (Btrfs: fix a tree mod logging issue for root replacement operations).

    This fixes the brand-new version of xfstest 276 as of commit cfe73f71.

    Cc: stable@vger.kernel.org
    Signed-off-by: Jan Schmidt
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Jan Schmidt
     
  • Steps to reproduce:
    mkfs.btrfs
    mount
    btrfs quota enable
    btrfs sub create /subv
    btrfs qgroup limit 10M /subv
    fallocate --length 20M /subv/data

    For the above example, fallocating will return successfully which
    is not expected, we try to fix it by doing qgroup reservation before
    fallocating.

    Signed-off-by: Wang Shilong
    Reviewed-by: Miao Xie
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Wang Shilong
     
  • If you restore a btrfs-image file system and try to mount that file system we'll
    panic. That's because btrfs-image restores and just makes one big chunk to
    envelope the whole disk, since they are really only meant to be messed with by
    our btrfs-progs. So fix up btrfs_rmap_block and the callers of it for mount so
    that we no longer panic but instead just return an error and fail to mount.
    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Now that we use bit operation to check fs_state, update
    btrfs_free_fs_root()'s checker, otherwise we get back to
    memory leak case.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Liu Bo
     

16 Mar, 2013

1 commit

  • Users report that an extent map's list is still linked when it's actually
    going to be freed from cache.

    The story is that

    a) when we're going to drop an extent map and may split this large one into
    smaller ems, and if this large one is flagged as EXTENT_FLAG_LOGGING which means
    that it's on the list to be logged, then the smaller ems split from it will also
    be flagged as EXTENT_FLAG_LOGGING, and this is _not_ expected.

    b) we'll keep ems from unlinking the list and freeing when they are flagged with
    EXTENT_FLAG_LOGGING, because the log code holds one reference.

    The end result is the warning, but the truth is that we set the flag
    EXTENT_FLAG_LOGGING only during fsync.

    So clear flag EXTENT_FLAG_LOGGING for extent maps split from a large one.

    Reported-by: Johannes Hirte
    Reported-by: Darrick J. Wong
    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     

15 Mar, 2013

6 commits

  • Creating snapshot passes extent_root to commit its transaction,
    but it can lead to the warning of checking root for quota in
    the __btrfs_end_transaction() when someone else is committing
    the current transaction. Since we've recorded the needed root
    in trans_handle, just use it to get rid of the warning.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • If one of qgroup fails to reserve firstly, we should return immediately,
    it is unnecessary to continue check.

    Signed-off-by: Wang Shilong
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Wang Shilong
     
  • The callers of lookup_inline_extent_info all handle getting an error back
    properly, so return an error if we have corruption instead of being a jerk and
    panicing. Still WARN_ON() since this is kind of crucial and I've been seeing it
    a bit too much recently for my taste, I think we're doing something wrong
    somewhere. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Doing this would reliably fail with -EBUSY for me:

    # mount /dev/sdb2 /mnt/scratch; umount /mnt/scratch; mkfs.btrfs -f /dev/sdb2
    ...
    unable to open /dev/sdb2: Device or resource busy

    because mkfs.btrfs tries to open the device O_EXCL, and somebody still has it.

    Using systemtap to track bdev gets & puts shows a kworker thread doing a
    blkdev put after mkfs attempts a get; this is left over from the unmount
    path:

    btrfs_close_devices
    __btrfs_close_devices
    call_rcu(&device->rcu, free_device);
    free_device
    INIT_WORK(&device->rcu_work, __free_device);
    schedule_work(&device->rcu_work);

    so unmount might complete before __free_device fires & does its blkdev_put.

    Adding an rcu_barrier() to btrfs_close_devices() causes unmount to wait
    until all blkdev_put()s are done, and the device is truly free once
    unmount completes.

    Cc: stable@vger.kernel.org
    Signed-off-by: Eric Sandeen
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Eric Sandeen
     
  • Remove a useless function declaration

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Liu Bo
     
  • Using spinning case instead of blocking will result in better concurrency
    overall.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Liu Bo
     

07 Mar, 2013

3 commits

  • The delayed inode code batches up changes to the btree in hopes of doing
    them in bulk. As the changes build up, processes kick off worker
    threads and wait for them to make progress.

    The current code kicks off an async work queue item for each delayed
    node, which creates a lot of churn. It also uses a fixed 1 HZ waiting
    period for the throttle, which allows us to build a lot of pending
    work and can slow down the commit.

    This changes us to watch a sequence counter as it is bumped during the
    operations. We kick off fewer work items and have each work item do
    more work.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Raid56 merge (merge commit e942f88) had mistakenly removed a call to
    __cancel_balance(), which resulted in balance not cleaning up after itself
    after a successful finish. (Cleanup includes switching the state, removing
    the balance item and releasing mut_ex_op testnset lock.) Bring it back.

    Reported-by: David Sterba
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Chris Mason

    Ilya Dryomov
     
  • …fs-next into for-linus-3.9

    Chris Mason
     

06 Mar, 2013

1 commit

  • Commit 24542bf7ea5e4fdfdb5157ff544c093fa4dcb536 changed preallocation of
    extents to cap the max size we try to allocate. It's a valid change,
    but the extent reservation code is also used by balance, and that
    can't tolerate a smaller extent being allocated.

    __btrfs_prealloc_file_range already has a min_size parameter, which is
    used by relocation to request a specific extent size. This commit
    adds an extra check to enforce that minimum extent size.

    Signed-off-by: Chris Mason
    Reported-by: Stefan Behrens

    Chris Mason
     

05 Mar, 2013

10 commits

  • Commit 5ac00add added a testnset mutex and code that disallows
    running administrative tasks in parallel. It is prevented that
    the device add/delete/balance/replace/resize operations are
    started in parallel. By mistake, the defragmentation operation
    was included in the check for mutually exclusiveness as well.
    This is fixed with this commit.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Josef Bacik

    Stefan Behrens
     
  • Only let one trans handle to wait for other handles, otherwise we
    will get ABBA issues.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • Btrfs balance can easily hit BUG_ON in these places, but we want
    to it bail out gracefully after we force the whole filesystem to
    readonly. So we use btrfs_std_error hook in place of BUG_ON.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • We can bail out from here gracefully instead of a cold BUG_ON.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • We've missed the 'free blocks' part on ENOMEM error.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • We first use btrfs_std_error hook to replace with BUG_ON, and we
    also need to cleanup what is left, including reloc roots rbtree
    and reloc roots list.
    Here we use a helper function to cleanup both rbtree and list, and
    since this function can also be used in the balance recover path,
    we also make the change as well to keep code simple.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • Add a check for NULL pointer to avoid invalid reference.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • If the async transaction commitment failed, we need close the
    current transaction handler, or the current transaction will be
    blocked to commit because of this orphan handler.

    We fix the problem by doing sync transaction commitment, that is
    to invoke btrfs_commit_transaction().

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     
  • There are several bugs at error path of create_snapshot() when the
    transaction commitment failed.
    - access the freed transaction handler. At the end of the
    transaction commitment, the transaction handler was freed, so we
    should not access it after the transaction commitment.
    - we were not aware of the error which happened during the snapshot
    creation if we submitted a async transaction commitment.
    - pending snapshot access vs pending snapshot free. when something
    wrong happened after we submitted a async transaction commitment,
    the transaction committer would cleanup the pending snapshots and
    free them. But the snapshot creators were not aware of it, they
    would access the freed pending snapshots.

    This patch fixes the above problems by:
    - remove the dangerous code that accessed the freed handler
    - assign ->error if the error happens during the snapshot creation
    - the transaction committer doesn't free the pending snapshots,
    just assigns the error number and evicts them before we unblock
    the transaction.

    Reported-by: Dan Carpenter
    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     
  • We need to inc the nlink of deleted entries when running replay so we can do the
    unlink on the fs_root and get everything cleaned up and then have the orphan
    cleanup do the right thing. The problem is inc_nlink complains about this, even
    thought it still does the right thing. So use set_nlink() if our i_nlink is 0
    to keep users from seeing the warnings during log replay. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

03 Mar, 2013

1 commit

  • tilegx_defconfig:

    fs/btrfs/raid56.c: In function 'btrfs_alloc_stripe_hash_table':
    fs/btrfs/raid56.c:206:3: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration]
    fs/btrfs/raid56.c:206:9: warning: assignment makes pointer from integer without a cast [enabled by default]
    fs/btrfs/raid56.c:226:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Chris Mason

    Geert Uytterhoeven
     

02 Mar, 2013

2 commits

  • We want to avoid module.h where posible, since it in turn includes
    nearly all of header space. This means removing it where it is not
    required, and using export.h where we are only exporting symbols via
    EXPORT_SYMBOL and friends.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: Chris Mason

    Paul Gortmaker
     
  • Apparently when we do inline extents we allow the data to overlap the last chunk
    of the btrfs_file_extent_item, which means that we can possibly have a
    btrfs_file_extent_item that isn't actually as large as a btrfs_file_extent_item.
    This messes with us when we try to overwrite the extent when logging new extents
    since we expect for it to be the right size. To fix this just delete the item
    and try to do the insert again which will give us the proper sized
    btrfs_file_extent_item. This fixes a panic where map_private_extent_buffer
    would blow up because we're trying to write past the end of the leaf. Thanks,

    Cc: stable@vger.kernel.org
    Signed-off-by: Josef Bacik

    Josef Bacik
     

01 Mar, 2013

5 commits

  • The stripe hash table is large, starting with allocation order 4 and can go as
    high as order 7 in case lock debugging is turned on and structure padding
    happens.

    Observed mount failure:

    mount: page allocation failure: order:7, mode:0x200050
    Pid: 8234, comm: mount Tainted: G W 3.8.0-default+ #267
    Call Trace:
    [] warn_alloc_failed+0xf3/0x140
    [] ? __alloc_pages_direct_compact+0x92/0x250
    [] __alloc_pages_nodemask+0x733/0x9d0
    [] ? cache_alloc_refill+0x3f8/0x840
    [] cache_alloc_refill+0x43c/0x840
    [] ? is_kernel_percpu_address+0x4b/0x90
    [] ? btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs]
    [] kmem_cache_alloc_trace+0x247/0x270
    [] btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs]
    [] open_ctree+0xb2f/0x1f90 [btrfs]
    [] ? string+0x49/0xe0
    [] ? vsnprintf+0x443/0x5d0
    [] btrfs_mount+0x526/0x600 [btrfs]
    [] ? cache_alloc_debugcheck_after+0x4c/0x200
    [] mount_fs+0x20/0xe0
    [] vfs_kern_mount+0x76/0x120
    [] do_mount+0x386/0x980
    [] ? strndup_user+0x5b/0x80
    [] sys_mount+0x90/0xe0
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: David Sterba
    Signed-off-by: Josef Bacik

    David Sterba
     
  • The original code is a little confusing and not clear, The right
    way to deal with the kernel code like this:
    [...]
    if (ret)
    goto out;
    [...]

    So i move the common clean_up code to the place labeled with
    out_fail, this will be easier to maintain.

    Signed-off-by: Wang Shilong
    Signed-off-by: Josef Bacik

    Wang Shilong
     
  • commit eb6b88d92c6df083dd09a8c471011e3788dfd7c6 leads into another bug.
    If it is just because qgroup_reserve fails, the function btrfs_qgroup_free
    should not be called, otherwise, it will cause the wrong quota accounting.

    Signed-off-by: Wang Shilong
    Signed-off-by: Josef Bacik

    Wang Shilong
     
  • The check work has been done just before the function btrfs_clean_quota_tree
    is called, it is not necessary to check it again, remove it.

    Signed-off-by: Wang Shilong
    Signed-off-by: Josef Bacik

    Wang Shilong
     
  • Return ENOMEM rather trigger BUG_ON, fix it.

    Signed-off-by: Wang Shilong
    Reviewed-by: Miao Xie
    Reviewed-by: Zach Brown
    Signed-off-by: Josef Bacik

    Wang Shilong