31 Mar, 2011

1 commit


28 Mar, 2011

1 commit


18 Mar, 2011

1 commit

  • If we cannot truncate an inode for some reason we will never delete the orphan
    item associated with that inode, which means that we will loop forever in
    btrfs_orphan_cleanup. Instead of doing this just return error so we fail to
    mount. It sucks, but hey it's better than hanging. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

17 Feb, 2011

1 commit

  • Btrfs device shrinking and balancing ends up reallocating all the blocks
    in order to allow COW to move them to new destinations. It is somewhat
    awkward in terms of ENOSPC because most of the enospc code is built
    around the idea that some operation on a reference counted tree triggers
    allocations in the non-reference counted trees.

    This commit changes the balancing code to deal with enospc by trying to
    allocate a new chunk. If that allocation succeeds, we go ahead and
    retry whatever failed due to enospc.

    Signed-off-by: Chris Mason

    Chris Mason
     

15 Feb, 2011

1 commit


01 Feb, 2011

1 commit


29 Jan, 2011

1 commit

  • The error check of btrfs_join_transaction()/btrfs_join_transaction_nolock()
    is added, and the mistake of the error check in several places is
    corrected.

    For more stable Btrfs, I think that we should reduce BUG_ON().
    But, I think that long time is necessary for this.
    So, I propose this patch as a short-term solution.

    With this patch:
    - To more stable Btrfs, the part that should be corrected is clarified.
    - The panic isn't done by the NULL pointer reference etc. (even if
    BUG_ON() is increased temporarily)
    - The error code is returned in the place where the error can be easily
    returned.

    As a long-term plan:
    - BUG_ON() is reduced by using the forced-readonly framework, etc.

    Signed-off-by: Tsutomu Itoh
    Signed-off-by: Chris Mason

    Tsutomu Itoh
     

30 Oct, 2010

1 commit

  • These are all the cases where a variable is set, but not
    read which are really bugs.

    - Couple of incorrect error handling fixed.
    - One incorrect use of a allocation policy
    - Some other things

    Still needs more review.

    Found by gcc 4.6's new warnings.

    [akpm@linux-foundation.org: fix build. Might have been bitrot]
    Signed-off-by: Andi Kleen
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Chris Mason

    Andi Kleen
     

29 Oct, 2010

2 commits

  • Conflicts:
    fs/btrfs/extent-tree.c

    Signed-off-by: Chris Mason

    Chris Mason
     
  • In order to save free space cache, we need an inode to hold the data, and we
    need a special item to point at the right inode for the right block group. So
    first, create a special item that will point to the right inode, and the number
    of extent entries we will have and the number of bitmaps we will have. We
    truncate and pre-allocate space everytime to make sure it's uptodate.

    This feature will be turned on as soon as you mount with -o space_cache, however
    it is safe to boot into old kernels, they will just generate the cache the old
    fashion way. When you boot back into a newer kernel we will notice that we
    modified and not the cache and automatically discard the cache.

    Signed-off-by: Josef Bacik

    Josef Bacik
     

23 Oct, 2010

1 commit

  • With multi-threaded writes we were getting ENOSPC early because somebody would
    come in, start flushing delalloc because they couldn't make their reservation,
    and in the meantime other threads would come in and use the space that was
    getting freed up, so when the original thread went to check to see if they had
    space they didn't and they'd return ENOSPC. So instead if we have some free
    space but not enough for our reservation, take the reservation and then start
    doing the flushing. The only time we don't take reservations is when we've
    already overcommitted our space, that way we don't have people who come late to
    the party way overcommitting ourselves. This also moves all of the retrying and
    flushing code into reserve_metdata_bytes so it's all uniform. This keeps my
    fs_mark test from returning -ENOSPC as soon as it starts and actually lets me
    fill up the disk. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

12 Jun, 2010

1 commit


25 May, 2010

4 commits

  • This patch adds metadata ENOSPC handling for the balance code.
    It is consisted by following major changes:

    1. Avoid COW tree leave in the phrase of merging tree.

    2. Handle interaction with snapshot creation.

    3. make the backref cache can live across transactions.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     
  • Pre-allocate space for data relocation. This can detect ENOPSC
    condition caused by fragmentation of free space.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     
  • Besides simplify the code, this change makes sure all metadata
    reservation for normal metadata operations are released after
    committing transaction.

    Changes since V1:

    Add code that check if unlink and rmdir will free space.

    Add ENOSPC handling for clone ioctl.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     
  • Introducing metadata reseravtion contexts has two major advantages.
    First, it makes metadata reseravtion more traceable. Second, it can
    reclaim freed space and re-add them to the itself after transaction
    committed.

    Besides add btrfs_block_rsv structure and related helper functions,
    This patch contains following changes:

    Move code that decides if freed tree block should be pinned into
    btrfs_free_tree_block().

    Make space accounting more accurate, mainly for handling read only
    block groups.

    Signed-off-by: Chris Mason

    Yan, Zheng
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

15 Mar, 2010

2 commits

  • This patch just goes through and fixes everybody that does

    lock_extent()
    blah
    unlock_extent()

    to use

    lock_extent_bits()
    blah
    unlock_extent_cached()

    and pass around a extent_state so we only have to do the searches once per
    function. This gives me about a 3 mb/s boots on my random write test. I have
    not converted some things, like the relocation and ioctl's, since they aren't
    heavily used and the relocation stuff is in the middle of being re-written. I
    also changed the clear_extent_bit() to only unset the cached state if we are
    clearing EXTENT_LOCKED and related stuff, so we can do things like this

    lock_extent_bits()
    clear delalloc bits
    unlock_extent_cached()

    without losing our cached state. I tested this thoroughly and turned on
    LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out
    fine.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This work is in preperation for being able to set a different root as the
    default mounting root.

    There is currently a problem with how we mount subvolumes. We cannot currently
    mount a subvolume of a subvolume, you can only mount subvolumes/snapshots of the
    default subvolume. So say you take a snapshot of the default subvolume and call
    it snap1, and then take a snapshot of snap1 and call it snap2, so now you have

    /
    /snap1
    /snap1/snap2

    as your available volumes. Currently you can only mount / and /snap1,
    you cannot mount /snap1/snap2. To fix this problem instead of passing
    subvolid= you must pass in subvolid=, where is
    the tree id that gets spit out via the subvolume listing you get from
    the subvolume listing patches (btrfs filesystem list). This allows us
    to mount /, /snap1 and /snap1/snap2 as the root volume.

    In addition to the above, we also now read the default dir item in the
    tree root to get the root key that it points to. For now this just
    points at what has always been the default subvolme, but later on I plan
    to change it to point at whatever root you want to be the new default
    root, so you can just set the default mount and not have to mount with
    -o subvolid=. I tested this out with the above scenario and it
    worked perfectly. Thanks,

    mount -o subvol operates inside the selected subvolid. For example:

    mount -o subvol=snap1,subvolid=256 /dev/xxx /mnt

    /mnt will have the snap1 directory for the subvolume with id
    256.

    mount -o subvol=snap /dev/xxx /mnt

    /mnt will be the snap directory of whatever the default subvolume
    is.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

09 Mar, 2010

1 commit

  • btrfs inialize rb trees in quite a number of places by settin rb_node =
    NULL; The problem with this is that 17d9ddc72fb8bba0d4f678 in the
    linux-next tree adds a new field to that struct which needs to be NULL for
    the new rbtree library code to work properly. This patch uses RB_ROOT as
    the intializer so all of the relevant fields will be NULL'd. Without the
    patch I get a panic.

    Signed-off-by: Eric Paris
    Acked-by: Venkatesh Pallipadi
    Signed-off-by: Chris Mason

    Eric Paris
     

05 Feb, 2010

1 commit

  • Mounting a bad filesystem caused a BUG_ON(). The following is steps to
    reproduce it.
    # mkfs.btrfs /dev/sda2
    # mount /dev/sda2 /mnt
    # mkfs.btrfs /dev/sda1 /dev/sda2
    (the program says that /dev/sda2 was mounted, and then exits. )
    # umount /mnt
    # mount /dev/sda1 /mnt

    At the third step, mkfs.btrfs exited in the way of make filesystem. So the
    initialization of the filesystem didn't finish. So the filesystem was bad, and
    it caused BUG_ON() when mounting it. But BUG_ON() should be called by the wrong
    code, not user's operation, so I think it is a bug of btrfs.

    This patch fixes it.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     

18 Jan, 2010

1 commit


18 Dec, 2009

3 commits

  • iput() can trigger new transactions if we are dropping the
    final reference, so calling it in btrfs_commit_transaction
    may end up deadlock. This patch adds delayed iput to avoid
    the issue.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     
  • truncating and deleting regular files are unbound operations,
    so it's not good to do them in a single transaction. This
    patch makes btrfs_truncate and btrfs_delete_inode start a
    new transaction after all items in a tree leaf are deleted.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     
  • We do log replay in a single transaction, so it's not good to do unbound
    operations. This patch cleans up orphan inodes cleanup after replaying
    the log. It also avoids doing other unbound operations such as truncating
    a file during replaying log. These unbound operations are postponed to
    the orphan inode cleanup stage.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

05 Oct, 2009

1 commit

  • The btrfs async worker threads are used for a wide variety of things,
    including processing bio end_io functions. This means that when
    the endio threads aren't running, the rest of the FS isn't
    able to do the final processing required to clear PageWriteback.

    The endio threads also try to exit as they become idle and
    start more as the work piles up. The problem is that starting more
    threads means kthreadd may need to allocate ram, and that allocation
    may wait until the global number of writeback pages on the system is
    below a certain limit.

    The result of that throttling is that end IO threads wait on
    kthreadd, who is waiting on IO to end, which will never happen.

    This commit fixes the deadlock by handing off thread startup to a
    dedicated thread. It also fixes a bug where the on-demand thread
    creation was creating far too many threads because it didn't take into
    account threads being started by other procs.

    Signed-off-by: Chris Mason

    Chris Mason
     

24 Sep, 2009

1 commit

  • The extent relocation code copy file extents one by one when
    relocating data block group. This is inefficient if file
    extents are small. This patch makes the relocation code copy
    file extents in clusters. So we can can make better use of
    read-ahead.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

22 Sep, 2009

1 commit


12 Sep, 2009

2 commits

  • This changes the btrfs code to find delalloc ranges in the extent state
    tree to use the new state caching code from set/test bit. It reduces
    one of the biggest causes of rbtree searches in the writeback path.

    test_range_bit is also modified to take the cached state as a starting
    point while searching.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • There are two main users of the extent_map tree. The
    first is regular file inodes, where it is evenly spread
    between readers and writers.

    The second is the chunk allocation tree, which maps blocks from
    logical addresses to phyiscal ones, and it is 99.99% reads.

    The mapping tree is a point of lock contention during heavy IO
    workloads, so this commit switches things to a rw lock.

    Signed-off-by: Chris Mason

    Chris Mason
     

08 Aug, 2009

1 commit

  • invalidate_inode_pages2_range may return -EBUSY occasionally
    which results Oops. This patch fixes the issue by moving
    invalidate_inode_pages2_range into a loop and keeping calling
    it until the return value is not -EBUSY.

    The EBUSY return is temporary, and can happen when the btrfs release page
    function is unable to release a page because the EXTENT_LOCK
    bit is set.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng
     

22 Jul, 2009

1 commit

  • When walking up the tree, btrfs_find_next_key assumes the upper level tree
    block is properly locked. This isn't always true even path->keep_locks is 1.
    This is because btrfs_find_next_key may advance path->slots[] several times
    instead of only once.

    When 'path->slots[level] >= btrfs_header_nritems(path->nodes[level])' is found,
    we can't guarantee the original value of 'path->slots[level]' is
    'btrfs_header_nritems(path->nodes[level]) - 1'. If it's not, the tree block at
    'level + 1' isn't locked.

    This patch fixes the issue by explicitly checking the locking state,
    re-searching the tree if it's not locked.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng
     

03 Jul, 2009

1 commit

  • The new backref format has restriction on type of backref item. If a tree
    block isn't referenced by its owner tree, full backrefs must be used for the
    pointers in it. When a tree block loses its owner tree's reference, backrefs
    for the pointers in it should be updated to full backrefs. Current
    btrfs_drop_snapshot misses the code that updates backrefs, so it's unsafe for
    general use.

    This patch adds backrefs update code to btrfs_drop_snapshot. It isn't a
    problem in the restricted form btrfs_drop_snapshot is used today, but for
    general snapshot deletion this update is required.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng
     

10 Jun, 2009

1 commit

  • This commit introduces a new kind of back reference for btrfs metadata.
    Once a filesystem has been mounted with this commit, IT WILL NO LONGER
    BE MOUNTABLE BY OLDER KERNELS.

    When a tree block in subvolume tree is cow'd, the reference counts of all
    extents it points to are increased by one. At transaction commit time,
    the old root of the subvolume is recorded in a "dead root" data structure,
    and the btree it points to is later walked, dropping reference counts
    and freeing any blocks where the reference count goes to 0.

    The increments done during cow and decrements done after commit cancel out,
    and the walk is a very expensive way to go about freeing the blocks that
    are no longer referenced by the new btree root. This commit reduces the
    transaction overhead by avoiding the need for dead root records.

    When a non-shared tree block is cow'd, we free the old block at once, and the
    new block inherits old block's references. When a tree block with reference
    count > 1 is cow'd, we increase the reference counts of all extents
    the new block points to by one, and decrease the old block's reference count by
    one.

    This dead tree avoidance code removes the need to modify the reference
    counts of lower level extents when a non-shared tree block is cow'd.
    But we still need to update back ref for all pointers in the block.
    This is because the location of the block is recorded in the back ref
    item.

    We can solve this by introducing a new type of back ref. The new
    back ref provides information about pointer's key, level and in which
    tree the pointer lives. This information allow us to find the pointer
    by searching the tree. The shortcoming of the new back ref is that it
    only works for pointers in tree blocks referenced by their owner trees.

    This is mostly a problem for snapshots, where resolving one of these
    fuzzy back references would be O(number_of_snapshots) and quite slow.
    The solution used here is to use the fuzzy back references in the common
    case where a given tree block is only referenced by one root,
    and use the full back references when multiple roots have a reference
    on a given block.

    This commit adds per subvolume red-black tree to keep trace of cached
    inodes. The red-black tree helps the balancing code to find cached
    inodes whose inode numbers within a given range.

    This commit improves the balancing code by introducing several data
    structures to keep the state of balancing. The most important one
    is the back ref cache. It caches how the upper level tree blocks are
    referenced. This greatly reduce the overhead of checking back ref.

    The improved balancing code scales significantly better with a large
    number of snapshots.

    This is a very large commit and was written in a number of
    pieces. But, they depend heavily on the disk format change and were
    squashed together to make sure git bisect didn't end up in a
    bad state wrt space balancing or the format change.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng