18 Jan, 2011

1 commit

  • This patch comes from "Forced readonly mounts on errors" ideas.

    As we know, this is the first step in being more fault tolerant of disk
    corruptions instead of just using BUG() statements.

    The major content:
    - add a framework for generating errors that should result in filesystems
    going readonly.
    - keep FS state in disk super block.
    - make sure that all of resource will be freed and released at umount time.
    - make sure that fter FS is forced readonly on error, there will be no more
    disk change before FS is corrected. For this, we should stop write operation.

    After this patch is applied, the conversion from BUG() to such a framework can
    happen incrementally.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    liubo
     

23 Dec, 2010

1 commit

  • Usage:

    Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call
    ioctl(BTRFS_I0CTL_SNAP_CREATE_V2).

    Implementation:

    - Set readonly bit of btrfs_root_item->flags.
    - Add readonly checks in btrfs_permission (inode_permission),
    btrfs_setattr, btrfs_set/remove_xattr and some ioctls.

    Changelog for v3:

    - Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags.
    - Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY.

    Signed-off-by: Li Zefan

    Li Zefan
     

22 Nov, 2010

1 commit

  • There are lots of places where we do dentry->d_parent->d_inode without holding
    the dentry->d_lock. This could cause problems with rename. So instead we need
    to use dget_parent() and hold the reference to the parent as long as we are
    going to use it's inode and then dput it at the end.

    Signed-off-by: Josef Bacik
    Cc: raven@themaw.net
    Signed-off-by: Chris Mason

    Josef Bacik
     

30 Oct, 2010

3 commits

  • START_SYNC will start a sync/commit, but not wait for it to
    complete. Any modification started after the ioctl returns is
    guaranteed not to be included in the commit. If a non-NULL
    pointer is passed, the transaction id will be returned to
    userspace.

    WAIT_SYNC will wait for any in-progress commit to complete. If a
    transaction id is specified, the ioctl will block and then
    return (success) when the specified transaction has committed.
    If it has already committed when we call the ioctl, it returns
    immediately. If the specified transaction doesn't exist, it
    returns EINVAL.

    If no transaction id is specified, WAIT_SYNC will wait for the
    currently committing transaction to finish it's commit to disk.
    If there is no currently committing transaction, it returns
    success.

    These ioctls are useful for applications which want to impose an
    ordering on when fs modifications reach disk, but do not want to
    wait for the full (slow) commit process to do so.

    Picky callers can take the transid returned by START_SYNC and
    feed it to WAIT_SYNC, and be certain to wait only as long as
    necessary for the transaction _they_ started to reach disk.

    Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
    and provided they didn't wait too long between the calls, they
    will get the same result. However, if a second commit starts
    before they call WAIT_SYNC, they may end up waiting longer for
    it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
    guarantees that any operation completed before the START_SYNC
    reaches disk.

    Signed-off-by: Sage Weil
    Signed-off-by: Chris Mason

    Sage Weil
     
  • Add support for an async transaction commit that is ordered such that any
    subsequent operations will join the following transaction, but does not
    wait until the current commit is fully on disk. This avoids much of the
    latency associated with the btrfs_commit_transaction for callers concerned
    with serialization and not safety.

    The wait_for_unblock flag controls whether we wait for the 'middle' portion
    of commit_transaction to complete, which is necessary if the caller expects
    some of the modifications contained in the commit to be available (this is
    the case for subvol/snapshot creation).

    Signed-off-by: Sage Weil
    Signed-off-by: Chris Mason

    Sage Weil
     
  • We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether
    num_writers > 1 or should_grow at the top of the loop. Then, much much
    later, we wait for that timeout if either num_writers or should_grow is
    true. However, it's possible for a racing process (calling
    btrfs_end_transaction()) to decrement num_writers such that we wait
    forever instead of for 1.

    Fix this by deciding how long to wait when we wait. Include a smp_mb()
    before checking if the waitqueue is active to ensure the num_writers
    is visible.

    Signed-off-by: Sage Weil
    Signed-off-by: Chris Mason

    Sage Weil
     

29 Oct, 2010

2 commits

  • Conflicts:
    fs/btrfs/extent-tree.c

    Signed-off-by: Chris Mason

    Chris Mason
     
  • In order to save free space cache, we need an inode to hold the data, and we
    need a special item to point at the right inode for the right block group. So
    first, create a special item that will point to the right inode, and the number
    of extent entries we will have and the number of bitmaps we will have. We
    truncate and pre-allocate space everytime to make sure it's uptodate.

    This feature will be turned on as soon as you mount with -o space_cache, however
    it is safe to boot into old kernels, they will just generate the cache the old
    fashion way. When you boot back into a newer kernel we will notice that we
    modified and not the cache and automatically discard the cache.

    Signed-off-by: Josef Bacik

    Josef Bacik
     

23 Oct, 2010

1 commit

  • With multi-threaded writes we were getting ENOSPC early because somebody would
    come in, start flushing delalloc because they couldn't make their reservation,
    and in the meantime other threads would come in and use the space that was
    getting freed up, so when the original thread went to check to see if they had
    space they didn't and they'd return ENOSPC. So instead if we have some free
    space but not enough for our reservation, take the reservation and then start
    doing the flushing. The only time we don't take reservations is when we've
    already overcommitted our space, that way we don't have people who come late to
    the party way overcommitting ourselves. This also moves all of the retrying and
    flushing code into reserve_metdata_bytes so it's all uniform. This keeps my
    fs_mark test from returning -ENOSPC as soon as it starts and actually lets me
    fill up the disk. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

25 May, 2010

6 commits


06 Apr, 2010

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: add check for changed leaves in setup_leaf_for_split
    Btrfs: create snapshot references in same commit as snapshot
    Btrfs: fix small race with delalloc flushing waitqueue's
    Btrfs: use add_to_page_cache_lru, use __page_cache_alloc
    Btrfs: fix chunk allocate size calculation
    Btrfs: kill max_extent mount option
    Btrfs: fail to mount if we have problems reading the block groups
    Btrfs: check btrfs_get_extent return for IS_ERR()
    Btrfs: handle kmalloc() failure in inode lookup ioctl
    Btrfs: dereferencing freed memory
    Btrfs: Simplify num_stripes's calculation logical for __btrfs_alloc_chunk()
    Btrfs: Add error handle for btrfs_search_slot() in btrfs_read_chunk_tree()
    Btrfs: Remove unnecessary finish_wait() in wait_current_trans()
    Btrfs: add NULL check for do_walk_down()
    Btrfs: remove duplicate include in ioctl.c

    Fix trivial conflict in fs/btrfs/compression.c due to slab.h include
    cleanups.

    Linus Torvalds
     
  • This creates the reference to a new snapshot in the same commit as the
    snapshot itself. This avoids the need for a second commit in order for a
    snapshot to be persistent, and also avoids the problem of "leaking" a
    new snapshot tree root if the host crashes before the second commit takes
    place.

    It is not at all clear to me why it wasn't always done this way. If there
    is still a reason for the two-stage {create,finish}_pending_snapshots()
    approach I'm missing something! :)

    I've been running this for a couple weeks under pretty heavy usage (a few
    snapshots per minute) without obvious problems.

    Signed-off-by: Sage Weil
    Signed-off-by: Chris Mason

    Sage Weil
     

31 Mar, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

15 Mar, 2010

1 commit

  • Flush any delalloc extents when we create a snapshot, so that recently
    written file data is always included in the snapshot.

    A later commit will add the ability to snapshot without the flush, but
    most people expect flushing.

    Signed-off-by: Sage Weil
    Signed-off-by: Chris Mason

    Sage Weil
     

09 Mar, 2010

1 commit

  • btrfs inialize rb trees in quite a number of places by settin rb_node =
    NULL; The problem with this is that 17d9ddc72fb8bba0d4f678 in the
    linux-next tree adds a new field to that struct which needs to be NULL for
    the new rbtree library code to work properly. This patch uses RB_ROOT as
    the intializer so all of the relevant fields will be NULL'd. Without the
    patch I get a panic.

    Signed-off-by: Eric Paris
    Acked-by: Venkatesh Pallipadi
    Signed-off-by: Chris Mason

    Eric Paris
     

18 Dec, 2009

3 commits


16 Dec, 2009

1 commit

  • We allow two log transactions at a time, but use same flag
    to mark dirty tree-log btree blocks. So we may flush dirty
    blocks belonging to newer log transaction when committing a
    log transaction. This patch fixes the issue by using two
    flags to mark dirty tree-log btree blocks.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

12 Nov, 2009

1 commit

  • We use journal_info to tell if we're in a nested transaction to make sure we
    don't commit the transaction within a nested transaction. We use another
    method to see if there are any outstanding ioctl trans handles, so if we're
    starting one do not set current->journal_info, since it will screw with other
    filesystems. This patch also cleans up the starting stuff so there aren't any
    magic numbers.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

14 Oct, 2009

1 commit

  • Syncing the tree log is a 3 phase operation.

    1) write and wait for all the tree log blocks for a given root.

    2) write and wait for all the tree log blocks for the
    tree of tree log roots.

    3) write and wait for the super blocks (barriers here)

    This isn't as efficient as it could be because there is
    no requirement to wait for the blocks from step one to hit the disk
    before we start writing the blocks from step two. This commit
    changes the sequence so that we don't start waiting until
    all the tree blocks from both steps one and two have been sent
    to disk.

    We do this by breaking up btrfs_write_wait_marked_extents into
    two functions, which is trivial because it was already broken
    up into two parts.

    Signed-off-by: Chris Mason

    Chris Mason
     

29 Sep, 2009

1 commit

  • At the start of a transaction we do a btrfs_reserve_metadata_space() and
    specify how many items we plan on modifying. Then once we've done our
    modifications and such, just call btrfs_unreserve_metadata_space() for
    the same number of items we reserved.

    For keeping track of metadata needed for data I've had to add an extent_io op
    for when we merge extents. This lets us track space properly when we are doing
    sequential writes, so we don't end up reserving way more metadata space than
    what we need.

    The only place where the metadata space accounting is not done is in the
    relocation code. This is because Yan is going to be reworking that code in the
    near future, so running btrfs-vol -b could still possibly result in a ENOSPC
    related panic. This patch also turns off the metadata_ratio stuff in order to
    allow users to more efficiently use their disk space.

    This patch makes it so we track how much metadata we need for an inode's
    delayed allocation extents by tracking how many extents are currently
    waiting for allocation. It introduces two new callbacks for the
    extent_io tree's, merge_extent_hook and split_extent_hook. These help
    us keep track of when we merge delalloc extents together and split them
    up. Reservations are handled prior to any actually dirty'ing occurs,
    and then we unreserve after we dirty.

    btrfs_unreserve_metadata_for_delalloc() will make the appropriate
    unreservations as needed based on the number of reservations we
    currently have and the number of extents we currently have. Doing the
    reservation outside of doing any of the actual dirty'ing lets us do
    things like filemap_flush() the inode to try and force delalloc to
    happen, or as a last resort actually start allocation on all delalloc
    inodes in the fs. This has survived dbench, fs_mark and an fsx torture
    test.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

22 Sep, 2009

3 commits

  • This patch adds snapshot/subvolume destroy ioctl. A subvolume that isn't being
    used and doesn't contains links to other subvolumes can be destroyed.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     
  • btrfs allows subvolumes and snapshots anywhere in the directory tree.
    If we snapshot a subvolume that contains a link to other subvolume
    called subvolA, subvolA can be accessed through both the original
    subvolume and the snapshot. This is similar to creating hard link to
    directory, and has the very similar problems.

    The aim of this patch is enforcing there is only one access point to
    each subvolume. Only the first directory entry (the one added when
    the subvolume/snapshot was created) is treated as valid access point.
    The first directory entry is distinguished by checking root forward
    reference. If the corresponding root forward reference is missing,
    we know the entry is not the first one.

    This patch also adds snapshot/subvolume rename support, the code
    allows rename subvolume link across subvolumes.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     
  • This patch contains two changes to avoid unnecessary tree block reads during
    snapshot dropping.

    First, check tree block's reference count and flags before reading the tree
    block. if reference count > 1 and there is no need to update backrefs, we can
    avoid reading the tree block.

    Second, save when snapshot was created in root_key.offset. we can compare block
    pointer's generation with snapshot's creation generation during updating
    backrefs. If a given block was created before snapshot was created, the
    snapshot can't be the tree block's owner. So we can avoid reading the block.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

18 Sep, 2009

1 commit

  • This patch gets rid of two limitations of async block group caching.
    The old code delays handling pinned extents when block group is in
    caching. To allocate logged file extents, the old code need wait
    until block group is fully cached. To get rid of the limitations,
    This patch introduces a data structure to track the progress of
    caching. Base on the caching progress, we know which extents should
    be added to the free space cache when handling the pinned extents.
    The logged file extents are also handled in a similar way.

    This patch also changes how pinned extents are tracked. The old
    code uses one tree to track pinned extents, and copy the pinned
    extents tree at transaction commit time. This patch makes it use
    two trees to track pinned extents. One tree for extents that are
    pinned in the running transaction, one tree for extents that can
    be unpinned. At transaction commit time, we swap the two trees.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng
     

30 Jul, 2009

2 commits

  • The semaphore used by the async caching threads can prevent a
    transaction commit, which can make the FS appear to stall. This
    releases the semaphore more often when a transaction commit is
    in progress.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The async block group caching code uses the commit_root pointer
    to get a stable version of the extent allocation tree for scanning.
    This copy of the tree root isn't going to change and it significantly
    reduces the complexity of the scanning code.

    During a commit, we have a loop where we update the extent allocation
    tree root. We need to loop because updating the root pointer in
    the tree of tree roots may allocate blocks which may change the
    extent allocation tree.

    Right now the commit_root pointer is changed inside this loop. It
    is more correct to change the commit_root pointer only after all the
    looping is done.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng
     

25 Jul, 2009

1 commit

  • The commit_transaction call to wait_ordered_extents when snap_pending
    passes nocow_only=1 to process only NOCOW or PREALLOC extents. This isn't
    correct for the 'flushoncommit' mode, as it skips extents we just started
    IO on in start_delalloc_inodes.

    So, in the flushoncommit case, wait on all ordered extents. Otherwise,
    only pass the nocow_only flag to wait_ordered_extents if snap_pending.

    Signed-off-by: Sage Weil
    Signed-off-by: Chris Mason

    Sage Weil
     

24 Jul, 2009

1 commit

  • This patch moves the caching of the block group off to a kthread in order to
    allow people to allocate sooner. Instead of blocking up behind the caching
    mutex, we instead kick of the caching kthread, and then attempt to make an
    allocation. If we cannot, we wait on the block groups caching waitqueue, which
    the caching kthread will wake the waiting threads up everytime it finds 2 meg
    worth of space, and then again when its finished caching. This is how I tested
    the speedup from this

    mkfs the disk
    mount the disk
    fill the disk up with fs_mark
    unmount the disk
    mount the disk
    time touch /mnt/foo

    Without my changes this took 11 seconds on my box, with these changes it now
    takes 1 second.

    Another change thats been put in place is we lock the super mirror's in the
    pinned extent map in order to keep us from adding that stuff as free space when
    caching the block group. This doesn't really change anything else as far as the
    pinned extent map is concerned, since for actual pinned extents we use
    EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
    those extents to keep from leaking memory.

    I've also added a check where when we are reading block groups from disk, if the
    amount of space used == the size of the block group, we go ahead and mark the
    block group as cached. This drastically reduces the amount of time it takes to
    cache the block groups. Using the same test as above, except doing a dd to a
    file and then unmounting, it used to take 33 seconds to umount, now it takes 3
    seconds.

    This version uses the commit_root in the caching kthread, and then keeps track
    of how many async caching threads are running at any given time so if one of the
    async threads is still running as we cross transactions we can wait until its
    finished before handling the pinned extents. Thank you,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

22 Jul, 2009

1 commit

  • Write dirty block groups may allocate new block, and so may add new delayed
    back ref. btrfs_run_delayed_refs may make some block groups dirty.

    commit_cowonly_roots does not handle the recursion properly, and some dirty
    blocks can be left unwritten at commit time. This patch moves
    btrfs_run_delayed_refs into the loop that writes dirty block groups, and makes
    the code not break out of the loop until there are no dirty block groups or
    delayed back refs.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng
     

03 Jul, 2009

1 commit

  • The new backref format has restriction on type of backref item. If a tree
    block isn't referenced by its owner tree, full backrefs must be used for the
    pointers in it. When a tree block loses its owner tree's reference, backrefs
    for the pointers in it should be updated to full backrefs. Current
    btrfs_drop_snapshot misses the code that updates backrefs, so it's unsafe for
    general use.

    This patch adds backrefs update code to btrfs_drop_snapshot. It isn't a
    problem in the restricted form btrfs_drop_snapshot is used today, but for
    general snapshot deletion this update is required.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng
     

16 Jun, 2009

1 commit


10 Jun, 2009

1 commit

  • This commit introduces a new kind of back reference for btrfs metadata.
    Once a filesystem has been mounted with this commit, IT WILL NO LONGER
    BE MOUNTABLE BY OLDER KERNELS.

    When a tree block in subvolume tree is cow'd, the reference counts of all
    extents it points to are increased by one. At transaction commit time,
    the old root of the subvolume is recorded in a "dead root" data structure,
    and the btree it points to is later walked, dropping reference counts
    and freeing any blocks where the reference count goes to 0.

    The increments done during cow and decrements done after commit cancel out,
    and the walk is a very expensive way to go about freeing the blocks that
    are no longer referenced by the new btree root. This commit reduces the
    transaction overhead by avoiding the need for dead root records.

    When a non-shared tree block is cow'd, we free the old block at once, and the
    new block inherits old block's references. When a tree block with reference
    count > 1 is cow'd, we increase the reference counts of all extents
    the new block points to by one, and decrease the old block's reference count by
    one.

    This dead tree avoidance code removes the need to modify the reference
    counts of lower level extents when a non-shared tree block is cow'd.
    But we still need to update back ref for all pointers in the block.
    This is because the location of the block is recorded in the back ref
    item.

    We can solve this by introducing a new type of back ref. The new
    back ref provides information about pointer's key, level and in which
    tree the pointer lives. This information allow us to find the pointer
    by searching the tree. The shortcoming of the new back ref is that it
    only works for pointers in tree blocks referenced by their owner trees.

    This is mostly a problem for snapshots, where resolving one of these
    fuzzy back references would be O(number_of_snapshots) and quite slow.
    The solution used here is to use the fuzzy back references in the common
    case where a given tree block is only referenced by one root,
    and use the full back references when multiple roots have a reference
    on a given block.

    This commit adds per subvolume red-black tree to keep trace of cached
    inodes. The red-black tree helps the balancing code to find cached
    inodes whose inode numbers within a given range.

    This commit improves the balancing code by introducing several data
    structures to keep the state of balancing. The most important one
    is the back ref cache. It caches how the upper level tree blocks are
    referenced. This greatly reduce the overhead of checking back ref.

    The improved balancing code scales significantly better with a large
    number of snapshots.

    This is a very large commit and was written in a number of
    pieces. But, they depend heavily on the disk format change and were
    squashed together to make sure git bisect didn't end up in a
    bad state wrt space balancing or the format change.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng