30 Oct, 2010

2 commits

  • These are all the cases where a variable is set, but not read which are
    not bugs as far as I can see, but simply leftovers.

    Still needs more review.

    Found by gcc 4.6's new warnings

    Signed-off-by: Andi Kleen
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Chris Mason

    Andi Kleen
     
  • These are all the cases where a variable is set, but not
    read which are really bugs.

    - Couple of incorrect error handling fixed.
    - One incorrect use of a allocation policy
    - Some other things

    Still needs more review.

    Found by gcc 4.6's new warnings.

    [akpm@linux-foundation.org: fix build. Might have been bitrot]
    Signed-off-by: Andi Kleen
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Chris Mason

    Andi Kleen
     

25 May, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

15 Mar, 2010

1 commit

  • This work is in preperation for being able to set a different root as the
    default mounting root.

    There is currently a problem with how we mount subvolumes. We cannot currently
    mount a subvolume of a subvolume, you can only mount subvolumes/snapshots of the
    default subvolume. So say you take a snapshot of the default subvolume and call
    it snap1, and then take a snapshot of snap1 and call it snap2, so now you have

    /
    /snap1
    /snap1/snap2

    as your available volumes. Currently you can only mount / and /snap1,
    you cannot mount /snap1/snap2. To fix this problem instead of passing
    subvolid= you must pass in subvolid=, where is
    the tree id that gets spit out via the subvolume listing you get from
    the subvolume listing patches (btrfs filesystem list). This allows us
    to mount /, /snap1 and /snap1/snap2 as the root volume.

    In addition to the above, we also now read the default dir item in the
    tree root to get the root key that it points to. For now this just
    points at what has always been the default subvolme, but later on I plan
    to change it to point at whatever root you want to be the new default
    root, so you can just set the default mount and not have to mount with
    -o subvolid=. I tested this out with the above scenario and it
    worked perfectly. Thanks,

    mount -o subvol operates inside the selected subvolid. For example:

    mount -o subvol=snap1,subvolid=256 /dev/xxx /mnt

    /mnt will have the snap1 directory for the subvolume with id
    256.

    mount -o subvol=snap /dev/xxx /mnt

    /mnt will be the snap directory of whatever the default subvolume
    is.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

18 Dec, 2009

1 commit

  • We do log replay in a single transaction, so it's not good to do unbound
    operations. This patch cleans up orphan inodes cleanup after replaying
    the log. It also avoids doing other unbound operations such as truncating
    a file during replaying log. These unbound operations are postponed to
    the orphan inode cleanup stage.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

16 Dec, 2009

2 commits

  • Rewrite btrfs_drop_extents by using btrfs_duplicate_item, so we can
    avoid calling lock_extent within transaction.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     
  • We allow two log transactions at a time, but use same flag
    to mark dirty tree-log btree blocks. So we may flush dirty
    blocks belonging to newer log transaction when committing a
    log transaction. This patch fixes the issue by using two
    flags to mark dirty tree-log btree blocks.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

16 Oct, 2009

1 commit

  • * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: always pin metadata in discard mode
    Btrfs: enable discard support
    Btrfs: add -o discard option
    Btrfs: properly wait log writers during log sync
    Btrfs: fix possible ENOSPC problems with truncate
    Btrfs: fix btrfs acl #ifdef checks
    Btrfs: streamline tree-log btree block writeout
    Btrfs: avoid tree log commit when there are no changes
    Btrfs: only write one super copy during fsync

    Linus Torvalds
     

14 Oct, 2009

4 commits

  • A recently fsync optimization make btrfs_sync_log skip calling
    wait_for_writer in the single log writer case. This is incorrect
    since the writer count can also be increased by btrfs_pin_log.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     
  • Syncing the tree log is a 3 phase operation.

    1) write and wait for all the tree log blocks for a given root.

    2) write and wait for all the tree log blocks for the
    tree of tree log roots.

    3) write and wait for the super blocks (barriers here)

    This isn't as efficient as it could be because there is
    no requirement to wait for the blocks from step one to hit the disk
    before we start writing the blocks from step two. This commit
    changes the sequence so that we don't start waiting until
    all the tree blocks from both steps one and two have been sent
    to disk.

    We do this by breaking up btrfs_write_wait_marked_extents into
    two functions, which is trivial because it was already broken
    up into two parts.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • rpm has a habit of running fdatasync when the file hasn't
    changed. We already detect if a file hasn't been changed
    in the current transaction but it might have been sent to
    the tree-log in this transaction and not changed since
    the last call to fsync.

    In this case, we want to avoid a tree log sync, which includes
    a number of synchronous writes and barriers. This commit
    extends the existing tracking of the last transaction to change
    a file to also track the last sub-transaction.

    The end result is that rpm -ivh and -Uvh are roughly twice as fast,
    and on par with ext3.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • During a tree-log commit for fsync, we've been writing at least
    two copies of the super block and forcing them to disk.

    The other filesystems write only one, and this change brings us on
    par with them. A full transaction commit will write all the super
    copies, so we still have redundant info written on a regular
    basis.

    Signed-off-by: Chris Mason

    Chris Mason
     

12 Oct, 2009

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: fix file clone ioctl for bookend extents
    Btrfs: fix uninit compiler warning in cow_file_range_nocow
    Btrfs: constify dentry_operations
    Btrfs: optimize back reference update during btrfs_drop_snapshot
    Btrfs: remove negative dentry when deleting subvolumne
    Btrfs: optimize fsync for the single writer case
    Btrfs: async delalloc flushing under space pressure
    Btrfs: release delalloc reservations on extent item insertion
    Btrfs: delay clearing EXTENT_DELALLOC for compressed extents
    Btrfs: cleanup extent_clear_unlock_delalloc flags
    Btrfs: fix possible softlockup in the allocator
    Btrfs: fix deadlock on async thread startup

    Linus Torvalds
     

09 Oct, 2009

1 commit

  • This patch optimizes the tree logging stuff so it doesn't always wait 1 jiffie
    for new people to join the logging transaction if there is only ever 1 writer.
    This helps a little bit with latency where we have something like RPM where it
    will fdatasync every file it writes, and so waiting the 1 jiffie for every
    fdatasync really starts to add up.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

24 Sep, 2009

1 commit


22 Sep, 2009

2 commits

  • This patch adds snapshot/subvolume destroy ioctl. A subvolume that isn't being
    used and doesn't contains links to other subvolumes can be destroyed.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     
  • The new back reference format does not allow reusing objectid of
    deleted snapshot/subvol. So we use ++highest_objectid to allocate
    objectid for new snapshot/subvol.

    Now we use ++highest_objectid to allocate objectid for both new inode
    and new snapshot/subvolume, so this patch removes 'find hole' code in
    btrfs_find_free_objectid.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

21 Sep, 2009

1 commit


18 Sep, 2009

1 commit

  • This patch gets rid of two limitations of async block group caching.
    The old code delays handling pinned extents when block group is in
    caching. To allocate logged file extents, the old code need wait
    until block group is fully cached. To get rid of the limitations,
    This patch introduces a data structure to track the progress of
    caching. Base on the caching progress, we know which extents should
    be added to the free space cache when handling the pinned extents.
    The logged file extents are also handled in a similar way.

    This patch also changes how pinned extents are tracked. The old
    code uses one tree to track pinned extents, and copy the pinned
    extents tree at transaction commit time. This patch makes it use
    two trees to track pinned extents. One tree for extents that are
    pinned in the running transaction, one tree for extents that can
    be unpinned. At transaction commit time, we swap the two trees.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng
     

12 Sep, 2009

1 commit

  • Data COW means that whenever we write to a file, we replace any old
    extent pointers with new ones. There was a window where a readpage
    might find the old extent pointers on disk and cache them in the
    extent_map tree in ram in the middle of a given write replacing them.

    Even though both the readpage and the write had their respective bytes
    in the file locked, the extent readpage inserts may cover more bytes than
    it had locked down.

    This commit closes the race by keeping the new extent pinned in the extent
    map tree until after the on-disk btree is properly setup with the new
    extent pointers.

    Signed-off-by: Chris Mason

    Chris Mason
     

28 Jul, 2009

2 commits

  • We are racy with async block caching and unpinning extents. This patch makes
    things much less complicated by only unpinning the extent if the block group is
    cached. We check the block_group->cached var under the block_group->lock spin
    lock. If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
    and unpin the extent and add the free space back. If it is not set to this, we
    start the caching of the block group so the next time we unpin extents we can
    unpin the extent. This keeps us from racing with the async caching threads,
    lets us kill the fs wide async thread counter, and keeps us from having to set
    DELALLOC bits for every extent we hit if there are caching kthreads going.

    One thing that needed to be changed was btrfs_free_super_mirror_extents. Now
    instead of just looking for LOCKED extents, we also look for DIRTY extents,
    since we could have left some extents pinned in the previous transaction that
    will never get freed now that we are unmounting, which would cause us to leak
    memory. So btrfs_free_super_mirror_extents has been changed to
    btrfs_free_pinned_extents, and it will clear the extents locked for the super
    mirror, and any remaining pinned extents that may be present. Thank you,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • dir has already been tested. It seems that this test should be on the
    recently returned value inode.

    A simplified version of the semantic match that finds this problem is as
    follows: (http://www.emn.fr/x-info/coccinelle/)

    Signed-off-by: Julia Lawall
    Signed-off-by: Chris Mason

    Julia Lawall
     

24 Jul, 2009

1 commit

  • This patch moves the caching of the block group off to a kthread in order to
    allow people to allocate sooner. Instead of blocking up behind the caching
    mutex, we instead kick of the caching kthread, and then attempt to make an
    allocation. If we cannot, we wait on the block groups caching waitqueue, which
    the caching kthread will wake the waiting threads up everytime it finds 2 meg
    worth of space, and then again when its finished caching. This is how I tested
    the speedup from this

    mkfs the disk
    mount the disk
    fill the disk up with fs_mark
    unmount the disk
    mount the disk
    time touch /mnt/foo

    Without my changes this took 11 seconds on my box, with these changes it now
    takes 1 second.

    Another change thats been put in place is we lock the super mirror's in the
    pinned extent map in order to keep us from adding that stuff as free space when
    caching the block group. This doesn't really change anything else as far as the
    pinned extent map is concerned, since for actual pinned extents we use
    EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
    those extents to keep from leaking memory.

    I've also added a check where when we are reading block groups from disk, if the
    amount of space used == the size of the block group, we go ahead and mark the
    block group as cached. This drastically reduces the amount of time it takes to
    cache the block groups. Using the same test as above, except doing a dd to a
    file and then unmounting, it used to take 33 seconds to umount, now it takes 3
    seconds.

    This version uses the commit_root in the caching kthread, and then keeps track
    of how many async caching threads are running at any given time so if one of the
    async threads is still running as we cross transactions we can wait until its
    finished before handling the pinned extents. Thank you,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

11 Jun, 2009

1 commit

  • During tree log replay, we read in the tree log roots,
    process them and then free them. A recent change
    takes an extra reference on the root node of the tree
    when the root is read in, and stores that reference
    in root->commit_root.

    This reference was not being freed, leaving us with
    one buffer pinned in ram for each subvol with
    a tree log root after a crash.

    Signed-off-by: Chris Mason

    Chris Mason
     

10 Jun, 2009

1 commit

  • This commit introduces a new kind of back reference for btrfs metadata.
    Once a filesystem has been mounted with this commit, IT WILL NO LONGER
    BE MOUNTABLE BY OLDER KERNELS.

    When a tree block in subvolume tree is cow'd, the reference counts of all
    extents it points to are increased by one. At transaction commit time,
    the old root of the subvolume is recorded in a "dead root" data structure,
    and the btree it points to is later walked, dropping reference counts
    and freeing any blocks where the reference count goes to 0.

    The increments done during cow and decrements done after commit cancel out,
    and the walk is a very expensive way to go about freeing the blocks that
    are no longer referenced by the new btree root. This commit reduces the
    transaction overhead by avoiding the need for dead root records.

    When a non-shared tree block is cow'd, we free the old block at once, and the
    new block inherits old block's references. When a tree block with reference
    count > 1 is cow'd, we increase the reference counts of all extents
    the new block points to by one, and decrease the old block's reference count by
    one.

    This dead tree avoidance code removes the need to modify the reference
    counts of lower level extents when a non-shared tree block is cow'd.
    But we still need to update back ref for all pointers in the block.
    This is because the location of the block is recorded in the back ref
    item.

    We can solve this by introducing a new type of back ref. The new
    back ref provides information about pointer's key, level and in which
    tree the pointer lives. This information allow us to find the pointer
    by searching the tree. The shortcoming of the new back ref is that it
    only works for pointers in tree blocks referenced by their owner trees.

    This is mostly a problem for snapshots, where resolving one of these
    fuzzy back references would be O(number_of_snapshots) and quite slow.
    The solution used here is to use the fuzzy back references in the common
    case where a given tree block is only referenced by one root,
    and use the full back references when multiple roots have a reference
    on a given block.

    This commit adds per subvolume red-black tree to keep trace of cached
    inodes. The red-black tree helps the balancing code to find cached
    inodes whose inode numbers within a given range.

    This commit improves the balancing code by introducing several data
    structures to keep the state of balancing. The most important one
    is the back ref cache. It caches how the upper level tree blocks are
    referenced. This greatly reduce the overhead of checking back ref.

    The improved balancing code scales significantly better with a large
    number of snapshots.

    This is a very large commit and was written in a number of
    pieces. But, they depend heavily on the disk format change and were
    squashed together to make sure git bisect didn't end up in a
    bad state wrt space balancing or the format change.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng
     

25 Apr, 2009

1 commit

  • The btrfs fallocate call takes an extent lock on the entire range
    being fallocated, and then runs through insert_reserved_extent on each
    extent as they are allocated.

    The problem with this is that btrfs_drop_extents may decide to try
    and take the same extent lock fallocate was already holding. The solution
    used here is to push down knowledge of the range that is already locked
    going into btrfs_drop_extents.

    It turns out that at least one other caller had the same bug.

    Signed-off-by: Chris Mason

    Chris Mason
     

03 Apr, 2009

3 commits

  • Signed-off-by: Chris Mason

    Stoyan Gaydarov
     
  • Add a 'notreelog' mount option to disable the tree log (used by fsync,
    O_SYNC writes). This is much slower, but the tree logging produces
    inconsistent views into the FS for ceph.

    Signed-off-by: Sage Weil
    Signed-off-by: Chris Mason

    Sage Weil
     
  • This patch removes the pinned_mutex. The extent io map has an internal tree
    lock that protects the tree itself, and since we only copy the extent io map
    when we are committing the transaction we don't need it there. We also don't
    need it when caching the block group since searching through the tree is also
    protected by the internal map spin lock.

    Signed-off-by: Josef Bacik

    Josef Bacik
     

25 Mar, 2009

4 commits

  • The fsync log has code to make sure all of the parents of a file are in the
    log along with the file. It uses a minimal log of the parent directory
    inodes, just enough to get the parent directory on disk.

    If the transaction that originally created a file is fully on disk,
    and the file hasn't been renamed or linked into other directories, we
    can safely skip the parent directory walk. We know the file is on disk
    somewhere and we can go ahead and just log that single file.

    This is more important now because unrelated unlinks in the parent directory
    might make us force a commit if we try to log the parent.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The tree logging code allows individual files or directories to be logged
    without including operations on other files and directories in the FS.
    It tries to commit the minimal set of changes to disk in order to
    fsync the single file or directory that was sent to fsync or O_SYNC.

    The tree logging code was allowing files and directories to be unlinked
    if they were part of a rename operation where only one directory
    in the rename was in the fsync log. This patch adds a few new rules
    to the tree logging.

    1) on rename or unlink, if the inode being unlinked isn't in the fsync
    log, we must force a full commit before doing an fsync of the directory
    where the unlink was done. The commit isn't done during the unlink,
    but it is forced the next time we try to log the parent directory.

    Solution: record transid of last unlink/rename per directory when the
    directory wasn't already logged. For renames this is only done when
    renaming to a different directory.

    mkdir foo/some_dir
    normal commit
    rename foo/some_dir foo2/some_dir
    mkdir foo/some_dir
    fsync foo/some_dir/some_file

    The fsync above will unlink the original some_dir without recording
    it in its new location (foo2). After a crash, some_dir will be gone
    unless the fsync of some_file forces a full commit

    2) we must log any new names for any file or dir that is in the fsync
    log. This way we make sure not to lose files that are unlinked during
    the same transaction.

    2a) we must log any new names for any file or dir during rename
    when the directory they are being removed from was logged.

    2a is actually the more important variant. Without the extra logging
    a crash might unlink the old name without recreating the new one

    3) after a crash, we must go through any directories with a link count
    of zero and redo the rm -rf

    mkdir f1/foo
    normal commit
    rm -rf f1/foo
    fsync(f1)

    The directory f1 was fully removed from the FS, but fsync was never
    called on f1, only its parent dir. After a crash the rm -rf must
    be replayed. This must be able to recurse down the entire
    directory tree. The inode link count fixup code takes care of the
    ugly details.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • During log replay, inodes are copied from the log to the main filesystem
    btrees. Sometimes they have a zero link count in the log but they actually
    gain links during the replay or have some in the main btree.

    This patch updates the link count to be at least one after copying the
    inode out of the log. This makes sure the inode is deleted during an
    iput while the rest of the replay code is still working on it.

    The log replay has fixup code to make sure that link counts are correct
    at the end of the replay, so we could use any non-zero number here and
    it would work fine.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
    for the buffers it was dirtying. This may require a kmalloc and it
    was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to
    set any btree locks they were holding to blocking first.

    This commit changes dirty tracking for extent buffers to just use a flag
    in the extent buffer. Now that we have one and only one extent buffer
    per page, this can be safely done without losing dirty bits along the way.

    This also introduces a path->leave_spinning flag that callers of
    btrfs_search_slot can use to indicate they will properly deal with a
    path returned where all the locks are spinning instead of blocking.

    Many of the btree search callers now expect spinning paths,
    resulting in better btree concurrency overall.

    Signed-off-by: Chris Mason

    Chris Mason
     

13 Feb, 2009

1 commit


04 Feb, 2009

1 commit

  • Most of the btrfs metadata operations can be protected by a spinlock,
    but some operations still need to schedule.

    So far, btrfs has been using a mutex along with a trylock loop,
    most of the time it is able to avoid going for the full mutex, so
    the trylock loop is a big performance gain.

    This commit is step one for getting rid of the blocking locks entirely.
    btrfs_tree_lock takes a spinlock, and the code explicitly switches
    to a blocking lock when it starts an operation that can schedule.

    We'll be able get rid of the blocking locks in smaller pieces over time.
    Tracing allows us to find the most common cause of blocking, so we
    can start with the hot spots first.

    The basic idea is:

    btrfs_tree_lock() returns with the spin lock held

    btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
    the extent buffer flags, and then drops the spin lock. The buffer is
    still considered locked by all of the btrfs code.

    If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
    the spin lock and waits on a wait queue for the blocking bit to go away.

    Much of the code that needs to set the blocking bit finishes without actually
    blocking a good percentage of the time. So, an adaptive spin is still
    used against the blocking bit to avoid very high context switch rates.

    btrfs_clear_lock_blocking() clears the blocking bit and returns
    with the spinlock held again.

    btrfs_tree_unlock() can be called on either blocking or spinning locks,
    it does the right thing based on the blocking bit.

    ctree.c has a helper function to set/clear all the locked buffers in a
    path as blocking.

    Signed-off-by: Chris Mason

    Chris Mason
     

22 Jan, 2009

1 commit

  • To improve performance, btrfs_sync_log merges tree log sync
    requests. But it wrongly merges sync requests for different
    tree logs. If multiple tree logs are synced at the same time,
    only one of them actually gets synced.

    This patch has following changes to fix the bug:

    Move most tree log related fields in btrfs_fs_info to
    btrfs_root. This allows merging sync requests separately
    for each tree log.

    Don't insert root item into the log root tree immediately
    after log tree is allocated. Root item for log tree is
    inserted when log tree get synced for the first time. This
    allows syncing the log root tree without first syncing all
    log trees.

    At tree-log sync, btrfs_sync_log first sync the log tree;
    then updates corresponding root item in the log root tree;
    sync the log root tree; then update the super block.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

10 Jan, 2009

1 commit

  • Each subvolume has an extent_state_tree used to mark metadata
    that needs to be sent to disk while syncing the tree. This is
    used in addition to the dirty bits on the pages themselves so that
    a single subvolume can be sent to disk efficiently in disk order.

    Normally this marking happens in btrfs_alloc_free_block, which also does
    special recording of dirty tree blocks for the tree log roots.

    Yan Zheng noticed that when the root of the log tree is allocated, it is added
    to the wrong writeback list. The fix used here is to explicitly set
    it dirty as part of tree log creation.

    Signed-off-by: Chris Mason

    Chris Mason
     

07 Jan, 2009

1 commit

  • This patch contains following things.

    1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE. This
    struct is kmalloced so we want to keep it reasonable.

    2) Replace copy_extent_csums by btrfs_lookup_csums_range. This was
    duplicated code in tree-log.c

    3) Remove replay_one_csum. csum items are replayed at the same time as
    replaying file extents. This guarantees we only replay useful csums.

    4) nbytes accounting fix.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

06 Jan, 2009

1 commit