28 Jul, 2011

1 commit

  • The btrfs metadata btree is the source of significant
    lock contention, especially in the root node. This
    commit changes our locking to use a reader/writer
    lock.

    The lock is built on top of rw spinlocks, and it
    extends the lock tracking to remember if we have a
    read lock or a write lock when we go to blocking. Atomics
    count the number of blocking readers or writers at any
    given time.

    It removes all of the adaptive spinning from the old code
    and uses only the spinning/blocking hints inside of btrfs
    to decide when it should continue spinning.

    In read heavy workloads this is dramatically faster. In write
    heavy workloads we're still faster because of less contention
    on the root node lock.

    We suffer slightly in dbench because we schedule more often
    during write locks, but all other benchmarks so far are improved.

    Signed-off-by: Chris Mason

    Chris Mason
     

06 May, 2011

1 commit

  • Remove static and global declarations and/or definitions. Reduces size
    of btrfs.ko by ~3.4kB.

    text data bss dec hex filename
    402081 7464 200 409745 64091 btrfs.ko.base
    398620 7144 200 405964 631cc btrfs.ko.remove-all

    Signed-off-by: David Sterba

    David Sterba
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

03 Apr, 2009

1 commit


25 Mar, 2009

2 commits

  • btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
    for the buffers it was dirtying. This may require a kmalloc and it
    was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to
    set any btree locks they were holding to blocking first.

    This commit changes dirty tracking for extent buffers to just use a flag
    in the extent buffer. Now that we have one and only one extent buffer
    per page, this can be safely done without losing dirty bits along the way.

    This also introduces a path->leave_spinning flag that callers of
    btrfs_search_slot can use to indicate they will properly deal with a
    path returned where all the locks are spinning instead of blocking.

    Many of the btree search callers now expect spinning paths,
    resulting in better btree concurrency overall.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This reduces contention on the extent buffer spin locks by testing for a
    blocking lock before trying to take the spinlock.

    Signed-off-by: Chris Mason

    Chris Mason
     

09 Mar, 2009

1 commit

  • btrfs_tree_locked was being used to make sure a given extent_buffer was
    properly locked in a few places. But, it wasn't correct for UP compiled
    kernels.

    This switches it to using assert_spin_locked instead, and renames it to
    btrfs_assert_tree_locked to better reflect how it was really being used.

    Signed-off-by: Chris Mason

    Chris Mason
     

13 Feb, 2009

1 commit

  • Btrfs is currently using spin_lock_nested with a nested value based
    on the tree depth of the block. But, this doesn't quite work because
    the max tree depth is bigger than what spin_lock_nested can deal with,
    and because locks are sometimes taken before the level field is filled in.

    The solution here is to use lockdep_set_class_and_name instead, and to
    set the class before unlocking the pages when the block is read from the
    disk and just after init of a freshly allocated tree block.

    btrfs_clear_path_blocking is also changed to take the locks in the proper
    order, and it also makes sure all the locks currently held are properly
    set to blocking before it tries to retake the spinlocks. Otherwise, lockdep
    gets upset about bad lock orderin.

    The lockdep magic cam from Peter Zijlstra

    Signed-off-by: Chris Mason

    Chris Mason
     

10 Feb, 2009

1 commit

  • Btrfs was using spin_is_contended to see if it should drop locks before
    doing extent allocations during btrfs_search_slot. The idea was to avoid
    expensive searches in the tree unless the lock was actually contended.

    But, spin_is_contended is specific to the ticket spinlocks on x86, so this
    is causing compile errors everywhere else.

    In practice, the contention could easily appear some time after we started
    doing the extent allocation, and it makes more sense to always drop the lock
    instead.

    Signed-off-by: Chris Mason

    Chris Mason
     

04 Feb, 2009

1 commit

  • Most of the btrfs metadata operations can be protected by a spinlock,
    but some operations still need to schedule.

    So far, btrfs has been using a mutex along with a trylock loop,
    most of the time it is able to avoid going for the full mutex, so
    the trylock loop is a big performance gain.

    This commit is step one for getting rid of the blocking locks entirely.
    btrfs_tree_lock takes a spinlock, and the code explicitly switches
    to a blocking lock when it starts an operation that can schedule.

    We'll be able get rid of the blocking locks in smaller pieces over time.
    Tracing allows us to find the most common cause of blocking, so we
    can start with the hot spots first.

    The basic idea is:

    btrfs_tree_lock() returns with the spin lock held

    btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
    the extent buffer flags, and then drops the spin lock. The buffer is
    still considered locked by all of the btrfs code.

    If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
    the spin lock and waits on a wait queue for the blocking bit to go away.

    Much of the code that needs to set the blocking bit finishes without actually
    blocking a good percentage of the time. So, an adaptive spin is still
    used against the blocking bit to avoid very high context switch rates.

    btrfs_clear_lock_blocking() clears the blocking bit and returns
    with the spinlock held again.

    btrfs_tree_unlock() can be called on either blocking or spinning locks,
    it does the right thing based on the blocking bit.

    ctree.c has a helper function to set/clear all the locked buffers in a
    path as blocking.

    Signed-off-by: Chris Mason

    Chris Mason
     

06 Jan, 2009

1 commit


30 Sep, 2008

1 commit

  • This improves the comments at the top of many functions. It didn't
    dive into the guts of functions because I was trying to
    avoid merging problems with the new allocator and back reference work.

    extent-tree.c and volumes.c were both skipped, and there is definitely
    more work todo in cleaning and commenting the code.

    Signed-off-by: Chris Mason

    Chris Mason
     

25 Sep, 2008

7 commits

  • A btree block cow has two parts, the first is to allocate a destination
    block and the second is to copy the old bock over.

    The first part needs locks in the extent allocation tree, and may need to
    do IO. This changeset splits that into a separate function that can be
    called without any tree locks held.

    btrfs_search_slot is changed to drop its path and start over if it has
    to COW a contended block. This often means that many writers will
    pre-alloc a new destination for a the same contended block, but they
    cache their prealloc for later use on lower levels in the tree.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The memory reclaiming issue happens when snapshot exists. In that
    case, some cache entries may not be used during old snapshot dropping,
    so they will remain in the cache until umount.

    The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
    the patch makes all dead roots of a given snapshot linked together in order of
    create time. After a old snapshot was completely dropped, we check the dead
    root list and remove all cache entries created before the oldest dead root in
    the list.

    Signed-off-by: Chris Mason

    Yan
     
  • Signed-off-by: Chris Mason

    Chris Mason
     
  • Lockdep has the notion of locking subclasses so that you can identify
    locks you expect to be taken after other locks of the same class. This
    changes the per-extent buffer btree locking routines to use a subclass based
    on the level in the tree.

    Unfortunately, lockdep can only handle 8 total subclasses, and the btrfs
    max level is also 8. So when lockdep is on, use a lower max level.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This replaces the use of the page cache lock bit for locking, which wasn't
    suitable for block size < page size and couldn't be used recursively.

    The mutexes alone don't fix either problem, but they are the first step.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This calls unlock_up sooner in btrfs_search_slot in order to decrease the
    amount of work done with the higher level tree locks held.

    Also, it changes btrfs_tree_lock to spin for a big against the page lock
    before scheduling. This makes a big difference in context switch rate under
    highly contended workloads.

    Longer term, a better locking structure is needed than the page lock.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The allocation trees and the chunk trees are serialized via their own
    dedicated mutexes. This means allocation location is still not very
    fine grained.

    The main FS btree is protected by locks on each block in the btree. Locks
    are taken top / down, and as processing finishes on a given level of the
    tree, the lock is released after locking the lower level.

    The end result of a search is now a path where only the lowest level
    is locked. Releasing or freeing the path drops any locks held.

    Signed-off-by: Chris Mason

    Chris Mason