15 Nov, 2011

1 commit

  • The btrfs snapshotting code requires that once a root has been
    snapshotted, we don't change it during a commit.

    But there are two cases to lead to tree corruptions:

    1) multi-thread snapshots can commit serveral snapshots in a transaction,
    and this may change the src root when processing the following pending
    snapshots, which lead to the former snapshots corruptions;

    2) the free inode cache was changing the roots when it root the cache,
    which lead to corruptions.

    This fixes things by making sure we force COW the block after we create a
    snapshot during commiting a transaction, then any changes to the roots
    will result in COW, and we get all the fs roots and snapshot roots to be
    consistent.

    Signed-off-by: Liu Bo
    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Liu Bo
     

21 Oct, 2011

1 commit


28 Jul, 2011

3 commits

  • Before the reader/writer locks, btrfs_next_leaf needed to keep
    the path blocking to avoid making lockdep upset.

    Now that btrfs_next_leaf only takes read locks, this isn't required.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The btrfs metadata btree is the source of significant
    lock contention, especially in the root node. This
    commit changes our locking to use a reader/writer
    lock.

    The lock is built on top of rw spinlocks, and it
    extends the lock tracking to remember if we have a
    read lock or a write lock when we go to blocking. Atomics
    count the number of blocking readers or writers at any
    given time.

    It removes all of the adaptive spinning from the old code
    and uses only the spinning/blocking hints inside of btrfs
    to decide when it should continue spinning.

    In read heavy workloads this is dramatically faster. In write
    heavy workloads we're still faster because of less contention
    on the root node lock.

    We suffer slightly in dbench because we schedule more often
    during write locks, but all other benchmarks so far are improved.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The extent_buffers have a very complex interface where
    we use HIGHMEM for metadata and try to cache a kmap mapping
    to access the memory.

    The next commit adds reader/writer locks, and concurrent use
    of this kmap cache would make it even more complex.

    This commit drops the ability to use HIGHMEM with extent buffers,
    and rips out all of the related code.

    Signed-off-by: Chris Mason

    Chris Mason
     

09 Jun, 2011

1 commit

  • Arne's scrub stuff exposed a problem with mapping the extent buffer in
    reada_for_search. He searches the commit root with multiple threads and with
    skip_locking set, so we can race and overwrite node->map_token since node isn't
    locked. So fix this so that we only map the extent buffer if we don't already
    have a map_token and skip_locking isn't set. Without this patch scrub would
    panic almost immediately, with the patch it doesn't panic anymore. Thanks,

    Reported-by: Arne Jansen
    Signed-off-by: Josef Bacik

    Josef Bacik
     

28 May, 2011

1 commit

  • git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus

    Conflicts:
    fs/btrfs/disk-io.c
    fs/btrfs/extent-tree.c
    fs/btrfs/free-space-cache.c
    fs/btrfs/inode.c
    fs/btrfs/transaction.c

    Signed-off-by: Chris Mason

    Chris Mason
     

24 May, 2011

5 commits


23 May, 2011

1 commit


21 May, 2011

1 commit

  • Changelog V5 -> V6:
    - Fix oom when the memory load is high, by storing the delayed nodes into the
    root's radix tree, and letting btrfs inodes go.

    Changelog V4 -> V5:
    - Fix the race on adding the delayed node to the inode, which is spotted by
    Chris Mason.
    - Merge Chris Mason's incremental patch into this patch.
    - Fix deadlock between readdir() and memory fault, which is reported by
    Itaru Kitayama.

    Changelog V3 -> V4:
    - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
    inode in time.

    Changelog V2 -> V3:
    - Fix the race between the delayed worker and the task which does delayed items
    balance, which is reported by Tsutomu Itoh.
    - Modify the patch address David Sterba's comment.
    - Fix the bug of the cpu recursion spinlock, reported by Chris Mason

    Changelog V1 -> V2:
    - break up the global rb-tree, use a list to manage the delayed nodes,
    which is created for every directory and file, and used to manage the
    delayed directory name index items and the delayed inode item.
    - introduce a worker to deal with the delayed nodes.

    Compare with Ext3/4, the performance of file creation and deletion on btrfs
    is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
    such as inode item, directory name item, directory name index and so on.

    If we can do some delayed b+ tree insertion or deletion, we can improve the
    performance, so we made this patch which implemented delayed directory name
    index insertion/deletion and delayed inode update.

    Implementation:
    - introduce a delayed root object into the filesystem, that use two lists to
    manage the delayed nodes which are created for every file/directory.
    One is used to manage all the delayed nodes that have delayed items. And the
    other is used to manage the delayed nodes which is waiting to be dealt with
    by the work thread.
    - Every delayed node has two rb-tree, one is used to manage the directory name
    index which is going to be inserted into b+ tree, and the other is used to
    manage the directory name index which is going to be deleted from b+ tree.
    - introduce a worker to deal with the delayed operation. This worker is used
    to deal with the works of the delayed directory name index items insertion
    and deletion and the delayed inode update.
    When the delayed items is beyond the lower limit, we create works for some
    delayed nodes and insert them into the work queue of the worker, and then
    go back.
    When the delayed items is beyond the upper bound, we create works for all
    the delayed nodes that haven't been dealt with, and insert them into the work
    queue of the worker, and then wait for that the untreated items is below some
    threshold value.
    - When we want to insert a directory name index into b+ tree, we just add the
    information into the delayed inserting rb-tree.
    And then we check the number of the delayed items and do delayed items
    balance. (The balance policy is above.)
    - When we want to delete a directory name index from the b+ tree, we search it
    in the inserting rb-tree at first. If we look it up, just drop it. If not,
    add the key of it into the delayed deleting rb-tree.
    Similar to the delayed inserting rb-tree, we also check the number of the
    delayed items and do delayed items balance.
    (The same to inserting manipulation)
    - When we want to update the metadata of some inode, we cached the data of the
    inode into the delayed node. the worker will flush it into the b+ tree after
    dealing with the delayed insertion and deletion.
    - We will move the delayed node to the tail of the list after we access the
    delayed node, By this way, we can cache more delayed items and merge more
    inode updates.
    - If we want to commit transaction, we will deal with all the delayed node.
    - the delayed node will be freed when we free the btrfs inode.
    - Before we log the inode items, we commit all the directory name index items
    and the delayed inode update.

    I did a quick test by the benchmark tool[1] and found we can improve the
    performance of file creation by ~15%, and file deletion by ~20%.

    Before applying this patch:
    Create files:
    Total files: 50000
    Total time: 1.096108
    Average time: 0.000022
    Delete files:
    Total files: 50000
    Total time: 1.510403
    Average time: 0.000030

    After applying this patch:
    Create files:
    Total files: 50000
    Total time: 0.932899
    Average time: 0.000019
    Delete files:
    Total files: 50000
    Total time: 1.215732
    Average time: 0.000024

    [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

    Many thanks for Kitayama-san's help!

    Signed-off-by: Miao Xie
    Reviewed-by: David Sterba
    Tested-by: Tsutomu Itoh
    Tested-by: Itaru Kitayama
    Signed-off-by: Chris Mason

    Miao Xie
     

02 May, 2011

3 commits


28 Mar, 2011

4 commits

  • This patch is checking return value of read_tree_block(),
    and if it is NULL, error processing.

    Signed-off-by: Tsutomu Itoh
    Signed-off-by: Chris Mason

    Tsutomu Itoh
     
  • This patch changes some BUG_ON() to the error return.
    (but, most callers still use BUG_ON())

    Signed-off-by: Tsutomu Itoh
    Signed-off-by: Chris Mason

    Tsutomu Itoh
     
  • Tracepoints can provide insight into why btrfs hits bugs and be greatly
    helpful for debugging, e.g
    dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
    dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
    btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
    btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
    btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
    flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
    flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
    flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
    flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
    btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
    btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)

    Here is what I have added:

    1) ordere_extent:
    btrfs_ordered_extent_add
    btrfs_ordered_extent_remove
    btrfs_ordered_extent_start
    btrfs_ordered_extent_put

    These provide critical information to understand how ordered_extents are
    updated.

    2) extent_map:
    btrfs_get_extent

    extent_map is used in both read and write cases, and it is useful for tracking
    how btrfs specific IO is running.

    3) writepage:
    __extent_writepage
    btrfs_writepage_end_io_hook

    Pages are cirtical resourses and produce a lot of corner cases during writeback,
    so it is valuable to know how page is written to disk.

    4) inode:
    btrfs_inode_new
    btrfs_inode_request
    btrfs_inode_evict

    These can show where and when a inode is created, when a inode is evicted.

    5) sync:
    btrfs_sync_file
    btrfs_sync_fs

    These show sync arguments.

    6) transaction:
    btrfs_transaction_commit

    In transaction based filesystem, it will be useful to know the generation and
    who does commit.

    7) back reference and cow:
    btrfs_delayed_tree_ref
    btrfs_delayed_data_ref
    btrfs_delayed_ref_head
    btrfs_cow_block

    Btrfs natively supports back references, these tracepoints are helpful on
    understanding btrfs's COW mechanism.

    8) chunk:
    btrfs_chunk_alloc
    btrfs_chunk_free

    Chunk is a link between physical offset and logical offset, and stands for space
    infomation in btrfs, and these are helpful on tracing space things.

    9) reserved_extent:
    btrfs_reserved_extent_alloc
    btrfs_reserved_extent_free

    These can show how btrfs uses its space.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    liubo
     
  • The pointer to the extent buffer for the root of each tree
    is protected by a spinlock so that we can safely read the pointer
    and take a reference on the extent buffer.

    But now that the extent buffers are freed via RCU, we can safely
    use rcu_read_lock instead.

    Signed-off-by: Chris Mason

    Chris Mason
     

18 Mar, 2011

1 commit

  • Currently if we have corrupted items things will blow up in spectacular ways.
    So as we read in blocks and they are leaves, check the entire leaf to make sure
    all of the items are correct and point to valid parts in the leaf for the item
    data the are responsible for. If the item is corrupt we will kick back EIO and
    not read any of the copies since they are likely to not be correct either. This
    will catch generic corruptions, it will be up to the individual callers of
    btrfs_search_slot to make sure their items are right. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

17 Jan, 2011

2 commits

  • Should check if functions returns NULL or not.

    Signed-off-by: Tsutomu Itoh
    Signed-off-by: Chris Mason

    Tsutomu Itoh
     
  • Hi,

    In fs/btrfs/inode.c::fixup_tree_root_location() we have this code:

    ...
    if (!path) {
    err = -ENOMEM;
    goto out;
    }
    ...
    out:
    btrfs_free_path(path);
    return err;

    btrfs_free_path() passes its argument on to other functions and some of
    them end up dereferencing the pointer.
    In the code above that pointer is clearly NULL, so btrfs_free_path() will
    eventually cause a NULL dereference.

    There are many ways to cut this cake (fix the bug). The one I chose was to
    make btrfs_free_path() deal gracefully with NULL pointers. If you
    disagree, feel free to come up with an alternative patch.

    Signed-off-by: Jesper Juhl
    Signed-off-by: Chris Mason

    Jesper Juhl
     

30 Oct, 2010

1 commit

  • These are all the cases where a variable is set, but not read which are
    not bugs as far as I can see, but simply leftovers.

    Still needs more review.

    Found by gcc 4.6's new warnings

    Signed-off-by: Andi Kleen
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Chris Mason

    Andi Kleen
     

29 Oct, 2010

1 commit


20 Jul, 2010

1 commit

  • split_leaf was not properly balancing leaves when it was forced to
    split a leaf twice. This commit adds an extra push left and right
    before forcing the double split in hopes of getting the slot where
    we want to insert at either the start or end of the leaf.

    If the extra pushes do work, then we are able to avoid splitting twice
    and we keep the tree properly balanced.

    Signed-off-by: Chris Mason

    Chris Mason
     

27 May, 2010

1 commit

  • After the path is released, the generation number got from block
    pointer is no long valid. The race may cause disk corruption, because
    verify_parent_transid() calls clear_extent_buffer_uptodate() when
    generation numbers mismatch.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

25 May, 2010

2 commits

  • This patch adds metadata ENOSPC handling for the balance code.
    It is consisted by following major changes:

    1. Avoid COW tree leave in the phrase of merging tree.

    2. Handle interaction with snapshot creation.

    3. make the backref cache can live across transactions.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     
  • Introducing metadata reseravtion contexts has two major advantages.
    First, it makes metadata reseravtion more traceable. Second, it can
    reclaim freed space and re-add them to the itself after transaction
    committed.

    Besides add btrfs_block_rsv structure and related helper functions,
    This patch contains following changes:

    Move code that decides if freed tree block should be pinned into
    btrfs_free_tree_block().

    Make space accounting more accurate, mainly for handling read only
    block groups.

    Signed-off-by: Chris Mason

    Yan, Zheng
     

06 Apr, 2010

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: add check for changed leaves in setup_leaf_for_split
    Btrfs: create snapshot references in same commit as snapshot
    Btrfs: fix small race with delalloc flushing waitqueue's
    Btrfs: use add_to_page_cache_lru, use __page_cache_alloc
    Btrfs: fix chunk allocate size calculation
    Btrfs: kill max_extent mount option
    Btrfs: fail to mount if we have problems reading the block groups
    Btrfs: check btrfs_get_extent return for IS_ERR()
    Btrfs: handle kmalloc() failure in inode lookup ioctl
    Btrfs: dereferencing freed memory
    Btrfs: Simplify num_stripes's calculation logical for __btrfs_alloc_chunk()
    Btrfs: Add error handle for btrfs_search_slot() in btrfs_read_chunk_tree()
    Btrfs: Remove unnecessary finish_wait() in wait_current_trans()
    Btrfs: add NULL check for do_walk_down()
    Btrfs: remove duplicate include in ioctl.c

    Fix trivial conflict in fs/btrfs/compression.c due to slab.h include
    cleanups.

    Linus Torvalds
     
  • setup_leaf_for_split needs to drop the path and search again, and has
    checks to see if the item we want to split changed size. But, it misses
    the case where the leaf changed and now has enough room for the item
    we want to insert.

    This adds an extra check to make sure the leaf really needs splitting
    before we call btrfs_split_leaf(), which keeps us from trying to split
    a leaf with a single item.

    btrfs_split_leaf() will blindly split the single item leaf, leaving us
    with one good leaf and one empty leaf and then a crash.

    Signed-off-by: Chris Mason

    Chris Mason
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

18 Dec, 2009

1 commit

  • The bytes_used field in root item was originally planned to
    trace the amount of used data and tree blocks. But it never
    worked right since we can't trace freeing of data accurately.
    This patch changes it to only trace the amount of tree blocks.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

16 Dec, 2009

1 commit

  • btrfs_duplicate_item duplicates item with new key, guaranteeing
    the source item and the new items are in the same tree leaf and
    contiguous. It allows us to split file extent in place, without
    using lock_extent to prevent bookend extent race.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

24 Sep, 2009

1 commit

  • For every hardlink in btrfs, there is a corresponding inode back
    reference. All inode back references for hardlinks in a given
    directory are stored in single b-tree item. The size of b-tree item
    is limited by the size of b-tree leaf, so we can only create limited
    number of hardlinks to a given file in a directory.

    The original code lacks of the check, it oops if the number of
    hardlinks goes over the limit. This patch fixes the issue by adding
    check to btrfs_link and btrfs_rename.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

25 Jul, 2009

1 commit

  • btrfs_split_leaf and btrfs_del_items can end up in a loop
    where one is constantly spliting a given leaf and the other
    is constantly merging it back with the adjacent nodes.

    There is a better fix for this, but in the interest of something
    small, this patch just changes btrfs_del_items back to balancing less
    often.

    Signed-off-by: Chris Mason

    Yan Zheng
     

24 Jul, 2009

2 commits


22 Jul, 2009

1 commit

  • When walking up the tree, btrfs_find_next_key assumes the upper level tree
    block is properly locked. This isn't always true even path->keep_locks is 1.
    This is because btrfs_find_next_key may advance path->slots[] several times
    instead of only once.

    When 'path->slots[level] >= btrfs_header_nritems(path->nodes[level])' is found,
    we can't guarantee the original value of 'path->slots[level]' is
    'btrfs_header_nritems(path->nodes[level]) - 1'. If it's not, the tree block at
    'level + 1' isn't locked.

    This patch fixes the issue by explicitly checking the locking state,
    re-searching the tree if it's not locked.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan Zheng