24 May, 2011

1 commit


08 Jan, 2011

1 commit


19 May, 2010

3 commits

  • Joel Becker
     
  • The original idea to pull ocfs2_find_cpos_for_left_leaf() out of
    alloc.c is to benefit punching-holes optimization patch, it however,
    can also be referred by other funcs in the future who want to do the
    same job.

    Signed-off-by: Tristan Ye
    Acked-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Tristan Ye
     
  • Truncate is just a special case of punching holes(from new i_size to
    end), we therefore could take advantage of the existing
    ocfs2_remove_btree_range() to reduce the comlexity and redundancy in
    alloc.c. The goal here is to make truncate more generic and
    straightforward.

    Several functions only used by ocfs2_commit_truncate() will smiply be
    removed.

    ocfs2_remove_btree_range() was originally used by the hole punching
    code, which didn't take refcount trees into account (definitely a bug).
    We therefore need to change that func a bit to handle refcount trees.
    It must take the refcount lock, calculate and reserve blocks for
    refcount tree changes, and decrease refcounts at the end. We replace
    ocfs2_lock_allocators() here by adding a new func
    ocfs2_reserve_blocks_for_rec_trunc() which accepts some extra blocks to
    reserve. This will not hurt any other code using
    ocfs2_remove_btree_range() (such as dir truncate and hole punching).

    I merged the following steps into one patch since they may be
    logically doing one thing, though I know it looks a little bit fat
    to review.

    1). Remove redundant code used by ocfs2_commit_truncate(), since we're
    moving to ocfs2_remove_btree_range anyway.

    2). Add a new func ocfs2_reserve_blocks_for_rec_trunc() for purpose of
    accepting some extra blocks to reserve.

    3). Change ocfs2_prepare_refcount_change_for_del() a bit to fit our
    needs. It's safe to do this since it's only being called by
    truncate.

    4). Change ocfs2_remove_btree_range() a bit to take refcount case into
    account.

    5). Finally, we change ocfs2_commit_truncate() to call
    ocfs2_remove_btree_range() in a proper way.

    The patch has been tested normally for sanity check, stress tests
    with heavier workload will be expected.

    Based on this patch, fixing the punching holes bug will be fairly easy.

    Signed-off-by: Tristan Ye
    Acked-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Tristan Ye
     

22 Mar, 2010

1 commit


03 Dec, 2009

1 commit

  • ocfs2 refcount tree is stored as an extent tree while
    the leaf ocfs2_refcount_rec points to a refcount block.

    The following step can trip a kernel panic.
    mkfs.ocfs2 -b 512 -C 1M --fs-features=refcount $DEVICE
    mount -t ocfs2 $DEVICE $MNT_DIR
    FILE_NAME=$RANDOM
    FILE_NAME_1=$RANDOM
    FILE_REF="${FILE_NAME}_ref"
    FILE_REF_1="${FILE_NAME}_ref_1"
    for((i=0;i> $MNT_DIR/$FILE_NAME
    cat /mnt/1048576 >> $MNT_DIR/$FILE_NAME_1
    done
    for((i=0;i> $MNT_DIR/$FILE_NAME
    done

    for((i=0;i> $MNT_DIR/$FILE_NAME
    cat /mnt/1048576 >> $MNT_DIR/$FILE_NAME_1
    done

    cat /mnt/1048576 >> $MNT_DIR/$FILE_NAME

    for((i=0;i> $MNT_DIR/$FILE_NAME
    cat /mnt/1048576 >> $MNT_DIR/$FILE_NAME_1
    done
    reflink $MNT_DIR/$FILE_NAME $MNT_DIR/$FILE_REF
    # write_f is a program which will write some bytes to a file at offset.
    # write_f -f file_name -l offset -w write_bytes.
    ./write_f -f $MNT_DIR/$FILE_REF -l $[310*1048576] -w 4096
    ./write_f -f $MNT_DIR/$FILE_REF -l $[306*1048576] -w 4096
    ./write_f -f $MNT_DIR/$FILE_REF -l $[311*1048576] -w 4096
    ./write_f -f $MNT_DIR/$FILE_NAME -l $[310*1048576] -w 4096
    ./write_f -f $MNT_DIR/$FILE_NAME -l $[311*1048576] -w 4096
    reflink $MNT_DIR/$FILE_NAME $MNT_DIR/$FILE_REF_1
    ./write_f -f $MNT_DIR/$FILE_NAME -l $[311*1048576] -w 4096
    #kernel panic here.

    The reason is that if the ocfs2_extent_rec is the last record
    in a leaf extent block, the old solution fails to find the
    suitable end cpos. So this patch try to walk through the b-tree,
    find the next sub root and get the c_pos the next sub-tree starts
    from.

    btw, I have runned tristan's test case against the patched kernel
    for several days and this type of kernel panic never happens again.

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     

23 Sep, 2009

5 commits

  • This patch try CoW support for a refcounted record.

    the whole process will be:
    1. Calculate how many clusters we need to CoW and where we start.
    Extents that are not completely encompassed by the write will
    be broken on 1MB boundaries.
    2. Do CoW for the clusters with the help of page cache.
    3. Change the b-tree structure with the new allocated clusters.

    Signed-off-by: Tao Ma

    Tao Ma
     
  • Add function ocfs2_mark_extent_refcounted which can mark
    an extent refcounted.

    Signed-off-by: Tao Ma

    Tao Ma
     
  • Given a physical cpos and length, decrement the refcount
    in the tree. If the refcount for any portion of the extent goes
    to zero, that portion is queued for freeing.

    Signed-off-by: Tao Ma

    Tao Ma
     
  • Now fs/ocfs2/alloc.c has more than 7000 lines. It contains our
    basic b-tree operation. Although we have already make our b-tree
    operation generic, the basic structrue ocfs2_path which is used
    to iterate one b-tree branch is still static and limited to only
    used in alloc.c. As refcount tree need them and I don't want to
    add any more b-tree unrelated code to alloc.c, export them out.

    Signed-off-by: Tao Ma

    Tao Ma
     
  • Add refcount b-tree as a new extent tree so that it can
    use the b-tree to store and maniuplate ocfs2_refcount_rec.

    Signed-off-by: Tao Ma

    Tao Ma
     

05 Sep, 2009

7 commits


04 Apr, 2009

1 commit

  • This patch makes use of Ocfs2's flexible btree code to add an additional
    tree to directory inodes. The new tree stores an array of small,
    fixed-length records in each leaf block. Each record stores a hash value,
    and pointer to a block in the traditional (unindexed) directory tree where a
    dirent with the given name hash resides. Lookup exclusively uses this tree
    to find dirents, thus providing us with constant time name lookups.

    Some of the hashing code was copied from ext3. Unfortunately, it has lots of
    unfixed checkpatch errors. I left that as-is so that tracking changes would
    be easier.

    Signed-off-by: Mark Fasheh
    Acked-by: Joel Becker

    Mark Fasheh
     

06 Jan, 2009

6 commits

  • When an ocfs2 extended attribute is large enough to require its own
    allocation tree, we root it with an ocfs2_xattr_value_root. However,
    these roots can be a part of inodes, xattr blocks, or xattr buckets.
    Thus, they need a different journal access function for each container.

    We wrap the bh, its journal access function, and the value root (xv) in
    a structure called ocfs2_xattr_valu_buf. This is a package that can
    be passed around. In this first pass, we simply pass it to the
    extent tree code.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • The per-metadata-type ocfs2_journal_access_*() functions hook up jbd2
    commit triggers and allow us to compute metadata ecc right before the
    buffers are written out. This commit provides ecc for inodes, extent
    blocks, group descriptors, and quota blocks. It is not safe to use
    extened attributes and metaecc at the same time yet.

    The ocfs2_extent_tree and ocfs2_path abstractions in alloc.c both hide
    the type of block at their root. Before, it didn't matter, but now the
    root block must use the appropriate ocfs2_journal_access_*() function.
    To keep this abstract, the structures now have a pointer to the matching
    journal_access function and a wrapper call to call it.

    A few places use naked ocfs2_write_block() calls instead of adding the
    blocks to the journal. We make sure to calculate their checksum and ecc
    before the write.

    Since we pass around the journal_access functions. Let's typedef them
    in ocfs2.h.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • We weren't consistently checking extent blocks after we read them.
    Most places checked the signature, but none checked h_blkno or
    h_fs_signature. Create a toplevel ocfs2_read_extent_block() that does
    the read and the validation.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • This patch genericizes the high level handling of extent removal.
    ocfs2_remove_btree_range() is nearly identical to
    __ocfs2_remove_inode_range(), except that extent tree operations have been
    used where necessary. We update ocfs2_remove_inode_range() to use the
    generic helper. Now extent tree based structures have an easy way to
    truncate ranges.

    Signed-off-by: Mark Fasheh
    Acked-by: Joel Becker

    Mark Fasheh
     
  • In ocfs2 xattr set, we reserve metadata and clusters in any place
    they are needed. It is time-consuming and ineffective, so this
    patch try to reserve metadata and clusters at the beginning of
    ocfs2_xattr_set.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • Now in ocfs2 xattr set, the whole process are divided into many small
    parts and they are wrapped into diffrent transactions and it make the
    set doesn't look like a real transaction. So we want to integrate it
    into a real one.

    In some cases we will allocate some clusters and free some in just one
    transaction. e.g, one xattr is larger than inline size, so it and its
    value root is stored within the inode while the value is outside in a
    cluster. Then we try to update it with a smaller value(larger than the
    size of root but smaller than inline size), we may need to free the
    outside cluster while allocate a new bucket(one cluster) since now the
    inode may be full. The old solution will lock the global_bitmap(if the
    local alloc failed in stress test) and then the truncate log. This will
    cause a ABBA lock with truncate log flush.

    This patch add the clusters free in dealloc_ctxt, so that we can record
    the free clusters during the transaction and then free it after we
    release the global_bitmap in xattr set.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     

14 Oct, 2008

10 commits

  • The original get/put_extent_tree() functions held a reference on
    et_root_bh. However, every single caller already has a safe reference,
    making the get/put cycle irrelevant.

    We change ocfs2_get_*_extent_tree() to ocfs2_init_*_extent_tree(). It
    no longer gets a reference on et_root_bh. ocfs2_put_extent_tree() is
    removed. Callers now have a simpler init+use pattern.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • We now have three different kinds of extent trees in ocfs2: inode data
    (dinode), extended attributes (xattr_tree), and extended attribute
    values (xattr_value). There is a nice abstraction for them,
    ocfs2_extent_tree, but it is hidden in alloc.c. All the calling
    functions have to pick amongst a varied API and pass in type bits and
    often extraneous pointers.

    A better way is to make ocfs2_extent_tree a first-class object.
    Everyone converts their object to an ocfs2_extent_tree() via the
    ocfs2_get_*_extent_tree() calls, then uses the ocfs2_extent_tree for all
    tree calls to alloc.c.

    This simplifies a lot of callers, making for readability. It also
    provides an easy way to add additional extent tree types, as they only
    need to be defined in alloc.c with a ocfs2_get__extent_tree()
    function.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • A caller knows what kind of extent tree they have. There's no reason
    they have to call ocfs2_get_extent_tree() with a NULL when they could
    just as easily call a specific function to their type of extent tree.

    Introduce ocfs2_dinode_get_extent_tree(),
    ocfs2_xattr_tree_get_extent_tree(), and
    ocfs2_xattr_value_get_extent_tree(). They only take the necessary
    arguments, calling into the underlying __ocfs2_get_extent_tree() to do
    the real work.

    __ocfs2_get_extent_tree() is the old ocfs2_get_extent_tree(), but
    without needing any switch-by-type logic.

    ocfs2_get_extent_tree() is now a wrapper around the specific calls. It
    exists because a couple alloc.c functions can take et_type. This will
    go later.

    Another benefit is that ocfs2_xattr_value_get_extent_tree() can take a
    struct ocfs2_xattr_value_root* instead of void*. This gives us
    typechecking where we didn't have it before.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • In xattr bucket, we want to limit the maximum size of a btree leaf,
    otherwise we'll lose the benefits of hashing because we'll have to search
    large leaves.

    So add a new field in ocfs2_extent_tree which indicates the maximum leaf cluster
    size we want so that we can prevent ocfs2_insert_extent() from merging the leaf
    record even if it is contiguous with an adjacent record.

    Other btree types are not affected by this change.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • When necessary, an ocfs2_xattr_block will embed an ocfs2_extent_list to
    store large numbers of EAs. This patch adds a new type in
    ocfs2_extent_tree_type and adds the implementation so that we can re-use the
    b-tree code to handle the storage of many EAs.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • Add some thin wrappers around ocfs2_insert_extent() for each of the 3
    different btree types, ocfs2_inode_insert_extent(),
    ocfs2_xattr_value_insert_extent() and ocfs2_xattr_tree_insert_extent(). The
    last is for the xattr index btree, which will be used in a followup patch.

    All the old callers in file.c etc will call ocfs2_dinode_insert_extent(),
    while the other two handle the xattr issue. And the init of extent tree are
    handled by these functions.

    When storing xattr value which is too large, we will allocate some clusters
    for it and here ocfs2_extent_list and ocfs2_extent_rec will also be used. In
    order to re-use the b-tree operation code, a new parameter named "private"
    is added into ocfs2_extent_tree and it is used to indicate the root of
    ocfs2_exent_list. The reason is that we can't deduce the root from the
    buffer_head now. It may be in an inode, an ocfs2_xattr_block or even worse,
    in any place in an ocfs2_xattr_bucket.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • Factor out the non-inode specifics of ocfs2_do_extend_allocation() into a more generic
    function, ocfs2_do_cluster_allocation(). ocfs2_do_extend_allocation calls
    ocfs2_do_cluster_allocation() now, but the latter can be used for other
    btree types as well.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • In the old extent tree operation, we take the hypothesis that we
    are using the ocfs2_extent_list in ocfs2_dinode as the tree root.
    As xattr will also use ocfs2_extent_list to store large value
    for a xattr entry, we refactor the tree operation so that xattr
    can use it directly.

    The refactoring includes 4 steps:
    1. Abstract set/get of last_eb_blk and update_clusters since they may
    be stored in different location for dinode and xattr.
    2. Add a new structure named ocfs2_extent_tree to indicate the
    extent tree the operation will work on.
    3. Remove all the use of fe_bh and di, use root_bh and root_el in
    extent tree instead. So now all the fe_bh is replaced with
    et->root_bh, el with root_el accordingly.
    4. Make ocfs2_lock_allocators generic. Now it is limited to be only used
    in file extend allocation. But the whole function is useful when we want
    to store large EAs.

    Note: This patch doesn't touch ocfs2_commit_truncate() since it is not used
    for anything other than truncate inode data btrees.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • ocfs2_extend_meta_needed(), ocfs2_calc_extend_credits() and
    ocfs2_reserve_new_metadata() are all useful for extent tree operations. But
    they are all limited to an inode btree because they use a struct
    ocfs2_dinode parameter. Change their parameter to struct ocfs2_extent_list
    (the part of an ocfs2_dinode they actually use) so that the xattr btree code
    can use these functions.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • ocfs2_num_free_extents() is used to find the number of free extent records
    in an inode btree. Hence, it takes an "ocfs2_dinode" parameter. We want to
    use this for extended attribute trees in the future, so genericize the
    interface the take a buffer head. A future patch will allow that buffer_head
    to contain any structure rooting an ocfs2 btree.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     

04 Oct, 2008

1 commit

  • Plug ocfs2 into ->fiemap. Some portions of ocfs2_get_clusters() had to be
    refactored so that the extent cache can be skipped in favor of going
    directly to the on-disk records. This makes it easier for us to determine
    which extent is the last one in the btree. Also, I'm not sure we want to be
    caching fiemap lookups anyway as they're not directly related to data
    read/write.

    Signed-off-by: Mark Fasheh
    Signed-off-by: "Theodore Ts'o"
    Cc: ocfs2-devel@oss.oracle.com
    Cc: linux-fsdevel@vger.kernel.org

    Mark Fasheh
     

13 Oct, 2007

2 commits

  • Create all new directories with OCFS2_INLINE_DATA_FL and the inline data
    bytes formatted as an empty directory. Inode size field reflects the actual
    amount of inline data available, which makes searching for dirent space
    very similar to the regular directory search.

    Inline-data directories are automatically pushed out to extents on any
    insert request which is too large for the available space.

    Signed-off-by: Mark Fasheh
    Reviewed-by: Joel Becker

    Mark Fasheh
     
  • This fixes up write, truncate, mmap, and RESVSP/UNRESVP to understand inline
    inode data.

    For the most part, the changes to the core write code can be relied on to do
    the heavy lifting. Any code calling ocfs2_write_begin (including shared
    writeable mmap) can count on it doing the right thing with respect to
    growing inline data to an extent tree.

    Size reducing truncates, including UNRESVP can simply zero that portion of
    the inode block being removed. Size increasing truncatesm, including RESVP
    have to be a little bit smarter and grow the inode to an extent tree if
    necessary.

    Signed-off-by: Mark Fasheh
    Reviewed-by: Joel Becker

    Mark Fasheh
     

11 Jul, 2007

1 commit

  • Provide an internal interface for the removal of arbitrary file regions.

    ocfs2_remove_inode_range() takes a byte range within a file and will remove
    existing extents within that range. Partial clusters will be zeroed so that
    any read from within the region will return zeros.

    Signed-off-by: Mark Fasheh

    Mark Fasheh