13 Mar, 2009

1 commit


06 Jan, 2009

4 commits

  • The per-metadata-type ocfs2_journal_access_*() functions hook up jbd2
    commit triggers and allow us to compute metadata ecc right before the
    buffers are written out. This commit provides ecc for inodes, extent
    blocks, group descriptors, and quota blocks. It is not safe to use
    extened attributes and metaecc at the same time yet.

    The ocfs2_extent_tree and ocfs2_path abstractions in alloc.c both hide
    the type of block at their root. Before, it didn't matter, but now the
    root block must use the appropriate ocfs2_journal_access_*() function.
    To keep this abstract, the structures now have a pointer to the matching
    journal_access function and a wrapper call to call it.

    A few places use naked ocfs2_write_block() calls instead of adding the
    blocks to the journal. We make sure to calculate their checksum and ecc
    before the write.

    Since we pass around the journal_access functions. Let's typedef them
    in ocfs2.h.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • Add quota calls for allocation and freeing of inodes and space, also update
    estimates on number of needed credits for a transaction. Move out inode
    allocation from ocfs2_mknod_locked() because vfs_dq_init() must be called
    outside of a transaction.

    Signed-off-by: Jan Kara
    Signed-off-by: Mark Fasheh

    Jan Kara
     
  • JBD2 is fully backwards compatible with JBD and it's been tested enough with
    Ocfs2 that we can clean this code up now.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • The ocfs2 code currently reads inodes off disk with a simple
    ocfs2_read_block() call. Each place that does this has a different set
    of sanity checks it performs. Some check only the signature. A couple
    validate the block number (the block read vs di->i_blkno). A couple
    others check for VALID_FL. Only one place validates i_fs_generation. A
    couple check nothing. Even when an error is found, they don't all do
    the same thing.

    We wrap inode reading into ocfs2_read_inode_block(). This will validate
    all the above fields, going readonly if they are invalid (they never
    should be). ocfs2_read_inode_block_full() is provided for the places
    that want to pass read_block flags. Every caller is passing a struct
    inode with a valid ip_blkno, so we don't need a separate blkno argument
    either.

    We will remove the validation checks from the rest of the code in a
    later commit, as they are no longer necessary.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     

15 Oct, 2008

2 commits

  • More than 30 callers of ocfs2_read_block() pass exactly OCFS2_BH_CACHED.
    Only six pass a different flag set. Rather than have every caller care,
    let's make ocfs2_read_block() take no flags and always do a cached read.
    The remaining six places can call ocfs2_read_blocks() directly.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • Now that synchronous readers are using ocfs2_read_blocks_sync(), all
    callers of ocfs2_read_blocks() are passing an inode. Use it
    unconditionally. Since it's there, we don't need to pass the
    ocfs2_super either.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     

14 Oct, 2008

9 commits

  • This is pointless as brelse() already does the check.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • ocfs2 wants JBD2 for many reasons, not the least of which is that JBD is
    limiting our maximum filesystem size.

    It's a pretty trivial change. Most functions are just renamed. The
    only functional change is moving to Jan's inode-based ordered data mode.
    It's better, too.

    Because JBD2 reads and writes JBD journals, this is compatible with any
    existing filesystem. It can even interact with JBD-based ocfs2 as long
    as the journal is formated for JBD.

    We provide a compatibility option so that paranoid people can still use
    JBD for the time being. This will go away shortly.

    [ Moved call of ocfs2_begin_ordered_truncate() from ocfs2_delete_inode() to
    ocfs2_truncate_for_delete(). --Mark ]

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • The original get/put_extent_tree() functions held a reference on
    et_root_bh. However, every single caller already has a safe reference,
    making the get/put cycle irrelevant.

    We change ocfs2_get_*_extent_tree() to ocfs2_init_*_extent_tree(). It
    no longer gets a reference on et_root_bh. ocfs2_put_extent_tree() is
    removed. Callers now have a simpler init+use pattern.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • We now have three different kinds of extent trees in ocfs2: inode data
    (dinode), extended attributes (xattr_tree), and extended attribute
    values (xattr_value). There is a nice abstraction for them,
    ocfs2_extent_tree, but it is hidden in alloc.c. All the calling
    functions have to pick amongst a varied API and pass in type bits and
    often extraneous pointers.

    A better way is to make ocfs2_extent_tree a first-class object.
    Everyone converts their object to an ocfs2_extent_tree() via the
    ocfs2_get_*_extent_tree() calls, then uses the ocfs2_extent_tree for all
    tree calls to alloc.c.

    This simplifies a lot of callers, making for readability. It also
    provides an easy way to add additional extent tree types, as they only
    need to be defined in alloc.c with a ocfs2_get__extent_tree()
    function.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • Add some thin wrappers around ocfs2_insert_extent() for each of the 3
    different btree types, ocfs2_inode_insert_extent(),
    ocfs2_xattr_value_insert_extent() and ocfs2_xattr_tree_insert_extent(). The
    last is for the xattr index btree, which will be used in a followup patch.

    All the old callers in file.c etc will call ocfs2_dinode_insert_extent(),
    while the other two handle the xattr issue. And the init of extent tree are
    handled by these functions.

    When storing xattr value which is too large, we will allocate some clusters
    for it and here ocfs2_extent_list and ocfs2_extent_rec will also be used. In
    order to re-use the b-tree operation code, a new parameter named "private"
    is added into ocfs2_extent_tree and it is used to indicate the root of
    ocfs2_exent_list. The reason is that we can't deduce the root from the
    buffer_head now. It may be in an inode, an ocfs2_xattr_block or even worse,
    in any place in an ocfs2_xattr_bucket.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • Factor out the non-inode specifics of ocfs2_do_extend_allocation() into a more generic
    function, ocfs2_do_cluster_allocation(). ocfs2_do_extend_allocation calls
    ocfs2_do_cluster_allocation() now, but the latter can be used for other
    btree types as well.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • In the old extent tree operation, we take the hypothesis that we
    are using the ocfs2_extent_list in ocfs2_dinode as the tree root.
    As xattr will also use ocfs2_extent_list to store large value
    for a xattr entry, we refactor the tree operation so that xattr
    can use it directly.

    The refactoring includes 4 steps:
    1. Abstract set/get of last_eb_blk and update_clusters since they may
    be stored in different location for dinode and xattr.
    2. Add a new structure named ocfs2_extent_tree to indicate the
    extent tree the operation will work on.
    3. Remove all the use of fe_bh and di, use root_bh and root_el in
    extent tree instead. So now all the fe_bh is replaced with
    et->root_bh, el with root_el accordingly.
    4. Make ocfs2_lock_allocators generic. Now it is limited to be only used
    in file extend allocation. But the whole function is useful when we want
    to store large EAs.

    Note: This patch doesn't touch ocfs2_commit_truncate() since it is not used
    for anything other than truncate inode data btrees.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • ocfs2_extend_meta_needed(), ocfs2_calc_extend_credits() and
    ocfs2_reserve_new_metadata() are all useful for extent tree operations. But
    they are all limited to an inode btree because they use a struct
    ocfs2_dinode parameter. Change their parameter to struct ocfs2_extent_list
    (the part of an ocfs2_dinode they actually use) so that the xattr btree code
    can use these functions.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • ocfs2_num_free_extents() is used to find the number of free extent records
    in an inode btree. Hence, it takes an "ocfs2_dinode" parameter. We want to
    use this for extended attribute trees in the future, so genericize the
    interface the take a buffer head. A future patch will allow that buffer_head
    to contain any structure rooting an ocfs2 btree.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     

10 Sep, 2008

1 commit

  • ocfs2 will become read-only if we try to read the bytes which pass
    the end of i_size. This can be easily reproduced by following steps:
    1. mkfs a ocfs2 volume with bs=4k cs=4k and nosparse.
    2. create a small file(say less than 100 bytes) and we will create the file
    which is allocated 1 cluster.
    3. read 8196 bytes from the kernel using O_DIRECT which exceeds the limit.
    4. The ocfs2 volume becomes read-only and dmesg shows:
    OCFS2: ERROR (device sda13): ocfs2_direct_IO_get_blocks:
    Inode 66010 has a hole at block 1
    File system is now read-only due to the potential of on-disk corruption.
    Please run fsck.ocfs2 once the file system is unmounted.

    So suppress the ERROR message.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     

01 Aug, 2008

1 commit


17 Jul, 2008

1 commit

  • This patch fixes a mmap_truncate bug which was found by ocfs2 test suite.

    In an ocfs2 cluster more than 1 node, run program mmap_truncate, which races
    mmap writes and truncates from multiple processes. While the test is
    running, a stat from another node forces writeout, causing an oops in
    ocfs2_get_block() because it sees a buffer to write which isn't allocated.

    This patch fixed the bug by clear dirty and uptodate bits in buffer, leave
    the buffer unmapped and return.

    Fix is suggested by Mark Fasheh, and I code up the patch.

    Signed-off-by: Coly Li
    Signed-off-by: Mark Fasheh

    Coly Li
     

18 Apr, 2008

1 commit


04 Mar, 2008

1 commit

  • In commit e6bafba5b4765a5a252f1b8d31cbf6d2459da337, a bug was fixed that
    involved converting !x & y to !(x & y). The code below shows the same
    pattern, and thus should perhaps be fixed in the same way.

    This is not tested and clearly changes the semantics, so it is only
    something to consider.

    Signed-off-by: Julia Lawall
    Signed-off-by: Mark Fasheh

    Julia Lawall
     

06 Feb, 2008

1 commit

  • Simplify page cache zeroing of segments of pages through 3 functions

    zero_user_segments(page, start1, end1, start2, end2)

    Zeros two segments of the page. It takes the position where to
    start and end the zeroing which avoids length calculations and
    makes code clearer.

    zero_user_segment(page, start, end)

    Same for a single segment.

    zero_user(page, start, length)

    Length variant for the case where we know the length.

    We remove the zero_user_page macro. Issues:

    1. Its a macro. Inline functions are preferable.

    2. The KM_USER0 macro is only defined for HIGHMEM.

    Having to treat this special case everywhere makes the
    code needlessly complex. The parameter for zeroing is always
    KM_USER0 except in one single case that we open code.

    Avoiding KM_USER0 makes a lot of code not having to be dealing
    with the special casing for HIGHMEM anymore. Dealing with
    kmap is only necessary for HIGHMEM configurations. In those
    configurations we use KM_USER0 like we do for a series of other
    functions defined in highmem.h.

    Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
    function could not be a macro. zero_user_* functions introduced
    here can be be inline because that constant is not used when these
    functions are called.

    Also extract the flushing of the caches to be outside of the kmap.

    [akpm@linux-foundation.org: fix nfs and ntfs build]
    [akpm@linux-foundation.org: fix ntfs build some more]
    Signed-off-by: Christoph Lameter
    Cc: Steven French
    Cc: Michael Halcrow
    Cc:
    Cc: Steven Whitehouse
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Anton Altaparmakov
    Cc: Mark Fasheh
    Cc: David Chinner
    Cc: Michael Halcrow
    Cc: Steven French
    Cc: Steven Whitehouse
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

26 Jan, 2008

4 commits

  • In ocfs2_read_inline_data() we should store file size in loff_t. Although
    the file size should fit in 32 bits we cannot be sure in case filesystem is
    corrupted.

    Signed-off-by: Jan Kara
    Signed-off-by: Mark Fasheh

    Jan Kara
     
  • Add ->readpages support to Ocfs2. This is rather trivial - all it required
    is a small update to ocfs2_get_block (for mapping full extents via b_size)
    and an ocfs2_readpages() function which partially mirrors ocfs2_readpage().

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • Call this the "inode_lock" now, since it covers both data and meta data.
    This patch makes no functional changes.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • The meta lock now covers both meta data and data, so this just removes the
    now-redundant data lock.

    Combining locks saves us a round of lock mastery per inode and one less lock
    to ping between nodes during read/write.

    We don't lose much - since meta locks were always held before a data lock
    (and at the same level) ordered writeout mode (the default) ensured that
    flushing for the meta data lock also pushed out data anyways.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

28 Nov, 2007

1 commit


07 Nov, 2007

1 commit

  • On file systems which don't support sparse files, Ocfs2_map_page_blocks()
    was reading blocks on appending writes. This caused write performance to
    suffer dramatically. Fix this by detecting an appending write on a nonsparse
    fs and skipping the read.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

17 Oct, 2007

1 commit

  • Plug ocfs2 into the ->write_begin and ->write_end aops.

    A bunch of custom code is now gone - the iovec iteration stuff during write
    and the ocfs2 splice write actor.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

13 Oct, 2007

4 commits

  • This fixes up write, truncate, mmap, and RESVSP/UNRESVP to understand inline
    inode data.

    For the most part, the changes to the core write code can be relied on to do
    the heavy lifting. Any code calling ocfs2_write_begin (including shared
    writeable mmap) can count on it doing the right thing with respect to
    growing inline data to an extent tree.

    Size reducing truncates, including UNRESVP can simply zero that portion of
    the inode block being removed. Size increasing truncatesm, including RESVP
    have to be a little bit smarter and grow the inode to an extent tree if
    necessary.

    Signed-off-by: Mark Fasheh
    Reviewed-by: Joel Becker

    Mark Fasheh
     
  • This hooks up ocfs2_readpage() to populate a page with data from an inode
    block. Direct IO reads from inline data are modified to fall back to
    buffered I/O. Appropriate checks are also placed in the extent map code to
    avoid reading an extent list when inline data might be stored.

    Signed-off-by: Mark Fasheh
    Reviewed-by: Joel Becker

    Mark Fasheh
     
  • We'll want to reuse most of this when pushing inline data back out to an
    extent. Keeping this part as a seperate patch helps to keep the upcoming
    changes for write support uncluttered.

    The core portion of ocfs2_zero_cluster_pages() responsible for making sure a
    page is mapped and properly dirtied is abstracted out into it's own
    function, ocfs2_map_and_dirty_page(). Actual functionality doesn't change,
    though zeroing becomes optional.

    We also turn part of ocfs2_free_write_ctxt() into a common function for
    unlocking and freeing a page array. This operation is very common (and
    uniform) for Ocfs2 cluster sizes greater than page size, so it makes sense
    to keep the code in one place.

    Signed-off-by: Mark Fasheh
    Reviewed-by: Joel Becker

    Mark Fasheh
     
  • By doing this, we can remove any higher level logic which has to have
    knowledge of btree functionality - any callers of ocfs2_write_begin() can
    now expect it to do anything necessary to prepare the inode for new data.

    Signed-off-by: Mark Fasheh
    Reviewed-by: Joel Becker

    Mark Fasheh
     

21 Sep, 2007

2 commits

  • The target page offsets were being incorrectly set a second time in
    ocfs2_prepare_page_for_write(), which was causing problems on a 16k page
    size kernel. Additionally, ocfs2_write_failure() was incorrectly using those
    parameters instead of the parameters for the individual page being cleaned
    up.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • This was broken for file systems whose cluster size is greater than page
    size. Pos needs to be incremented as we loop through the descriptors, and
    len needs to be capped to the size of a single cluster.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

12 Sep, 2007

1 commit

  • In ocfs2_alloc_write_write_ctxt, the written clusters length is calculated
    by the byte length only. This may cause some problems if we start to write
    at some position in the end of one cluster and last to a second cluster
    while the "len" is smaller than a cluster size. In that case, we have to
    write 2 clusters actually.
    So we have to take the start position into consideration also.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    tao.ma@oracle.com
     

20 Jul, 2007

1 commit

  • Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
    the virtual address -> file offset differently from linear mappings.

    ->populate is a layering violation because the filesystem/pagecache code
    should need to know anything about the virtual memory mapping. The hitch here
    is that the ->nopage handler didn't pass down enough information (ie. pgoff).
    But it is more logical to pass pgoff rather than have the ->nopage function
    calculate it itself anyway (because that's a similar layering violation).

    Having the populate handler install the pte itself is likewise a nasty thing
    to be doing.

    This patch introduces a new fault handler that replaces ->nopage and
    ->populate and (later) ->nopfn. Most of the old mechanism is still in place
    so there is a lot of duplication and nice cleanups that can be removed if
    everyone switches over.

    The rationale for doing this in the first place is that nonlinear mappings are
    subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
    to duplicate the synchronisation logic rather than just consolidate the two.

    After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
    pagecache. Seems like a fringe functionality anyway.

    NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no
    users have hit mainline yet.

    [akpm@linux-foundation.org: cleanup]
    [randy.dunlap@oracle.com: doc. fixes for readahead]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Randy Dunlap
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

11 Jul, 2007

3 commits