17 Nov, 2011

1 commit

  • With indexed_dir enabled, ocfs2 maintains a list of dirblocks having
    space.

    The credit calculation in ocfs2_link_credits() did not correctly account
    for adding an entry that exactly fills a dirblock that triggers removing
    that dirblock by changing the pointer in the previous block in the list.
    The credit calculation did not account for that previous block.

    To expose, do:

    mkfs.ocfs2 -b 512 -M local /dev/sdX
    mount /dev/sdX /ocfs2
    mkdir /ocfs2/linkdir
    touch /ocfs2/linkdir/file1
    for i in `seq 1 29` ; do link /ocfs2/linkdir/file1
    /ocfs2/linkdir/linklinklinklinklinklink$i; done
    rm -f /ocfs2/linkdir/linklinklinklinklinklink10
    sleep 8
    link /ocfs2/linkdir/file1
    /ocfs2/linkdir/linklinklinklinklinklinkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

    Note:
    The link names have been crafted for a 512 byte blocksize. Reproducing
    with a larger blocksize will require longer (or more) links. The sleep
    is important. We want jbd2 to commit the transaction so that the missing
    block does not piggy back on account of the previous transaction.

    Signed-off-by: XiaoweiHu
    Reviewed-by: WengangWang
    Reviewed-by: Sunil.Mushran
    Signed-off-by: Joel Becker

    Xiaowei.Hu
     

31 Mar, 2011

1 commit


20 Feb, 2011

1 commit

  • In the rare case that INLINE_DATA, INDEX_DIR, QUOTA, XATTR features are
    disabled and both the allocation of the directory inode and the allocation
    of the first directory block need to relink allocation group, there need
    not be enough credits reserved in a transaction. Fix the estimate.

    CC: Mark Fasheh
    Signed-off-by: Jan Kara
    Signed-off-by: Joel Becker

    Jan Kara
     

10 Sep, 2010

1 commit

  • Thanks for the comments. I have incorportated them all.

    CONFIG_OCFS2_FS_STATS is enabled and CONFIG_DEBUG_LOCK_ALLOC is disabled.
    Statistics now look like -
    ocfs2_write_ctxt: 2144 - 2136 = 8
    ocfs2_inode_info: 1960 - 1848 = 112
    ocfs2_journal: 168 - 160 = 8
    ocfs2_lock_res: 336 - 304 = 32
    ocfs2_refcount_tree: 512 - 472 = 40

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Joel Becker

    Goldwyn Rodrigues
     

06 May, 2010

1 commit

  • jbd[2]_journal_dirty_metadata() only returns 0. It's been returning 0
    since before the kernel moved to git. There is no point in checking
    this error.

    ocfs2_journal_dirty() has been faithfully returning the status since the
    beginning. All over ocfs2, we have blocks of code checking this can't
    fail status. In the past few years, we've tried to avoid adding these
    checks, because they are pointless. But anyone who looks at our code
    assumes they are needed.

    Finally, ocfs2_journal_dirty() is made a void function. All error
    checking is removed from other files. We'll BUG_ON() the status of
    jbd2_journal_dirty_metadata() just in case they change it someday. They
    won't.

    Signed-off-by: Joel Becker

    Joel Becker
     

26 Mar, 2010

1 commit


23 Sep, 2009

3 commits


05 Sep, 2009

3 commits

  • The next step in divorcing metadata I/O management from struct inode is
    to pass struct ocfs2_caching_info to the journal functions. Thus the
    journal locks a metadata cache with the cache io_lock function. It also
    can compare ci_last_trans and ci_created_trans directly.

    This is a large patch because of all the places we change
    ocfs2_journal_access..(handle, inode, ...) to
    ocfs2_journal_access..(handle, INODE_CACHE(inode), ...).

    Signed-off-by: Joel Becker

    Joel Becker
     
  • Similar ip_last_trans, ip_created_trans tracks the creation of a journal
    managed inode. This specifically tracks what transaction created the
    inode. This is so the code can know if the inode has ever been written
    to disk.

    This behavior is desirable for any journal managed object. We move it
    to struct ocfs2_caching_info as ci_created_trans so that any object
    using ocfs2_caching_info can rely on this behavior.

    Signed-off-by: Joel Becker

    Joel Becker
     
  • We have the read side of metadata caching isolated to struct
    ocfs2_caching_info, now we need the write side. This means the journal
    functions. The journal only does a couple of things with struct inode.

    This change moves the ip_last_trans field onto struct
    ocfs2_caching_info as ci_last_trans. This field tells the journal
    whether a pending journal flush is required.

    Signed-off-by: Joel Becker

    Joel Becker
     

11 Aug, 2009

1 commit

  • In OCFS2, allocator locks rank above transaction start. Thus we
    cannot extend quota file from inside a transaction less we could
    deadlock.

    We solve the problem by starting transaction not already in
    ocfs2_acquire_dquot() but only in ocfs2_local_read_dquot() and
    ocfs2_global_read_dquot() and we allocate blocks to quota files before starting
    the transaction. In case we crash, quota files will just have a few blocks
    more but that's no problem since we just use them next time we extend the
    quota file.

    Signed-off-by: Jan Kara
    Signed-off-by: Joel Becker

    Jan Kara
     

24 Jul, 2009

1 commit


09 Jul, 2009

1 commit

  • If the mount fails for any reason, ocfs2_dismount_volume calls
    ocfs2_orphan_scan_stop. It requires that ocfs2_orphan_scan_init
    be called to setup the mutex and work queues, but that doesn't
    happen if the mount has failed and we oops accessing an uninitialized
    work queue.

    This patch splits the init and startup of the orphan scan, eliminating
    the oops.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Joel Becker

    Jeff Mahoney
     

23 Jun, 2009

1 commit


04 Jun, 2009

1 commit

  • When a dentry is unlinked, the unlinking node takes an EX on the dentry lock
    before moving the dentry to the orphan directory. Other nodes that have
    this dentry in cache have a PR on the same dentry lock. When the EX is
    requested, the other nodes flag the corresponding inode as MAYBE_ORPHANED
    during downconvert. The inode is finally deleted when the last node to iput
    the inode sees that i_nlink==0 and the MAYBE_ORPHANED flag is set.

    A problem arises if a node is forced to free dentry locks because of memory
    pressure. If this happens, the node will no longer get downconvert
    notifications for the dentries that have been unlinked on another node.
    If it also happens that node is actively using the corresponding inode and
    happens to be the one performing the last iput on that inode, it will fail
    to delete the inode as it will not have the MAYBE_ORPHANED flag set.

    This patch fixes this shortcoming by introducing a periodic scan of the
    orphan directories to delete such inodes. Care has been taken to distribute
    the workload across the cluster so that no one node has to perform the task
    all the time.

    Signed-off-by: Srinivas Eeda
    Signed-off-by: Joel Becker

    Srinivas Eeda
     

01 May, 2009

1 commit

  • The ocfs2 directory index updates two blocks when we remove an entry -
    the dx root and the dx leaf. OCFS2_DELETE_INODE_CREDITS was only
    accounting for the dx leaf. This shows up when ocfs2_delete_inode()
    runs out of credits in jbd2_journal_dirty_metadata() at
    "J_ASSERT_JH(jh, handle->h_buffer_credits > 0);".

    The test that caught this was running dirop_file_racer from the
    ocfs2-test suite with a 250-character filename PREFIX. Run on a 512B
    blocksize, it forces the orphan dir index to grow large enough to
    trigger.

    Signed-off-by: Joel Becker

    Joel Becker
     

04 Apr, 2009

5 commits

  • During recovery, a node recovers orphans in it's slot and the dead node(s). But
    if the dead nodes were holding orphans in offline slots, they will be left
    unrecovered.

    If the dead node is the last one to die and is holding orphans in other slots
    and is the first one to mount, then it only recovers it's own slot, which
    leaves orphans in offline slots.

    This patch queues complete_recovery to clean orphans for all offline slots
    during mount and node recovery.

    Signed-off-by: Srinivas Eeda
    Acked-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Srinivas Eeda
     
  • The only operation which doesn't get faster with directory indexing is
    insert, which still has to walk the entire unindexed directory portion to
    find a free block. This patch provides an improvement in directory insert
    performance by maintaining a singly linked list of directory leaf blocks
    which have space for additional dirents.

    Signed-off-by: Mark Fasheh
    Acked-by: Joel Becker

    Mark Fasheh
     
  • Allow us to store a small number of directory index records in the
    ocfs2_dx_root_block. This saves us a disk read on small to medium sized
    directories (less than about 250 entries). The inline root is automatically
    turned into a root block with extents if the directory size increases beyond
    it's capacity.

    Signed-off-by: Mark Fasheh
    Acked-by: Joel Becker

    Mark Fasheh
     
  • This patch makes use of Ocfs2's flexible btree code to add an additional
    tree to directory inodes. The new tree stores an array of small,
    fixed-length records in each leaf block. Each record stores a hash value,
    and pointer to a block in the traditional (unindexed) directory tree where a
    dirent with the given name hash resides. Lookup exclusively uses this tree
    to find dirents, thus providing us with constant time name lookups.

    Some of the hashing code was copied from ext3. Unfortunately, it has lots of
    unfixed checkpatch errors. I left that as-is so that tracking changes would
    be easier.

    Signed-off-by: Mark Fasheh
    Acked-by: Joel Becker

    Mark Fasheh
     
  • Move the definition of struct recovery_map from journal.c to journal.h. This
    is preparation for the next patch.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     

11 Feb, 2009

1 commit

  • If we race with commit code setting i_transaction to NULL, we could
    possibly dereference it. Proper locking requires the journal pointer
    (to access journal->j_list_lock), which we don't have. So we have to
    change the prototype of the function so that filesystem passes us the
    journal pointer. Also add a more detailed comment about why the
    function jbd2_journal_begin_ordered_truncate() does what it does and
    how it should be used.

    Thanks to Dan Carpenter for pointing to the
    suspitious code.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"
    Acked-by: Joel Becker
    CC: linux-ext4@vger.kernel.org
    CC: ocfs2-devel@oss.oracle.com
    CC: mfasheh@suse.de
    CC: Dan Carpenter

    Jan Kara
     

06 Jan, 2009

5 commits

  • The per-metadata-type ocfs2_journal_access_*() functions hook up jbd2
    commit triggers and allow us to compute metadata ecc right before the
    buffers are written out. This commit provides ecc for inodes, extent
    blocks, group descriptors, and quota blocks. It is not safe to use
    extened attributes and metaecc at the same time yet.

    The ocfs2_extent_tree and ocfs2_path abstractions in alloc.c both hide
    the type of block at their root. Before, it didn't matter, but now the
    root block must use the appropriate ocfs2_journal_access_*() function.
    To keep this abstract, the structures now have a pointer to the matching
    journal_access function and a wrapper call to call it.

    A few places use naked ocfs2_write_block() calls instead of adding the
    blocks to the journal. We make sure to calculate their checksum and ecc
    before the write.

    Since we pass around the journal_access functions. Let's typedef them
    in ocfs2.h.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • We create wrappers for ocfs2_journal_access() that are specific to the
    type of metadata block. This allows us to associate jbd2 commit
    triggers with the block. The triggers will compute metadata ecc in a
    future commit.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • Implement functions for recovery after a crash. Functions just
    read local quota file and sync info to global quota file.

    Signed-off-by: Jan Kara
    Signed-off-by: Mark Fasheh

    Jan Kara
     
  • Add quota calls for allocation and freeing of inodes and space, also update
    estimates on number of needed credits for a transaction. Move out inode
    allocation from ocfs2_mknod_locked() because vfs_dq_init() must be called
    outside of a transaction.

    Signed-off-by: Jan Kara
    Signed-off-by: Mark Fasheh

    Jan Kara
     
  • JBD2 is fully backwards compatible with JBD and it's been tested enough with
    Ocfs2 that we can clean this code up now.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

14 Oct, 2008

3 commits

  • ocfs2 wants JBD2 for many reasons, not the least of which is that JBD is
    limiting our maximum filesystem size.

    It's a pretty trivial change. Most functions are just renamed. The
    only functional change is moving to Jan's inode-based ordered data mode.
    It's better, too.

    Because JBD2 reads and writes JBD journals, this is compatible with any
    existing filesystem. It can even interact with JBD-based ocfs2 as long
    as the journal is formated for JBD.

    We provide a compatibility option so that paranoid people can still use
    JBD for the time being. This will go away shortly.

    [ Moved call of ocfs2_begin_ordered_truncate() from ocfs2_delete_inode() to
    ocfs2_truncate_for_delete(). --Mark ]

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • This patch implements storing extended attributes both in inode or a single
    external block. We only store EA's in-inode when blocksize > 512 or that
    inode block has free space for it. When an EA's value is larger than 80
    bytes, we will store the value via b-tree outside inode or block.

    Signed-off-by: Tiger Yang
    Signed-off-by: Mark Fasheh

    Tiger Yang
     
  • ocfs2_extend_meta_needed(), ocfs2_calc_extend_credits() and
    ocfs2_reserve_new_metadata() are all useful for extent tree operations. But
    they are all limited to an inode btree because they use a struct
    ocfs2_dinode parameter. Change their parameter to struct ocfs2_extent_list
    (the part of an ocfs2_dinode they actually use) so that the xattr btree code
    can use these functions.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     

01 Aug, 2008

1 commit

  • As the fs recovery is asynchronous, there is a small chance that another
    node can mount (and thus recover) the slot before the recovery thread
    gets to it.

    If this happens, the recovery thread will block indefinitely on the
    journal/slot lock as that lock will be held for the duration of the mount
    (by design) by the node assigned to that slot.

    The solution implemented is to keep track of the journal replays using
    a recovery generation in the journal inode, which will be incremented by the
    thread replaying that journal. The recovery thread, before attempting the
    blocking lock on the journal/slot lock, will compare the generation on disk
    with what it has cached and skip recovery if it does not match.

    This bug appears to have been inadvertently introduced during the mount/umount
    vote removal by mainline commit 34d024f84345807bf44163fac84e921513dde323. In the
    mount voting scheme, the messaging would indirectly indicate that the slot
    was being recovered.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     

18 Apr, 2008

1 commit

  • The old recovery map was a bitmap of node numbers. This was sufficient
    for the maximum node number of 254. Going forward, we want node numbers
    to be UINT32. Thus, we need a new recovery map.

    Note that we can't keep track of slots here. We must write down the
    node number to recovery *before* we get the locks needed to convert a
    node number into a slot number.

    The recovery map is now an array of unsigned ints, max_slots in size.
    It moves to journal.c with the rest of recovery.

    Because it needs to be initialized, we move all of recovery initialization
    into a new function, ocfs2_recovery_init(). This actually cleans up
    ocfs2_initialize_super() a little as well. Following on, recovery cleaup
    becomes part of ocfs2_recovery_exit().

    A number of node map functions are rendered obsolete and are removed.

    Finally, waiting on recovery is wrapped in a function rather than naked
    checks on the recovery_event. This is a cleanup from Mark.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     

26 Jan, 2008

2 commits

  • This patch adds the ability for a userspace program to request that a
    properly formatted cluster group be added to the main allocation bitmap for
    an Ocfs2 file system. The request is made via an ioctl, OCFS2_IOC_GROUP_ADD.
    On a high level, this is similar to ext3, but we use a different ioctl as
    the structure which has to be passed through is different.

    During an online resize, tunefs.ocfs2 will format any new cluster groups
    which must be added to complete the resize, and call OCFS2_IOC_GROUP_ADD on
    each one. Kernel verifies that the core cluster group information is valid
    and then does the work of linking it into the global allocation bitmap.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • This patch adds the ability for a userspace program to request an extend of
    last cluster group on an Ocfs2 file system. The request is made via ioctl,
    OCFS2_IOC_GROUP_EXTEND. This is derived from EXT3_IOC_GROUP_EXTEND, but is
    obviously Ocfs2 specific.

    tunefs.ocfs2 would call this for an online-resize operation if the last
    cluster group isn't full.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     

13 Oct, 2007

1 commit

  • This fixes up write, truncate, mmap, and RESVSP/UNRESVP to understand inline
    inode data.

    For the most part, the changes to the core write code can be relied on to do
    the heavy lifting. Any code calling ocfs2_write_begin (including shared
    writeable mmap) can count on it doing the right thing with respect to
    growing inline data to an extent tree.

    Size reducing truncates, including UNRESVP can simply zero that portion of
    the inode block being removed. Size increasing truncatesm, including RESVP
    have to be a little bit smarter and grow the inode to an extent tree if
    necessary.

    Signed-off-by: Mark Fasheh
    Reviewed-by: Joel Becker

    Mark Fasheh
     

11 Jul, 2007

1 commit

  • Provide an internal interface for the removal of arbitrary file regions.

    ocfs2_remove_inode_range() takes a byte range within a file and will remove
    existing extents within that range. Partial clusters will be zeroed so that
    any read from within the region will return zeros.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

27 Apr, 2007

1 commit

  • Due to the size of our group bitmaps, we'll never have a leaf node extent
    record with more than 16 bits worth of clusters. Split e_clusters up so that
    leaf nodes can get a flags field where we can mark unwritten extents.
    Interior nodes whose length references all the child nodes beneath it can't
    split their e_clusters field, so we use a union to preserve sizing there.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

02 Feb, 2007

1 commit

  • Commit 592282cf2eaa33409c6511ddd3f3ecaa57daeaaa fixed some missing directory
    c/mtime updates in part by introducing a dinode update in ocfs2_add_entry().
    Unfortunately, ocfs2_link() (which didn't update the directory inode before)
    is now missing a single journal credit. Fix this by doubling the number of
    inode updates expected during hard link creation.

    Signed-off-by: Mark Fasheh

    Mark Fasheh