28 Jul, 2011

1 commit

  • Fix a corruption that can happen when we have (two or more) outstanding
    aio's to an overlapping unaligned region. Ext4
    (e9e3bcecf44c04b9e6b505fd8e2eb9cea58fb94d) and xfs recently had to fix
    similar issues.

    In our case what happens is that we can have an outstanding aio on a region
    and if a write comes in with some bytes overlapping the original aio we may
    decide to read that region into a page before continuing (typically because
    of buffered-io fallback). Since we have no ordering guarantees with the
    aio, we can read stale or bad data into the page and then write it back out.

    If the i/o is page and block aligned, then we avoid this issue as there
    won't be any need to read data from disk.

    I took the same approach as Eric in the ext4 patch and introduced some
    serialization of unaligned async direct i/o. I don't expect this to have an
    effect on the most common cases of AIO. Unaligned aio will be slower
    though, but that's far more acceptable than data corruption.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Mark Fasheh
     

11 Sep, 2010

1 commit

  • Track negative dentries by recording the generation number of the parent
    directory in d_fsdata. The generation number for the parent directory is
    recorded in the inode_info, which increments every time the lock on the
    directory is dropped.

    If the generation number of the parent directory and the negative dentry
    matches, there is no need to perform the revalidate, else a revalidate
    is forced. This improves performance in situations where nodes look for
    the same non-existent file multiple times.

    Thanks Mark for explaining the DLM sequence.

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Joel Becker

    Goldwyn Rodrigues
     

10 Sep, 2010

1 commit

  • Thanks for the comments. I have incorportated them all.

    CONFIG_OCFS2_FS_STATS is enabled and CONFIG_DEBUG_LOCK_ALLOC is disabled.
    Statistics now look like -
    ocfs2_write_ctxt: 2144 - 2136 = 8
    ocfs2_inode_info: 1960 - 1848 = 112
    ocfs2_journal: 168 - 160 = 8
    ocfs2_lock_res: 336 - 304 = 32
    ocfs2_refcount_tree: 512 - 472 = 40

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Joel Becker

    Goldwyn Rodrigues
     

10 Aug, 2010

2 commits


21 May, 2010

1 commit

  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (47 commits)
    ocfs2: Silence a gcc warning.
    ocfs2: Don't retry xattr set in case value extension fails.
    ocfs2:dlm: avoid dlm->ast_lock lockres->spinlock dependency break
    ocfs2: Reset xattr value size after xa_cleanup_value_truncate().
    fs/ocfs2/dlm: Use kstrdup
    fs/ocfs2/dlm: Drop memory allocation cast
    Ocfs2: Optimize punching-hole code.
    Ocfs2: Make ocfs2_find_cpos_for_left_leaf() public.
    Ocfs2: Fix hole punching to correctly do CoW during cluster zeroing.
    Ocfs2: Optimize ocfs2 truncate to use ocfs2_remove_btree_range() instead.
    ocfs2: Block signals for mkdir/link/symlink/O_CREAT.
    ocfs2: Wrap signal blocking in void functions.
    ocfs2/dlm: Increase o2dlm lockres hash size
    ocfs2: Make ocfs2_extend_trans() really extend.
    ocfs2/trivial: Code cleanup for allocation reservation.
    ocfs2: make ocfs2_adjust_resv_from_alloc simple.
    ocfs2: Make nointr a default mount option
    ocfs2/dlm: Make o2dlm domain join/leave messages KERN_NOTICE
    o2net: log socket state changes
    ocfs2: print node # when tcp fails
    ...

    Linus Torvalds
     

06 May, 2010

1 commit


24 Apr, 2010

1 commit

  • Currently in the error path of ocfs2_symlink and ocfs2_mknod, we just call
    iput with the inode we failed with, but the inode wipe code will complain
    because we don't add the inode to orphan dir. One solution would be to lock
    the orphan dir during the entire transaction, but that's too heavy for a
    rare error path. Instead, we add a flag, OCFS2_INODE_SKIP_ORPHAN_DIR which
    tells the inode wipe code that it won't find this inode in the orphan dir.

    [ Merge fixes and comment style cleanups -Mark ]

    Signed-off-by: Li Dongyang
    Signed-off-by: Mark Fasheh

    Li Dongyang
     

05 Sep, 2009

6 commits


04 Apr, 2009

2 commits

  • For nfs exporting, ocfs2_get_dentry() returns the dentry for fh.
    ocfs2_get_dentry() may read from disk when the inode is not in memory,
    without any cross cluster lock. this leads to the file system loading a
    stale inode.

    This patch fixes above problem.

    Solution is that in case of inode is not in memory, we get the cluster
    lock(PR) of alloc inode where the inode in question is allocated from (this
    causes node on which deletion is done sync the alloc inode) before reading
    out the inode itsself. then we check the bitmap in the group (the inode in
    question allcated from) to see if the bit is clear. if it's clear then it's
    stale. if the bit is set, we then check generation as the existing code
    does.

    We have to read out the inode in question from disk first to know its alloc
    slot and allot bit. And if its not stale we read it out using ocfs2_iget().
    The second read should then be from cache.

    And also we have to add a per superblock nfs_sync_lock to cover the lock for
    alloc inode and that for inode in question. this is because ocfs2_get_dentry()
    and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked
    in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so
    that mutliple ocfs2_delete_inode() can run concurrently in normal case.

    [mfasheh@suse.com: build warning fixes and comment cleanups]
    Signed-off-by: Wengang Wang
    Acked-by: Joel Becker
    Signed-off-by: Mark Fasheh

    wengang wang
     
  • In ocfs2, the inode block search looks for the "emptiest" inode
    group to allocate from. So if an inode alloc file has many equally
    (or almost equally) empty groups, new inodes will tend to get
    spread out amongst them, which in turn can put them all over the
    disk. This is undesirable because directory operations on conceptually
    "nearby" inodes force a large number of seeks.

    So we add ip_last_used_group in core directory inodes which records
    the last used allocation group. Another field named ip_last_used_slot
    is also added in case inode stealing happens. When claiming new inode,
    we passed in directory's inode so that the allocation can use this
    information.
    For more details, please see
    http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     

06 Jan, 2009

2 commits

  • For each quota type each node has local quota file. In this file it stores
    changes users have made to disk usage via this node. Once in a while this
    information is synced to global file (and thus with other nodes) so that
    limits enforcement at least aproximately works.

    Global quota files contain all the information about usage and limits. It's
    mostly handled by the generic VFS code (which implements a trie of structures
    inside a quota file). We only have to provide functions to convert structures
    from on-disk format to in-memory one. We also have to provide wrappers for
    various quota functions starting transactions and acquiring necessary cluster
    locks before the actual IO is really started.

    Signed-off-by: Jan Kara
    Signed-off-by: Mark Fasheh

    Jan Kara
     
  • The ocfs2 code currently reads inodes off disk with a simple
    ocfs2_read_block() call. Each place that does this has a different set
    of sanity checks it performs. Some check only the signature. A couple
    validate the block number (the block read vs di->i_blkno). A couple
    others check for VALID_FL. Only one place validates i_fs_generation. A
    couple check nothing. Even when an error is found, they don't all do
    the same thing.

    We wrap inode reading into ocfs2_read_inode_block(). This will validate
    all the above fields, going readonly if they are invalid (they never
    should be). ocfs2_read_inode_block_full() is provided for the places
    that want to pass read_block flags. Every caller is passing a struct
    inode with a valid ip_blkno, so we don't need a separate blkno argument
    either.

    We will remove the validation checks from the rest of the code in a
    later commit, as they are no longer necessary.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     

15 Oct, 2008

1 commit


14 Oct, 2008

2 commits

  • ocfs2 wants JBD2 for many reasons, not the least of which is that JBD is
    limiting our maximum filesystem size.

    It's a pretty trivial change. Most functions are just renamed. The
    only functional change is moving to Jan's inode-based ordered data mode.
    It's better, too.

    Because JBD2 reads and writes JBD journals, this is compatible with any
    existing filesystem. It can even interact with JBD-based ocfs2 as long
    as the journal is formated for JBD.

    We provide a compatibility option so that paranoid people can still use
    JBD for the time being. This will go away shortly.

    [ Moved call of ocfs2_begin_ordered_truncate() from ocfs2_delete_inode() to
    ocfs2_truncate_for_delete(). --Mark ]

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • This patch implements storing extended attributes both in inode or a single
    external block. We only store EA's in-inode when blocksize > 512 or that
    inode block has free space for it. When an EA's value is larger than 80
    bytes, we will store the value via b-tree outside inode or block.

    Signed-off-by: Tiger Yang
    Signed-off-by: Mark Fasheh

    Tiger Yang
     

26 Jan, 2008

3 commits

  • Create separate lockdep lock classes for system file's i_mutexes. They are
    used to guard allocations and similar things and thus rank differently
    than i_mutex of a regular file or directory.

    Signed-off-by: Jan Kara
    Signed-off-by: Mark Fasheh

    Jan Kara
     
  • Call this the "inode_lock" now, since it covers both data and meta data.
    This patch makes no functional changes.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • The meta lock now covers both meta data and data, so this just removes the
    now-redundant data lock.

    Combining locks saves us a round of lock mastery per inode and one less lock
    to ping between nodes during read/write.

    We don't lose much - since meta locks were always held before a data lock
    (and at the same level) ordered writeout mode (the default) ensured that
    flushing for the meta data lock also pushed out data anyways.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

13 Oct, 2007

1 commit

  • Add the disk, network and memory structures needed to support data in inode.

    Struct ocfs2_inline_data is defined and embedded in ocfs2_dinode for storing
    inline data.

    A new inode field, i_dyn_features, is added to facilitate tracking of
    dynamic inode state. Since it will be used often, we want to mirror it on
    ocfs2_inode_info, and transfer it via the meta data lvb.

    Signed-off-by: Mark Fasheh
    Reviewed-by: Joel Becker

    Mark Fasheh
     

03 May, 2007

1 commit


27 Apr, 2007

5 commits

  • The extent map code was ripped out earlier because of an inability to deal
    with holes. This patch adds back a simpler caching scheme requiring far less
    code.

    Our old extent map caching was designed back when meta data block caching in
    Ocfs2 didn't work very well, resulting in many disk reads. These days our
    metadata caching is much better, resulting in no un-necessary disk reads. As
    a result, extent caching doesn't have to be as fancy, nor does it have to
    cache as many extents. Keeping the last 3 extents seen should be sufficient
    to give us a small performance boost on some streaming workloads.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • Older file systems which didn't support holes did a dumb calculation of
    i_blocks based on i_size. This is no longer accurate, so fix things up to
    take actual allocation into account.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • The code in extent_map.c is not prepared to deal with a subtree being
    rotated between lookups. This can happen when filling holes in sparse files.
    Instead of a lengthy patch to update the code (which would likely lose the
    benefit of caching subtree roots), we remove most of the algorithms and
    implement a simple path based lookup. A less ambitious extent caching scheme
    will be added in a later patch.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • Remove node messaging code that becomes unused with the delete inode vote
    removal.

    [Removed even more cruft which I spotted during review --Mark]

    Signed-off-by: Tiger Yang
    Signed-off-by: Mark Fasheh

    Tiger Yang
     
  • Ocfs2 currently does cluster-wide node messaging to check the open state of
    an inode during delete. This patch removes that mechanism in favor of an
    inode cluster lock which is taken at shared read when an inode is first read
    and dropped in clear_inode(). This allows a deleting node to test the
    liveness of an inode by attempting to take an exclusive lock.

    Signed-off-by: Tiger Yang
    Signed-off-by: Mark Fasheh

    Tiger Yang
     

08 Dec, 2006

1 commit

  • Replace all uses of kmem_cache_t with struct kmem_cache.

    The patch was generated using the following script:

    #!/bin/sh
    #
    # Replace one string by another in all the kernel sources.
    #

    set -e

    for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
    quilt add $file
    sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
    mv /tmp/$$ $file
    quilt refresh
    done

    The script was run like this

    sh replace kmem_cache_t "struct kmem_cache"

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

02 Dec, 2006

2 commits


25 Sep, 2006

1 commit

  • OCFS2 puts inode meta data in the "lock value block" provided by the DLM.
    Typically, i_generation is encoded in the lock name so that a deleted inode
    on and a new one in the same block don't share the same lvb.

    Unfortunately, that scheme means that the read in ocfs2_read_locked_inode()
    is potentially thrown away as soon as the meta data lock is taken - we
    cannot encode the lock name without first knowing i_generation, which
    requires a disk read.

    This patch encodes i_generation in the inode meta data lvb, and removes the
    value from the inode meta data lock name. This way, the read can be covered
    by a lock, and at the same time we can distinguish between an up to date and
    a stale LVB.

    This will help cold-cache stat(2) performance in particular.

    Since this patch changes the protocol version, we take the opportunity to do
    a minor re-organization of two of the LVB fields.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

21 Sep, 2006

1 commit


29 Jun, 2006

1 commit


04 Feb, 2006

1 commit


04 Jan, 2006

1 commit