15 Apr, 2009

1 commit


07 Apr, 2009

1 commit

  • There's a possible deadlock in generic_file_splice_write(),
    splice_from_pipe() and ocfs2_file_splice_write():

    - task A calls generic_file_splice_write()
    - this calls inode_double_lock(), which locks i_mutex on both
    pipe->inode and target inode
    - ordering depends on inode pointers, can happen that pipe->inode is
    locked first
    - __splice_from_pipe() needs more data, calls pipe_wait()
    - this releases lock on pipe->inode, goes to interruptible sleep
    - task B calls generic_file_splice_write(), similarly to the first
    - this locks pipe->inode, then tries to lock inode, but that is
    already held by task A
    - task A is interrupted, it tries to lock pipe->inode, but fails, as
    it is already held by task B
    - ABBA deadlock

    Fix this by explicitly ordering locks: the outer lock must be on
    target inode and the inner lock (which is later unlocked and relocked)
    must be on pipe->inode. This is OK, pipe inodes and target inodes
    form two nonoverlapping sets, generic_file_splice_write() and friends
    are not called with a target which is a pipe.

    Signed-off-by: Miklos Szeredi
    Acked-by: Mark Fasheh
    Acked-by: Jens Axboe
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

04 Apr, 2009

32 commits

  • During recovery, a node recovers orphans in it's slot and the dead node(s). But
    if the dead nodes were holding orphans in offline slots, they will be left
    unrecovered.

    If the dead node is the last one to die and is holding orphans in other slots
    and is the first one to mount, then it only recovers it's own slot, which
    leaves orphans in offline slots.

    This patch queues complete_recovery to clean orphans for all offline slots
    during mount and node recovery.

    Signed-off-by: Srinivas Eeda
    Acked-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Srinivas Eeda
     
  • A page can have multiple buffers and even if a page is not uptodate, some buffers
    can be uptodate on pagesize != blocksize environment.
    This aops checks that all buffers which correspond to a part of a file
    that we want to read are uptodate. If so, we do not have to issue actual
    read IO to HDD even if a page is not uptodate because the portion we
    want to read are uptodate.
    "block_is_partially_uptodate" function is already used by ext2/3/4.
    With the following patch random read/write mixed workloads or random read after
    random write workloads can be optimized and we can get performance improvement.

    Signed-off-by: Hisashi Hifumi
    Signed-off-by: Mark Fasheh

    Hisashi Hifumi
     
  • For nfs exporting, ocfs2_get_dentry() returns the dentry for fh.
    ocfs2_get_dentry() may read from disk when the inode is not in memory,
    without any cross cluster lock. this leads to the file system loading a
    stale inode.

    This patch fixes above problem.

    Solution is that in case of inode is not in memory, we get the cluster
    lock(PR) of alloc inode where the inode in question is allocated from (this
    causes node on which deletion is done sync the alloc inode) before reading
    out the inode itsself. then we check the bitmap in the group (the inode in
    question allcated from) to see if the bit is clear. if it's clear then it's
    stale. if the bit is set, we then check generation as the existing code
    does.

    We have to read out the inode in question from disk first to know its alloc
    slot and allot bit. And if its not stale we read it out using ocfs2_iget().
    The second read should then be from cache.

    And also we have to add a per superblock nfs_sync_lock to cover the lock for
    alloc inode and that for inode in question. this is because ocfs2_get_dentry()
    and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked
    in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so
    that mutliple ocfs2_delete_inode() can run concurrently in normal case.

    [mfasheh@suse.com: build warning fixes and comment cleanups]
    Signed-off-by: Wengang Wang
    Acked-by: Joel Becker
    Signed-off-by: Mark Fasheh

    wengang wang
     
  • The debugfs file, mle_state, now prints the number of largest number of mles
    in one hash link.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • This patch attempts to fix a fine race between purging and migration.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • This patch removes struct dlm_lock_name and adds the entries directly
    to struct dlm_master_list_entry. Under the new scheme, both mles that
    are backed by a lockres or not, will have the name populated in mle->mname.
    This allows us to get rid of code that was figuring out the location of
    the mle name.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • This patch shows the number of lockres' and mles in the debugfs file, dlm_state.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • This patch inlines dlm_set_lockres_owner() and dlm_change_lockres_owner().

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • This patch replaces the lockres counts that tracked the number number of
    locally and remotely mastered lockres' with a current and total count. The
    total count is the number of lockres' that have been created since the dlm
    domain was created.

    The number of locally and remotely mastered counts can be computed using
    the locking_state output.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • The lifetime of a mle is limited to the duration of the lockres mastery
    process. While typically this lifetime is fairly short, we have noticed
    the number of mles explode under certain circumstances. This patch tracks
    the number of each different types of mles and should help us determine
    how best to speed up the mastery process.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • The previous patch explicitly did not indent dlm_cleanup_master_list()
    so as to make the patch readable. This patch properly indents the
    function.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • With this patch, the mles are stored in a hash and not a simple list.
    This should improve the mle lookup time when the number of outstanding
    masteries is large.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • This patch adds code to create and destroy the dlm->master_hash.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • This patch refactors dlm_clean_master_list() so as to make it
    easier to convert the mle list to a hash.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • For master mle, the name it stored in the attached lockres in struct qstr.
    For block and migration mle, the name is stored inline in struct dlm_lock_name.
    This patch attempts to make struct dlm_lock_name look like a struct qstr. While
    we could use struct qstr, we don't because we want to avoid having to malloc
    and free the lockname string as the mle's lifetime is fairly short.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • This patch encapsulates adding and removing of the mle from the
    dlm->master_list. This patch is part of the series of patches that
    converts the mle list to a mle hash.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • In ocfs2, the block group search looks for the "emptiest" group
    to allocate from. So if the allocator has many equally(or almost
    equally) empty groups, new block group will tend to get spread
    out amongst them.

    So we add osb_inode_alloc_group in ocfs2_super to record the last
    used inode allocation group.
    For more details, please see
    http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy.

    I have done some basic test and the results are a ten times improvement on
    some cold-cache stat workloads.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • Inode groups used to be allocated from local alloc file,
    but since we want all inodes to be contiguous enough, we
    will try to allocate them directly from global_bitmap.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • In ocfs2, the inode block search looks for the "emptiest" inode
    group to allocate from. So if an inode alloc file has many equally
    (or almost equally) empty groups, new inodes will tend to get
    spread out amongst them, which in turn can put them all over the
    disk. This is undesirable because directory operations on conceptually
    "nearby" inodes force a large number of seeks.

    So we add ip_last_used_group in core directory inodes which records
    the last used allocation group. Another field named ip_last_used_slot
    is also added in case inode stealing happens. When claiming new inode,
    we passed in directory's inode so that the allocation can use this
    information.
    For more details, please see
    http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • ocfs2_dx_dir_rebalance() is passed the block offset of a dx leaf which needs
    rebalancing. Since we rebalance an entire cluster at a time however, this
    function needs to calculate the beginning of that cluster, in blocks. The
    calculation was wrong, which would result in a read of non-leaf blocks. Fix
    the calculation by adding ocfs2_block_to_cluster_start() which is a more
    straight-forward way of determining this.

    Reported-by: Tristan Ye
    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • ocfs2_empty_dir() is far more expensive than checking link count. Since both
    need to be checked at the same time, we can improve performance by checking
    link count first.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • Since the disk format is finalized, we can set this feature bit in the
    supported mask.

    Signed-off-by: Mark Fasheh
    Acked-by: Joel Becker

    Mark Fasheh
     
  • This little bit of extra accounting speeds up ocfs2_empty_dir()
    dramatically by allowing us to short-circuit the full directory scan.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • Since we've now got a directory format capable of handling a large number of
    entries, we can increase the maximum link count supported. This only gets
    increased if the directory indexing feature is turned on.

    Signed-off-by: Mark Fasheh
    Acked-by: Joel Becker

    Mark Fasheh
     
  • The only operation which doesn't get faster with directory indexing is
    insert, which still has to walk the entire unindexed directory portion to
    find a free block. This patch provides an improvement in directory insert
    performance by maintaining a singly linked list of directory leaf blocks
    which have space for additional dirents.

    Signed-off-by: Mark Fasheh
    Acked-by: Joel Becker

    Mark Fasheh
     
  • Allow us to store a small number of directory index records in the
    ocfs2_dx_root_block. This saves us a disk read on small to medium sized
    directories (less than about 250 entries). The inline root is automatically
    turned into a root block with extents if the directory size increases beyond
    it's capacity.

    Signed-off-by: Mark Fasheh
    Acked-by: Joel Becker

    Mark Fasheh
     
  • This patch makes use of Ocfs2's flexible btree code to add an additional
    tree to directory inodes. The new tree stores an array of small,
    fixed-length records in each leaf block. Each record stores a hash value,
    and pointer to a block in the traditional (unindexed) directory tree where a
    dirent with the given name hash resides. Lookup exclusively uses this tree
    to find dirents, thus providing us with constant time name lookups.

    Some of the hashing code was copied from ext3. Unfortunately, it has lots of
    unfixed checkpatch errors. I left that as-is so that tracking changes would
    be easier.

    Signed-off-by: Mark Fasheh
    Acked-by: Joel Becker

    Mark Fasheh
     
  • Many directory manipulation calls pass around a tuple of dirent, and it's
    containing buffer_head. Dir indexing has a bit more state, but instead of
    adding yet more arguments to functions, we introduce 'struct
    ocfs2_dir_lookup_result'. In this patch, it simply holds the same tuple, but
    future patches will add more state.

    Signed-off-by: Mark Fasheh
    Acked-by: Joel Becker

    Mark Fasheh
     
  • This patch removes the debugfs file local_alloc_stats as that information
    is now included in the fs_state debugfs file.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • This patch creates a per mount debugfs file, fs_state, which exposes
    information like, cluster stack in use, states of the downconvert, recovery
    and commit threads, number of journal txns, some allocation stats, list of
    all slots, etc.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • Move the definition of struct recovery_map from journal.c to journal.h. This
    is preparation for the next patch.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     
  • This patch creates a debugfs file, o2hb/livesnodes, which exposes the
    aggregate list of heartbeating node across all heartbeat regions.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Mark Fasheh

    Sunil Mushran
     

03 Apr, 2009

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    Remove two unneeded exports and make two symbols static in fs/mpage.c
    Cleanup after commit 585d3bc06f4ca57f975a5a1f698f65a45ea66225
    Trim includes of fdtable.h
    Don't crap into descriptor table in binfmt_som
    Trim includes in binfmt_elf
    Don't mess with descriptor table in load_elf_binary()
    Get rid of indirect include of fs_struct.h
    New helper - current_umask()
    check_unsafe_exec() doesn't care about signal handlers sharing
    New locking/refcounting for fs_struct
    Take fs_struct handling to new file (fs/fs_struct.c)
    Get rid of bumping fs_struct refcount in pivot_root(2)
    Kill unsharing fs_struct in __set_personality()

    Linus Torvalds
     

01 Apr, 2009

2 commits

  • Change the page_mkwrite prototype to take a struct vm_fault, and return
    VM_FAULT_xxx flags. There should be no functional change.

    This makes it possible to return much more detailed error information to
    the VM (and also can provide more information eg. virtual_address to the
    driver, which might be important in some special cases).

    This is required for a subsequent fix. And will also make it easier to
    merge page_mkwrite() with fault() in future.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Trond Myklebust
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Artem Bityutskiy
    Cc: Felix Blyakher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • current->fs->umask is what most of fs_struct users are doing.
    Put that into a helper function.

    Signed-off-by: Al Viro

    Al Viro
     

28 Mar, 2009

1 commit


13 Mar, 2009

2 commits

  • A long time ago, xs->base is allocated a 4K size and all the contents
    in the bucket are copied to the it. Now we use ocfs2_xattr_bucket to
    abstract xattr bucket and xs->base is initialized to the start of the
    bu_bhs[0]. So xs->base + offset will overflow when the value root is
    stored outside the first block.

    Then why we can survive the xattr test by now? It is because we always
    read the bucket contiguously now and kernel mm allocate continguous
    memory for us. We are lucky, but we should fix it. So just get the
    right value root as other callers do.

    Signed-off-by: Tao Ma
    Acked-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Tao Ma
     
  • We need to use le32_to_cpu to test rec->e_cpos in
    ocfs2_dinode_insert_check.

    Signed-off-by: Tao Ma
    Acked-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Tao Ma