11 Sep, 2010

1 commit

  • This patch change mutex_lock to a new subclass and
    add a new inode lock subclass for the target inode
    which caused this lockdep warning.

    =============================================
    [ INFO: possible recursive locking detected ]
    2.6.35+ #5
    ---------------------------------------------
    reflink/11086 is trying to acquire lock:
    (Meta){+++++.}, at: [] ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]

    but task is already holding lock:
    (Meta){+++++.}, at: [] ocfs2_reflink_ioctl+0x5d3/0x1229 [ocfs2]

    other info that might help us debug this:
    6 locks held by reflink/11086:
    #0: (&sb->s_type->i_mutex_key#15/1){+.+.+.}, at: [] lookup_create+0x26/0x97
    #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [] ocfs2_reflink_ioctl+0x4d3/0x1229 [ocfs2]
    #2: (Meta){+++++.}, at: [] ocfs2_reflink_ioctl+0x5d3/0x1229 [ocfs2]
    #3: (&oi->ip_xattr_sem){+.+.+.}, at: [] ocfs2_reflink_ioctl+0x68b/0x1229 [ocfs2]
    #4: (&oi->ip_alloc_sem){+.+.+.}, at: [] ocfs2_reflink_ioctl+0x69a/0x1229 [ocfs2]
    #5: (&sb->s_type->i_mutex_key#15/2){+.+...}, at: [] ocfs2_reflink_ioctl+0x882/0x1229 [ocfs2]

    stack backtrace:
    Pid: 11086, comm: reflink Not tainted 2.6.35+ #5
    Call Trace:
    [] validate_chain+0x56e/0xd68
    [] ? mark_held_locks+0x49/0x69
    [] __lock_acquire+0x79a/0x7f1
    [] lock_acquire+0xc6/0xed
    [] ? ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
    [] __ocfs2_cluster_lock+0x975/0xa0d [ocfs2]
    [] ? ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
    [] ? ocfs2_wait_for_recovery+0x15/0x8a [ocfs2]
    [] ocfs2_inode_lock_full_nested+0x1ac/0xdc5 [ocfs2]
    [] ? ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
    [] ? trace_hardirqs_on_caller+0x10b/0x12f
    [] ? debug_mutex_free_waiter+0x4f/0x53
    [] ocfs2_reflink_ioctl+0x898/0x1229 [ocfs2]
    [] ? ocfs2_file_lock_res_init+0x66/0x78 [ocfs2]
    [] ? might_fault+0x40/0x8d
    [] ocfs2_ioctl+0x61a/0x656 [ocfs2]
    [] ? mntput_no_expire+0x1d/0xb0
    [] ? path_put+0x2c/0x31
    [] vfs_ioctl+0x2a/0x9d
    [] do_vfs_ioctl+0x45d/0x4ae
    [] ? _raw_spin_unlock+0x26/0x2a
    [] ? sysret_check+0x27/0x62
    [] sys_ioctl+0x57/0x7a
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Tao Ma
    Signed-off-by: Joel Becker

    Tao Ma
     

23 Sep, 2009

1 commit


23 Jun, 2009

2 commits


04 Jun, 2009

1 commit

  • When a dentry is unlinked, the unlinking node takes an EX on the dentry lock
    before moving the dentry to the orphan directory. Other nodes that have
    this dentry in cache have a PR on the same dentry lock. When the EX is
    requested, the other nodes flag the corresponding inode as MAYBE_ORPHANED
    during downconvert. The inode is finally deleted when the last node to iput
    the inode sees that i_nlink==0 and the MAYBE_ORPHANED flag is set.

    A problem arises if a node is forced to free dentry locks because of memory
    pressure. If this happens, the node will no longer get downconvert
    notifications for the dentries that have been unlinked on another node.
    If it also happens that node is actively using the corresponding inode and
    happens to be the one performing the last iput on that inode, it will fail
    to delete the inode as it will not have the MAYBE_ORPHANED flag set.

    This patch fixes this shortcoming by introducing a periodic scan of the
    orphan directories to delete such inodes. Care has been taken to distribute
    the workload across the cluster so that no one node has to perform the task
    all the time.

    Signed-off-by: Srinivas Eeda
    Signed-off-by: Joel Becker

    Srinivas Eeda
     

04 Apr, 2009

1 commit

  • For nfs exporting, ocfs2_get_dentry() returns the dentry for fh.
    ocfs2_get_dentry() may read from disk when the inode is not in memory,
    without any cross cluster lock. this leads to the file system loading a
    stale inode.

    This patch fixes above problem.

    Solution is that in case of inode is not in memory, we get the cluster
    lock(PR) of alloc inode where the inode in question is allocated from (this
    causes node on which deletion is done sync the alloc inode) before reading
    out the inode itsself. then we check the bitmap in the group (the inode in
    question allcated from) to see if the bit is clear. if it's clear then it's
    stale. if the bit is set, we then check generation as the existing code
    does.

    We have to read out the inode in question from disk first to know its alloc
    slot and allot bit. And if its not stale we read it out using ocfs2_iget().
    The second read should then be from cache.

    And also we have to add a per superblock nfs_sync_lock to cover the lock for
    alloc inode and that for inode in question. this is because ocfs2_get_dentry()
    and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked
    in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so
    that mutliple ocfs2_delete_inode() can run concurrently in normal case.

    [mfasheh@suse.com: build warning fixes and comment cleanups]
    Signed-off-by: Wengang Wang
    Acked-by: Joel Becker
    Signed-off-by: Mark Fasheh

    wengang wang
     

06 Jan, 2009

1 commit

  • For each quota type each node has local quota file. In this file it stores
    changes users have made to disk usage via this node. Once in a while this
    information is synced to global file (and thus with other nodes) so that
    limits enforcement at least aproximately works.

    Global quota files contain all the information about usage and limits. It's
    mostly handled by the generic VFS code (which implements a trie of structures
    inside a quota file). We only have to provide functions to convert structures
    from on-disk format to in-memory one. We also have to provide wrappers for
    various quota functions starting transactions and acquiring necessary cluster
    locks before the actual IO is really started.

    Signed-off-by: Jan Kara
    Signed-off-by: Mark Fasheh

    Jan Kara
     

18 Apr, 2008

4 commits

  • We define the ocfs2_stack_plugin structure to represent a stack driver.
    The o2cb stack code is split into stack_o2cb.c. This becomes the
    ocfs2_stack_o2cb.ko module.

    The stackglue generic functions are similarly split into the
    ocfs2_stackglue.ko module. This module now provides an interface to
    register drivers. The ocfs2_stack_o2cb driver registers itself. As
    part of this interface, ocfs2_stackglue can load drivers on demand.
    This is accomplished in ocfs2_cluster_connect().

    ocfs2_cluster_disconnect() is now notified when a _hangup() is pending.
    If a hangup is pending, it will not release the driver module and will
    let _hangup() do that.

    Signed-off-by: Joel Becker

    Joel Becker
     
  • The stack glue initialization function needs a better name so that it can be
    used cleanly when stackglue becomes a module.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • This step introduces a cluster stack agnostic API for initializing and
    exiting. fs/ocfs2/dlmglue.c no longer uses o2cb/o2dlm knowledge to
    connect to the stack. It is all handled in stackglue.c.

    heartbeat.c no longer needs to know how it gets called.
    ocfs2_do_node_down() is now a clean recovery trigger.

    The big gotcha is the ordering of initializations and de-initializations done
    underneath ocfs2_cluster_connect(). ocfs2_dlm_init() used to do all
    o2dlm initialization in one block. Thus, the o2dlm functionality of
    ocfs2_cluster_connect() is very straightforward. ocfs2_dlm_shutdown(),
    however, did a few things between de-registration of the eviction
    callback and actually shutting down the domain. Now de-registration and
    shutdown of the domain are wrapped within the single
    ocfs2_cluster_disconnect() call. I've checked the code paths to make
    sure we can safely tear down things in ocfs2_dlm_shutdown() before
    calling ocfs2_cluster_disconnect(). The filesystem has already set
    itself to ignore the callback.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • This is the first in a series of patches to isolate ocfs2 from the
    underlying cluster stack. Here we wrap the dlm locking functions with
    ocfs2-specific calls. Because ocfs2 always uses the same dlm lock status
    callbacks, we can eliminate the callbacks from the filesystem visible
    functions.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     

04 Mar, 2008

1 commit

  • This patch contains the following cleanups that are now possible:
    - make the following needlessly global functions static:
    - dlmglue.c:ocfs2_process_blocked_lock()
    - heartbeat.c:ocfs2_node_map_init()
    - #if 0 the following unused global function plus support functions:
    - heartbeat.c:ocfs2_node_map_is_only()

    Signed-off-by: Adrian Bunk
    Signed-off-by: Mark Fasheh

    Adrian Bunk
     

07 Feb, 2008

1 commit

  • Currently, when ocfs2 nodes connect via TCP, they advertise their
    compatibility level. If the versions do not match, two nodes cannot speak
    to each other and they disconnect. As a result, this provides no forward or
    backwards compatibility.

    This patch implements a simple protocol negotiation at the dlm level by
    introducing a major/minor version number scheme for entities that
    communicate. Specifically, o2dlm has a major/minor version for interaction
    with o2dlm on other nodes, and ocfs2 itself has a major/minor version for
    interacting with the filesystem on other nodes.

    This will allow rolling upgrades of ocfs2 clusters when changes to the
    locking or network protocols can be done in a backwards compatible manner.
    In those cases, only the minor number is changed and the negotatied protocol
    minor is returned from dlm join. In the far less likely event that a
    required protocol change makes backwards compatibility impossible, we simply
    bump the major number.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     

26 Jan, 2008

4 commits

  • This adds a new dlmglue lock type which is intended to back flock()
    requests.

    Since these locks are driven from userspace, usage rules are much more
    liberal than the typical Ocfs2 internal cluster lock. As a result, we can't
    make use of most dlmglue features - lock caching and lock level
    optimizations in particular. Additionally, userspace is free to deadlock
    itself, so we have to deal with that in the same way as the rest of the
    kernel - by allowing a signal to abort a lock request.

    In order to keep ocfs2_cluster_lock() complexity down, ocfs2_file_lock()
    does it's own dlm coordination. We still use the same helper functions
    though, so duplicated code is kept to a minimum.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • Call this the "inode_lock" now, since it covers both data and meta data.
    This patch makes no functional changes.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • The meta lock now covers both meta data and data, so this just removes the
    now-redundant data lock.

    Combining locks saves us a round of lock mastery per inode and one less lock
    to ping between nodes during read/write.

    We don't lose much - since meta locks were always held before a data lock
    (and at the same level) ordered writeout mode (the default) ensured that
    flushing for the meta data lock also pushed out data anyways.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • The node maps that are set/unset by these votes are no longer relevant, thus
    we can remove the mount and umount votes. Since those are the last two
    remaining votes, we can also remove the entire vote infrastructure.

    The vote thread has been renamed to the downconvert thread, and the small
    amount of functionality related to managing it has been moved into
    fs/ocfs2/dlmglue.c. All references to votes have been removed or updated.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

13 Oct, 2007

1 commit

  • Add the disk, network and memory structures needed to support data in inode.

    Struct ocfs2_inline_data is defined and embedded in ocfs2_dinode for storing
    inline data.

    A new inode field, i_dyn_features, is added to facilitate tracking of
    dynamic inode state. Since it will be used often, we want to mirror it on
    ocfs2_inode_info, and transfer it via the meta data lvb.

    Signed-off-by: Mark Fasheh
    Reviewed-by: Joel Becker

    Mark Fasheh
     

03 May, 2007

1 commit

  • This patch makes the following needlessly global functions static:
    - aops.c: ocfs2_write_data_page()
    - dlmglue.c: ocfs2_dump_meta_lvb_info()
    - file.c: ocfs2_set_inode_size()

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Mark Fasheh

    Adrian Bunk
     

27 Apr, 2007

1 commit

  • Ocfs2 currently does cluster-wide node messaging to check the open state of
    an inode during delete. This patch removes that mechanism in favor of an
    inode cluster lock which is taken at shared read when an inode is first read
    and dropped in clear_inode(). This allows a deleting node to test the
    liveness of an inode by attempting to take an exclusive lock.

    Signed-off-by: Tiger Yang
    Signed-off-by: Mark Fasheh

    Tiger Yang
     

02 Dec, 2006

3 commits


25 Sep, 2006

4 commits

  • OCFS2 puts inode meta data in the "lock value block" provided by the DLM.
    Typically, i_generation is encoded in the lock name so that a deleted inode
    on and a new one in the same block don't share the same lvb.

    Unfortunately, that scheme means that the read in ocfs2_read_locked_inode()
    is potentially thrown away as soon as the meta data lock is taken - we
    cannot encode the lock name without first knowing i_generation, which
    requires a disk read.

    This patch encodes i_generation in the inode meta data lvb, and removes the
    value from the inode meta data lock name. This way, the read can be covered
    by a lock, and at the same time we can distinguish between an up to date and
    a stale LVB.

    This will help cold-cache stat(2) performance in particular.

    Since this patch changes the protocol version, we take the opportunity to do
    a minor re-organization of two of the LVB fields.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • When i_generation is removed from the lockname, this will help us determine
    whether a meta data lvb has information that is in sync with the local
    struct inode.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • lvb_version doesn't need to be a whole 32 bits. Make it an 8 bit field to
    free up some space. This should be backwards compatible until we use one of
    the fields, in which case we'd bump the lvb version anyway.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • Replace the dentry vote mechanism with a cluster lock which covers a set
    of dentries. This allows us to force d_delete() only on nodes which actually
    care about an unlink.

    Every node that does a ->lookup() gets a read only lock on the dentry, until
    an unlink during which the unlinking node, will request an exclusive lock,
    forcing the other nodes who care about that dentry to d_delete() it. The
    effect is that we retain a very lightweight ->d_revalidate(), and at the
    same time get to make large improvements to the average case performance of
    the ocfs2 unlink and rename operations.

    This patch adds the cluster lock type which OCFS2 can attach to
    dentries. A small number of fs/ocfs2/dcache.c functions are stubbed
    out so that this change can compile.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

21 Sep, 2006

1 commit


04 Jan, 2006

1 commit