08 May, 2013

1 commit

  • This removes the retry-based AIO infrastructure now that nothing in tree
    is using it.

    We want to remove retry-based AIO because it is fundemantally unsafe.
    It retries IO submission from a kernel thread that has only assumed the
    mm of the submitting task. All other task_struct references in the IO
    submission path will see the kernel thread, not the submitting task.
    This design flaw means that nothing of any meaningful complexity can use
    retry-based AIO.

    This removes all the code and data associated with the retry machinery.
    The most significant benefit of this is the removal of the locking
    around the unused run list in the submission path.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Kent Overstreet
    Signed-off-by: Zach Brown
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Acked-by: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zach Brown
     

26 Feb, 2013

1 commit

  • Pull user namespace and namespace infrastructure changes from Eric W Biederman:
    "This set of changes starts with a few small enhnacements to the user
    namespace. reboot support, allowing more arbitrary mappings, and
    support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
    user namespace root.

    I do my best to document that if you care about limiting your
    unprivileged users that when you have the user namespace support
    enabled you will need to enable memory control groups.

    There is a minor bug fix to prevent overflowing the stack if someone
    creates way too many user namespaces.

    The bulk of the changes are a continuation of the kuid/kgid push down
    work through the filesystems. These changes make using uids and gids
    typesafe which ensures that these filesystems are safe to use when
    multiple user namespaces are in use. The filesystems converted for
    3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
    changes for these filesystems were a little more involved so I split
    the changes into smaller hopefully obviously correct changes.

    XFS is the only filesystem that remains. I was hoping I could get
    that in this release so that user namespace support would be enabled
    with an allyesconfig or an allmodconfig but it looks like the xfs
    changes need another couple of days before it they are ready."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
    cifs: Enable building with user namespaces enabled.
    cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
    cifs: Convert struct cifs_sb_info to use kuids and kgids
    cifs: Modify struct smb_vol to use kuids and kgids
    cifs: Convert struct cifsFileInfo to use a kuid
    cifs: Convert struct cifs_fattr to use kuid and kgids
    cifs: Convert struct tcon_link to use a kuid.
    cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
    cifs: Convert from a kuid before printing current_fsuid
    cifs: Use kuids and kgids SID to uid/gid mapping
    cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
    cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
    cifs: Override unmappable incoming uids and gids
    nfsd: Enable building with user namespaces enabled.
    nfsd: Properly compare and initialize kuids and kgids
    nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
    nfsd: Modify nfsd4_cb_sec to use kuids and kgids
    nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
    nfsd: Convert nfsxdr to use kuids and kgids
    nfsd: Convert nfs3xdr to use kuids and kgids
    ...

    Linus Torvalds
     

22 Feb, 2013

1 commit


13 Feb, 2013

1 commit


04 Jul, 2012

2 commits

  • When ocfs2dc thread holds dc_task_lock spinlock and receives soft IRQ it
    deadlock itself trying to get same spinlock in ocfs2_wake_downconvert_thread.
    Below is the stack snippet.

    The patch disables interrupts when acquiring dc_task_lock spinlock.

    ocfs2_wake_downconvert_thread
    ocfs2_rw_unlock
    ocfs2_dio_end_io
    dio_complete
    .....
    bio_endio
    req_bio_endio
    ....
    scsi_io_completion
    blk_done_softirq
    __do_softirq
    do_softirq
    irq_exit
    do_IRQ
    ocfs2_downconvert_thread
    [kthread]

    Signed-off-by: Srinivas Eeda
    Signed-off-by: Joel Becker

    Srinivas Eeda
     
  • Fix misplaced parentheses

    Signed-off-by: Roel Kluin
    Signed-off-by: Joel Becker

    roel
     

02 Dec, 2011

1 commit

  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (31 commits)
    ocfs2: avoid unaligned access to dqc_bitmap
    ocfs2: Use filemap_write_and_wait() instead of write_inode_now()
    ocfs2: honor O_(D)SYNC flag in fallocate
    ocfs2: Add a missing journal credit in ocfs2_link_credits() -v2
    ocfs2: send correct UUID to cleancache initialization
    ocfs2: Commit transactions in error cases -v2
    ocfs2: make direntry invalid when deleting it
    fs/ocfs2/dlm/dlmlock.c: free kmem_cache_zalloc'd data using kmem_cache_free
    ocfs2: Avoid livelock in ocfs2_readpage()
    ocfs2: serialize unaligned aio
    ocfs2: Implement llseek()
    ocfs2: Fix ocfs2_page_mkwrite()
    ocfs2: Add comment about orphan scanning
    ocfs2: Clean up messages in the fs
    ocfs2/cluster: Cluster up now includes network connections too
    ocfs2/cluster: Add new function o2net_fill_node_map()
    ocfs2/cluster: Fix output in file elapsed_time_in_ms
    ocfs2/dlm: dlmlock_remote() needs to account for remastery
    ocfs2/dlm: Take inflight reference count for remotely mastered resources too
    ocfs2/dlm: Cleanup dlm_wait_for_node_death() and dlm_wait_for_node_recovery()
    ...

    Linus Torvalds
     

02 Nov, 2011

1 commit


01 Jun, 2011

1 commit

  • ocfs2 cannot currently mount a device that is readonly at the media
    ("hard readonly"). Fix the broken places.
    see detail: http://oss.oracle.com/bugzilla/show_bug.cgi?id=1322

    [ Description edited -- Joel ]

    Signed-off-by: Tiger Yang
    Reviewed-by: Sunil Mushran
    Signed-off-by: Joel Becker

    Tiger Yang
     

29 Mar, 2011

1 commit


07 Mar, 2011

1 commit

  • mlog_exit is used to record the exit status of a function.
    But because it is added in so many functions, if we enable it,
    the system logs get filled up quickly and cause too much I/O.
    So actually no one can open it for a production system or even
    for a test.

    This patch just try to remove it or change it. So:
    1. if all the error paths already use mlog_errno, it is just removed.
    Otherwise, it will be replaced by mlog_errno.
    2. if it is used to print some return value, it is replaced with
    mlog(0,...).
    mlog_exit_ptr is changed to mlog(0.
    All those mlog(0,...) will be replaced with trace events later.

    Signed-off-by: Tao Ma

    Tao Ma
     

21 Feb, 2011

1 commit

  • ENTRY is used to record the entry of a function.
    But because it is added in so many functions, if we enable it,
    the system logs get filled up quickly and cause too much I/O.
    So actually no one can open it for a production system or even
    for a test.

    So for mlog_entry_void, we just remove it.
    for mlog_entry(...), we replace it with mlog(0,...), and they
    will be replace by trace event later.

    Signed-off-by: Tao Ma

    Tao Ma
     

20 Feb, 2011

1 commit

  • Patch makes use of the hrtimer to track times in ocfs2 lock stats.

    The patch is a bit involved to ensure no additional impact on the memory
    footprint. The size of ocfs2_inode_cache remains 1280 bytes on 32-bit systems.

    A related change was to modify the unit of the max wait time from nanosec to
    microsec allowing us to track max time larger than 4 secs. This change
    necessitated the bumping of the output version in the debugfs file,
    locking_state, from 2 to 3.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Joel Becker

    Sunil Mushran
     

11 Sep, 2010

1 commit

  • Track negative dentries by recording the generation number of the parent
    directory in d_fsdata. The generation number for the parent directory is
    recorded in the inode_info, which increments every time the lock on the
    directory is dropped.

    If the generation number of the parent directory and the negative dentry
    matches, there is no need to perform the revalidate, else a revalidate
    is forced. This improves performance in situations where nodes look for
    the same non-existent file multiple times.

    Thanks Mark for explaining the DLM sequence.

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Joel Becker

    Goldwyn Rodrigues
     

20 Jul, 2010

1 commit


22 May, 2010

1 commit


08 Mar, 2010

1 commit


28 Feb, 2010

1 commit


27 Feb, 2010

3 commits

  • Inside the stackglue, the locking protocol structure is hanging off of
    the ocfs2_cluster_connection. This takes it one further; the locking
    protocol is passed into ocfs2_cluster_connect(). Now different cluster
    connections can have different locking protocols with distinct asts.
    Note that all locking protocols have to keep their maximum protocol
    version in lock-step.

    With the protocol structure set in ocfs2_cluster_connect(), there is no
    need for the stackglue to have a static pointer to a specific protocol
    structure. We can change initialization to only pass in the maximum
    protocol version.

    Signed-off-by: Joel Becker

    Joel Becker
     
  • We're going to want it in the ast functions, so we convert union
    ocfs2_dlm_lksb to struct ocfs2_dlm_lksb and let it carry the connection.

    Signed-off-by: Joel Becker

    Joel Becker
     
  • The stackglue ast and bast functions tried to maintain the fiction that
    their arguments were void pointers. In reality, stack_user.c had to
    know that the argument was an ocfs2_lock_res in order to get the status
    off of the lksb. That's ugly.

    This changes stackglue to always pass the lksb as the argument to ast
    and bast functions. The caller can always use container_of() to get the
    ocfs2_lock_res or user_dlm_lock_res. The net effect to the caller is
    zero. They still get back the lockres in their ast. stackglue gets
    cleaner, and now can use the lksb itself.

    Signed-off-by: Joel Becker

    Joel Becker
     

09 Feb, 2010

1 commit

  • In particular, several occurances of funny versions of 'success',
    'unknown', 'therefore', 'acknowledge', 'argument', 'achieve', 'address',
    'beginning', 'desirable', 'separate' and 'necessary' are fixed.

    Signed-off-by: Daniel Mack
    Cc: Joe Perches
    Cc: Junio C Hamano
    Signed-off-by: Jiri Kosina

    Daniel Mack
     

04 Feb, 2010

1 commit


03 Feb, 2010

4 commits

  • During blocked lock processing, we should consider the possibility that the
    lock is no longer blocking.

    Joel Becker assisted in fixing this issue.

    Reported-by: David Teigland
    Signed-off-by: Sunil Mushran
    Signed-off-by: Joel Becker

    Sunil Mushran
     
  • During upconvert, if the master were to send a BAST, dlmglue will detect the
    upconversion in process and send a cancel convert to the master. Upon receiving
    the AST for the cancel convert, it will re-process the lock resource to determine
    whether it needs downconverting. Say, the up was from PR to EX and the BAST was
    for EX. After the cancel convert, it will need to downconvert to NL.

    However, if the node was originally upconverting from NL to EX, then there would
    be no reason to downconvert (assuming the same message sequence).

    This patch makes dlmglue consider the possibility that the current lock level
    is already compatible and that downconverting is not required.

    Joel Becker assisted in fixing this issue.

    Fixes ossbz#1178
    http://oss.oracle.com/bugzilla/show_bug.cgi?id=1178

    Reported-by: Coly Li
    Signed-off-by: Sunil Mushran
    Signed-off-by: Joel Becker

    Sunil Mushran
     
  • There is possibility of a livelock in __ocfs2_cluster_lock(). If a node were
    to get an ast for an upconvert request, followed immediately by a bast,
    there is a small window where the fs may downconvert the lock before the
    process requesting the upconvert is able to take the lock.

    This patch adds a new flag to indicate that the upconvert is still in
    progress and that the dc thread should not downconvert it right now.

    Wengang Wang and Joel Becker
    contributed heavily to this patch.

    Reported-by: David Teigland
    Signed-off-by: Sunil Mushran
    Signed-off-by: Joel Becker

    Sunil Mushran
     
  • During bast, set the OCFS2_LOCK_BLOCKED flag only if the lock needs to
    downconverted.

    Signed-off-by: Wengang Wang
    Acked-by: Sunil Mushran
    Acked-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Wengang Wang
     

26 Jan, 2010

1 commit


04 Dec, 2009

1 commit

  • That is "success", "unknown", "through", "performance", "[re|un]mapping"
    , "access", "default", "reasonable", "[con]currently", "temperature"
    , "channel", "[un]used", "application", "example","hierarchy", "therefore"
    , "[over|under]flow", "contiguous", "threshold", "enough" and others.

    Signed-off-by: André Goddard Rosa
    Signed-off-by: Jiri Kosina

    André Goddard Rosa
     

23 Sep, 2009

3 commits


05 Sep, 2009

2 commits

  • The next step in divorcing metadata I/O management from struct inode is
    to pass struct ocfs2_caching_info to the journal functions. Thus the
    journal locks a metadata cache with the cache io_lock function. It also
    can compare ci_last_trans and ci_created_trans directly.

    This is a large patch because of all the places we change
    ocfs2_journal_access..(handle, inode, ...) to
    ocfs2_journal_access..(handle, INODE_CACHE(inode), ...).

    Signed-off-by: Joel Becker

    Joel Becker
     
  • We are really passing the inode into the ocfs2_read/write_blocks()
    functions to get at the metadata cache. This commit passes the cache
    directly into the metadata block functions, divorcing them from the
    inode.

    Signed-off-by: Joel Becker

    Joel Becker
     

23 Jun, 2009

4 commits

  • Add lockdep support to OCFS2. The support also covers all of the cluster
    locks except for open locks, journal locks, and local quotafile locks. These
    are special because they are acquired for a node, not for a particular process
    and lockdep cannot deal with such type of locking.

    Signed-off-by: Jan Kara
    Signed-off-by: Joel Becker

    Jan Kara
     
  • Local and Hard-RO mounts do not need orphan scanning.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Joel Becker

    Sunil Mushran
     
  • We don't access the LVB in our ocfs2_*_lock_res_init() functions.

    Since the LVB can become invalid during some cluster recovery
    operations, the dlmglue must be able to handle an uninitialized
    LVB.

    For the orphan scan lock, we initialized an uninitialzed LVB with our
    scan sequence number plus one. This starts a normal orphan scan
    cycle.

    Signed-off-by: Sunil Mushran
    Signed-off-by: Joel Becker

    Sunil Mushran
     
  • The Lock Value Block (LVB) of a DLM lock can be lost when nodes die and
    the DLM cannot reconstruct its state. Clients of the DLM need to know
    this.

    ocfs2's internal DLM, o2dlm, explicitly zeroes out the LVB when it loses
    track of the state. This is not a standard behavior, but ocfs2 has
    always relied on it. Thus, an o2dlm LVB is always "valid".

    ocfs2 now supports both o2dlm and fs/dlm via the stack glue. When
    fs/dlm loses track of an LVBs state, it sets a flag
    (DLM_SBF_VALNOTVALID) on the Lock Status Block (LKSB). The contents of
    the LVB may be garbage or merely stale.

    ocfs2 doesn't want to try to guess at the validity of the stale LVB.
    Instead, it should be checking the VALNOTVALID flag. As this is the
    'standard' way of treating LVBs, we will promote this behavior.

    We add a stack glue API ocfs2_dlm_lvb_valid(). It returns non-zero when
    the LVB is valid. o2dlm will always return valid, while fs/dlm will
    check VALNOTVALID.

    Signed-off-by: Joel Becker
    Acked-by: Mark Fasheh

    Joel Becker
     

04 Jun, 2009

1 commit

  • When a dentry is unlinked, the unlinking node takes an EX on the dentry lock
    before moving the dentry to the orphan directory. Other nodes that have
    this dentry in cache have a PR on the same dentry lock. When the EX is
    requested, the other nodes flag the corresponding inode as MAYBE_ORPHANED
    during downconvert. The inode is finally deleted when the last node to iput
    the inode sees that i_nlink==0 and the MAYBE_ORPHANED flag is set.

    A problem arises if a node is forced to free dentry locks because of memory
    pressure. If this happens, the node will no longer get downconvert
    notifications for the dentries that have been unlinked on another node.
    If it also happens that node is actively using the corresponding inode and
    happens to be the one performing the last iput on that inode, it will fail
    to delete the inode as it will not have the MAYBE_ORPHANED flag set.

    This patch fixes this shortcoming by introducing a periodic scan of the
    orphan directories to delete such inodes. Care has been taken to distribute
    the workload across the cluster so that no one node has to perform the task
    all the time.

    Signed-off-by: Srinivas Eeda
    Signed-off-by: Joel Becker

    Srinivas Eeda
     

04 Apr, 2009

1 commit

  • For nfs exporting, ocfs2_get_dentry() returns the dentry for fh.
    ocfs2_get_dentry() may read from disk when the inode is not in memory,
    without any cross cluster lock. this leads to the file system loading a
    stale inode.

    This patch fixes above problem.

    Solution is that in case of inode is not in memory, we get the cluster
    lock(PR) of alloc inode where the inode in question is allocated from (this
    causes node on which deletion is done sync the alloc inode) before reading
    out the inode itsself. then we check the bitmap in the group (the inode in
    question allcated from) to see if the bit is clear. if it's clear then it's
    stale. if the bit is set, we then check generation as the existing code
    does.

    We have to read out the inode in question from disk first to know its alloc
    slot and allot bit. And if its not stale we read it out using ocfs2_iget().
    The second read should then be from cache.

    And also we have to add a per superblock nfs_sync_lock to cover the lock for
    alloc inode and that for inode in question. this is because ocfs2_get_dentry()
    and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked
    in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so
    that mutliple ocfs2_delete_inode() can run concurrently in normal case.

    [mfasheh@suse.com: build warning fixes and comment cleanups]
    Signed-off-by: Wengang Wang
    Acked-by: Joel Becker
    Signed-off-by: Mark Fasheh

    wengang wang