13 Dec, 2016

1 commit

  • CURRENT_TIME is not y2038 safe.

    Use y2038 safe ktime_get_real_seconds() here for timestamps. struct
    heartbeat_block's hb_seq and deletetion time are already 64 bits wide
    and accommodate times beyond y2038.

    Also use y2038 safe ktime_get_real_ts64() for on disk inode timestamps.
    These are also wide enough to accommodate time64_t.

    Link: http://lkml.kernel.org/r/1475365298-29236-1-git-send-email-deepa.kernel@gmail.com
    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Deepa Dinamani
     

11 Oct, 2016

2 commits

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Al Viro
     

08 Oct, 2016

1 commit


28 Sep, 2016

1 commit

  • CURRENT_TIME macro is not appropriate for filesystems as it
    doesn't use the right granularity for filesystem timestamps.
    Use current_time() instead.

    CURRENT_TIME is also not y2038 safe.

    This is also in preparation for the patch that transitions
    vfs timestamps to use 64 bit time and hence make them
    y2038 safe. As part of the effort current_time() will be
    extended to do range checks. Hence, it is necessary for all
    file system timestamps to use current_time(). Also,
    current_time() will be transitioned along with vfs to be
    y2038 safe.

    Note that whenever a single call to current_time() is used
    to change timestamps in different inodes, it is because they
    share the same time granularity.

    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Acked-by: Felipe Balbi
    Acked-by: Steven Whitehouse
    Acked-by: Ryusuke Konishi
    Acked-by: David Sterba
    Signed-off-by: Al Viro

    Deepa Dinamani
     

27 Sep, 2016

2 commits

  • Generated patch:

    sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
    sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • This is trivial to do:

    - add flags argument to foo_rename()
    - check if flags is zero
    - assign foo_rename() to .rename2 instead of .rename

    This doesn't mean it's impossible to support RENAME_NOREPLACE for these
    filesystems, but it is not trivial, like for local filesystems.
    RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible
    for a file to be created on one host while it is overwritten by rename on
    another host).

    Filesystems converted:

    9p, afs, ceph, coda, ecryptfs, kernfs, lustre, ncpfs, nfs, ocfs2, orangefs.

    After this, we can get rid of the duplicate interfaces for rename.

    Signed-off-by: Miklos Szeredi
    Acked-by: Greg Kroah-Hartman
    Acked-by: David Howells [AFS]
    Acked-by: Mike Marshall
    Cc: Eric Van Hensbergen
    Cc: Ilya Dryomov
    Cc: Jan Harkes
    Cc: Tyler Hicks
    Cc: Oleg Drokin
    Cc: Trond Myklebust
    Cc: Mark Fasheh

    Miklos Szeredi
     

13 May, 2016

1 commit

  • Commit 702e5bc68ad2 ("ocfs2: use generic posix ACL infrastructure")
    refactored code to use posix_acl_create. The problem with this function
    is that it is not mindful of the cluster wide inode lock making it
    unsuitable for use with ocfs2 inode creation with ACLs. For example,
    when used in ocfs2_mknod, this function can cause deadlock as follows.
    The parent dir inode lock is taken when calling posix_acl_create ->
    get_acl -> ocfs2_iop_get_acl which takes the inode lock again. This can
    cause deadlock if there is a blocked remote lock request waiting for the
    lock to be downconverted. And same deadlock happened in ocfs2_reflink.
    This fix is to revert back using ocfs2_init_acl.

    Fixes: 702e5bc68ad2 ("ocfs2: use generic posix ACL infrastructure")
    Signed-off-by: Tariq Saeed
    Signed-off-by: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

15 Jan, 2016

2 commits

  • In ocfs2_orphan_del, currently it finds and deletes entry first, and
    then access orphan dir dinode. This will have a problem once
    ocfs2_journal_access_di fails. In this case, entry will be removed from
    orphan dir, but in deed the inode hasn't been deleted successfully. In
    other words, the file is missing but not actually deleted. So we should
    access orphan dinode first like unlink and rename.

    Signed-off-by: Joseph Qi
    Reviewed-by: Jiufei Xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Reviewed-by: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Since iput will take care the NULL check itself, NULL check before
    calling it is redundant. So clean them up.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

12 Jan, 2016

1 commit

  • Pull vfs RCU symlink updates from Al Viro:
    "Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
    even if the symlink is not an embedded one.

    No changes since the mailbomb on Jan 1"

    * 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    switch ->get_link() to delayed_call, kill ->put_link()
    kill free_page_put_link()
    teach nfs_get_link() to work in RCU mode
    teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
    teach shmem_get_link() to work in RCU mode
    teach page_get_link() to work in RCU mode
    replace ->follow_link() with new method that could stay in RCU mode
    don't put symlink bodies in pagecache into highmem
    namei: page_getlink() and page_follow_link_light() are the same thing
    ufs: get rid of ->setattr() for symlinks
    udf: don't duplicate page_symlink_inode_operations
    logfs: don't duplicate page_symlink_inode_operations
    switch befs long symlinks to page_symlink_operations

    Linus Torvalds
     

13 Dec, 2015

1 commit

  • Commit 8f1eb48758aa ("ocfs2: fix umask ignored issue") introduced an
    issue, SGID of sub dir was not inherited from its parents dir. It is
    because SGID is set into "inode->i_mode" in ocfs2_get_init_inode(), but
    is overwritten by "mode" which don't have SGID set later.

    Fixes: 8f1eb48758aa ("ocfs2: fix umask ignored issue")
    Signed-off-by: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Acked-by: Srinivas Eeda
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     

09 Dec, 2015

1 commit

  • kmap() in page_follow_link_light() needed to go - allowing to hold
    an arbitrary number of kmaps for long is a great way to deadlocking
    the system.

    new helper (inode_nohighmem(inode)) needs to be used for pagecache
    symlinks inodes; done for all in-tree cases. page_follow_link_light()
    instrumented to yell about anything missed.

    Signed-off-by: Al Viro

    Al Viro
     

21 Nov, 2015

1 commit

  • New created file's mode is not masked with umask, and this makes umask not
    work for ocfs2 volume.

    Fixes: 702e5bc ("ocfs2: use generic posix ACL infrastructure")
    Signed-off-by: Junxiao Bi
    Cc: Gang He
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     

06 Nov, 2015

2 commits

  • In ocfs2_mknod_locked if '__ocfs2_mknod_locke d' returns an error, we
    should reclaim the inode successfully claimed above, otherwise, the
    inode never be reused. The case is described below:

    ocfs2_mknod
    ocfs2_mknod_locked
    ocfs2_claim_new_inode
    Successfully claim the inode
    __ocfs2_mknod_locked
    ocfs2_journal_access_di
    Failed because of -ENOMEM or other reasons, the inode
    lockres has not been initialized yet.

    iput(inode)
    ocfs2_evict_inode
    ocfs2_delete_inode
    ocfs2_inode_lock
    ocfs2_inode_lock_full_nested
    __ocfs2_cluster_lock
    Return -EINVAL because of the inode
    lockres has not been initialized.

    So the following operations are not performed
    ocfs2_wipe_inode
    ocfs2_remove_inode
    ocfs2_free_dinode
    ocfs2_free_suballoc_bits

    Signed-off-by: Alex Chen
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    alex chen
     
  • dio entry will only do truncate in case of ORPHAN_NEED_TRUNCATE. So do
    not include it when doing normal orphan scan to reduce contention.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

05 Sep, 2015

4 commits

  • When running dirop_fileop_racer we found a case that inode
    can not removed.

    Two nodes, say Node A and Node B, mount the same ocfs2 volume. Create
    two dirs /race/1/ and /race/2/ in the filesystem.

    Node A Node B
    rm -r /race/2/
    mv /race/1/ /race/2/
    call ocfs2_unlink(), get
    the EX mode of /race/2/
    wait for B unlock /race/2/
    decrease i_nlink of /race/2/ to 0,
    and add inode of /race/2/ into
    orphan dir, unlock /race/2/
    got EX mode of /race/2/. because
    /race/1/ is dir, so inc i_nlink
    of /race/2/ and update into disk,
    unlock /race/2/
    because i_nlink of /race/2/
    is not zero, this inode will
    always remain in orphan dir

    This patch fixes this case by test whether i_nlink of new dir is zero.

    Signed-off-by: Yiwen Jiang
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Cc: Xue jiufei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yiwen Jiang
     
  • In ocfs2_rename, it will lead to an inode with two entried(old and new) if
    ocfs2_delete_entry(old) failed. Thus, filesystem will be inconsistent.

    The case is described below:

    ocfs2_rename
    -> ocfs2_start_trans
    -> ocfs2_add_entry(new)
    -> ocfs2_delete_entry(old)
    -> __ocfs2_journal_access *failed* because of -ENOMEM
    -> ocfs2_commit_trans

    So filesystem should be set to read-only at the moment.

    Signed-off-by: Yiwen Jiang
    Cc: Joseph Qi
    Cc: Joel Becker
    Reviewed-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen
     
  • Unlocking order in ocfs2_unlink and ocfs2_rename mismatches the
    corresponding locking order, although it won't cause issues, adjust the
    code so that it looks more reasonable.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • During direct io the inode will be added to orphan first and then
    deleted from orphan. There is a race window that the orphan entry will
    be deleted twice and thus trigger the BUG when validating
    OCFS2_DIO_ORPHANED_FL in ocfs2_del_inode_from_orphan.

    ocfs2_direct_IO_write
    ...
    ocfs2_add_inode_to_orphan
    >>>>>>>> race window.
    1) another node may rm the file and then down, this node
    take care of orphan recovery and clear flag
    OCFS2_DIO_ORPHANED_FL.
    2) since rw lock is unlocked, it may race with another
    orphan recovery and append dio.
    ocfs2_del_inode_from_orphan

    So take inode mutex lock when recovering orphans and make rw unlock at the
    end of aio write in case of append dio.

    Signed-off-by: Joseph Qi
    Reported-by: Yiwen Jiang
    Cc: Weiwei Wang
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

24 Jul, 2015

1 commit


25 Jun, 2015

2 commits

  • Use kernel.h macro definition.

    Thanks to Julia Lawall for Coccinelle scripting support.

    Signed-off-by: Fabian Frederick
    Cc: Julia Lawall
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Once dio crashed it will leave an entry in orphan dir. And orphan scan
    will take care of the clean up. There is a tiny race case that the same
    entry will be truncated twice and then trigger the BUG in
    ocfs2_del_inode_from_orphan.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

27 Apr, 2015

1 commit

  • Pull fourth vfs update from Al Viro:
    "d_inode() annotations from David Howells (sat in for-next since before
    the beginning of merge window) + four assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    RCU pathwalk breakage when running into a symlink overmounting something
    fix I_DIO_WAKEUP definition
    direct-io: only inc/dec inode->i_dio_count for file systems
    fs/9p: fix readdir()
    VFS: assorted d_backing_inode() annotations
    VFS: fs/inode.c helpers: d_inode() annotations
    VFS: fs/cachefiles: d_backing_inode() annotations
    VFS: fs library helpers: d_inode() annotations
    VFS: assorted weird filesystems: d_inode() annotations
    VFS: normal filesystems (and lustre): d_inode() annotations
    VFS: security/: d_inode() annotations
    VFS: security/: d_backing_inode() annotations
    VFS: net/: d_inode() annotations
    VFS: net/unix: d_backing_inode() annotations
    VFS: kernel/: d_inode() annotations
    VFS: audit: d_backing_inode() annotations
    VFS: Fix up some ->d_inode accesses in the chelsio driver
    VFS: Cachefiles should perform fs modifications on the top layer only
    VFS: AF_UNIX sockets should call mknod on the top layer only

    Linus Torvalds
     

16 Apr, 2015

1 commit


15 Apr, 2015

2 commits


17 Feb, 2015

2 commits

  • If one node has crashed with orphan entry leftover, another node which do
    append O_DIRECT write to the same file will override the
    i_dio_orphaned_slot. Then the old entry won't be cleaned forever. If
    this case happens, we let it wait for orphan recovery first.

    Signed-off-by: Joseph Qi
    Cc: Weiwei Wang
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Xuejiufei
    Cc: alex chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Add functions to add inode to orphan dir and remove inode in orphan dir.
    Here we do not call ocfs2_prepare_orphan_dir and ocfs2_orphan_add
    directly. Because append O_DIRECT will add inode to orphan two and may
    result in more than one orphan entry for the same inode.

    [akpm@linux-foundation.org: avoid dynamic stack allocation]
    Signed-off-by: Joseph Qi
    Cc: Weiwei Wang
    Cc: Junxiao Bi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Xuejiufei
    Cc: alex chen
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

09 Jan, 2015

1 commit

  • In ocfs2_link(), the parent directory inode passed to function
    ocfs2_lookup_ino_from_name() is wrong. Parameter dir is the parent of
    new_dentry not old_dentry. We should get old_dir from old_dentry and
    lookup old_dentry in old_dir in case another node remove the old dentry.

    With this change, hard linking works again, when paths are relative with
    at least one subdirectory. This is how the problem was reproducable:

    # mkdir a
    # mkdir b
    # touch a/test
    # ln a/test b/test
    ln: failed to create hard link `b/test' => `a/test': No such file or directory

    However when creating links in the same dir, it worked well.

    Now the link gets created.

    Fixes: 0e048316ff57 ("ocfs2: check existence of old dentry in ocfs2_link()")
    Signed-off-by: joyce.xue
    Reported-by: Szabo Aron - UBIT
    Cc: Mark Fasheh
    Cc: Joel Becker
    Tested-by: Aron Szabo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     

30 Oct, 2014

1 commit

  • d_splice_alias() can return a valid dentry, NULL or an ERR_PTR.
    Currently the code checks not for ERR_PTR and will cuase an oops in
    ocfs2_dentry_attach_lock(). Fix this by using IS_ERR_OR_NULL().

    Signed-off-by: Richard Weinberger
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     

24 Jun, 2014

3 commits

  • When the call to ocfs2_add_entry() failed in ocfs2_symlink() and
    ocfs2_mknod(), iput() will not be called during dput(dentry) because no
    d_instantiate(), and this will lead to umount hung.

    Signed-off-by: jiangyiwen
    Cc: Joel Becker
    Reviewed-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    jiangyiwen
     
  • When running dirop_fileop_racer we found a dead lock case.

    2 nodes, say Node A and Node B, mount the same ocfs2 volume. Create
    /race/16/1 in the filesystem, and let the inode number of dir 16 is less
    than the inode number of dir race.

    Node A Node B
    mv /race/16/1 /race/
    right after Node A has got the
    EX mode of /race/16/, and tries to
    get EX mode of /race
    ls /race/16/

    In this case, Node A has got the EX mode of /race/16/, and wants to get EX
    mode of /race/. Node B has got the PR mode of /race/, and wants to get
    the PR mode of /race/16/. Since EX and PR are mutually exclusive, dead
    lock happens.

    This patch fixes this case by locking in ancestor order before trying
    inode number order.

    Signed-off-by: Yiwen Jiang
    Signed-off-by: Joseph Qi
    Cc: Joel Becker
    Reviewed-by: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yiwen Jiang
     
  • There are two files a and b in dir /mnt/ocfs2.

    node A node B

    mv a b
    In ocfs2_rename(), after calling
    ocfs2_orphan_add(), the inode of
    file b will be added into orphan
    dir.

    If ocfs2_update_entry() fails,
    ocfs2_rename return error and mv
    operation fails. But file b still
    exists in the parent dir.

    ocfs2_queue_orphan_scan
    -> ocfs2_queue_recovery_completion
    -> ocfs2_complete_recovery
    -> ocfs2_recover_orphans
    The inode of the file b will be
    put with iput().

    ocfs2_evict_inode
    -> ocfs2_delete_inode
    -> ocfs2_wipe_inode
    -> ocfs2_remove_inode
    OCFS2_VALID_FL in the inode
    i_flags will be cleared.

    The file b still can be accessed
    on node B.
    ls /mnt/ocfs2
    When first read the file b with
    ocfs2_read_inode_block(). It will
    validate the inode using
    ocfs2_validate_inode_block().
    Because OCFS2_VALID_FL not set in
    the inode i_flags, so the file
    system will be readonly.

    So we should add inode into orphan dir after updating entry in
    ocfs2_rename().

    Signed-off-by: alex.chen
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    alex chen
     

04 Apr, 2014

4 commits

  • Ensure that ocfs2_update_inode_fsync_trans() is called any time we touch
    an inode in a given transaction. This is a follow-on to the previous
    patch to reduce lock contention and deadlocking during an fsync
    operation.

    Signed-off-by: Darrick J. Wong
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Wengang
    Cc: Greg Marsden
    Cc: Srinivas Eeda
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • Commit 9548906b2bb7 ('xattr: Constify ->name member of "struct xattr"')
    missed that ocfs2 is calling kfree(xattr->name). As a result, kernel
    panic occurs upon calling kfree(xattr->name) because xattr->name refers
    static constant names. This patch removes kfree(xattr->name) from
    ocfs2_mknod() and ocfs2_symlink().

    Signed-off-by: Tetsuo Handa
    Reported-by: Tariq Saeed
    Tested-by: Tariq Saeed
    Reviewed-by: Srinivas Eeda
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • When ocfs2_create_new_inode_locks() return error, inode open lock may
    not be obtainted for this inode. So other nodes can remove this file
    and free dinode when inode still remain in memory on this node, which is
    not correct and may trigger BUG. So __ocfs2_mknod_locked should return
    error when ocfs2_create_new_inode_locks() failed.

    Node_1 Node_2
    create fileA, call ocfs2_mknod()
    -> ocfs2_get_init_inode(), allocate inodeA
    -> ocfs2_claim_new_inode(), claim dinode(dinodeA)
    -> call ocfs2_create_new_inode_locks(),
    create open lock failed, return error
    -> __ocfs2_mknod_locked return success

    unlink fileA
    try open lock succeed,
    and free dinodeA

    create another file, call ocfs2_mknod()
    -> ocfs2_get_init_inode(), allocate inodeB
    -> ocfs2_claim_new_inode(), as Node_2 had freed dinodeA,
    so claim dinodeA and update generation for dinodeA

    call __ocfs2_drop_dl_inodes()->ocfs2_delete_inode()
    to free inodeA, and finally triggers BUG
    on(inode->i_generation != le32_to_cpu(fe->i_generation))
    in function ocfs2_inode_lock_update().

    Signed-off-by: joyce.xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • Currently, ocfs2_sync_file grabs i_mutex and forces the current journal
    transaction to complete. This isn't terribly efficient, since sync_file
    really only needs to wait for the last transaction involving that inode
    to complete, and this doesn't require i_mutex.

    Therefore, implement the necessary bits to track the newest tid
    associated with an inode, and teach sync_file to wait for that instead
    of waiting for everything in the journal to commit. Furthermore, only
    issue the flush request to the drive if jbd2 hasn't already done so.

    This also eliminates the deadlock between ocfs2_file_aio_write() and
    ocfs2_sync_file(). aio_write takes i_mutex then calls
    ocfs2_aiodio_wait() to wait for unaligned dio writes to finish.
    However, if that dio completion involves calling fsync, then we can get
    into trouble when some ocfs2_sync_file tries to take i_mutex.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

11 Feb, 2014

1 commit

  • System call linkat first calls user_path_at(), check the existence of
    old dentry, and then calls vfs_link()->ocfs2_link() to do the actual
    work. There may exist a race when Node A create a hard link for file
    while node B rm it.

    Node A Node B
    user_path_at()
    ->ocfs2_lookup(),
    find old dentry exist
    rm file, add inode say inodeA
    to orphan_dir

    call ocfs2_link(),create a
    hard link for inodeA.

    rm the link, add inodeA to orphan_dir
    again

    When orphan_scan work start, it calls ocfs2_queue_orphans() to do the
    main work. It first tranverses entrys in orphan_dir, linking all inodes
    in this orphan_dir to a list look like this:

    inodeA->inodeB->...->inodeA

    When tranvering this list, it will fall into loop, calling iput() again
    and again. And finally trigger BUG_ON(inode->i_state & I_CLEAR).

    Signed-off-by: joyce
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei