26 Mar, 2016

2 commits

  • In the current implementation of unaligned aio+dio, lock order behave as
    follow:

    in user process context:
    -> call io_submit()
    -> get i_mutex
    get ip_unaligned_aio
    -> submit direct io to block device
    -> release i_mutex
    -> io_submit() return

    in dio work queue context(the work queue is created in __blockdev_direct_IO):
    -> release ip_unaligned_aio
    get i_mutex
    -> clear unwritten flag & change i_size
    -> release i_mutex

    There is a limitation to the thread number of dio work queue. 256 at
    default. If all 256 thread are in the above 'window2' stage, and there
    is a user process in the 'window1' stage, the system will became
    deadlock. Since the user process hold i_mutex to wait ip_unaligned_aio
    lock, while there is a direct bio hold ip_unaligned_aio mutex who is
    waiting for a dio work queue thread to be schedule. But all the dio
    work queue thread is waiting for i_mutex lock in 'window2'.

    This case only happened in a test which send a large number(more than
    256) of aio at one io_submit() call.

    My design is to remove ip_unaligned_aio lock. Change it to a sync io
    instead. Just like ip_unaligned_aio lock, serialize the unaligned aio
    dio.

    [akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi]
    Signed-off-by: Ryan Ding
    Reviewed-by: Junxiao Bi
    Cc: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryan Ding
     
  • To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.

    There is still one issue in the direct write procedure.

    phase 1: alloc extent with UNWRITTEN flag
    phase 2: submit direct data to disk, add zero page to page cache
    phase 3: clear UNWRITTEN flag when data has been written to disk

    When there are 2 direct write A(0~3KB),B(4~7KB) writing to the same
    cluster 0~7KB (cluster size 8KB). Write request A arrive phase 2 first,
    it will zero the region (4~7KB). Before request A enter to phase 3,
    request B arrive phase 2, it will zero region (0~3KB). This is just like
    request B steps request A.

    To resolve this issue, we should let request B knows this cluster is already
    under zero, to prevent it from steps the previous write request.

    This patch will add function ocfs2_unwritten_check() to do this job. It
    will record all clusters that are under direct write(it will be recorded
    in the 'ip_unwritten_list' member of inode info), and prevent the later
    direct write writing to the same cluster to do the zero work again.

    Signed-off-by: Ryan Ding
    Reviewed-by: Junxiao Bi
    Cc: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryan Ding
     

23 Mar, 2016

1 commit

  • Implement online file check sysfile interfaces, e.g. how to create the
    related sysfile according to device name, how to display/handle file
    check request from the sysfile.

    Signed-off-by: Gang He
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gang He
     

06 Nov, 2015

1 commit

  • We have no need to take inode mutex, rw and inode lock if it is not dio
    entry when recover orphans. Optimize it by adding a flag
    OCFS2_INODE_DIO_ORPHAN_ENTRY to ocfs2_inode_info to reduce contention.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

05 Sep, 2015

1 commit

  • During direct io the inode will be added to orphan first and then
    deleted from orphan. There is a race window that the orphan entry will
    be deleted twice and thus trigger the BUG when validating
    OCFS2_DIO_ORPHANED_FL in ocfs2_del_inode_from_orphan.

    ocfs2_direct_IO_write
    ...
    ocfs2_add_inode_to_orphan
    >>>>>>>> race window.
    1) another node may rm the file and then down, this node
    take care of orphan recovery and clear flag
    OCFS2_DIO_ORPHANED_FL.
    2) since rw lock is unlocked, it may race with another
    orphan recovery and append dio.
    ocfs2_del_inode_from_orphan

    So take inode mutex lock when recovering orphans and make rw unlock at the
    end of aio write in case of append dio.

    Signed-off-by: Joseph Qi
    Reported-by: Yiwen Jiang
    Cc: Weiwei Wang
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

17 Feb, 2015

1 commit

  • If one node has crashed with orphan entry leftover, another node which do
    append O_DIRECT write to the same file will override the
    i_dio_orphaned_slot. Then the old entry won't be cleaned forever. If
    this case happens, we let it wait for orphan recovery first.

    Signed-off-by: Joseph Qi
    Cc: Weiwei Wang
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Mark Fasheh
    Cc: Xuejiufei
    Cc: alex chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

10 Nov, 2014

1 commit


10 Oct, 2014

1 commit

  • ocfs2_inode_info->ip_clusters and ocfs2_dinode->id1.bitmap1.i_total are
    defined as type u32, so the shift left operations may overflow if volume
    size is large, for example, 2TB and cluster size is 1MB.

    Signed-off-by: Joseph Qi
    Reviewed-by: Alex Chen
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

04 Apr, 2014

3 commits

  • The flag was never set, delete it.

    Signed-off-by: Jan Kara
    Reviewed-by: Mark Fasheh
    Reviewed-by: Srinivas Eeda
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently, ocfs2_sync_file grabs i_mutex and forces the current journal
    transaction to complete. This isn't terribly efficient, since sync_file
    really only needs to wait for the last transaction involving that inode
    to complete, and this doesn't require i_mutex.

    Therefore, implement the necessary bits to track the newest tid
    associated with an inode, and teach sync_file to wait for that instead
    of waiting for everything in the journal to commit. Furthermore, only
    issue the flush request to the drive if jbd2 hasn't already done so.

    This also eliminates the deadlock between ocfs2_file_aio_write() and
    ocfs2_sync_file(). aio_write takes i_mutex then calls
    ocfs2_aiodio_wait() to wait for unaligned dio writes to finish.
    However, if that dio completion involves calling fsync, then we can get
    into trouble when some ocfs2_sync_file tries to take i_mutex.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • There is a problem that waitqueue_active() may check stale data thus miss
    a wakeup of threads waiting on ip_unaligned_aio.

    The valid value of ip_unaligned_aio is only 0 and 1 so we can change it to
    be of type mutex thus the above prolem is avoid. Another benifit is that
    mutex which works as FIFO is fairer than wake_up_all().

    Signed-off-by: Wengang Wang
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wengang Wang
     

08 May, 2013

1 commit

  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

28 Jul, 2011

1 commit

  • Fix a corruption that can happen when we have (two or more) outstanding
    aio's to an overlapping unaligned region. Ext4
    (e9e3bcecf44c04b9e6b505fd8e2eb9cea58fb94d) and xfs recently had to fix
    similar issues.

    In our case what happens is that we can have an outstanding aio on a region
    and if a write comes in with some bytes overlapping the original aio we may
    decide to read that region into a page before continuing (typically because
    of buffered-io fallback). Since we have no ordering guarantees with the
    aio, we can read stale or bad data into the page and then write it back out.

    If the i/o is page and block aligned, then we avoid this issue as there
    won't be any need to read data from disk.

    I took the same approach as Eric in the ext4 patch and introduced some
    serialization of unaligned async direct i/o. I don't expect this to have an
    effect on the most common cases of AIO. Unaligned aio will be slower
    though, but that's far more acceptable than data corruption.

    Signed-off-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Mark Fasheh
     

11 Sep, 2010

1 commit

  • Track negative dentries by recording the generation number of the parent
    directory in d_fsdata. The generation number for the parent directory is
    recorded in the inode_info, which increments every time the lock on the
    directory is dropped.

    If the generation number of the parent directory and the negative dentry
    matches, there is no need to perform the revalidate, else a revalidate
    is forced. This improves performance in situations where nodes look for
    the same non-existent file multiple times.

    Thanks Mark for explaining the DLM sequence.

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Joel Becker

    Goldwyn Rodrigues
     

10 Sep, 2010

1 commit

  • Thanks for the comments. I have incorportated them all.

    CONFIG_OCFS2_FS_STATS is enabled and CONFIG_DEBUG_LOCK_ALLOC is disabled.
    Statistics now look like -
    ocfs2_write_ctxt: 2144 - 2136 = 8
    ocfs2_inode_info: 1960 - 1848 = 112
    ocfs2_journal: 168 - 160 = 8
    ocfs2_lock_res: 336 - 304 = 32
    ocfs2_refcount_tree: 512 - 472 = 40

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Joel Becker

    Goldwyn Rodrigues
     

10 Aug, 2010

2 commits


21 May, 2010

1 commit

  • * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (47 commits)
    ocfs2: Silence a gcc warning.
    ocfs2: Don't retry xattr set in case value extension fails.
    ocfs2:dlm: avoid dlm->ast_lock lockres->spinlock dependency break
    ocfs2: Reset xattr value size after xa_cleanup_value_truncate().
    fs/ocfs2/dlm: Use kstrdup
    fs/ocfs2/dlm: Drop memory allocation cast
    Ocfs2: Optimize punching-hole code.
    Ocfs2: Make ocfs2_find_cpos_for_left_leaf() public.
    Ocfs2: Fix hole punching to correctly do CoW during cluster zeroing.
    Ocfs2: Optimize ocfs2 truncate to use ocfs2_remove_btree_range() instead.
    ocfs2: Block signals for mkdir/link/symlink/O_CREAT.
    ocfs2: Wrap signal blocking in void functions.
    ocfs2/dlm: Increase o2dlm lockres hash size
    ocfs2: Make ocfs2_extend_trans() really extend.
    ocfs2/trivial: Code cleanup for allocation reservation.
    ocfs2: make ocfs2_adjust_resv_from_alloc simple.
    ocfs2: Make nointr a default mount option
    ocfs2/dlm: Make o2dlm domain join/leave messages KERN_NOTICE
    o2net: log socket state changes
    ocfs2: print node # when tcp fails
    ...

    Linus Torvalds
     

06 May, 2010

1 commit


24 Apr, 2010

1 commit

  • Currently in the error path of ocfs2_symlink and ocfs2_mknod, we just call
    iput with the inode we failed with, but the inode wipe code will complain
    because we don't add the inode to orphan dir. One solution would be to lock
    the orphan dir during the entire transaction, but that's too heavy for a
    rare error path. Instead, we add a flag, OCFS2_INODE_SKIP_ORPHAN_DIR which
    tells the inode wipe code that it won't find this inode in the orphan dir.

    [ Merge fixes and comment style cleanups -Mark ]

    Signed-off-by: Li Dongyang
    Signed-off-by: Mark Fasheh

    Li Dongyang
     

05 Sep, 2009

6 commits


04 Apr, 2009

2 commits

  • For nfs exporting, ocfs2_get_dentry() returns the dentry for fh.
    ocfs2_get_dentry() may read from disk when the inode is not in memory,
    without any cross cluster lock. this leads to the file system loading a
    stale inode.

    This patch fixes above problem.

    Solution is that in case of inode is not in memory, we get the cluster
    lock(PR) of alloc inode where the inode in question is allocated from (this
    causes node on which deletion is done sync the alloc inode) before reading
    out the inode itsself. then we check the bitmap in the group (the inode in
    question allcated from) to see if the bit is clear. if it's clear then it's
    stale. if the bit is set, we then check generation as the existing code
    does.

    We have to read out the inode in question from disk first to know its alloc
    slot and allot bit. And if its not stale we read it out using ocfs2_iget().
    The second read should then be from cache.

    And also we have to add a per superblock nfs_sync_lock to cover the lock for
    alloc inode and that for inode in question. this is because ocfs2_get_dentry()
    and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked
    in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so
    that mutliple ocfs2_delete_inode() can run concurrently in normal case.

    [mfasheh@suse.com: build warning fixes and comment cleanups]
    Signed-off-by: Wengang Wang
    Acked-by: Joel Becker
    Signed-off-by: Mark Fasheh

    wengang wang
     
  • In ocfs2, the inode block search looks for the "emptiest" inode
    group to allocate from. So if an inode alloc file has many equally
    (or almost equally) empty groups, new inodes will tend to get
    spread out amongst them, which in turn can put them all over the
    disk. This is undesirable because directory operations on conceptually
    "nearby" inodes force a large number of seeks.

    So we add ip_last_used_group in core directory inodes which records
    the last used allocation group. Another field named ip_last_used_slot
    is also added in case inode stealing happens. When claiming new inode,
    we passed in directory's inode so that the allocation can use this
    information.
    For more details, please see
    http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy.

    Signed-off-by: Tao Ma
    Signed-off-by: Mark Fasheh

    Tao Ma
     

06 Jan, 2009

2 commits

  • For each quota type each node has local quota file. In this file it stores
    changes users have made to disk usage via this node. Once in a while this
    information is synced to global file (and thus with other nodes) so that
    limits enforcement at least aproximately works.

    Global quota files contain all the information about usage and limits. It's
    mostly handled by the generic VFS code (which implements a trie of structures
    inside a quota file). We only have to provide functions to convert structures
    from on-disk format to in-memory one. We also have to provide wrappers for
    various quota functions starting transactions and acquiring necessary cluster
    locks before the actual IO is really started.

    Signed-off-by: Jan Kara
    Signed-off-by: Mark Fasheh

    Jan Kara
     
  • The ocfs2 code currently reads inodes off disk with a simple
    ocfs2_read_block() call. Each place that does this has a different set
    of sanity checks it performs. Some check only the signature. A couple
    validate the block number (the block read vs di->i_blkno). A couple
    others check for VALID_FL. Only one place validates i_fs_generation. A
    couple check nothing. Even when an error is found, they don't all do
    the same thing.

    We wrap inode reading into ocfs2_read_inode_block(). This will validate
    all the above fields, going readonly if they are invalid (they never
    should be). ocfs2_read_inode_block_full() is provided for the places
    that want to pass read_block flags. Every caller is passing a struct
    inode with a valid ip_blkno, so we don't need a separate blkno argument
    either.

    We will remove the validation checks from the rest of the code in a
    later commit, as they are no longer necessary.

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     

15 Oct, 2008

1 commit


14 Oct, 2008

2 commits

  • ocfs2 wants JBD2 for many reasons, not the least of which is that JBD is
    limiting our maximum filesystem size.

    It's a pretty trivial change. Most functions are just renamed. The
    only functional change is moving to Jan's inode-based ordered data mode.
    It's better, too.

    Because JBD2 reads and writes JBD journals, this is compatible with any
    existing filesystem. It can even interact with JBD-based ocfs2 as long
    as the journal is formated for JBD.

    We provide a compatibility option so that paranoid people can still use
    JBD for the time being. This will go away shortly.

    [ Moved call of ocfs2_begin_ordered_truncate() from ocfs2_delete_inode() to
    ocfs2_truncate_for_delete(). --Mark ]

    Signed-off-by: Joel Becker
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • This patch implements storing extended attributes both in inode or a single
    external block. We only store EA's in-inode when blocksize > 512 or that
    inode block has free space for it. When an EA's value is larger than 80
    bytes, we will store the value via b-tree outside inode or block.

    Signed-off-by: Tiger Yang
    Signed-off-by: Mark Fasheh

    Tiger Yang
     

26 Jan, 2008

3 commits

  • Create separate lockdep lock classes for system file's i_mutexes. They are
    used to guard allocations and similar things and thus rank differently
    than i_mutex of a regular file or directory.

    Signed-off-by: Jan Kara
    Signed-off-by: Mark Fasheh

    Jan Kara
     
  • Call this the "inode_lock" now, since it covers both data and meta data.
    This patch makes no functional changes.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • The meta lock now covers both meta data and data, so this just removes the
    now-redundant data lock.

    Combining locks saves us a round of lock mastery per inode and one less lock
    to ping between nodes during read/write.

    We don't lose much - since meta locks were always held before a data lock
    (and at the same level) ordered writeout mode (the default) ensured that
    flushing for the meta data lock also pushed out data anyways.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

13 Oct, 2007

1 commit

  • Add the disk, network and memory structures needed to support data in inode.

    Struct ocfs2_inline_data is defined and embedded in ocfs2_dinode for storing
    inline data.

    A new inode field, i_dyn_features, is added to facilitate tracking of
    dynamic inode state. Since it will be used often, we want to mirror it on
    ocfs2_inode_info, and transfer it via the meta data lvb.

    Signed-off-by: Mark Fasheh
    Reviewed-by: Joel Becker

    Mark Fasheh
     

03 May, 2007

1 commit


27 Apr, 2007

2 commits

  • The extent map code was ripped out earlier because of an inability to deal
    with holes. This patch adds back a simpler caching scheme requiring far less
    code.

    Our old extent map caching was designed back when meta data block caching in
    Ocfs2 didn't work very well, resulting in many disk reads. These days our
    metadata caching is much better, resulting in no un-necessary disk reads. As
    a result, extent caching doesn't have to be as fancy, nor does it have to
    cache as many extents. Keeping the last 3 extents seen should be sufficient
    to give us a small performance boost on some streaming workloads.

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • Older file systems which didn't support holes did a dumb calculation of
    i_blocks based on i_size. This is no longer accurate, so fix things up to
    take actual allocation into account.

    Signed-off-by: Mark Fasheh

    Mark Fasheh