26 Mar, 2016
2 commits
-
In the current implementation of unaligned aio+dio, lock order behave as
follow:in user process context:
-> call io_submit()
-> get i_mutex
get ip_unaligned_aio
-> submit direct io to block device
-> release i_mutex
-> io_submit() returnin dio work queue context(the work queue is created in __blockdev_direct_IO):
-> release ip_unaligned_aio
get i_mutex
-> clear unwritten flag & change i_size
-> release i_mutexThere is a limitation to the thread number of dio work queue. 256 at
default. If all 256 thread are in the above 'window2' stage, and there
is a user process in the 'window1' stage, the system will became
deadlock. Since the user process hold i_mutex to wait ip_unaligned_aio
lock, while there is a direct bio hold ip_unaligned_aio mutex who is
waiting for a dio work queue thread to be schedule. But all the dio
work queue thread is waiting for i_mutex lock in 'window2'.This case only happened in a test which send a large number(more than
256) of aio at one io_submit() call.My design is to remove ip_unaligned_aio lock. Change it to a sync io
instead. Just like ip_unaligned_aio lock, serialize the unaligned aio
dio.[akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi]
Signed-off-by: Ryan Ding
Reviewed-by: Junxiao Bi
Cc: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
There is still one issue in the direct write procedure.
phase 1: alloc extent with UNWRITTEN flag
phase 2: submit direct data to disk, add zero page to page cache
phase 3: clear UNWRITTEN flag when data has been written to diskWhen there are 2 direct write A(0~3KB),B(4~7KB) writing to the same
cluster 0~7KB (cluster size 8KB). Write request A arrive phase 2 first,
it will zero the region (4~7KB). Before request A enter to phase 3,
request B arrive phase 2, it will zero region (0~3KB). This is just like
request B steps request A.To resolve this issue, we should let request B knows this cluster is already
under zero, to prevent it from steps the previous write request.This patch will add function ocfs2_unwritten_check() to do this job. It
will record all clusters that are under direct write(it will be recorded
in the 'ip_unwritten_list' member of inode info), and prevent the later
direct write writing to the same cluster to do the zero work again.Signed-off-by: Ryan Ding
Reviewed-by: Junxiao Bi
Cc: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Mar, 2016
1 commit
-
Implement online file check sysfile interfaces, e.g. how to create the
related sysfile according to device name, how to display/handle file
check request from the sysfile.Signed-off-by: Gang He
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Cc: Joseph Qi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Nov, 2015
1 commit
-
We have no need to take inode mutex, rw and inode lock if it is not dio
entry when recover orphans. Optimize it by adding a flag
OCFS2_INODE_DIO_ORPHAN_ENTRY to ocfs2_inode_info to reduce contention.Signed-off-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Sep, 2015
1 commit
-
During direct io the inode will be added to orphan first and then
deleted from orphan. There is a race window that the orphan entry will
be deleted twice and thus trigger the BUG when validating
OCFS2_DIO_ORPHANED_FL in ocfs2_del_inode_from_orphan.ocfs2_direct_IO_write
...
ocfs2_add_inode_to_orphan
>>>>>>>> race window.
1) another node may rm the file and then down, this node
take care of orphan recovery and clear flag
OCFS2_DIO_ORPHANED_FL.
2) since rw lock is unlocked, it may race with another
orphan recovery and append dio.
ocfs2_del_inode_from_orphanSo take inode mutex lock when recovering orphans and make rw unlock at the
end of aio write in case of append dio.Signed-off-by: Joseph Qi
Reported-by: Yiwen Jiang
Cc: Weiwei Wang
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Feb, 2015
1 commit
-
If one node has crashed with orphan entry leftover, another node which do
append O_DIRECT write to the same file will override the
i_dio_orphaned_slot. Then the old entry won't be cleaned forever. If
this case happens, we let it wait for orphan recovery first.Signed-off-by: Joseph Qi
Cc: Weiwei Wang
Cc: Joel Becker
Cc: Junxiao Bi
Cc: Mark Fasheh
Cc: Xuejiufei
Cc: alex chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Nov, 2014
1 commit
-
CC: Mark Fasheh
CC: Joel Becker
CC: ocfs2-devel@oss.oracle.com
Acked-by: Christoph Hellwig
Signed-off-by: Jan Kara
10 Oct, 2014
1 commit
-
ocfs2_inode_info->ip_clusters and ocfs2_dinode->id1.bitmap1.i_total are
defined as type u32, so the shift left operations may overflow if volume
size is large, for example, 2TB and cluster size is 1MB.Signed-off-by: Joseph Qi
Reviewed-by: Alex Chen
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Apr, 2014
3 commits
-
The flag was never set, delete it.
Signed-off-by: Jan Kara
Reviewed-by: Mark Fasheh
Reviewed-by: Srinivas Eeda
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, ocfs2_sync_file grabs i_mutex and forces the current journal
transaction to complete. This isn't terribly efficient, since sync_file
really only needs to wait for the last transaction involving that inode
to complete, and this doesn't require i_mutex.Therefore, implement the necessary bits to track the newest tid
associated with an inode, and teach sync_file to wait for that instead
of waiting for everything in the journal to commit. Furthermore, only
issue the flush request to the drive if jbd2 hasn't already done so.This also eliminates the deadlock between ocfs2_file_aio_write() and
ocfs2_sync_file(). aio_write takes i_mutex then calls
ocfs2_aiodio_wait() to wait for unaligned dio writes to finish.
However, if that dio completion involves calling fsync, then we can get
into trouble when some ocfs2_sync_file tries to take i_mutex.Signed-off-by: Darrick J. Wong
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There is a problem that waitqueue_active() may check stale data thus miss
a wakeup of threads waiting on ip_unaligned_aio.The valid value of ip_unaligned_aio is only 0 and 1 so we can change it to
be of type mutex thus the above prolem is avoid. Another benifit is that
mutex which works as FIFO is fairer than wake_up_all().Signed-off-by: Wengang Wang
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 May, 2013
1 commit
-
Faster kernel compiles by way of fewer unnecessary includes.
[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet
Cc: Zach Brown
Cc: Felipe Balbi
Cc: Greg Kroah-Hartman
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Rusty Russell
Cc: Jens Axboe
Cc: Asai Thambi S P
Cc: Selvan Mani
Cc: Sam Bradshaw
Cc: Jeff Moyer
Cc: Al Viro
Cc: Benjamin LaHaise
Reviewed-by: "Theodore Ts'o"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Jul, 2011
1 commit
-
Fix a corruption that can happen when we have (two or more) outstanding
aio's to an overlapping unaligned region. Ext4
(e9e3bcecf44c04b9e6b505fd8e2eb9cea58fb94d) and xfs recently had to fix
similar issues.In our case what happens is that we can have an outstanding aio on a region
and if a write comes in with some bytes overlapping the original aio we may
decide to read that region into a page before continuing (typically because
of buffered-io fallback). Since we have no ordering guarantees with the
aio, we can read stale or bad data into the page and then write it back out.If the i/o is page and block aligned, then we avoid this issue as there
won't be any need to read data from disk.I took the same approach as Eric in the ext4 patch and introduced some
serialization of unaligned async direct i/o. I don't expect this to have an
effect on the most common cases of AIO. Unaligned aio will be slower
though, but that's far more acceptable than data corruption.Signed-off-by: Mark Fasheh
Signed-off-by: Joel Becker
11 Sep, 2010
1 commit
-
Track negative dentries by recording the generation number of the parent
directory in d_fsdata. The generation number for the parent directory is
recorded in the inode_info, which increments every time the lock on the
directory is dropped.If the generation number of the parent directory and the negative dentry
matches, there is no need to perform the revalidate, else a revalidate
is forced. This improves performance in situations where nodes look for
the same non-existent file multiple times.Thanks Mark for explaining the DLM sequence.
Signed-off-by: Goldwyn Rodrigues
Signed-off-by: Joel Becker
10 Sep, 2010
1 commit
-
Thanks for the comments. I have incorportated them all.
CONFIG_OCFS2_FS_STATS is enabled and CONFIG_DEBUG_LOCK_ALLOC is disabled.
Statistics now look like -
ocfs2_write_ctxt: 2144 - 2136 = 8
ocfs2_inode_info: 1960 - 1848 = 112
ocfs2_journal: 168 - 160 = 8
ocfs2_lock_res: 336 - 304 = 32
ocfs2_refcount_tree: 512 - 472 = 40Signed-off-by: Goldwyn Rodrigues
Signed-off-by: Joel Becker
10 Aug, 2010
2 commits
-
... and let iput_final() do the actual eviction or retention
Signed-off-by: Al Viro
-
Signed-off-by: Al Viro
21 May, 2010
1 commit
-
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (47 commits)
ocfs2: Silence a gcc warning.
ocfs2: Don't retry xattr set in case value extension fails.
ocfs2:dlm: avoid dlm->ast_lock lockres->spinlock dependency break
ocfs2: Reset xattr value size after xa_cleanup_value_truncate().
fs/ocfs2/dlm: Use kstrdup
fs/ocfs2/dlm: Drop memory allocation cast
Ocfs2: Optimize punching-hole code.
Ocfs2: Make ocfs2_find_cpos_for_left_leaf() public.
Ocfs2: Fix hole punching to correctly do CoW during cluster zeroing.
Ocfs2: Optimize ocfs2 truncate to use ocfs2_remove_btree_range() instead.
ocfs2: Block signals for mkdir/link/symlink/O_CREAT.
ocfs2: Wrap signal blocking in void functions.
ocfs2/dlm: Increase o2dlm lockres hash size
ocfs2: Make ocfs2_extend_trans() really extend.
ocfs2/trivial: Code cleanup for allocation reservation.
ocfs2: make ocfs2_adjust_resv_from_alloc simple.
ocfs2: Make nointr a default mount option
ocfs2/dlm: Make o2dlm domain join/leave messages KERN_NOTICE
o2net: log socket state changes
ocfs2: print node # when tcp fails
...
06 May, 2010
1 commit
-
Add a per-inode reservations structure and pass it through to the
reservations code.Signed-off-by: Mark Fasheh
24 Apr, 2010
1 commit
-
Currently in the error path of ocfs2_symlink and ocfs2_mknod, we just call
iput with the inode we failed with, but the inode wipe code will complain
because we don't add the inode to orphan dir. One solution would be to lock
the orphan dir during the entire transaction, but that's too heavy for a
rare error path. Instead, we add a flag, OCFS2_INODE_SKIP_ORPHAN_DIR which
tells the inode wipe code that it won't find this inode in the orphan dir.[ Merge fixes and comment style cleanups -Mark ]
Signed-off-by: Li Dongyang
Signed-off-by: Mark Fasheh
05 Sep, 2009
6 commits
-
We can get to the inode from the caching information. Other parent
types don't need it.Signed-off-by: Joel Becker
-
Similar ip_last_trans, ip_created_trans tracks the creation of a journal
managed inode. This specifically tracks what transaction created the
inode. This is so the code can know if the inode has ever been written
to disk.This behavior is desirable for any journal managed object. We move it
to struct ocfs2_caching_info as ci_created_trans so that any object
using ocfs2_caching_info can rely on this behavior.Signed-off-by: Joel Becker
-
We have the read side of metadata caching isolated to struct
ocfs2_caching_info, now we need the write side. This means the journal
functions. The journal only does a couple of things with struct inode.This change moves the ip_last_trans field onto struct
ocfs2_caching_info as ci_last_trans. This field tells the journal
whether a pending journal flush is required.Signed-off-by: Joel Becker
-
We are really passing the inode into the ocfs2_read/write_blocks()
functions to get at the metadata cache. This commit passes the cache
directly into the metadata block functions, divorcing them from the
inode.Signed-off-by: Joel Becker
-
We don't really want to cart around too many new fields on the
ocfs2_caching_info structure. So let's wrap all our access of the
parent object in a set of operations. One pointer on caching_info, and
more flexibility to boot.Signed-off-by: Joel Becker
-
We want to use the ocfs2_caching_info structure in places that are not
inodes. To do that, it can no longer rely on referencing the inode
directly.This patch moves the flags to ocfs2_caching_info->ci_flags, stores
pointers to the parent's locks on the ocfs2_caching_info, and renames
the constants and flags to reflect its independant state.Signed-off-by: Joel Becker
04 Apr, 2009
2 commits
-
For nfs exporting, ocfs2_get_dentry() returns the dentry for fh.
ocfs2_get_dentry() may read from disk when the inode is not in memory,
without any cross cluster lock. this leads to the file system loading a
stale inode.This patch fixes above problem.
Solution is that in case of inode is not in memory, we get the cluster
lock(PR) of alloc inode where the inode in question is allocated from (this
causes node on which deletion is done sync the alloc inode) before reading
out the inode itsself. then we check the bitmap in the group (the inode in
question allcated from) to see if the bit is clear. if it's clear then it's
stale. if the bit is set, we then check generation as the existing code
does.We have to read out the inode in question from disk first to know its alloc
slot and allot bit. And if its not stale we read it out using ocfs2_iget().
The second read should then be from cache.And also we have to add a per superblock nfs_sync_lock to cover the lock for
alloc inode and that for inode in question. this is because ocfs2_get_dentry()
and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked
in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so
that mutliple ocfs2_delete_inode() can run concurrently in normal case.[mfasheh@suse.com: build warning fixes and comment cleanups]
Signed-off-by: Wengang Wang
Acked-by: Joel Becker
Signed-off-by: Mark Fasheh -
In ocfs2, the inode block search looks for the "emptiest" inode
group to allocate from. So if an inode alloc file has many equally
(or almost equally) empty groups, new inodes will tend to get
spread out amongst them, which in turn can put them all over the
disk. This is undesirable because directory operations on conceptually
"nearby" inodes force a large number of seeks.So we add ip_last_used_group in core directory inodes which records
the last used allocation group. Another field named ip_last_used_slot
is also added in case inode stealing happens. When claiming new inode,
we passed in directory's inode so that the allocation can use this
information.
For more details, please see
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy.Signed-off-by: Tao Ma
Signed-off-by: Mark Fasheh
06 Jan, 2009
2 commits
-
For each quota type each node has local quota file. In this file it stores
changes users have made to disk usage via this node. Once in a while this
information is synced to global file (and thus with other nodes) so that
limits enforcement at least aproximately works.Global quota files contain all the information about usage and limits. It's
mostly handled by the generic VFS code (which implements a trie of structures
inside a quota file). We only have to provide functions to convert structures
from on-disk format to in-memory one. We also have to provide wrappers for
various quota functions starting transactions and acquiring necessary cluster
locks before the actual IO is really started.Signed-off-by: Jan Kara
Signed-off-by: Mark Fasheh -
The ocfs2 code currently reads inodes off disk with a simple
ocfs2_read_block() call. Each place that does this has a different set
of sanity checks it performs. Some check only the signature. A couple
validate the block number (the block read vs di->i_blkno). A couple
others check for VALID_FL. Only one place validates i_fs_generation. A
couple check nothing. Even when an error is found, they don't all do
the same thing.We wrap inode reading into ocfs2_read_inode_block(). This will validate
all the above fields, going readonly if they are invalid (they never
should be). ocfs2_read_inode_block_full() is provided for the places
that want to pass read_block flags. Every caller is passing a struct
inode with a valid ip_blkno, so we don't need a separate blkno argument
either.We will remove the validation checks from the rest of the code in a
later commit, as they are no longer necessary.Signed-off-by: Joel Becker
Signed-off-by: Mark Fasheh
15 Oct, 2008
1 commit
-
dir.c is the only place using ocfs2_bread(), so let's make it static to
that file.Signed-off-by: Joel Becker
Signed-off-by: Mark Fasheh
14 Oct, 2008
2 commits
-
ocfs2 wants JBD2 for many reasons, not the least of which is that JBD is
limiting our maximum filesystem size.It's a pretty trivial change. Most functions are just renamed. The
only functional change is moving to Jan's inode-based ordered data mode.
It's better, too.Because JBD2 reads and writes JBD journals, this is compatible with any
existing filesystem. It can even interact with JBD-based ocfs2 as long
as the journal is formated for JBD.We provide a compatibility option so that paranoid people can still use
JBD for the time being. This will go away shortly.[ Moved call of ocfs2_begin_ordered_truncate() from ocfs2_delete_inode() to
ocfs2_truncate_for_delete(). --Mark ]Signed-off-by: Joel Becker
Signed-off-by: Mark Fasheh -
This patch implements storing extended attributes both in inode or a single
external block. We only store EA's in-inode when blocksize > 512 or that
inode block has free space for it. When an EA's value is larger than 80
bytes, we will store the value via b-tree outside inode or block.Signed-off-by: Tiger Yang
Signed-off-by: Mark Fasheh
26 Jan, 2008
3 commits
-
Create separate lockdep lock classes for system file's i_mutexes. They are
used to guard allocations and similar things and thus rank differently
than i_mutex of a regular file or directory.Signed-off-by: Jan Kara
Signed-off-by: Mark Fasheh -
Call this the "inode_lock" now, since it covers both data and meta data.
This patch makes no functional changes.Signed-off-by: Mark Fasheh
-
The meta lock now covers both meta data and data, so this just removes the
now-redundant data lock.Combining locks saves us a round of lock mastery per inode and one less lock
to ping between nodes during read/write.We don't lose much - since meta locks were always held before a data lock
(and at the same level) ordered writeout mode (the default) ensured that
flushing for the meta data lock also pushed out data anyways.Signed-off-by: Mark Fasheh
13 Oct, 2007
1 commit
-
Add the disk, network and memory structures needed to support data in inode.
Struct ocfs2_inline_data is defined and embedded in ocfs2_dinode for storing
inline data.A new inode field, i_dyn_features, is added to facilitate tracking of
dynamic inode state. Since it will be used often, we want to mirror it on
ocfs2_inode_info, and transfer it via the meta data lvb.Signed-off-by: Mark Fasheh
Reviewed-by: Joel Becker
03 May, 2007
1 commit
-
Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into
ocfs2-specific ip_attr. Hence, when someone sets these flags via a different
interface than ioctl, they are stored correctly.Signed-off-by: Jan Kara
Signed-off-by: Mark Fasheh
27 Apr, 2007
2 commits
-
The extent map code was ripped out earlier because of an inability to deal
with holes. This patch adds back a simpler caching scheme requiring far less
code.Our old extent map caching was designed back when meta data block caching in
Ocfs2 didn't work very well, resulting in many disk reads. These days our
metadata caching is much better, resulting in no un-necessary disk reads. As
a result, extent caching doesn't have to be as fancy, nor does it have to
cache as many extents. Keeping the last 3 extents seen should be sufficient
to give us a small performance boost on some streaming workloads.Signed-off-by: Mark Fasheh
-
Older file systems which didn't support holes did a dumb calculation of
i_blocks based on i_size. This is no longer accurate, so fix things up to
take actual allocation into account.Signed-off-by: Mark Fasheh