Doug / smarc-fsl-linux-kernel | Embedian Git Server

15 Apr, 2009

1 commit

328eaaba4 ocfs2: fix i_mutex locking in ocfs2_splice_to_file() ... Browse Code »

Rearrange locking of i_mutex on destination and call to
ocfs2_rw_lock() so locks are only held while buffers are copied with
the pipe_to_file() actor, and not while waiting for more data on the
pipe.

Signed-off-by: Miklos Szeredi
Signed-off-by: Jens Axboe

Miklos Szeredi
2009-04-15 18:10:12 +0800

07 Apr, 2009

1 commit

7bfac9ecf splice: fix deadlock in splicing to file ... Browse Code »

There's a possible deadlock in generic_file_splice_write(),
splice_from_pipe() and ocfs2_file_splice_write():

- task A calls generic_file_splice_write()
- this calls inode_double_lock(), which locks i_mutex on both
pipe->inode and target inode
- ordering depends on inode pointers, can happen that pipe->inode is
locked first
- __splice_from_pipe() needs more data, calls pipe_wait()
- this releases lock on pipe->inode, goes to interruptible sleep
- task B calls generic_file_splice_write(), similarly to the first
- this locks pipe->inode, then tries to lock inode, but that is
already held by task A
- task A is interrupted, it tries to lock pipe->inode, but fails, as
it is already held by task B
- ABBA deadlock

Fix this by explicitly ordering locks: the outer lock must be on
target inode and the inner lock (which is later unlocked and relocked)
must be on pipe->inode. This is OK, pipe inodes and target inodes
form two nonoverlapping sets, generic_file_splice_write() and friends
are not called with a target which is a pipe.

Signed-off-by: Miklos Szeredi
Acked-by: Mark Fasheh
Acked-by: Jens Axboe
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Miklos Szeredi
2009-04-07 23:34:46 +0800

04 Apr, 2009

32 commits

9140db04e ocfs2: recover orphans in offline slots during recovery and mount ... Browse Code »

During recovery, a node recovers orphans in it's slot and the dead node(s). But
if the dead nodes were holding orphans in offline slots, they will be left
unrecovered.

If the dead node is the last one to die and is holding orphans in other slots
and is the first one to mount, then it only recovers it's own slot, which
leaves orphans in offline slots.

This patch queues complete_recovery to clean orphans for all offline slots
during mount and node recovery.

Signed-off-by: Srinivas Eeda
Acked-by: Joel Becker
Signed-off-by: Mark Fasheh

Srinivas Eeda
2009-04-04 02:39:26 +0800
1fca3a05e ocfs2: Pagecache usage optimization on ocfs2 ... Browse Code »

A page can have multiple buffers and even if a page is not uptodate, some buffers
can be uptodate on pagesize != blocksize environment.
This aops checks that all buffers which correspond to a part of a file
that we want to read are uptodate. If so, we do not have to issue actual
read IO to HDD even if a page is not uptodate because the portion we
want to read are uptodate.
"block_is_partially_uptodate" function is already used by ext2/3/4.
With the following patch random read/write mixed workloads or random read after
random write workloads can be optimized and we can get performance improvement.

Signed-off-by: Hisashi Hifumi
Signed-off-by: Mark Fasheh

Hisashi Hifumi
2009-04-04 02:39:26 +0800
6ca497a83 ocfs2: fix rare stale inode errors when exporting via nfs ... Browse Code »

For nfs exporting, ocfs2_get_dentry() returns the dentry for fh.
ocfs2_get_dentry() may read from disk when the inode is not in memory,
without any cross cluster lock. this leads to the file system loading a
stale inode.

This patch fixes above problem.

Solution is that in case of inode is not in memory, we get the cluster
lock(PR) of alloc inode where the inode in question is allocated from (this
causes node on which deletion is done sync the alloc inode) before reading
out the inode itsself. then we check the bitmap in the group (the inode in
question allcated from) to see if the bit is clear. if it's clear then it's
stale. if the bit is set, we then check generation as the existing code
does.

We have to read out the inode in question from disk first to know its alloc
slot and allot bit. And if its not stale we read it out using ocfs2_iget().
The second read should then be from cache.

And also we have to add a per superblock nfs_sync_lock to cover the lock for
alloc inode and that for inode in question. this is because ocfs2_get_dentry()
and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked
in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so
that mutliple ocfs2_delete_inode() can run concurrently in normal case.

[mfasheh@suse.com: build warning fixes and comment cleanups]
Signed-off-by: Wengang Wang
Acked-by: Joel Becker
Signed-off-by: Mark Fasheh

wengang wang
2009-04-04 02:39:25 +0800
9405dccfd ocfs2/dlm: Tweak mle_state output ... Browse Code »

The debugfs file, mle_state, now prints the number of largest number of mles
in one hash link.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:25 +0800
516b7e52a ocfs2/dlm: Do not purge lockres that is being migrated dlm_purge_lockres() ... Browse Code »

This patch attempts to fix a fine race between purging and migration.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:24 +0800
7141514b8 ocfs2/dlm: Remove struct dlm_lock_name in struct dlm_master_list_entry ... Browse Code »

This patch removes struct dlm_lock_name and adds the entries directly
to struct dlm_master_list_entry. Under the new scheme, both mles that
are backed by a lockres or not, will have the name populated in mle->mname.
This allows us to get rid of code that was figuring out the location of
the mle name.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:23 +0800
e64ff1460 ocfs2/dlm: Show the number of lockres/mles in dlm_state ... Browse Code »

This patch shows the number of lockres' and mles in the debugfs file, dlm_state.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:22 +0800
7d62a978a ocfs2/dlm: dlm_set_lockres_owner() and dlm_change_lockres_owner() inlined ... Browse Code »

This patch inlines dlm_set_lockres_owner() and dlm_change_lockres_owner().

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:21 +0800
6800791ab ocfs2/dlm: Improve lockres counts ... Browse Code »

This patch replaces the lockres counts that tracked the number number of
locally and remotely mastered lockres' with a current and total count. The
total count is the number of lockres' that have been created since the dlm
domain was created.

The number of locally and remotely mastered counts can be computed using
the locking_state output.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:21 +0800
2041d8fdc ocfs2/dlm: Track number of mles ... Browse Code »

The lifetime of a mle is limited to the duration of the lockres mastery
process. While typically this lifetime is fairly short, we have noticed
the number of mles explode under certain circumstances. This patch tracks
the number of each different types of mles and should help us determine
how best to speed up the mastery process.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:21 +0800
67ae1f060 ocfs2/dlm: Indent dlm_cleanup_master_list() ... Browse Code »

The previous patch explicitly did not indent dlm_cleanup_master_list()
so as to make the patch readable. This patch properly indents the
function.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:21 +0800
2ed6c750d ocfs2/dlm: Activate dlm->master_hash for master list entries ... Browse Code »

With this patch, the mles are stored in a hash and not a simple list.
This should improve the mle lookup time when the number of outstanding
masteries is large.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:19 +0800
e2b66ddcc ocfs2/dlm: Create and destroy the dlm->master_hash ... Browse Code »

This patch adds code to create and destroy the dlm->master_hash.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:18 +0800
c2cd4a443 ocfs2/dlm: Refactor dlm_clean_master_list() ... Browse Code »

This patch refactors dlm_clean_master_list() so as to make it
easier to convert the mle list to a hash.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:18 +0800
f77a9a78c ocfs2/dlm: Clean up struct dlm_lock_name ... Browse Code »

For master mle, the name it stored in the attached lockres in struct qstr.
For block and migration mle, the name is stored inline in struct dlm_lock_name.
This patch attempts to make struct dlm_lock_name look like a struct qstr. While
we could use struct qstr, we don't because we want to avoid having to malloc
and free the lockname string as the mle's lifetime is fairly short.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:18 +0800
1c0845773 ocfs2/dlm: Encapsulate adding and removing of mle from dlm->master_list ... Browse Code »

This patch encapsulates adding and removing of the mle from the
dlm->master_list. This patch is part of the series of patches that
converts the mle list to a mle hash.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:18 +0800
feb473a6e ocfs2: Optimize inode group allocation by recording last used group. ... Browse Code »

In ocfs2, the block group search looks for the "emptiest" group
to allocate from. So if the allocator has many equally(or almost
equally) empty groups, new block group will tend to get spread
out amongst them.

So we add osb_inode_alloc_group in ocfs2_super to record the last
used inode allocation group.
For more details, please see
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy.

I have done some basic test and the results are a ten times improvement on
some cold-cache stat workloads.

Signed-off-by: Tao Ma
Signed-off-by: Mark Fasheh

Tao Ma
2009-04-04 02:39:18 +0800
60ca81e82 ocfs2: Allocate inode groups from global_bitmap. ... Browse Code »

Inode groups used to be allocated from local alloc file,
but since we want all inodes to be contiguous enough, we
will try to allocate them directly from global_bitmap.

Signed-off-by: Tao Ma
Signed-off-by: Mark Fasheh

Tao Ma
2009-04-04 02:39:17 +0800
138211515 ocfs2: Optimize inode allocation by remembering last group ... Browse Code »

In ocfs2, the inode block search looks for the "emptiest" inode
group to allocate from. So if an inode alloc file has many equally
(or almost equally) empty groups, new inodes will tend to get
spread out amongst them, which in turn can put them all over the
disk. This is undesirable because directory operations on conceptually
"nearby" inodes force a large number of seeks.

So we add ip_last_used_group in core directory inodes which records
the last used allocation group. Another field named ip_last_used_slot
is also added in case inode stealing happens. When claiming new inode,
we passed in directory's inode so that the allocation can use this
information.
For more details, please see
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy.

Signed-off-by: Tao Ma
Signed-off-by: Mark Fasheh

Tao Ma
2009-04-04 02:39:17 +0800
1d46dc08d ocfs2: fix leaf start calculation in ocfs2_dx_dir_rebalance() ... Browse Code »

ocfs2_dx_dir_rebalance() is passed the block offset of a dx leaf which needs
rebalancing. Since we rebalance an entire cluster at a time however, this
function needs to calculate the beginning of that cluster, in blocks. The
calculation was wrong, which would result in a read of non-leaf blocks. Fix
the calculation by adding ocfs2_block_to_cluster_start() which is a more
straight-forward way of determining this.

Reported-by: Tristan Ye
Signed-off-by: Mark Fasheh

Mark Fasheh
2009-04-04 02:39:17 +0800
b80b549c3 ocfs2: re-order ocfs2_empty_dir checks ... Browse Code »

ocfs2_empty_dir() is far more expensive than checking link count. Since both
need to be checked at the same time, we can improve performance by checking
link count first.

Signed-off-by: Mark Fasheh

Mark Fasheh
2009-04-04 02:39:17 +0800
3a8df2b9c ocfs2: Enable indexed directories ... Browse Code »

Since the disk format is finalized, we can set this feature bit in the
supported mask.

Signed-off-by: Mark Fasheh
Acked-by: Joel Becker

Mark Fasheh
2009-04-04 02:39:16 +0800
e3a93c2db ocfs2: Add total entry count to dx_root_block ... Browse Code »

This little bit of extra accounting speeds up ocfs2_empty_dir()
dramatically by allowing us to short-circuit the full directory scan.

Signed-off-by: Mark Fasheh

Mark Fasheh
2009-04-04 02:39:16 +0800
198a1ca3b ocfs2: Increase max links count ... Browse Code »

Since we've now got a directory format capable of handling a large number of
entries, we can increase the maximum link count supported. This only gets
increased if the directory indexing feature is turned on.

Signed-off-by: Mark Fasheh
Acked-by: Joel Becker

Mark Fasheh
2009-04-04 02:39:16 +0800
e7c17e430 ocfs2: Introduce dir free space list ... Browse Code »

The only operation which doesn't get faster with directory indexing is
insert, which still has to walk the entire unindexed directory portion to
find a free block. This patch provides an improvement in directory insert
performance by maintaining a singly linked list of directory leaf blocks
which have space for additional dirents.

Signed-off-by: Mark Fasheh
Acked-by: Joel Becker

Mark Fasheh
2009-04-04 02:39:16 +0800
4ed8a6bb0 ocfs2: Store dir index records inline ... Browse Code »

Allow us to store a small number of directory index records in the
ocfs2_dx_root_block. This saves us a disk read on small to medium sized
directories (less than about 250 entries). The inline root is automatically
turned into a root block with extents if the directory size increases beyond
it's capacity.

Signed-off-by: Mark Fasheh
Acked-by: Joel Becker

Mark Fasheh
2009-04-04 02:39:16 +0800
9b7895efa ocfs2: Add a name indexed b-tree to directory inodes ... Browse Code »

This patch makes use of Ocfs2's flexible btree code to add an additional
tree to directory inodes. The new tree stores an array of small,
fixed-length records in each leaf block. Each record stores a hash value,
and pointer to a block in the traditional (unindexed) directory tree where a
dirent with the given name hash resides. Lookup exclusively uses this tree
to find dirents, thus providing us with constant time name lookups.

Some of the hashing code was copied from ext3. Unfortunately, it has lots of
unfixed checkpatch errors. I left that as-is so that tracking changes would
be easier.

Signed-off-by: Mark Fasheh
Acked-by: Joel Becker

Mark Fasheh
2009-04-04 02:39:15 +0800
4a12ca3a0 ocfs2: Introduce dir lookup helper struct ... Browse Code »

Many directory manipulation calls pass around a tuple of dirent, and it's
containing buffer_head. Dir indexing has a bit more state, but instead of
adding yet more arguments to functions, we introduce 'struct
ocfs2_dir_lookup_result'. In this patch, it simply holds the same tuple, but
future patches will add more state.

Signed-off-by: Mark Fasheh
Acked-by: Joel Becker

Mark Fasheh
2009-04-04 02:39:15 +0800
59b526a30 ocfs2: Remove debugfs file local_alloc_stats ... Browse Code »

This patch removes the debugfs file local_alloc_stats as that information
is now included in the fs_state debugfs file.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:15 +0800
50397507e ocfs2: Expose the file system state via debugfs ... Browse Code »

This patch creates a per mount debugfs file, fs_state, which exposes
information like, cluster stack in use, states of the downconvert, recovery
and commit threads, number of journal txns, some allocation stats, list of
all slots, etc.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:15 +0800
96a6c64b5 ocfs2: Move struct recovery_map to a header file ... Browse Code »

Move the definition of struct recovery_map from journal.c to journal.h. This
is preparation for the next patch.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:14 +0800
87d3d3f39 ocfs2/hb: Expose the list of heartbeating nodes via debugfs ... Browse Code »

This patch creates a debugfs file, o2hb/livesnodes, which exposes the
aggregate list of heartbeating node across all heartbeat regions.

Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Sunil Mushran
2009-04-04 02:39:14 +0800

03 Apr, 2009

1 commit

8fe74cf05 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
Remove two unneeded exports and make two symbols static in fs/mpage.c
Cleanup after commit 585d3bc06f4ca57f975a5a1f698f65a45ea66225
Trim includes of fdtable.h
Don't crap into descriptor table in binfmt_som
Trim includes in binfmt_elf
Don't mess with descriptor table in load_elf_binary()
Get rid of indirect include of fs_struct.h
New helper - current_umask()
check_unsafe_exec() doesn't care about signal handlers sharing
New locking/refcounting for fs_struct
Take fs_struct handling to new file (fs/fs_struct.c)
Get rid of bumping fs_struct refcount in pivot_root(2)
Kill unsharing fs_struct in __set_personality()

Linus Torvalds
2009-04-03 12:09:10 +0800

01 Apr, 2009

2 commits

c2ec175c3 mm: page_mkwrite change prototype to match fault ... Browse Code »

Change the page_mkwrite prototype to take a struct vm_fault, and return
VM_FAULT_xxx flags. There should be no functional change.

This makes it possible to return much more detailed error information to
the VM (and also can provide more information eg. virtual_address to the
driver, which might be important in some special cases).

This is required for a subsequent fix. And will also make it easier to
merge page_mkwrite() with fault() in future.

Signed-off-by: Nick Piggin
Cc: Chris Mason
Cc: Trond Myklebust
Cc: Miklos Szeredi
Cc: Steven Whitehouse
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Artem Bityutskiy
Cc: Felix Blyakher
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-04-01 23:59:14 +0800
ce3b0f8d5 New helper - current_umask() ... Browse Code »

current->fs->umask is what most of fs_struct users are doing.
Put that into a helper function.

Signed-off-by: Al Viro

Al Viro
2009-04-01 11:00:26 +0800

28 Mar, 2009

1 commit

d8fba0ffe constify dentry_operations: OCFS2 ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2009-03-28 02:44:02 +0800

13 Mar, 2009

2 commits

712e53e46 ocfs2: Use xs->bucket to set xattr value outside ... Browse Code »

A long time ago, xs->base is allocated a 4K size and all the contents
in the bucket are copied to the it. Now we use ocfs2_xattr_bucket to
abstract xattr bucket and xs->base is initialized to the start of the
bu_bhs[0]. So xs->base + offset will overflow when the value root is
stored outside the first block.

Then why we can survive the xattr test by now? It is because we always
read the bucket contiguously now and kernel mm allocate continguous
memory for us. We are lucky, but we should fix it. So just get the
right value root as other callers do.

Signed-off-by: Tao Ma
Acked-by: Joel Becker
Signed-off-by: Mark Fasheh

Tao Ma
2009-03-13 07:46:09 +0800
74e77eb30 ocfs2: Fix a bug found by sparse check. ... Browse Code »

We need to use le32_to_cpu to test rec->e_cpos in
ocfs2_dinode_insert_check.

Signed-off-by: Tao Ma
Acked-by: Joel Becker
Signed-off-by: Mark Fasheh

Tao Ma
2009-03-13 07:46:01 +0800