Eric Lee / smarc-fsl-linux-kernel

08 Sep, 2010

1 commit

e49e27674 ocfs2: allow return of new inode block location before allocation of the inode ... Browse Code »

This allows code which needs to know the eventual block number of an inode
but can't allocate it yet due to transaction or lock ordering. For example,
ocfs2_create_inode_in_orphan() currently gives a junk blkno for preparation
of the orphan dir because it can't yet know where the actual inode is placed
- that code is actually in ocfs2_mknod_locked. This is a problem when the
orphan dirs are indexed as the junk inode number will create an index entry
which goes unused (and fails the later removal from the orphan dir). Now
with these interfaces, ocfs2_create_inode_in_orphan() can run the block
group search (and get back the inode block number) *before* any actual
allocation occurs.

Signed-off-by: Mark Fasheh
Signed-off-by: Tao Ma

Mark Fasheh
2010-09-08 14:25:59 +0800

06 May, 2010

2 commits

1ed9b777f ocfs2: ocfs2_claim_*() don't need an ocfs2_super argument. ... Browse Code »

They all take an ocfs2_alloc_context, which has the allocation inode.

Signed-off-by: Joel Becker
Signed-off-by: Tao Ma

Joel Becker
2010-05-06 13:59:06 +0800
d02f00cc0 ocfs2: allocation reservations ... Browse Code »

This patch improves Ocfs2 allocation policy by allowing an inode to
reserve a portion of the local alloc bitmap for itself. The reserved
portion (allocation window) is advisory in that other allocation
windows might steal it if the local alloc bitmap becomes
full. Otherwise, the reservations are honored and guaranteed to be
free. When the local alloc window is moved to a different portion of
the bitmap, existing reservations are discarded.

Reservation windows are represented internally by a red-black
tree. Within that tree, each node represents the reservation window of
one inode. An LRU of active reservations is also maintained. When new
data is written, we allocate it from the inodes window. When all bits
in a window are exhausted, we allocate a new one as close to the
previous one as possible. Should we not find free space, an existing
reservation is pulled off the LRU and cannibalized.

Signed-off-by: Mark Fasheh

Mark Fasheh
2010-05-06 09:17:30 +0800

13 Apr, 2010

1 commit

7d1fe093b ocfs2: Pass suballocation results back via a structure. ... Browse Code »

We're going to be adding more info to a suballocator allocation. Rather
than growing every function in the chain, let's pass a result structure
around.

Signed-off-by: Joel Becker
Signed-off-by: Tao Ma

Joel Becker
2010-04-13 14:30:19 +0800

26 Mar, 2010

1 commit

2b6cb576a ocfs2: Set suballoc_loc on allocated metadata. ... Browse Code »

Get the suballoc_loc from ocfs2_claim_new_inode() or
ocfs2_claim_metadata(). Store it on the appropriate field of the block
we just allocated.

Signed-off-by: Joel Becker

Joel Becker
2010-03-26 10:09:15 +0800

24 Mar, 2010

1 commit

b4414eea0 ocfs2: Clear undo bits when local alloc is freed ... Browse Code »

When the local alloc file changes windows, unused bits are freed back to the
global bitmap. By defnition, those bits can not be in use by any file. Also,
the local alloc will never have been able to allocate those bits if they
were part of a previous truncate. Therefore it makes sense that we should
clear unused local alloc bits in the undo buffer so that they can be used
immediatly.

[ Modified to call it ocfs2_release_clusters() -- Joel ]

Signed-off-by: Mark Fasheh
Signed-off-by: Joel Becker

Mark Fasheh
2010-03-24 09:22:40 +0800

27 Feb, 2010

1 commit

b89c54282 ocfs2: add extent block stealing for ocfs2 v5 ... Browse Code »

This patch add extent block (metadata) stealing mechanism for
extent allocation. This mechanism is same as the inode stealing.
if no room in slot specific extent_alloc, we will try to
allocate extent block from the next slot.

Signed-off-by: Tiger Yang
Acked-by: Tao Ma
Signed-off-by: Joel Becker

Tiger Yang
2010-02-27 07:41:07 +0800

04 Apr, 2009

2 commits

6ca497a83 ocfs2: fix rare stale inode errors when exporting via nfs ... Browse Code »

For nfs exporting, ocfs2_get_dentry() returns the dentry for fh.
ocfs2_get_dentry() may read from disk when the inode is not in memory,
without any cross cluster lock. this leads to the file system loading a
stale inode.

This patch fixes above problem.

Solution is that in case of inode is not in memory, we get the cluster
lock(PR) of alloc inode where the inode in question is allocated from (this
causes node on which deletion is done sync the alloc inode) before reading
out the inode itsself. then we check the bitmap in the group (the inode in
question allcated from) to see if the bit is clear. if it's clear then it's
stale. if the bit is set, we then check generation as the existing code
does.

We have to read out the inode in question from disk first to know its alloc
slot and allot bit. And if its not stale we read it out using ocfs2_iget().
The second read should then be from cache.

And also we have to add a per superblock nfs_sync_lock to cover the lock for
alloc inode and that for inode in question. this is because ocfs2_get_dentry()
and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked
in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so
that mutliple ocfs2_delete_inode() can run concurrently in normal case.

[mfasheh@suse.com: build warning fixes and comment cleanups]
Signed-off-by: Wengang Wang
Acked-by: Joel Becker
Signed-off-by: Mark Fasheh

wengang wang
2009-04-04 02:39:25 +0800
138211515 ocfs2: Optimize inode allocation by remembering last group ... Browse Code »

In ocfs2, the inode block search looks for the "emptiest" inode
group to allocate from. So if an inode alloc file has many equally
(or almost equally) empty groups, new inodes will tend to get
spread out amongst them, which in turn can put them all over the
disk. This is undesirable because directory operations on conceptually
"nearby" inodes force a large number of seeks.

So we add ip_last_used_group in core directory inodes which records
the last used allocation group. Another field named ip_last_used_slot
is also added in case inode stealing happens. When claiming new inode,
we passed in directory's inode so that the allocation can use this
information.
For more details, please see
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy.

Signed-off-by: Tao Ma
Signed-off-by: Mark Fasheh

Tao Ma
2009-04-04 02:39:17 +0800

06 Jan, 2009

3 commits

970e4936d ocfs2: Validate metadata only when it's read from disk. ... Browse Code »

Add an optional validation hook to ocfs2_read_blocks(). Now the
validation function is only called when a block was actually read off of
disk. It is not called when the buffer was in cache.

We add a buffer state bit BH_NeedsValidate to flag these buffers. It
must always be one higher than the last JBD2 buffer state bit.

The dinode, dirblock, extent_block, and xattr_block validators are
lifted to this scheme directly. The group_descriptor validator needs to
be split into two pieces. The first part only needs the gd buffer and
is passed to ocfs2_read_block(). The second part requires the dinode as
well, and is called every time. It's only 3 compares, so it's tiny.
This also allows us to clean up the non-fatal gd check used by resize.c.
It now has no magic argument.

Signed-off-by: Joel Becker
Signed-off-by: Mark Fasheh

Joel Becker
2009-01-06 00:36:53 +0800
68f64d471 ocfs2: Wrap group descriptor reads in a dedicated function. ... Browse Code »

We have a clean call for validating group descriptors, but every place
that wants the always does a read_block()+validate() call pair. Create
a toplevel ocfs2_read_group_descriptor() that does the right
thing. This allows us to leverage the single call point later for
fancier handling. We also add validation of gd->bg_generation against
the superblock and gd->bg_blkno against the block we thought we read.

Signed-off-by: Joel Becker
Signed-off-by: Mark Fasheh

Joel Becker
2009-01-06 00:36:53 +0800
57e3e7971 ocfs2: Consolidate validation of group descriptors. ... Browse Code »

Currently the validation of group descriptors is directly duplicated so
that one version can error the filesystem and the other (resize) can
just report the problem. Consolidate to one function that takes a
boolean. Wrap that function with the old call for the old users.

This is in preparation for lifting the read+validate step into a
single function.

Signed-off-by: Joel Becker
Signed-off-by: Mark Fasheh

Joel Becker
2009-01-06 00:36:53 +0800

14 Oct, 2008

7 commits

1187c9688 ocfs2: Limit inode allocation to 32bits. ... Browse Code »

ocfs2 inode numbers are block numbers. For any filesystem with less
than 2^32 blocks, this is not a problem. However, when ocfs2 starts
using JDB2, it will be able to support filesystems with more than 2^32
blocks. This would result in inode numbers higher than 2^32.

The problem is that stat(2) can't handle those numbers on 32bit
machines. The simple solution is to have ocfs2 allocate all inodes
below that boundary.

The suballoc code is changed to honor an optional block limit. Only the
inode suballocator sets that limit - all other allocations stay unlimited.

The biggest trick is to grow the inode suballocator beneath that limit.
There's no point in allocating block groups that are above the limit,
then rejecting their elements later on. We want to prevent the inode
allocator from ever having block groups above the limit. This involves
a little gyration with the local alloc code. If the local alloc window
is above the limit, it signals the caller to try the global bitmap but
does not disable the local alloc file (which can be used for other
allocations).

[ Minor cleanup - removed an ML_NOTICE comment. --Mark ]

Signed-off-by: Joel Becker
Signed-off-by: Mark Fasheh

Joel Becker
2008-10-14 07:57:07 +0800
f99b9b7cc ocfs2: Make ocfs2_extent_tree the first-class representation of a tree. ... Browse Code »

We now have three different kinds of extent trees in ocfs2: inode data
(dinode), extended attributes (xattr_tree), and extended attribute
values (xattr_value). There is a nice abstraction for them,
ocfs2_extent_tree, but it is hidden in alloc.c. All the calling
functions have to pick amongst a varied API and pass in type bits and
often extraneous pointers.

A better way is to make ocfs2_extent_tree a first-class object.
Everyone converts their object to an ocfs2_extent_tree() via the
ocfs2_get_*_extent_tree() calls, then uses the ocfs2_extent_tree for all
tree calls to alloc.c.

This simplifies a lot of callers, making for readability. It also
provides an easy way to add additional extent tree types, as they only
need to be defined in alloc.c with a ocfs2_get__extent_tree()
function.

Signed-off-by: Joel Becker
Signed-off-by: Mark Fasheh

Joel Becker
2008-10-14 07:57:05 +0800
cf1d6c763 ocfs2: Add extended attribute support ... Browse Code »

This patch implements storing extended attributes both in inode or a single
external block. We only store EA's in-inode when blocksize > 512 or that
inode block has free space for it. When an EA's value is larger than 80
bytes, we will store the value via b-tree outside inode or block.

Signed-off-by: Tiger Yang
Signed-off-by: Mark Fasheh

Tiger Yang
2008-10-14 07:57:02 +0800
f56654c43 ocfs2: Add extent tree operation for xattr value btrees ... Browse Code »

Add some thin wrappers around ocfs2_insert_extent() for each of the 3
different btree types, ocfs2_inode_insert_extent(),
ocfs2_xattr_value_insert_extent() and ocfs2_xattr_tree_insert_extent(). The
last is for the xattr index btree, which will be used in a followup patch.

All the old callers in file.c etc will call ocfs2_dinode_insert_extent(),
while the other two handle the xattr issue. And the init of extent tree are
handled by these functions.

When storing xattr value which is too large, we will allocate some clusters
for it and here ocfs2_extent_list and ocfs2_extent_rec will also be used. In
order to re-use the b-tree operation code, a new parameter named "private"
is added into ocfs2_extent_tree and it is used to indicate the root of
ocfs2_exent_list. The reason is that we can't deduce the root from the
buffer_head now. It may be in an inode, an ocfs2_xattr_block or even worse,
in any place in an ocfs2_xattr_bucket.

Signed-off-by: Tao Ma
Signed-off-by: Mark Fasheh

Tao Ma
2008-10-14 07:57:01 +0800
e7d4cb6bc ocfs2: Abstract ocfs2_extent_tree in b-tree operations. ... Browse Code »

In the old extent tree operation, we take the hypothesis that we
are using the ocfs2_extent_list in ocfs2_dinode as the tree root.
As xattr will also use ocfs2_extent_list to store large value
for a xattr entry, we refactor the tree operation so that xattr
can use it directly.

The refactoring includes 4 steps:
1. Abstract set/get of last_eb_blk and update_clusters since they may
be stored in different location for dinode and xattr.
2. Add a new structure named ocfs2_extent_tree to indicate the
extent tree the operation will work on.
3. Remove all the use of fe_bh and di, use root_bh and root_el in
extent tree instead. So now all the fe_bh is replaced with
et->root_bh, el with root_el accordingly.
4. Make ocfs2_lock_allocators generic. Now it is limited to be only used
in file extend allocation. But the whole function is useful when we want
to store large EAs.

Note: This patch doesn't touch ocfs2_commit_truncate() since it is not used
for anything other than truncate inode data btrees.

Signed-off-by: Tao Ma
Signed-off-by: Mark Fasheh

Tao Ma
2008-10-14 04:57:58 +0800
811f933df ocfs2: Use ocfs2_extent_list instead of ocfs2_dinode. ... Browse Code »

ocfs2_extend_meta_needed(), ocfs2_calc_extend_credits() and
ocfs2_reserve_new_metadata() are all useful for extent tree operations. But
they are all limited to an inode btree because they use a struct
ocfs2_dinode parameter. Change their parameter to struct ocfs2_extent_list
(the part of an ocfs2_dinode they actually use) so that the xattr btree code
can use these functions.

Signed-off-by: Tao Ma
Signed-off-by: Mark Fasheh

Tao Ma
2008-10-14 04:57:58 +0800
9c7af40b2 ocfs2: throttle back local alloc when low on disk space ... Browse Code »

Ocfs2's local allocator disables itself for the duration of a mount point
when it has trouble allocating a large enough area from the primary bitmap.
That can cause performance problems, especially for disks which were only
temporarily full or fragmented. This patch allows for the allocator to
shrink it's window first, before being disabled. Later, it can also be
re-enabled so that any performance drop is minimized.

To do this, we allow the value of osb->local_alloc_bits to be shrunk when
needed. The default value is recorded in a mostly read-only variable so that
we can re-initialize when required.

Locking had to be updated so that we could protect changes to
local_alloc_bits. Mostly this involves protecting various local alloc values
with the osb spinlock. A new state is also added, OCFS2_LA_THROTTLED, which
is used when the local allocator is has shrunk, but is not disabled. If the
available space dips below 1 megabyte, the local alloc file is disabled. In
either case, local alloc is re-enabled 30 seconds after the event, or when
an appropriate amount of bits is seen in the primary bitmap.

Signed-off-by: Mark Fasheh

Mark Fasheh
2008-10-14 04:57:57 +0800

18 Apr, 2008

1 commit

a4a489116 ocfs2: Add ac_alloc_slot in ocfs2_alloc_context ... Browse Code »

In inode stealing, we no longer restrict the allocation to
happen in the local node. So it is neccessary for us to add
a new member in ocfs2_alloc_context to indicate which slot
we are using for allocation. We also modify the process of
local alloc so that this member can be used there also.

Signed-off-by: Tao Ma
Signed-off-by: Sunil Mushran
Signed-off-by: Mark Fasheh

Tao Ma
2008-04-18 23:56:10 +0800

26 Jan, 2008

1 commit

d659072f7 [PATCH 1/2] ocfs2: Add group extend for online resize ... Browse Code »

This patch adds the ability for a userspace program to request an extend of
last cluster group on an Ocfs2 file system. The request is made via ioctl,
OCFS2_IOC_GROUP_EXTEND. This is derived from EXT3_IOC_GROUP_EXTEND, but is
obviously Ocfs2 specific.

tunefs.ocfs2 would call this for an online-resize operation if the last
cluster group isn't full.

Signed-off-by: Tao Ma
Signed-off-by: Mark Fasheh

Tao Ma
2008-01-26 06:53:35 +0800

21 Sep, 2007

1 commit

415cb8003 ocfs2: Allow smaller allocations during large writes ... Browse Code »

The ocfs2 write code loops through a page much like the block code, except
that ocfs2 allocation units can be any size, including larger than page
size. Typically it's equal to or larger than page size - most kernels run 4k
pages, the minimum ocfs2 allocation (cluster) size.

Some changes introduced during 2.6.23 changed the way writes to pages are
handled, and inadvertantly broke support for > 4k page size. Instead of just
writing one cluster at a time, we now handle the whole page in one pass.

This means that multiple (small) seperate allocations might happen in the
same pass. The allocation code howver typically optimizes by getting the
maximum which was reserved. This triggered a BUG_ON in the extend code where
it'd ask for a single bit (for one part of a > 4k page) and get back more
than it asked for.

Fix this by providing a variant of the high level allocation function which
allows the caller to specify a maximum. The traditional function remains and
just calls the new one with a maximum determined from the initial
reservation.

Signed-off-by: Mark Fasheh

Mark Fasheh
2007-09-21 06:06:09 +0800

11 Jul, 2007

2 commits

59a5e416d ocfs2: plug truncate into cached dealloc routines ... Browse Code »

Signed-off-by: Mark Fasheh

Mark Fasheh
2007-07-11 08:31:55 +0800
2b604351b ocfs2: simplify deallocation locking ... Browse Code »

Deallocation of suballocator blocks, most notably extent blocks, might
involve multiple suballocator inodes.

The locking for this can get extremely complicated, especially when the
suballocator inodes to delete from aren't known until deep within an
unrelated codepath.

Implement a simple scheme for recording the blocks to be unlinked so that
the actual deallocation can be done in a context which won't deadlock.

Signed-off-by: Mark Fasheh

Mark Fasheh
2007-07-11 08:31:54 +0800

02 Dec, 2006

2 commits

1fabe1481 ocfs2: Remove struct ocfs2_journal_handle in favor of handle_t ... Browse Code »

This is mostly a search and replace as ocfs2_journal_handle is now no more
than a container for a handle_t pointer.

ocfs2_commit_trans() becomes very straight forward, and we remove some out
of date comments / code.

Signed-off-by: Mark Fasheh

Mark Fasheh
2006-12-02 10:28:28 +0800
da5cbf2f9 ocfs2: don't use handle for locking in allocation functions ... Browse Code »

Instead we record our state on the allocation context structure which all
callers already know about and lifetime correctly. This means the
reservation functions don't need a handle passed in any more, and we can
also take it off the alloc context.

Signed-off-by: Mark Fasheh

Mark Fasheh
2006-12-02 10:27:49 +0800

08 Aug, 2006

1 commit

883d4cae4 ocfs2: allocation hints ... Browse Code »

Record the most recently used allocation group on the allocation context, so
that subsequent allocations can attempt to optimize for contiguousness.
Local alloc especially should benefit from this as the current chain search
tends to let it spew across the disk.

Signed-off-by: Mark Fasheh

Mark Fasheh
2006-08-08 02:07:01 +0800

04 Jan, 2006

1 commit

ccd979bdb [PATCH] OCFS2: The Second Oracle Cluster Filesystem ... Browse Code »

The OCFS2 file system module.

Signed-off-by: Mark Fasheh
Signed-off-by: Kurt Hackel

Mark Fasheh
2006-01-04 03:45:47 +0800