07 Jan, 2011
1 commit
-
RCU free the struct inode. This will allow:
- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.Signed-off-by: Nick Piggin
29 Oct, 2010
1 commit
-
Signed-off-by: Al Viro
26 Oct, 2010
4 commits
-
Clones an existing reference to inode; caller must already hold one.
Signed-off-by: Al Viro
-
Add a new helper to write out the inode using the writeback code,
that is including the correct dirty bit and list manipulation. A few
of filesystems already opencode this, and a lot of others should be
using it instead of using write_inode_now which also writes out the
data.Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro -
exofs_new_inode() was incrementing the inode->i_count and
decrementing it in create_done(), in a bad attempt to make sure
the inode will still be there when the asynchronous create_done()
finally arrives. This was very stupid because iput() was not called,
and if it was actually needed, it would leak the inode.However all this is not needed, because at exofs_evict_inode()
we already wait for create_done() by waiting for the
object_created event. Therefore remove the superfluous ref counting
and just Thicken the comment at exofs_evict_inode() a bit.While at it change places that open coded wait_obj_created()
to call the already available wrapper.CC: Dave Chinner
CC: Christoph Hellwig
CC: Nick Piggin
Signed-off-by: Boaz Harrosh -
Signed-off-by: Joe Perches
Signed-off-by: Boaz Harrosh
19 Oct, 2010
2 commits
-
Though it has been promised that inode->i_mapping->backing_dev_info
is not used and the supporting code is fine. Until the pointer
will default to NULL, I'd rather it points to the correct thing
regardless.At least for future infrastructure coder it is a clear indication
of where are the key points that inodes are initialized.
I know because it took me time to find this out.Signed-off-by: Boaz Harrosh
-
Last BUG fix added a flag to the the page_collect structure
to communicate with readpage_strip. This calls for a clean up
removing that flag's reincarnations in the read functions
parameters.Signed-off-by: Boaz Harrosh
08 Oct, 2010
1 commit
-
This BUG is there since the first submit of the code, but only triggered
in last Kernel. It's timing related do to the asynchronous object-creation
behaviour of exofs. (Which should be investigated farther)The bug is obvious hence the fixed.
Signed-off-by: Boaz Harrosh
12 Aug, 2010
1 commit
-
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
exofs: Fix groups code when num_devices is not divisible by group_width
exofs: Remove useless optimization
exofs: exofs_file_fsync and exofs_file_flush correctness
exofs: Remove superfluous dependency on buffer_head and writeback
11 Aug, 2010
1 commit
-
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
xen-blkfront: fix missing out label
blkdev: fix blkdev_issue_zeroout return value
block: update request stacking methods to support discards
block: fix missing export of blk_types.h
writeback: fix bad _bh spinlock nesting
drbd: revert "delay probes", feature is being re-implemented differently
drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
drbd: Disable delay probes for the upcomming release
writeback: cleanup bdi_register
writeback: add new tracepoints
writeback: remove unnecessary init_timer call
writeback: optimize periodic bdi thread wakeups
writeback: prevent unnecessary bdi threads wakeups
writeback: move bdi threads exiting logic to the forker thread
writeback: restructure bdi forker loop a little
writeback: move last_active to bdi
writeback: do not remove bdi from bdi_list
writeback: simplify bdi code a little
writeback: do not lose wake-ups in bdi threads
...Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
drivers/scsi/scsi_error.c as per Jens.
10 Aug, 2010
3 commits
-
Signed-off-by: Al Viro
-
These changes are crafted based on the similar
conversion done to ext2 by Nick Piggin.* Remove the deprecated ->truncate vector. Let exofs_setattr
take care of on-disk size updates.
* Call truncate_pagecache on the unused pages if
write_begin/end fails.
* Cleanup exofs_delete_inode that did stupid inode
writes and updates on an inode that will be
removed.
* And finally get rid of exofs_get_block. We never
had any blocks it was all for calling nobh_truncate_page.
nobh_truncate_page is not actually needed in exofs since
the last page is complete and gone, just like all the other
pages. There is no partial blocks in exofs.I've tested with this patch, and there are no apparent
failures, so far.CC: Nick Piggin
CC: Christoph Hellwig
Signed-off-by: Boaz Harrosh
Signed-off-by: Al Viro -
Replace inode_setattr with opencoded variants of it in all callers. This
moves the remaining call to vmtruncate into the filesystem methods where it
can be replaced with the proper truncate sequence.In a few cases it was obvious that we would never end up calling vmtruncate
so it was left out in the opencoded variant:spufs: explicitly checks for ATTR_SIZE earlier
btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
ufs: contains an opencoded simple_seattr + truncate that sets the filesize just aboveIn addition to that ncpfs called inode_setattr with handcrafted iattrs,
which allowed to trim down the opencoded variant.Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro
08 Aug, 2010
1 commit
-
Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
04 Aug, 2010
4 commits
-
There is a bug when num_devices is not divisible by group_width * mirrors.
We would not return to the proper device and offset when looping on to the
next group.The fix makes code simpler actually.
Signed-off-by: Boaz Harrosh
-
We used to compact all used devices in an IO to the beginning
of the device array in an io_state. And keep a last device used
so in later loops we don't iterate on all device slots. This
does not prevent us from checking if slots are empty since in
reads we only read from a single mirror and jump to the next
mirror-set.This optimization is marginal, and needlessly complicates the
code. Specially when we will later want to support raid/456
with same abstract code. So remove the distinction between
"dev" and "comp". Only "dev" is used both as the device used
and as the index (component) in the device array.[Note that now the io_state->dev member is redundant but I
keep it because I might want to optimize by only IOing a
single group, though keeping a group_width*mirrors devices
in io_state, we now keep num-devices in each io_state]Signed-off-by: Boaz Harrosh
-
As per Christoph advise: no need to call filemap_write_and_wait().
In exofs all metadata is at the inode so just writing the inode is
all is needed. ->fsync implies this must be done synchronously.But now exofs_file_fsync can not be used by exofs_file_flush.
vfs_fsync() should do that job correctly.FIXME: remove the sb_sync and fix that sb_update better.
Signed-off-by: Boaz Harrosh
-
exofs_releasepage && exofs_invalidatepage are never called.
Leave the WARN_ONs but remove any code. Remove thecleanup other stale #includes.
Signed-off-by: Boaz Harrosh
28 May, 2010
1 commit
-
Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro
24 May, 2010
1 commit
-
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
exofs: confusion between kmap() and kmap_atomic() api
exofs: Add default address_space_operations
22 May, 2010
1 commit
-
Ack-by: Boaz Harrosh
Signed-off-by: Dmitry Monakhov
Signed-off-by: Al Viro
17 May, 2010
2 commits
-
For kmap_atomic() we call kunmap_atomic() on the returned pointer.
That's different from kmap() and kunmap() and so it's easy to get them
backwards.Cc: Stable
Signed-off-by: Dan Carpenter
Signed-off-by: Boaz Harrosh -
All vectors of address_space_operations should be initialized
by the filesystem. Add the missing parts.This is actually an optimization, by using
__set_page_dirty_nobuffers. The default, in case of NULL,
would be __set_page_dirty_buffers which has these extar if(s)..releasepage && .invalidatepage should both not be called
because page_private() is NULL in exofs. Put a WARN_ON if
they are called, to indicate the Kernel has changed in this
regard, if when it does.Signed-off-by: Boaz Harrosh
29 Apr, 2010
1 commit
-
Commit b3d0ab7e60d1865bb6f6a79a77aaba22f2543236 ("exofs: add bdi backing
to mount session") has a bug in the placement of the bdi member at
struct exofs_sb_info. The layout member must be kept last.Signed-off-by: Boaz Harrosh
Acked-by: Jens Axboe
Signed-off-by: Linus Torvalds
22 Apr, 2010
1 commit
-
This ensures that dirty data gets flushed properly.
Signed-off-by: Jens Axboe
30 Mar, 2010
1 commit
-
…it slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
06 Mar, 2010
1 commit
-
This gives the filesystem more information about the writeback that
is happening. Trond requested this for the NFS unstable write handling,
and other filesystems might benefit from this too by beeing able to
distinguish between the different callers in more detail.Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro
28 Feb, 2010
11 commits
-
* _calc_stripe_info() changes to accommodate for grouping
calculations. Returns additional information* old _prepare_pages() becomes _prepare_one_group()
which stores pages belonging to one device group.* New _prepare_for_striping iterates on all groups calling
_prepare_one_group().* Enable mounting of groups data_maps (group_width != 0)
[QUESTION]
what is faster A or B;
A. x += stride;
x = x % width + first_x;B x += stride
if (x < last_x)
x = first_x;Signed-off-by: Boaz Harrosh
-
* Rename _offset_dev_unit_off() to _calc_stripe_info()
and recieve a struct for the output params* In _prepare_for_striping we only need to call
_calc_stripe_info() once. The other componets
are easy to calculate from that. This code
was inspired by what's done in truncate.* Some code shifts that make sense now but will make
more sense when group support is added.Signed-off-by: Boaz Harrosh
-
If an object is referenced by a directory but does not
exist on a target, it is a very serious corruption that
means:
1. Either a power failure with very slim chance of it
happening. Because the directory update is always submitted
much after object creation, but if a directory is written
to one device and the object creation to another it might
theoretically happen.
2. It only ever happened to me while developing with BUGs
causing file corruption. Crashes could also cause it but
they are more like case 1.In any way the object does not exist, so data is surely lost.
If there is a mix-up in the obj-id or data-map, then lost objects
can be salvaged by off-line fsck. The only recoverable information
is the directory name. By letting it appear as a regular empty file,
with date==0 (1970 Jan 1st) ownership to root, we enable recovery
of the only useful information. And also enable deletion or over-write.
I can see how this can hurt.Signed-off-by: Boaz Harrosh
-
* inode.c operations are full-pages based, and not actually
true scatter-gather
* Lets us use more pages at once upto 512 (from 249) in 64 bit
* Brings us much much closer to be able to use exofs's io_state engine
from objlayout driver. (Once I decide where to put the common code)After RAID0 patch the outer (input) bio was never used as a bio, but
was simply a page carrier into the raid engine. Even in the simple
mirror/single-dev arrangement pages info was copied into a second bio.
It is now easer to just pass a pages array into the io_state and prepare
bio(s) once.Signed-off-by: Boaz Harrosh
-
We now support striping over mirror devices. Including variable sized
stripe_unit.Some limits:
* stripe_unit must be a multiple of PAGE_SIZE
* stripe_unit * stripe_count is maximum upto 32-bit (4Gb)Tested RAID0 over mirrors, RAID0 only, mirrors only. All check.
Design notes:
* I'm not using a vectored raid-engine mechanism yet. Following the
pnfs-objects-layout data-map structure, "Mirror" is just a private
case of "group_width" == 1, and RAID0 is a private case of
"Mirrors" == 1. The performance lose of the general case over the
particular special case optimization is totally negligible, also
considering the extra code size.* In general I added a prepare_stripes() stage that divides the
to-be-io pages to the participating devices, the previous
exofs_ios_write/read, now becomes _write/read_mirrors and a new
write/read upper layer loops on all devices calling
_write/read_mirrors. Effectively the prepare_stripes stage is the all
secret.
Also truncate need fixing to accommodate for striping.* In a RAID0 arrangement, in a regular usage scenario, if all inode
layouts will start at the same device, the small files fill up the
first device and the later devices stay empty, the farther the device
the emptier it is.To fix that, each inode will start at a different stripe_unit,
according to it's obj_id modulus number-of-stripe-units. And
will then span all stripe-units in the same incrementing order
wrapping back to the beginning of the device table. We call it
a stripe-units moving window.Special consideration was taken to keep all devices in a mirror
arrangement identical. So a broken osd-device could just be cloned
from one of the mirrors and no FS scrubbing is needed. (We do that
by rotating stripe-unit at a time and not a single device at a time.)TODO:
We no longer verify object_length == inode->i_size in exofs_iget.
(since i_size is stripped on multiple objects now).
I should introduce a multiple-device attribute reading, and use
it in exofs_iget.Signed-off-by: Boaz Harrosh
-
* Layouts describe the way a file is spread on multiple devices.
The layout information is stored in the objects attribute introduced
in this patch.* There can be multiple generating function for the layout.
Currently defined:
- No attribute present - use below moving-window on global
device table, all devices.
(This is the only one currently used in exofs)
- an obj_id generated moving window - the obj_id is a randomizing
factor in the otherwise global map layout.
- An explicit layout stored, including a data_map and a device
index list.
- More might be defined in future ...* There are two attributes defined of the same structure:
A-data-files-layout - This layout is used by data-files. If present
at a directory, all files of that directory will
be created with this layout.
A-meta-data-layout - This layout is used by a directory and other
meta-data information. Also inherited at creation
of subdirectories.* At creation time inodes are created with the layout specified above.
A usermode utility may change the creation layout on a give directory
or file. Which in the case of directories, will also apply to newly
created files/subdirectories, children of that directory.
In the simple unaltered case of a newly created exofs, no layout
attributes are present, and all layouts adhere to the layout specified
at the device-table.* In case of a future file system loaded in an old exofs-driver.
At iget(), the generating_function is inspected and if not supported
will return an IO error to the application and the inode will not
be loaded. So not to damage any data.
Note: After this patch we do not yet support any type of layout
only the RAID0 patch that enables striping at the super-block
level will add support for RAID0 layouts above. This way we
are past and future compatible and fully bisectable.* Access to the device table is done by an accessor since
it will change according to above information.Signed-off-by: Boaz Harrosh
-
The original idea was that a mirror read can be sub-divided
to multiple devices. But this has very little gain and only
at very large IOes so it's not going to be implemented soon.Signed-off-by: Boaz Harrosh
-
* Abstract away those members in exofs_sb_info that are related/needed
by a layout into a new exofs_layout structure. Embed it in exofs_sb_info.* At exofs_io_state receive/keep a pointer to an exofs_layout. No need for
an exofs_sb_info pointer, all we need is at exofs_layout.* Change any usage of above exofs_sb_info members to their new name.
Signed-off-by: Boaz Harrosh
-
In check_io, implement the case of reading passed end of
file, by clearing the pages and recover with no error. In
a raid arrangement this can become a legitimate situation
in case of holes in the file.Signed-off-by: Boaz Harrosh
-
optimize the exofs_i_info struct usage by moving the embedded
vfs_inode to be first. A compiler might optimize away an "add"
operation with constant zero. (Which it cannot with other constants)Signed-off-by: Boaz Harrosh
-
* Last debug trimming left in some stupid print, remove them.
Fixup some other prints
* Shift printing from inode.c to ios.c
* Add couple of prints when memory allocation fails.Signed-off-by: Boaz Harrosh
05 Jan, 2010
1 commit
-
exofs uses simple_write_end() for it's .write_end handler. But
it is not enough because simple_write_end() does not call
mark_inode_dirty() when it extends i_size. So even if we do
call mark_inode_dirty at beginning of write out, with a very
long IO and a saturated system we might get the .write_inode()
called while still extend-writing to file and miss out on the last
i_size updates.So override .write_end, call simple_write_end(), and afterwords if
i_size was changed call mark_inode_dirty().It stands to logic that since simple_write_end() was the one extending
i_size it should also call mark_inode_dirty(). But it looks like all
users of simple_write_end() are memory-bound pseudo filesystems, who
could careless about mark_inode_dirty(). I might submit a
warning-comment patch to simple_write_end() in future.CC: Stable
Signed-off-by: Boaz Harrosh