Doug / smarc-fsl-linux-kernel | Embedian Git Server

31 Oct, 2010

1 commit

925d169f5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (39 commits)
Btrfs: deal with errors from updating the tree log
Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
Btrfs: make SNAP_DESTROY async
Btrfs: add SNAP_CREATE_ASYNC ioctl
Btrfs: add START_SYNC, WAIT_SYNC ioctls
Btrfs: async transaction commit
Btrfs: fix deadlock in btrfs_commit_transaction
Btrfs: fix lockdep warning on clone ioctl
Btrfs: fix clone ioctl where range is adjacent to extent
Btrfs: fix delalloc checks in clone ioctl
Btrfs: drop unused variable in block_alloc_rsv
Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
Btrfs: Fix variables set but not read (bugs found by gcc 4.6)
Btrfs: Use ERR_CAST helpers
Btrfs: use memdup_user helpers
Btrfs: fix raid code for removing missing drives
Btrfs: Switch the extent buffer rbtree into a radix tree
Btrfs: restructure try_release_extent_buffer()
Btrfs: use the flusher threads for delalloc throttling
Btrfs: tune the chunk allocation to 5% of the FS as metadata
...

Fix up trivial conflicts in fs/btrfs/super.c and fs/fs-writeback.c, and
remove use of INIT_RCU_HEAD in fs/btrfs/extent_io.c (that init macro was
useless and removed in commit 5e8067adfdba: "rcu head remove init")

Linus Torvalds
2010-10-31 00:05:48 +0800

30 Oct, 2010

2 commits

559af8211 Btrfs: cleanup warnings from gcc 4.6 (nonbugs) ... Browse Code »

These are all the cases where a variable is set, but not read which are
not bugs as far as I can see, but simply leftovers.

Still needs more review.

Found by gcc 4.6's new warnings

Signed-off-by: Andi Kleen
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Chris Mason

Andi Kleen
2010-10-30 03:14:37 +0800
411fc6bce Btrfs: Fix variables set but not read (bugs found by gcc 4.6) ... Browse Code »

These are all the cases where a variable is set, but not
read which are really bugs.

- Couple of incorrect error handling fixed.
- One incorrect use of a allocation policy
- Some other things

Still needs more review.

Found by gcc 4.6's new warnings.

[akpm@linux-foundation.org: fix build. Might have been bitrot]
Signed-off-by: Andi Kleen
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Chris Mason

Andi Kleen
2010-10-30 03:14:31 +0800

29 Oct, 2010

2 commits

19fe0a8b7 Btrfs: Switch the extent buffer rbtree into a radix tree ... Browse Code »

This patch reduces the CPU time spent in the extent buffer search by using the
radix tree instead of the rbtree and using the rcu lock instead of the spin
lock.

I did a quick test by the benchmark tool[1] and found the patch improve the
file creation/deletion performance problem that I have reported[2].

Before applying this patch:
Create files:
Total files: 50000
Total time: 0.971531
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.366761
Average time: 0.000027

After applying this patch:
Create files:
Total files: 50000
Total time: 0.927455
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.292280
Average time: 0.000026

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
[2] http://marc.info/?l=linux-btrfs&m=128212635122920&w=2

Signed-off-by: Miao Xie
Signed-off-by: Chris Mason

Miao Xie
2010-10-29 23:25:45 +0800
897ca6e9b Btrfs: restructure try_release_extent_buffer() ... Browse Code »

restructure try_release_extent_buffer() and write a function to release the
extent buffer. It will be used later.

Signed-off-by: Miao Xie
Signed-off-by: Chris Mason

Miao Xie
2010-10-29 23:25:45 +0800

06 Jul, 2010

1 commit

9c3a8ee8a writeback: remove writeback_inodes_wbc ... Browse Code »

This was just an odd wrapper around writeback_inodes_wb. Removing this
also allows to get rid of the bdi member of struct writeback_control
which was rather out of place there.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2010-07-06 14:54:03 +0800

26 May, 2010

1 commit

4845e44ff Btrfs: rework O_DIRECT enospc handling ... Browse Code »

This changes O_DIRECT write code to mark extents as delalloc
while it is processing them. Yan Zheng has reworked the
enospc accounting based on tracking delalloc extents and
this makes it much easier to track enospc in the O_DIRECT code.

There are a few space cases with the O_DIRECT code though,
it only sets the EXTENT_DELALLOC bits, instead of doing
EXTENT_DELALLOC | EXTENT_DIRTY | EXTENT_UPTODATE, because
we don't want to mess with clearing the dirty and uptodate
bits when things go wrong. This is important because there
are no pages in the page cache, so any extent state structs
that we put in the tree won't get freed by releasepage. We have
to clear them ourselves as the DIO ends.

With this commit, we reserve space at in btrfs_file_aio_write,
and then as each btrfs_direct_IO call progresses it sets
EXTENT_DELALLOC on the range.

btrfs_get_blocks_direct is responsible for clearing the delalloc
at the same time it drops the extent lock.

Signed-off-by: Chris Mason

Chris Mason
2010-05-26 09:52:08 +0800

25 May, 2010

3 commits

eaf25d933 Btrfs: use async helpers for DIO write checksumming ... Browse Code »

The async helper threads offload crc work onto all the
CPUs, and make streaming writes much faster. This
changes the O_DIRECT write code to use them. The only
small complication was that we need to pass in the
logical offset in the file for each bio, because we can't
find it in the bio's pages.

Signed-off-by: Chris Mason

Chris Mason
2010-05-25 22:34:58 +0800
11c65dccf Btrfs: do aio_write instead of write ... Browse Code »

In order for AIO to work, we need to implement aio_write. This patch converts
our btrfs_file_write to btrfs_aio_write. I've tested this with xfstests and
nothing broke, and the AIO stuff magically started working. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2010-05-25 22:34:57 +0800
0ca1f7ceb Btrfs: Update metadata reservation for delayed allocation ... Browse Code »

Introduce metadata reservation context for delayed allocation
and update various related functions.

This patch also introduces EXTENT_FIRST_DELALLOC control bit for
set/clear_extent_bit. It tells set/clear_bit_hook whether they
are processing the first extent_state with EXTENT_DELALLOC bit
set. This change is important if set/clear_extent_bit involves
multiple extent_state.

Signed-off-by: Yan Zheng
Signed-off-by: Chris Mason

Yan, Zheng
2010-05-25 22:34:51 +0800

06 Apr, 2010

2 commits

795d580ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: add check for changed leaves in setup_leaf_for_split
Btrfs: create snapshot references in same commit as snapshot
Btrfs: fix small race with delalloc flushing waitqueue's
Btrfs: use add_to_page_cache_lru, use __page_cache_alloc
Btrfs: fix chunk allocate size calculation
Btrfs: kill max_extent mount option
Btrfs: fail to mount if we have problems reading the block groups
Btrfs: check btrfs_get_extent return for IS_ERR()
Btrfs: handle kmalloc() failure in inode lookup ioctl
Btrfs: dereferencing freed memory
Btrfs: Simplify num_stripes's calculation logical for __btrfs_alloc_chunk()
Btrfs: Add error handle for btrfs_search_slot() in btrfs_read_chunk_tree()
Btrfs: Remove unnecessary finish_wait() in wait_current_trans()
Btrfs: add NULL check for do_walk_down()
Btrfs: remove duplicate include in ioctl.c

Fix trivial conflict in fs/btrfs/compression.c due to slab.h include
cleanups.

Linus Torvalds
2010-04-06 04:21:15 +0800
28ecb6090 Btrfs: use add_to_page_cache_lru, use __page_cache_alloc ... Browse Code »

Pagecache pages should be allocated with __page_cache_alloc, so they
obey pagecache memory policies.

add_to_page_cache_lru is exported, so it should be used. Benefits over
using a private pagevec: neater code, 128 bytes fewer stack used, percpu
lru ordering is preserved, and finally don't need to flush pagevec
before returning so batching may be shared with other LRU insertions.

Signed-off-by: Nick Piggin :
Signed-off-by: Chris Mason

Nick Piggin
2010-04-06 02:41:51 +0800

30 Mar, 2010

1 commit

5a0e3ad6a include cleanup: Update gfp.h and slab.h includes to prepare for breaking implic… ... Browse Code »

…it slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Tejun Heo
2010-03-30 21:02:32 +0800

15 Mar, 2010

3 commits

2ac55d41b Btrfs: cache the extent state everywhere we possibly can V2 ... Browse Code »

This patch just goes through and fixes everybody that does

lock_extent()
blah
unlock_extent()

to use

lock_extent_bits()
blah
unlock_extent_cached()

and pass around a extent_state so we only have to do the searches once per
function. This gives me about a 3 mb/s boots on my random write test. I have
not converted some things, like the relocation and ioctl's, since they aren't
heavily used and the relocation stuff is in the middle of being re-written. I
also changed the clear_extent_bit() to only unset the cached state if we are
clearing EXTENT_LOCKED and related stuff, so we can do things like this

lock_extent_bits()
clear delalloc bits
unlock_extent_cached()

without losing our cached state. I tested this thoroughly and turned on
LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out
fine.

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2010-03-15 23:00:13 +0800
c2a128d28 Btrfs: cache extent state in find_delalloc_range ... Browse Code »

This patch makes us cache the extent state we find in find_delalloc_range since
we'll have to lock the extent later on in the function. This will keep us from
re-searching for the rang when we try to lock the extent.

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2010-03-15 23:00:13 +0800
4125bf761 Btrfs: finish read pages in the order they are submitted ... Browse Code »

The endio is done at reverse order of bio vectors.

That means for a sequential read, the page first submitted will finish
last in a bio. Considering we will do checksum (making cache hot) for
every page, this does introduce delay (and chance to squeeze cache used
soon) for pages submitted at the begining.

I don't observe obvious performance difference with below patch at my
simple test, but seems more natural to finish read in the order they are
submitted.

Signed-off-by: Shaohua Li
Signed-off-by: Chris Mason

Chris Mason
2010-03-15 23:00:12 +0800

09 Mar, 2010

1 commit

6bef4d317 Btrfs: use RB_ROOT to intialize rb_trees instead of setting rb_node to NULL ... Browse Code »

btrfs inialize rb trees in quite a number of places by settin rb_node =
NULL; The problem with this is that 17d9ddc72fb8bba0d4f678 in the
linux-next tree adds a new field to that struct which needs to be NULL for
the new rbtree library code to work properly. This patch uses RB_ROOT as
the intializer so all of the relevant fields will be NULL'd. Without the
patch I get a panic.

Signed-off-by: Eric Paris
Acked-by: Venkatesh Pallipadi
Signed-off-by: Chris Mason

Eric Paris
2010-03-09 05:26:50 +0800

05 Feb, 2010

1 commit

f044ba783 Btrfs: fix race between allocate and release extent buffer. ... Browse Code »

Increase extent buffer's reference count while holding the lock.
Otherwise it can race with try_release_extent_buffer.

Signed-off-by: Yan Zheng
Signed-off-by: Chris Mason

Yan, Zheng
2010-02-05 00:31:44 +0800

09 Oct, 2009

2 commits

32c00aff7 Btrfs: release delalloc reservations on extent item insertion ... Browse Code »

This patch fixes an issue with the delalloc metadata space reservation
code. The problem is we used to free the reservation as soon as we
allocated the delalloc region. The problem with this is if we are not
inserting an inline extent, we don't actually insert the extent item until
after the ordered extent is written out. This patch does 3 things,

1) It moves the reservation clearing stuff into the ordered code, so when
we remove the ordered extent we remove the reservation.
2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear
delalloc bits in the cases where we want to clear the metadata reservation
when we clear the delalloc extent, in the case that we do an inline extent
or we invalidate the page.
3) It adds another waitqueue to the space info so that when we start a fs
wide delalloc flush, anybody else who also hits that area will simply wait
for the flush to finish and then try to make their allocation.

This has been tested thoroughly to make sure we did not regress on
performance.

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2009-10-09 03:21:10 +0800
a791e35e1 Btrfs: cleanup extent_clear_unlock_delalloc flags ... Browse Code »

extent_clear_unlock_delalloc has a growing set of ugly parameters
that is very difficult to read and maintain.

This switches to a flag field and well named flag defines.

Signed-off-by: Chris Mason

Chris Mason
2009-10-09 03:11:49 +0800

29 Sep, 2009

1 commit

9ed74f2db Btrfs: proper -ENOSPC handling ... Browse Code »

At the start of a transaction we do a btrfs_reserve_metadata_space() and
specify how many items we plan on modifying. Then once we've done our
modifications and such, just call btrfs_unreserve_metadata_space() for
the same number of items we reserved.

For keeping track of metadata needed for data I've had to add an extent_io op
for when we merge extents. This lets us track space properly when we are doing
sequential writes, so we don't end up reserving way more metadata space than
what we need.

The only place where the metadata space accounting is not done is in the
relocation code. This is because Yan is going to be reworking that code in the
near future, so running btrfs-vol -b could still possibly result in a ENOSPC
related panic. This patch also turns off the metadata_ratio stuff in order to
allow users to more efficiently use their disk space.

This patch makes it so we track how much metadata we need for an inode's
delayed allocation extents by tracking how many extents are currently
waiting for allocation. It introduces two new callbacks for the
extent_io tree's, merge_extent_hook and split_extent_hook. These help
us keep track of when we merge delalloc extents together and split them
up. Reservations are handled prior to any actually dirty'ing occurs,
and then we unreserve after we dirty.

btrfs_unreserve_metadata_for_delalloc() will make the appropriate
unreservations as needed based on the number of reservations we
currently have and the number of extents we currently have. Doing the
reservation outside of doing any of the actual dirty'ing lets us do
things like filemap_flush() the inode to try and force delalloc to
happen, or as a last resort actually start allocation on all delalloc
inodes in the fs. This has survived dbench, fs_mark and an fsx torture
test.

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2009-09-29 04:29:42 +0800

24 Sep, 2009

3 commits

11ef160fd Btrfs: fix releasepage to avoid unlocking extents we haven't locked ... Browse Code »

During releasepage, we try to drop any extent_state structs for the
bye offsets of the page we're releaseing. But the code was incorrectly
telling clear_extent_bit to delete the state struct unconditionallly.

Normally this would be fine because we have the page locked, but other
parts of btrfs will lock down an entire extent, the most common place
being IO completion.

releasepage was deleting the extent state without first locking the extent,
which may result in removing a state struct that another process had
locked down. The fix here is to leave the NODATASUM and EXTENT_LOCKED
bits alone in releasepage.

Signed-off-by: Chris Mason

Chris Mason
2009-09-24 08:30:53 +0800
46562cec9 Btrfs: Fix test_range_bit for whole file extents ... Browse Code »

If test_range_bit finds an extent that goes all the way to (u64)-1, it
can incorrectly wrap the u64 instead of treaing it like the end of
the address space.

This just adds a check for the highest possible offset so we don't wrap.

Signed-off-by: Chris Mason

Chris Mason
2009-09-24 08:30:52 +0800
42daec299 Btrfs: fix errors handling cached state in set/clear_extent_bit ... Browse Code »

Both set and clear_extent_bit allow passing a cached
state struct to reduce rbtree search times. clear_extent_bit
was improperly bypassing some of the checks around making sure
the extent state fields were correct for a given operation.

The fix used here (from Yan Zheng) is to use the hit_next
goto target instead of jumping all the way down to start clearing
bits without making sure the cached state was exactly correct
for the operation we were doing.

This also fixes up the setting of the start variable for both
ops in the case where we find an overlapping extent that
begins before the range we want to change. In both cases
we were incorrectly going backwards from the original
requested change.

Signed-off-by: Chris Mason

Chris Mason
2009-09-24 08:30:52 +0800

19 Sep, 2009

1 commit

f85d7d6c8 Btrfs: properly honor wbc->nr_to_write changes ... Browse Code »

When btrfs fills a delayed allocation, it tries to increase
the wbc nr_to_write to cover a big part of allocation. The
theory is that we're doing contiguous IO and writing a few
more blocks will save seeks overall at a very low cost.

The problem is that extent_write_cache_pages could ignore
the new higher nr_to_write if nr_to_write had already gone
down to zero. We fix that by rechecking the nr_to_write
for every page that is processed in the pagevec.

This updates the math around bumping the nr_to_write value
to make sure we don't leave a tiny amount of IO hanging
around for the very end of a new extent.

Signed-off-by: Chris Mason

Chris Mason
2009-09-19 04:08:46 +0800

12 Sep, 2009

9 commits

8b62b72b2 Btrfs: Use PagePrivate2 to track pages in the data=ordered code. ... Browse Code »

Btrfs writes go through delalloc to the data=ordered code. This
makes sure that all of the data is on disk before the metadata
that references it. The tracking means that we have to make sure
each page in an extent is fully written before we add that extent into
the on-disk btree.

This was done in the past by setting the EXTENT_ORDERED bit for the
range of an extent when it was added to the data=ordered code, and then
clearing the EXTENT_ORDERED bit in the extent state tree as each page
finished IO.

One of the reasons we had to do this was because sometimes pages are
magically dirtied without page_mkwrite being called. The EXTENT_ORDERED
bit is checked at writepage time, and if it isn't there, our page become
dirty without going through the proper path.

These bit operations make for a number of rbtree searches for each page,
and can cause considerable lock contention.

This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
As pages go into the ordered code, PagePrivate2 is set on each one.
This is a cheap operation because we already have all the pages locked
and ready to go.

As IO finishes, the PagePrivate2 bit is cleared and the ordered
accoutning is updated for each page.

At writepage time, if the PagePrivate2 bit is missing, we go into the
writepage fixup code to handle improperly dirtied pages.

Signed-off-by: Chris Mason

Chris Mason
2009-09-12 01:31:07 +0800
9655d2982 Btrfs: use a cached state for extent state operations during delalloc ... Browse Code »

This changes the btrfs code to find delalloc ranges in the extent state
tree to use the new state caching code from set/test bit. It reduces
one of the biggest causes of rbtree searches in the writeback path.

test_range_bit is also modified to take the cached state as a starting
point while searching.

Signed-off-by: Chris Mason

Chris Mason
2009-09-12 01:31:07 +0800
d5550c631 Btrfs: don't lock bits in the extent tree during writepage ... Browse Code »

At writepage time, we have the page locked and we have the
extent_map entry for this extent pinned in the extent_map tree.
So, the page can't go away and its mapping can't change.

There is no need for the extra extent_state lock bits during writepage.

Signed-off-by: Chris Mason

Chris Mason
2009-09-12 01:31:06 +0800
2c64c53d8 Btrfs: cache values for locking extents ... Browse Code »

Many of the btrfs extent state tree users follow the same pattern.
They lock an extent range in the tree, do some operation and then
unlock.

This translates to at least 2 rbtree searches, and maybe more if they
are doing operations on the extent state tree. A locked extent
in the tree isn't going to be merged or changed, and so we can
safely return the extent state structure as a cached handle.

This changes set_extent_bit to give back a cached handle, and also
changes both set_extent_bit and clear_extent_bit to use the cached
handle if it is available.

Signed-off-by: Chris Mason

Chris Mason
2009-09-12 01:31:06 +0800
1edbb734b Btrfs: reduce CPU usage in the extent_state tree ... Browse Code »

Btrfs is currently mirroring some of the page state bits into
its extent state tree. The goal behind this was to use it in supporting
blocksizes other than the page size.

But, we don't currently support that, and we're using quite a lot of CPU
on the rb tree and its spin lock. This commit starts a series of
cleanups to reduce the amount of work done in the extent state tree as
part of each IO.

This commit:

* Adds the ability to lock an extent in the state tree and also set
other bits. The idea is to do locking and delalloc in one call

* Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits. Btrfs is using
a combination of the page bits and the ordered write code for this
instead.

Signed-off-by: Chris Mason

Chris Mason
2009-09-12 01:31:06 +0800
e48c465bb Btrfs: Fix new state initialization order ... Browse Code »

As the extent state tree is manipulated, there are call backs
that are used to take extra actions when different state bits are set
or cleared. One example of this is a counter for the total number
of delayed allocation bytes in a single inode and in the whole FS.

When new states are inserted, this callback is being done before we
properly setup the new state. This hasn't caused problems before
because the lock bit was always done first, and the existing call backs
don't care about the lock bit.

This patch makes sure the state is properly setup before using the
callback, which is important for later optimizations that do more work
without using the lock bit.

Signed-off-by: Chris Mason

Chris Mason
2009-09-12 01:31:05 +0800
890871be8 Btrfs: switch extent_map to a rw lock ... Browse Code »

There are two main users of the extent_map tree. The
first is regular file inodes, where it is evenly spread
between readers and writers.

The second is the chunk allocation tree, which maps blocks from
logical addresses to phyiscal ones, and it is 99.99% reads.

The mapping tree is a point of lock contention during heavy IO
workloads, so this commit switches things to a rw lock.

Signed-off-by: Chris Mason

Chris Mason
2009-09-12 01:31:05 +0800
a97adc9ff Btrfs: use larger nr_to_write for larger extents ... Browse Code »

When btrfs fills a large delayed allocation extent, it is a good idea
to try and convince the write_cache_pages caller to go ahead and
write a good chunk of that extent. The extra IO is basically free
because we know it is contiguous.

Signed-off-by: Chris Mason

Chris Mason
2009-09-12 01:31:04 +0800
40431d6c1 Btrfs: optimize set extent bit ... Browse Code »

The Btrfs set_extent_bit call currently searches the rbtree
every time it needs to find more extent_state objects to fill
the requested operation.

This adds a simple test with rb_next to see if the next object
in the tree was adjacent to the one we just found. If so,
we skip the search and just use the next object.

Signed-off-by: Chris Mason

Chris Mason
2009-09-12 01:31:03 +0800

10 Jun, 2009

1 commit

5c939df56 btrfs: Fix set/clear_extent_bit for 'end == (u64)-1' ... Browse Code »

There are some 'start = state->end + 1;' like code in set_extent_bit
and clear_extent_bit. They overflow when end == (u64)-1.

Signed-off-by: Yan Zheng
Signed-off-by: Chris Mason

Yan Zheng
2009-06-10 23:29:46 +0800

27 Apr, 2009

1 commit

b7967db75 Btrfs: remove #if 0 code ... Browse Code »

Btrfs had some old code sitting around under #if 0, this drops it.

Signed-off-by: Chris Mason

Chris Mason
2009-04-27 19:40:52 +0800

25 Apr, 2009

1 commit

9601e3f63 Btrfs: kill btrfs_cache_create ... Browse Code »

Just use kmem_cache_create directly.

Signed-off-by: Christoph Hellwig
Signed-off-by: Chris Mason

Christoph Hellwig
2009-04-25 03:46:04 +0800

21 Apr, 2009

3 commits

11c8349b4 Btrfs: fix oops on page->mapping->host during writepage ... Browse Code »

The extent_io writepage call updates the writepage index in the inode
as it makes progress. But, it was doing the update after unlocking the page,
which isn't legal because page->mapping can't be trusted once the page
is unlocked.

This lead to an oops, especially common with compression turned on. The
fix here is to update the writeback index before unlocking the page.

Signed-off-by: Chris Mason

Chris Mason
2009-04-21 03:53:09 +0800
d313d7a31 Btrfs: add a priority queue to the async thread helpers ... Browse Code »

Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a
higher priority. But, the checksumming helper threads prevent it
from being fully effective.

There are two problems. First, a big queue of pending checksumming
will delay the synchronous IO behind other lower priority writes. Second,
the checksumming uses an ordered async work queue. The ordering makes sure
that IOs are sent to the block layer in the same order they are sent
to the checksumming threads. Usually this gives us less seeky IO.

But, when we start mixing IO priorities, the lower priority IO can delay
the higher priority IO.

This patch solves both problems by adding a high priority list to the async
helper threads, and a new btrfs_set_work_high_prio(), which is used
to make put a new async work item onto the higher priority list.

The ordering is still done on high priority IO, but all of the high
priority bios are ordered separately from the low priority bios. This
ordering is purely an IO optimization, it is not involved in data
or metadata integrity.

Signed-off-by: Chris Mason

Chris Mason
2009-04-21 03:53:08 +0800
ffbd517d5 Btrfs: use WRITE_SYNC for synchronous writes ... Browse Code »

Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for
writes we plan on waiting on in the near future. This patch
mirrors recent changes in other filesystems and the generic code to
use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for
other latency critical writes.

Btrfs uses async worker threads for checksumming before the write is done,
and then again to actually submit the bios. The bio submission code just
runs a per-device list of bios that need to be sent down the pipe.

This list is split into low priority and high priority lists so the
WRITE_SYNC IO happens first.

Signed-off-by: Chris Mason

Chris Mason
2009-04-21 03:53:08 +0800