Doug / smarc-fsl-linux-kernel | Embedian Git Server

18 May, 2013

6 commits

3a6cad900 Btrfs: explicitly use global_block_rsv for quota_tree ... Browse Code »

The quota_tree was set up to use the empty_block_rsv before
which would be problematic when the filesystem is filled up
and ENOSPC happens during internal operations while the quota
tree is updated and COWed (when the btrfs_qgroup_info_item
items) are written. In fact, use_block_rsv() which is used
in btrfs_cow_block() falls back to the global_block_rsv in
this case. But just in order to make it more clear what is
happening, change it to explicitly use the global_block_rsv.

Signed-off-by: Stefan Behrens
Signed-off-by: Josef Bacik

Stefan Behrens
2013-05-18 09:40:36 +0800
d88033dbf Btrfs: update the global reserve if it is empty ... Browse Code »

Before applying this patch, we reserved the space for the global reserve
by the minimum unit if we found it is empty, it was unreasonable and
inefficient, because if the global reserve space was depleted, it implied
that the size of the global reserve was too small. In this case, we shoud
update the global reserve and fill it.

Cc: Tsutomu Itoh
Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2013-05-18 09:40:26 +0800
5881cfc92 Btrfs: don't steal the reserved space from the global reserve if their space type is different ... Browse Code »

If the type of the space we need is different with the global reserve, we
can not steal the space from the global reserve, because we can not allocate
the space from the free space cache that the global reserve points to.

Cc: Tsutomu Itoh
Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2013-05-18 09:40:25 +0800
b586b3237 Btrfs: optimize the error handle of use_block_rsv() ... Browse Code »

cc: Tsutomu Itoh
Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2013-05-18 09:40:24 +0800
7b61cd922 Btrfs: don't use global block reservation for inode cache truncation ... Browse Code »

It is very likely that there are lots of subvolumes/snapshots in the filesystem,
so if we use global block reservation to do inode cache truncation, we may hog
all the free space that is reserved in global rsv. So it is better that we do
the free space reservation for inode cache truncation by ourselves.

Cc: Tsutomu Itoh
Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2013-05-18 09:40:22 +0800
b1c79e094 Btrfs: handle running extent ops with skinny metadata ... Browse Code »

Chris hit a bug where we weren't finding extent records when running extent ops.
This is because we use the delayed_ref_head when running the extent op, which
means we can't use the ->type checks to see if we are metadata. We also lose
the level of the metadata we are working on. So to fix this we can just check
the ->is_data section of the extent_op, and we can store the level of the buffer
we were modifying in the extent_op. Thanks,

Signed-off-by: Josef Bacik

Josef Bacik
2013-05-18 09:40:15 +0800

07 May, 2013

17 commits

b6919a58f btrfs: fix misleading variable name for flags ... Browse Code »

The variable was named 'data' in btrfs_reserve_extent and that's the
only function that actually uses it to let btrfs_get_alloc_profile know
what profile we want. Then it's passed down as u64 flags.

Signed-off-by: David Sterba
Signed-off-by: Josef Bacik

David Sterba
2013-05-07 03:55:27 +0800
48a3b6366 btrfs: make static code static & remove dead code ... Browse Code »

Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.

removed functions:

btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()

btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.

ulist.c functions are left, another patch will take care of those.

Signed-off-by: Eric Sandeen
Signed-off-by: Josef Bacik

Eric Sandeen
2013-05-07 03:55:23 +0800
b50c6e250 Btrfs: deal with free space cache errors while replaying log ... Browse Code »

So everybody who got hit by my fsync bug will still continue to hit this
BUG_ON() in the free space cache, which is pretty heavy handed. So I took a
file system that had this bug and fixed up all the BUG_ON()'s and leaks that
popped up when I tried to mount a broken file system like this. With this patch
we just fail to mount instead of panicing. Thanks,

Signed-off-by: Josef Bacik

Josef Bacik
2013-05-07 03:55:20 +0800
3c76cd84e Btrfs: allocate new chunks if the space is not enough for global rsv ... Browse Code »

When running the 208th of xfstests, the fs returned the enospc
error when there was lots of free space in the disk.

By bisect debug, we found it was introduced by commit 96f1bb5777.
This commit makes the space check for the global reservation in
can_overcommit() be inconsistent with should_alloc_chunk().
can_overcommit() requires that the free space is 2 times the size
of the global reservation, or we can't do overcommit. And instead,
we need reclaim some reserved space, and if we still don't have
enough free space, we need allocate a new chunk. But unfortunately,
should_alloc_chunk() just requires that the free space is 1 time
the size of the global reservation, that is we would not try to
allocate a new chunk if the free space size is in the middle of
these two requires, and just return the enospc error. Fix it.

Cc: Jim Schutt
Cc: Josef Bacik
Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2013-05-07 03:55:17 +0800
fc36ed7e0 Btrfs: separate sequence numbers for delayed ref tracking and tree mod log ... Browse Code »

Sequence numbers for delayed refs have been introduced in the first version
of the qgroup patch set. To solve the problem of find_all_roots on a busy
file system, the tree mod log was introduced. The sequence numbers for that
were simply shared between those two users.

However, at one point in qgroup's quota accounting, there's a statement
accessing the previous sequence number, that's still just doing (seq - 1)
just as it would have to in the very first version.

To satisfy that requirement, this patch makes the sequence number counter 64
bit and splits it into a major part (used for qgroup sequence number
counting) and a minor part (incremented for each tree modification in the
log). This enables us to go exactly one major step backwards, as required
for qgroups, while still incrementing the sequence counter for tree mod log
insertions to keep track of their order. Keeping them in a single variable
means there's no need to change all the code dealing with comparisons of two
sequence numbers.

The sequence number is reset to 0 on commit (not new in this patch), which
ensures we won't overflow the two 32 bit counters.

Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
from the tree mod log code may happen.

Signed-off-by: Jan Schmidt
Signed-off-by: Josef Bacik

Jan Schmidt
2013-05-07 03:55:17 +0800
32b025380 Btrfs: don't panic if we're trying to drop too many refs ... Browse Code »

This is just obnoxious. Just print a message, abort the transaction, and return
an error. Thanks,

Signed-off-by: Josef Bacik

Josef Bacik
2013-05-07 03:55:09 +0800
416bc6580 Btrfs: fix all callers of read_tree_block ... Browse Code »

We kept leaking extent buffers when mounting a broken file system and it turns
out it's because not everybody uses read_tree_block properly. You need to check
and make sure the extent_buffer is uptodate before you use it. This patch fixes
everybody who calls read_tree_block directly to make sure they check that it is
uptodate and free it and return an error if it is not. With this we no longer
leak EB's when things go horribly wrong. Thanks,

Signed-off-by: Josef Bacik

Josef Bacik
2013-05-07 03:55:07 +0800
51bf5f0bc Btrfs: only exclude supers in the range of our block group ... Browse Code »

If we fail to load block groups halfway through we can leave extent_state's on
the excluded tree. This is because we just lookup the supers and add them to
the excluded tree regardless of which block group we are looking at currently.
This is a problem because we remove the excluded extents for the range of the
block group only, so if we don't ever load a block group for one of the excluded
extents we won't ever free it. This fixes the problem by only adding excluded
extents if it falls in the block group range we care about. With this patch
we're no longer leaking space when we fail to read all of the block groups.
Thanks,

Signed-off-by: Josef Bacik

Josef Bacik
2013-05-07 03:55:06 +0800
0a3896d0f Btrfs: fix possible infinite loop in slow caching ... Browse Code »

So I noticed there is an infinite loop in the slow caching code. If we return 1
when we hit the end of the tree, so we could end up caching the last block group
the slow way and suddenly we're looping forever because we just keep
re-searching and trying again. Fix this by only doing btrfs_next_leaf() if we
don't need_resched(). Thanks,

Signed-off-by: Josef Bacik

Josef Bacik
2013-05-07 03:55:01 +0800
fd279faef Btrfs: cleanup of function where btrfs_extend_item() is called ... Browse Code »

Argument 'trans' became unnecessary from setup_inline_extent_backref()
that called btrfs_extend_item().

Signed-off-by: Tsutomu Itoh
Signed-off-by: Josef Bacik

Tsutomu Itoh
2013-05-07 03:54:54 +0800
4b90c6801 Btrfs: remove unused argument of btrfs_extend_item() ... Browse Code »

Argument 'trans' is not used in btrfs_extend_item().

Signed-off-by: Tsutomu Itoh
Signed-off-by: Josef Bacik

Tsutomu Itoh
2013-05-07 03:54:53 +0800
afe5fea72 Btrfs: cleanup of function where fixup_low_keys() is called ... Browse Code »

If argument 'trans' is unnecessary in the function where
fixup_low_keys() is called, 'trans' is deleted.

Signed-off-by: Tsutomu Itoh
Signed-off-by: Josef Bacik

Tsutomu Itoh
2013-05-07 03:54:52 +0800
98ad69cfd Btrfs: don't wait on ordered extents if we have a trans open ... Browse Code »

Dave was hitting a lockdep warning because we're now properly taking the ordered
operations mutex in the ordered wait stuff. This is because some cases we will
have a trans handle when we are flushing delalloc space, but we can't wait on
ordered extents because we could potentially deadlock, so fix this by not doing
the wait if we have a trans handle. Thanks

Reported-and-tested-by: David Sterba
Signed-off-by: Josef Bacik

Josef Bacik
2013-05-07 03:54:32 +0800
8c579fe74 Btrfs: fix error handling in make/read block group ... Browse Code »

I noticed that we will add a block group to the space info before we add it to
the block group cache rb tree, so we could potentially allocate from the block
group before it's able to be searched for. I don't think this is too much of
a problem, the race window is microscopic, but just in case move the tree
insertion to above the space info linking. This makes it easier to adjust the
error handling as well, so we can remove a couple of BUG_ON(ret)'s and have real
error handling setup for these scenarios. Thanks,

Signed-off-by: Josef Bacik

Josef Bacik
2013-05-07 03:54:32 +0800
c2cf52eb7 Btrfs: Include the device in most error printk()s ... Browse Code »

With more than one btrfs volume mounted, it can be very difficult to find
out which volume is hitting an error. btrfs_error() will print this, but
it is currently rigged as more of a fatal error handler, while many of
the printk()s are currently for debugging and yet-unhandled cases.

This patch just changes the functions where the device information is
already available. Some cases remain where the root or fs_info is not
passed to the function emitting the error.

This may introduce some confusion with volumes backed by multiple devices
emitting errors referring to the primary device in the set instead of the
one on which the error occurred.

Use btrfs_printk(fs_info, format, ...) rather than writing the device
string every time, and introduce macro wrappers ala XFS for brevity.
Since the function already cannot be used for continuations, print a
newline as part of the btrfs_printk() message rather than at each caller.

Signed-off-by: Simon Kirby
Reviewed-by: David Sterba
Signed-off-by: Josef Bacik

Simon Kirby
2013-05-07 03:54:23 +0800
9d1a2a3ad btrfs: clean snapshots one by one ... Browse Code »

Each time pick one dead root from the list and let the caller know if
it's needed to continue. This should improve responsiveness during
umount and balance which at some point waits for cleaning all currently
queued dead roots.

A new dead root is added to the end of the list, so the snapshots
disappear in the order of deletion.

The snapshot cleaning work is now done only from the cleaner thread and the
others wake it if needed.

Signed-off-by: David Sterba
Signed-off-by: Josef Bacik

David Sterba
2013-05-07 03:54:21 +0800
3173a18f7 Btrfs: add a incompatible format change for smaller metadata extent refs ... Browse Code »

We currently store the first key of the tree block inside the reference for the
tree block in the extent tree. This takes up quite a bit of space. Make a new
key type for metadata which holds the level as the offset and completely removes
storing the btrfs_tree_block_info inside the extent ref. This reduces the size
from 51 bytes to 33 bytes per extent reference for each tree block. In practice
this results in a 30-35% decrease in the size of our extent tree, which means we
COW less and can keep more of the extent tree in memory which makes our heavy
metadata operations go much faster. This is not an automatic format change, you
must enable it at mkfs time or with btrfstune. This patch deals with having
metadata stored as either the old format or the new format so it is easy to
convert. Thanks,

Signed-off-by: Josef Bacik

Josef Bacik
2013-05-07 03:54:18 +0800

30 Mar, 2013

1 commit

3615db41c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fixes from Chris Mason:
"We've had a busy two weeks of bug fixing. The biggest patches in here
are some long standing early-enospc problems (Josef) and a very old
race where compression and mmap combine forces to lose writes (me).
I'm fairly sure the mmap bug goes all the way back to the introduction
of the compression code, which is proof that fsx doesn't trigger every
possible mmap corner after all.

I'm sure you'll notice one of these is from this morning, it's a small
and isolated use-after-free fix in our scrub error reporting. I
double checked it here."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: don't drop path when printing out tree errors in scrub
Btrfs: fix wrong return value of btrfs_lookup_csum()
Btrfs: fix wrong reservation of csums
Btrfs: fix double free in the btrfs_qgroup_account_ref()
Btrfs: limit the global reserve to 512mb
Btrfs: hold the ordered operations mutex when waiting on ordered extents
Btrfs: fix space accounting for unlink and rename
Btrfs: fix space leak when we fail to reserve metadata space
Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes
Btrfs: fix race between mmap writes and compression
Btrfs: fix memory leak in btrfs_create_tree()
Btrfs: fix locking on ROOT_REPLACE operations in tree mod log
Btrfs: fix missing qgroup reservation before fallocating
Btrfs: handle a bogus chunk tree nicely
Btrfs: update to use fs_state bit

Linus Torvalds
2013-03-30 02:13:25 +0800

28 Mar, 2013

2 commits

fdf30d1c1 Btrfs: limit the global reserve to 512mb ... Browse Code »

A user reported a problem where he was getting early ENOSPC with hundreds of
gigs of free data space and 6 gigs of free metadata space. This is because the
global block reserve was taking up the entire free metadata space. This is
ridiculous, we have infrastructure in place to throttle if we start using too
much of the global reserve, so instead of letting it get this huge just limit it
to 512mb so that users can still get work done. This allowed the user to
complete his rsync without issues. Thanks

Cc: stable@vger.kernel.org
Reported-and-tested-by: Stefan Priebe
Signed-off-by: Josef Bacik

Josef Bacik
2013-03-28 21:51:29 +0800
f4881bc7a Btrfs: fix space leak when we fail to reserve metadata space ... Browse Code »

Dave reported a warning when running xfstest 275. We have been leaking delalloc
metadata space when our reservations fail. This is because we were improperly
calculating how much space to free for our checksum reservations. The problem
is we would sometimes free up space that had already been freed in another
thread and we would end up with negative usage for the delalloc space. This
patch fixes the problem by calculating how much space the other threads would
have already freed, and then calculate how much space we need to free had we not
done the reservation at all, and then freeing any excess space. This makes
xfstests 275 no longer have leaked space. Thanks

Cc: stable@vger.kernel.org
Reported-by: David Sterba
Signed-off-by: Josef Bacik

Josef Bacik
2013-03-28 21:51:26 +0800

22 Mar, 2013

1 commit

835d974fa Btrfs: handle a bogus chunk tree nicely ... Browse Code »

If you restore a btrfs-image file system and try to mount that file system we'll
panic. That's because btrfs-image restores and just makes one big chunk to
envelope the whole disk, since they are really only meant to be messed with by
our btrfs-progs. So fix up btrfs_rmap_block and the callers of it for mount so
that we no longer panic but instead just return an error and fail to mount.
Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-03-22 07:24:31 +0800

18 Mar, 2013

1 commit

08637024a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fixes from Chris Mason:
"Eric's rcu barrier patch fixes a long standing problem with our
unmount code hanging on to devices in workqueue helpers. Liu Bo
nailed down a difficult assertion for in-memory extent mappings."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix warning of free_extent_map
Btrfs: fix warning when creating snapshots
Btrfs: return as soon as possible when edquot happens
Btrfs: return EIO if we have extent tree corruption
btrfs: use rcu_barrier() to wait for bdev puts at unmount
Btrfs: remove btrfs_try_spin_lock
Btrfs: get better concurrency for snapshot-aware defrag work

Linus Torvalds
2013-03-18 02:04:14 +0800

15 Mar, 2013

1 commit

492104c86 Btrfs: return EIO if we have extent tree corruption ... Browse Code »

The callers of lookup_inline_extent_info all handle getting an error back
properly, so return an error if we have corruption instead of being a jerk and
panicing. Still WARN_ON() since this is kind of crucial and I've been seeing it
a bit too much recently for my taste, I think we're doing something wrong
somewhere. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason

Josef Bacik
2013-03-15 02:57:29 +0800

03 Mar, 2013

1 commit

b695188dd Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs update from Chris Mason:
"The biggest feature in the pull is the new (and still experimental)
raid56 code that David Woodhouse started long ago. I'm still working
on the parity logging setup that will avoid inconsistent parity after
a crash, so this is only for testing right now. But, I'd really like
to get it out to a broader audience to hammer out any performance
issues or other problems.

scrub does not yet correct errors on raid5/6 either.

Josef has another pass at fsync performance. The big change here is
to combine waiting for metadata with waiting for data, which is a big
latency win. It is also step one toward using atomics from the
hardware during a commit.

Mark Fasheh has a new way to use btrfs send/receive to send only the
metadata changes. SUSE is using this to make snapper more efficient
at finding changes between snapshosts.

Snapshot-aware defrag is also included.

Otherwise we have a large number of fixes and cleanups. Eric Sandeen
wins the award for removing the most lines, and I'm hoping we steal
this idea from XFS over and over again."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits)
btrfs: fixup/remove module.h usage as required
Btrfs: delete inline extents when we find them during logging
btrfs: try harder to allocate raid56 stripe cache
Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more logic
Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails
Btrfs: remove reduplicate check about root in the function btrfs_clean_quota_tree
Btrfs: return ENOMEM rather than use BUG_ON when btrfs_alloc_path fails
Btrfs: fix missing deleted items in btrfs_clean_quota_tree
btrfs: use only inline_pages from extent buffer
Btrfs: fix wrong reserved space when deleting a snapshot/subvolume
Btrfs: fix wrong reserved space in qgroup during snap/subv creation
Btrfs: remove unnecessary dget_parent/dput when creating the pending snapshot
btrfs: remove a printk from scan_one_device
Btrfs: fix NULL pointer after aborting a transaction
Btrfs: fix memory leak of log roots
Btrfs: copy everything if we've created an inline extent
btrfs: cleanup for open-coded alignment
Btrfs: do not change inode flags in rename
Btrfs: use reserved space for creating a snapshot
clear chunk_alloc flag on retryable failure
...

Linus Torvalds
2013-03-03 08:41:54 +0800

01 Mar, 2013

4 commits

88e081bf8 Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more logic ... Browse Code »

The original code is a little confusing and not clear, The right
way to deal with the kernel code like this:
[...]
if (ret)
goto out;
[...]

So i move the common clean_up code to the place labeled with
out_fail, this will be easier to maintain.

Signed-off-by: Wang Shilong
Signed-off-by: Josef Bacik

Wang Shilong
2013-03-01 23:13:04 +0800
a9870c0e0 Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails ... Browse Code »

commit eb6b88d92c6df083dd09a8c471011e3788dfd7c6 leads into another bug.
If it is just because qgroup_reserve fails, the function btrfs_qgroup_free
should not be called, otherwise, it will cause the wrong quota accounting.

Signed-off-by: Wang Shilong
Signed-off-by: Josef Bacik

Wang Shilong
2013-03-01 23:13:04 +0800
de1a2262b Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

Pull writeback fixes from Wu Fengguang:
"Two writeback fixes

- fix negative (setpoint - dirty) in 32bit archs

- use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle()"

* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
Negative (setpoint-dirty) in bdi_position_ratio()
vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them

Linus Torvalds
2013-03-01 05:21:44 +0800
d5c120701 Btrfs: fix wrong reserved space in qgroup during snap/subv creation ... Browse Code »

There are two problems in the space reservation of the snapshot/
subvolume creation.
- don't reserve the space for the root item insertion
- the space which is reserved in the qgroup is different with
the free space reservation. we need reserve free space for
7 items, but in qgroup reservation, we need reserve space only
for 3 items.

So we implement new metadata reservation functions for the
snapshot/subvolume creation.

Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik

Miao Xie
2013-03-01 02:33:54 +0800

27 Feb, 2013

2 commits

fda2832fe btrfs: cleanup for open-coded alignment ... Browse Code »

Though most of the btrfs codes are using ALIGN macro for page alignment,
there are still some codes using open-coded alignment like the
following:
------
u64 mask = ((u64)root->stripesize - 1);
u64 ret = (val + mask) & ~mask;
------
Or even hidden one:
------
num_bytes = (end - start + blocksize) & ~(blocksize - 1);
------

Sometimes these open-coded alignment is not so easy to understand for
newbie like me.

This commit changes the open-coded alignment to the ALIGN macro for a
better readability.

Also there is a previous patch from David Sterba with similar changes,
but the patch is for 3.2 kernel and seems not merged.
http://www.spinics.net/lists/linux-btrfs/msg12747.html

Cc: David Sterba
Signed-off-by: Qu Wenruo
Signed-off-by: Josef Bacik

Qu Wenruo
2013-02-27 00:04:13 +0800
a81cb9a2d clear chunk_alloc flag on retryable failure ... Browse Code »

I've experienced filesystem freezes with permanent spikes in the active
process count for quite a while, particularly on filesystems whose
available raw space has already been fully allocated to chunks.

While looking into this, I found a pretty obvious error in
do_chunk_alloc: it sets space_info->chunk_alloc, but if
btrfs_alloc_chunk returns an error other than ENOSPC, it returns leaving
that flag set, which causes any other threads waiting for
space_info->chunk_alloc to become zero to spin indefinitely.

I haven't double-checked that this patch fixes the failure I've observed
fully (it's not exactly trivial to trigger), but it surely is a bug and
the fix is trivial, so... Please put it in :-)

What I saw in that function also happens to explain why in some cases I
see filesystems allocate a huge number of chunks that remain unused
(leading to the scenario above, of not having more chunks to allocate).
It happens for data and metadata, but not necessarily both. I'm
guessing some thread sets the force_alloc flag on the corresponding
space_info, and then several threads trying to get disk space end up
attempting to allocate a new chunk concurrently. All of them will see
the force_alloc flag and bump their local copy of force up to the level
they see first, and they won't clear it even if another thread succeeds
in allocating a chunk, thus clearing the force flag. Then each thread
that observed the force flag will, on its turn, force the allocation of
a new chunk. And any threads that come in while it does that will see
the force flag still set and pick it up, and so on. This sounds like a
problem to me, but... what should the correct behavior be? Clear
force_flag once we copy it to a local force? Reset force to the
incoming value on every loop? Set the flag to our incoming force if we
have it at first, clear our local flag, and move it from the space_info
when we determined that we are the thread that's going to perform the
allocation?

btrfs: clear chunk_alloc flag on retryable failure

From: Alexandre Oliva

If btrfs_alloc_chunk fails with e.g. ENOMEM, we exit do_chunk_alloc
without clearing chunk_alloc in space_info. As a result, any further
calls to do_chunk_alloc on that filesystem will start busy-waiting for
chunk_alloc to be cleared, but it never will be. This patch adjusts
do_chunk_alloc so that it clears this flag in case of an error.

Signed-off-by: Alexandre Oliva
Signed-off-by: Josef Bacik

Alexandre Oliva
2013-02-27 00:00:51 +0800

22 Feb, 2013

1 commit

9afa3195b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull trivial tree from Jiri Kosina:
"Assorted tiny fixes queued in trivial tree"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (22 commits)
DocBook: update EXPORT_SYMBOL entry to point at export.h
Documentation: update top level 00-INDEX file with new additions
ARM: at91/ide: remove unsused at91-ide Kconfig entry
percpu_counter.h: comment code for better readability
x86, efi: fix comment typo in head_32.S
IB: cxgb3: delay freeing mem untill entirely done with it
net: mvneta: remove unneeded version.h include
time: x86: report_lost_ticks doesn't exist any more
pcmcia: avoid static analysis complaint about use-after-free
fs/jfs: Fix typo in comment : 'how may' -> 'how many'
of: add missing documentation for of_platform_populate()
btrfs: remove unnecessary cur_trans set before goto loop in join_transaction
sound: soc: Fix typo in sound/codecs
treewide: Fix typo in various drivers
btrfs: fix comment typos
Update ibmvscsi module name in Kconfig.
powerpc: fix typo (utilties -> utilities)
of: fix spelling mistake in comment
h8300: Fix home page URL in h8300/README
xtensa: Fix home page URL in Kconfig
...

Linus Torvalds
2013-02-22 09:40:58 +0800

21 Feb, 2013

3 commits

24542bf7e btrfs: limit fallocate extent reservation to 256MB ... Browse Code »

Very large fallocate requests are cpu bound and result in extents with a
repeating pattern of ever decreasing size:

$ time fallocate -l 1T file
real 0m13.039s

( an excerpt of the extents from btrfs-debug-tree: )
prealloc data disk byte 1536292564992 nr 397312
prealloc data disk byte 1536292962304 nr 196608
prealloc data disk byte 1536293158912 nr 98304
prealloc data disk byte 1536293257216 nr 49152
prealloc data disk byte 1536293306368 nr 24576
prealloc data disk byte 1536293330944 nr 12288
prealloc data disk byte 1536293343232 nr 8192
prealloc data disk byte 1536293351424 nr 4096
prealloc data disk byte 1536293355520 nr 4096
prealloc data disk byte 1536293359616 nr 4096

The excessive cpu use comes from __btrfs_prealloc_file_range() trying to
allocate the entire remaining size after each extent is allocated.
btrfs_reserve_extent() repeatedly cuts this requested size in half until
it gets down to the size that the allocators can return. We limit the
problem for now by capping each reservation at 256 meg.

The small extents come from a masking bug when decreasing the requested
reservation size. The high 32bits are cleared and the remaining low
bits might happen to reserve a small size. Fix this by using
round_down() which properly casts the mask.

After these fixes huge fallocate requests are fast and result in nice
large extents:

$ time fallocate -l 1T file
real 0m0.082s

prealloc data disk byte 1112425889792 nr 268435456
prealloc data disk byte 1112694325248 nr 268435456
prealloc data disk byte 1112962760704 nr 268435456

Reported-by: Eric Sandeen
Signed-off-by: Zach Brown
Signed-off-by: Chris Mason

Zach Brown
2013-02-21 03:06:25 +0800
e942f883b Merge branch 'raid56-experimental' into for-linus-3.9 ... Browse Code »

Signed-off-by: Chris Mason

Conflicts:
fs/btrfs/ctree.h
fs/btrfs/extent-tree.c
fs/btrfs/inode.c
fs/btrfs/volumes.c

Chris Mason
2013-02-21 03:06:05 +0800
b2c6b3e06 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btr… ... Browse Code »

…fs-next into for-linus-3.9

Signed-off-by: Chris Mason <chris.mason@fusionio.com>

Conflicts:
fs/btrfs/disk-io.c

Chris Mason
2013-02-21 03:05:45 +0800