Eric Lee / smarc-fsl-linux-kernel

06 Nov, 2011

2 commits

531f4b1ae Merge branch 'for-chris' of git://github.com/sensille/linux into integration ... Browse Code »

Conflicts:
fs/btrfs/ctree.h

Signed-off-by: Chris Mason

Chris Mason
2011-11-06 16:05:08 +0800
01d658f2c Btrfs: make sure to flush queued bios if write_cache_pages waits ... Browse Code »

write_cache_pages tries to build up a large bio to stuff down the pipe.
But if it needs to wait for a page lock, it needs to make sure and send
down any pending writes so we don't deadlock with anyone who has the
page lock and is waiting for writeback of things inside the bio.

Dave Sterba triggered this as a deadlock between the autodefrag code and
the extent write_cache_pages

Signed-off-by: Chris Mason

Chris Mason
2011-11-06 16:03:48 +0800

02 Oct, 2011

1 commit

ab0fff030 btrfs: add READAHEAD extent buffer flag ... Browse Code »

Add a READAHEAD extent buffer flag.
Add a function to trigger a read with this flag set.

Changes v2:
- use extent buffer flags instead of extent state flags

Changes v5:
- adapt to changed read_extent_buffer_pages interface
- don't return eb from reada_tree_block_flagged if it has CORRUPT flag set

Signed-off-by: Arne Jansen

Arne Jansen
2011-10-02 14:47:57 +0800

28 Jul, 2011

1 commit

85d4e4611 Btrfs: make a lockdep class for each root ... Browse Code »

This patch was originally from Tejun Heo. lockdep complains about the btrfs
locking because we sometimes take btree locks from two different trees at the
same time. The current classes are based only on level in the btree, which
isn't enough information for lockdep to figure out if the lock is safe.

This patch makes a class for each type of tree, and lumps all the FS trees that
actually have files and directories into the same class.

Signed-off-by: Chris Mason

Chris Mason
2011-07-28 00:46:46 +0800

23 May, 2011

1 commit

945d8962c Merge branch 'cleanups' of git://repo.or.cz/linux-2.6/btrfs-unstable into inode_numbers ... Browse Code »

Conflicts:
fs/btrfs/extent-tree.c
fs/btrfs/free-space-cache.c
fs/btrfs/inode.c
fs/btrfs/tree-log.c

Signed-off-by: Chris Mason

Chris Mason
2011-05-23 00:33:42 +0800

21 May, 2011

1 commit

16cdcec73 btrfs: implement delayed inode items operation ... Browse Code »

Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.

Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.

Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.

Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason

Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.

Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.

Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.

Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030

After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

Many thanks for Kitayama-san's help!

Signed-off-by: Miao Xie
Reviewed-by: David Sterba
Tested-by: Tsutomu Itoh
Tested-by: Itaru Kitayama
Signed-off-by: Chris Mason

Miao Xie
2011-05-21 21:30:56 +0800

06 May, 2011

1 commit

f2a97a9db btrfs: remove all unused functions ... Browse Code »

Remove static and global declarations and/or definitions. Reduces size
of btrfs.ko by ~3.4kB.

text data bss dec hex filename
402081 7464 200 409745 64091 btrfs.ko.base
398620 7144 200 405964 631cc btrfs.ko.remove-all

Signed-off-by: David Sterba

David Sterba
2011-05-06 18:34:03 +0800

04 May, 2011

1 commit

621496f4f btrfs: remove unused function prototypes ... Browse Code »

function prototypes without a body

Signed-off-by: David Sterba

David Sterba
2011-05-04 20:01:26 +0800

18 Jan, 2011

1 commit

acce952b0 Btrfs: forced readonly mounts on errors ... Browse Code »
43

This patch comes from "Forced readonly mounts on errors" ideas.

As we know, this is the first step in being more fault tolerant of disk
corruptions instead of just using BUG() statements.

The major content:
- add a framework for generating errors that should result in filesystems
going readonly.
- keep FS state in disk super block.
- make sure that all of resource will be freed and released at umount time.
- make sure that fter FS is forced readonly on error, there will be no more
disk change before FS is corrected. For this, we should stop write operation.

After this patch is applied, the conversion from BUG() to such a framework can
happen incrementally.

Signed-off-by: Liu Bo
Signed-off-by: Chris Mason

liubo
2011-01-18 04:13:08 +0800

25 May, 2010

2 commits

eaf25d933 Btrfs: use async helpers for DIO write checksumming ... Browse Code »

The async helper threads offload crc work onto all the
CPUs, and make streaming writes much faster. This
changes the O_DIRECT write code to use them. The only
small complication was that we need to pass in the
logical offset in the file for each bio, because we can't
find it in the bio's pages.

Signed-off-by: Chris Mason

Chris Mason
2010-05-25 22:34:58 +0800
4a500fd17 Btrfs: Metadata ENOSPC handling for tree log ... Browse Code »

Previous patches make the allocater return -ENOSPC if there is no
unreserved free metadata space. This patch updates tree log code
and various other places to propagate/handle the ENOSPC error.

Signed-off-by: Yan Zheng
Signed-off-by: Chris Mason

Yan, Zheng
2010-05-25 22:34:53 +0800

25 Mar, 2009

1 commit

b9473439d Btrfs: leave btree locks spinning more often ... Browse Code »

btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
for the buffers it was dirtying. This may require a kmalloc and it
was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to
set any btree locks they were holding to blocking first.

This commit changes dirty tracking for extent buffers to just use a flag
in the extent buffer. Now that we have one and only one extent buffer
per page, this can be safely done without losing dirty bits along the way.

This also introduces a path->leave_spinning flag that callers of
btrfs_search_slot can use to indicate they will properly deal with a
path returned where all the locks are spinning instead of blocking.

Many of the btree search callers now expect spinning paths,
resulting in better btree concurrency overall.

Signed-off-by: Chris Mason

Chris Mason
2009-03-25 04:14:28 +0800

13 Feb, 2009

1 commit

4008c04a0 Btrfs: make a lockdep class for the extent buffer locks ... Browse Code »

Btrfs is currently using spin_lock_nested with a nested value based
on the tree depth of the block. But, this doesn't quite work because
the max tree depth is bigger than what spin_lock_nested can deal with,
and because locks are sometimes taken before the level field is filled in.

The solution here is to use lockdep_set_class_and_name instead, and to
set the class before unlocking the pages when the block is read from the
disk and just after init of a freshly allocated tree block.

btrfs_clear_path_blocking is also changed to take the locks in the proper
order, and it also makes sure all the locks currently held are properly
set to blocking before it tries to retake the spinlocks. Otherwise, lockdep
gets upset about bad lock orderin.

The lockdep magic cam from Peter Zijlstra

Signed-off-by: Chris Mason

Chris Mason
2009-02-13 03:09:45 +0800

22 Jan, 2009

1 commit

7237f1833 Btrfs: fix tree logs parallel sync ... Browse Code »

To improve performance, btrfs_sync_log merges tree log sync
requests. But it wrongly merges sync requests for different
tree logs. If multiple tree logs are synced at the same time,
only one of them actually gets synced.

This patch has following changes to fix the bug:

Move most tree log related fields in btrfs_fs_info to
btrfs_root. This allows merging sync requests separately
for each tree log.

Don't insert root item into the log root tree immediately
after log tree is allocated. Root item for log tree is
inserted when log tree get synced for the first time. This
allows syncing the log root tree without first syncing all
log trees.

At tree-log sync, btrfs_sync_log first sync the log tree;
then updates corresponding root item in the log root tree;
sync the log root tree; then update the super block.

Signed-off-by: Yan Zheng

Yan Zheng
2009-01-22 01:54:03 +0800

09 Dec, 2008

1 commit

a512bbf85 Btrfs: superblock duplication ... Browse Code »

This patch implements superblock duplication. Superblocks
are stored at offset 16K, 64M and 256G on every devices.
Spaces used by superblocks are preserved by the allocator,
which uses a reverse mapping function to find the logical
addresses that correspond to superblocks. Thank you,

Signed-off-by: Yan Zheng

Yan Zheng
2008-12-09 05:46:26 +0800

13 Nov, 2008

1 commit

c146afad2 Btrfs: mount ro and remount support ... Browse Code »

This patch adds mount ro and remount support. The main
changes in patch are: adding btrfs_remount and related
helper function; splitting the transaction related code
out of close_ctree into btrfs_commit_super; updating
allocator to properly handle read only block group.

Signed-off-by: Yan Zheng

Yan Zheng
2008-11-13 03:34:12 +0800

07 Nov, 2008

1 commit

4a69a4100 Btrfs: Add ordered async work queues ... Browse Code »

Btrfs uses kernel threads to create async work queues for cpu intensive
operations such as checksumming and decompression. These work well,
but they make it difficult to keep IO order intact.

A single writepages call from pdflush or fsync will turn into a number
of bios, and each bio is checksummed in parallel. Once the checksum is
computed, the bio is sent down to the disk, and since we don't control
the order in which the parallel operations happen, they might go down to
the disk in almost any order.

The code deals with this somewhat by having deep work queues for a single
kernel thread, making it very likely that a single thread will process all
the bios for a single inode.

This patch introduces an explicitly ordered work queue. As work structs
are placed into the queue they are put onto the tail of a list. They have
three callbacks:

->func (cpu intensive processing here)
->ordered_func (order sensitive processing here)
->ordered_free (free the work struct, all processing is done)

The work struct has three callbacks. The func callback does the cpu intensive
work, and when it completes the work struct is marked as done.

Every time a work struct completes, the list is checked to see if the head
is marked as done. If so the ordered_func callback is used to do the
order sensitive processing and the ordered_free callback is used to do
any cleanup. Then we loop back and check the head of the list again.

This patch also changes the checksumming code to use the ordered workqueues.
One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s.

Signed-off-by: Chris Mason

Chris Mason
2008-11-07 11:03:00 +0800

30 Oct, 2008

1 commit

c8b978188 Btrfs: Add zlib compression support ... Browse Code »

This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.

Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.

If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.

* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.

* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.

* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.

From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.

In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.

In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.

Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.

Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.

Decompression is hooked into readpages and it does spread across CPUs nicely.

Signed-off-by: Chris Mason

Chris Mason
2008-10-30 02:49:59 +0800

25 Sep, 2008

22 commits

4bef08485 Btrfs: Tree logging fixes ... Browse Code »

* Pin down data blocks to prevent them from being reallocated like so:

trans 1: allocate file extent
trans 2: free file extent
trans 3: free file extent during old snapshot deletion
trans 3: allocate file extent to new file
trans 3: fsync new file

Before the tree logging code, this was legal because the fsync
would commit the transation that did the final data extent free
and the transaction that allocated the extent to the new file
at the same time.

With the tree logging code, the tree log subtransaction can commit
before the transaction that freed the extent. If we crash,
we're left with two different files using the extent.

* Don't wait in start_transaction if log replay is going on. This
avoids deadlocks from iput while we're cleaning up link counts in the
replay code.

* Don't deadlock in replay_one_name by trying to read an inode off
the disk while holding paths for the directory

* Hold the buffer lock while we mark a buffer as written. This
closes a race where someone is changing a buffer while we write it.
They are supposed to mark it dirty again after they change it, but
this violates the cow rules.

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:04:07 +0800
e02119d5a Btrfs: Add a write ahead tree log to optimize synchronous operations ... Browse Code »

File syncs and directory syncs are optimized by copying their
items into a special (copy-on-write) log tree. There is one log tree per
subvolume and the btrfs super block points to a tree of log tree roots.

After a crash, items are copied out of the log tree and back into the
subvolume. See tree-log.c for all the details.

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:04:07 +0800
b64a2851b Btrfs: Wait for async bio submissions to make some progress at queue time ... Browse Code »

Before, the btrfs bdi congestion function was used to test for too many
async bios. This keeps that check to throttle pdflush, but also
adds a check while queuing bios.

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:04:06 +0800
777e6bd70 Btrfs: Transaction commit: don't use filemap_fdatawait ... Browse Code »

After writing out all the remaining btree blocks in the transaction,
the commit code would use filemap_fdatawait to make sure it was all
on disk. This means it would wait for blocks written by other procs
as well.

The new code walks the list of blocks for this transaction again
and waits only for those required by this transaction.

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:04:06 +0800
3f157a2fd Btrfs: Online btree defragmentation fixes ... Browse Code »

The btree defragger wasn't making forward progress because the new key wasn't
being saved by the btrfs_search_forward function.

This also disables the automatic btree defrag, it wasn't scaling well to
huge filesystems. The auto-defrag needs to be done differently.

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:04:04 +0800
89ce8a63d Add btrfs_end_transaction_throttle to force writers to wait for pending commits ... Browse Code »

The existing throttle mechanism was often not sufficient to prevent
new writers from coming in and making a given transaction run forever.
This adds an explicit wait at the end of most operations so they will
allow the current transaction to close.

There is no wait inside file_write, inode updates, or cow filling, all which
have different deadlock possibilities.

This is a temporary measure until better asynchronous commit support is
added. This code leads to stalls as it waits for data=ordered
writeback, and it really needs to be fixed.

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:04:03 +0800
dfe250206 Btrfs: Add mount -o degraded to allow mounts to continue with missing devices ... Browse Code »

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:04:03 +0800
1259ab75c Btrfs: Handle write errors on raid1 and raid10 ... Browse Code »

When duplicate copies exist, writes are allowed to fail to one of those
copies. This changeset includes a few changes that allow the FS to
continue even when some IOs fail.

It also adds verification of the parent generation number for btree blocks.
This generation is stored in the pointer to a block, and it ensures
that missed writes to are detected.

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:04:03 +0800
ca7a79ad8 Btrfs: Pass down the expected generation number when reading tree blocks ... Browse Code »

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:04:03 +0800
44b8bd7ed Btrfs: Create a work queue for bio writes ... Browse Code »

This allows checksumming to happen in parallel among many cpus, and
keeps us from bogging down pdflush with the checksumming code.

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:04:01 +0800
f29844623 Btrfs: Write out all super blocks on commit, and bring back proper barrier support ... Browse Code »

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:04:01 +0800
22c599485 Btrfs: Handle data block end_io through the async work queue ... Browse Code »

Before it was done by the bio end_io routine, the work queue code is able
to scale much better with faster IO subsystems.

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:04:01 +0800
0999df54f Btrfs: Verify checksums on tree blocks found without read_tree_block ... Browse Code »

Checksums were only verified by btrfs_read_tree_block, which meant the
functions to probe the page cache for blocks were not validating checksums.
Normally this is fine because the buffers will only be in cache if they
have already been validated.

But, there is a window while the buffer is being read from disk where
it could be up to date in the cache but not yet verified. This patch
makes sure all buffers go through checksum verification before they
are used.

This is safer, and it prevents modification of buffers before they go
through the csum code.

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:04:01 +0800
8a4b83cc8 Btrfs: Add support for device scanning and detection ioctls ... Browse Code »

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:04:01 +0800
0b86a832a Btrfs: Add support for multiple devices per filesystem ... Browse Code »

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:04:00 +0800
e2008b614 Btrfs: Add some simple throttling to wait for data=ordered and snapshot deletion ... Browse Code »

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:03:59 +0800
dc17ff8f1 Btrfs: Add data=ordered support ... Browse Code »

This forces file data extents down the disk along with the metadata that
references them. The current implementation is fairly simple, and just
writes out all of the dirty pages in an inode before the commit.

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:03:59 +0800
edbd8d4ef Btrfs: Support for online FS resize (grow and shrink) ... Browse Code »

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:03:58 +0800
ff79f8190 Btrfs: Add back file data checksumming ... Browse Code »

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:03:57 +0800
6b80053d0 Btrfs: Add back the online defragging code ... Browse Code »

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:03:56 +0800
db94535db Btrfs: Allow tree blocks larger than the page size ... Browse Code »

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:03:56 +0800
5f39d397d Btrfs: Create extent_buffer interface for large blocksizes ... Browse Code »

Signed-off-by: Chris Mason

Chris Mason
2008-09-25 23:03:56 +0800