Eric Lee / smarc-fsl-linux-kernel

03 Jan, 2010

1 commit

4b6764fa9 writeback: add missing kernel-doc notation ... Browse Code »

Fix the following htmldocs warning:

Warning(fs/fs-writeback.c:255): No description found for parameter 'sb'

Signed-off-by: Jaswinder Singh Rajput
Signed-off-by: Randy Dunlap
Acked-by: Wu Fengguang
Cc: Peter Zijlstra
Cc: Jan Kara
Cc: Jens Axboe
Signed-off-by: Linus Torvalds

Jaswinder Singh Rajput
2010-01-03 02:09:44 +0800

23 Dec, 2009

1 commit

17bd55d03 fs-writeback: Add helper function to start writeback if idle ... Browse Code »

ext4, at least, would like to start pushing on writeback if it starts
to get close to ENOSPC when reserving worst-case blocks for delalloc
writes. Writing out delalloc data will convert those worst-case
predictions into usually smaller actual usage, freeing up space
before we hit ENOSPC based on this speculation.

Thanks to Jens for the suggestion for the helper function,
& the naming help.

I've made the helper return status on whether writeback was
started even though I don't plan to use it in the ext4 patch;
it seems like it would be potentially useful to test this
in some cases.

Signed-off-by: Eric Sandeen
Acked-by: Jan Kara

Eric Sandeen
2009-12-23 20:57:07 +0800

03 Dec, 2009

3 commits

0d99519ef writeback: remove unused nonblocking and congestion checks ... Browse Code »

- no one is calling wb_writeback and write_cache_pages with
wbc.nonblocking=1 any more
- lumpy pageout will want to do nonblocking writeback without the
congestion wait

So remove the congestion checks as suggested by Chris.

Signed-off-by: Wu Fengguang
Cc: Chris Mason
Cc: Jens Axboe
Cc: Trond Myklebust
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Evgeniy Polyakov
Cc: Alex Elder
Signed-off-by: Jens Axboe

Wu Fengguang
2009-12-03 20:54:25 +0800
b17621fed writeback: introduce wbc.for_background ... Browse Code »

It will lower the flush priority for NFS, and maybe more in future.

Signed-off-by: Wu Fengguang
Cc: Trond Myklebust
Cc: Jens Axboe
Cc: Christoph Hellwig
Signed-off-by: Jens Axboe

Wu Fengguang
2009-12-03 20:54:25 +0800
951c30d13 writeback: remove the always false bdi_cap_writeback_dirty() test ... Browse Code »

This is dead code because no bdi flush thread will be started for
!bdi_cap_writeback_dirty bdi.

Signed-off-by: Wu Fengguang
Cc: Jens Axboe
Cc: Trond Myklebust
Cc: Christoph Hellwig
Signed-off-by: Jens Axboe

Wu Fengguang
2009-12-03 20:54:25 +0800

26 Sep, 2009

12 commits

a72bfd4de writeback: pass in super_block to bdi_start_writeback() ... Browse Code »

Sometimes we only want to write pages from a specific super_block,
so allow that to be passed in.

This fixes a problem with commit 56a131dcf7ed36c3c6e36bea448b674ea85ed5bb
causing writeback on all super_blocks on a bdi, where we only really
want to sync a specific sb from writeback_inodes_sb().

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-26 06:10:40 +0800
56a131dcf writeback: writeback_inodes_sb() should use bdi_start_writeback() ... Browse Code »

Pointless to iterate other devices looking for a super, when
we have a bdi mapping.

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-26 00:08:26 +0800
b3af9468a writeback: don't delay inodes redirtied by a fast dirtier ... Browse Code »

Debug traces show that in per-bdi writeback, the inode under writeback
almost always get redirtied by a busy dirtier. We used to call
redirty_tail() in this case, which could delay inode for up to 30s.

This is unacceptable because it now happens so frequently for plain cp/dd,
that the accumulated delays could make writeback of big files very slow.

So let's distinguish between data redirty and metadata only redirty.
The first one is caused by a busy dirtier, while the latter one could
happen in XFS, NFS, etc. when they are doing delalloc or updating isize.

The inode being busy dirtied will now be requeued for next io, while
the inode being redirtied by fs will continue to be delayed to avoid
repeated IO.

CC: Jan Kara
CC: Theodore Ts'o
CC: Dave Chinner
CC: Chris Mason
CC: Christoph Hellwig
Signed-off-by: Wu Fengguang
Signed-off-by: Jens Axboe

Wu Fengguang
2009-09-26 00:08:26 +0800
9ecc2738a writeback: make the super_block pinning more efficient ... Browse Code »

Currently we pin the inode->i_sb for every single inode. This
increases cache traffic on sb->s_umount sem. Lets instead
cache the inode sb pin state and keep the super_block pinned
for as long as keep writing out inodes from the same
super_block.

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-26 00:08:26 +0800
cf137307c writeback: don't resort for a single super_block in move_expired_inodes() ... Browse Code »

If we only moved inodes from a single super_block to the temporary
list, there's no point in doing a resort for multiple super_blocks.

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-26 00:08:26 +0800
5c03449d3 writeback: move inodes from one super_block together ... Browse Code »

__mark_inode_dirty adds inode to wb dirty list in random order. If a disk has
several partitions, writeback might keep spindle moving between partitions.
To reduce the move, better write big chunk of one partition and then move to
another. Inodes from one fs usually are in one partion, so idealy move indoes
from one fs together should reduce spindle move. This patch tries to address
this. Before per-bdi writeback is added, the behavior is write indoes
from one fs first and then another, so the patch restores previous behavior.
The loop in the patch is a bit ugly, should we add a dirty list for each
superblock in bdi_writeback?

Test in a two partition disk with attached fio script shows about 3% ~ 6%
improvement.

Signed-off-by: Shaohua Li
Reviewed-by: Wu Fengguang
Signed-off-by: Jens Axboe

Shaohua Li
2009-09-26 00:08:25 +0800
5b0830cb9 writeback: get rid to incorrect references to pdflush in comments ... Browse Code »

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-26 00:08:25 +0800
71fd05a88 writeback: improve readability of the wb_writeback() continue/break logic ... Browse Code »

And throw some comments in there, too.

Reviewed-by: Wu Fengguang
Signed-off-by: Jens Axboe

Jens Axboe
2009-09-26 00:08:25 +0800
ae1b7f7d4 writeback: cleanup writeback_single_inode() ... Browse Code »

Make the if-else straight in writeback_single_inode().
No behavior change.

Cc: Jan Kara
Cc: Michael Rubin
Cc: Peter Zijlstra
Signed-off-by: Fengguang Wu
Signed-off-by: Jens Axboe

Wu Fengguang
2009-09-26 00:08:25 +0800
7fbdea323 writeback: kupdate writeback shall not stop when more io is possible ... Browse Code »

Fix the kupdate case, which disregards wbc.more_io and stop writeback
prematurely even when there are more inodes to be synced.

wbc.more_io should always be respected.

Also remove the pages_skipped check. It will set when some page(s) of some
inode(s) cannot be written for now. Such inodes will be delayed for a while.
This variable has nothing to do with whether there are other writeable inodes.

CC: Jan Kara
CC: Dave Chinner
CC: Peter Zijlstra
Signed-off-by: Wu Fengguang
Signed-off-by: Jens Axboe

Wu Fengguang
2009-09-26 00:08:25 +0800
d3ddec763 writeback: stop background writeback when below background threshold ... Browse Code »

Treat bdi_start_writeback(0) as a special request to do background write,
and stop such work when we are below the background dirty threshold.

Also simplify the (nr_pages
CC: Jan Kara
Acked-by: Peter Zijlstra
Signed-off-by: Wu Fengguang
Signed-off-by: Jens Axboe

Wu Fengguang
2009-09-26 00:08:24 +0800
a5989bdc9 fs: Fix busyloop in wb_writeback() ... Browse Code »

If all inodes are under writeback (e.g. in case when there's only one inode
with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades
to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by
waiting on I_SYNC flags of an inode on b_more_io list in case we failed to
write anything.

Tested-by: Wu Fengguang
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2009-09-26 00:08:24 +0800

16 Sep, 2009

12 commits

1ef7d9aa3 writeback: fix possible bdi writeback refcounting problem ... Browse Code »

wb_clear_pending AFAIKS should not be called after the item has been
put on the list, except by the worker threads. It could lead to the
situation where the refcount is decremented below 0 and cause lots of
problems.

Presumably the !wb_has_dirty_io case is not a common one, so it can
be discovered when the thread wakes up to check?

Also add a comment in bdi_work_clear.

Signed-off-by: Nick Piggin
Signed-off-by: Jens Axboe

Nick Piggin
2009-09-16 21:18:53 +0800
77b9d059c writeback: Fix bdi use after free in wb_work_complete() ... Browse Code »

By the time bdi_work_on_stack gets evaluated again in bdi_work_free, it
can already have been deallocated and used for something else in the
!on stack case, giving a false positive in this test and causing
corruption.

Signed-off-by: Nick Piggin
Signed-off-by: Jens Axboe

Nick Piggin
2009-09-16 21:18:52 +0800
77fad5e62 writeback: improve scalability of bdi writeback work queues ... Browse Code »

If you're going to do an atomic RMW on each list entry, there's not much
point in all the RCU complexities of the list walking. This is only going
to help the multi-thread case I guess, but it doesn't hurt to do now.

Signed-off-by: Nick Piggin
Signed-off-by: Jens Axboe

Nick Piggin
2009-09-16 21:18:52 +0800
deed62edf writeback: remove smp_mb(), it's not needed with list_add_tail_rcu() ... Browse Code »

list_add_tail_rcu contains required barriers.

Signed-off-by: Nick Piggin
Signed-off-by: Jens Axboe

Nick Piggin
2009-09-16 21:18:52 +0800
49db04143 writeback: use schedule_timeout_interruptible() ... Browse Code »

Gets rid of a manual set_current_state().

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-16 21:18:52 +0800
8010c3b63 writeback: add comments to bdi_work structure ... Browse Code »

And document its retriever, get_next_work_item().

Acked-by: Jan Kara
Signed-off-by: Jens Axboe

Jens Axboe
2009-09-16 21:18:52 +0800
b6e51316d writeback: separate starting of sync vs opportunistic writeback ... Browse Code »

bdi_start_writeback() is currently split into two paths, one for
WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
only WB_SYNC_NONE.

Push down the writeback_control allocation and only accept the
parameters that make sense for each function. This cleans up
the API considerably.

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-16 21:18:52 +0800
bcddc3f01 writeback: inline allocation failure handling in bdi_alloc_queue_work() ... Browse Code »

This gets rid of work == NULL in bdi_queue_work() and puts the
OOM handling where it belongs.

Acked-by: Jan Kara
Signed-off-by: Jens Axboe

Jens Axboe
2009-09-16 21:18:52 +0800
cfc4ba536 writeback: use RCU to protect bdi_list ... Browse Code »

Now that bdi_writeback_all() no longer handles integrity writeback,
it doesn't have to block anymore. This means that we can switch
bdi_list reader side protection to RCU.

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-16 21:18:51 +0800
f11fcae84 writeback: only use bdi_writeback_all() for WB_SYNC_NONE writeout ... Browse Code »

Data integrity writeback must use bdi_start_writeback() and ensure
that wbc->sb and wbc->bdi are set.

Acked-by: Jan Kara
Signed-off-by: Jens Axboe

Jens Axboe
2009-09-16 21:18:51 +0800
c4a77a6c7 writeback: make wb_writeback() take an argument structure ... Browse Code »

We need to be able to pass in range_cyclic as well, so instead
of growing yet another argument, split the arguments into a
struct wb_writeback_args structure that we can use internally.
Also makes it easier to just copy all members to an on-stack
struct, since we can't access work after clearing the pending
bit.

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-16 21:18:25 +0800
f0fad8a53 writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE ... Browse Code »

Since it's an opportunistic writeback and not a data integrity action,
don't punt to blocking writeback. Just wakeup the thread and it will
flush old data.

Acked-by: Jan Kara
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2009-09-16 21:16:18 +0800

14 Sep, 2009

1 commit

18f2ee705 vfs: Remove generic_osync_inode() and sync_page_range{_nolock}() ... Browse Code »

Remove these three functions since nobody uses them anymore.

Signed-off-by: Jan Kara

Jan Kara
2009-09-14 23:08:17 +0800

11 Sep, 2009

5 commits

500b067c5 writeback: check for registered bdi in flusher add and inode dirty ... Browse Code »

Also a debugging aid. We want to catch dirty inodes being added to
backing devices that don't do writeback.

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-11 15:20:26 +0800
d0bceac74 writeback: get rid of pdflush completely ... Browse Code »

It is now unused, so kill it off.

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-11 15:20:25 +0800
03ba3782e writeback: switch to per-bdi threads for flushing data ... Browse Code »

This gets rid of pdflush for bdi writeout and kupdated style cleaning.
pdflush writeout suffers from lack of locality and also requires more
threads to handle the same workload, since it has to work in a
non-blocking fashion against each queue. This also introduces lumpy
behaviour and potential request starvation, since pdflush can be starved
for queue access if others are accessing it. A sample ffsb workload that
does random writes to files is about 8% faster here on a simple SATA drive
during the benchmark phase. File layout also seems a LOT more smooth in
vmstat:

r b swpd free buff cache si so bi bo in cs us sy id wa
0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45

where vanilla tends to fluctuate a lot in the creation phase:

r b swpd free buff cache si so bi bo in cs us sy id wa
1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54

A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
SSD based writeback test on XFS performs over 20% better as well, with
the throughput being very stable around 1GB/sec, where pdflush only
manages 750MB/sec and fluctuates wildly while doing so. Random buffered
writes to many files behave a lot better as well, as does random mmap'ed
writes.

A separate thread is added to sync the super blocks. In the long term,
adding sync_supers_bdi() functionality could get rid of this thread again.

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-11 15:20:25 +0800
66f3b8e2e writeback: move dirty inodes from super_block to backing_dev_info ... Browse Code »

This is a first step at introducing per-bdi flusher threads. We should
have no change in behaviour, although sb_has_dirty_inodes() is now
ridiculously expensive, as there's no easy way to answer that question.
Not a huge problem, since it'll be deleted in subsequent patches.

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-11 15:20:25 +0800
d8a8559cd writeback: get rid of generic_sync_sb_inodes() export ... Browse Code »

This adds two new exported functions:

- writeback_inodes_sb(), which only attempts to writeback dirty inodes on
this super_block, for WB_SYNC_NONE writeout.
- sync_inodes_sb(), which writes out all dirty inodes on this super_block
and also waits for the IO to complete.

Acked-by: Jan Kara
Signed-off-by: Jens Axboe

Jens Axboe
2009-09-11 15:20:25 +0800

24 Jun, 2009

1 commit

01c031945 cleanup __writeback_single_inode ... Browse Code »

There is no reason to for the split between __writeback_single_inode and
__sync_single_inode, the former just does a couple of checks before
tail-calling the latter. So merge the two, and while we're at it split
out the I_SYNC waiting case for data integrity writers, as it's
logically separate function. Finally rename __writeback_single_inode to
writeback_single_inode.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2009-06-24 20:15:26 +0800

17 Jun, 2009

1 commit

84a892456 writeback: skip new or to-be-freed inodes ... Browse Code »

1) I_FREEING tests should be coupled with I_CLEAR

The two I_FREEING tests are racy because clear_inode() can set i_state to
I_CLEAR between the clear of I_SYNC and the test of I_FREEING.

2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
races with generic_forget_inode()

generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
generic_sync_sb_inodes() shall not try to step in and create possible races:

generic_forget_inode
inode->i_state |= I_WILL_FREE;
spin_unlock(&inode_lock);
generic_sync_sb_inodes()
spin_lock(&inode_lock);
__iget(inode);
__writeback_single_inode
// see non zero i_count
may WARN here ==> WARN_ON(inode->i_state & I_WILL_FREE);
spin_unlock(&inode_lock);
may call generic_forget_inode again ==> iput(inode);

The above race and warning didn't turn up because writeback_inodes() holds
the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
early. But we are not sure the UBIFS calls and future callers will
guarantee that. So skip I_WILL_FREE inodes for the sake of safety.

Cc: Eric Sandeen
Acked-by: Jeff Layton
Cc: Masayoshi MIZUMA
Signed-off-by: Wu Fengguang
Cc: Artem Bityutskiy
Cc: Christoph Hellwig
Acked-by: Jan Kara
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:45 +0800

12 Jun, 2009

3 commits

4195f73d1 fs: block_dump missing dentry locking ... Browse Code »

I think the block_dump output in __mark_inode_dirty is missing dentry locking.
Surely the i_dentry list can change any time, so we may not even *get* a
dentry there. If we do get one by chance, then it would appear to be able to
go away or get renamed at any time...

Signed-off-by: Al Viro

Nick Piggin
2009-06-12 09:36:10 +0800
545b9fd3d fs: remove incorrect I_NEW warnings ... Browse Code »

Some filesystems can call in to sync an inode that is still in the
I_NEW state (eg. ext family, when mounted with -osync). This is OK
because the filesystem has sole access to the new inode, so it can
modify i_state without races (because no other thread should be
modifying it, by definition of I_NEW). Ie. a false positive, so
remove the warnings.

The races are described here 7ef0d7377cb287e08f3ae94cebc919448e1f5dff,
which is also where the warnings were introduced.

Reported-by: Stephen Hemminger
Signed-off-by: Nick Piggin
Signed-off-by: Al Viro

Nick Piggin
2009-06-12 09:36:10 +0800
5cee5815d vfs: Make sys_sync() use fsync_super() (version 4) ... Browse Code »

It is unnecessarily fragile to have two places (fsync_super() and do_sync())
doing data integrity sync of the filesystem. Alter __fsync_super() to
accommodate needs of both callers and use it. So after this patch
__fsync_super() is the only place where we gather all the calls needed to
properly send all data on a filesystem to disk.

Nice bonus is that we get a complete livelock avoidance and write_supers()
is now only used for periodic writeback of superblocks.

sync_blockdevs() introduced a couple of patches ago is gone now.

[build fixes folded]

Signed-off-by: Jan Kara
Signed-off-by: Al Viro

Jan Kara
2009-06-12 09:36:03 +0800