Eric Lee / smarc-fsl-linux-kernel

04 Apr, 2009

1 commit

811158b14 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
trivial: Update my email address
trivial: NULL noise: drivers/mtd/tests/mtd_*test.c
trivial: NULL noise: drivers/media/dvb/frontends/drx397xD_fw.h
trivial: Fix misspelling of "Celsius".
trivial: remove unused variable 'path' in alloc_file()
trivial: fix a pdlfush -> pdflush typo in comment
trivial: jbd header comment typo fix for JBD_PARANOID_IOFAIL
trivial: wusb: Storage class should be before const qualifier
trivial: drivers/char/bsr.c: Storage class should be before const qualifier
trivial: h8300: Storage class should be before const qualifier
trivial: fix where cgroup documentation is not correctly referred to
trivial: Give the right path in Documentation example
trivial: MTD: remove EOL from MODULE_DESCRIPTION
trivial: Fix typo in bio_split()'s documentation
trivial: PWM: fix of #endif comment
trivial: fix typos/grammar errors in Kconfig texts
trivial: Fix misspelling of firmware
trivial: cgroups: documentation typo and spelling corrections
trivial: Update contact info for Jochen Hein
trivial: fix typo "resgister" -> "register"
...

Linus Torvalds
2009-04-04 06:24:35 +0800

03 Apr, 2009

2 commits

d2caa3c54 writeback: guard against jiffies wraparound on inode->dirtied_when checks (try #3) ... Browse Code »

The dirtied_when value on an inode is supposed to represent the first time
that an inode has one of its pages dirtied. This value is in units of
jiffies. It's used in several places in the writeback code to determine
when to write out an inode.

The problem is that these checks assume that dirtied_when is updated
periodically. If an inode is continuously being used for I/O it can be
persistently marked as dirty and will continue to age. Once the time
compared to is greater than or equal to half the maximum of the jiffies
type, the logic of the time_*() macros inverts and the opposite of what is
needed is returned. On 32-bit architectures that's just under 25 days
(assuming HZ == 1000).

As the least-recently dirtied inode, it'll end up being the first one that
pdflush will try to write out. sync_sb_inodes does this check:

/* Was this inode dirtied after sync_sb_inodes was called? */
if (time_after(inode->dirtied_when, start))
break;

...but now dirtied_when appears to be in the future. sync_sb_inodes bails
out without attempting to write any dirty inodes. When this occurs,
pdflush will stop writing out inodes for this superblock. Nothing can
unwedge it until jiffies moves out of the problematic window.

This patch fixes this problem by changing the checks against dirtied_when
to also check whether it appears to be in the future. If it does, then we
consider the value to be far in the past.

This should shrink the problematic window of time to such a small period
(30s) as not to matter.

Signed-off-by: Jeff Layton
Signed-off-by: Wu Fengguang
Acked-by: Ian Kent
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeff Layton
2009-04-03 10:04:48 +0800
b6fac63cc vfs: skip I_CLEAR state inodes ... Browse Code »

clear_inode() will switch inode state from I_FREEING to I_CLEAR, and do so
_outside_ of inode_lock. So any I_FREEING testing is incomplete without a
coupled testing of I_CLEAR.

So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
add_dquot_ref().

Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
reminds fixing the other two cases.

Masayoshi MIZUMA has a nice panic flow:

=====================================================================
[process A] | [process B]
| |
| prune_icache() | drop_pagecache()
| spin_lock(&inode_lock) | drop_pagecache_sb()
| inode->i_state |= I_FREEING; | |
| spin_unlock(&inode_lock) | V
| | | spin_lock(&inode_lock)
| V | |
| dispose_list() | |
| list_del() | |
| clear_inode() | |
| inode->i_state = I_CLEAR | |
| | | V
| | | if (inode->i_state & (I_FREEING|I_WILL_FREE))
| | | continue;
Reviewed-by: Jan Kara
Signed-off-by: Wu Fengguang
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-04-03 10:04:48 +0800

30 Mar, 2009

1 commit

3e3cb64f6 trivial: fix a pdlfush -> pdflush typo in comment ... Browse Code »

Signed-off-by: Masatake YAMATO
Signed-off-by: Jiri Kosina

Masatake YAMATO
2009-03-30 21:22:03 +0800

13 Mar, 2009

1 commit

7ef0d7377 fs: new inode i_state corruption fix ... Browse Code »

There was a report of a data corruption
http://lkml.org/lkml/2008/11/14/121. There is a script included to
reproduce the problem.

During testing, I encountered a number of strange things with ext3, so I
tried ext2 to attempt to reduce complexity of the problem. I found that
fsstress would quickly hang in wait_on_inode, waiting for I_LOCK to be
cleared, even though instrumentation showed that unlock_new_inode had
already been called for that inode. This points to memory scribble, or
synchronisation problme.

i_state of I_NEW inodes is not protected by inode_lock because other
processes are not supposed to touch them until I_LOCK (and I_NEW) is
cleared. Adding WARN_ON(inode->i_state & I_NEW) to sites where we modify
i_state revealed that generic_sync_sb_inodes is picking up new inodes from
the inode lists and passing them to __writeback_single_inode without
waiting for I_NEW. Subsequently modifying i_state causes corruption. In
my case it would look like this:

CPU0 CPU1
unlock_new_inode() __sync_single_inode()
reg i_state
reg -> reg & ~(I_LOCK|I_NEW) reg i_state
reg -> inode->i_state reg -> reg | I_SYNC
reg -> inode->i_state

Non-atomic RMW on CPU1 overwrites CPU0 store and sets I_LOCK|I_NEW again.

Fix for this is rather than wait for I_NEW inodes, just skip over them:
inodes concurrently being created are not subject to data integrity
operations, and should not significantly contribute to dirty memory
either.

After this change, I'm unable to reproduce any of the added warnings or
hangs after ~1hour of running. Previously, the new warnings would start
immediately and hang would happen in under 5 minutes.

I'm also testing on ext3 now, and so far no problems there either. I
don't know whether this fixes the problem reported above, but it fixes a
real problem for me.

Cc: "Jorge Boncompte [DTI2]"
Reported-by: Adrian Hunter
Cc: Jan Kara
Cc:
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-03-13 07:20:24 +0800

07 Jan, 2009

3 commits

856bf4d71 fs: sys_sync fix ... Browse Code »

s_syncing livelock avoidance was breaking data integrity guarantee of
sys_sync, by allowing sys_sync to skip writing or waiting for superblocks
if there is a concurrent sys_sync happening.

This livelock avoidance is much less important now that we don't have the
get_super_to_sync() call after every sb that we sync. This was replaced
by __put_super_and_need_restart.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:59:09 +0800
38f219776 fs: sync_sb_inodes fix ... Browse Code »

Fix data integrity semantics required by sys_sync, by iterating over all
inodes and waiting for any writeback pages after the initial writeout.
Comments explain the exact problem.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:59:09 +0800
4f5a99d64 fs: remove WB_SYNC_HOLD ... Browse Code »

Remove WB_SYNC_HOLD. The primary motiviation is the design of my
anti-starvation code for fsync. It requires taking an inode lock over the
sync operation, so we could run into lock ordering problems with multiple
inodes. It is possible to take a single global lock to solve the ordering
problem, but then that would prevent a future nice implementation of "sync
multiple inodes" based on lock order via inode address.

Seems like a backward step to remove this, but actually it is busted
anyway: we can't use the inode lists for data integrity wait: an inode can
be taken off the dirty lists but still be under writeback. In order to
satisfy data integrity semantics, we should wait for it to finish
writeback, but if we only search the dirty lists, we'll miss it.

It would be possible to have a "writeback" list, for sys_sync, I suppose.
But why complicate things by prematurely optimise? For unmounting, we
could avoid the "livelock avoidance" code, which would be easier, but
again premature IMO.

Fixing the existing data integrity problem will come next.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:59:09 +0800

17 Oct, 2008

1 commit

e1f8e8744 Remove Andrew Morton's old email accounts ... Browse Code »

People can use the real name an an index into MAINTAINERS to find the
current email address.

Signed-off-by: Francois Cami
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Francois Cami
2008-10-17 02:21:32 +0800

15 Jul, 2008

2 commits

4ee6afd34 VFS: export sync_sb_inodes ... Browse Code »

This patch exports the 'sync_sb_inodes()' which is needed for
UBIFS because it has to force write-back from time to time.
Namely, the UBIFS budgeting subsystem forces write-back when
its pessimistic callculations show that there is no free
space on the media.

Signed-off-by: Artem Bityutskiy

Artem Bityutskiy
2008-07-15 00:10:52 +0800
ae8547b0a VFS: move inode_lock into sync_sb_inodes ... Browse Code »

This patch makes 'sync_sb_inodes()' lock 'inode_lock', rather
than expect that the caller will do this.

This change was previously done by Hans Reiser
and sat in the -mm tree.

Signed-off-by: Artem Bityutskiy

Hans Reiser
2008-07-15 00:10:52 +0800

29 Apr, 2008

1 commit

f11b00f3b fs/fs-writeback.c: make 2 functions static ... Browse Code »

Make the following needlessly global functions static:

- writeback_acquire()
- writeback_release()

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-04-29 23:06:00 +0800

20 Mar, 2008

1 commit

a6b91919e fs: fix kernel-doc notation warnings ... Browse Code »

Fix kernel-doc notation warnings in fs/.

Warning(mmotm-2008-0314-1449//fs/super.c:560): missing initial short description on line:
* mark_files_ro
Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line:
* lease_get_mtime
Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line:
* lease_get_mtime
Warning(mmotm-2008-0314-1449//fs/namei.c:1368): missing initial short description on line:
* lookup_one_len: filesystem helper to lookup single pathname component
Warning(mmotm-2008-0314-1449//fs/buffer.c:3221): missing initial short description on line:
* bh_uptodate_or_lock: Test whether the buffer is uptodate
Warning(mmotm-2008-0314-1449//fs/buffer.c:3240): missing initial short description on line:
* bh_submit_read: Submit a locked buffer for reading
Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:30): missing initial short description on line:
* writeback_acquire: attempt to get exclusive writeback access to a device
Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:47): missing initial short description on line:
* writeback_in_progress: determine whether there is writeback in progress
Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:58): missing initial short description on line:
* writeback_release: relinquish exclusive writeback access against a device.
Warning(mmotm-2008-0314-1449//include/linux/jbd.h:351): contents before sections
Warning(mmotm-2008-0314-1449//include/linux/jbd.h:561): contents before sections
Warning(mmotm-2008-0314-1449//fs/jbd/transaction.c:1935): missing initial short description on line:
* void journal_invalidatepage()

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2008-03-20 09:53:36 +0800

09 Feb, 2008

1 commit

18914b188 write_inode_now(): avoid unnecessary synchronous write ... Browse Code »

We shouldn't use WB_SYNC_ALL if the caller is asking for asynchronous
treatment.

Signed-off-by: Mike Galbraith
Acked-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Galbraith
2008-02-09 01:22:34 +0800

07 Feb, 2008

1 commit

797074e44 fs: use list_for_each_entry_reverse and kill sb_entry ... Browse Code »

Use list_for_each_entry_reverse for super_blocks list and remove
unused sb_entry macro.

Signed-off-by: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2008-02-07 02:41:05 +0800

06 Feb, 2008

2 commits

8bc3be275 writeback: speed up writeback of big dirty files ... Browse Code »

After making dirty a 100M file, the normal behavior is to start the
writeback for all data after 30s delays. But sometimes the following
happens instead:

- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92M

Some analyze shows that the internal io dispatch queues goes like this:

s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M
1) initial state with a 100M file and a 1K file

2) 4M written, nr_to_write 0, no more writes(BUG)

nr_to_write > 0 in (3) fools the upper layer to think that data have all
been written out. The big dirty file is actually still sitting in
s_more_io. We cannot simply splice s_more_io back to s_io as soon as s_io
becomes empty, and let the loop in generic_sync_sb_inodes() continue: this
may starve newly expired inodes in s_dirty. It is also not an option to
draw inodes from both s_more_io and s_dirty, an let the loop go on: this
might lead to live locks, and might also starve other superblocks in sync
time(well kupdate may still starve some superblocks, that's another bug).

We have to return when a full scan of s_io completes. So nr_to_write > 0
does not necessarily mean that "all data are written". This patch
introduces a flag writeback_control.more_io to indicate that more io should
be done. With it the big dirty file no longer has to wait for the next
kupdate invokation 5s later.

In sync_sb_inodes() we only set more_io on super_blocks we actually
visited. This avoids the interaction between two pdflush deamons.

Also in __sync_single_inode() we don't blindly keep requeuing the io if the
filesystem cannot progress. Failing to do so may lead to 100% iowait.

Tested-by: Mike Snitzer
Signed-off-by: Fengguang Wu
Cc: Michael Rubin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2008-02-06 01:44:19 +0800
2d544564f skip writing data pages when inode is under I_SYNC ... Browse Code »

Since I_SYNC was split out from I_LOCK, the concern in commit
4b89eed93e0fa40a63e3d7b1796ec1337ea7a3aa ("Write back inode data pages
even when the inode itself is locked") is not longer valid.

We should revert to the original behavior: in __writeback_single_inode(),
when we find an I_SYNC-ed inode and we're not doing a data-integrity sync,
skip writing entirely. Otherwise, we are double calling do_writepages()

Signed-off-by: Qi Yong
Cc: Peter Zijlstra
Cc: Hugh Dickins
Cc: Joern Engel
Cc: WU Fengguang
Cc: Michael Rubin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Qi Yong
2008-02-06 01:44:18 +0800

15 Jan, 2008

1 commit

c23f72cae Revert "writeback: introduce writeback_control.more_io to indicate more io" ... Browse Code »

This reverts commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b, as
requested by Fengguang Wu. It's not quite fully baked yet, and while
there are patches around to fix the problems it caused, they should get
more testing. Says Fengguang: "I'll resend them both for -mm later on,
in a more complete patchset".

See

http://bugzilla.kernel.org/show_bug.cgi?id=9738

for some of this discussion.

Requested-by: Fengguang Wu
Cc: Andrew Morton
Cc: Peter Zijlstra
Signed-off-by: Linus Torvalds

Linus Torvalds
2008-01-15 13:21:29 +0800

20 Oct, 2007

1 commit

ba25f9dcc Use helpers to obtain task pid in printks ... Browse Code »

The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.

The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.

[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov
Cc: Dave Airlie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:43 +0800

17 Oct, 2007

13 commits

1c0eeaf56 introduce I_SYNC ... Browse Code »

I_LOCK was used for several unrelated purposes, which caused deadlock
situations in certain filesystems as a side effect. One of the purposes
now uses the new I_SYNC bit.

Also document the various bits and change their order from historical to
logical.

[bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
Signed-off-by: Joern Engel
Cc: Dave Kleikamp
Cc: David Chinner
Cc: Anton Altaparmakov
Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joern Engel
2007-10-17 23:43:02 +0800
2e6883bdf writeback: introduce writeback_control.more_io to indicate more io ... Browse Code »

After making dirty a 100M file, the normal behavior is to start the writeback
for all data after 30s delays. But sometimes the following happens instead:

- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92M

Some analyze shows that the internal io dispatch queues goes like this:

s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M

1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write 0, no more writes(BUG)

nr_to_write > 0 in (3) fools the upper layer to think that data have all been
written out. The big dirty file is actually still sitting in s_more_io. We
cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
let the loop in generic_sync_sb_inodes() continue: this may starve newly
expired inodes in s_dirty. It is also not an option to draw inodes from both
s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
and might also starve other superblocks in sync time(well kupdate may still
starve some superblocks, that's another bug).

We have to return when a full scan of s_io completes. So nr_to_write > 0 does
not necessarily mean that "all data are written". This patch introduces a
flag writeback_control.more_io to indicate this situation. With it the big
dirty file no longer has to wait for the next kupdate invocation 5s later.

Cc: David Chinner
Cc: Ken Chen
Signed-off-by: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2007-10-17 23:43:02 +0800
08d8e9749 writeback: fix ntfs with sb_has_dirty_inodes() ... Browse Code »

NTFS's if-condition on dirty inodes is not complete. Fix it with
sb_has_dirty_inodes().

Cc: Anton Altaparmakov
Cc: Ken Chen
Signed-off-by: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2007-10-17 23:43:02 +0800
2c1365791 writeback: fix time ordering of the per superblock inode lists 8 ... Browse Code »

Streamline the management of dirty inode lists and fix time ordering bugs.

The writeback logic used to move not-yet-expired dirty inodes from s_dirty to
s_io, *only to* move them back. The move-inodes-back-and-forth thing is a
mess, which is eliminated by this patch.

The new scheme is:
- s_dirty acts as a time ordered io delaying queue;
- s_io/s_more_io together acts as an io dispatching queue.

On kupdate writeback, we pull some inodes from s_dirty to s_io at the start of
every full scan of s_io. Otherwise (i.e. for sync/throttle/background
writeback), we always pull from s_dirty on each run (a partial scan).

Note that the line
list_splice_init(&sb->s_more_io, &sb->s_io);
is moved to queue_io() to leave s_io empty. Otherwise a big dirtied file will
sit in s_io for a long time, preventing new expired inodes to get in.

Cc: Ken Chen
Cc: Andrew Morton
Signed-off-by: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2007-10-17 23:43:02 +0800
0e0f4fc22 writeback: fix periodic superblock dirty inode flushing ... Browse Code »

Current -mm tree has bucketful of bug fixes in periodic writeback path.
However, we still hit a glitch where dirty pages on a given inode aren't
completely flushed to the disk, and system will accumulate large amount of
dirty pages beyond what dirty_expire_interval is designed for.

The problem is __sync_single_inode() will move an inode to sb->s_dirty list
even when there are more pending dirty pages on that inode. If there is
another inode with a small number of dirty pages, we hit a case where the loop
iteration in wb_kupdate() terminates prematurely because wbc.nr_to_write > 0.
Thus leaving the inode that has large amount of dirty pages behind and it has
to wait for another dirty_writeback_interval before we flush it again. We
effectively only write out MAX_WRITEBACK_PAGES every dirty_writeback_interval.
If the rate of dirtying is sufficiently high, the system will start
accumulate a large number of dirty pages.

So fix it by having another sb->s_more_io list on which to park the inode
while we iterate through sb->s_io and to allow each dirty inode which resides
on that sb to have an equal chance of flushing some amount of dirty pages.

Signed-off-by: Ken Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2007-10-17 23:43:02 +0800
670e4def6 writeback: fix time ordering of the per superblock dirty inode lists 7 ... Browse Code »

This one fixes four bugs.

There are a few situation in there where writeback decides it is going to skip
over a blockdev inode on the kernel-internal blockdev superblock. It
presently does this by moving the blockdev inode onto the tail of the blockdev
superblock's s_dirty. But

a) this screws up s_dirty's reverse-time-orderedness and

b) refiling the blockdev for writeback in another 30 second is rude. We
should try again sooner than that.

Fix all this up by using redirty_head(): move the blockdev inode onto the head
of the blockdev superblock's s_dirty list for prompt writeback.

Cc: Mike Waychison
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-10-17 23:43:02 +0800
65cb9b47e writeback: fix time ordering of the per superblock dirty inode lists 6 ... Browse Code »

Recycling the previous changelog:

When the writeback function is operating in writeback-for-flushing mode
(as opposed to writeback-for-integrity) and it encounters an I_LOCKed inode,
it will skip writing that inode. This is done for throughput and latency:
move on to another inode rather than blocking for this one.

Writeback skips this inode by moving it off s_io and onto s_dirty, so that
writeback can proceed with the other inodes on s_io.

However that inode movement can corrupt s_dirty's
reverse-time-orderedness. Fix that by using the new redirty_tail(), which
will update the refiled inode's dirtied_when field.

Note: the behaviour in here is a bit rude: if kupdate happens to come
across a locked inode then it will defer writeback of that inode for another
30 seconds. We'll address that in the next patch.

Address that here. What we do is to move the skipped inode to the _head_ of
s_dirty, immediately eligible for writeout again. Instead of deferring that
writeout for another 30 seconds.

One would think that this might cause a livelock: we keep on trying to write
the same locked inode. But it won't because:

a) if that was the case, it would _already_ be happening on the
balance_dirty_pages codepath. Because balance_dirty_pages() doesn't care
about inode timestamps.

b) if we skipped this inode then we won't have done any writeback. The
higher-level writeback paths will see that wbc.nr_to_write didn't change
and they'll then back off and take a nap.

Cc: Mike Waychison
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-10-17 23:43:02 +0800
c6945e77e writeback: fix time ordering of the per superblock dirty inode lists 5 ... Browse Code »

When the writeback function is operating in writeback-for-flushing mode (as
opposed to writeback-for-integrity) and it encounters an I_LOCKed inode, it
will skip writing that inode. This is done for throughput and latency: move
on to another inode rather than blocking for this one.

Writeback skips this inode by moving it off s_io and onto s_dirty, so that
writeback can proceed with the other inodes on s_io.

However that inode movement can corrupt s_dirty's reverse-time-orderedness.
Fix that by using the new redirty_tail(), which will update the refiled
inode's dirtied_when field.

Note: the behaviour in here is a bit rude: if kupdate happens to come across a
locked inode then it will defer writeback of that inode for another 30
seconds. We'll address that in the next patch.

Cc: Mike Waychison
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-10-17 23:43:02 +0800
1b43ef91d writeback: fix comment, use helper function ... Browse Code »

There's a comment in there which claims that the inode is left on s_io
if nfs chickened out of writing some data.

But that's not been true for three years.
9290280ced13c85689adeffa587e9a53bd3a5873 fixed a livelock by moving these
inodes back onto s_dirty. Fix the comment.

In the second leg of the `if', use redirty_tail() rather than open-coding it.

Add weaselly comment indicating lack of confidence in the code and lack of the
fortitude which would be needed to fiddle with it.

Cc: Mike Waychison
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-10-17 23:43:02 +0800
c986d1e2a writeback: fix time ordering of the per superblock dirty inode lists 4 ... Browse Code »

When the kupdate function has tried to write back an expired inode it will
then check to see whether some of the inode's pages are still dirty.

This can happen when the filesystem decided to not write a page for some
reason. But it does _not_ occur due to redirtyings: a redirtying will set
I_DIRTY_PAGES.

What we need to do here is to set I_DIRTY_PAGES to reflect reality and to then
put the inode onto the _head_ of s_dirty for consideration on the next kupdate
pass, in five seconds time.

Problem is, the code failed to modify the inode's timestamp when pushing the
inode onto thehead of s_dirty.

The patch:

If there are no other inodes on s_dirty then we leave the inode's timestamp
alone: it is already expired.

If there _are_ other inodes on s_dirty then we arrange for this inode to get
the same timestamp as the inode which is at the head of s_dirty, thus
preserving the s_dirty ordering. But we only need to do this if this inode
purports to have been dirtied before the one at head-of-list.

Cc: Mike Waychison
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-10-17 23:43:02 +0800
f57b9b7b4 writeback: fix time ordering of the per superblock dirty inode lists 3 ... Browse Code »

While writeback is working against a dirty inode it does a check after trying
to write some of the inode's pages:

"did the lower layers skip some of the inode's dirty pages because they were
locked (or under writeback, or whatever)"

If this turns out to be true, we must move the inode back onto s_dirty and
redirty it. The reason for doing this is that fsync() and friends only check
the s_dirty list, and those functions want to know about those pages which
were locked, so they can be waited upon and, if necessary, rewritten.

Problem is, that redirtying was putting the inode onto the tail of s_dirty
without updating its timestamp. This causes a violation of s_dirty ordering.

Fix this by updating inode->dirtied_when when moving the inode onto s_dirty.

But the code is still a bit buggy? If the inode was _already_ dirty then we
don't need to move it at all. Oh well, hopefully it doesn't matter too much,
as that was a redirtying, which was very recent anwyay.

Cc: Mike Waychison
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-10-17 23:43:02 +0800
9852a0e76 writeback: fix time ordering of the per superblock dirty inode lists: memory-backed inodes ... Browse Code »

For reasons which escape me, inodes which are dirty against a ram-backed
filesystem are managed in the same way as inodes which are backed by real
devices.

Probably we could optimise things here. But given that we skip the entire
supeblock as son as we hit the first dirty inode, there's not a lot to be
gained.

And the code does need to handle one particular non-backed superblock: the
kernel's fake internal superblock which holds all the blockdevs.

Still. At present when the code encounters an inode which is dirty against a
memory-backed filesystem it will skip that inode by refiling it back onto
s_dirty. But it fails to update the inode's timestamp when doing so which at
least makes the debugging code upset.

Fix.

Cc: Mike Waychison
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-10-17 23:43:02 +0800
6610a0bc8 writeback: fix time-ordering of the per-superblock dirty-inode lists ... Browse Code »

When writeback has finished writing back an inode it looks to see if that
inode is still dirty. If it is, that means that a process redirtied the inode
while its writeback was in progress.

What we need to do here is to refile the redirtied inode onto the s_dirty
list.

But we're doing that wrongly: it could be that this inode was redirtied
_before_ the last inode on s_dirty. We're blindly appending this inode to the
list, after an inode which might be less-recently-dirtied, thus violating the
list's ordering.

So we must either insertion-sort this inode into the correct place, or we must
update this inode's dirtied_when field when appending it to the reverse-sorted
s_dirty list, to preserve the reverse-time-ordering.

This patch does the latter: if this inode was dirtied less recently than the
tail inode then copy the tail inode's timestamp into this inode.

This means that in rare circumstances, some inodes will be writen back later
than they should have been. But the time slip will be small.

Cc: Mike Waychison
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-10-17 23:43:01 +0800

10 Oct, 2007

1 commit

f5ff8422b Fix warnings with !CONFIG_BLOCK ... Browse Code »

Hide everything in blkdev.h with CONFIG_BLOCK isn't set, and fixup
the (few) files that fail to build because they were relying on blkdev.h
pulling in extra includes for them.

Signed-off-by: Jens Axboe

Jens Axboe
2007-10-10 15:25:57 +0800

27 Jan, 2007

1 commit

4b89eed93 Write back inode data pages even when the inode itself is locked ... Browse Code »

In __writeback_single_inode(), when we find a locked inode and we're not
doing a data-integrity sync, we used to just skip writing entirely,
since we didn't want to wait for the inode to unlock.

However, there's really no reason to skip writing the data pages, which
are likely to be the the bulk of the dirty state anyway (and the main
reason why writeback was started for the non-data-integrity case, of
course!)

Acked-by: Nick Piggin
Cc: Andrew Morton ,
Cc: Peter Zijlstra
Cc: Hugh Dickins
Cc: David Howells
Signed-off-by: Linus Torvalds

Linus Torvalds
2007-01-27 04:53:20 +0800

01 Oct, 2006

2 commits

7b0de42d7 [PATCH] BLOCK: Remove dependence on existence of blockdev_superblock [try #6] ... Browse Code »

Move blockdev_superblock extern declaration from fs/fs-writeback.c to a
headerfile and remove the dependence on it by wrapping it in a macro.

Signed-Off-By: David Howells
Signed-off-by: Jens Axboe

David Howells
2006-10-01 02:52:26 +0800
07f3f05c1 [PATCH] BLOCK: Move extern declarations out of fs/*.c into header files [try #6] ... Browse Code »

Create a new header file, fs/internal.h, for common definitions local to the
sources in the fs/ directory.

Move extern definitions that should be in header files from fs/*.c to
fs/internal.h or other main header files where they span directories.

Signed-Off-By: David Howells
Signed-off-by: Jens Axboe

David Howells
2006-10-01 02:52:18 +0800

01 Jul, 2006

2 commits

fd39fc856 [PATCH] zoned vm counters: conversion of nr_unstable to per zone counter ... Browse Code »

Conversion of nr_unstable to a per zone counter

We need to do some special modifications to the nfs code since there are
multiple cases of disposition and we need to have a page ref for proper
accounting.

This converts the last critical page state of the VM and therefore we need to
remove several functions that were depending on GET_PAGE_STATE_LAST in order
to make the kernel compile again. We are only left with event type counters
in page state.

[akpm@osdl.org: bugfixes]
Signed-off-by: Christoph Lameter
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:36 +0800
b1e7a8fd8 [PATCH] zoned vm counters: conversion of nr_dirty to per zone counter ... Browse Code »

This makes nr_dirty a per zone counter. Looping over all processors is
avoided during writeback state determination.

The counter aggregation for nr_dirty had to be undone in the NFS layer since
we summed up the page counts from multiple zones. Someone more familiar with
NFS should probably review what I have done.

[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-07-01 02:25:35 +0800

23 Jun, 2006

2 commits

b31dc66a5 [PATCH] Kill PF_SYNCWRITE flag ... Browse Code »

A process flag to indicate whether we are doing sync io is incredibly
ugly. It also causes performance problems when one does a lot of async
io and then proceeds to sync it. Part of the io will go out as async,
and the other part as sync. This causes a disconnect between the
previously submitted io and the synced io. For io schedulers such as CFQ,
this will cause us lost merges and suboptimal behaviour in scheduling.

Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
the O_DIRECT path just directly indicate that the writes are sync
by using WRITE_SYNC instead.

Signed-off-by: Jens Axboe

Jens Axboe
2006-06-23 23:10:39 +0800
111ebb6e6 [PATCH] writeback: fix range handling ... Browse Code »

When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request. Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".

To make all this sane, the patch changes range of writeback_control.

So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.

And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.

This patch does,

- Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
-1 is usually ok for range_end (type is long long). But, if someone did,

range_end += val; range_end is "val - 1"
u64val = range_end >> bits; u64val is "~(0ULL)"

or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
things, and uses LLONG_MAX for range_end.

- All callers of ->writepages() sets range_start/end or range_cyclic.

- Fix updates of ->writeback_index. It seems already bit strange.
If it starts at 0 and ended by check of nr_to_write, this last
index may reduce chance to scan end of file. So, this updates
->writeback_index only if range_cyclic is true or whole-file is
scanned.

Signed-off-by: OGAWA Hirofumi
Cc: Nathan Scott
Cc: Anton Altaparmakov
Cc: Steven French
Cc: "Vladimir V. Saveliev"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

OGAWA Hirofumi
2006-06-23 22:42:49 +0800