Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

31 Mar, 2011

1 commit

25985edce Fix common misspellings ... Browse Code »

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi

Lucas De Marchi
2011-03-31 22:26:23 +0800

25 Mar, 2011

4 commits

0f1b1fd86 fs: pull inode->i_lock up out of writeback_single_inode ... Browse Code »

First thing we do in writeback_single_inode() is take the i_lock and
the last thing we do is drop it. A caller already holds the i_lock,
so pull the i_lock out of writeback_single_inode() to reduce the
round trips on this lock during inode writeback.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-03-25 09:17:51 +0800
a66979aba fs: move i_wb_list out from under inode_lock ... Browse Code »

Protect the inode writeback list with a new global lock
inode_wb_list_lock and use it to protect the list manipulations and
traversals. This lock replaces the inode_lock as the inodes on the
list can be validity checked while holding the inode->i_lock and
hence the inode_lock is no longer needed to protect the list.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-03-25 09:17:51 +0800
55fa6091d fs: move i_sb_list out from under inode_lock ... Browse Code »

Protect the per-sb inode list with a new global lock
inode_sb_list_lock and use it to protect the list manipulations and
traversals. This lock replaces the inode_lock as the inodes on the
list can be validity checked while holding the inode->i_lock and
hence the inode_lock is no longer needed to protect the list.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-03-25 09:16:32 +0800
250df6ed2 fs: protect inode->i_state with inode->i_lock ... Browse Code »

Protect inode state transitions and validity checks with the
inode->i_lock. This enables us to make inode state transitions
independently of the inode_lock and is the first step to peeling
away the inode_lock from the code.

This requires that __iget() is done atomically with i_state checks
during list traversals so that we don't race with another thread
marking the inode I_FREEING between the state check and grabbing the
reference.

Also remove the unlock_new_inode() memory barrier optimisation
required to avoid taking the inode_lock when clearing I_NEW.
Simplify the code by simply taking the inode->i_lock around the
state change and wakeup. Because the wakeup is no longer tricky,
remove the wake_up_inode() function and open code the wakeup where
necessary.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-03-25 09:16:31 +0800

14 Jan, 2011

6 commits

cb9ef8d5e fs/fs-writeback.c: fix sync_inodes_sb() return value kernel-doc ... Browse Code »

The sync_inodes_sb() function does not have a return value. Remove the
outdated documentation comment.

Signed-off-by: Stefan Hajnoczi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stefan Hajnoczi
2011-01-14 09:32:48 +0800
c691b9d98 sync_inode_metadata: fix comment ... Browse Code »

Use correct function name, remove incorrect apostrophe

Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2011-01-14 09:32:32 +0800
b9543dac5 writeback: avoid livelocking WB_SYNC_ALL writeback ... Browse Code »

When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
usually set to LONG_MAX. The logic in wb_writeback() then calls
__writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we
easily end up with non-positive nr_to_write after the function returns, if
the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment.

When nr_to_write is
Signed-off-by: Wu Fengguang
Cc: Johannes Weiner
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Jan Engelhardt
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2011-01-14 09:32:32 +0800
aa373cf55 writeback: stop background/kupdate works from livelocking other works ... Browse Code »

Background writeback is easily livelockable in a loop in wb_writeback() by
a process continuously re-dirtying pages (or continuously appending to a
file). This is in fact intended as the target of background writeback is
to write dirty pages it can find as long as we are over
dirty_background_threshold.

But the above behavior gets inconvenient at times because no other work
queued in the flusher thread's queue gets processed. In particular, since
e.g. sync(1) relies on flusher thread to do all the IO for it, sync(1)
can hang forever waiting for flusher thread to do the work.

Generally, when a flusher thread has some work queued, someone submitted
the work to achieve a goal more specific than what background writeback
does. Moreover by working on the specific work, we also reduce amount of
dirty pages which is exactly the target of background writeout. So it
makes sense to give specific work a priority over a generic page cleaning.

Thus we interrupt background writeback if there is some other work to do.
We return to the background writeback after completing all the queued
work.

This may delay the writeback of expired inodes for a while, however the
expired inodes will eventually be flushed to disk as long as the other
works won't livelock.

[fengguang.wu@intel.com: update comment]
Signed-off-by: Jan Kara
Signed-off-by: Wu Fengguang
Cc: Johannes Weiner
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Jan Engelhardt
Cc: Jens Axboe

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2011-01-14 09:32:32 +0800
71927e84e writeback: trace wakeup event for background writeback ... Browse Code »

This tracks when balance_dirty_pages() tries to wakeup the flusher thread
for background writeback (if it was not started already).

Suggested-by: Christoph Hellwig
Signed-off-by: Wu Fengguang
Cc: Jan Kara
Cc: Johannes Weiner
Cc: Dave Chinner
Cc: Jan Engelhardt
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2011-01-14 09:32:32 +0800
6585027a5 writeback: integrated background writeback work ... Browse Code »

Check whether background writeback is needed after finishing each work.

When bdi flusher thread finishes doing some work check whether any kind of
background writeback needs to be done (either because
dirty_background_ratio is exceeded or because we need to start flushing
old inodes). If so, just do background write back.

This way, bdi_start_background_writeback() just needs to wake up the
flusher thread. It will do background writeback as soon as there is no
other work.

This is a preparatory patch for the next patch which stops background
writeback as soon as there is other work to do.

Signed-off-by: Jan Kara
Signed-off-by: Wu Fengguang
Cc: Johannes Weiner
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Jan Engelhardt
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2011-01-14 09:32:32 +0800

31 Oct, 2010

1 commit

925d169f5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (39 commits)
Btrfs: deal with errors from updating the tree log
Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
Btrfs: make SNAP_DESTROY async
Btrfs: add SNAP_CREATE_ASYNC ioctl
Btrfs: add START_SYNC, WAIT_SYNC ioctls
Btrfs: async transaction commit
Btrfs: fix deadlock in btrfs_commit_transaction
Btrfs: fix lockdep warning on clone ioctl
Btrfs: fix clone ioctl where range is adjacent to extent
Btrfs: fix delalloc checks in clone ioctl
Btrfs: drop unused variable in block_alloc_rsv
Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
Btrfs: Fix variables set but not read (bugs found by gcc 4.6)
Btrfs: Use ERR_CAST helpers
Btrfs: use memdup_user helpers
Btrfs: fix raid code for removing missing drives
Btrfs: Switch the extent buffer rbtree into a radix tree
Btrfs: restructure try_release_extent_buffer()
Btrfs: use the flusher threads for delalloc throttling
Btrfs: tune the chunk allocation to 5% of the FS as metadata
...

Fix up trivial conflicts in fs/btrfs/super.c and fs/fs-writeback.c, and
remove use of INIT_RCU_HEAD in fs/btrfs/extent_io.c (that init macro was
useless and removed in commit 5e8067adfdba: "rcu head remove init")

Linus Torvalds
2010-10-31 00:05:48 +0800

30 Oct, 2010

1 commit

cdf01dd54 fs-writeback.c: unify some common code ... Browse Code »

The btrfs merge looks like hell, because it changes fs-writeback.c, and
the crazy code has this repeated "estimate number of dirty pages"
counting that involves three different helper functions. And it's done
in two different places.

Just unify that whole calculation as a "get_nr_dirty_pages()" helper
function, and the merge result will look half-way decent.

Signed-off-by: Linus Torvalds

Linus Torvalds
2010-10-30 23:55:52 +0800

29 Oct, 2010

1 commit

3259f8bed Add new functions for triggering inode writeback ... Browse Code »

When btrfs is running low on metadata space, it needs to force delayed
allocation pages to disk. It currently does this with a suboptimal walk
of a private list of inodes with delayed allocation, and it would be
much better if we used the generic flusher threads.

writeback_inodes_sb_if_idle would be ideal, but it waits for the flusher
thread to start IO on all the dirty pages in the FS before it returns.
This adds variants of writeback_inodes_sb* that allow the caller to
control how many pages get sent down.

Signed-off-by: Chris Mason

Chris Mason
2010-10-29 23:25:29 +0800

27 Oct, 2010

4 commits

426e1f5ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
split invalidate_inodes()
fs: skip I_FREEING inodes in writeback_sb_inodes
fs: fold invalidate_list into invalidate_inodes
fs: do not drop inode_lock in dispose_list
fs: inode split IO and LRU lists
fs: switch bdev inode bdi's correctly
fs: fix buffer invalidation in invalidate_list
fsnotify: use dget_parent
smbfs: use dget_parent
exportfs: use dget_parent
fs: use RCU read side protection in d_validate
fs: clean up dentry lru modification
fs: split __shrink_dcache_sb
fs: improve DCACHE_REFERENCED usage
fs: use percpu counter for nr_dentry and nr_dentry_unused
fs: simplify __d_free
fs: take dcache_lock inside __d_path
fs: do not assign default i_ino in new_inode
fs: introduce a per-cpu last_ino allocator
new helper: ihold()
...

Linus Torvalds
2010-10-27 08:58:44 +0800
766f91641 kernel: remove PF_FLUSHER ... Browse Code »

PF_FLUSHER is only ever set, not tested, remove it.

Signed-off-by: Peter Zijlstra
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2010-10-27 07:52:15 +0800
74ce002d9 fs/fs-writeback.c: restore lost comment ... Browse Code »

I had to go back to a 2.6.20 tree to work out why we're adding a
number-of-inodes into a number-of-pages count. Restore the lost comment.

Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2010-10-27 07:52:10 +0800
4cbec4c8b writeback: remove the internal 5% low bound on dirty_ratio ... Browse Code »

The dirty_ratio was silently limited in global_dirty_limits() to >= 5%.
This is not a user expected behavior. And it's inconsistent with
calc_period_shift(), which uses the plain vm_dirty_ratio value.

Let's remove the internal bound.

At the same time, fix balance_dirty_pages() to work with the
dirty_thresh=0 case. This allows applications to proceed when
dirty+writeback pages are all cleaned.

And ">" fits with the name "exceeded" better than ">=" does. Neil thinks
it is an aesthetic improvement as well as a functional one :)

Signed-off-by: Wu Fengguang
Cc: Jan Kara
Proposed-by: Con Kolivas
Acked-by: Peter Zijlstra
Reviewed-by: Rik van Riel
Reviewed-by: Neil Brown
Reviewed-by: KOSAKI Motohiro
Cc: Michael Rubin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-10-27 07:52:08 +0800

26 Oct, 2010

6 commits

9843b76aa fs: skip I_FREEING inodes in writeback_sb_inodes ... Browse Code »

Skip I_FREEING inodes just like I_WILL_FREE and I_NEW when walking the
writeback lists. Currenly this can't happen, but once we move from
inode_lock to more fine grained locking we can have an inode that's
still on the writeback lists but has I_FREEING set, and we absolutely
need to skip it here, just like we do for all other inode list walks.

Based on a patch from Dave Chinner.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-10-26 09:26:16 +0800
7ccf19a80 fs: inode split IO and LRU lists ... Browse Code »

The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.

Signed-off-by: Nick Piggin
Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Nick Piggin
2010-10-26 09:26:15 +0800
9e38d86ff fs: Implement lazy LRU updates for inodes ... Browse Code »

Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic. We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.

This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.

Signed-off-by: Nick Piggin
Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Nick Piggin
2010-10-26 09:26:09 +0800
cffbc8aa3 fs: Convert nr_inodes and nr_unused to per-cpu counters ... Browse Code »

The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.

Based on a patch originally from Eric Dumazet.

[AV: cleaned up a bit, fixed build breakage on weird configs

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Dave Chinner
2010-10-26 09:26:09 +0800
1d3382cbf new helper: inode_unhashed() ... Browse Code »

note: for race-free uses you inode_lock held

Signed-off-by: Al Viro

Al Viro
2010-10-26 09:24:15 +0800
c37650161 fs: add sync_inode_metadata ... Browse Code »

Add a new helper to write out the inode using the writeback code,
that is including the correct dirty bit and list manipulation. A few
of filesystems already opencode this, and a lot of others should be
using it instead of using write_inode_now which also writes out the
data.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-10-26 09:18:19 +0800

04 Oct, 2010

1 commit

aaead25b9 writeback: always use sb->s_bdi for writeback purposes ... Browse Code »

We currently use struct backing_dev_info for various different purposes.
Originally it was introduced to describe a backing device which includes
an unplug and congestion function and various bits of readahead information
and VM-relevant flags. We're also using for tracking dirty inodes for
writeback.

To make writeback properly find all inodes we need to only access the
per-filesystem backing_device pointed to by the superblock in ->s_bdi
inside the writeback code, and not the instances pointeded to by
inode->i_mapping->backing_dev which can be overriden by special devices
or might not be set at all by some filesystems.

Long term we should split out the writeback-relevant bits of struct
backing_device_info (which includes more than the current bdi_writeback)
and only point to it from the superblock while leaving the traditional
backing device as a separate structure that can be overriden by devices.

The one exception for now is the block device filesystem which really
wants different writeback contexts for it's different (internal) inodes
to handle the writeout more efficiently. For now we do this with
a hack in fs-writeback.c because we're so late in the cycle, but in
the future I plan to replace this with a superblock method that allows
for multiple writeback contexts per filesystem.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2010-10-04 20:25:33 +0800

22 Sep, 2010

1 commit

692ebd17c bdi: Fix warnings in __mark_inode_dirty for /dev/zero and friends ... Browse Code »

Inodes of devices such as /dev/zero can get dirty for example via
utime(2) syscall or due to atime update. Backing device of such inodes
(zero_bdi, etc.) is however unable to handle dirty inodes and thus
__mark_inode_dirty complains. In fact, inode should be rather dirtied
against backing device of the filesystem holding it. This is generally a
good rule except for filesystems such as 'bdev' or 'mtd_inodefs'. Inodes
in these pseudofilesystems are referenced from ordinary filesystem
inodes and carry mapping with real data of the device. Thus for these
inodes we have to use inode->i_mapping->backing_dev_info as we did so
far. We distinguish these filesystems by checking whether sb->s_bdi
points to a non-trivial backing device or not.

Example: Assume we have an ext3 filesystem on /dev/sda1 mounted on /.
There's a device inode A described by a path "/dev/sdb" on this
filesystem. This inode will be dirtied against backing device "8:0"
after this patch. bdev filesystem contains block device inode B coupled
with our inode A. When someone modifies a page of /dev/sdb, it's B that
gets dirtied and the dirtying happens against the backing device "8:16".
Thus both inodes get filed to a correct bdi list.

Cc: stable@kernel.org
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2010-09-22 15:48:47 +0800

28 Aug, 2010

1 commit

b76b4014f writeback: Fix lost wake-up shutting down writeback thread ... Browse Code »

Setting the task state here may cause us to miss the wake up from
kthread_stop(), so we need to recheck kthread_should_stop() or risk
sleeping forever in the following schedule().

Symptom was an indefinite hang on an NFSv4 mount. (NFSv4 may create
multiple mounts in a temporary namespace while traversing the mount
path, and since the temporary namespace is immediately destroyed, it may
end up destroying a mount very soon after it was created, possibly
making this race more likely.)

INFO: task mount.nfs4:4314 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.nfs4 D 0000000000000000 2880 4314 4313 0x00000000
ffff88001ed6da28 0000000000000046 ffff88001ed6dfd8 ffff88001ed6dfd8
ffff88001ed6c000 ffff88001ed6c000 ffff88001ed6c000 ffff88001e5003a0
ffff88001ed6dfd8 ffff88001e5003a8 ffff88001ed6c000 ffff88001ed6dfd8
Call Trace:
[] schedule_timeout+0x1cd/0x2e0
[] ? mark_held_locks+0x6c/0xa0
[] ? _raw_spin_unlock_irq+0x30/0x60
[] ? trace_hardirqs_on_caller+0x14d/0x190
[] ? sub_preempt_count+0xe/0xd0
[] wait_for_common+0x120/0x190
[] ? default_wake_function+0x0/0x20
[] wait_for_completion+0x1d/0x20
[] kthread_stop+0x4a/0x150
[] ? thaw_process+0x70/0x80
[] bdi_unregister+0x10a/0x1a0
[] nfs_put_super+0x19/0x20
[] generic_shutdown_super+0x54/0xe0
[] kill_anon_super+0x16/0x60
[] nfs4_kill_super+0x39/0x90
[] deactivate_locked_super+0x45/0x60
[] deactivate_super+0x49/0x70
[] mntput_no_expire+0x84/0xe0
[] release_mounts+0x9f/0xc0
[] put_mnt_ns+0x65/0x80
[] nfs_follow_remote_path+0x1e6/0x420
[] nfs4_try_mount+0x6f/0xd0
[] nfs4_get_sb+0xa2/0x360
[] vfs_kern_mount+0x88/0x1f0
[] do_kern_mount+0x52/0x130
[] ? _lock_kernel+0x6a/0x170
[] do_mount+0x26e/0x7f0
[] ? copy_mount_options+0xea/0x190
[] sys_mount+0x98/0xf0
[] system_call_fastpath+0x16/0x1b
1 lock held by mount.nfs4/4314:
#0: (&type->s_umount_key#24){+.+...}, at: [] deactivate_super+0x41/0x70

Signed-off-by: J. Bruce Fields
Signed-off-by: Jens Axboe
Acked-by: Artem Bityutskiy

J. Bruce Fields
2010-08-28 14:52:10 +0800

12 Aug, 2010

5 commits

81d73a32d mm: fix writeback_in_progress() ... Browse Code »

Commit 83ba7b071f3 ("writeback: simplify the write back thread queue")
broke writeback_in_progress() as in that commit we started to remove work
items from the list at the moment we start working on them and not at the
moment they are finished. Thus if the flusher thread was doing some work
but there was no other work queued, writeback_in_progress() returned
false. This could in particular cause unnecessary queueing of background
writeback from balance_dirty_pages() or writeout work from
writeback_sb_if_idle().

This patch fixes the problem by introducing a bit in the bdi state which
indicates that the flusher thread is processing some work and uses this
bit for writeback_in_progress() test.

NOTE: Both callsites of writeback_in_progress() (namely,
writeback_inodes_sb_if_idle() and balance_dirty_pages()) would actually
need a different information than what writeback_in_progress() provides.
They would need to know whether *the kind of writeback they are going to
submit* is already queued. But this information isn't that simple to
provide so let's fix writeback_in_progress() for the time being.

Signed-off-by: Jan Kara
Cc: Christoph Hellwig
Cc: Wu Fengguang
Acked-by: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2010-08-12 23:43:30 +0800
a50aeb401 writeback: merge for_kupdate and !for_kupdate cases ... Browse Code »

Unify the logic for kupdate and non-kupdate cases. There won't be
starvation because the inodes requeued into b_more_io will later be
spliced _after_ the remaining inodes in b_io, hence won't stand in the way
of other inodes in the next run.

It avoids unnecessary redirty_tail() calls, hence the update of
i_dirtied_when. The timestamp update is undesirable because it could
later delay the inode's periodic writeback, or may exclude the inode from
the data integrity sync operation (which checks timestamp to avoid extra
work and livelock).

===
How the redirty_tail() comes about:

It was a long story.. This redirty_tail() was introduced with
wbc.more_io. The initial patch for more_io actually does not have the
redirty_tail(), and when it's merged, several 100% iowait bug reports
arised:

reiserfs:
http://lkml.org/lkml/2007/10/23/93

jfs:
commit 29a424f28390752a4ca2349633aaacc6be494db5
JFS: clear PAGECACHE_TAG_DIRTY for no-write pages

ext2:
http://www.spinics.net/linux/lists/linux-ext4/msg04762.html

They are all old bugs hidden in various filesystems that become "visible"
with the more_io patch. At the time, the ext2 bug is thought to be
"trivial", so not fixed. Instead the following updated more_io patch with
redirty_tail() is merged:

http://www.spinics.net/linux/lists/linux-ext4/msg04507.html

This will in general prevent 100% on ext2 and possibly other unknown FS bugs.

Signed-off-by: Wu Fengguang
Cc: Dave Chinner
Cc: Martin Bligh
Cc: Michael Rubin
Cc: Peter Zijlstra
Cc: Christoph Hellwig
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-08-12 23:43:30 +0800
4ea879b96 writeback: fix queue_io() ordering ... Browse Code »

This was not a bug, since b_io is empty for kupdate writeback. The next
patch will do requeue_io() for non-kupdate writeback, so let's fix it.

Signed-off-by: Wu Fengguang
Cc: Dave Chinner
Cc: Martin Bligh
Cc: Michael Rubin
Cc: Peter Zijlstra
Cc: Christoph Hellwig
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-08-12 23:43:30 +0800
23539afc7 writeback: don't redirty tail an inode with dirty pages ... Browse Code »

Avoid delaying writeback for an expire inode with lots of dirty pages, but
no active dirtier at the moment. Previously we only do that for the
kupdate case.

Any filesystem that does delayed allocation or unwritten extent conversion
after IO completion will cause this - for example, XFS.

Signed-off-by: Wu Fengguang
Acked-by: Jan Kara
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-08-12 23:43:30 +0800
16c4042f0 writeback: avoid unnecessary calculation of bdi dirty thresholds ... Browse Code »

Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
that the latter can be avoided when under global dirty background
threshold (which is the normal state for most systems).

Signed-off-by: Wu Fengguang
Cc: Peter Zijlstra
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-08-12 23:43:29 +0800

11 Aug, 2010

2 commits

2f9e825d3 Merge branch 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
xen-blkfront: fix missing out label
blkdev: fix blkdev_issue_zeroout return value
block: update request stacking methods to support discards
block: fix missing export of blk_types.h
writeback: fix bad _bh spinlock nesting
drbd: revert "delay probes", feature is being re-implemented differently
drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
drbd: Disable delay probes for the upcomming release
writeback: cleanup bdi_register
writeback: add new tracepoints
writeback: remove unnecessary init_timer call
writeback: optimize periodic bdi thread wakeups
writeback: prevent unnecessary bdi threads wakeups
writeback: move bdi threads exiting logic to the forker thread
writeback: restructure bdi forker loop a little
writeback: move last_active to bdi
writeback: do not remove bdi from bdi_list
writeback: simplify bdi code a little
writeback: do not lose wake-ups in bdi threads
...

Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
drivers/scsi/scsi_error.c as per Jens.

Linus Torvalds
2010-08-11 06:22:42 +0800
5f248c9c2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
no need for list_for_each_entry_safe()/resetting with superblock list
Fix sget() race with failing mount
vfs: don't hold s_umount over close_bdev_exclusive() call
sysv: do not mark superblock dirty on remount
sysv: do not mark superblock dirty on mount
btrfs: remove junk sb_dirt change
BFS: clean up the superblock usage
AFFS: wait for sb synchronization when needed
AFFS: clean up dirty flag usage
cifs: truncate fallout
mbcache: fix shrinker function return value
mbcache: Remove unused features
add f_flags to struct statfs(64)
pass a struct path to vfs_statfs
update VFS documentation for method changes.
All filesystems that need invalidate_inode_buffers() are doing that explicitly
convert remaining ->clear_inode() to ->evict_inode()
Make ->drop_inode() just return whether inode needs to be dropped
fs/inode.c:clear_inode() is gone
fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
...

Fix up trivial conflicts in fs/nilfs2/super.c

Linus Torvalds
2010-08-11 02:26:52 +0800

10 Aug, 2010

2 commits

7624ee72a mm: avoid resetting wb_start after each writeback round ... Browse Code »

WB_SYNC_NONE writeback is done in rounds of 1024 pages so that we don't
write out some huge inode for too long while starving writeout of other
inodes. To avoid livelocks, we record time we started writeback in
wbc->wb_start and do not write out inodes which were dirtied after this
time. But currently, writeback_inodes_wb() resets wb_start each time it
is called thus effectively invalidating this logic and making any
WB_SYNC_NONE writeback prone to livelocks.

This patch makes sure wb_start is set only once when we start writeback.

Signed-off-by: Jan Kara
Reviewed-by: Wu Fengguang
Cc: Christoph Hellwig
Acked-by: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2010-08-10 11:45:03 +0800
a4ffdde6e simplify checks for I_CLEAR/I_FREEING ... Browse Code »

add I_CLEAR instead of replacing I_FREEING with it. I_CLEAR is
equivalent to I_FREEING for almost all code looking at either;
it's there to keep track of having called clear_inode() exactly
once per inode lifetime, at some point after having set I_FREEING.
I_CLEAR and I_FREEING never get set at the same time with the
current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR
instead of I_CLEAR without loss of information. As the result of
such change, checks become simpler and the amount of code that needs
to know about I_CLEAR shrinks a lot.

Signed-off-by: Al Viro

Al Viro
2010-08-10 04:47:44 +0800

08 Aug, 2010

4 commits

6467716a3 writeback: optimize periodic bdi thread wakeups ... Browse Code »

Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which
should take care of the periodic background write-out. However, the write-out
will actually start only 'dirty_writeback_interval' centisecs later, so we can
delay the wake-up.

This change was requested by Nick Piggin who pointed out that if we delay the
wake-up, we weed out 2 unnecessary contex switches, which matters because
'__mark_inode_dirty()' is a hot-path function.

This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which
sets up a timer to wake-up the bdi thread and returns. So the wake-up is
delayed.

We also delete the timer in bdi threads just before writing-back. And
synchronously delete it when unregistering bdi. At the unregister point the bdi
does not have any users, so no one can arm it again.

Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq
context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes
this change as well.

This patch also moves the 'bdi_wb_init()' function down in the file to avoid
forward-declaration of 'bdi_wakeup_thread_delayed()'.

Signed-off-by: Artem Bityutskiy
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:56 +0800
253c34e9b writeback: prevent unnecessary bdi threads wakeups ... Browse Code »

Finally, we can get rid of unnecessary wake-ups in bdi threads, which are very
bad for battery-driven devices.

There are two types of activities bdi threads do:
1. process bdi works from the 'bdi->work_list'
2. periodic write-back

So there are 2 sources of wake-up events for bdi threads:

1. 'bdi_queue_work()' - submits bdi works
2. '__mark_inode_dirty()' - adds dirty I/O to bdi's

The former already has bdi wake-up code. The latter does not, and this patch
adds it.

'__mark_inode_dirty()' is hot-path function, but this patch adds another
'spin_lock(&bdi->wb_lock)' there. However, it is taken only in rare cases when
the bdi has no dirty inodes. So adding this spinlock should be fine and should
not affect performance.

This patch makes sure bdi threads and the forker thread do not wake-up if there
is nothing to do. The forker thread will nevertheless wake up at least every
5 min. to check whether it has to kill a bdi thread. This can also be optimized,
but is not worth it.

This patch also tidies up the warning about unregistered bid, and turns it from
an ugly crocodile to a simple 'WARN()' statement.

Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:56 +0800
fff5b85aa writeback: move bdi threads exiting logic to the forker thread ... Browse Code »

Currently, bdi threads can decide to exit if there were no useful activities
for 5 minutes. However, this causes nasty races: we can easily oops in the
'bdi_queue_work()' if the bdi thread decides to exit while we are waking it up.

And even if we do not oops, but the bdi tread exits immediately after we wake
it up, we'd lose the wake-up event and have an unnecessary delay (up to 5 secs)
in the bdi work processing.

This patch makes the forker thread to be the central place which not only
creates bdi threads, but also kills them if they were inactive long enough.
This better design-wise.

Another reason why this change was done is to prepare for the further changes
which will prevent the bdi threads from waking up every 5 sec and wasting
power. Indeed, when the task does not wake up periodically anymore, it won't be
able to exit either.

This patch also moves the the 'wake_up_bit()' call from the bdi thread to the
forker thread as well. So now the forker thread sets the BDI_pending bit, then
forks the task or kills it, then clears the bit and wakes up the waiting
process.

The only process which may wain on the bit is 'bdi_wb_shutdown()'. This
function was changed as well - now it first removes the bdi from the
'bdi_list', then waits on the 'BDI_pending' bit. Once it wakes up, it is
guaranteed that the forker thread won't race with it, because the bdi is not
visible. Note, the forker thread sets the 'BDI_pending' bit under the
'bdi->wb_lock' which is essential for proper serialization.

And additionally, when we change 'bdi->wb.task', we now take the
'bdi->work_lock', to make sure that we do not lose wake-ups which we otherwise
would when raced with, say, 'bdi_queue_work()'.

Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:56 +0800
ecd584030 writeback: move last_active to bdi ... Browse Code »

Currently bdi threads use local variable 'last_active' which stores last time
when the bdi thread did some useful work. Move this local variable to 'struct
bdi_writeback'. This is just a preparation for the further patches which will
make the forker thread decide when bdi threads should be killed.

Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:56 +0800