Eric Lee / linux-smarc-t335x-v3.2

03 Sep, 2011

2 commits

09f40f98b mm: Add comment explaining task state setting in bdi_forker_thread() ... Browse Code »

CC: Wu Fengguang
CC: Andrew Morton
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2011-09-03 07:17:02 +0800
5a042aa4b mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread() ... Browse Code »

bdi_forker_thread() clears BDI_pending bit at the end of the main loop.
However clearing of this bit must not be done in some cases which is
handled by calling 'continue' from switch statement. That's kind of
unusual construct and without a good reason so change the function into
more intuitive code flow.

CC: Wu Fengguang
CC: Andrew Morton
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2011-09-03 07:17:02 +0800

27 Jul, 2011

1 commit

f01ef569c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
mm: properly reflect task dirty limits in dirty_exceeded logic
writeback: don't busy retry writeback on new/freeing inodes
writeback: scale IO chunk size up to half device bandwidth
writeback: trace global_dirty_state
writeback: introduce max-pause and pass-good dirty limits
writeback: introduce smoothed global dirty limit
writeback: consolidate variable names in balance_dirty_pages()
writeback: show bdi write bandwidth in debugfs
writeback: bdi write bandwidth estimation
writeback: account per-bdi accumulated written pages
writeback: make writeback_control.nr_to_write straight
writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
writeback: trace event writeback_queue_io
writeback: trace event writeback_single_inode
writeback: remove .nonblocking and .encountered_congestion
writeback: remove writeback_control.more_io
writeback: skip balance_dirty_pages() for in-memory fs
writeback: add bdi_dirty_limit() kernel-doc
writeback: avoid extra sync work at enqueue time
writeback: elevate queue_io() into wb_writeback()
...

Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

Linus Torvalds
2011-07-27 01:39:54 +0800

26 Jul, 2011

2 commits

45b583b10 Merge 'akpm' patch series ... Browse Code »

* Merge akpm patch series: (122 commits)
drivers/connector/cn_proc.c: remove unused local
Documentation/SubmitChecklist: add RCU debug config options
reiserfs: use hweight_long()
reiserfs: use proper little-endian bitops
pnpacpi: register disabled resources
drivers/rtc/rtc-tegra.c: properly initialize spinlock
drivers/rtc/rtc-twl.c: check return value of twl_rtc_write_u8() in twl_rtc_set_time()
drivers/rtc: add support for Qualcomm PMIC8xxx RTC
drivers/rtc/rtc-s3c.c: support clock gating
drivers/rtc/rtc-mpc5121.c: add support for RTC on MPC5200
init: skip calibration delay if previously done
misc/eeprom: add eeprom access driver for digsy_mtc board
misc/eeprom: add driver for microwire 93xx46 EEPROMs
checkpatch.pl: update $logFunctions
checkpatch: make utf-8 test --strict
checkpatch.pl: add ability to ignore various messages
checkpatch: add a "prefer __aligned" check
checkpatch: validate signature styles and To: and Cc: lines
checkpatch: add __rcu as a sparse modifier
checkpatch: suggest using min_t or max_t
...

Did this as a merge because of (trivial) conflicts in
- Documentation/feature-removal-schedule.txt
- arch/xtensa/include/asm/uaccess.h
that were just easier to fix up in the merge than in the patch series.

Linus Torvalds
2011-07-26 12:00:19 +0800
ccb6108f5 mm/backing-dev.c: reset bdi min_ratio in bdi_unregister() ... Browse Code »

Vito said:

: The system has many usb disks coming and going day to day, with their
: respective bdi's having min_ratio set to 1 when inserted. It works for
: some time until eventually min_ratio can no longer be set, even when the
: active set of bdi's seen in /sys/class/bdi/*/min_ratio doesn't add up to
: anywhere near 100.
:
: This then leads to an unrelated starvation problem caused by write-heavy
: fuse mounts being used atop the usb disks, a problem the min_ratio setting
: at the underlying devices bdi effectively prevents.

Fix this leakage by resetting the bdi min_ratio when unregistering the
BDI.

Signed-off-by: Peter Zijlstra
Reported-by: Vito Caputo
Cc: Wu Fengguang
Cc: Miklos Szeredi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-07-26 11:57:07 +0800

24 Jul, 2011

1 commit

ef3230880 backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu ... Browse Code »

backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu

synchronize_rcu sleeps several timer ticks. synchronize_rcu_expedited is
much faster.

With 100Hz timer frequency, when we remove 10000 block devices with
"dmsetup remove_all" command, it takes 27 minutes. With this patch,
removing 10000 block devices takes only 15 seconds.

Signed-off-by: Mikulas Patocka
Signed-off-by: Jens Axboe

Mikulas Patocka
2011-07-24 02:44:24 +0800

10 Jul, 2011

4 commits

00821b002 writeback: show bdi write bandwidth in debugfs ... Browse Code »

Add a "BdiWriteBandwidth" entry and indent others in /debug/bdi/*/stats.

btw, increase digital field width to 10, for keeping the possibly
huge BdiWritten number aligned at least for desktop systems.

Impact: this could break user space tools if they are dumb enough to
depend on the number of white spaces.

CC: Theodore Ts'o
CC: Jan Kara
CC: Peter Zijlstra
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-07-10 13:09:02 +0800
e98be2d59 writeback: bdi write bandwidth estimation ... Browse Code »

The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.

It tries to update the bandwidth only when disk is fully utilized.
Any inactive period of more than one second will be skipped.

The estimated bandwidth will be reflecting how fast the device can
writeout when _fully utilized_, and won't drop to 0 when it goes idle.
The value will remain constant at disk idle time. At busy write time, if
not considering fluctuations, it will also remain high unless be knocked
down by possible concurrent reads that compete for the disk time and
bandwidth with async writes.

The estimation is not done purely in the flusher because there is no
guarantee for write_cache_pages() to return timely to update bandwidth.

The bdi->avg_write_bandwidth smoothing is very effective for filtering
out sudden spikes, however may be a little biased in long term.

The overheads are low because the bdi bandwidth update only occurs at
200ms intervals.

The 200ms update interval is suitable, because it's not possible to get
the real bandwidth for the instance at all, due to large fluctuations.

The NFS commits can be as large as seconds worth of data. One XFS
completion may be as large as half second worth of data if we are going
to increase the write chunk to half second worth of data. In ext4,
fluctuations with time period of around 5 seconds is observed. And there
is another pattern of irregular periods of up to 20 seconds on SSD tests.

That's why we are not only doing the estimation at 200ms intervals, but
also averaging them over a period of 3 seconds and then go further to do
another level of smoothing in avg_write_bandwidth.

CC: Li Shaohua
CC: Peter Zijlstra
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-07-10 13:09:01 +0800
f7d2b1ecd writeback: account per-bdi accumulated written pages ... Browse Code »

Introduce the BDI_WRITTEN counter. It will be used for estimating the
bdi's write bandwidth.

Peter Zijlstra :
Move BDI_WRITTEN accounting into __bdi_writeout_inc().
This will cover and fix fuse, which only calls bdi_writeout_inc().

CC: Michael Rubin
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Jan Kara
Signed-off-by: Wu Fengguang

Jan Kara
2011-07-10 13:09:01 +0800
d46db3d58 writeback: make writeback_control.nr_to_write straight ... Browse Code »

Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.

Acked-by: Jan Kara
Proposed-by: Christoph Hellwig
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-07-10 13:09:01 +0800

08 Jun, 2011

1 commit

f758eeabe writeback: split inode_wb_list_lock into bdi_writeback.list_lock ... Browse Code »

Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads. It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
inode_wb_list_lock: 42590 44433 0.12 147.74 144127.35 252274 886792 0.08 121.34 917211.23
------------------
inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 34 [] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 12893 [] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 10702 [] writeback_single_inode+0x16d/0x20a
------------------
inode_wb_list_lock 2 [] bdev_inode_switch_bdi+0x29/0x85
inode_wb_list_lock 19 [] inode_wb_list_del+0x22/0x49
inode_wb_list_lock 5550 [] __mark_inode_dirty+0x170/0x1d0
inode_wb_list_lock 8511 [] writeback_sb_inodes+0x10f/0x157

2.6.39-rc3 + patch:
&(&wb->list_lock)->rlock: 11383 11657 0.14 151.69 40429.51 90825 527918 0.11 145.90 556843.37
------------------------
&(&wb->list_lock)->rlock 10 [] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 1493 [] writeback_inodes_wb+0x3d/0x150
&(&wb->list_lock)->rlock 3652 [] writeback_sb_inodes+0x123/0x16f
&(&wb->list_lock)->rlock 1412 [] writeback_single_inode+0x17f/0x223
------------------------
&(&wb->list_lock)->rlock 3 [] bdi_lock_two+0x46/0x4b
&(&wb->list_lock)->rlock 6 [] inode_wb_list_del+0x5f/0x86
&(&wb->list_lock)->rlock 2061 [] __mark_inode_dirty+0x173/0x1cf
&(&wb->list_lock)->rlock 2629 [] writeback_sb_inodes+0x123/0x16f

hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

Signed-off-by: Christoph Hellwig
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Wu Fengguang

Christoph Hellwig
2011-06-08 08:25:21 +0800

21 May, 2011

1 commit

345227d70 backing-dev: Kill set but not used var in bdi_debug_stats_show() ... Browse Code »

Signed-off-by: Gustavo F. Padovan
Signed-off-by: Jens Axboe

Gustavo F. Padovan
2011-05-21 03:23:37 +0800

31 Mar, 2011

1 commit

25985edce Fix common misspellings ... Browse Code »

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi

Lucas De Marchi
2011-03-31 22:26:23 +0800

25 Mar, 2011

2 commits

d39dd11c3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
fs: simplify iget & friends
fs: pull inode->i_lock up out of writeback_single_inode
fs: rename inode_lock to inode_hash_lock
fs: move i_wb_list out from under inode_lock
fs: move i_sb_list out from under inode_lock
fs: remove inode_lock from iput_final and prune_icache
fs: Lock the inode LRU list separately
fs: factor inode disposal
fs: protect inode->i_state with inode->i_lock
autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()
autofs4 - remove autofs4_lock
autofs4 - fix d_manage() return on rcu-walk
autofs4 - fix autofs4_expire_indirect() traversal
autofs4 - fix dentry leak in autofs4_expire_direct()
autofs4 - reinstate last used update on access
vfs - check non-mountpoint dentry might block in __follow_mount_rcu()

Linus Torvalds
2011-03-25 10:01:30 +0800
a66979aba fs: move i_wb_list out from under inode_lock ... Browse Code »

Protect the inode writeback list with a new global lock
inode_wb_list_lock and use it to protect the list manipulations and
traversals. This lock replaces the inode_lock as the inodes on the
list can be validity checked while holding the inode->i_lock and
hence the inode_lock is no longer needed to protect the list.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-03-25 09:17:51 +0800

17 Mar, 2011

1 commit

95f28604a fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away ... Browse Code »

We don't have proper reference counting for this yet, so we run into
cases where the device is pulled and we OOPS on flushing the fs data.
This happens even though the dirty inodes have already been
migrated to the default_backing_dev_info.

Reported-by: Torsten Hilbrich
Tested-by: Torsten Hilbrich
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Jens Axboe
2011-03-17 18:13:12 +0800

10 Mar, 2011

1 commit

7eaceacca block: remove per-queue plugging ... Browse Code »

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe

Jens Axboe
2011-03-10 15:52:07 +0800

27 Oct, 2010

4 commits

426e1f5ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
split invalidate_inodes()
fs: skip I_FREEING inodes in writeback_sb_inodes
fs: fold invalidate_list into invalidate_inodes
fs: do not drop inode_lock in dispose_list
fs: inode split IO and LRU lists
fs: switch bdev inode bdi's correctly
fs: fix buffer invalidation in invalidate_list
fsnotify: use dget_parent
smbfs: use dget_parent
exportfs: use dget_parent
fs: use RCU read side protection in d_validate
fs: clean up dentry lru modification
fs: split __shrink_dcache_sb
fs: improve DCACHE_REFERENCED usage
fs: use percpu counter for nr_dentry and nr_dentry_unused
fs: simplify __d_free
fs: take dcache_lock inside __d_path
fs: do not assign default i_ino in new_inode
fs: introduce a per-cpu last_ino allocator
new helper: ihold()
...

Linus Torvalds
2010-10-27 08:58:44 +0800
766f91641 kernel: remove PF_FLUSHER ... Browse Code »

PF_FLUSHER is only ever set, not tested, remove it.

Signed-off-by: Peter Zijlstra
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2010-10-27 07:52:15 +0800
0e093d997 writeback: do not sleep on the congestion queue if there are no congested BDIs o… ... Browse Code »

…r if significant congestion is not being encountered in the current zone

If congestion_wait() is called with no BDI congested, the caller will
sleep for the full timeout and this may be an unnecessary sleep. This
patch adds a wait_iff_congested() that checks congestion and only sleeps
if a BDI is congested else, it calls cond_resched() to ensure the caller
is not hogging the CPU longer than its quota but otherwise will not sleep.

This is aimed at reducing some of the major desktop stalls reported during
IO. For example, while kswapd is operating, it calls congestion_wait()
but it could just have been reclaiming clean page cache pages with no
congestion. Without this patch, it would sleep for a full timeout but
after this patch, it'll just call schedule() if it has been on the CPU too
long. Similar logic applies to direct reclaimers that are not making
enough progress.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2010-10-27 07:52:07 +0800
52bb91986 writeback: account for time spent congestion_waited ... Browse Code »

There is strong evidence to indicate a lot of time is being spent in
congestion_wait(), some of it unnecessarily. This patch adds a tracepoint
for congestion_wait to record when congestion_wait() was called, how long
the timeout was for and how long it actually slept.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: Johannes Weiner
Cc: Wu Fengguang
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-10-27 07:52:07 +0800

26 Oct, 2010

1 commit

7ccf19a80 fs: inode split IO and LRU lists ... Browse Code »

The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.

Signed-off-by: Nick Piggin
Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro

Nick Piggin
2010-10-26 09:26:15 +0800

22 Sep, 2010

1 commit

976e48f8a bdi: Initialize noop_backing_dev_info properly ... Browse Code »

Properly initialize this backing dev info so that writeback code does not
barf when getting to it e.g. via sb->s_bdi.

Cc: stable@kernel.org
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2010-09-22 15:48:47 +0800

27 Aug, 2010

1 commit

6628bc74f writeback: do not lose wakeup events when forking bdi threads ... Browse Code »

This patch fixes the following issue:

INFO: task mount.nfs4:1120 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.nfs4 D 00000000fffc6a21 0 1120 1119 0x00000000
ffff880235643948 0000000000000046 ffffffff00000000 ffffffff00000000
ffff880235643fd8 ffff880235314760 00000000001d44c0 ffff880235643fd8
00000000001d44c0 00000000001d44c0 00000000001d44c0 00000000001d44c0
Call Trace:
[] schedule_timeout+0x34/0xf1
[] ? wait_for_common+0x3f/0x130
[] ? trace_hardirqs_on+0xd/0xf
[] wait_for_common+0xd2/0x130
[] ? default_wake_function+0x0/0xf
[] ? _raw_spin_unlock+0x26/0x2a
[] wait_for_completion+0x18/0x1a
[] sync_inodes_sb+0xca/0x1bc
[] __sync_filesystem+0x47/0x7e
[] sync_filesystem+0x47/0x4b
[] generic_shutdown_super+0x22/0xd2
[] kill_anon_super+0x11/0x4f
[] nfs4_kill_super+0x3f/0x72 [nfs]
[] deactivate_locked_super+0x21/0x41
[] deactivate_super+0x40/0x45
[] mntput_no_expire+0xb8/0xed
[] release_mounts+0x9a/0xb0
[] put_mnt_ns+0x6a/0x7b
[] nfs_follow_remote_path+0x19a/0x296 [nfs]
[] nfs4_try_mount+0x75/0xaf [nfs]
[] nfs4_get_sb+0x276/0x2ff [nfs]
[] vfs_kern_mount+0xb8/0x196
[] do_kern_mount+0x48/0xe8
[] do_mount+0x771/0x7e8
[] sys_mount+0x83/0xbd
[] system_call_fastpath+0x16/0x1b

The reason of this hang was a race condition: when the flusher thread is
forking a bdi thread, we use 'kthread_run()', so we run it _before_ we make it
visible in 'bdi->wb.task'. The bdi thread runs, does all works, and goes sleep.
'bdi->wb.task' is still NULL. And this is a dangerous time window.

If at this time someone queues a work for this bdi, he does not see the bdi
thread and wakes up the forker thread instead! But the forker has already
forked this bdi thread, but just did not make it visible yet!

The result is that we lose the wake up event for this bdi thread and the NFS4
code waits forever.

To fix the problem, we should use 'ktrhead_create()' for creating bdi threads,
then make them visible in 'bdi->wb.task', and only after this wake them up.
This is exactly what this patch does.

Signed-off-by: Artem Bityutskiy
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-27 15:16:18 +0800

12 Aug, 2010

1 commit

16c4042f0 writeback: avoid unnecessary calculation of bdi dirty thresholds ... Browse Code »

Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
that the latter can be avoided when under global dirty background
threshold (which is the normal state for most systems).

Signed-off-by: Wu Fengguang
Cc: Peter Zijlstra
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-08-12 23:43:29 +0800

08 Aug, 2010

15 commits

6bf05d03e writeback: fix bad _bh spinlock nesting ... Browse Code »

Fix a bug where a lock is _bh nested within another _bh lock,
but forgets to use the _bh variant for unlock.

Further more, it's not necessary to test _bh locks, the inner lock
can just use spin_lock(). So fix up the bug by making that change.

Signed-off-by: Jens Axboe

Jens Axboe
2010-08-08 00:53:57 +0800
c284de61d writeback: cleanup bdi_register ... Browse Code »

This patch makes sure we first initialize everything and set the BDI_registered
flag, and only after this we add the bdi to 'bdi_list'. Current code adds the
bdi to the list too early, and as a result I the

WARN(!test_bit(BDI_registered, &bdi->state)

in bdi forker is triggered. Also, it is in general good practice to make things
visible only when they are fully initialized.

Also, this patch does few micro clean-ups:
1. Removes the 'exit' label which does not do anything, just returns. This
allows to get rid of few braces and 'ret' variable and make the code smaller.
2. If 'kthread_run()' fails, remove the error code it returns, not hard-coded
'-ENOMEM'. Theoretically, some day 'kthread_run()' can return something
else. Also, in case of failure it is not necessary to set 'bdi->wb.task' to
NULL.

Signed-off-by: Artem Bityutskiy
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:57 +0800
603320239 writeback: add new tracepoints ... Browse Code »

Add 2 new trace points to the periodic write-back wake up case, just like we do
in the 'bdi_queue_work()' function. Namely, introduce:

1. trace_writeback_wake_thread(bdi)
2. trace_writeback_wake_forker_thread(bdi)

The first event is triggered every time we wake up a bdi thread to start
periodic background write-out. The second event is triggered only when the bdi
thread does not exist and should be created by the forker thread.

This patch was suggested by Dave Chinner and Christoph Hellwig.

Signed-off-by: Artem Bityutskiy
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:56 +0800
b5048a6cb writeback: remove unnecessary init_timer call ... Browse Code »

The 'setup_timer()' function also calls 'init_timer()', so the extra
'init_timer()' call is not needed. Indeed, 'setup_timer()' is basically
'init_timer()' plus callback function and data pointers initialization.

Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:56 +0800
6467716a3 writeback: optimize periodic bdi thread wakeups ... Browse Code »

Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which
should take care of the periodic background write-out. However, the write-out
will actually start only 'dirty_writeback_interval' centisecs later, so we can
delay the wake-up.

This change was requested by Nick Piggin who pointed out that if we delay the
wake-up, we weed out 2 unnecessary contex switches, which matters because
'__mark_inode_dirty()' is a hot-path function.

This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which
sets up a timer to wake-up the bdi thread and returns. So the wake-up is
delayed.

We also delete the timer in bdi threads just before writing-back. And
synchronously delete it when unregistering bdi. At the unregister point the bdi
does not have any users, so no one can arm it again.

Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq
context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes
this change as well.

This patch also moves the 'bdi_wb_init()' function down in the file to avoid
forward-declaration of 'bdi_wakeup_thread_delayed()'.

Signed-off-by: Artem Bityutskiy
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:56 +0800
253c34e9b writeback: prevent unnecessary bdi threads wakeups ... Browse Code »

Finally, we can get rid of unnecessary wake-ups in bdi threads, which are very
bad for battery-driven devices.

There are two types of activities bdi threads do:
1. process bdi works from the 'bdi->work_list'
2. periodic write-back

So there are 2 sources of wake-up events for bdi threads:

1. 'bdi_queue_work()' - submits bdi works
2. '__mark_inode_dirty()' - adds dirty I/O to bdi's

The former already has bdi wake-up code. The latter does not, and this patch
adds it.

'__mark_inode_dirty()' is hot-path function, but this patch adds another
'spin_lock(&bdi->wb_lock)' there. However, it is taken only in rare cases when
the bdi has no dirty inodes. So adding this spinlock should be fine and should
not affect performance.

This patch makes sure bdi threads and the forker thread do not wake-up if there
is nothing to do. The forker thread will nevertheless wake up at least every
5 min. to check whether it has to kill a bdi thread. This can also be optimized,
but is not worth it.

This patch also tidies up the warning about unregistered bid, and turns it from
an ugly crocodile to a simple 'WARN()' statement.

Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:56 +0800
fff5b85aa writeback: move bdi threads exiting logic to the forker thread ... Browse Code »

Currently, bdi threads can decide to exit if there were no useful activities
for 5 minutes. However, this causes nasty races: we can easily oops in the
'bdi_queue_work()' if the bdi thread decides to exit while we are waking it up.

And even if we do not oops, but the bdi tread exits immediately after we wake
it up, we'd lose the wake-up event and have an unnecessary delay (up to 5 secs)
in the bdi work processing.

This patch makes the forker thread to be the central place which not only
creates bdi threads, but also kills them if they were inactive long enough.
This better design-wise.

Another reason why this change was done is to prepare for the further changes
which will prevent the bdi threads from waking up every 5 sec and wasting
power. Indeed, when the task does not wake up periodically anymore, it won't be
able to exit either.

This patch also moves the the 'wake_up_bit()' call from the bdi thread to the
forker thread as well. So now the forker thread sets the BDI_pending bit, then
forks the task or kills it, then clears the bit and wakes up the waiting
process.

The only process which may wain on the bit is 'bdi_wb_shutdown()'. This
function was changed as well - now it first removes the bdi from the
'bdi_list', then waits on the 'BDI_pending' bit. Once it wakes up, it is
guaranteed that the forker thread won't race with it, because the bdi is not
visible. Note, the forker thread sets the 'BDI_pending' bit under the
'bdi->wb_lock' which is essential for proper serialization.

And additionally, when we change 'bdi->wb.task', we now take the
'bdi->work_lock', to make sure that we do not lose wake-ups which we otherwise
would when raced with, say, 'bdi_queue_work()'.

Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:56 +0800
adf392407 writeback: restructure bdi forker loop a little ... Browse Code »

This patch re-structures the bdi forker a little:
1. Add 'bdi_cap_flush_forker(bdi)' condition check to the bdi loop. The reason
for this is that the forker thread can start _before_ the 'BDI_registered'
flag is set (see 'bdi_register()'), so the WARN() statement will fire for
the default bdi. I observed this warning at boot-up.

2. Introduce an enum 'action' and use "switch" statement in the outer loop.
This is a preparation to the further patch which will teach the forker
thread killing bdi threads, so we'll have another case in the "switch"
statement. This change was suggested by Christoph Hellwig.

This patch is just a small step towards the coming change where the forker
thread will kill the bdi threads. It should simplify reviewing the following
changes, which would otherwise be larger.

This patch also amends comments a little.

Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:56 +0800
78c40cb65 writeback: do not remove bdi from bdi_list ... Browse Code »

The forker thread removes bdis from 'bdi_list' before forking the bdi thread.
But this is wrong for at least 2 reasons.

Reason #1: if we temporary remove a bdi from the list, we may miss works which
would otherwise be given to us.

Reason #2: this is racy; indeed, 'bdi_wb_shutdown()' expects that bdis are
always in the 'bdi_list' (see 'bdi_remove_from_list()'), and when
it races with the forker thread, it can shut down the bdi thread
at the same time as the forker creates it.

This patch makes sure the forker thread never removes bdis from 'bdi_list'
(which was suggested by Christoph Hellwig).

In order to make sure that we do not race with 'bdi_wb_shutdown()', we have to
hold the 'bdi_lock' while walking the 'bdi_list' and setting the 'BDI_pending'
flag.

NOTE! The error path is interesting. Currently, when we fail to create a bdi
thread, we move the bdi to the tail of 'bdi_list'. But if we never remove the
bdi from the list, we cannot move it to the tail either, because then we can
mess up the RCU readers which walk the list. And also, we'll have the race
described above in "Reason #2".

But I not think that adding to the tail is any important so I just do not do
that.

Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:56 +0800
080dcec41 writeback: simplify bdi code a little ... Browse Code »

This patch simplifies bdi code a little by removing the 'pending_list' which is
redundant. Indeed, currently the forker thread ('bdi_forker_thread()') is
working like this:

1. In a loop, fetch all bdi's which have works but have no writeback thread and
move them to the 'pending_list'.
2. If the list is empty, sleep for 5 sec.
3. Otherwise, take one bdi from the list, fork the writeback thread for this
bdi, and repeat the loop.

IOW, it first moves everything to the 'pending_list', then process only one
element, and so on. This patch simplifies the algorithm, which is now as
follows.

1. Find the first bdi which has a work and remove it from the global list of
bdi's (bdi_list).
2. If there was not such bdi, sleep 5 sec.
3. Fork the writeback thread for this bdi and repeat the loop.

IOW, now we find the first bdi to process, process it, and so on. This is
simpler and involves less lists.

The bonus now is that we can get rid of a couple of functions, as well as
remove complications which involve 'rcu_call()' and 'bdi->rcu_head'.

This patch also makes sure we use 'list_add_tail_rcu()', instead of plain
'list_add_tail()', but this piece of code is going to be removed in the next
patch anyway.

Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:56 +0800
c4ec7908c writeback: do not lose wake-ups in the forker thread - 2 ... Browse Code »

Currently, if someone submits jobs for the default bdi, we can lose wake-up
events. E.g., this can happen if 'bdi_queue_work()' is called when
'bdi_forker_thread()' is executing code after 'wb_do_writeback(me, 0)', but
before 'set_current_state(TASK_INTERRUPTIBLE)'.

This situation is unlikely, and the result is not very severe - we'll just
delay the execution of the work, but this is still not very nice.

This patch fixes the issue by checking whether the default bdi has works before
the forker thread goes sleep.

Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:55 +0800
c5f7ad233 writeback: do not lose wake-ups in the forker thread - 1 ... Browse Code »

Currently the forker thread can lose wake-ups which may lead to unnecessary
delays in processing bdi works. E.g., consider the following scenario.

1. 'bdi_forker_thread()' walks the 'bdi_list', finds out there is nothing to
do, and is about to finish the loop.
2. A bdi thread decides to exit because it was inactive for long time.
3. 'bdi_queue_work()' adds a work to the bdi which just exited, so it wakes up
the forker thread.
4. but 'bdi_forker_thread()' executes 'set_current_state(TASK_INTERRUPTIBLE)'
and goes sleep. We lose a wake-up.

Losing the wake-up is not fatal, but this means that the bdi work processing
will be delayed by up to 5 sec. This race is theoretical, I never hit it, but
it is worth fixing.

The fix is to execute 'set_current_state(TASK_INTERRUPTIBLE)' _before_ walking
'bdi_list', not after.

Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:55 +0800
94eac5e62 writeback: fix possible race when creating bdi threads ... Browse Code »

This patch fixes a very unlikely race condition on the bdi forker thread error
path: when bdi thread creation fails, 'bdi->wb.task' may contain the error code
for a short period of time. If at the same time someone submits a work to this
bdi, we can end up with an oops 'bdi_queue_work()' while executing
'wake_up_process(wb->task)'.

This patch fixes the issue by introducing a temporary variable 'task' and
storing the possible error code there, so that 'wb->task' would never take
erroneous values.

Note, this race is very unlikely and I never hit it, so it is theoretical, but
nevertheless worth fixing.

This patch also merges 2 comments which were previously separate.

Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:19 +0800
6f904ff0e writeback: harmonize writeback threads naming ... Browse Code »

The write-back code mixes words "thread" and "task" for the same things. This
is not a big deal, but still an inconsistency.

hch: a convention I tend to use and I've seen in various places
is to always use _task for the storage of the task_struct pointer,
and thread everywhere else. This especially helps with having
foo_thread for the actual thread and foo_task for a global
variable keeping the task_struct pointer

This patch renames:
* 'bdi_add_default_flusher_task()' -> 'bdi_add_default_flusher_thread()'
* 'bdi_forker_task()' -> 'bdi_forker_thread()'

because bdi threads are 'bdi_writeback_thread()', so these names are more
consistent.

This patch also amends commentaries and makes them refer the forker and bdi
threads as "thread", not "task".

Also, while on it, make 'bdi_add_default_flusher_thread()' declaration use
'static void' instead of 'void static' and make checkpatch.pl happy.

Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-08 00:53:16 +0800
455b28646 writeback: Initial tracing support ... Browse Code »

Trace queue/sched/exec parts of the writeback loop. This provides
insight into when and why flusher threads are scheduled to run. e.g
a sync invocation leaves traces like:

sync-[...]: writeback_queue: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0
flush-8:0-[...]: writeback_exec: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0

This also lays the foundation for adding more writeback tracing to
provide deeper insight into the whole writeback path.

The original tracing code is from Jens Axboe, though this version is
a rewrite as a result of the code being traced changing
significantly.

Signed-off-by: Dave Chinner
Signed-off-by: Jens Axboe

Dave Chinner
2010-08-08 00:24:23 +0800