Eric Lee / smarc-fsl-linux-kernel

27 Jul, 2011

1 commit

f01ef569c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
mm: properly reflect task dirty limits in dirty_exceeded logic
writeback: don't busy retry writeback on new/freeing inodes
writeback: scale IO chunk size up to half device bandwidth
writeback: trace global_dirty_state
writeback: introduce max-pause and pass-good dirty limits
writeback: introduce smoothed global dirty limit
writeback: consolidate variable names in balance_dirty_pages()
writeback: show bdi write bandwidth in debugfs
writeback: bdi write bandwidth estimation
writeback: account per-bdi accumulated written pages
writeback: make writeback_control.nr_to_write straight
writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
writeback: trace event writeback_queue_io
writeback: trace event writeback_single_inode
writeback: remove .nonblocking and .encountered_congestion
writeback: remove writeback_control.more_io
writeback: skip balance_dirty_pages() for in-memory fs
writeback: add bdi_dirty_limit() kernel-doc
writeback: avoid extra sync work at enqueue time
writeback: elevate queue_io() into wb_writeback()
...

Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

Linus Torvalds
2011-07-27 01:39:54 +0800

26 Jul, 2011

2 commits

99b12e3d8 writeback: account NR_WRITTEN at IO completion time ... Browse Code »

NR_WRITTEN is now accounted at block IO enqueue time, which is not very
accurate as to common understanding. This moves NR_WRITTEN accounting to
the IO completion time and makes it more consistent with BDI_WRITTEN,
which is used for bandwidth estimation.

Signed-off-by: Wu Fengguang
Cc: Michael Rubin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2011-07-26 11:57:11 +0800
72c478321 mm: remove useless rcu lock-unlock from mapping_tagged() ... Browse Code »

radix_tree_tagged() is lockless - it reads from a member of the raid-tree
root node. It does not require any protection.

Signed-off-by: Konstantin Khlebnikov
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2011-07-26 11:57:11 +0800

24 Jul, 2011

1 commit

bcff25fc8 mm: properly reflect task dirty limits in dirty_exceeded logic ... Browse Code »

We set bdi->dirty_exceeded (and thus ratelimiting code starts to
call balance_dirty_pages() every 8 pages) when a per-bdi limit is
exceeded or global limit is exceeded. But per-bdi limit also depends
on the task. Thus different tasks reach the limit on that bdi at
different levels of dirty pages. The result is that with current code
bdi->dirty_exceeded ping-ponged between 1 and 0 depending on which task
just got into balance_dirty_pages().

We fix the issue by clearing bdi->dirty_exceeded only when per-bdi amount
of dirty pages drops below the threshold (7/8 * bdi_dirty_limit) where task
limits already do not have any influence.

Impact: The end result is, the dirty pages are kept more tightly under
control, with the average number slightly lowered than before. This
reduces the risk to throttle light dirtiers and hence more responsive.
However it may add overheads by enforcing balance_dirty_pages() calls
on every 8 pages when there are 2+ heavy dirtiers.

CC: Andrew Morton
CC: Christoph Hellwig
CC: Dave Chinner
CC: Peter Zijlstra
Signed-off-by: Jan Kara
Signed-off-by: Wu Fengguang

Jan Kara
2011-07-24 10:51:52 +0800

10 Jul, 2011

7 commits

e1cbe2360 writeback: trace global_dirty_state ... Browse Code »

Add trace event balance_dirty_state for showing the global dirty page
counts and thresholds at each global_dirty_limits() invocation. This
will cover the callers throttle_vm_writeout(), over_bground_thresh()
and each balance_dirty_pages() loop.

Signed-off-by: Wu Fengguang

Wu Fengguang
2011-07-10 13:09:03 +0800
ffd1f609a writeback: introduce max-pause and pass-good dirty limits ... Browse Code »

The max-pause limit helps to keep the sleep time inside
balance_dirty_pages() within MAX_PAUSE=200ms. The 200ms max sleep means
per task rate limit of 8pages/200ms=160KB/s when dirty exceeded, which
normally is enough to stop dirtiers from continue pushing the dirty
pages high, unless there are a sufficient large number of slow dirtiers
(eg. 500 tasks doing 160KB/s will still sum up to 80MB/s, exceeding the
write bandwidth of a slow disk and hence accumulating more and more dirty
pages).

The pass-good limit helps to let go of the good bdi's in the presence of
a blocked bdi (ie. NFS server not responding) or slow USB disk which for
some reason build up a large number of initial dirty pages that refuse
to go away anytime soon.

For example, given two bdi's A and B and the initial state

bdi_thresh_A = dirty_thresh / 2
bdi_thresh_B = dirty_thresh / 2
bdi_dirty_A = dirty_thresh / 2
bdi_dirty_B = dirty_thresh / 2

Then A get blocked, after a dozen seconds

bdi_thresh_A = 0
bdi_thresh_B = dirty_thresh
bdi_dirty_A = dirty_thresh / 2
bdi_dirty_B = dirty_thresh / 2

The (bdi_dirty_B < bdi_thresh_B) test is now useless and the dirty pages
will be effectively throttled by condition (nr_dirty < dirty_thresh).
This has two problems:
(1) we lose the protections for light dirtiers
(2) balance_dirty_pages() effectively becomes IO-less because the
(bdi_nr_reclaimable > bdi_thresh) test won't be true. This is good
for IO, but balance_dirty_pages() loses an important way to break
out of the loop which leads to more spread out throttle delays.

DIRTY_PASSGOOD_AREA can eliminate the above issues. The only problem is,
DIRTY_PASSGOOD_AREA needs to be defined as 2 to fully cover the above
example while this patch uses the more conservative value 8 so as not to
surprise people with too many dirty pages than expected.

The max-pause limit won't noticeably impact the speed dirty pages are
knocked down when there is a sudden drop of global/bdi dirty thresholds.
Because the heavy dirties will be throttled below 160KB/s which is slow
enough. It does help to avoid long dirty throttle delays and especially
will make light dirtiers more responsive.

Signed-off-by: Wu Fengguang

Wu Fengguang
2011-07-10 13:09:02 +0800
c42843f2f writeback: introduce smoothed global dirty limit ... Browse Code »

The start of a heavy weight application (ie. KVM) may instantly knock
down determine_dirtyable_memory() if the swap is not enabled or full.
global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
dirty thresholds that are _much_ lower than the global/bdi dirty pages.

balance_dirty_pages() will then heavily throttle all dirtiers including
the light ones, until the dirty pages drop below the new dirty thresholds.
During this _deep_ dirty-exceeded state, the system may appear rather
unresponsive to the users.

About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
threshold to heavy dirtiers than light ones, and the dirty pages will
be throttled around the heavy dirtiers' dirty threshold and reasonably
below the light dirtiers' dirty threshold. In this state, only the heavy
dirtiers will be throttled and the dirty pages are carefully controlled
to not exceed the light dirtiers' dirty threshold. However if the
threshold itself suddenly drops below the number of dirty pages, the
light dirtiers will get heavily throttled.

So introduce global_dirty_limit for tracking the global dirty threshold
with policies

- follow downwards slowly
- follow up in one shot

global_dirty_limit can effectively mask out the impact of sudden drop of
dirtyable memory. It will be used in the next patch for two new type of
dirty limits. Note that the new dirty limits are not going to avoid
throttling the light dirtiers, but could limit their sleep time to 200ms.

Signed-off-by: Wu Fengguang

Wu Fengguang
2011-07-10 13:09:02 +0800
7762741e3 writeback: consolidate variable names in balance_dirty_pages() ... Browse Code »

Introduce

nr_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS

in order to simplify many tests in the following patches.

balance_dirty_pages() will eventually care only about the dirty sums
besides nr_writeback.

Acked-by: Jan Kara
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-07-10 13:09:02 +0800
e98be2d59 writeback: bdi write bandwidth estimation ... Browse Code »

The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.

It tries to update the bandwidth only when disk is fully utilized.
Any inactive period of more than one second will be skipped.

The estimated bandwidth will be reflecting how fast the device can
writeout when _fully utilized_, and won't drop to 0 when it goes idle.
The value will remain constant at disk idle time. At busy write time, if
not considering fluctuations, it will also remain high unless be knocked
down by possible concurrent reads that compete for the disk time and
bandwidth with async writes.

The estimation is not done purely in the flusher because there is no
guarantee for write_cache_pages() to return timely to update bandwidth.

The bdi->avg_write_bandwidth smoothing is very effective for filtering
out sudden spikes, however may be a little biased in long term.

The overheads are low because the bdi bandwidth update only occurs at
200ms intervals.

The 200ms update interval is suitable, because it's not possible to get
the real bandwidth for the instance at all, due to large fluctuations.

The NFS commits can be as large as seconds worth of data. One XFS
completion may be as large as half second worth of data if we are going
to increase the write chunk to half second worth of data. In ext4,
fluctuations with time period of around 5 seconds is observed. And there
is another pattern of irregular periods of up to 20 seconds on SSD tests.

That's why we are not only doing the estimation at 200ms intervals, but
also averaging them over a period of 3 seconds and then go further to do
another level of smoothing in avg_write_bandwidth.

CC: Li Shaohua
CC: Peter Zijlstra
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-07-10 13:09:01 +0800
f7d2b1ecd writeback: account per-bdi accumulated written pages ... Browse Code »

Introduce the BDI_WRITTEN counter. It will be used for estimating the
bdi's write bandwidth.

Peter Zijlstra :
Move BDI_WRITTEN accounting into __bdi_writeout_inc().
This will cover and fix fuse, which only calls bdi_writeout_inc().

CC: Michael Rubin
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Jan Kara
Signed-off-by: Wu Fengguang

Jan Kara
2011-07-10 13:09:01 +0800
d46db3d58 writeback: make writeback_control.nr_to_write straight ... Browse Code »

Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.

Acked-by: Jan Kara
Proposed-by: Christoph Hellwig
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-07-10 13:09:01 +0800

20 Jun, 2011

1 commit

36715cef0 writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr() ... Browse Code »

This helps prevent tmpfs dirtiers from skewing the per-cpu bdp_ratelimits.

Acked-by: Jan Kara
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-06-20 00:25:46 +0800

08 Jun, 2011

3 commits

3efaf0fab writeback: skip balance_dirty_pages() for in-memory fs ... Browse Code »

This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.

Notes about the tmpfs/ramfs behavior changes:

As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
balance_dirty_pages() as long as we are over the (dirty+background)/2
global throttle threshold. This is because both the dirty pages and
threshold will be 0 for tmpfs/ramfs. Hence this test will always
evaluate to TRUE:

dirty_exceeded =
(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
|| (nr_reclaimable + nr_writeback >= dirty_thresh);

For 2.6.37, someone complained that the current logic does not allow the
users to set vm.dirty_ratio=0. So commit 4cbec4c8b9 changed the test to

dirty_exceeded =
(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
|| (nr_reclaimable + nr_writeback > dirty_thresh);

So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
throttled unless the global dirty threshold is exceeded (which is very
unlikely to happen; once happen, will block many tasks).

I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
for a busy writing server, tmpfs write()s may get livelocked! The
"inadvertent" throttling can hardly bring help to any workload because
of its "either no throttling, or get throttled to death" property.

So based on 2.6.37, this patch won't bring more noticeable changes.

CC: Hugh Dickins
Acked-by: Rik van Riel
Acked-by: Peter Zijlstra
Reviewed-by: Minchan Kim
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-06-08 08:25:22 +0800
6f7186562 writeback: add bdi_dirty_limit() kernel-doc ... Browse Code »

Clarify the bdi_dirty_limit() comment.

Acked-by: Peter Zijlstra
Acked-by: Jan Kara
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-06-08 08:25:22 +0800
6e6938b6d writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage ... Browse Code »
1

sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
do livelock prevention for it, too.

Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance
using page tagging") is a partial fix in that it only fixed the
WB_SYNC_ALL phase livelock.

Although ext4 is tested to no longer livelock with commit f446daaea9,
it may due to some "redirty_tail() after pages_skipped" effect which
is by no means a guarantee for _all_ the file systems.

Note that writeback_inodes_sb() is called by not only sync(), they are
treated the same because the other callers also need livelock prevention.

Impact: It changes the order in which pages/inodes are synced to disk.
Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
until finished with the current inode.

Acked-by: Jan Kara
CC: Dave Chinner
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-06-08 08:25:20 +0800

25 Mar, 2011

1 commit

6c5103890 Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
Documentation/iostats.txt: bit-size reference etc.
cfq-iosched: removing unnecessary think time checking
cfq-iosched: Don't clear queue stats when preempt.
blk-throttle: Reset group slice when limits are changed
blk-cgroup: Only give unaccounted_time under debug
cfq-iosched: Don't set active queue in preempt
block: fix non-atomic access to genhd inflight structures
block: attempt to merge with existing requests on plug flush
block: NULL dereference on error path in __blkdev_get()
cfq-iosched: Don't update group weights when on service tree
fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
block: Require subsystems to explicitly allocate bio_set integrity mempool
jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
fs: make fsync_buffers_list() plug
mm: make generic_writepages() use plugging
blk-cgroup: Add unaccounted time to timeslice_used.
block: fixup plugging stubs for !CONFIG_BLOCK
block: remove obsolete comments for blkdev_issue_zeroout.
blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
...

Fix up conflicts in fs/{aio.c,super.c}

Linus Torvalds
2011-03-25 01:16:26 +0800

23 Mar, 2011

2 commits

cf15b07cf writeback: make mapping->writeback_index to point to the last written page ... Browse Code »

For range-cyclic writeback (e.g. kupdate), the writeback code sets a
continuation point of the next writeback to mapping->writeback_index which
is set the page after the last written page. This happens so that we
evenly write the whole file even if pages in it get continuously
redirtied.

However, in some cases, sequential writer is writing in the middle of the
page and it just redirties the last written page by continuing from that.
For example with an application which uses a file as a big ring buffer we
see:

[1st writeback session]
...
flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898514 + 8
flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898522 + 8
flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898530 + 8
flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898538 + 8
flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898546 + 8
kworker/0:1-11 4571: block_rq_issue: 8,0 W 0 () 94898514 + 40
>> flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898554 + 8
>> flush-8:0-2743 4571: block_rq_issue: 8,0 W 0 () 94898554 + 8

[2nd writeback session after 35sec]
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898562 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898570 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898578 + 8
...
kworker/0:1-11 4606: block_rq_issue: 8,0 W 0 () 94898562 + 640
kworker/0:1-11 4606: block_rq_issue: 8,0 W 0 () 94899202 + 72
...
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899962 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899970 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899978 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899986 + 8
flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899994 + 8
kworker/0:1-11 4606: block_rq_issue: 8,0 W 0 () 94899962 + 40
>> flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898554 + 8
>> flush-8:0-2743 4606: block_rq_issue: 8,0 W 0 () 94898554 + 8

So we seeked back to 94898554 after we wrote all the pages at the end of
the file.

This extra seek seems unnecessary. If we continue writeback from the last
written page, we can avoid it and do not cause harm to other cases. The
original intent of even writeout over the whole file is preserved and if
the page does not get redirtied pagevec_lookup_tag() just skips it.

As an exceptional case, when I/O error happens, set done_index to the next
page as the comment in the code suggests.

Tested-by: Wu Fengguang
Signed-off-by: Jun'ichi Nomura
Signed-off-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jun'ichi Nomura
2011-03-23 08:44:09 +0800
278df9f45 mm: reclaim invalidated page ASAP ... Browse Code »

invalidate_mapping_pages is very big hint to reclaimer. It means user
doesn't want to use the page any more. So in order to prevent working set
page eviction, this patch move the page into tail of inactive list by
PG_reclaim.

Please, remember that pages in inactive list are working set as well as
active list. If we don't move pages into inactive list's tail, pages near
by tail of inactive list can be evicted although we have a big clue about
useless pages. It's totally bad.

Now PG_readahead/PG_reclaim is shared. fe3cba17 added ClearPageReclaim
into clear_page_dirty_for_io for preventing fast reclaiming readahead
marker page.

In this series, PG_reclaim is used by invalidated page, too. If VM find
the page is invalidated and it's dirty, it sets PG_reclaim to reclaim
asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally. It
disturbs this serie's goal.

I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead. In v4,
ClearPageReadahead in set_page_dirty has a problem which is reported by
Steven Barrett. It's due to compound page. Some driver(ex, audio) calls
set_page_dirty with compound page which isn't on LRU. but my patch does
ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it
breaks PageTail flag.

I think it doesn't affect THP and pass my test with THP enabling but Cced
Andrea for double check.

Signed-off-by: Minchan Kim
Reported-by: Steven Barrett
Reviewed-by: Johannes Weiner
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Cc: Wu Fengguang
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-03-23 08:44:04 +0800

17 Mar, 2011

1 commit

9b6096a65 mm: make generic_writepages() use plugging ... Browse Code »

This recovers a performance regression caused by the removal
of the per-device plugging.

Signed-off-by: Jens Axboe

Shaohua Li
2011-03-17 17:47:06 +0800

10 Mar, 2011

1 commit

7eaceacca block: remove per-queue plugging ... Browse Code »
86

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe

Jens Axboe
2011-03-10 15:52:07 +0800

14 Jan, 2011

3 commits

240c879f2 writeback: avoid unnecessary determine_dirtyable_memory call ... Browse Code »

I think determine_dirtyable_memory() is a rather costly function since it
need many atomic reads for gathering zone/global page state. But when we
use vm_dirty_bytes && dirty_background_bytes, we don't need that costly
calculation.

This patch eliminates such unnecessary overhead.

NOTE : newly added if condition might add overhead in normal path.
But it should be _really_ small because anyway we need the
access both vm_dirty_bytes and dirty_background_bytes so it is
likely to hit the cache.

[akpm@linux-foundation.org: fix used-uninitialised warning]
Signed-off-by: Minchan Kim
Cc: Wu Fengguang
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-01-14 09:32:38 +0800
c3f0da631 mm/page-writeback.c: fix __set_page_dirty_no_writeback() return value ... Browse Code »

__set_page_dirty_no_writeback() should return true if it actually
transitioned the page from a clean to dirty state although it seems nobody
uses its return value at present.

Signed-off-by: Bob Liu
Acked-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bob Liu
2011-01-14 09:32:32 +0800
008d23e48 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
Documentation/trace/events.txt: Remove obsolete sched_signal_send.
writeback: fix global_dirty_limits comment runtime -> real-time
ppc: fix comment typo singal -> signal
drivers: fix comment typo diable -> disable.
m68k: fix comment typo diable -> disable.
wireless: comment typo fix diable -> disable.
media: comment typo fix diable -> disable.
remove doc for obsolete dynamic-printk kernel-parameter
remove extraneous 'is' from Documentation/iostats.txt
Fix spelling milisec -> ms in snd_ps3 module parameter description
Fix spelling mistakes in comments
Revert conflicting V4L changes
i7core_edac: fix typos in comments
mm/rmap.c: fix comment
sound, ca0106: Fix assignment to 'channel'.
hrtimer: fix a typo in comment
init/Kconfig: fix typo
anon_inodes: fix wrong function name in comment
fix comment typos concerning "consistent"
poll: fix a typo in comment
...

Fix up trivial conflicts in:
- drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
- fs/ext4/ext4.h

Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.

Linus Torvalds
2011-01-14 02:05:56 +0800

04 Jan, 2011

1 commit

ebd1373d4 writeback: fix global_dirty_limits comment runtime -> real-time ... Browse Code »

Change runtime with real-time

Cc: Wu Fengguang
Signed-off-by: Minchan Kim
Signed-off-by: Jiri Kosina

Minchan Kim
2011-01-04 18:09:29 +0800

23 Dec, 2010

1 commit

d153ba644 writeback: do uninterruptible sleep in balance_dirty_pages() ... Browse Code »

Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong. If it's
going to do that then it must break out if signal_pending(), otherwise
it's pretty much guaranteed to degenerate into a busywait loop. Plus we
*do* want these processes to appear in D state and to contribute to load
average.

So it should be TASK_UNINTERRUPTIBLE. -- Andrew Morton

Signed-off-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-12-23 11:43:33 +0800

27 Oct, 2010

3 commits

4cbec4c8b writeback: remove the internal 5% low bound on dirty_ratio ... Browse Code »

The dirty_ratio was silently limited in global_dirty_limits() to >= 5%.
This is not a user expected behavior. And it's inconsistent with
calc_period_shift(), which uses the plain vm_dirty_ratio value.

Let's remove the internal bound.

At the same time, fix balance_dirty_pages() to work with the
dirty_thresh=0 case. This allows applications to proceed when
dirty+writeback pages are all cleaned.

And ">" fits with the name "exceeded" better than ">=" does. Neil thinks
it is an aesthetic improvement as well as a functional one :)

Signed-off-by: Wu Fengguang
Cc: Jan Kara
Proposed-by: Con Kolivas
Acked-by: Peter Zijlstra
Reviewed-by: Rik van Riel
Reviewed-by: Neil Brown
Reviewed-by: KOSAKI Motohiro
Cc: Michael Rubin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-10-27 07:52:08 +0800
ea941f0e2 writeback: add nr_dirtied and nr_written to /proc/vmstat ... Browse Code »

To help developers and applications gain visibility into writeback
behaviour adding two entries to vm_stat_items and /proc/vmstat. This will
allow us to track the "written" and "dirtied" counts.

# grep nr_dirtied /proc/vmstat
nr_dirtied 3747
# grep nr_written /proc/vmstat
nr_written 3618

Signed-off-by: Michael Rubin
Reviewed-by: Wu Fengguang
Cc: Dave Chinner
Cc: Jens Axboe
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Rubin
2010-10-27 07:52:06 +0800
f629d1c9b mm: add account_page_writeback() ... Browse Code »

To help developers and applications gain visibility into writeback
behaviour this patch adds two counters to /proc/vmstat.

# grep nr_dirtied /proc/vmstat
nr_dirtied 3747
# grep nr_written /proc/vmstat
nr_written 3618

These entries allow user apps to understand writeback behaviour over time
and learn how it is impacting their performance. Currently there is no
way to inspect dirty and writeback speed over time. It's not possible for
nr_dirty/nr_writeback.

These entries are necessary to give visibility into writeback behaviour.
We have /proc/diskstats which lets us understand the io in the block
layer. We have blktrace for more in depth understanding. We have
e2fsprogs and debugsfs to give insight into the file systems behaviour,
but we don't offer our users the ability understand what writeback is
doing. There is no way to know how active it is over the whole system, if
it's falling behind or to quantify it's efforts. With these values
exported users can easily see how much data applications are sending
through writeback and also at what rates writeback is processing this
data. Comparing the rates of change between the two allow developers to
see when writeback is not able to keep up with incoming traffic and the
rate of dirty memory being sent to the IO back end. This allows folks to
understand their io workloads and track kernel issues. Non kernel
engineers at Google often use these counters to solve puzzling performance
problems.

Patch #4 adds a pernode vmstat file with nr_dirtied and nr_written

Patch #5 add writeback thresholds to /proc/vmstat

Currently these values are in debugfs. But they should be promoted to
/proc since they are useful for developers who are writing databases
and file servers and are not debugging the kernel.

The output is as below:

# grep threshold /proc/vmstat
nr_pages_dirty_threshold 409111
nr_pages_dirty_background_threshold 818223

This patch:

This allows code outside of the mm core to safely manipulate page
writeback state and not worry about the other accounting. Not using these
routines means that some code will lose track of the accounting and we get
bugs.

Modify nilfs2 to use interface.

Signed-off-by: Michael Rubin
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Wu Fengguang
Cc: KONISHI Ryusuke
Cc: Jiro SEKIBA
Cc: Dave Chinner
Cc: Jens Axboe
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Rubin
2010-10-27 07:52:06 +0800

29 Aug, 2010

1 commit

997396a73 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: fix get_ticket_handler() error handling
ceph: don't BUG on ENOMEM during mds reconnect
ceph: ceph_mdsc_build_path() returns an ERR_PTR
ceph: Fix warnings
ceph: ceph_get_inode() returns an ERR_PTR
ceph: initialize fields on new dentry_infos
ceph: maintain i_head_snapc when any caps are dirty, not just for data
ceph: fix osd request lru adjustment when sending request
ceph: don't improperly set dir complete when holding EXCL cap
mm: exporting account_page_dirty
ceph: direct requests in snapped namespace based on nonsnap parent
ceph: queue cap snap writeback for realm children on snap update
ceph: include dirty xattrs state in snapped caps
ceph: fix xattr cap writeback
ceph: fix multiple mds session shutdown

Linus Torvalds
2010-08-29 05:07:20 +0800

24 Aug, 2010

1 commit

546a19242 writeback: write_cache_pages doesn't terminate at nr_to_write <= 0 ... Browse Code »

I noticed XFS writeback in 2.6.36-rc1 was much slower than it should have
been. Enabling writeback tracing showed:

flush-253:16-8516 [007] 1342952.351608: wbc_writepage: bdi 253:16: towrt=1024 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
flush-253:16-8516 [007] 1342952.351654: wbc_writepage: bdi 253:16: towrt=1023 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
flush-253:16-8516 [000] 1342952.369520: wbc_writepage: bdi 253:16: towrt=0 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
flush-253:16-8516 [000] 1342952.369542: wbc_writepage: bdi 253:16: towrt=-1 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
flush-253:16-8516 [000] 1342952.369549: wbc_writepage: bdi 253:16: towrt=-2 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0

Writeback is not terminating in background writeback if ->writepage is
returning with wbc->nr_to_write == 0, resulting in sub-optimal single page
writeback on XFS.

Fix the write_cache_pages loop to terminate correctly when this situation
occurs and so prevent this sub-optimal background writeback pattern. This
improves sustained sequential buffered write performance from around
250MB/s to 750MB/s for a 100GB file on an XFS filesystem on my 8p test VM.

Cc:
Signed-off-by: Dave Chinner
Reviewed-by: Wu Fengguang
Reviewed-by: Christoph Hellwig

Dave Chinner
2010-08-24 09:44:34 +0800

23 Aug, 2010

1 commit

679ceace8 mm: exporting account_page_dirty ... Browse Code »

This allows code outside of the mm core to safely manipulate page state
and not worry about the other accounting. Not using these routines means
that some code will lose track of the accounting and we get bugs. This
has happened once already.

Signed-off-by: Michael Rubin
Signed-off-by: Sage Weil

Michael Rubin
2010-08-23 06:16:51 +0800

21 Aug, 2010

1 commit

d5ed3a4af lib/radix-tree.c: fix overflow in radix_tree_range_tag_if_tagged() ... Browse Code »

When radix_tree_maxindex() is ~0UL, it can happen that scanning overflows
index and tree traversal code goes astray reading memory until it hits
unreadable memory. Check for overflow and exit in that case.

Signed-off-by: Jan Kara
Cc: Christoph Hellwig
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2010-08-21 00:34:55 +0800

15 Aug, 2010

1 commit

03ab450f0 mm/page-writeback: fix non-kernel-doc function comments ... Browse Code »

Remove leading /** from non-kernel-doc function comments to prevent
kernel-doc warnings.

Signed-off-by: Randy Dunlap
Signed-off-by: Linus Torvalds

Randy Dunlap
2010-08-15 07:20:59 +0800

12 Aug, 2010

4 commits

1babe1838 writeback: add comment to the dirty limit functions ... Browse Code »

Document global_dirty_limits() and bdi_dirty_limit().

Signed-off-by: Wu Fengguang
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Jens Axboe
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-08-12 23:43:30 +0800
16c4042f0 writeback: avoid unnecessary calculation of bdi dirty thresholds ... Browse Code »

Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
that the latter can be avoided when under global dirty background
threshold (which is the normal state for most systems).

Signed-off-by: Wu Fengguang
Cc: Peter Zijlstra
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-08-12 23:43:29 +0800
e50e37201 writeback: balance_dirty_pages(): reduce calls to global_page_state ... Browse Code »

Reducing the number of times balance_dirty_pages calls global_page_state
reduces the cache references and so improves write performance on a
variety of workloads.

'perf stats' of simple fio write tests shows the reduction in cache
access. Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2
with 3Gb memory (dirty_threshold approx 600 Mb) running each test 10
times, dropping the fasted & slowest values then taking the average &
standard deviation

average (s.d.) in millions (10^6)
2.6.31-rc8 648.6 (14.6)
+patch 620.1 (16.5)

Achieving this reduction is by dropping clip_bdi_dirty_limit as it rereads
the counters to apply the dirty_threshold and moving this check up into
balance_dirty_pages where it has already read the counters.

Also by rearrange the for loop to only contain one copy of the limit tests
allows the pdflush test after the loop to use the local copies of the
counters rather than rereading them.

In the common case with no throttling it now calls global_page_state 5
fewer times and bdi_stat 2 fewer.

Fengguang:

This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh) to
avoid exceeding the dirty limit. Since the bdi dirty limit is mostly
accurate we don't need to do routinely clip. A simple dirty limit check
would be enough.

The check is necessary because, in principle we should throttle everything
calling balance_dirty_pages() when we're over the total limit, as said by
Peter.

We now set and clear dirty_exceeded not only based on bdi dirty limits,
but also on the global dirty limit. The global limit check is added in
place of clip_bdi_dirty_limit() for safety and not intended as a behavior
change. The bdi limits should be tight enough to keep all dirty pages
under the global limit at most time; occasional small exceeding should be
OK though. The change makes the logic more obvious: the global limit is
the ultimate goal and shall be always imposed.

We may now start background writeback work based on outdated conditions.
That's safe because the bdi flush thread will (and have to) double check
the states. It reduces overall overheads because the test based on old
states still have good chance to be right.

[akpm@linux-foundation.org] fix uninitialized dirty_exceeded
Signed-off-by: Richard Kennedy
Signed-off-by: Wu Fengguang
Cc: Jan Kara
Acked-by: Peter Zijlstra
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-08-12 23:43:29 +0800
3c111a071 mm: fix fatal kernel-doc error ... Browse Code »

Fix a fatal kernel-doc error due to a #define coming between a function's
kernel-doc notation and the function signature. (kernel-doc cannot handle
this)

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2010-08-12 23:43:29 +0800

11 Aug, 2010

1 commit

2f9e825d3 Merge branch 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
xen-blkfront: fix missing out label
blkdev: fix blkdev_issue_zeroout return value
block: update request stacking methods to support discards
block: fix missing export of blk_types.h
writeback: fix bad _bh spinlock nesting
drbd: revert "delay probes", feature is being re-implemented differently
drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
drbd: Disable delay probes for the upcomming release
writeback: cleanup bdi_register
writeback: add new tracepoints
writeback: remove unnecessary init_timer call
writeback: optimize periodic bdi thread wakeups
writeback: prevent unnecessary bdi threads wakeups
writeback: move bdi threads exiting logic to the forker thread
writeback: restructure bdi forker loop a little
writeback: move last_active to bdi
writeback: do not remove bdi from bdi_list
writeback: simplify bdi code a little
writeback: do not lose wake-ups in bdi threads
...

Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
drivers/scsi/scsi_error.c as per Jens.

Linus Torvalds
2010-08-11 06:22:42 +0800

10 Aug, 2010

1 commit

f446daaea mm: implement writeback livelock avoidance using page tagging ... Browse Code »
2

We try to avoid livelocks of writeback when some steadily creates dirty
pages in a mapping we are writing out. For memory-cleaning writeback,
using nr_to_write works reasonably well but we cannot really use it for
data integrity writeback. This patch tries to solve the problem.

The idea is simple: Tag all pages that should be written back with a
special tag (TOWRITE) in the radix tree. This can be done rather quickly
and thus livelocks should not happen in practice. Then we start doing the
hard work of locking pages and sending them to disk only for those pages
that have TOWRITE tag set.

Note: Adding new radix tree tag grows radix tree node from 288 to 296
bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
However, the number of slab/slub items per page remains the same (13 and 7
respectively).

Signed-off-by: Jan Kara
Cc: Dave Chinner
Cc: Nick Piggin
Cc: Chris Mason
Cc: Theodore Ts'o
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2010-08-10 11:44:59 +0800

08 Aug, 2010

1 commit

9e094383b writeback: Add tracing to write_cache_pages ... Browse Code »

Add a trace event to the ->writepage loop in write_cache_pages to give
visibility into how the ->writepage call is changing variables within the
writeback control structure. Of most interest is how wbc->nr_to_write changes
from call to call, especially with filesystems that write multiple pages
in ->writepage.

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Dave Chinner
2010-08-08 00:24:26 +0800