Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

08 Apr, 2014

1 commit

ed6d7c8e5 mm: remove unused arg of set_page_dirty_balance() ... Browse Code »

There's only one caller of set_page_dirty_balance() and that will call it
with page_mkwrite == 0.

The page_mkwrite argument was unused since commit b827e496c893 "mm: close
page_mkwrite races".

Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2014-04-08 07:35:57 +0800

07 Feb, 2014

1 commit

a85d9df1e mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq() ... Browse Code »

During aio stress test, we observed the following lockdep warning. This
mean AIO+numa_balancing is currently deadlockable.

The problem is, aio_migratepage disable interrupt, but
__set_page_dirty_nobuffers unintentionally enable it again.

Generally, all helper function should use spin_lock_irqsave() instead of
spin_lock_irq() because they don't know caller at all.

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(&(&ctx->completion_lock)->rlock);

lock(&(&ctx->completion_lock)->rlock);

*** DEADLOCK ***

dump_stack+0x19/0x1b
print_usage_bug+0x1f7/0x208
mark_lock+0x21d/0x2a0
mark_held_locks+0xb9/0x140
trace_hardirqs_on_caller+0x105/0x1d0
trace_hardirqs_on+0xd/0x10
_raw_spin_unlock_irq+0x2c/0x50
__set_page_dirty_nobuffers+0x8c/0xf0
migrate_page_copy+0x434/0x540
aio_migratepage+0xb1/0x140
move_to_new_page+0x7d/0x230
migrate_pages+0x5e5/0x700
migrate_misplaced_page+0xbc/0xf0
do_numa_page+0x102/0x190
handle_pte_fault+0x241/0x970
handle_mm_fault+0x265/0x370
__do_page_fault+0x172/0x5a0
do_page_fault+0x1a/0x70
page_fault+0x28/0x30

Signed-off-by: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Rik van Riel
Cc: Johannes Weiner
Acked-by: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2014-02-07 05:48:51 +0800

30 Jan, 2014

2 commits

a1c3bfb2f mm/page-writeback.c: do not count anon pages as dirtyable memory ... Browse Code »

The VM is currently heavily tuned to avoid swapping. Whether that is
good or bad is a separate discussion, but as long as the VM won't swap
to make room for dirty cache, we can not consider anonymous pages when
calculating the amount of dirtyable memory, the baseline to which
dirty_background_ratio and dirty_ratio are applied.

A simple workload that occupies a significant size (40+%, depending on
memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
uses the remainder for a streaming writer demonstrates this problem. In
that case, the actual cache pages are a small fraction of what is
considered dirtyable overall, which results in an relatively large
portion of the cache pages to be dirtied. As kswapd starts rotating
these, random tasks enter direct reclaim and stall on IO.

Only consider free pages and file pages dirtyable.

Signed-off-by: Johannes Weiner
Reported-by: Tejun Heo
Tested-by: Tejun Heo
Reviewed-by: Rik van Riel
Cc: Mel Gorman
Cc: Wu Fengguang
Reviewed-by: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-01-30 08:22:39 +0800
a804552b9 mm/page-writeback.c: fix dirty_balance_reserve subtraction from dirtyable memory ... Browse Code »

Tejun reported stuttering and latency spikes on a system where random
tasks would enter direct reclaim and get stuck on dirty pages. Around
50% of memory was occupied by tmpfs backed by an SSD, and another disk
(rotating) was reading and writing at max speed to shrink a partition.

: The problem was pretty ridiculous. It's a 8gig machine w/ one ssd and 10k
: rpm harddrive and I could reliably reproduce constant stuttering every
: several seconds for as long as buffered IO was going on on the hard drive
: either with tmpfs occupying somewhere above 4gig or a test program which
: allocates about the same amount of anon memory. Although swap usage was
: zero, turning off swap also made the problem go away too.
:
: The trigger conditions seem quite plausible - high anon memory usage w/
: heavy buffered IO and swap configured - and it's highly likely that this
: is happening in the wild too. (this can happen with copying large files
: to usb sticks too, right?)

This patch (of 2):

The dirty_balance_reserve is an approximation of the fraction of free
pages that the page allocator does not make available for page cache
allocations. As a result, it has to be taken into account when
calculating the amount of "dirtyable memory", the baseline to which
dirty_background_ratio and dirty_ratio are applied.

However, currently the reserve is subtracted from the sum of free and
reclaimable pages, which is non-sensical and leads to erroneous results
when the system is dominated by unreclaimable pages and the
dirty_balance_reserve is bigger than free+reclaimable. In that case, at
least the already allocated cache should be considered dirtyable.

Fix the calculation by subtracting the reserve from the amount of free
pages, then adding the reclaimable pages on top.

[akpm@linux-foundation.org: fix CONFIG_HIGHMEM build]
Signed-off-by: Johannes Weiner
Reported-by: Tejun Heo
Tested-by: Tejun Heo
Reviewed-by: Rik van Riel
Cc: Mel Gorman
Cc: Wu Fengguang
Reviewed-by: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-01-30 08:22:39 +0800

17 Oct, 2013

1 commit

e3b6c655b writeback: fix negative bdi max pause ... Browse Code »

Toralf runs trinity on UML/i386. After some time it hangs and the last
message line is

BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:1521]

It's found that pages_dirtied becomes very large. More than 1000000000
pages in this case:

period = HZ * pages_dirtied / task_ratelimit;
BUG_ON(pages_dirtied > 2000000000);
BUG_ON(pages_dirtied > 1000000000); < 0) {
+ extern int printf(char *, ...);
+ printf("ick : pause : %li\n", pause);
+ printf("ick: pages_dirtied : %lu\n", pages_dirtied);
+ printf("ick: task_ratelimit: %lu\n", task_ratelimit);
+ BUG_ON(1);
+ }
trace_balance_dirty_pages(bdi,

Since pause is bounded by [min_pause, max_pause] where min_pause is also
bounded by max_pause. It's suspected and demonstrated that the
max_pause calculation goes wrong:

ick: pause : -717
ick: min_pause : -177
ick: max_pause : -717
ick: pages_dirtied : 14
ick: task_ratelimit: 0

The problem lies in the two "long = unsigned long" assignments in
bdi_max_pause() which might go negative if the highest bit is 1, and the
min_t(long, ...) check failed to protect it falling under 0. Fix all of
them by using "unsigned long" throughout the function.

Signed-off-by: Fengguang Wu
Reported-by: Toralf Förster
Tested-by: Toralf Förster
Reviewed-by: Jan Kara
Cc: Richard Weinberger
Cc: Geert Uytterhoeven
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2013-10-17 12:35:53 +0800

13 Sep, 2013

1 commit

3ea67d06e memcg: add per cgroup writeback pages accounting ... Browse Code »

Add memcg routines to count writeback pages, later dirty pages will also
be accounted.

After Kame's commit 89c06bd52fb9 ("memcg: use new logic for page stat
accounting"), we can use 'struct page' flag to test page state instead
of per page_cgroup flag. But memcg has a feature to move a page from a
cgroup to another one and may have race between "move" and "page stat
accounting". So in order to avoid the race we have designed a new lock:

mem_cgroup_begin_update_page_stat()
modify page information -->(a)
mem_cgroup_update_page_stat() -->(b)
mem_cgroup_end_update_page_stat()

It requires both (a) and (b)(writeback pages accounting) to be pretected
in mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for
!CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
read lock in the most cases (no task is moving), and spin_lock_irqsave
on top in the slow path.

There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
And the lock order is:
--> memcg->move_lock
--> mapping->tree_lock

Signed-off-by: Sha Zhengju
Acked-by: Michal Hocko
Reviewed-by: Greg Thelen
Cc: Fengguang Wu
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sha Zhengju
2013-09-13 06:38:02 +0800

12 Sep, 2013

3 commits

5a5374856 mm/page-writeback.c: add strictlimit feature ... Browse Code »

The feature prevents mistrusted filesystems (ie: FUSE mounts created by
unprivileged users) to grow a large number of dirty pages before
throttling. For such filesystems balance_dirty_pages always check bdi
counters against bdi limits. I.e. even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks. The only use case for now
is fuse: it sets bdi max_ratio to 1% by default and system administrators
are supposed to expect that this limit won't be exceeded.

The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A
filesystem may set the flag when it initializes its BDI.

The problematic scenario comes from the fact that nobody pays attention to
the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse
writeback). The implementation of fuse writeback releases original page
(by calling end_page_writeback) almost immediately. A fuse request queued
for real processing bears a copy of original page. Hence, if userspace
fuse daemon doesn't finalize write requests in timely manner, an
aggressive mmap writer can pollute virtually all memory by those temporary
fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but
nobody cares.

To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
problem" as a shortcut for "a possibility of uncontrolled grow of amount
of RAM consumed by temporary pages allocated by kernel fuse to process
writeback".

The problem was very easy to reproduce. There is a trivial example
filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I
added "sleep(1);" to the write methods, then recompiled and mounted it.
Then created a huge file on the mount point and run a simple program which
mmap-ed the file to a memory region, then wrote a data to the region. An
hour later I observed almost all RAM consumed by fuse writeback. Since
then some unrelated changes in kernel fuse made it more difficult to
reproduce, but it is still possible now.

Putting this theoretical happens-in-the-lab thing aside, there is another
thing that really hurts real world (FUSE) users. This is write-through
page cache policy FUSE currently uses. I.e. handling write(2), kernel
fuse populates page cache and flushes user data to the server
synchronously. This is excessively suboptimal. Pavel Emelyanov's patches
("writeback cache policy") solve the problem, but they also make resolving
NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying
a huge file to a fuse mount would result in memory starvation. Miklos,
the maintainer of FUSE, believes strictlimit feature the way to go.

And eventually putting FUSE topics aside, there is one more use-case for
strictlimit feature. Using a slow USB stick (mass storage) in a machine
with huge amount of RAM installed is a well-known pain. Let's make simple
computations. Assuming 64GB of RAM installed, existing implementation of
balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file
/media/my-usb-storage/" may return in a few seconds, but subsequent
"umount /media/my-usb-storage/" will take more than two hours if effective
throughput of the storage is, to say, 1MB/sec.

After inclusion of strictlimit feature, it will be trivial to add a knob
(e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
Manually or via udev rule. May be I'm wrong, but it seems to be quite a
natural desire to limit the amount of dirty memory for some devices we are
not fully trust (in the sense of sustainable throughput).

[akpm@linux-foundation.org: fix warning in page-writeback.c]
Signed-off-by: Maxim Patlasov
Cc: Jan Kara
Cc: Miklos Szeredi
Cc: Wu Fengguang
Cc: Pavel Emelyanov
Cc: James Bottomley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Maxim Patlasov
2013-09-12 06:58:04 +0800
6e543d578 mm: vmscan: fix do_try_to_free_pages() livelock ... Browse Code »
13

This patch is based on KOSAKI's work and I add a little more description,
please refer https://lkml.org/lkml/2012/6/14/74.

Currently, I found system can enter a state that there are lots of free
pages in a zone but only order-0 and order-1 pages which means the zone is
heavily fragmented, then high order allocation could make direct reclaim
path's long stall(ex, 60 seconds) especially in no swap and no compaciton
enviroment. This problem happened on v3.4, but it seems issue still lives
in current tree, the reason is do_try_to_free_pages enter live lock:

kswapd will go to sleep if the zones have been fully scanned and are still
not balanced. As kswapd thinks there's little point trying all over again
to avoid infinite loop. Instead it changes order from high-order to
0-order because kswapd think order-0 is the most important. Look at
73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
and may leave zone->all_unreclaimable =3D 0. It assume high-order users
can still perform direct reclaim if they wish.

Direct reclaim continue to reclaim for a high order which is not a
COSTLY_ORDER without oom-killer until kswapd turn on
zone->all_unreclaimble= . This is because to avoid too early oom-kill.
So it means direct_reclaim depends on kswapd to break this loop.

In worst case, direct-reclaim may continue to page reclaim forever when
kswapd sleeps forever until someone like watchdog detect and finally kill
the process. As described in:
http://thread.gmane.org/gmane.linux.kernel.mm/103737

We can't turn on zone->all_unreclaimable from direct reclaim path because
direct reclaim path don't take any lock and this way is racy. Thus this
patch removes zone->all_unreclaimable field completely and recalculates
zone reclaimable state every time.

Note: we can't take the idea that direct-reclaim see zone->pages_scanned
directly and kswapd continue to use zone->all_unreclaimable. Because, it
is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
zone->all_unreclaimable as a name) describes the detail.

[akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
Cc: Aaditya Kumar
Cc: Ying Han
Cc: Nick Piggin
Acked-by: Rik van Riel
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Lameter
Cc: Bob Liu
Cc: Neil Zhang
Cc: Russell King - ARM Linux
Reviewed-by: Michal Hocko
Acked-by: Minchan Kim
Acked-by: Johannes Weiner
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Lisa Du
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lisa Du
2013-09-12 06:58:01 +0800
72457c0a0 mm: revert "page-writeback.c: subtract min_free_kbytes from dirtyable memory" ... Browse Code »

This reverts commit 75f7ad8e043d. It was the result of a problem
observed with a 3.2 kernel and merged in 3.9, while the issue had been
resolved upstream in 3.3 (commit ab8fabd46f81: "mm: exclude reserved
pages from dirtyable memory").

The "reserved pages" are a superset of min_free_kbytes, thus this change
is redundant and confusing. Revert it.

Signed-off-by: Johannes Weiner
Cc: Paul Szabo
Cc: Rik van Riel
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-09-12 06:57:23 +0800

15 Jul, 2013

1 commit

0db0628d9 kernel: delete __cpuinit usage from all core kernel files ... Browse Code »
5

The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

This removes all the uses of the __cpuinit macros from C files in
the core kernel directories (kernel, init, lib, mm, and include)
that don't really have a specific maintainer.

[1] https://lkml.org/lkml/2013/5/20/589

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2013-07-15 07:36:59 +0800

30 Apr, 2013

1 commit

713685111 mm: make snapshotting pages for stable writes a per-bio operation ... Browse Code »

Walking a bio's page mappings has proved problematic, so create a new
bio flag to indicate that a bio's data needs to be snapshotted in order
to guarantee stable pages during writeback. Next, for the one user
(ext3/jbd) of snapshotting, hook all the places where writes can be
initiated without PG_writeback set, and set BIO_SNAP_STABLE there.

We must also flag journal "metadata" bios for stable writeout, since
file data can be written through the journal. Finally, the
MS_SNAP_STABLE mount flag (only used by ext3) is now superfluous, so get
rid of it.

[akpm@linux-foundation.org: rename _submit_bh()'s `flags' to `bio_flags', delobotomize the _submit_bh declaration]
[akpm@linux-foundation.org: teeny cleanup]
Signed-off-by: Darrick J. Wong
Cc: Andy Lutomirski
Cc: Adrian Hunter
Cc: Artem Bityutskiy
Reviewed-by: Jan Kara
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Darrick J. Wong
2013-04-30 06:54:33 +0800

01 Mar, 2013

2 commits

de1a2262b Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

Pull writeback fixes from Wu Fengguang:
"Two writeback fixes

- fix negative (setpoint - dirty) in 32bit archs

- use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle()"

* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
Negative (setpoint-dirty) in bdi_position_ratio()
vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them

Linus Torvalds
2013-03-01 05:21:44 +0800
ee89f8125 Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO core bits from Jens Axboe:
"Below are the core block IO bits for 3.9. It was delayed a few days
since my workstation kept crashing every 2-8h after pulling it into
current -git, but turns out it is a bug in the new pstate code (divide
by zero, will report separately). In any case, it contains:

- The big cfq/blkcg update from Tejun and and Vivek.

- Additional block and writeback tracepoints from Tejun.

- Improvement of the should sort (based on queues) logic in the plug
flushing.

- _io() variants of the wait_for_completion() interface, using
io_schedule() instead of schedule() to contribute to io wait
properly.

- Various little fixes.

You'll get two trivial merge conflicts, which should be easy enough to
fix up"

Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42ca: "hlist: drop the node parameter from iterators").

* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
block: remove redundant check to bd_openers()
block: use i_size_write() in bd_set_size()
cfq: fix lock imbalance with failed allocations
drivers/block/swim3.c: fix null pointer dereference
block: don't select PERCPU_RWSEM
block: account iowait time when waiting for completion of IO request
sched: add wait_for_completion_io[_timeout]
writeback: add more tracepoints
block: add block_{touch|dirty}_buffer tracepoint
buffer: make touch_buffer() an exported function
block: add @req to bio_{front|back}_merge tracepoints
block: add missing block_bio_complete() tracepoint
block: Remove should_sort judgement when flush blk_plug
block,elevator: use new hashtable implementation
cfq-iosched: add hierarchical cfq_group statistics
cfq-iosched: collect stats from dead cfqgs
cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
block: RCU free request_queue
blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
...

Linus Torvalds
2013-03-01 04:52:24 +0800

24 Feb, 2013

1 commit

75f7ad8e0 page-writeback.c: subtract min_free_kbytes from dirtyable memory ... Browse Code »

When calculating amount of dirtyable memory, min_free_kbytes should be
subtracted because it is not intended for dirty pages.

Addresses http://bugs.debian.org/695182

[akpm@linux-foundation.org: fix up min_free_kbytes extern declarations]
[akpm@linux-foundation.org: fix min() warning]
Signed-off-by: Paul Szabo
Acked-by: Rik van Riel
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Szabo
2013-02-24 09:50:17 +0800

22 Feb, 2013

2 commits

ffecfd1a7 block: optionally snapshot page contents to provide stable pages during write ... Browse Code »

This provides a band-aid to provide stable page writes on jbd without
needing to backport the fixed locking and page writeback bit handling
schemes of jbd2. The band-aid works by using bounce buffers to snapshot
page contents instead of waiting.

For those wondering about the ext3 bandage -- fixing the jbd locking
(which was done as part of ext4dev years ago) is a lot of surgery, and
setting PG_writeback on data pages when we actually hold the page lock
dropped ext3 performance by nearly an order of magnitude. If we're
going to migrate iscsi and raid to use stable page writes, the
complaints about high latency will likely return. We might as well
centralize their page snapshotting thing to one place.

Signed-off-by: Darrick J. Wong
Tested-by: Andy Lutomirski
Cc: Adrian Hunter
Cc: Artem Bityutskiy
Reviewed-by: Jan Kara
Cc: Joel Becker
Cc: Mark Fasheh
Cc: Steven Whitehouse
Cc: Jens Axboe
Cc: Eric Van Hensbergen
Cc: Ron Minnich
Cc: Latchesar Ionkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Darrick J. Wong
2013-02-22 09:22:20 +0800
1d1d1a767 mm: only enforce stable page writes if the backing device requires it ... Browse Code »
9

Create a helper function to check if a backing device requires stable
page writes and, if so, performs the necessary wait. Then, make it so
that all points in the memory manager that handle making pages writable
use the helper function. This should provide stable page write support
to most filesystems, while eliminating unnecessary waiting for devices
that don't require the feature.

Before this patchset, all filesystems would block, regardless of whether
or not it was necessary. ext3 would wait, but still generate occasional
checksum errors. The network filesystems were left to do their own
thing, so they'd wait too.

After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it. ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait. Network filesystems haven't been touched, so either they
provide their own stable page guarantees or they don't block at all.
The blocking behavior is back to what it was before 3.0 if you don't
have a disk requiring stable page writes.

Here's the result of using dbench to test latency on ext2:

3.8.0-rc3:
Operation Count AvgLat MaxLat
----------------------------------------
WriteX 109347 0.028 59.817
ReadX 347180 0.004 3.391
Flush 15514 29.828 287.283

Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms

3.8.0-rc3 + patches:
WriteX 105556 0.029 4.273
ReadX 335004 0.005 4.112
Flush 14982 30.540 298.634

Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms

As you can see, the maximum write latency drops considerably with this
patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave
similarly, but see the cover letter for those results.

Signed-off-by: Darrick J. Wong
Acked-by: Steven Whitehouse
Reviewed-by: Jan Kara
Cc: Adrian Hunter
Cc: Andy Lutomirski
Cc: Artem Bityutskiy
Cc: Joel Becker
Cc: Mark Fasheh
Cc: Jens Axboe
Cc: Eric Van Hensbergen
Cc: Ron Minnich
Cc: Latchesar Ionkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Darrick J. Wong
2013-02-22 09:22:19 +0800

08 Feb, 2013

1 commit

8bd75c77b sched/rt: Move rt specific bits into new header file ... Browse Code »

Move rt scheduler definitions out of include/linux/sched.h into
new file include/linux/sched/rt.h

Signed-off-by: Clark Williams
Cc: Peter Zijlstra
Cc: Steven Rostedt
Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
Signed-off-by: Ingo Molnar

Clark Williams
2013-02-08 03:51:08 +0800

24 Jan, 2013

1 commit

ed84825b7 Negative (setpoint-dirty) in bdi_position_ratio() ... Browse Code »

In bdi_position_ratio(), get difference (setpoint-dirty) right even when
negative. Both setpoint and dirty are unsigned long, the difference was
zero-padded thus wrongly sign-extended to s64. This issue affects all
32-bit architectures, does not affect 64-bit architectures where long
and s64 are equivalent.

In this function, dirty is between freerun and limit, the pseudo-float x
is between [-1,1], expected to be negative about half the time. With
zero-padding, instead of a small negative x we obtained a large positive
one so bdi_position_ratio() returned garbage.

Casting the difference to s64 also prevents overflow with left-shift;
though normally these numbers are small and I never observed a 32-bit
overflow there.

(This patch does not solve the PAE OOM issue.)

Paul Szabo psz@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics University of Sydney Australia

Reviewed-by: Jan Kara
Reported-by: Paul Szabo
Reference: http://bugs.debian.org/695182
Signed-off-by: Paul Szabo
Signed-off-by: Fengguang Wu

paul.szabo@sydney.edu.au
2013-01-24 22:22:22 +0800

14 Jan, 2013

1 commit

9fb0a7da0 writeback: add more tracepoints ... Browse Code »

Add tracepoints for page dirtying, writeback_single_inode start, inode
dirtying and writeback. For the latter two inode events, a pair of
events are defined to denote start and end of the operations (the
starting one has _start suffix and the one w/o suffix happens after
the operation is complete). These inode ops are FS specific and can
be non-trivial and having enclosing tracepoints is useful for external
tracers.

This is part of tracepoint additions to improve visiblity into
dirtying / writeback operations for io tracer and userland.

v2: writeback_dirty_inode[_start] TPs may be called for files on
pseudo FSes w/ unregistered bdi. Check whether bdi->dev is %NULL
before dereferencing.

v3: buffer dirtying moved to a block TP.

Signed-off-by: Tejun Heo
Reviewed-by: Jan Kara
Signed-off-by: Jens Axboe

Tejun Heo
2013-01-14 22:00:36 +0800

21 Dec, 2012

1 commit

c8b74c2f6 mm: fix calculation of dirtyable memory ... Browse Code »

The system uses global_dirtyable_memory() to calculate number of
dirtyable pages/pages that can be allocated to the page cache. A bug
causes an underflow thus making the page count look like a big unsigned
number. This in turn confuses the dirty writeback throttling to
aggressively write back pages as they become dirty (usually 1 page at a
time). This generally only affects systems with highmem because the
underflowed count gets subtracted from the global count of dirtyable
memory.

The problem was introduced with v3.2-4896-gab8fabd

Fix is to ensure we don't get an underflowed total of either highmem or
global dirtyable memory.

Signed-off-by: Sonny Rao
Signed-off-by: Puneet Kumar
Acked-by: Johannes Weiner
Tested-by: Damien Wyart
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sonny Rao
2012-12-21 09:40:18 +0800

12 Dec, 2012

1 commit

d0e1d66b5 writeback: remove nr_pages_dirtied arg from balance_dirty_pages_ratelimited_nr() ... Browse Code »

There is no reason to pass the nr_pages_dirtied argument, because
nr_pages_dirtied value from the caller is unused in
balance_dirty_pages_ratelimited_nr().

Signed-off-by: Namjae Jeon
Signed-off-by: Vivek Trivedi
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namjae Jeon
2012-12-12 09:22:21 +0800

28 Sep, 2012

1 commit

2f60d628f CPU hotplug, writeback: Don't call writeback_set_ratelimit() too often during hotplug ... Browse Code »

The CPU hotplug callback related to writeback calls writeback_set_ratelimit()
during every state change in the hotplug sequence. This is unnecessary
since num_online_cpus() changes only once during the entire hotplug operation.

So invoke the function only once per hotplug, thereby avoiding the
unnecessary repetition of those costly calculations.

Signed-off-by: Srivatsa S. Bhat

Srivatsa S. Bhat
2012-09-28 20:27:49 +0800

04 Aug, 2012

1 commit

f0cd2dbb6 vfs: kill write_super and sync_supers ... Browse Code »

Finally we can kill the 'sync_supers' kernel thread along with the
'->write_super()' superblock operation because all the users are gone.
Now every file-system is supposed to self-manage own superblock and
its dirty state.

The nice thing about killing this thread is that it improves power management.
Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up
every 5 seconds no matter what - even if there were no dirty superblocks and
even if there were no file-systems using this service (e.g., btrfs and
journalled ext4 do not need it). So it was wasting power most of the time. And
because the thread was in the core of the kernel, all systems had to have it.
So I am quite happy to make it go away.

Interestingly, this thread is a left-over from the pdflush kernel thread which
was a self-forking kernel thread responsible for all the write-back in old
Linux kernels. It was turned into per-block device BDI threads, and
'sync_supers' was a left-over. Thus, R.I.P, pdflush as well.

Signed-off-by: Artem Bityutskiy
Signed-off-by: Al Viro

Artem Bityutskiy
2012-08-04 05:24:44 +0800

09 Jun, 2012

2 commits

331cbdeed writeback: Fix some comment errors ... Browse Code »

Signed-off-by: Wanpeng Li
Signed-off-by: Fengguang Wu

Wanpeng Li
2012-06-09 19:54:47 +0800
eb608e3a3 block: Convert BDI proportion calculations to flexible proportions ... Browse Code »

Convert calculations of proportion of writeback each bdi does to new flexible
proportion code. That allows us to use aging period of fixed wallclock time
which gives better proportion estimates given the hugely varying throughput of
different devices.

Acked-by: Peter Zijlstra
Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-06-09 07:37:56 +0800

06 May, 2012

1 commit

68809c710 writeback: initialize global_dirty_limit ... Browse Code »

This prevents global_dirty_limit from remaining 0 (the initial value)
for long time, since it's only updated in update_dirty_limit() when
above the dirty freerun area.

It will avoid unexpected consequences when some random code use it as a
convenient approximation of the global dirty threshold.

Signed-off-by: Fengguang Wu

Fengguang Wu
2012-05-06 13:41:58 +0800

14 Apr, 2012

1 commit

18cf8cf8b mm: page-writeback.c: local functions should not be exposed globally ... Browse Code »

The function global_dirtyable_memory is only referenced in this file and
should be marked static to prevent it from being exposed globally.

This quiets the sparse warning:

warning: symbol 'global_dirtyable_memory' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten
Signed-off-by: Fengguang Wu

H Hartley Sweeten
2012-04-14 17:37:27 +0800

29 Mar, 2012

1 commit

69e1aaddd Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 updates for 3.4 from Ted Ts'o:
"Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes

The changes to export dirty_writeback_interval are from Artem's s_dirt
cleanup patch series. The same is true of the change to remove the
s_dirt helper functions which never got used by anyone in-tree. I've
run these changes by Al Viro, and am carrying them so that Artem can
more easily fix up the rest of the file systems during the next merge
window. (Originally we had hopped to remove the use of s_dirt from
ext4 during this merge window, but his patches had some bugs, so I
ultimately ended dropping them from the ext4 tree.)"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (66 commits)
vfs: remove unused superblock helpers
mm: export dirty_writeback_interval
ext4: remove useless s_dirt assignment
ext4: write superblock only once on unmount
ext4: do not mark superblock as dirty unnecessarily
ext4: correct ext4_punch_hole return codes
ext4: remove restrictive checks for EOFBLOCKS_FL
ext4: always set then trimmed blocks count into len
ext4: fix trimmed block count accunting
ext4: fix start and len arguments handling in ext4_trim_fs()
ext4: update s_free_{inodes,blocks}_count during online resize
ext4: change some printk() calls to use ext4_msg() instead
ext4: avoid output message interleaving in ext4_error_()
ext4: remove trailing newlines from ext4_msg() and ext4_error() messages
ext4: add no_printk argument validation, fix fallout
ext4: remove redundant "EXT4-fs: " from uses of ext4_msg
ext4: give more helpful error message in ext4_ext_rm_leaf()
ext4: remove unused code from ext4_ext_map_blocks()
ext4: rewrite punch hole to use ext4_ext_remove_space()
jbd2: cleanup journal tail after transaction commit
...

Linus Torvalds
2012-03-29 01:02:55 +0800

22 Mar, 2012

2 commits

91913a294 mm: export dirty_writeback_interval ... Browse Code »

Export 'dirty_writeback_interval' to make it visible to
file-systems. We are going to push superblock management down to
file-systems and get rid of the 'sync_supers' kernel thread completly.

Signed-off-by: Artem Bityutskiy
Cc: Al Viro
Signed-off-by: "Theodore Ts'o"

Artem Bityutskiy
2012-03-22 10:33:00 +0800
47a133339 mm: use global_dirty_limit in throttle_vm_writeout() ... Browse Code »

When starting a memory hog task, a desktop box w/o swap is found to go
unresponsive for a long time. It's solely caused by lots of congestion
waits in throttle_vm_writeout():

gnome-system-mo-4201 553.073384: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
gnome-system-mo-4201 553.073386: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
gtali-4237 553.080377: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
gtali-4237 553.080378: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
Xorg-3483 553.103375: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
Xorg-3483 553.103377: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000

The root cause is, the dirty threshold is knocked down a lot by the memory
hog task. Fixed by using global_dirty_limit which decreases gradually on
such events and can guarantee we stay above (the also decreasing) nr_dirty
in the progress of following down to the new dirty threshold.

Signed-off-by: Fengguang Wu
Cc: Johannes Weiner
Cc: Jan Kara
Cc: Greg Thelen
Cc: Ying Han
Cc: KAMEZAWA Hiroyuki
Reviewed-by: Rik van Riel
Cc: Mel Gorman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2012-03-22 08:54:58 +0800

11 Jan, 2012

5 commits

001a541ea Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
writeback: balanced_rate cannot exceed write bandwidth
writeback: do strict bdi dirty_exceeded
writeback: avoid tiny dirty poll intervals
writeback: max, min and target dirty pause time
writeback: dirty ratelimit - think time compensation
btrfs: fix dirtied pages accounting on sub-page writes
writeback: fix dirtied pages accounting on redirty
writeback: fix dirtied pages accounting on sub-page writes
writeback: charge leaked page dirties to active tasks
writeback: Include all dirty inodes in background writeback

Linus Torvalds
2012-01-11 08:59:59 +0800
a756cf590 mm: try to distribute dirty pages fairly across zones ... Browse Code »

The maximum number of dirty pages that exist in the system at any time is
determined by a number of pages considered dirtyable and a user-configured
percentage of those, or an absolute number in bytes.

This number of dirtyable pages is the sum of memory provided by all the
zones in the system minus their lowmem reserves and high watermarks, so
that the system can retain a healthy number of free pages without having
to reclaim dirty pages.

But there is a flaw in that we have a zoned page allocator which does not
care about the global state but rather the state of individual memory
zones. And right now there is nothing that prevents one zone from filling
up with dirty pages while other zones are spared, which frequently leads
to situations where kswapd, in order to restore the watermark of free
pages, does indeed have to write pages from that zone's LRU list. This
can interfere so badly with IO from the flusher threads that major
filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
already, taking away the VM's only possibility to keep such a zone
balanced, aside from hoping the flushers will soon clean pages from that
zone.

Enter per-zone dirty limits. They are to a zone's dirtyable memory what
the global limit is to the global amount of dirtyable memory, and try to
make sure that no single zone receives more than its fair share of the
globally allowed dirty pages in the first place. As the number of pages
considered dirtyable excludes the zones' lowmem reserves and high
watermarks, the maximum number of dirty pages in a zone is such that the
zone can always be balanced without requiring page cleaning.

As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to pass
__GFP_WRITE when they know in advance that the page will be written to and
become dirty soon. The page allocator will then attempt to allocate from
the first zone of the zonelist - which on NUMA is determined by the task's
NUMA memory policy - that has not exceeded its dirty limit.

At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case. With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations. Workloads that
previously could fit their dirty pages completely in the higher zone may
be forced to allocate from lower zones, but the amount of pages that
"spill over" are limited themselves by the lower zones' dirty constraints,
and thus unlikely to become a problem.

For now, the problem of unfair dirty page distribution remains for NUMA
configurations where the zones allowed for allocation are in sum not big
enough to trigger the global dirty limits, wake up the flusher threads and
remedy the situation. Because of this, an allocation that could not
succeed on any of the considered zones is allowed to ignore the dirty
limits before going into direct reclaim or even failing the allocation,
until a future patch changes the global dirty throttling and flusher
thread activation so that they take individual zone states into account.

Test results

15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
40% dirty ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

seconds nr_vmscan_write
(stddev) min| median| max
xfs
vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
patched: 550.996( 3.802) 0.000| 0.000| 0.000

fuse-ntfs
vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
patched: 558.049(17.914) 0.000| 0.000| 43.000

btrfs
vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
patched: 563.365(11.368) 0.000| 0.000| 1362.000

ext4
vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
patched: 568.806(17.496) 0.000| 0.000| 0.000

Signed-off-by: Johannes Weiner
Reviewed-by: Minchan Kim
Acked-by: Mel Gorman
Reviewed-by: Michal Hocko
Tested-by: Wu Fengguang
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Jan Kara
Cc: Shaohua Li
Cc: Rik van Riel
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:43 +0800
ccafa2879 mm: writeback: cleanups in preparation for per-zone dirty limits ... Browse Code »

The next patch will introduce per-zone dirty limiting functions in
addition to the traditional global dirty limiting.

Rename determine_dirtyable_memory() to global_dirtyable_memory() before
adding the zone-specific version, and fix up its documentation.

Also, move the functions to determine the dirtyable memory and the
function to calculate the dirty limit based on that together so that their
relationship is more apparent and that they can be commented on as a
group.

Signed-off-by: Johannes Weiner
Reviewed-by: Minchan Kim
Acked-by: Mel Gorman
Reviewed-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Dave Chinner
Cc: Jan Kara
Cc: Shaohua Li
Cc: Rik van Riel
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:43 +0800
ab8fabd46 mm: exclude reserved pages from dirtyable memory ... Browse Code »

Per-zone dirty limits try to distribute page cache pages allocated for
writing across zones in proportion to the individual zone sizes, to reduce
the likelihood of reclaim having to write back individual pages from the
LRU lists in order to make progress.

This patch:

The amount of dirtyable pages should not include the full number of free
pages: there is a number of reserved pages that the page allocator and
kswapd always try to keep free.

The closer (reclaimable pages - dirty pages) is to the number of reserved
pages, the more likely it becomes for reclaim to run into dirty pages:

+----------+ ---
| anon | |
+----------+ |
| | |
| | -- dirty limit new -- flusher new
| file | | |
| | | |
| | -- dirty limit old -- flusher old
| | |
+----------+ --- reclaim
| reserved |
+----------+
| kernel |
+----------+

This patch introduces a per-zone dirty reserve that takes both the lowmem
reserve as well as the high watermark of the zone into account, and a
global sum of those per-zone values that is subtracted from the global
amount of dirtyable pages. The lowmem reserve is unavailable to page
cache allocations and kswapd tries to keep the high watermark free. We
don't want to end up in a situation where reclaim has to clean pages in
order to balance zones.

Not treating reserved pages as dirtyable on a global level is only a
conceptual fix. In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.

But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages is
mostly much smaller in absolute numbers.

[akpm@linux-foundation.org: fix highmem build]
Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Michal Hocko
Reviewed-by: Minchan Kim
Acked-by: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Dave Chinner
Cc: Jan Kara
Cc: Shaohua Li
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:43 +0800
1edf22348 mm/page-writeback.c: make determine_dirtyable_memory static again ... Browse Code »

The tracing ring-buffer used this function briefly, but not anymore.
Make it local to the writeback code again.

Also, move the function so that no forward declaration needs to be
reintroduced.

Signed-off-by: Johannes Weiner
Acked-by: Mel Gorman
Reviewed-by: Michal Hocko
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:41 +0800

04 Jan, 2012

1 commit

ff01bb483 fs: move code out of buffer.c ... Browse Code »

Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
kill_bdev as well, so brd doesn't have to open code it. Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving. The small comment replacing it says enough.

Signed-off-by: Nick Piggin
Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Al Viro
2012-01-04 11:54:07 +0800

18 Dec, 2011

4 commits

bdaac4902 writeback: balanced_rate cannot exceed write bandwidth ... Browse Code »

Add an upper limit to balanced_rate according to the below inequality.
This filters out some rare but huge singular points, which at least
enables more readable gnuplot figures.

When there are N dd dirtiers,

balanced_dirty_ratelimit = write_bw / N

So it holds that

balanced_dirty_ratelimit
Acked-by: Peter Zijlstra
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-12-18 14:20:33 +0800
827919405 writeback: do strict bdi dirty_exceeded ... Browse Code »

This helps to reduce dirty throttling polls and hence CPU overheads.

bdi->dirty_exceeded typically only helps when suddenly starting 100+
dd's on a disk, in which case the dd's may need to poll
balance_dirty_pages() earlier than tsk->nr_dirtied_pause.

CC: Jan Kara
CC: Peter Zijlstra
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-12-18 14:20:31 +0800
5b9b35743 writeback: avoid tiny dirty poll intervals ... Browse Code »

The LKP tests see big 56% regression for the case fio_mmap_randwrite_64k.
Shaohua manages to root cause it to be the much smaller dirty pause times
and hence much more frequent invocations to the IO-less balance_dirty_pages().
Since fio_mmap_randwrite_64k effectively contains both reads and writes,
the more frequent pauses triggered more idling in the cfq IO scheduler.

The solution is to increase pause time all the way up to the max 200ms
in this case, which is found to restore most performance. This will help
reduce CPU overheads in other cases, too.

Note that I don't expect many performance critical workloads to run this
access pattern: the mmap read-on-write is rather inefficient and could
be avoided by doing normal writes syscalls.

CC: Jan Kara
CC: Peter Zijlstra
Reported-by: Li Shaohua
Tested-by: Li Shaohua
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-12-18 14:20:30 +0800
7ccb9ad53 writeback: max, min and target dirty pause time ... Browse Code »

Control the pause time and the call intervals to balance_dirty_pages()
with three parameters:

1) max_pause, limited by bdi_dirty and MAX_PAUSE

2) the target pause time, grows with the number of dd tasks
and is normally limited by max_pause/2

3) the minimal pause, set to half the target pause
and is used to skip short sleeps and accumulate them into bigger ones

The typical behaviors after patch:

- if ever task_ratelimit is far below dirty_ratelimit, the pause time
will remain constant at max_pause and nr_dirtied_pause will be
fluctuating with task_ratelimit

- in the normal cases, nr_dirtied_pause will remain stable (keep in the
same pace with dirty_ratelimit) and the pause time will be fluctuating
with task_ratelimit

In summary, someone has to fluctuate with task_ratelimit, because

task_ratelimit = nr_dirtied_pause / pause

We normally prefer a stable nr_dirtied_pause, until reaching max_pause.

The notable behavior changes are:

- in stable workloads, there will no longer be sudden big trajectory
switching of nr_dirtied_pause as concerned by Peter. It will be as
smooth as dirty_ratelimit and changing proportionally with it (as
always, assuming bdi bandwidth does not fluctuate across 2^N lines,
otherwise nr_dirtied_pause will show up in 2+ parallel trajectories)

- in the rare cases when something keeps task_ratelimit far below
dirty_ratelimit, the smoothness can no longer be retained and
nr_dirtied_pause will be "dancing" with task_ratelimit. This fixes a
(not that destructive but still not good) bug that
dirty_ratelimit gets brought down undesirably

CC: Peter Zijlstra
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-12-18 14:20:28 +0800