30 Apr, 2008
6 commits
-
Fuse will use temporary buffers to write back dirty data from memory mappings
(normal writes are done synchronously). This is needed, because there cannot
be any guarantee about the time in which a write will complete.By using temporary buffers, from the MM's point if view the page is written
back immediately. If the writeout was due to memory pressure, this
effectively migrates data from a full zone to a less full zone.This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used
as temporary buffers.[Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP]
Signed-off-by: Miklos Szeredi
Cc: Christoph Lameter
Signed-off-by: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fuse needs this for writable mmap support.
Signed-off-by: Miklos Szeredi
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
set, then don't update the per-bdi writeback stats from
test_set_page_writeback() and test_clear_page_writeback().Misc cleanups:
- convert bdi_cap_writeback_dirty() and friends to static inline functions
- create a flag that includes all three dirty/writeback related flags,
since almst all users will want to have them toghetherSigned-off-by: Miklos Szeredi
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add "max_ratio" to /sys/class/bdi. This indicates the maximum percentage of
the global dirty threshold allocated to this bdi.[mszeredi@suse.cz]
- fix parsing in max_ratio_store().
- export bdi_set_max_ratio() to modules
- limit bdi_dirty with bdi->max_ratio
- document new sysfs attributeSigned-off-by: Peter Zijlstra
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Under normal circumstances each device is given a part of the total write-back
cache that relates to its current avg writeout speed in relation to the other
devices.min_ratio - allows one to assign a minimum portion of the write-back cache to
a particular device. This is useful in situations where you might want to
provide a minimum QoS. (One request for this feature came from flash based
storage people who wanted to avoid writing out at all costs - they of course
needed some pdflush hacks as well)max_ratio - allows one to assign a maximum portion of the dirty limit to a
particular device. This is useful in situations where you want to avoid one
device taking all or most of the write-back cache. Eg. an NFS mount that is
prone to get stuck, or a FUSE mount which you don't trust to play fair.Add "min_ratio" to /sys/class/bdi. This indicates the minimum percentage of
the global dirty threshold allocated to this bdi.[mszeredi@suse.cz]
- fix parsing in min_ratio_store()
- document new sysfs attributeSigned-off-by: Peter Zijlstra
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object.
This allows us to see and set the various BDI specific variables.In particular this properly exposes the read-ahead window for all relevant
users and /sys/block//queue/read_ahead_kb should be deprecated.With patient help from Kay Sievers and Greg KH
[mszeredi@suse.cz]
- split off NFS and FUSE changes into separate patches
- document new sysfs attributes under Documentation/ABI
- do bdi_class_init as a core_initcall, otherwise the "default" BDI
won't be initialized
- remove bdi_init_fmt macro, it's not used very much[akpm@linux-foundation.org: fix ia64 warning]
Signed-off-by: Peter Zijlstra
Cc: Kay Sievers
Acked-by: Greg KH
Cc: Trond Myklebust
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Feb, 2008
4 commits
-
After making dirty a 100M file, the normal behavior is to start the
writeback for all data after 30s delays. But sometimes the following
happens instead:- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92MSome analyze shows that the internal io dispatch queues goes like this:
s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M
1) initial state with a 100M file and a 1K file2) 4M written, nr_to_write 0, no more writes(BUG)
nr_to_write > 0 in (3) fools the upper layer to think that data have all
been written out. The big dirty file is actually still sitting in
s_more_io. We cannot simply splice s_more_io back to s_io as soon as s_io
becomes empty, and let the loop in generic_sync_sb_inodes() continue: this
may starve newly expired inodes in s_dirty. It is also not an option to
draw inodes from both s_more_io and s_dirty, an let the loop go on: this
might lead to live locks, and might also starve other superblocks in sync
time(well kupdate may still starve some superblocks, that's another bug).We have to return when a full scan of s_io completes. So nr_to_write > 0
does not necessarily mean that "all data are written". This patch
introduces a flag writeback_control.more_io to indicate that more io should
be done. With it the big dirty file no longer has to wait for the next
kupdate invokation 5s later.In sync_sb_inodes() we only set more_io on super_blocks we actually
visited. This avoids the interaction between two pdflush deamons.Also in __sync_single_inode() we don't blindly keep requeuing the io if the
filesystem cannot progress. Failing to do so may lead to 100% iowait.Tested-by: Mike Snitzer
Signed-off-by: Fengguang Wu
Cc: Michael Rubin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
fastcall is always defined to be empty, remove it
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add vm.highmem_is_dirtyable toggle
A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
approximately 2Gb size which contains a hash format that is written
randomly by the dbclean process. On 2.6.16 this process took a few
minutes. With lowmem only accounting of dirty ratios, this takes about 12
hours of 100% disk IO, all random writes.Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
add the highmem back to the total available memory count.[akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
Signed-off-by: Bron Gondwana
Cc: Ethan Solomita
Cc: Peter Zijlstra
Cc: WU Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
task_dirty_limit() can become static.
Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Jan, 2008
1 commit
-
This reverts commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b, as
requested by Fengguang Wu. It's not quite fully baked yet, and while
there are patches around to fix the problems it caused, they should get
more testing. Says Fengguang: "I'll resend them both for -mm later on,
in a more complete patchset".See
http://bugzilla.kernel.org/show_bug.cgi?id=9738
for some of this discussion.
Requested-by: Fengguang Wu
Cc: Andrew Morton
Cc: Peter Zijlstra
Signed-off-by: Linus Torvalds
16 Nov, 2007
1 commit
-
This code harks back to the days when we didn't count dirty mapped
pages, which led us to try to balance the number of dirty unmapped pages
by how much unmapped memory there was in the system.That makes no sense any more, since now the dirty counts include the
mapped pages. Not to mention that the math doesn't work with HIGHMEM
machines anyway, and causes the unmapped_ratio to potentially turn
negative (which we do catch thanks to clamping it at a minimum value,
but I mention that as an indication of how broken the code is).The code also was written at a time when the default dirty ratio was
much larger, and the unmapped_ratio logic effectively capped that large
dirty ratio a bit. Again, we've since lowered the dirty ratio rather
aggressively, further lessening the point of that code.Acked-by: Peter Zijlstra
Signed-off-by: Linus Torvalds
15 Nov, 2007
1 commit
-
We allow violation of bdi limits if there is a lot of room on the system.
Once we hit half the total limit we start enforcing bdi limits and bdi
ramp-up should happen. Doing it this way avoids many small writeouts on an
otherwise idle system and should also speed up the ramp-up.Signed-off-by: Peter Zijlstra
Reviewed-by: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Oct, 2007
1 commit
-
Spelling fixes in mm/.
Signed-off-by: Simon Arlott
Signed-off-by: Adrian Bunk
17 Oct, 2007
10 commits
-
We don't want to introduce pointless delays in throttle_vm_writeout() when
the writeback limits are not yet exceeded, do we?Cc: Nick Piggin
Cc: OGAWA Hirofumi
Cc: Kumar Gala
Cc: Pete Zaitcev
Cc: Greg KH
Reviewed-by: Rik van Riel
Signed-off-by: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I_LOCK was used for several unrelated purposes, which caused deadlock
situations in certain filesystems as a side effect. One of the purposes
now uses the new I_SYNC bit.Also document the various bits and change their order from historical to
logical.[bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
Signed-off-by: Joern Engel
Cc: Dave Kleikamp
Cc: David Chinner
Cc: Anton Altaparmakov
Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
After making dirty a 100M file, the normal behavior is to start the writeback
for all data after 30s delays. But sometimes the following happens instead:- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92MSome analyze shows that the internal io dispatch queues goes like this:
s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write 0, no more writes(BUG)nr_to_write > 0 in (3) fools the upper layer to think that data have all been
written out. The big dirty file is actually still sitting in s_more_io. We
cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
let the loop in generic_sync_sb_inodes() continue: this may starve newly
expired inodes in s_dirty. It is also not an option to draw inodes from both
s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
and might also starve other superblocks in sync time(well kupdate may still
starve some superblocks, that's another bug).We have to return when a full scan of s_io completes. So nr_to_write > 0 does
not necessarily mean that "all data are written". This patch introduces a
flag writeback_control.more_io to indicate this situation. With it the big
dirty file no longer has to wait for the next kupdate invocation 5s later.Cc: David Chinner
Cc: Ken Chen
Signed-off-by: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is a writeback-internal marker but we're propagating it all the way back
to userspace!.Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Based on ideas of Andrew:
http://marc.info/?l=linux-kernel&m=102912915020543&w=2Scale the bdi dirty limit inversly with the tasks dirty rate.
This makes heavy writers have a lower dirty limit than the occasional writer.Andrea proposed something similar:
http://lwn.net/Articles/152277/The main disadvantage to his patch is that he uses an unrelated quantity to
measure time, which leaves him with a workload dependant tunable. Other than
that the two approaches appear quite similar.[akpm@linux-foundation.org: fix warning]
Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Scale writeback cache per backing device, proportional to its writeout speed.
By decoupling the BDI dirty thresholds a number of problems we currently have
will go away, namely:- mutual interference starvation (for any number of BDIs);
- deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).It might be that all dirty pages are for a single BDI while other BDIs are
idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
dirty pages outstanding and make progress.A global threshold also creates a deadlock for stacked BDIs; when A writes to
B, and A generates enough dirty pages to get throttled, B will never start
writeback until the dirty pages go away. Again, by giving each BDI its own
'independent' dirty limit, this problem is avoided.So the problem is to determine how to distribute the total dirty limit across
the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
not have any dirty pages outstanding is a waste.What is done is to keep a floating proportion between the DBIs based on
writeback completions. This way faster/more active devices get a larger share
than slower/idle devices.[akpm@linux-foundation.org: fix warnings]
[hugh@veritas.com: Fix occasional hang when a task couldn't get out of balance_dirty_pages]
Signed-off-by: Peter Zijlstra
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Count per BDI writeback pages.
Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable.
Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Here's a cut at fixing up uses of the online node map in generic code.
mm/shmem.c:shmem_parse_mpol()
Ensure nodelist is subset of nodes with memory.
Use node_states[N_HIGH_MEMORY] as default for missing
nodelist for interleave policy.mm/shmem.c:shmem_fill_super()
initialize policy_nodes to node_states[N_HIGH_MEMORY]
mm/page-writeback.c:highmem_dirtyable_memory()
sum over nodes with memory
mm/page_alloc.c:zlc_setup()
allowednodes - use nodes with memory.
mm/page_alloc.c:default_zonelist_order()
average over nodes with memory.
mm/page_alloc.c:find_next_best_node()
skip nodes w/o memory.
N_HIGH_MEMORY state mask may not be initialized at this time,
unless we want to depend on early_calculate_totalpages() [see
below]. Will ZONE_MOVABLE ever be configurable?mm/page_alloc.c:find_zone_movable_pfns_for_nodes()
spread kernelcore over nodes with memory.
This required calling early_calculate_totalpages()
unconditionally, and populating N_HIGH_MEMORY node
state therein from nodes in the early_node_map[].
If we can depend on this, we can eliminate the
population of N_HIGH_MEMORY mask from __build_all_zonelists()
and use the N_HIGH_MEMORY mask in find_next_best_node().mm/mempolicy.c:mpol_check_policy()
Ensure nodes specified for policy are subset of
nodes with memory.[akpm@linux-foundation.org: fix warnings]
Signed-off-by: Lee Schermerhorn
Acked-by: Christoph Lameter
Cc: Shaohua Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Probing pages and radix_tree_tagged are lockless operations with the lockless
radix-tree. Convert these users to RCU locking rather than using tree_lock.Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Oct, 2007
1 commit
-
All the current page_mkwrite() implementations also set the page dirty. Which
results in the set_page_dirty_balance() call to _not_ call balance, because the
page is already found dirty.This allows us to dirty a _lot_ of pages without ever hitting
balance_dirty_pages(). Not good (tm).Force a balance call if ->page_mkwrite() was successful.
Signed-off-by: Peter Zijlstra
Signed-off-by: Linus Torvalds
20 Jul, 2007
3 commits
-
page-writeback accounting is presently performed in the page-flags macros.
This is inconsistent and a bit ugly and makes it awkward to implement
per-backing_dev under-writeback page accounting.So move this accounting down to the callsite(s).
Acked-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Share the same page flag bit for PG_readahead and PG_reclaim.
One is used only on file reads, another is only for emergency writes. One
is used mostly for fresh/young pages, another is for old pages.Combinations of possible interactions are:
a) clear PG_reclaim => implicit clear of PG_readahead
it will delay an asynchronous readahead into a synchronous one
it actually does _good_ for readahead:
the pages will be reclaimed soon, it's readahead thrashing!
in this case, synchronous readahead makes more sense.b) clear PG_readahead => implicit clear of PG_reclaim
one(and only one) page will not be reclaimed in time
it can be avoided by checking PageWriteback(page) in readahead firstc) set PG_reclaim => implicit set of PG_readahead
will confuse readahead and make it restart the size rampup process
it's a trivial problem, and can mostly be avoided by checking
PageWriteback(page) first in readaheadd) set PG_readahead => implicit set of PG_reclaim
PG_readahead will never be set on already cached pages.
PG_reclaim will always be cleared on dirtying a page.
so not a problem.In summary,
a) we get better behavior
b,d) possible interactions can be avoided
c) racy condition exists that might affect readahead, but the chance
is _really_ low, and the hurt on readahead is trivial.Compound pages also use PG_reclaim, but for now they do not interact with
reclaim/readahead code.Signed-off-by: Fengguang Wu
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Fix msync data loss and (less importantly) dirty page accounting
inaccuracies due to the race remaining in clear_page_dirty_for_io().The deleted comment explains what the race was, and the added comments
explain how it is fixed.Signed-off-by: Nick Piggin
Acked-by: Linus Torvalds
Cc: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Jul, 2007
1 commit
-
It is a bug to set a page dirty if it is not uptodate unless it has
buffers. If the page has buffers, then the page may be dirty (some buffers
dirty) but not uptodate (some buffers not uptodate). The exception to this
rule is if the set_page_dirty caller is racing with truncate or invalidate.A buffer can not be set dirty if it is not uptodate.
If either of these situations occurs, it indicates there could be some data
loss problem. Some of these warnings could be a harmless one where the
page or buffer is set uptodate immediately after it is dirtied, however we
should fix those up, and enforce this ordering.Bring the order of operations for truncate into line with those of
invalidate. This will prevent a page from being able to go !uptodate while
we're holding the tree_lock, which is probably a good thing anyway.Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Jul, 2007
1 commit
-
Repair indenting bustage.
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 May, 2007
1 commit
-
Clean up massive code duplication between mpage_writepages() and
generic_writepages().The new generic function, write_cache_pages() takes a function pointer
argument, which will be called for each page to be written.Maybe cifs_writepages() too can use this infrastructure, but I'm not
touching that with a ten-foot pole.The upcoming page writeback support in fuse will also want this.
Signed-off-by: Miklos Szeredi
Acked-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 May, 2007
1 commit
-
Cleanup: setting an outstanding error on a mapping was open coded too many
times. Factor it out in mapping_set_error().Signed-off-by: Guillaume Chazarain
Cc: Steven Whitehouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 May, 2007
1 commit
-
We can use the global ZVC counters to establish the exact size of the LRU
and the free pages. This allows a more accurate determination of the dirty
ratio.This patch will fix the broken ratio calculations if large amounts of
memory are allocated to huge pags or other consumers that do not put the
pages on to the LRU.Notes:
- I did not add NR_SLAB_RECLAIMABLE to the calculation of the
dirtyable pages. Those may be reclaimable but they are at this
point not dirtyable. If NR_SLAB_RECLAIMABLE would be considered
then a huge number of reclaimable pages would stop writeback
from occurring.- This patch used to be in mm as the last one in a series of patches.
It was removed when Linus updated the treatment of highmem because
there was a conflict. I updated the patch to follow Linus' approach.
This patch is neede to fulfill the claims made in the beginning of the
patchset that is now in Linus' tree.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Apr, 2007
1 commit
-
Do this really early in the 2.6.22-rc series, so that we'll get
feedback. And don't change by half measures. Just cut the default
dirty limit to a quarter of what it was, and see if anybody even
notices.Signed-off-by: Linus Torvalds
02 Mar, 2007
1 commit
-
throttle_vm_writeout() is designed to wait for the dirty levels to subside.
But if the caller holds IO or FS locks, we might be holding up that writeout.So change it to take a single nap to give other devices a chance to clean some
memory, then return.Cc: Nick Piggin
Cc: OGAWA Hirofumi
Cc: Kumar Gala
Cc: Pete Zaitcev
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Feb, 2007
3 commits
-
Change a hard-coded constant 0 to the symbolic equivalent NOTIFY_DONE in
the ratelimit_handler() CPU notifier handler function.Signed-off-by: Paul E. McKenney
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A variety of (mostly) innocuous fixes to the embedded kernel-doc content in
source files, including:* make multi-line initial descriptions single line
* denote some function names, constants and structs as such
* change erroneous opening '/*' to '/**' in a few places
* reword some text for claritySigned-off-by: Robert P. J. Day
Cc: "Randy.Dunlap"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
shmem backed file does not have page writeback, nor it participates in
backing device's dirty or writeback accounting. So using generic
__set_page_dirty_nobuffers() for its .set_page_dirty aops method is a bit
overkill. It unnecessarily prolongs shm unmap latency.For example, on a densely populated large shm segment (sevearl GBs), the
unmapping operation becomes painfully long. Because at unmap, kernel
transfers dirty bit in PTE into page struct and to the radix tree tag. The
operation of tagging the radix tree is particularly expensive because it
has to traverse the tree from the root to the leaf node on every dirty
page. What's bothering is that radix tree tag is used for page write back.
However, shmem is memory backed and there is no page write back for such
file system. And in the end, we spend all that time tagging radix tree and
none of that fancy tagging will be used. So let's simplify it by introduce
a new aops __set_page_dirty_no_writeback and this will speed up shm unmap.Signed-off-by: Ken Chen
Cc: Peter Zijlstra
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Jan, 2007
1 commit
-
This makes balance_dirty_page() always base its calculations on the
amount of non-highmem memory in the machine, rather than try to base it
on total memory and then falling back on non-highmem memory if the
mapping it was writing wasn't highmem capable.This not only fixes a situation where two different writers can have
wildly different notions about what is a "balanced" dirty state, but it
also means that people with highmem machines don't run into an OOM
situation when regular memory fills up with dirty pages.We used to try to handle the latter case by scaling down the dirty_ratio
if the machine had a lot of highmem pages in page_writeback_init(), but
it wasn't aggressive enough for some situations, and since basing the
dirty ratio on highmem memory was broken in the first place, let's just
stop doing so.(A variation of this theme fixed Justin Piszcz's OOM problem when
copying an 18GB file on a RAID setup).Acked-by: Nick Piggin
Cc: Justin Piszcz
Cc: Andrew Morton
Cc: Neil Brown
Cc: Ingo Molnar
Cc: Randy Dunlap
Cc: Christoph Lameter
Cc: Jens Axboe
Cc: Peter Zijlstra
Cc: Adrian Bunk
Signed-off-by: Linus Torvalds
30 Dec, 2006
1 commit
-
The VM layer (on the face of it, fairly reasonably) expected that when
it does a ->writepage() call to the filesystem, it would write out the
full page at that point in time. Especially since it had earlier marked
the whole page dirty with "set_page_dirty()".But that isn't actually the case: ->writepage() does not actually write
a page, it writes the parts of the page that have been explicitly marked
dirty before, *and* that had not got written out for other reasons since
the last time we told it they were dirty.That last caveat is the important one.
Which _most_ of the time ends up being the whole page (since we had
called "set_page_dirty()" on the page earlier), but if the filesystem
had done any dirty flushing of its own (for example, to honor some
internal write ordering guarantees), it might end up doing only a
partial page IO (or none at all) when ->writepage() is actually called.That is the correct thing in general (since we actually often _want_
only the known-dirty parts of the page to be written out), but the
shared dirty page handling had implicitly forgotten about these details,
and had a number of cases where it was doing just the "->writepage()"
part, without telling the low-level filesystem that the whole page might
have been re-dirtied as part of being mapped writably into user space.Since most of the time the FS did actually write out the full page, we
didn't notice this for a loong time, and this needed some really odd
patterns to trigger. But it caused occasional corruption with rtorrent
and with the Debian "apt" database, because both use shared mmaps to
update the end result.This fixes it. Finally. After way too much hair-pulling.
Acked-by: Nick Piggin
Acked-by: Martin J. Bligh
Acked-by: Martin Michlmayr
Acked-by: Martin Johansson
Acked-by: Ingo Molnar
Acked-by: Andrei Popa
Cc: High Dickins
Cc: Andrew Morton ,
Cc: Peter Zijlstra
Cc: Segher Boessenkool
Cc: David Miller
Cc: Arjan van de Ven
Cc: Gordon Farquharson
Cc: Guillaume Chazarain
Cc: Theodore Tso
Cc: Kenneth Cheng
Cc: Tobias Diedrich
Signed-off-by: Linus Torvalds