12 Aug, 2010
1 commit
-
Commit 83ba7b071f3 ("writeback: simplify the write back thread queue")
broke writeback_in_progress() as in that commit we started to remove work
items from the list at the moment we start working on them and not at the
moment they are finished. Thus if the flusher thread was doing some work
but there was no other work queued, writeback_in_progress() returned
false. This could in particular cause unnecessary queueing of background
writeback from balance_dirty_pages() or writeout work from
writeback_sb_if_idle().This patch fixes the problem by introducing a bit in the bdi state which
indicates that the flusher thread is processing some work and uses this
bit for writeback_in_progress() test.NOTE: Both callsites of writeback_in_progress() (namely,
writeback_inodes_sb_if_idle() and balance_dirty_pages()) would actually
need a different information than what writeback_in_progress() provides.
They would need to know whether *the kind of writeback they are going to
submit* is already queued. But this information isn't that simple to
provide so let's fix writeback_in_progress() for the time being.Signed-off-by: Jan Kara
Cc: Christoph Hellwig
Cc: Wu Fengguang
Acked-by: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 Aug, 2010
6 commits
-
Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which
should take care of the periodic background write-out. However, the write-out
will actually start only 'dirty_writeback_interval' centisecs later, so we can
delay the wake-up.This change was requested by Nick Piggin who pointed out that if we delay the
wake-up, we weed out 2 unnecessary contex switches, which matters because
'__mark_inode_dirty()' is a hot-path function.This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which
sets up a timer to wake-up the bdi thread and returns. So the wake-up is
delayed.We also delete the timer in bdi threads just before writing-back. And
synchronously delete it when unregistering bdi. At the unregister point the bdi
does not have any users, so no one can arm it again.Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq
context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes
this change as well.This patch also moves the 'bdi_wb_init()' function down in the file to avoid
forward-declaration of 'bdi_wakeup_thread_delayed()'.Signed-off-by: Artem Bityutskiy
Signed-off-by: Jens Axboe -
Currently bdi threads use local variable 'last_active' which stores last time
when the bdi thread did some useful work. Move this local variable to 'struct
bdi_writeback'. This is just a preparation for the further patches which will
make the forker thread decide when bdi threads should be killed.Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
This patch simplifies bdi code a little by removing the 'pending_list' which is
redundant. Indeed, currently the forker thread ('bdi_forker_thread()') is
working like this:1. In a loop, fetch all bdi's which have works but have no writeback thread and
move them to the 'pending_list'.
2. If the list is empty, sleep for 5 sec.
3. Otherwise, take one bdi from the list, fork the writeback thread for this
bdi, and repeat the loop.IOW, it first moves everything to the 'pending_list', then process only one
element, and so on. This patch simplifies the algorithm, which is now as
follows.1. Find the first bdi which has a work and remove it from the global list of
bdi's (bdi_list).
2. If there was not such bdi, sleep 5 sec.
3. Fork the writeback thread for this bdi and repeat the loop.IOW, now we find the first bdi to process, process it, and so on. This is
simpler and involves less lists.The bonus now is that we can get rid of a couple of functions, as well as
remove complications which involve 'rcu_call()' and 'bdi->rcu_head'.This patch also makes sure we use 'list_add_tail_rcu()', instead of plain
'list_add_tail()', but this piece of code is going to be removed in the next
patch anyway.Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
The write-back code mixes words "thread" and "task" for the same things. This
is not a big deal, but still an inconsistency.hch: a convention I tend to use and I've seen in various places
is to always use _task for the storage of the task_struct pointer,
and thread everywhere else. This especially helps with having
foo_thread for the actual thread and foo_task for a global
variable keeping the task_struct pointerThis patch renames:
* 'bdi_add_default_flusher_task()' -> 'bdi_add_default_flusher_thread()'
* 'bdi_forker_task()' -> 'bdi_forker_thread()'because bdi threads are 'bdi_writeback_thread()', so these names are more
consistent.This patch also amends commentaries and makes them refer the forker and bdi
threads as "thread", not "task".Also, while on it, make 'bdi_add_default_flusher_thread()' declaration use
'static void' instead of 'void static' and make checkpatch.pl happy.Signed-off-by: Artem Bityutskiy
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
Move all code for the writeback thread into fs/fs-writeback.c instead of
splitting it over two functions in two files.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe -
The wb_list member of struct backing_device_info always has exactly one
element. Just use the direct bdi->wb pointer instead and simplify some
code.Also remove bdi_task_init which is now trivial to prepare for the next
patch.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
06 Jul, 2010
1 commit
-
First remove items from work_list as soon as we start working on them. This
means we don't have to track any pending or visited state and can get
rid of all the RCU magic freeing the work items - we can simply free
them once the operation has finished. Second use a real completion for
tracking synchronous requests - if the caller sets the completion pointer
we complete it, otherwise use it as a boolean indicator that we can free
the work item directly. Third unify struct wb_writeback_args and struct
bdi_work into a single data structure, wb_writeback_work. Previous we
set all parameters into a struct wb_writeback_args, copied it into
struct bdi_work, copied it again on the stack to use it there. Instead
of just allocate one structure dynamically or on the stack and use it
all the way through the stack.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
11 Jun, 2010
1 commit
-
bdi_start_writeback now never gets a superblock passed, so we can just remove
that case. And to further untangle the code and flatten the call stack
split it into two trivial helpers for it's two callers.Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
01 Jun, 2010
1 commit
-
This reverts commit e913fc825dc685a444cb4c1d0f9d32f372f59861.
We are investigating a hang associated with the WB_SYNC_NONE changes,
so revert them for now.Conflicts:
fs/fs-writeback.c
mm/page-writeback.cSigned-off-by: Jens Axboe
22 May, 2010
1 commit
-
Commit 69b62d01 fixed up most of the places where we would enter
busy schedule() spins when disabling the periodic background
writeback. This fixes up the sb timer so that it doesn't get
hammered on with the delay disabled, and ensures that it gets
rearmed if needed when /proc/sys/vm/dirty_writeback_centisecs
gets modified.bdi_forker_task() also needs to check for !dirty_writeback_centisecs
and use schedule() appropriately, fix that up too.Signed-off-by: Jens Axboe
17 May, 2010
1 commit
-
When umount calls sync_filesystem(), we first do a WB_SYNC_NONE
writeback to kick off writeback of pending dirty inodes, then follow
that up with a WB_SYNC_ALL to wait for it. Since umount already holds
the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all
writeback happens as WB_SYNC_ALL. This can greatly slow down umount,
since WB_SYNC_ALL writeback is a data integrity operation and thus
a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems
it's a lot slower.Signed-off-by: Jens Axboe
29 Apr, 2010
1 commit
-
Conflicts:
fs/block_dev.cSigned-off-by: Jens Axboe
25 Apr, 2010
1 commit
-
noop_backing_dev_info is used only as a flag to mark filesystems that
don't have any backing store, like tmpfs, procfs, spufs, etc.Signed-off-by: Joern Engel
Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes
to the noop_backing_dev_info is not legal and will not result in
them being flushed, but we already catch this condition in
__mark_inode_dirty() when checking for a registered bdi.Signed-off-by: Jens Axboe
22 Apr, 2010
1 commit
-
Pretty trivial helper, just sets up the bdi and registers it. An atomic
sequence count is used to ensure that the registered sysfs names are
unique.Signed-off-by: Jens Axboe
06 Apr, 2010
1 commit
-
One of the features of laptop-mode is that it forces a writeout of dirty
pages if something else triggers a physical read or write from a device.
The current implementation flushes pages on all devices, rather than only
the one that triggered the flush. This patch alters the behaviour so that
only the recently accessed block device is flushed, preventing other
disks being spun up for no terribly good reason.Signed-off-by: Matthew Garrett
Signed-off-by: Jens Axboe
29 Oct, 2009
1 commit
-
There's nothing block related about them, the backing device
is used by things like NFS etc as well. This gets rid of the
need to protect such calls by CONFIG_BLOCK.Signed-off-by: Jens Axboe
26 Sep, 2009
1 commit
-
Sometimes we only want to write pages from a specific super_block,
so allow that to be passed in.This fixes a problem with commit 56a131dcf7ed36c3c6e36bea448b674ea85ed5bb
causing writeback on all super_blocks on a bdi, where we only really
want to sync a specific sb from writeback_inodes_sb().Signed-off-by: Jens Axboe
16 Sep, 2009
2 commits
-
bdi_start_writeback() is currently split into two paths, one for
WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
only WB_SYNC_NONE.Push down the writeback_control allocation and only accept the
parameters that make sense for each function. This cleans up
the API considerably.Signed-off-by: Jens Axboe
-
Now that bdi_writeback_all() no longer handles integrity writeback,
it doesn't have to block anymore. This means that we can switch
bdi_list reader side protection to RCU.Signed-off-by: Jens Axboe
11 Sep, 2009
4 commits
-
Also a debugging aid. We want to catch dirty inodes being added to
backing devices that don't do writeback.Signed-off-by: Jens Axboe
-
This enables us to track who does what and print info. Its main use
is catching dirty inodes on the default_backing_dev_info, so we can
fix that up.Signed-off-by: Jens Axboe
-
This gets rid of pdflush for bdi writeout and kupdated style cleaning.
pdflush writeout suffers from lack of locality and also requires more
threads to handle the same workload, since it has to work in a
non-blocking fashion against each queue. This also introduces lumpy
behaviour and potential request starvation, since pdflush can be starved
for queue access if others are accessing it. A sample ffsb workload that
does random writes to files is about 8% faster here on a simple SATA drive
during the benchmark phase. File layout also seems a LOT more smooth in
vmstat:r b swpd free buff cache si so bi bo in cs us sy id wa
0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45where vanilla tends to fluctuate a lot in the creation phase:
r b swpd free buff cache si so bi bo in cs us sy id wa
1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
SSD based writeback test on XFS performs over 20% better as well, with
the throughput being very stable around 1GB/sec, where pdflush only
manages 750MB/sec and fluctuates wildly while doing so. Random buffered
writes to many files behave a lot better as well, as does random mmap'ed
writes.A separate thread is added to sync the super blocks. In the long term,
adding sync_supers_bdi() functionality could get rid of this thread again.Signed-off-by: Jens Axboe
-
This is a first step at introducing per-bdi flusher threads. We should
have no change in behaviour, although sb_has_dirty_inodes() is now
ridiculously expensive, as there's no easy way to answer that question.
Not a huge problem, since it'll be deleted in subsequent patches.Signed-off-by: Jens Axboe
12 Jul, 2009
1 commit
-
Move the definition of BLK_RW_ASYNC/BLK_RW_SYNC into linux/backing-dev.h
so that it is available to all callers of set/clear_bdi_congested().This replaces commit 097041e576ee3a50d92dd643ee8ca65bf6a62e21 ("fuse:
Fix build error"), which will be reverted.Signed-off-by: Trond Myklebust
Acked-by: Larry Finger
Cc: Jens Axboe
Cc: Miklos Szeredi
Signed-off-by: Linus Torvalds
11 Jul, 2009
1 commit
-
Commit 1faa16d22877f4839bd433547d770c676d1d964c accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.Signed-off-by: Jens Axboe
06 Apr, 2009
1 commit
-
This makes sure that we never wait on async IO for sync requests, instead
of doing the split on writes vs reads.Signed-off-by: Jens Axboe
Signed-off-by: Linus Torvalds
20 Oct, 2008
1 commit
-
Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon"). The latter includes tmpfs.The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists. The big
policy changes are in separate patches.[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel
Signed-off-by: Lee Schermerhorn
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Apr, 2008
6 commits
-
Fuse needs this for writable mmap support.
Signed-off-by: Miklos Szeredi
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
set, then don't update the per-bdi writeback stats from
test_set_page_writeback() and test_clear_page_writeback().Misc cleanups:
- convert bdi_cap_writeback_dirty() and friends to static inline functions
- create a flag that includes all three dirty/writeback related flags,
since almst all users will want to have them toghetherSigned-off-by: Miklos Szeredi
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Move BDI statistics to debugfs:
/sys/kernel/debug/bdi//stats
Use postcore_initcall() to initialize the sysfs class and debugfs,
because debugfs is initialized in core_initcall().Update descriptions in ABI documentation.
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add "max_ratio" to /sys/class/bdi. This indicates the maximum percentage of
the global dirty threshold allocated to this bdi.[mszeredi@suse.cz]
- fix parsing in max_ratio_store().
- export bdi_set_max_ratio() to modules
- limit bdi_dirty with bdi->max_ratio
- document new sysfs attributeSigned-off-by: Peter Zijlstra
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Under normal circumstances each device is given a part of the total write-back
cache that relates to its current avg writeout speed in relation to the other
devices.min_ratio - allows one to assign a minimum portion of the write-back cache to
a particular device. This is useful in situations where you might want to
provide a minimum QoS. (One request for this feature came from flash based
storage people who wanted to avoid writing out at all costs - they of course
needed some pdflush hacks as well)max_ratio - allows one to assign a maximum portion of the dirty limit to a
particular device. This is useful in situations where you want to avoid one
device taking all or most of the write-back cache. Eg. an NFS mount that is
prone to get stuck, or a FUSE mount which you don't trust to play fair.Add "min_ratio" to /sys/class/bdi. This indicates the minimum percentage of
the global dirty threshold allocated to this bdi.[mszeredi@suse.cz]
- fix parsing in min_ratio_store()
- document new sysfs attributeSigned-off-by: Peter Zijlstra
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object.
This allows us to see and set the various BDI specific variables.In particular this properly exposes the read-ahead window for all relevant
users and /sys/block//queue/read_ahead_kb should be deprecated.With patient help from Kay Sievers and Greg KH
[mszeredi@suse.cz]
- split off NFS and FUSE changes into separate patches
- document new sysfs attributes under Documentation/ABI
- do bdi_class_init as a core_initcall, otherwise the "default" BDI
won't be initialized
- remove bdi_init_fmt macro, it's not used very much[akpm@linux-foundation.org: fix ia64 warning]
Signed-off-by: Peter Zijlstra
Cc: Kay Sievers
Acked-by: Greg KH
Cc: Trond Myklebust
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Apr, 2008
1 commit
-
Make the following needlessly global functions static:
- writeback_acquire()
- writeback_release()Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Oct, 2007
5 commits
-
Scale writeback cache per backing device, proportional to its writeout speed.
By decoupling the BDI dirty thresholds a number of problems we currently have
will go away, namely:- mutual interference starvation (for any number of BDIs);
- deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).It might be that all dirty pages are for a single BDI while other BDIs are
idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
dirty pages outstanding and make progress.A global threshold also creates a deadlock for stacked BDIs; when A writes to
B, and A generates enough dirty pages to get throttled, B will never start
writeback until the dirty pages go away. Again, by giving each BDI its own
'independent' dirty limit, this problem is avoided.So the problem is to determine how to distribute the total dirty limit across
the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
not have any dirty pages outstanding is a waste.What is done is to keep a floating proportion between the DBIs based on
writeback completions. This way faster/more active devices get a larger share
than slower/idle devices.[akpm@linux-foundation.org: fix warnings]
[hugh@veritas.com: Fix occasional hang when a task couldn't get out of balance_dirty_pages]
Signed-off-by: Peter Zijlstra
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Count per BDI writeback pages.
Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable.
Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Provide scalable per backing_dev_info statistics counters.
Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
provide BDI constructor/destructor hooks
[akpm@linux-foundation.org: compile fix]
Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds