Eric Lee / smarc-fsl-linux-kernel

01 Aug, 2012

1 commit

3965c9ae4 mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threads ... Browse Code »

Since per-BDI flusher threads were introduced in 2.6, the pdflush
mechanism is not used any more. But the old interface exported through
/proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

For back-compatibility, printk warning information and return 2 to notify
the users that the interface is removed.

Signed-off-by: Wanpeng Li
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2012-08-01 09:42:40 +0800

31 Jul, 2012

1 commit

2e3ee6134 Merge tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

Pull writeback updates from Wu Fengguang:
"Use time based periods to age the writeback proportions, which can
adapt equally well to fast/slow devices."

Fix up trivial conflict in comment in fs/sync.c

* tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Fix some comment errors
block: Convert BDI proportion calculations to flexible proportions
lib: Fix possible deadlock in flexible proportion code
lib: Proportions with flexible period

Linus Torvalds
2012-07-31 13:14:04 +0800

23 Jul, 2012

1 commit

6eedc7015 vfs: Move noop_backing_dev_info check from sync into writeback ... Browse Code »

In principle, a filesystem may want to have ->sync_fs() called during sync(1)
although it does not have a bdi (i.e. s_bdi is set to noop_backing_dev_info).
Only writeback code really needs bdi set to something reasonable. So move the
checks where they are more logical.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Al Viro

Jan Kara
2012-07-23 03:58:18 +0800

09 Jun, 2012

2 commits

331cbdeed writeback: Fix some comment errors ... Browse Code »

Signed-off-by: Wanpeng Li
Signed-off-by: Fengguang Wu

Wanpeng Li
2012-06-09 19:54:47 +0800
ead188f9f writeback: Fix lock imbalance in writeback_sb_inodes() ... Browse Code »

Fix bug introduced by 169ebd90. We have to have wb_list_lock locked when
restarting writeback loop after having waited for inode writeback.

Bug description by Ted Tso:

I can reproduce this fairly easily by using ext4 w/o a journal, running
under KVM with 1024megs memory, with fsstress (xfstests #13):

[ 45.153294] =====================================
[ 45.154784] [ BUG: bad unlock balance detected! ]
[ 45.155591] 3.5.0-rc1-00002-gb22b1f1 #124 Not tainted
[ 45.155591] -------------------------------------
[ 45.155591] flush-254:16/2499 is trying to release lock (&(&wb->list_lock)->rlock) at:
[ 45.155591] [] writeback_sb_inodes+0x160/0x327
[ 45.155591] but there are no more locks to release!

Reported-by: Theodore Ts'o
Tested-by: Theodore Ts'o
Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-06-09 07:32:15 +0800

06 May, 2012

7 commits

169ebd901 writeback: Avoid iput() from flusher thread ... Browse Code »
43

Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
because iput() can do a lot of work - for example truncate the inode if it's
the last iput on unlinked file. Some filesystems depend on flusher thread
progressing (e.g. because they need to flush delay allocated blocks to reduce
allocation uncertainty) and so flusher thread doing truncate creates
interesting dependencies and possibilities for deadlocks.

We get rid of iput() in flusher thread by using the fact that I_SYNC inode
flag effectively pins the inode in memory. So if we take care to either hold
i_lock or have I_SYNC set, we can get away without taking inode reference
in writeback_sb_inodes().

As a side effect of these changes, we also fix possible use-after-free in
wb_writeback() because inode_wait_for_writeback() call could try to reacquire
i_lock on the inode that was already free.

Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-05-06 13:43:41 +0800
4f8ad655d writeback: Refactor writeback_single_inode() ... Browse Code »

The code in writeback_single_inode() is relatively complex. The list requeing
logic makes sense only for flusher thread but not really for sync_inode() or
write_inode_now() callers. Also when we want to get rid of inode references
held by flusher thread, we will need a special I_SYNC handling there.

So separate part of writeback_single_inode() which does the real writeback work
into __writeback_single_inode() and make writeback_single_inode() do only stuff
necessary for callers writing only one inode, moving the special list handling
into writeback_sb_inodes(). As a sideeffect this fixes a possible race where we
could skip some inode during sync(2) because other writer refiled it from b_io
to b_dirty list. Also I_SYNC handling is moved into the callers of
__writeback_single_inode() to make locking easier.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-05-06 13:43:40 +0800
f0d07b7ff writeback: Remove wb->list_lock from writeback_single_inode() ... Browse Code »

writeback_single_inode() doesn't need wb->list_lock for anything on entry now.
So remove the requirement. This makes locking of writeback_single_inode()
temporarily awkward (entering with i_lock, returning with i_lock and
wb->list_lock) but it will be sanitized in the next patch.

Also inode_wait_for_writeback() doesn't need wb->list_lock for anything. It was
just taking it to make usage convenient for callers but with
writeback_single_inode() changing it's not very convenient anymore. So remove
the lock from that function.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-05-06 13:43:39 +0800
ccb26b5a6 writeback: Separate inode requeueing after writeback ... Browse Code »

Move inode requeueing after inode has been written out into a separate
function.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-05-06 13:43:39 +0800
6290be1c1 writeback: Move I_DIRTY_PAGES handling ... Browse Code »

Instead of clearing I_DIRTY_PAGES and resetting it when we didn't succeed in
writing them all, just clear the bit only when we succeeded writing all the
pages. We also move the clearing of the bit close to other i_state handling to
separate it from writeback list handling. This is desirable because list
handling will differ for flusher thread and other writeback_single_inode()
callers in future. No filesystem plays any tricks with I_DIRTY_PAGES (like
checking it in ->writepages or ->write_inode implementation) so this movement
is safe.

Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-05-06 13:43:39 +0800
cc1676d91 writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() ... Browse Code »

When writeback_single_inode() is called on inode which has I_SYNC already
set while doing WB_SYNC_NONE, inode is moved to b_more_io list. However
this makes sense only if the caller is flusher thread. For other callers of
writeback_single_inode() it doesn't really make sense and may be even wrong
- flusher thread may be doing WB_SYNC_ALL writeback in parallel.

So we move requeueing from writeback_single_inode() to writeback_sb_inodes().

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-05-06 13:43:38 +0800
365b94ae6 writeback: Move clearing of I_SYNC into inode_sync_complete() ... Browse Code »

Move clearing of I_SYNC into inode_sync_complete(). It is more logical to have
clearing of I_SYNC bit and waking of waiters in one place. Also later we will
have two places needing to clear I_SYNC and wake up waiters so this allows them
to use the common helper. Moving of I_SYNC clearing to a later stage of
writeback_single_inode() is safe since we hold i_lock all the time.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-05-06 13:43:38 +0800

29 Mar, 2012

1 commit

529b73fc0 Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

Pull trivial writeback fixes from Wu Fengguang:
"They've been tested in linux-next for 20 days actually."

* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Remove outdated comment
fs: Remove bogus wait in write_inode_now()

Linus Torvalds
2012-03-29 01:07:27 +0800

25 Mar, 2012

1 commit

11bcb3284 Merge tag 'module-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux ... Browse Code »

Pull cleanup of fs/ and lib/ users of module.h from Paul Gortmaker:
"Fix up files in fs/ and lib/ dirs to only use module.h if they really
need it.

These are trivial in scope vs the work done previously. We now have
things where any few remaining cleanups can be farmed out to arch or
subsystem maintainers, and I have done so when possible. What is
remaining here represents the bits that don't clearly lie within a
single arch/subsystem boundary, like the fs dir and the lib dir.

Some duplicate includes arising from overlapping fixes from
independent subsystem maintainer submissions are also quashed."

Fix up trivial conflicts due to clashes with other include file cleanups
(including some due to the previous bug.h cleanup pull).

* tag 'module-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
lib: reduce the use of module.h wherever possible
fs: reduce the use of module.h wherever possible
includecheck: delete any duplicate instances of module.h

Linus Torvalds
2012-03-25 01:24:31 +0800

21 Mar, 2012

3 commits

697e6fed9 writeback: Remove outdated comment ... Browse Code »

The comment is hopelessly outdated and misplaced. We no longer have 'bdi'
part of writeback work, the comment about blockdev super is outdated,
comment about throttling as well. Information about list handling is in
more detail at queue_io(). So just move the bit about older_than_this to
close to move_expired_inodes() and remove the rest.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-03-21 15:27:08 +0800
f469ec9c5 fs: Remove bogus wait in write_inode_now() ... Browse Code »

inode_sync_wait() in write_inode_now() is just bogus. That function waits for
I_SYNC bit to be cleared but writeback_single_inode() clears the bit on return
so the wait is effectivelly a nop unless someone else submits the inode for
writeback again. All the waiting write_inode_now() needs is achieved by using
WB_SYNC_ALL writeback mode.

Signed-off-by: Jan Kara
Signed-off-by: Christoph Hellwig
Signed-off-by: Fengguang Wu

Jan Kara
2012-03-21 15:26:47 +0800
69a7aebcf Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull trivial tree from Jiri Kosina:
"It's indeed trivial -- mostly documentation updates and a bunch of
typo fixes from Masanari.

There are also several linux/version.h include removals from Jesper."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (101 commits)
kcore: fix spelling in read_kcore() comment
constify struct pci_dev * in obvious cases
Revert "char: Fix typo in viotape.c"
init: fix wording error in mm_init comment
usb: gadget: Kconfig: fix typo for 'different'
Revert "power, max8998: Include linux/module.h just once in drivers/power/max8998_charger.c"
writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header
writeback: fix typo in the writeback_control comment
Documentation: Fix multiple typo in Documentation
tpm_tis: fix tis_lock with respect to RCU
Revert "media: Fix typo in mixer_drv.c and hdmi_drv.c"
Doc: Update numastat.txt
qla4xxx: Add missing spaces to error messages
compiler.h: Fix typo
security: struct security_operations kerneldoc fix
Documentation: broken URL in libata.tmpl
Documentation: broken URL in filesystems.tmpl
mtd: simplify return logic in do_map_probe()
mm: fix comment typo of truncate_inode_pages_range
power: bq27x00: Fix typos in comment
...

Linus Torvalds
2012-03-21 12:12:50 +0800

07 Mar, 2012

1 commit

c097b2ca5 writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header ... Browse Code »

Signed-off-by: Fengguang Wu
Signed-off-by: Jiri Kosina

Fengguang Wu
2012-03-07 23:08:46 +0800

29 Feb, 2012

1 commit

630d9c472 fs: reduce the use of module.h wherever possible ... Browse Code »

For files only using THIS_MODULE and/or EXPORT_SYMBOL, map
them onto including export.h -- or if the file isn't even
using those, then just delete the include. Fix up any implicit
include dependencies that were being masked by module.h along
the way.

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2012-02-29 08:31:58 +0800

01 Feb, 2012

1 commit

15eb77a07 writeback: fix NULL bdi->dev in trace writeback_single_inode ... Browse Code »

bdi_prune_sb() resets sb->s_bdi to default_backing_dev_info when the
tearing down the original bdi. Fix trace_writeback_single_inode to
use sb->s_bdi=default_backing_dev_info rather than bdi->dev=NULL for a
teared down bdi.

Cc:
Reported-by: Rabin Vincent
Tested-by: Rabin Vincent
Signed-off-by: Wu Fengguang

Wu Fengguang
2012-02-01 16:53:40 +0800

11 Jan, 2012

1 commit

001a541ea Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
writeback: balanced_rate cannot exceed write bandwidth
writeback: do strict bdi dirty_exceeded
writeback: avoid tiny dirty poll intervals
writeback: max, min and target dirty pause time
writeback: dirty ratelimit - think time compensation
btrfs: fix dirtied pages accounting on sub-page writes
writeback: fix dirtied pages accounting on redirty
writeback: fix dirtied pages accounting on sub-page writes
writeback: charge leaked page dirties to active tasks
writeback: Include all dirty inodes in background writeback

Linus Torvalds
2012-01-11 08:59:59 +0800

09 Jan, 2012

1 commit

eb59c505f Merge branch 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

* 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits)
PM / Hibernate: Implement compat_ioctl for /dev/snapshot
PM / Freezer: fix return value of freezable_schedule_timeout_killable()
PM / shmobile: Allow the A4R domain to be turned off at run time
PM / input / touchscreen: Make st1232 use device PM QoS constraints
PM / QoS: Introduce dev_pm_qos_add_ancestor_request()
PM / shmobile: Remove the stay_on flag from SH7372's PM domains
PM / shmobile: Don't include SH7372's INTCS in syscore suspend/resume
PM / shmobile: Add support for the sh7372 A4S power domain / sleep mode
PM: Drop generic_subsys_pm_ops
PM / Sleep: Remove forward-only callbacks from AMBA bus type
PM / Sleep: Remove forward-only callbacks from platform bus type
PM: Run the driver callback directly if the subsystem one is not there
PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
PM/Devfreq: Add Exynos4-bus device DVFS driver for Exynos4210/4212/4412.
PM / Sleep: Merge internal functions in generic_ops.c
PM / Sleep: Simplify generic system suspend callbacks
PM / Hibernate: Remove deprecated hibernation snapshot ioctls
PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
ARM: S3C64XX: Implement basic power domain support
PM / shmobile: Use common always on power domain governor
...

Fix up trivial conflict in fs/xfs/xfs_buf.c due to removal of unused
XBT_FORCE_SLEEP bit

Linus Torvalds
2012-01-09 05:10:57 +0800

08 Jan, 2012

1 commit

bc31b86a5 writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c ... Browse Code »

Fix compile error

fs/fs-writeback.c:515:33: error: ‘PAGE_CACHE_SHIFT’ undeclared (first use in this function)

Reported-by: Randy Dunlap
Acked-by: Randy Dunlap
Signed-off-by: Wu Fengguang

Wu Fengguang
2012-01-08 10:35:19 +0800

04 Jan, 2012

1 commit

ff01bb483 fs: move code out of buffer.c ... Browse Code »

Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
kill_bdev as well, so brd doesn't have to open code it. Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving. The small comment replacing it says enough.

Signed-off-by: Nick Piggin
Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Al Viro
2012-01-04 11:54:07 +0800

26 Dec, 2011

1 commit

b7ba68c4a Merge branch 'pm-sleep' into pm-for-linus ... Browse Code »

* pm-sleep: (51 commits)
PM: Drop generic_subsys_pm_ops
PM / Sleep: Remove forward-only callbacks from AMBA bus type
PM / Sleep: Remove forward-only callbacks from platform bus type
PM: Run the driver callback directly if the subsystem one is not there
PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
PM / Sleep: Merge internal functions in generic_ops.c
PM / Sleep: Simplify generic system suspend callbacks
PM / Hibernate: Remove deprecated hibernation snapshot ioctls
PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
PM / Sleep: Recommend [un]lock_system_sleep() over using pm_mutex directly
PM / Sleep: Replace mutex_[un]lock(&pm_mutex) with [un]lock_system_sleep()
PM / Sleep: Make [un]lock_system_sleep() generic
PM / Sleep: Use the freezer_count() functions in [un]lock_system_sleep() APIs
PM / Freezer: Remove the "userspace only" constraint from freezer[_do_not]_count()
PM / Hibernate: Replace unintuitive 'if' condition in kernel/power/user.c with 'else'
Freezer / sunrpc / NFS: don't allow TASK_KILLABLE sleeps to block the freezer
PM / Sleep: Unify diagnostic messages from device suspend/resume
ACPI / PM: Do not save/restore NVS on Asus K54C/K54HR
PM / Hibernate: Remove deprecated hibernation test modes
PM / Hibernate: Thaw processes in SNAPSHOT_CREATE_IMAGE ioctl test path
...

Conflicts:
kernel/kmod.c

Rafael J. Wysocki
2011-12-26 06:42:20 +0800

22 Dec, 2011

1 commit

b00f4dc5f Merge branch 'master' into pm-sleep ... Browse Code »

* master: (848 commits)
SELinux: Fix RCU deref check warning in sel_netport_insert()
binary_sysctl(): fix memory leak
mm/vmalloc.c: remove static declaration of va from __get_vm_area_node
ipmi_watchdog: restore settings when BMC reset
oom: fix integer overflow of points in oom_badness
memcg: keep root group unchanged if creation fails
nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
nilfs2: unbreak compat ioctl
cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
evm: prevent racing during tfm allocation
evm: key must be set once during initialization
mmc: vub300: fix type of firmware_rom_wait_states module parameter
Revert "mmc: enable runtime PM by default"
mmc: sdhci: remove "state" argument from sdhci_suspend_host
x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT
IB/qib: Correct sense on freectxts increment and decrement
RDMA/cma: Verify private data length
cgroups: fix a css_set not found bug in cgroup_attach_proc
oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
...

Conflicts:
kernel/cgroup_freezer.c

Rafael J. Wysocki
2011-12-22 04:59:45 +0800

18 Dec, 2011

2 commits

1bc36b642 writeback: Include all dirty inodes in background writeback ... Browse Code »

Current livelock avoidance code makes background work to include only inodes
that were dirtied before background writeback has started. However background
writeback can be running for a long time and thus excluding newly dirtied
inodes can eventually exclude significant portion of dirty inodes making
background writeback inefficient. Since background writeback avoids livelocking
the flusher thread by yielding to any other work, there is no real reason why
background work should not include all dirty inodes so change the logic in
wb_writeback().

Signed-off-by: Jan Kara
Signed-off-by: Wu Fengguang

Jan Kara
2011-12-18 14:20:18 +0800
b3bba872d writeback: show writeback reason with __print_symbolic ... Browse Code »

This makes the binary trace understandable by trace-cmd.

CC: Dave Chinner
CC: Curt Wohlgemuth
CC: Steven Rostedt
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-12-18 14:20:17 +0800

29 Nov, 2011

1 commit

786228ab3 writeback: Fix issue on make htmldocs ... Browse Code »

Document the @reason parameter to make "make htmldocs" happy.

Acked-by: Randy Dunlap
Signed-off-by: Marcos Paulo de Souza
Signed-off-by: Wu Fengguang

Marcos Paulo de Souza
2011-11-29 15:50:28 +0800

22 Nov, 2011

1 commit

8a32c441c freezer: implement and use kthread_freezable_should_stop() ... Browse Code »

Writeback and thinkpad_acpi have been using thaw_process() to prevent
deadlock between the freezer and kthread_stop(); unfortunately, this
is inherently racy - nothing prevents freezing from happening between
thaw_process() and kthread_stop().

This patch implements kthread_freezable_should_stop() which enters
refrigerator if necessary but is guaranteed to return if
kthread_stop() is invoked. Both thaw_process() users are converted to
use the new function.

Note that this deadlock condition exists for many of freezable
kthreads. They need to be converted to use the new should_stop or
freezable workqueue.

Tested with synthetic test case.

Signed-off-by: Tejun Heo
Acked-by: Henrique de Moraes Holschuh
Cc: Jens Axboe
Cc: Oleg Nesterov

Tejun Heo
2011-11-22 04:32:23 +0800

31 Oct, 2011

2 commits

0e175a183 writeback: Add a 'reason' to wb_writeback_work ... Browse Code »

This creates a new 'reason' field in a wb_writeback_work
structure, which unambiguously identifies who initiates
writeback activity. A 'wb_reason' enumeration has been
added to writeback.h, to enumerate the possible reasons.

The 'writeback_work_class' and tracepoint event class and
'writeback_queue_io' tracepoints are updated to include the
symbolic 'reason' in all trace events.

And the 'writeback_inodes_sbXXX' family of routines has had
a wb_stats parameter added to them, so callers can specify
why writeback is being started.

Acked-by: Jan Kara
Signed-off-by: Curt Wohlgemuth
Signed-off-by: Wu Fengguang

Curt Wohlgemuth
2011-10-31 00:33:36 +0800
ad4e38dd6 writeback: send work item to queue_io, move_expired_inodes ... Browse Code »

Instead of sending ->older_than_this to queue_io() and
move_expired_inodes(), send the entire wb_writeback_work
structure. There are other fields of a work item that are
useful in these routines and in tracepoints.

Acked-by: Jan Kara
Signed-off-by: Curt Wohlgemuth
Signed-off-by: Wu Fengguang

Curt Wohlgemuth
2011-10-31 00:33:27 +0800

03 Oct, 2011

2 commits

b00949aa2 writeback: per-bdi background threshold ... Browse Code »

One thing puzzled me is that in JBOD case, the per-disk writeout
performance is smaller than the corresponding single-disk case even
when they have comparable bdi_thresh. Tracing shows find that in single
disk case, bdi_writeback is always kept high while in JBOD case, it
could drop low from time to time and correspondingly bdi_reclaimable
could sometimes rush high.

The fix is to watch bdi_reclaimable and kick background writeback as
soon as it goes high. This resembles the global background threshold
but in per-bdi manner. The trick is, as long as bdi_reclaimable does
not go high, bdi_writeback naturally won't go low because
bdi_reclaimable+bdi_writeback ~= bdi_thresh.

With less fluctuated writeback pages, JBOD performance is observed to
increase noticeably in various cases.

vmstat:nr_written values before/after patch:

3.1.0-rc4-wo-underrun+
------------------------
125596480
61790815
58853546
110159811
69544762
50644862
42677090
47491324
52548986
26783091
35526347
44670723
127996037
57518856
51919909
86410514
40132519
48423248
206041046
72312903
50635672
68308534
57882933
52183472
53788956
44493342
42641209 3.1.0-rc4-bgthresh3+ ------------------------ +25.9% 158179363 JBOD-10HDD-16G/ext4-100dd-1M-24p-16384M-20:10-X +110.4% 130032231 JBOD-10HDD-16G/ext4-10dd-1M-24p-16384M-20:10-X -0.1% 58823828 JBOD-10HDD-16G/ext4-1dd-1M-24p-16384M-20:10-X +24.7% 137355377 JBOD-10HDD-16G/xfs-100dd-1M-24p-16384M-20:10-X +10.8% 77080047 JBOD-10HDD-16G/xfs-10dd-1M-24p-16384M-20:10-X +0.5% 50890006 JBOD-10HDD-16G/xfs-1dd-1M-24p-16384M-20:10-X +28.0% 54643527 JBOD-10HDD-thresh=100M/ext4-100dd-1M-24p-16384M-100M:10-X +13.3% 53785605 JBOD-10HDD-thresh=100M/ext4-10dd-1M-24p-16384M-100M:10-X +0.9% 53001031 JBOD-10HDD-thresh=100M/ext4-1dd-1M-24p-16384M-100M:10-X +36.8% 36650248 JBOD-10HDD-thresh=100M/xfs-100dd-1M-24p-16384M-100M:10-X +14.0% 40492312 JBOD-10HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X -1.1% 44177606 JBOD-10HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X +22.4% 156719990 JBOD-10HDD-thresh=2G/ext4-100dd-1M-24p-16384M-2048M:10-X +3.8% 59677625 JBOD-10HDD-thresh=2G/ext4-10dd-1M-24p-16384M-2048M:10-X +12.2% 58269894 JBOD-10HDD-thresh=2G/ext4-1dd-1M-24p-16384M-2048M:10-X +79.0% 154660433 JBOD-10HDD-thresh=2G/xfs-100dd-1M-24p-16384M-2048M:10-X +38.6% 55617893 JBOD-10HDD-thresh=2G/xfs-10dd-1M-24p-16384M-2048M:10-X +7.5% 52042927 JBOD-10HDD-thresh=2G/xfs-1dd-1M-24p-16384M-2048M:10-X +44.1% 296846536 JBOD-10HDD-thresh=4G/xfs-100dd-1M-24p-16384M-4096M:10-X -19.4% 58272885 JBOD-10HDD-thresh=4G/xfs-10dd-1M-24p-16384M-4096M:10-X -0.5% 50384787 JBOD-10HDD-thresh=4G/xfs-1dd-1M-24p-16384M-4096M:10-X +115.7% 147324758 JBOD-10HDD-thresh=800M/ext4-100dd-1M-24p-16384M-800M:10-X +14.5% 66269621 JBOD-10HDD-thresh=800M/ext4-10dd-1M-24p-16384M-800M:10-X +12.8% 58855181 JBOD-10HDD-thresh=800M/ext4-1dd-1M-24p-16384M-800M:10-X +94.2% 104460352 JBOD-10HDD-thresh=800M/xfs-100dd-1M-24p-16384M-800M:10-X +35.5% 60298210 JBOD-10HDD-thresh=800M/xfs-10dd-1M-24p-16384M-800M:10-X +18.9% 50681038 JBOD-10HDD-thresh=800M/xfs-1dd-1M-24p-16384M-800M:10-X

Signed-off-by: Wu Fengguang

Wu Fengguang
2011-10-03 21:08:58 +0800
af6a31138 writeback: add bg_threshold parameter to __bdi_update_bandwidth() ... Browse Code »

No behavior change.

Signed-off-by: Wu Fengguang

Wu Fengguang
2011-10-03 21:08:56 +0800

31 Jul, 2011

1 commit

0e995816f don't busy retry the inode on failed grab_super_passive() ... Browse Code »

This fixes a soft lockup on conditions

a) the flusher is working on a work by __bdi_start_writeback(), while

b) someone else calls writeback_inodes_sb*() or sync_inodes_sb(), which
grab sb->s_umount and enqueue a new work for the flusher to execute

The s_umount grabbed by (b) will fail the grab_super_passive() in (a).
Then if the inode is requeued, wb_writeback() will busy retry on it.
As a result, wb_writeback() loops for ever without releasing
wb->list_lock, which further blocks other tasks.

Fix the busy loop by redirtying the inode. This may undesirably delay
the writeback of the inode, however most likely it will be picked up
soon by the queued work by writeback_inodes_sb*(), sync_inodes_sb() or
even writeback_inodes_wb().

bug url: http://www.spinics.net/lists/linux-fsdevel/msg47292.html

Reported-by: Christoph Hellwig
Tested-by: Christoph Hellwig
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-07-31 22:52:08 +0800

27 Jul, 2011

1 commit

f01ef569c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: (27 commits)
mm: properly reflect task dirty limits in dirty_exceeded logic
writeback: don't busy retry writeback on new/freeing inodes
writeback: scale IO chunk size up to half device bandwidth
writeback: trace global_dirty_state
writeback: introduce max-pause and pass-good dirty limits
writeback: introduce smoothed global dirty limit
writeback: consolidate variable names in balance_dirty_pages()
writeback: show bdi write bandwidth in debugfs
writeback: bdi write bandwidth estimation
writeback: account per-bdi accumulated written pages
writeback: make writeback_control.nr_to_write straight
writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()
writeback: trace event writeback_queue_io
writeback: trace event writeback_single_inode
writeback: remove .nonblocking and .encountered_congestion
writeback: remove writeback_control.more_io
writeback: skip balance_dirty_pages() for in-memory fs
writeback: add bdi_dirty_limit() kernel-doc
writeback: avoid extra sync work at enqueue time
writeback: elevate queue_io() into wb_writeback()
...

Fix up trivial conflicts in fs/fs-writeback.c and mm/filemap.c

Linus Torvalds
2011-07-27 01:39:54 +0800

24 Jul, 2011

1 commit

fcc5c2221 writeback: don't busy retry writeback on new/freeing inodes ... Browse Code »

Fix a system hang bug introduced by commit b7a2441f9966 ("writeback:
remove writeback_control.more_io") and e8dfc3058 ("writeback: elevate
queue_io() into wb_writeback()") easily reproducible with high memory
pressure and lots of file creation/deletions, for example, a kernel
build in limited memory.

It hangs when some inode is in the I_NEW, I_FREEING or I_WILL_FREE
state, the flusher will get stuck busy retrying that inode, never
releasing wb->list_lock. The lock in turn blocks all kinds of other
tasks when they are trying to grab it.

As put by Jan, it's a safe change regarding data integrity. I_FREEING or
I_WILL_FREE inodes are written back by iput_final() and it is reclaim
code that is responsible for eventually removing them. So writeback code
can safely ignore them. I_NEW inodes should move out of this state when
they are fully set up and in the writeback round following that, we will
consider them for writeback. So the change makes sense.

CC: Jan Kara
Reported-by: Hugh Dickins
Tested-by: Hugh Dickins
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-07-24 10:46:51 +0800

20 Jul, 2011

1 commit

12ad3ab66 superblock: move pin_sb_for_writeback() to fs/super.c ... Browse Code »

The per-sb shrinker has the same requirement as the writeback
threads of ensuring that the superblock is usable and pinned for the
time it takes to run the work. Both need to take a passive reference
to the sb, take a read lock on the s_umount lock and then only
continue if an unmount is not in progress.

pin_sb_for_writeback() does this exactly, so move it to fs/super.c
and rename it to grab_super_passive() and exporting it via
fs/internal.h for all the VFS code to be able to use.

Signed-off-by: Dave Chinner
Signed-off-by: Al Viro

Dave Chinner
2011-07-20 13:44:38 +0800

10 Jul, 2011

2 commits

1a12d8bd7 writeback: scale IO chunk size up to half device bandwidth ... Browse Code »

Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long. (At least, that was the
comment previously.) This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway. Previously
there may have been other code paths that waited on I_SYNC, but not
any more. -- Theodore Ts'o

So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages
will adapt to as large as the storage device can write within 500ms.

XFS is observed to do IO completions in a batch, and the batch size is
equal to the write chunk size. To avoid dirty pages to suddenly drop
out of balance_dirty_pages()'s dirty control scope and create large
fluctuations, the chunk size is also limited to half the control scope.

The balance_dirty_pages() control scrope is

[(background_thresh + dirty_thresh) / 2, dirty_thresh]

which is by default [15%, 20%] of global dirty pages, whose range size
is dirty_thresh / DIRTY_FULL_SCOPE.

The adpative write chunk size will be rounded to the nearest 4MB
boundary.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

CC: Theodore Ts'o
CC: Dave Chinner
CC: Chris Mason
CC: Peter Zijlstra
Signed-off-by: Wu Fengguang

Wu Fengguang
2011-07-10 13:09:03 +0800
c42843f2f writeback: introduce smoothed global dirty limit ... Browse Code »

The start of a heavy weight application (ie. KVM) may instantly knock
down determine_dirtyable_memory() if the swap is not enabled or full.
global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
dirty thresholds that are _much_ lower than the global/bdi dirty pages.

balance_dirty_pages() will then heavily throttle all dirtiers including
the light ones, until the dirty pages drop below the new dirty thresholds.
During this _deep_ dirty-exceeded state, the system may appear rather
unresponsive to the users.

About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
threshold to heavy dirtiers than light ones, and the dirty pages will
be throttled around the heavy dirtiers' dirty threshold and reasonably
below the light dirtiers' dirty threshold. In this state, only the heavy
dirtiers will be throttled and the dirty pages are carefully controlled
to not exceed the light dirtiers' dirty threshold. However if the
threshold itself suddenly drops below the number of dirty pages, the
light dirtiers will get heavily throttled.

So introduce global_dirty_limit for tracking the global dirty threshold
with policies

- follow downwards slowly
- follow up in one shot

global_dirty_limit can effectively mask out the impact of sudden drop of
dirtyable memory. It will be used in the next patch for two new type of
dirty limits. Note that the new dirty limits are not going to avoid
throttling the light dirtiers, but could limit their sleep time to 200ms.

Signed-off-by: Wu Fengguang

Wu Fengguang
2011-07-10 13:09:02 +0800