Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

16 Jan, 2015

1 commit

02a59e291 writeback: fix a subtle race condition in I_DIRTY clearing ... Browse Code »

commit 9c6ac78eb3521c5937b2dd8a7d1b300f41092f45 upstream.

After invoking ->dirty_inode(), __mark_inode_dirty() does smp_mb() and
tests inode->i_state locklessly to see whether it already has all the
necessary I_DIRTY bits set. The comment above the barrier doesn't
contain any useful information - memory barriers can't ensure "changes
are seen by all cpus" by itself.

And it sure enough was broken. Please consider the following
scenario.

CPU 0 CPU 1
-------------------------------------------------------------------------------

enters __writeback_single_inode()
grabs inode->i_lock
tests PAGECACHE_TAG_DIRTY which is clear
enters __set_page_dirty()
grabs mapping->tree_lock
sets PAGECACHE_TAG_DIRTY
releases mapping->tree_lock
leaves __set_page_dirty()

enters __mark_inode_dirty()
smp_mb()
sees I_DIRTY_PAGES set
leaves __mark_inode_dirty()
clears I_DIRTY_PAGES
releases inode->i_lock

Now @inode has dirty pages w/ I_DIRTY_PAGES clear. This doesn't seem
to lead to an immediately critical problem because requeue_inode()
later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
deciding whether the inode needs to be requeued for IO and there are
enough unintentional memory barriers inbetween, so while the inode
ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
IO list.

The lack of explicit barrier may also theoretically affect the other
I_DIRTY bits which deal with metadata dirtiness. There is no
guarantee that a strong enough barrier exists between
I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
inode. Filesystem inode writeout path likely has enough stuff which
can behave as full barrier but it's theoretically possible that the
writeout may not see all the updates from ->dirty_inode().

Fix it by adding an explicit smp_mb() after I_DIRTY clearing. Note
that I_DIRTY_PAGES needs a special treatment as it always needs to be
cleared to be interlocked with the lockless test on
__mark_inode_dirty() side. It's cleared unconditionally and
reinstated after smp_mb() if the mapping still has dirty pages.

Also add comments explaining how and why the barriers are paired.

Lightly tested.

Signed-off-by: Tejun Heo
Cc: Jan Kara
Cc: Mikulas Patocka
Cc: Jens Axboe
Cc: Al Viro
Reviewed-by: Jan Kara
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2015-01-16 22:59:52 +0800

16 Jul, 2014

1 commit

743162013 sched: Remove proliferation of wait_on_bit() action functions ... Browse Code »
23

The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().

So:
Rename wait_on_bit and wait_on_bit_lock to
wait_on_bit_action and wait_on_bit_lock_action
to make it explicit that they need an action function.

Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
which are *not* given an action function but implicitly use
a standard one.
The decision to error-out if a signal is pending is now made
based on the 'mode' argument rather than being encoded in the action
function.

All instances of the old wait_on_bit and wait_on_bit_lock which
can use the new version have been changed accordingly and their
action functions have been discarded.
wait_on_bit{_lock} does not return any specific error code in the
event of a signal so the caller must check for non-zero and
interpolate their own error code as appropriate.

The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"

The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.

A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack. So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).

Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS. CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.

Signed-off-by: NeilBrown
Acked-by: David Howells (fscache, keys)
Acked-by: Steven Whitehouse (gfs2)
Acked-by: Peter Zijlstra
Cc: Oleg Nesterov
Cc: Steve French
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar

NeilBrown
2014-07-16 21:10:39 +0800

05 Apr, 2014

1 commit

34917f971 Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw ... Browse Code »

Pull GFS2 updates from Steven Whitehouse:
"One of the main highlights this time, is not the patches themselves
but instead the widening contributor base. It is good to see that
interest is increasing in GFS2, and I'd like to thank all the
contributors to this patch set.

In addition to the usual set of bug fixes and clean ups, there are
patches to improve inode creation performance when xattrs are required
and some improvements to the transaction code which is intended to
help improve scalability after further changes in due course.

Journal extent mapping is also updated to make it more efficient and
again, this is a foundation for future work in this area.

The maximum number of ACLs has been increased to 300 (for a 4k block
size) which means that even with a few additional xattrs from selinux,
everything should fit within a single fs block.

There is also a patch to bring GFS2's own copy of the writepages code
up to the same level as the core VFS. Eventually we may be able to
merge some of this code, since it is fairly similar.

The other major change this time, is bringing consistency to the
printing of messages via fs_, pr_ macros"

* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (29 commits)
GFS2: Fix address space from page function
GFS2: Fix uninitialized VFS inode in gfs2_create_inode
GFS2: Fix return value in slot_get()
GFS2: inline function gfs2_set_mode
GFS2: Remove extraneous function gfs2_security_init
GFS2: Increase the max number of ACLs
GFS2: Re-add a call to log_flush_wait when flushing the journal
GFS2: Ensure workqueue is scheduled after noexp request
GFS2: check NULL return value in gfs2_ok_to_move
GFS2: Convert gfs2_lm_withdraw to use fs_err
GFS2: Use fs_ more often
GFS2: Use pr_ more consistently
GFS2: Move recovery variables to journal structure in memory
GFS2: global conversion to pr_foo()
GFS2: return -E2BIG if hit the maximum limits of ACLs
GFS2: Clean up journal extent mapping
GFS2: replace kmalloc - __vmalloc / memset 0
GFS2: Remove extra "if" in gfs2_log_flush()
fs: NULL dereference in posix_acl_to_xattr()
GFS2: Move log buffer accounting to transaction
...

Linus Torvalds
2014-04-05 05:49:16 +0800

04 Apr, 2014

2 commits

5acda9d12 bdi: avoid oops on device removal ... Browse Code »
5

After commit 839a8e8660b6 ("writeback: replace custom worker pool
implementation with unbound workqueue") when device is removed while we
are writing to it we crash in bdi_writeback_workfn() ->
set_worker_desc() because bdi->dev is NULL.

This can happen because even though bdi_unregister() cancels all pending
flushing work, nothing really prevents new ones from being queued from
balance_dirty_pages() or other places.

Fix the problem by clearing BDI_registered bit in bdi_unregister() and
checking it before scheduling of any flushing work.

Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977

Reviewed-by: Tejun Heo
Signed-off-by: Jan Kara
Cc: Derek Basehore
Cc: Jens Axboe
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2014-04-04 07:20:49 +0800
6ca738d60 backing_dev: fix hung task on sync ... Browse Code »
5

bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
schedule work to writeback dirty inodes. The problem with this is that
it can delay work that is scheduled for immediate execution, such as the
work from sync_inodes_sb(). This can happen since mod_delayed_work()
can now steal work from a work_queue. This fixes the problem by using
queue_delayed_work() instead. This is a regression caused by commit
839a8e8660b6 ("writeback: replace custom worker pool implementation with
unbound workqueue").

The reason that this causes a problem is that laptop-mode will change
the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
In the case that bdi_wakeup_thread_delayed() races with
sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
task. Even if dirty_writeback_centisecs is not long enough to cause a
hung task, we still don't want to delay sync for that long.

We fix the problem by using queue_delayed_work() when we want to
schedule writeback sometime in future. This function doesn't change the
timer if it is already armed.

For the same reason, we also change bdi_writeback_workfn() to
immediately queue the work again in the case that the work_list is not
empty. The same problem can happen if the sync work is run on the
rescue worker.

[jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()]
Signed-off-by: Derek Basehore
Reviewed-by: Jan Kara
Cc: Alexander Viro
Reviewed-by: Tejun Heo
Cc: Greg Kroah-Hartman
Cc: "Darrick J. Wong"
Cc: Derek Basehore
Cc: Kees Cook
Cc: Benson Leung
Cc: Sonny Rao
Cc: Luigi Semenzato
Cc: Jens Axboe
Cc: Dave Chinner
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Derek Basehore
2014-04-04 07:20:49 +0800

22 Feb, 2014

1 commit

0dc83bd30 Revert "writeback: do not sync data dirtied after sync start" ... Browse Code »

This reverts commit c4a391b53a72d2df4ee97f96f78c1d5971b47489. Dave
Chinner has reported the commit may cause some
inodes to be left out from sync(2). This is because we can call
redirty_tail() for some inode (which sets i_dirtied_when to current time)
after sync(2) has started or similarly requeue_inode() can set
i_dirtied_when to current time if writeback had to skip some pages. The
real problem is in the functions clobbering i_dirtied_when but fixing
that isn't trivial so revert is a safer choice for now.

CC: stable@vger.kernel.org # >= 3.13
Signed-off-by: Jan Kara

Jan Kara
2014-02-22 09:02:28 +0800

06 Feb, 2014

1 commit

774016b2d GFS2: journal data writepages update ... Browse Code »
2

GFS2 has carried what is more or less a copy of the
write_cache_pages() for some time. It seems that this
copy has slipped behind the core code over time. This
patch brings it back uptodate, and in addition adds the
tracepoint which would otherwise be missing.

We could go further, and eliminate some or all of the
code duplication here. The issue is that if we do that,
then the function we need to split out from the existing
write_cache_pages(), which will look a lot like
gfs2_jdata_write_pagevec(), would land up putting quite a
lot of extra variables on the stack. I know that has been
a problem in the past in the writeback code path, which
is why I've hesitated to do it here.

Signed-off-by: Steven Whitehouse

Steven Whitehouse
2014-02-06 23:47:47 +0800

14 Dec, 2013

1 commit

f9b0e058c writeback: Fix data corruption on NFS ... Browse Code »

Commit 4f8ad655dbc8 "writeback: Refactor writeback_single_inode()" added
a condition to skip clean inode. However this is wrong in WB_SYNC_ALL
mode because there we also want to wait for outstanding writeback on
possibly clean inode. This was causing occasional data corruption issues
on NFS because it uses sync_inode() to make sure all outstanding writes
are flushed to the server before truncating the inode and with
sync_inode() returning prematurely file was sometimes extended back
by an outstanding write after it was truncated.

So modify the test to also check for pages under writeback in
WB_SYNC_ALL mode.

CC: stable@vger.kernel.org # >= 3.5
Fixes: 4f8ad655dbc82cf05d2edc11e66b78a42d38bf93
Reported-and-tested-by: Dan Duval
Signed-off-by: Jan Kara

Jan Kara
2013-12-14 04:21:26 +0800

13 Nov, 2013

2 commits

5cbb3d216 Merge branch 'akpm' (patches from Andrew Morton) ... Browse Code »

Merge first patch-bomb from Andrew Morton:
"Quite a lot of other stuff is banked up awaiting further
next->mainline merging, but this batch contains:

- Lots of random misc patches
- OCFS2
- Most of MM
- backlight updates
- lib/ updates
- printk updates
- checkpatch updates
- epoll tweaking
- rtc updates
- hfs
- hfsplus
- documentation
- procfs
- update gcov to gcc-4.7 format
- IPC"

* emailed patches from Andrew Morton : (269 commits)
ipc, msg: fix message length check for negative values
ipc/util.c: remove unnecessary work pending test
devpts: plug the memory leak in kill_sb
./Makefile: export initial ramdisk compression config option
init/Kconfig: add option to disable kernel compression
drivers: w1: make w1_slave::flags long to avoid memory corruption
drivers/w1/masters/ds1wm.cuse dev_get_platdata()
drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
drivers/memstick/core/mspro_block.c: fix attributes array allocation
drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
kernel/panic.c: reduce 1 byte usage for print tainted buffer
gcov: reuse kbasename helper
kernel/gcov/fs.c: use pr_warn()
kernel/module.c: use pr_foo()
gcov: compile specific gcov implementation based on gcc version
gcov: add support for gcc 4.7 gcov format
gcov: move gcov structs definitions to a gcc version specific file
kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
...

Linus Torvalds
2013-11-13 14:45:43 +0800
c4a391b53 writeback: do not sync data dirtied after sync start ... Browse Code »

When there are processes heavily creating small files while sync(2) is
running, it can easily happen that quite some new files are created
between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen
especially if there are several busy filesystems (remember that sync
traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
fs before starting it on another fs). Because WB_SYNC_ALL pass is slow
(e.g. causes a transaction commit and cache flush for each inode in
ext3), resulting sync(2) times are rather large.

The following script reproduces the problem:

function run_writers
{
for (( i = 0; i < 10; i++ )); do
mkdir $1/dir$i
for (( j = 0; j < 40000; j++ )); do
dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
done &
done
}

for dir in "$@"; do
run_writers $dir
done

sleep 40
time sync

Fix the problem by disregarding inodes dirtied after sync(2) was called
in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes
a time stamp when sync has started which is used for setting up work for
flusher threads.

To give some numbers, when above script is run on two ext4 filesystems
on simple SATA drive, the average sync time from 10 runs is 267.549
seconds with standard deviation 104.799426. With the patched kernel,
the average sync time from 10 runs is 2.995 seconds with standard
deviation 0.096.

Signed-off-by: Jan Kara
Reviewed-by: Fengguang Wu
Reviewed-by: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2013-11-13 11:09:07 +0800

25 Oct, 2013

1 commit

719ea2fbb new helpers: lock_mount_hash/unlock_mount_hash ... Browse Code »

aka br_write_{lock,unlock} of vfsmount_lock. Inlines in fs/mount.h,
vfsmount_lock extern moved over there as well.

Signed-off-by: Al Viro

Al Viro
2013-10-25 11:34:59 +0800

14 Sep, 2013

1 commit

3711d86a2 Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

Pull writeback fix from Wu Fengguang:
"A trivial writeback fix"

* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Do not sort b_io list only because of block device inode

Linus Torvalds
2013-09-14 11:06:40 +0800

12 Sep, 2013

3 commits

146d7009b writeback: fix race that cause writeback hung ... Browse Code »

There is a race between mark inode dirty and writeback thread, see the
following scenario. In this case, writeback thread will not run though
there is dirty_io.

__mark_inode_dirty() bdi_writeback_workfn()
... ...
spin_lock(&inode->i_lock);
...
if (bdi_cap_writeback_dirty(bdi)) {
<<< assume wb has dirty_io, so wakeup_bdi is false.
<<< the following inode_dirty also have wakeup_bdi false.
if (!wb_has_dirty_io(&bdi->wb))
wakeup_bdi = true;
}
spin_unlock(&inode->i_lock);
<<< assume last dirty_io is removed here.
pages_written = wb_do_writeback(wb);
...
<<< work_list empty and wb has no dirty_io,
<<< delayed_work will not be queued.
if (!list_empty(&bdi->work_list) ||
(wb_has_dirty_io(wb) && dirty_writeback_interval))
queue_delayed_work(bdi_wq, &wb->dwork,
msecs_to_jiffies(dirty_writeback_interval * 10));
spin_lock(&bdi->wb.list_lock);
inode->dirtied_when = jiffies;
<<< new dirty_io is added.
list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
spin_unlock(&bdi->wb.list_lock);

<<< though there is dirty_io, but wakeup_bdi is false,
<<< so writeback thread will not be waked up and
<<< the new dirty_io will not be flushed.
if (wakeup_bdi)
bdi_wakeup_thread_delayed(bdi);

Writeback will run until there is a new flush work queued. This may cause
a lot of dirty pages stay in memory for a long time.

Signed-off-by: Junxiao Bi
Reviewed-by: Jan Kara
Cc: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2013-09-12 06:58:13 +0800
7d9f073b8 mm/writeback: make writeback_inodes_wb static ... Browse Code »

It's not used globally and could be static.

Signed-off-by: Wanpeng Li
Cc: Dave Hansen
Cc: Rik van Riel
Cc: Fengguang Wu
Cc: Joonsoo Kim
Cc: Johannes Weiner
Cc: Tejun Heo
Cc: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: KOSAKI Motohiro
Cc: Jiri Kosina
Cc: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-09-12 06:58:02 +0800
47df3dded writeback: fix occasional slow sync(1) ... Browse Code »

In case when system contains no dirty pages, wakeup_flusher_threads() will
submit WB_SYNC_NONE writeback for 0 pages so wb_writeback() exits
immediately without doing anything, even though there are dirty inodes in
the system. Thus sync(1) will write all the dirty inodes from a
WB_SYNC_ALL writeback pass which is slow.

Fix the problem by using get_nr_dirty_pages() in wakeup_flusher_threads()
instead of calculating number of dirty pages manually. That function also
takes number of dirty inodes into account.

Signed-off-by: Jan Kara
Reported-by: Paul Taysom
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2013-09-12 06:57:55 +0800

10 Jul, 2013

2 commits

25d130ba2 mm/writeback: don't check force_wait to handle bdi->work_list ... Browse Code »

After commit 839a8e8660b6 ("writeback: replace custom worker pool
implementation with unbound workqueue"), bdi_writeback_workfn runs off
bdi_writeback->dwork, on each execution, it processes bdi->work_list and
reschedules if there are more things to do instead of flush any work
that race with us existing. It is unecessary to check force_wait in
wb_do_writeback since it is always 0 after the mentioned commit. This
patch remove the force_wait in wb_do_writeback.

Signed-off-by: Wanpeng Li
Reviewed-by: Tejun Heo
Reviewed-by: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-07-10 01:33:22 +0800
120578410 fs/fs-writeback.c: : make wb_do_writeback() as static ... Browse Code »

It's not used globally and could be static.

Signed-off-by: Haicheng Li
Cc: Jan Kara
Cc: Wu Fengguang
Cc: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Haicheng Li
2013-07-10 01:33:22 +0800

09 Jul, 2013

1 commit

a8855990e writeback: Do not sort b_io list only because of block device inode ... Browse Code »

It is very likely that block device inode will be part of BDI dirty list
as well. However it doesn't make sence to sort inodes on the b_io list
just because of this inode (as it contains buffers all over the device
anyway). So save some CPU cycles which is valuable since we hold relatively
contented wb->list_lock.

Signed-off-by: Jan Kara

Jan Kara
2013-07-09 22:36:45 +0800

03 Jul, 2013

1 commit

7747bd4bc sync: don't block the flusher thread waiting on IO ... Browse Code »

When sync does it's WB_SYNC_ALL writeback, it issues data Io and
then immediately waits for IO completion. This is done in the
context of the flusher thread, and hence completely ties up the
flusher thread for the backing device until all the dirty inodes
have been synced. On filesystems that are dirtying inodes constantly
and quickly, this means the flusher thread can be tied up for
minutes per sync call and hence badly affect system level write IO
performance as the page cache cannot be cleaned quickly.

We already have a wait loop for IO completion for sync(2), so cut
this out of the flusher thread and delegate it to wait_sb_inodes().
Hence we can do rapid IO submission, and then wait for it all to
complete.

Effect of sync on fsmark before the patch:

FSUse% Count Size Files/sec App Overhead
.....
0 640000 4096 35154.6 1026984
0 720000 4096 36740.3 1023844
0 800000 4096 36184.6 916599
0 880000 4096 1282.7 1054367
0 960000 4096 3951.3 918773
0 1040000 4096 40646.2 996448
0 1120000 4096 43610.1 895647
0 1200000 4096 40333.1 921048

And a single sync pass took:

real 0m52.407s
user 0m0.000s
sys 0m0.090s

After the patch, there is no impact on fsmark results, and each
individual sync(2) operation run concurrently with the same fsmark
workload takes roughly 7s:

real 0m6.930s
user 0m0.000s
sys 0m0.039s

IOWs, sync is 7-8x faster on a busy filesystem and does not have an
adverse impact on ongoing async data write operations.

Signed-off-by: Dave Chinner
Reviewed-by: Jan Kara
Signed-off-by: Linus Torvalds

Dave Chinner
2013-07-03 00:16:42 +0800

09 May, 2013

1 commit

4de13d7aa Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block core updates from Jens Axboe:

- Major bit is Kents prep work for immutable bio vecs.

- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.

- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.

- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.

- Runtime PM framework, SCSI patches exists on top of these in James'
tree.

- A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...

Linus Torvalds
2013-05-09 01:13:35 +0800

01 May, 2013

1 commit

ef3b10192 writeback: set worker desc to identify writeback workers in task dumps ... Browse Code »

Writeback has been recently converted to use workqueue instead of its
private thread pool implementation. One negative side effect of this
conversion is that there's no easy to tell which backing device a
writeback work item was working on at the time of task dump, be it
sysrq-t, BUG, WARN or whatever, which, according to our writeback
brethren, is important in tracking down issues with a lot of mounted
file systems on a lot of different devices.

This patch restores that information using the new worker description
facility. bdi_writeback_workfn() calls set_work_desc() to identify
which bdi it's working on. The description is printed out together with
the worqueue name and worker function as in the following example dump.

WARNING: at fs/fs-writeback.c:1015 bdi_writeback_workfn+0x2b4/0x3c0()
Modules linked in:
Pid: 28, comm: kworker/u18:0 Not tainted 3.9.0-rc1-work+ #24 empty empty/S3992
Workqueue: writeback bdi_writeback_workfn (flush-8:16)
ffffffff820a3a98 ffff88015b927cb8 ffffffff81c61855 ffff88015b927cf8
ffffffff8108f500 0000000000000000 ffff88007a171948 ffff88007a1716b0
ffff88015b49df00 ffff88015b8d3940 0000000000000000 ffff88015b927d08
Call Trace:
[] dump_stack+0x19/0x1b
[] warn_slowpath_common+0x70/0xa0
[] warn_slowpath_null+0x1a/0x20
[] bdi_writeback_workfn+0x2b4/0x3c0
[] process_one_work+0x1d7/0x660
[] worker_thread+0x122/0x380
[] kthread+0xea/0xf0
[] ret_from_fork+0x7c/0xb0

Signed-off-by: Tejun Heo
Cc: Dave Chinner
Cc: Jan Kara
Cc: Jens Axboe
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2013-05-01 08:04:02 +0800

02 Apr, 2013

1 commit

839a8e866 writeback: replace custom worker pool implementation with unbound workqueue ... Browse Code »
15

Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically. The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.

there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.

The conversion isn't too complicated but the followings are worth
mentioning.

* bdi_writeback->last_active, task and wakeup_timer are removed.
delayed_work ->dwork is added instead. Explicit timer handling is
no longer necessary. Everything works by either queueing / modding
/ flushing / canceling the delayed_work item.

* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
bdi_writeback->dwork. On each execution, it processes
bdi->work_list and reschedules itself if there are more things to
do.

The function also handles low-mem condition, which used to be
handled by the forker thread. If the function is running off a
rescuer thread, it only writes out limited number of pages so that
the rescuer can serve other bdis too. This preserves the flusher
creation failure behavior of the forker thread.

* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
bdi_writeback_workfn() about on-going bdi unregistration so that it
always drains work_list even if it's running off the rescuer. Note
that the original code was broken in this regard. Under memory
pressure, a bdi could finish unregistration with non-empty
work_list.

* The default bdi is no longer special. It now is treated the same as
any other bdi and bdi_cap_flush_forker() is removed.

* BDI_pending is no longer used. Removed.

* Some tracepoints become non-applicable. The following TPs are
removed - writeback_nothread, writeback_wake_thread,
writeback_wake_forker_thread, writeback_thread_start,
writeback_thread_stop.

Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.

Signed-off-by: Tejun Heo
Reviewed-by: Jan Kara
Cc: Jens Axboe
Cc: Fengguang Wu
Cc: Jeff Moyer

Tejun Heo
2013-04-02 10:08:06 +0800

01 Mar, 2013

1 commit

de1a2262b Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

Pull writeback fixes from Wu Fengguang:
"Two writeback fixes

- fix negative (setpoint - dirty) in 32bit archs

- use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle()"

* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
Negative (setpoint-dirty) in bdi_position_ratio()
vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them

Linus Torvalds
2013-03-01 05:21:44 +0800

14 Jan, 2013

1 commit

9fb0a7da0 writeback: add more tracepoints ... Browse Code »

Add tracepoints for page dirtying, writeback_single_inode start, inode
dirtying and writeback. For the latter two inode events, a pair of
events are defined to denote start and end of the operations (the
starting one has _start suffix and the one w/o suffix happens after
the operation is complete). These inode ops are FS specific and can
be non-trivial and having enclosing tracepoints is useful for external
tracers.

This is part of tracepoint additions to improve visiblity into
dirtying / writeback operations for io tracer and userland.

v2: writeback_dirty_inode[_start] TPs may be called for files on
pseudo FSes w/ unregistered bdi. Check whether bdi->dev is %NULL
before dereferencing.

v3: buffer dirtying moved to a block TP.

Signed-off-by: Tejun Heo
Reviewed-by: Jan Kara
Signed-off-by: Jens Axboe

Tejun Heo
2013-01-14 22:00:36 +0800

12 Jan, 2013

1 commit

10ee27a06 vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them ... Browse Code »

writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read()
with down_read_trylock() because

- If ->s_umount is write locked, then the sb is not idle. That is
writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock.

- writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start
writeback, it may bring us deadlock problem when doing umount. In order to
fix the problem, ext4 and btrfs implemented their own writeback functions
instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant
code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle().

The name of these two functions is cumbersome, so rename them to
try_to_writeback_inodes_sb(_nr).

This idea came from Christoph Hellwig.
Some code is from the patch of Kamal Mostafa.

Reviewed-by: Jan Kara
Signed-off-by: Miao Xie
Signed-off-by: Fengguang Wu

Miao Xie
2013-01-12 10:47:43 +0800

13 Dec, 2012

1 commit

5aaea51df writeback: fix a typo in comment ... Browse Code »

Signed-off-by: Yan Hong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yan Hong
2012-12-13 09:38:34 +0800

27 Nov, 2012

1 commit

4eff96dd5 writeback: put unused inodes to LRU after writeback completion ... Browse Code »

Commit 169ebd90131b ("writeback: Avoid iput() from flusher thread")
removed iget-iput pair from inode writeback. As a side effect, inodes
that are dirty during iput_final() call won't be ever added to inode LRU
(iput_final() doesn't add dirty inodes to LRU and later when the inode
is cleaned there's noone to add the inode there). Thus inodes are
effectively unreclaimable until someone looks them up again.

The practical effect of this bug is limited by the fact that inodes are
pinned by a dentry for long enough that the inode gets cleaned. But
still the bug can have nasty consequences leading up to OOM conditions
under certain circumstances. Following can easily reproduce the
problem:

for (( i = 0; i < 1000; i++ )); do
mkdir $i
for (( j = 0; j < 1000; j++ )); do
touch $i/$j
echo 2 > /proc/sys/vm/drop_caches
done
done

then one needs to run 'sync; ls -lR' to make inodes reclaimable again.

We fix the issue by inserting unused clean inodes into the LRU after
writeback finishes in inode_sync_complete().

Signed-off-by: Jan Kara
Reported-by: OGAWA Hirofumi
Cc: Al Viro
Cc: OGAWA Hirofumi
Cc: Wu Fengguang
Cc: Dave Chinner
Cc: [3.5+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2012-11-27 09:41:24 +0800

12 Oct, 2012

1 commit

40924754f Merge branch 'writeback-for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

Pull writeback fixes from Fengguang Wu:
"Three trivial writeback fixes"

* 'writeback-for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
CPU hotplug, writeback: Don't call writeback_set_ratelimit() too often during hotplug
writeback: correct comment for move_expired_inodes()
backing-dev: use kstrto* in preference to simple_strtoul

Linus Torvalds
2012-10-12 09:46:03 +0800

09 Oct, 2012

1 commit

cd8ed2a45 fs/fs-writeback.c: remove unneccesary parameter of __writeback_single_inode() ... Browse Code »

The parameter 'wb' is never used in this function.

Signed-off-by: Yan Hong
Acked-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yan Hong
2012-10-09 15:23:00 +0800

08 Oct, 2012

1 commit

6432f2128 Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 updates from Ted Ts'o:
"The big new feature added this time is supporting online resizing
using the meta_bg feature. This allows us to resize file systems
which are greater than 16TB. In addition, the speed of online
resizing has been improved in general.

We also fix a number of races, some of which could lead to deadlocks,
in ext4's Asynchronous I/O and online defrag support, thanks to good
work by Dmitry Monakhov.

There are also a large number of more minor bug fixes and cleanups
from a number of other ext4 contributors, quite of few of which have
submitted fixes for the first time."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (69 commits)
ext4: fix ext4_flush_completed_IO wait semantics
ext4: fix mtime update in nodelalloc mode
ext4: fix ext_remove_space for punch_hole case
ext4: punch_hole should wait for DIO writers
ext4: serialize truncate with owerwrite DIO workers
ext4: endless truncate due to nonlocked dio readers
ext4: serialize unlocked dio reads with truncate
ext4: serialize dio nonlocked reads with defrag workers
ext4: completed_io locking cleanup
ext4: fix unwritten counter leakage
ext4: give i_aiodio_unwritten a more appropriate name
ext4: ext4_inode_info diet
ext4: convert to use leXX_add_cpu()
ext4: ext4_bread usage audit
fs: reserve fallocate flag codepoint
ext4: remove redundant offset check in mext_check_arguments()
ext4: don't clear orphan list on ro mount with errors
jbd2: fix assertion failure in commit code due to lacking transaction credits
ext4: release donor reference when EXT4_IOC_MOVE_EXT ioctl fails
ext4: enable FITRIM ioctl on bigalloc file system
...

Linus Torvalds
2012-10-08 05:36:39 +0800

02 Oct, 2012

1 commit

99dbb1632 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull the trivial tree from Jiri Kosina:
"Tiny usual fixes all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
doc: fix old config name of kprobetrace
fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc
btrfs: fix the commment for the action flags in delayed-ref.h
btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID
vfs: fix kerneldoc for generic_fh_to_parent()
treewide: fix comment/printk/variable typos
ipr: fix small coding style issues
doc: fix broken utf8 encoding
nfs: comment fix
platform/x86: fix asus_laptop.wled_type module parameter
mfd: printk/comment fixes
doc: getdelays.c: remember to close() socket on error in create_nl_socket()
doc: aliasing-test: close fd on write error
mmc: fix comment typos
dma: fix comments
spi: fix comment/printk typos in spi
Coccinelle: fix typo in memdup_user.cocci
tmiofb: missing NULL pointer checks
tools: perf: Fix typo in tools/perf
tools/testing: fix comment / output typos
...

Linus Torvalds
2012-10-02 00:06:36 +0800

21 Sep, 2012

1 commit

7dfd8cc53 fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc ... Browse Code »

Argument @only_this_sb has been removed.

Signed-off-by: Liu Bo
Signed-off-by: Jiri Kosina

Liu Bo
2012-09-21 18:04:36 +0800

20 Sep, 2012

1 commit

00d4e7362 ext4: fix potential deadlock in ext4_nonda_switch() ... Browse Code »

In ext4_nonda_switch(), if the file system is getting full we used to
call writeback_inodes_sb_if_idle(). The problem is that we can be
holding i_mutex already, and this causes a potential deadlock when
writeback_inodes_sb_if_idle() when it tries to take s_umount. (See
lockdep output below).

As it turns out we don't need need to hold s_umount; the fact that we
are in the middle of the write(2) system call will keep the superblock
pinned. Unfortunately writeback_inodes_sb() checks to make sure
s_umount is taken, and the VFS uses a different mechanism for making
sure the file system doesn't get unmounted out from under us. The
simplest way of dealing with this is to just simply grab s_umount
using a trylock, and skip kicking the writeback flusher thread in the
very unlikely case that we can't take a read lock on s_umount without
blocking.

Also, we now check the cirteria for kicking the writeback thread
before we decide to whether to fall back to non-delayed writeback, so
if there are any outstanding delayed allocation writes, we try to get
them resolved as soon as possible.

[ INFO: possible circular locking dependency detected ]
3.6.0-rc1-00042-gce894ca #367 Not tainted
-------------------------------------------------------
dd/8298 is trying to acquire lock:
(&type->s_umount_key#18){++++..}, at: [] writeback_inodes_sb_if_idle+0x28/0x46

but task is already holding lock:
(&sb->s_type->i_mutex_key#8){+.+...}, at: [] generic_file_aio_write+0x5f/0xd3

which lock already depends on the new lock.

2 locks held by dd/8298:
#0: (sb_writers#2){.+.+.+}, at: [] generic_file_aio_write+0x56/0xd3
#1: (&sb->s_type->i_mutex_key#8){+.+...}, at: [] generic_file_aio_write+0x5f/0xd3

stack backtrace:
Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367
Call Trace:
[] ? console_unlock+0x345/0x372
[] print_circular_bug+0x190/0x19d
[] __lock_acquire+0x86d/0xb6c
[] ? mark_held_locks+0x5c/0x7b
[] lock_acquire+0x66/0xb9
[] ? writeback_inodes_sb_if_idle+0x28/0x46
[] down_read+0x28/0x58
[] ? writeback_inodes_sb_if_idle+0x28/0x46
[] writeback_inodes_sb_if_idle+0x28/0x46
[] ext4_nonda_switch+0xe1/0xf4
[] ext4_da_write_begin+0x27/0x193
[] generic_file_buffered_write+0xc8/0x1bb
[] __generic_file_aio_write+0x1dd/0x205
[] generic_file_aio_write+0x78/0xd3
[] ext4_file_write+0x480/0x4a6
[] ? __lock_acquire+0x41e/0xb6c
[] ? sched_clock_cpu+0x11a/0x13e
[] ? trace_hardirqs_off+0xb/0xd
[] ? local_clock+0x37/0x4e
[] do_sync_write+0x67/0x9d
[] ? wait_on_retry_sync_kiocb+0x44/0x44
[] vfs_write+0x7b/0xe6
[] sys_write+0x3b/0x64
[] syscall_call+0x7/0xb

Signed-off-by: "Theodore Ts'o"
Cc: stable@vger.kernel.org

Theodore Ts'o
2012-09-20 10:42:36 +0800

11 Sep, 2012

1 commit

0e2f2b236 writeback: correct comment for move_expired_inodes() ... Browse Code »

The function scans @delaying_queue and stops at the first inode
whose dirtied_when is after *work->older_than_this. So the expired
ones being moved are those before *work->older_than_this. Correct
the comment here.

Reviewed-by: Jan Kara
Signed-off-by: Wang Sheng-Hui

Wang Sheng-Hui
2012-09-11 08:29:50 +0800

01 Aug, 2012

1 commit

3965c9ae4 mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threads ... Browse Code »

Since per-BDI flusher threads were introduced in 2.6, the pdflush
mechanism is not used any more. But the old interface exported through
/proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

For back-compatibility, printk warning information and return 2 to notify
the users that the interface is removed.

Signed-off-by: Wanpeng Li
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2012-08-01 09:42:40 +0800

31 Jul, 2012

1 commit

2e3ee6134 Merge tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

Pull writeback updates from Wu Fengguang:
"Use time based periods to age the writeback proportions, which can
adapt equally well to fast/slow devices."

Fix up trivial conflict in comment in fs/sync.c

* tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Fix some comment errors
block: Convert BDI proportion calculations to flexible proportions
lib: Fix possible deadlock in flexible proportion code
lib: Proportions with flexible period

Linus Torvalds
2012-07-31 13:14:04 +0800

23 Jul, 2012

1 commit

6eedc7015 vfs: Move noop_backing_dev_info check from sync into writeback ... Browse Code »

In principle, a filesystem may want to have ->sync_fs() called during sync(1)
although it does not have a bdi (i.e. s_bdi is set to noop_backing_dev_info).
Only writeback code really needs bdi set to something reasonable. So move the
checks where they are more logical.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Al Viro

Jan Kara
2012-07-23 03:58:18 +0800

09 Jun, 2012

2 commits

331cbdeed writeback: Fix some comment errors ... Browse Code »

Signed-off-by: Wanpeng Li
Signed-off-by: Fengguang Wu

Wanpeng Li
2012-06-09 19:54:47 +0800
ead188f9f writeback: Fix lock imbalance in writeback_sb_inodes() ... Browse Code »

Fix bug introduced by 169ebd90. We have to have wb_list_lock locked when
restarting writeback loop after having waited for inode writeback.

Bug description by Ted Tso:

I can reproduce this fairly easily by using ext4 w/o a journal, running
under KVM with 1024megs memory, with fsstress (xfstests #13):

[ 45.153294] =====================================
[ 45.154784] [ BUG: bad unlock balance detected! ]
[ 45.155591] 3.5.0-rc1-00002-gb22b1f1 #124 Not tainted
[ 45.155591] -------------------------------------
[ 45.155591] flush-254:16/2499 is trying to release lock (&(&wb->list_lock)->rlock) at:
[ 45.155591] [] writeback_sb_inodes+0x160/0x327
[ 45.155591] but there are no more locks to release!

Reported-by: Theodore Ts'o
Tested-by: Theodore Ts'o
Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-06-09 07:32:15 +0800

06 May, 2012

1 commit

169ebd901 writeback: Avoid iput() from flusher thread ... Browse Code »

Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
because iput() can do a lot of work - for example truncate the inode if it's
the last iput on unlinked file. Some filesystems depend on flusher thread
progressing (e.g. because they need to flush delay allocated blocks to reduce
allocation uncertainty) and so flusher thread doing truncate creates
interesting dependencies and possibilities for deadlocks.

We get rid of iput() in flusher thread by using the fact that I_SYNC inode
flag effectively pins the inode in memory. So if we take care to either hold
i_lock or have I_SYNC set, we can get away without taking inode reference
in writeback_sb_inodes().

As a side effect of these changes, we also fix possible use-after-free in
wb_writeback() because inode_wait_for_writeback() call could try to reacquire
i_lock on the inode that was already free.

Signed-off-by: Jan Kara
Signed-off-by: Fengguang Wu

Jan Kara
2012-05-06 13:43:41 +0800