16 Jan, 2015

1 commit

  • commit 9c6ac78eb3521c5937b2dd8a7d1b300f41092f45 upstream.

    After invoking ->dirty_inode(), __mark_inode_dirty() does smp_mb() and
    tests inode->i_state locklessly to see whether it already has all the
    necessary I_DIRTY bits set. The comment above the barrier doesn't
    contain any useful information - memory barriers can't ensure "changes
    are seen by all cpus" by itself.

    And it sure enough was broken. Please consider the following
    scenario.

    CPU 0 CPU 1
    -------------------------------------------------------------------------------

    enters __writeback_single_inode()
    grabs inode->i_lock
    tests PAGECACHE_TAG_DIRTY which is clear
    enters __set_page_dirty()
    grabs mapping->tree_lock
    sets PAGECACHE_TAG_DIRTY
    releases mapping->tree_lock
    leaves __set_page_dirty()

    enters __mark_inode_dirty()
    smp_mb()
    sees I_DIRTY_PAGES set
    leaves __mark_inode_dirty()
    clears I_DIRTY_PAGES
    releases inode->i_lock

    Now @inode has dirty pages w/ I_DIRTY_PAGES clear. This doesn't seem
    to lead to an immediately critical problem because requeue_inode()
    later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
    deciding whether the inode needs to be requeued for IO and there are
    enough unintentional memory barriers inbetween, so while the inode
    ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
    IO list.

    The lack of explicit barrier may also theoretically affect the other
    I_DIRTY bits which deal with metadata dirtiness. There is no
    guarantee that a strong enough barrier exists between
    I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
    inode. Filesystem inode writeout path likely has enough stuff which
    can behave as full barrier but it's theoretically possible that the
    writeout may not see all the updates from ->dirty_inode().

    Fix it by adding an explicit smp_mb() after I_DIRTY clearing. Note
    that I_DIRTY_PAGES needs a special treatment as it always needs to be
    cleared to be interlocked with the lockless test on
    __mark_inode_dirty() side. It's cleared unconditionally and
    reinstated after smp_mb() if the mapping still has dirty pages.

    Also add comments explaining how and why the barriers are paired.

    Lightly tested.

    Signed-off-by: Tejun Heo
    Cc: Jan Kara
    Cc: Mikulas Patocka
    Cc: Jens Axboe
    Cc: Al Viro
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

16 Jul, 2014

1 commit

  • The current "wait_on_bit" interface requires an 'action'
    function to be provided which does the actual waiting.
    There are over 20 such functions, many of them identical.
    Most cases can be satisfied by one of just two functions, one
    which uses io_schedule() and one which just uses schedule().

    So:
    Rename wait_on_bit and wait_on_bit_lock to
    wait_on_bit_action and wait_on_bit_lock_action
    to make it explicit that they need an action function.

    Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
    which are *not* given an action function but implicitly use
    a standard one.
    The decision to error-out if a signal is pending is now made
    based on the 'mode' argument rather than being encoded in the action
    function.

    All instances of the old wait_on_bit and wait_on_bit_lock which
    can use the new version have been changed accordingly and their
    action functions have been discarded.
    wait_on_bit{_lock} does not return any specific error code in the
    event of a signal so the caller must check for non-zero and
    interpolate their own error code as appropriate.

    The wait_on_bit() call in __fscache_wait_on_invalidate() was
    ambiguous as it specified TASK_UNINTERRUPTIBLE but used
    fscache_wait_bit_interruptible as an action function.
    David Howells confirms this should be uniformly
    "uninterruptible"

    The main remaining user of wait_on_bit{,_lock}_action is NFS
    which needs to use a freezer-aware schedule() call.

    A comment in fs/gfs2/glock.c notes that having multiple 'action'
    functions is useful as they display differently in the 'wchan'
    field of 'ps'. (and /proc/$PID/wchan).
    As the new bit_wait{,_io} functions are tagged "__sched", they
    will not show up at all, but something higher in the stack. So
    the distinction will still be visible, only with different
    function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
    gfs2/glock.c case).

    Since first version of this patch (against 3.15) two new action
    functions appeared, on in NFS and one in CIFS. CIFS also now
    uses an action function that makes the same freezer aware
    schedule call as NFS.

    Signed-off-by: NeilBrown
    Acked-by: David Howells (fscache, keys)
    Acked-by: Steven Whitehouse (gfs2)
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     

05 Apr, 2014

1 commit

  • Pull GFS2 updates from Steven Whitehouse:
    "One of the main highlights this time, is not the patches themselves
    but instead the widening contributor base. It is good to see that
    interest is increasing in GFS2, and I'd like to thank all the
    contributors to this patch set.

    In addition to the usual set of bug fixes and clean ups, there are
    patches to improve inode creation performance when xattrs are required
    and some improvements to the transaction code which is intended to
    help improve scalability after further changes in due course.

    Journal extent mapping is also updated to make it more efficient and
    again, this is a foundation for future work in this area.

    The maximum number of ACLs has been increased to 300 (for a 4k block
    size) which means that even with a few additional xattrs from selinux,
    everything should fit within a single fs block.

    There is also a patch to bring GFS2's own copy of the writepages code
    up to the same level as the core VFS. Eventually we may be able to
    merge some of this code, since it is fairly similar.

    The other major change this time, is bringing consistency to the
    printing of messages via fs_, pr_ macros"

    * tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (29 commits)
    GFS2: Fix address space from page function
    GFS2: Fix uninitialized VFS inode in gfs2_create_inode
    GFS2: Fix return value in slot_get()
    GFS2: inline function gfs2_set_mode
    GFS2: Remove extraneous function gfs2_security_init
    GFS2: Increase the max number of ACLs
    GFS2: Re-add a call to log_flush_wait when flushing the journal
    GFS2: Ensure workqueue is scheduled after noexp request
    GFS2: check NULL return value in gfs2_ok_to_move
    GFS2: Convert gfs2_lm_withdraw to use fs_err
    GFS2: Use fs_ more often
    GFS2: Use pr_ more consistently
    GFS2: Move recovery variables to journal structure in memory
    GFS2: global conversion to pr_foo()
    GFS2: return -E2BIG if hit the maximum limits of ACLs
    GFS2: Clean up journal extent mapping
    GFS2: replace kmalloc - __vmalloc / memset 0
    GFS2: Remove extra "if" in gfs2_log_flush()
    fs: NULL dereference in posix_acl_to_xattr()
    GFS2: Move log buffer accounting to transaction
    ...

    Linus Torvalds
     

04 Apr, 2014

2 commits

  • After commit 839a8e8660b6 ("writeback: replace custom worker pool
    implementation with unbound workqueue") when device is removed while we
    are writing to it we crash in bdi_writeback_workfn() ->
    set_worker_desc() because bdi->dev is NULL.

    This can happen because even though bdi_unregister() cancels all pending
    flushing work, nothing really prevents new ones from being queued from
    balance_dirty_pages() or other places.

    Fix the problem by clearing BDI_registered bit in bdi_unregister() and
    checking it before scheduling of any flushing work.

    Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977

    Reviewed-by: Tejun Heo
    Signed-off-by: Jan Kara
    Cc: Derek Basehore
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
    schedule work to writeback dirty inodes. The problem with this is that
    it can delay work that is scheduled for immediate execution, such as the
    work from sync_inodes_sb(). This can happen since mod_delayed_work()
    can now steal work from a work_queue. This fixes the problem by using
    queue_delayed_work() instead. This is a regression caused by commit
    839a8e8660b6 ("writeback: replace custom worker pool implementation with
    unbound workqueue").

    The reason that this causes a problem is that laptop-mode will change
    the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
    In the case that bdi_wakeup_thread_delayed() races with
    sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
    task. Even if dirty_writeback_centisecs is not long enough to cause a
    hung task, we still don't want to delay sync for that long.

    We fix the problem by using queue_delayed_work() when we want to
    schedule writeback sometime in future. This function doesn't change the
    timer if it is already armed.

    For the same reason, we also change bdi_writeback_workfn() to
    immediately queue the work again in the case that the work_list is not
    empty. The same problem can happen if the sync work is run on the
    rescue worker.

    [jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()]
    Signed-off-by: Derek Basehore
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Reviewed-by: Tejun Heo
    Cc: Greg Kroah-Hartman
    Cc: "Darrick J. Wong"
    Cc: Derek Basehore
    Cc: Kees Cook
    Cc: Benson Leung
    Cc: Sonny Rao
    Cc: Luigi Semenzato
    Cc: Jens Axboe
    Cc: Dave Chinner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Derek Basehore
     

22 Feb, 2014

1 commit

  • This reverts commit c4a391b53a72d2df4ee97f96f78c1d5971b47489. Dave
    Chinner has reported the commit may cause some
    inodes to be left out from sync(2). This is because we can call
    redirty_tail() for some inode (which sets i_dirtied_when to current time)
    after sync(2) has started or similarly requeue_inode() can set
    i_dirtied_when to current time if writeback had to skip some pages. The
    real problem is in the functions clobbering i_dirtied_when but fixing
    that isn't trivial so revert is a safer choice for now.

    CC: stable@vger.kernel.org # >= 3.13
    Signed-off-by: Jan Kara

    Jan Kara
     

06 Feb, 2014

1 commit

  • GFS2 has carried what is more or less a copy of the
    write_cache_pages() for some time. It seems that this
    copy has slipped behind the core code over time. This
    patch brings it back uptodate, and in addition adds the
    tracepoint which would otherwise be missing.

    We could go further, and eliminate some or all of the
    code duplication here. The issue is that if we do that,
    then the function we need to split out from the existing
    write_cache_pages(), which will look a lot like
    gfs2_jdata_write_pagevec(), would land up putting quite a
    lot of extra variables on the stack. I know that has been
    a problem in the past in the writeback code path, which
    is why I've hesitated to do it here.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     

14 Dec, 2013

1 commit

  • Commit 4f8ad655dbc8 "writeback: Refactor writeback_single_inode()" added
    a condition to skip clean inode. However this is wrong in WB_SYNC_ALL
    mode because there we also want to wait for outstanding writeback on
    possibly clean inode. This was causing occasional data corruption issues
    on NFS because it uses sync_inode() to make sure all outstanding writes
    are flushed to the server before truncating the inode and with
    sync_inode() returning prematurely file was sometimes extended back
    by an outstanding write after it was truncated.

    So modify the test to also check for pages under writeback in
    WB_SYNC_ALL mode.

    CC: stable@vger.kernel.org # >= 3.5
    Fixes: 4f8ad655dbc82cf05d2edc11e66b78a42d38bf93
    Reported-and-tested-by: Dan Duval
    Signed-off-by: Jan Kara

    Jan Kara
     

13 Nov, 2013

2 commits

  • Merge first patch-bomb from Andrew Morton:
    "Quite a lot of other stuff is banked up awaiting further
    next->mainline merging, but this batch contains:

    - Lots of random misc patches
    - OCFS2
    - Most of MM
    - backlight updates
    - lib/ updates
    - printk updates
    - checkpatch updates
    - epoll tweaking
    - rtc updates
    - hfs
    - hfsplus
    - documentation
    - procfs
    - update gcov to gcc-4.7 format
    - IPC"

    * emailed patches from Andrew Morton : (269 commits)
    ipc, msg: fix message length check for negative values
    ipc/util.c: remove unnecessary work pending test
    devpts: plug the memory leak in kill_sb
    ./Makefile: export initial ramdisk compression config option
    init/Kconfig: add option to disable kernel compression
    drivers: w1: make w1_slave::flags long to avoid memory corruption
    drivers/w1/masters/ds1wm.cuse dev_get_platdata()
    drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
    drivers/memstick/core/mspro_block.c: fix attributes array allocation
    drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
    kernel/panic.c: reduce 1 byte usage for print tainted buffer
    gcov: reuse kbasename helper
    kernel/gcov/fs.c: use pr_warn()
    kernel/module.c: use pr_foo()
    gcov: compile specific gcov implementation based on gcc version
    gcov: add support for gcc 4.7 gcov format
    gcov: move gcov structs definitions to a gcc version specific file
    kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
    kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
    kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
    ...

    Linus Torvalds
     
  • When there are processes heavily creating small files while sync(2) is
    running, it can easily happen that quite some new files are created
    between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen
    especially if there are several busy filesystems (remember that sync
    traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
    fs before starting it on another fs). Because WB_SYNC_ALL pass is slow
    (e.g. causes a transaction commit and cache flush for each inode in
    ext3), resulting sync(2) times are rather large.

    The following script reproduces the problem:

    function run_writers
    {
    for (( i = 0; i < 10; i++ )); do
    mkdir $1/dir$i
    for (( j = 0; j < 40000; j++ )); do
    dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
    done &
    done
    }

    for dir in "$@"; do
    run_writers $dir
    done

    sleep 40
    time sync

    Fix the problem by disregarding inodes dirtied after sync(2) was called
    in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes
    a time stamp when sync has started which is used for setting up work for
    flusher threads.

    To give some numbers, when above script is run on two ext4 filesystems
    on simple SATA drive, the average sync time from 10 runs is 267.549
    seconds with standard deviation 104.799426. With the patched kernel,
    the average sync time from 10 runs is 2.995 seconds with standard
    deviation 0.096.

    Signed-off-by: Jan Kara
    Reviewed-by: Fengguang Wu
    Reviewed-by: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

25 Oct, 2013

1 commit


14 Sep, 2013

1 commit


12 Sep, 2013

3 commits

  • There is a race between mark inode dirty and writeback thread, see the
    following scenario. In this case, writeback thread will not run though
    there is dirty_io.

    __mark_inode_dirty() bdi_writeback_workfn()
    ... ...
    spin_lock(&inode->i_lock);
    ...
    if (bdi_cap_writeback_dirty(bdi)) {
    <<< assume wb has dirty_io, so wakeup_bdi is false.
    <<< the following inode_dirty also have wakeup_bdi false.
    if (!wb_has_dirty_io(&bdi->wb))
    wakeup_bdi = true;
    }
    spin_unlock(&inode->i_lock);
    <<< assume last dirty_io is removed here.
    pages_written = wb_do_writeback(wb);
    ...
    <<< work_list empty and wb has no dirty_io,
    <<< delayed_work will not be queued.
    if (!list_empty(&bdi->work_list) ||
    (wb_has_dirty_io(wb) && dirty_writeback_interval))
    queue_delayed_work(bdi_wq, &wb->dwork,
    msecs_to_jiffies(dirty_writeback_interval * 10));
    spin_lock(&bdi->wb.list_lock);
    inode->dirtied_when = jiffies;
    <<< new dirty_io is added.
    list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
    spin_unlock(&bdi->wb.list_lock);

    <<< though there is dirty_io, but wakeup_bdi is false,
    <<< so writeback thread will not be waked up and
    <<< the new dirty_io will not be flushed.
    if (wakeup_bdi)
    bdi_wakeup_thread_delayed(bdi);

    Writeback will run until there is a new flush work queued. This may cause
    a lot of dirty pages stay in memory for a long time.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Jan Kara
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • It's not used globally and could be static.

    Signed-off-by: Wanpeng Li
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Fengguang Wu
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Jiri Kosina
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • In case when system contains no dirty pages, wakeup_flusher_threads() will
    submit WB_SYNC_NONE writeback for 0 pages so wb_writeback() exits
    immediately without doing anything, even though there are dirty inodes in
    the system. Thus sync(1) will write all the dirty inodes from a
    WB_SYNC_ALL writeback pass which is slow.

    Fix the problem by using get_nr_dirty_pages() in wakeup_flusher_threads()
    instead of calculating number of dirty pages manually. That function also
    takes number of dirty inodes into account.

    Signed-off-by: Jan Kara
    Reported-by: Paul Taysom
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

10 Jul, 2013

2 commits

  • After commit 839a8e8660b6 ("writeback: replace custom worker pool
    implementation with unbound workqueue"), bdi_writeback_workfn runs off
    bdi_writeback->dwork, on each execution, it processes bdi->work_list and
    reschedules if there are more things to do instead of flush any work
    that race with us existing. It is unecessary to check force_wait in
    wb_do_writeback since it is always 0 after the mentioned commit. This
    patch remove the force_wait in wb_do_writeback.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Tejun Heo
    Reviewed-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • It's not used globally and could be static.

    Signed-off-by: Haicheng Li
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haicheng Li
     

09 Jul, 2013

1 commit

  • It is very likely that block device inode will be part of BDI dirty list
    as well. However it doesn't make sence to sort inodes on the b_io list
    just because of this inode (as it contains buffers all over the device
    anyway). So save some CPU cycles which is valuable since we hold relatively
    contented wb->list_lock.

    Signed-off-by: Jan Kara

    Jan Kara
     

03 Jul, 2013

1 commit

  • When sync does it's WB_SYNC_ALL writeback, it issues data Io and
    then immediately waits for IO completion. This is done in the
    context of the flusher thread, and hence completely ties up the
    flusher thread for the backing device until all the dirty inodes
    have been synced. On filesystems that are dirtying inodes constantly
    and quickly, this means the flusher thread can be tied up for
    minutes per sync call and hence badly affect system level write IO
    performance as the page cache cannot be cleaned quickly.

    We already have a wait loop for IO completion for sync(2), so cut
    this out of the flusher thread and delegate it to wait_sb_inodes().
    Hence we can do rapid IO submission, and then wait for it all to
    complete.

    Effect of sync on fsmark before the patch:

    FSUse% Count Size Files/sec App Overhead
    .....
    0 640000 4096 35154.6 1026984
    0 720000 4096 36740.3 1023844
    0 800000 4096 36184.6 916599
    0 880000 4096 1282.7 1054367
    0 960000 4096 3951.3 918773
    0 1040000 4096 40646.2 996448
    0 1120000 4096 43610.1 895647
    0 1200000 4096 40333.1 921048

    And a single sync pass took:

    real 0m52.407s
    user 0m0.000s
    sys 0m0.090s

    After the patch, there is no impact on fsmark results, and each
    individual sync(2) operation run concurrently with the same fsmark
    workload takes roughly 7s:

    real 0m6.930s
    user 0m0.000s
    sys 0m0.039s

    IOWs, sync is 7-8x faster on a busy filesystem and does not have an
    adverse impact on ongoing async data write operations.

    Signed-off-by: Dave Chinner
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

09 May, 2013

1 commit

  • Pull block core updates from Jens Axboe:

    - Major bit is Kents prep work for immutable bio vecs.

    - Stable candidate fix for a scheduling-while-atomic in the queue
    bypass operation.

    - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
    discard bios.

    - Tejuns changes to convert the writeback thread pool to the generic
    workqueue mechanism.

    - Runtime PM framework, SCSI patches exists on top of these in James'
    tree.

    - A few random fixes.

    * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
    relay: move remove_buf_file inside relay_close_buf
    partitions/efi.c: replace useless kzalloc's by kmalloc's
    fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
    block: fix max discard sectors limit
    blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
    Documentation: cfq-iosched: update documentation help for cfq tunables
    writeback: expose the bdi_wq workqueue
    writeback: replace custom worker pool implementation with unbound workqueue
    writeback: remove unused bdi_pending_list
    aoe: Fix unitialized var usage
    bio-integrity: Add explicit field for owner of bip_buf
    block: Add an explicit bio flag for bios that own their bvec
    block: Add bio_alloc_pages()
    block: Convert some code to bio_for_each_segment_all()
    block: Add bio_for_each_segment_all()
    bounce: Refactor __blk_queue_bounce to not use bi_io_vec
    raid1: use bio_copy_data()
    pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
    pktcdvd: use bio_copy_data()
    block: Add bio_copy_data()
    ...

    Linus Torvalds
     

01 May, 2013

1 commit

  • Writeback has been recently converted to use workqueue instead of its
    private thread pool implementation. One negative side effect of this
    conversion is that there's no easy to tell which backing device a
    writeback work item was working on at the time of task dump, be it
    sysrq-t, BUG, WARN or whatever, which, according to our writeback
    brethren, is important in tracking down issues with a lot of mounted
    file systems on a lot of different devices.

    This patch restores that information using the new worker description
    facility. bdi_writeback_workfn() calls set_work_desc() to identify
    which bdi it's working on. The description is printed out together with
    the worqueue name and worker function as in the following example dump.

    WARNING: at fs/fs-writeback.c:1015 bdi_writeback_workfn+0x2b4/0x3c0()
    Modules linked in:
    Pid: 28, comm: kworker/u18:0 Not tainted 3.9.0-rc1-work+ #24 empty empty/S3992
    Workqueue: writeback bdi_writeback_workfn (flush-8:16)
    ffffffff820a3a98 ffff88015b927cb8 ffffffff81c61855 ffff88015b927cf8
    ffffffff8108f500 0000000000000000 ffff88007a171948 ffff88007a1716b0
    ffff88015b49df00 ffff88015b8d3940 0000000000000000 ffff88015b927d08
    Call Trace:
    [] dump_stack+0x19/0x1b
    [] warn_slowpath_common+0x70/0xa0
    [] warn_slowpath_null+0x1a/0x20
    [] bdi_writeback_workfn+0x2b4/0x3c0
    [] process_one_work+0x1d7/0x660
    [] worker_thread+0x122/0x380
    [] kthread+0xea/0xf0
    [] ret_from_fork+0x7c/0xb0

    Signed-off-by: Tejun Heo
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

02 Apr, 2013

1 commit

  • Writeback implements its own worker pool - each bdi can be associated
    with a worker thread which is created and destroyed dynamically. The
    worker thread for the default bdi is always present and serves as the
    "forker" thread which forks off worker threads for other bdis.

    there's no reason for writeback to implement its own worker pool when
    using unbound workqueue instead is much simpler and more efficient.
    This patch replaces custom worker pool implementation in writeback
    with an unbound workqueue.

    The conversion isn't too complicated but the followings are worth
    mentioning.

    * bdi_writeback->last_active, task and wakeup_timer are removed.
    delayed_work ->dwork is added instead. Explicit timer handling is
    no longer necessary. Everything works by either queueing / modding
    / flushing / canceling the delayed_work item.

    * bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
    bdi_writeback->dwork. On each execution, it processes
    bdi->work_list and reschedules itself if there are more things to
    do.

    The function also handles low-mem condition, which used to be
    handled by the forker thread. If the function is running off a
    rescuer thread, it only writes out limited number of pages so that
    the rescuer can serve other bdis too. This preserves the flusher
    creation failure behavior of the forker thread.

    * INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
    bdi_writeback_workfn() about on-going bdi unregistration so that it
    always drains work_list even if it's running off the rescuer. Note
    that the original code was broken in this regard. Under memory
    pressure, a bdi could finish unregistration with non-empty
    work_list.

    * The default bdi is no longer special. It now is treated the same as
    any other bdi and bdi_cap_flush_forker() is removed.

    * BDI_pending is no longer used. Removed.

    * Some tracepoints become non-applicable. The following TPs are
    removed - writeback_nothread, writeback_wake_thread,
    writeback_wake_forker_thread, writeback_thread_start,
    writeback_thread_stop.

    Everything, including devices coming and going away and rescuer
    operation under simulated memory pressure, seems to work fine in my
    test setup.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Fengguang Wu
    Cc: Jeff Moyer

    Tejun Heo
     

01 Mar, 2013

1 commit


14 Jan, 2013

1 commit

  • Add tracepoints for page dirtying, writeback_single_inode start, inode
    dirtying and writeback. For the latter two inode events, a pair of
    events are defined to denote start and end of the operations (the
    starting one has _start suffix and the one w/o suffix happens after
    the operation is complete). These inode ops are FS specific and can
    be non-trivial and having enclosing tracepoints is useful for external
    tracers.

    This is part of tracepoint additions to improve visiblity into
    dirtying / writeback operations for io tracer and userland.

    v2: writeback_dirty_inode[_start] TPs may be called for files on
    pseudo FSes w/ unregistered bdi. Check whether bdi->dev is %NULL
    before dereferencing.

    v3: buffer dirtying moved to a block TP.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

12 Jan, 2013

1 commit

  • writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read()
    with down_read_trylock() because

    - If ->s_umount is write locked, then the sb is not idle. That is
    writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock.

    - writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start
    writeback, it may bring us deadlock problem when doing umount. In order to
    fix the problem, ext4 and btrfs implemented their own writeback functions
    instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant
    code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle().

    The name of these two functions is cumbersome, so rename them to
    try_to_writeback_inodes_sb(_nr).

    This idea came from Christoph Hellwig.
    Some code is from the patch of Kamal Mostafa.

    Reviewed-by: Jan Kara
    Signed-off-by: Miao Xie
    Signed-off-by: Fengguang Wu

    Miao Xie
     

13 Dec, 2012

1 commit


27 Nov, 2012

1 commit

  • Commit 169ebd90131b ("writeback: Avoid iput() from flusher thread")
    removed iget-iput pair from inode writeback. As a side effect, inodes
    that are dirty during iput_final() call won't be ever added to inode LRU
    (iput_final() doesn't add dirty inodes to LRU and later when the inode
    is cleaned there's noone to add the inode there). Thus inodes are
    effectively unreclaimable until someone looks them up again.

    The practical effect of this bug is limited by the fact that inodes are
    pinned by a dentry for long enough that the inode gets cleaned. But
    still the bug can have nasty consequences leading up to OOM conditions
    under certain circumstances. Following can easily reproduce the
    problem:

    for (( i = 0; i < 1000; i++ )); do
    mkdir $i
    for (( j = 0; j < 1000; j++ )); do
    touch $i/$j
    echo 2 > /proc/sys/vm/drop_caches
    done
    done

    then one needs to run 'sync; ls -lR' to make inodes reclaimable again.

    We fix the issue by inserting unused clean inodes into the LRU after
    writeback finishes in inode_sync_complete().

    Signed-off-by: Jan Kara
    Reported-by: OGAWA Hirofumi
    Cc: Al Viro
    Cc: OGAWA Hirofumi
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: [3.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

12 Oct, 2012

1 commit


09 Oct, 2012

1 commit


08 Oct, 2012

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "The big new feature added this time is supporting online resizing
    using the meta_bg feature. This allows us to resize file systems
    which are greater than 16TB. In addition, the speed of online
    resizing has been improved in general.

    We also fix a number of races, some of which could lead to deadlocks,
    in ext4's Asynchronous I/O and online defrag support, thanks to good
    work by Dmitry Monakhov.

    There are also a large number of more minor bug fixes and cleanups
    from a number of other ext4 contributors, quite of few of which have
    submitted fixes for the first time."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (69 commits)
    ext4: fix ext4_flush_completed_IO wait semantics
    ext4: fix mtime update in nodelalloc mode
    ext4: fix ext_remove_space for punch_hole case
    ext4: punch_hole should wait for DIO writers
    ext4: serialize truncate with owerwrite DIO workers
    ext4: endless truncate due to nonlocked dio readers
    ext4: serialize unlocked dio reads with truncate
    ext4: serialize dio nonlocked reads with defrag workers
    ext4: completed_io locking cleanup
    ext4: fix unwritten counter leakage
    ext4: give i_aiodio_unwritten a more appropriate name
    ext4: ext4_inode_info diet
    ext4: convert to use leXX_add_cpu()
    ext4: ext4_bread usage audit
    fs: reserve fallocate flag codepoint
    ext4: remove redundant offset check in mext_check_arguments()
    ext4: don't clear orphan list on ro mount with errors
    jbd2: fix assertion failure in commit code due to lacking transaction credits
    ext4: release donor reference when EXT4_IOC_MOVE_EXT ioctl fails
    ext4: enable FITRIM ioctl on bigalloc file system
    ...

    Linus Torvalds
     

02 Oct, 2012

1 commit

  • Pull the trivial tree from Jiri Kosina:
    "Tiny usual fixes all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    doc: fix old config name of kprobetrace
    fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc
    btrfs: fix the commment for the action flags in delayed-ref.h
    btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID
    vfs: fix kerneldoc for generic_fh_to_parent()
    treewide: fix comment/printk/variable typos
    ipr: fix small coding style issues
    doc: fix broken utf8 encoding
    nfs: comment fix
    platform/x86: fix asus_laptop.wled_type module parameter
    mfd: printk/comment fixes
    doc: getdelays.c: remember to close() socket on error in create_nl_socket()
    doc: aliasing-test: close fd on write error
    mmc: fix comment typos
    dma: fix comments
    spi: fix comment/printk typos in spi
    Coccinelle: fix typo in memdup_user.cocci
    tmiofb: missing NULL pointer checks
    tools: perf: Fix typo in tools/perf
    tools/testing: fix comment / output typos
    ...

    Linus Torvalds
     

21 Sep, 2012

1 commit


20 Sep, 2012

1 commit

  • In ext4_nonda_switch(), if the file system is getting full we used to
    call writeback_inodes_sb_if_idle(). The problem is that we can be
    holding i_mutex already, and this causes a potential deadlock when
    writeback_inodes_sb_if_idle() when it tries to take s_umount. (See
    lockdep output below).

    As it turns out we don't need need to hold s_umount; the fact that we
    are in the middle of the write(2) system call will keep the superblock
    pinned. Unfortunately writeback_inodes_sb() checks to make sure
    s_umount is taken, and the VFS uses a different mechanism for making
    sure the file system doesn't get unmounted out from under us. The
    simplest way of dealing with this is to just simply grab s_umount
    using a trylock, and skip kicking the writeback flusher thread in the
    very unlikely case that we can't take a read lock on s_umount without
    blocking.

    Also, we now check the cirteria for kicking the writeback thread
    before we decide to whether to fall back to non-delayed writeback, so
    if there are any outstanding delayed allocation writes, we try to get
    them resolved as soon as possible.

    [ INFO: possible circular locking dependency detected ]
    3.6.0-rc1-00042-gce894ca #367 Not tainted
    -------------------------------------------------------
    dd/8298 is trying to acquire lock:
    (&type->s_umount_key#18){++++..}, at: [] writeback_inodes_sb_if_idle+0x28/0x46

    but task is already holding lock:
    (&sb->s_type->i_mutex_key#8){+.+...}, at: [] generic_file_aio_write+0x5f/0xd3

    which lock already depends on the new lock.

    2 locks held by dd/8298:
    #0: (sb_writers#2){.+.+.+}, at: [] generic_file_aio_write+0x56/0xd3
    #1: (&sb->s_type->i_mutex_key#8){+.+...}, at: [] generic_file_aio_write+0x5f/0xd3

    stack backtrace:
    Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367
    Call Trace:
    [] ? console_unlock+0x345/0x372
    [] print_circular_bug+0x190/0x19d
    [] __lock_acquire+0x86d/0xb6c
    [] ? mark_held_locks+0x5c/0x7b
    [] lock_acquire+0x66/0xb9
    [] ? writeback_inodes_sb_if_idle+0x28/0x46
    [] down_read+0x28/0x58
    [] ? writeback_inodes_sb_if_idle+0x28/0x46
    [] writeback_inodes_sb_if_idle+0x28/0x46
    [] ext4_nonda_switch+0xe1/0xf4
    [] ext4_da_write_begin+0x27/0x193
    [] generic_file_buffered_write+0xc8/0x1bb
    [] __generic_file_aio_write+0x1dd/0x205
    [] generic_file_aio_write+0x78/0xd3
    [] ext4_file_write+0x480/0x4a6
    [] ? __lock_acquire+0x41e/0xb6c
    [] ? sched_clock_cpu+0x11a/0x13e
    [] ? trace_hardirqs_off+0xb/0xd
    [] ? local_clock+0x37/0x4e
    [] do_sync_write+0x67/0x9d
    [] ? wait_on_retry_sync_kiocb+0x44/0x44
    [] vfs_write+0x7b/0xe6
    [] sys_write+0x3b/0x64
    [] syscall_call+0x7/0xb

    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     

11 Sep, 2012

1 commit


01 Aug, 2012

1 commit

  • Since per-BDI flusher threads were introduced in 2.6, the pdflush
    mechanism is not used any more. But the old interface exported through
    /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

    For back-compatibility, printk warning information and return 2 to notify
    the users that the interface is removed.

    Signed-off-by: Wanpeng Li
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

31 Jul, 2012

1 commit

  • Pull writeback updates from Wu Fengguang:
    "Use time based periods to age the writeback proportions, which can
    adapt equally well to fast/slow devices."

    Fix up trivial conflict in comment in fs/sync.c

    * tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Fix some comment errors
    block: Convert BDI proportion calculations to flexible proportions
    lib: Fix possible deadlock in flexible proportion code
    lib: Proportions with flexible period

    Linus Torvalds
     

23 Jul, 2012

1 commit

  • In principle, a filesystem may want to have ->sync_fs() called during sync(1)
    although it does not have a bdi (i.e. s_bdi is set to noop_backing_dev_info).
    Only writeback code really needs bdi set to something reasonable. So move the
    checks where they are more logical.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

09 Jun, 2012

2 commits

  • Signed-off-by: Wanpeng Li
    Signed-off-by: Fengguang Wu

    Wanpeng Li
     
  • Fix bug introduced by 169ebd90. We have to have wb_list_lock locked when
    restarting writeback loop after having waited for inode writeback.

    Bug description by Ted Tso:

    I can reproduce this fairly easily by using ext4 w/o a journal, running
    under KVM with 1024megs memory, with fsstress (xfstests #13):

    [ 45.153294] =====================================
    [ 45.154784] [ BUG: bad unlock balance detected! ]
    [ 45.155591] 3.5.0-rc1-00002-gb22b1f1 #124 Not tainted
    [ 45.155591] -------------------------------------
    [ 45.155591] flush-254:16/2499 is trying to release lock (&(&wb->list_lock)->rlock) at:
    [ 45.155591] [] writeback_sb_inodes+0x160/0x327
    [ 45.155591] but there are no more locks to release!

    Reported-by: Theodore Ts'o
    Tested-by: Theodore Ts'o
    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara
     

06 May, 2012

1 commit

  • Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
    because iput() can do a lot of work - for example truncate the inode if it's
    the last iput on unlinked file. Some filesystems depend on flusher thread
    progressing (e.g. because they need to flush delay allocated blocks to reduce
    allocation uncertainty) and so flusher thread doing truncate creates
    interesting dependencies and possibilities for deadlocks.

    We get rid of iput() in flusher thread by using the fact that I_SYNC inode
    flag effectively pins the inode in memory. So if we take care to either hold
    i_lock or have I_SYNC set, we can get away without taking inode reference
    in writeback_sb_inodes().

    As a side effect of these changes, we also fix possible use-after-free in
    wb_writeback() because inode_wait_for_writeback() call could try to reacquire
    i_lock on the inode that was already free.

    Signed-off-by: Jan Kara
    Signed-off-by: Fengguang Wu

    Jan Kara