31 Mar, 2011

1 commit


25 Mar, 2011

4 commits

  • First thing we do in writeback_single_inode() is take the i_lock and
    the last thing we do is drop it. A caller already holds the i_lock,
    so pull the i_lock out of writeback_single_inode() to reduce the
    round trips on this lock during inode writeback.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect the inode writeback list with a new global lock
    inode_wb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect the per-sb inode list with a new global lock
    inode_sb_list_lock and use it to protect the list manipulations and
    traversals. This lock replaces the inode_lock as the inodes on the
    list can be validity checked while holding the inode->i_lock and
    hence the inode_lock is no longer needed to protect the list.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Protect inode state transitions and validity checks with the
    inode->i_lock. This enables us to make inode state transitions
    independently of the inode_lock and is the first step to peeling
    away the inode_lock from the code.

    This requires that __iget() is done atomically with i_state checks
    during list traversals so that we don't race with another thread
    marking the inode I_FREEING between the state check and grabbing the
    reference.

    Also remove the unlock_new_inode() memory barrier optimisation
    required to avoid taking the inode_lock when clearing I_NEW.
    Simplify the code by simply taking the inode->i_lock around the
    state change and wakeup. Because the wakeup is no longer tricky,
    remove the wake_up_inode() function and open code the wakeup where
    necessary.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

14 Jan, 2011

6 commits

  • The sync_inodes_sb() function does not have a return value. Remove the
    outdated documentation comment.

    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefan Hajnoczi
     
  • Use correct function name, remove incorrect apostrophe

    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
    usually set to LONG_MAX. The logic in wb_writeback() then calls
    __writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we
    easily end up with non-positive nr_to_write after the function returns, if
    the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment.

    When nr_to_write is
    Signed-off-by: Wu Fengguang
    Cc: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Jan Engelhardt
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Background writeback is easily livelockable in a loop in wb_writeback() by
    a process continuously re-dirtying pages (or continuously appending to a
    file). This is in fact intended as the target of background writeback is
    to write dirty pages it can find as long as we are over
    dirty_background_threshold.

    But the above behavior gets inconvenient at times because no other work
    queued in the flusher thread's queue gets processed. In particular, since
    e.g. sync(1) relies on flusher thread to do all the IO for it, sync(1)
    can hang forever waiting for flusher thread to do the work.

    Generally, when a flusher thread has some work queued, someone submitted
    the work to achieve a goal more specific than what background writeback
    does. Moreover by working on the specific work, we also reduce amount of
    dirty pages which is exactly the target of background writeout. So it
    makes sense to give specific work a priority over a generic page cleaning.

    Thus we interrupt background writeback if there is some other work to do.
    We return to the background writeback after completing all the queued
    work.

    This may delay the writeback of expired inodes for a while, however the
    expired inodes will eventually be flushed to disk as long as the other
    works won't livelock.

    [fengguang.wu@intel.com: update comment]
    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang
    Cc: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Jan Engelhardt
    Cc: Jens Axboe

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • This tracks when balance_dirty_pages() tries to wakeup the flusher thread
    for background writeback (if it was not started already).

    Suggested-by: Christoph Hellwig
    Signed-off-by: Wu Fengguang
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: Dave Chinner
    Cc: Jan Engelhardt
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Check whether background writeback is needed after finishing each work.

    When bdi flusher thread finishes doing some work check whether any kind of
    background writeback needs to be done (either because
    dirty_background_ratio is exceeded or because we need to start flushing
    old inodes). If so, just do background write back.

    This way, bdi_start_background_writeback() just needs to wake up the
    flusher thread. It will do background writeback as soon as there is no
    other work.

    This is a preparatory patch for the next patch which stops background
    writeback as soon as there is other work to do.

    Signed-off-by: Jan Kara
    Signed-off-by: Wu Fengguang
    Cc: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Jan Engelhardt
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

31 Oct, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (39 commits)
    Btrfs: deal with errors from updating the tree log
    Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
    Btrfs: make SNAP_DESTROY async
    Btrfs: add SNAP_CREATE_ASYNC ioctl
    Btrfs: add START_SYNC, WAIT_SYNC ioctls
    Btrfs: async transaction commit
    Btrfs: fix deadlock in btrfs_commit_transaction
    Btrfs: fix lockdep warning on clone ioctl
    Btrfs: fix clone ioctl where range is adjacent to extent
    Btrfs: fix delalloc checks in clone ioctl
    Btrfs: drop unused variable in block_alloc_rsv
    Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
    Btrfs: Fix variables set but not read (bugs found by gcc 4.6)
    Btrfs: Use ERR_CAST helpers
    Btrfs: use memdup_user helpers
    Btrfs: fix raid code for removing missing drives
    Btrfs: Switch the extent buffer rbtree into a radix tree
    Btrfs: restructure try_release_extent_buffer()
    Btrfs: use the flusher threads for delalloc throttling
    Btrfs: tune the chunk allocation to 5% of the FS as metadata
    ...

    Fix up trivial conflicts in fs/btrfs/super.c and fs/fs-writeback.c, and
    remove use of INIT_RCU_HEAD in fs/btrfs/extent_io.c (that init macro was
    useless and removed in commit 5e8067adfdba: "rcu head remove init")

    Linus Torvalds
     

30 Oct, 2010

1 commit

  • The btrfs merge looks like hell, because it changes fs-writeback.c, and
    the crazy code has this repeated "estimate number of dirty pages"
    counting that involves three different helper functions. And it's done
    in two different places.

    Just unify that whole calculation as a "get_nr_dirty_pages()" helper
    function, and the merge result will look half-way decent.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Oct, 2010

1 commit

  • When btrfs is running low on metadata space, it needs to force delayed
    allocation pages to disk. It currently does this with a suboptimal walk
    of a private list of inodes with delayed allocation, and it would be
    much better if we used the generic flusher threads.

    writeback_inodes_sb_if_idle would be ideal, but it waits for the flusher
    thread to start IO on all the dirty pages in the FS before it returns.
    This adds variants of writeback_inodes_sb* that allow the caller to
    control how many pages get sent down.

    Signed-off-by: Chris Mason

    Chris Mason
     

27 Oct, 2010

4 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
    split invalidate_inodes()
    fs: skip I_FREEING inodes in writeback_sb_inodes
    fs: fold invalidate_list into invalidate_inodes
    fs: do not drop inode_lock in dispose_list
    fs: inode split IO and LRU lists
    fs: switch bdev inode bdi's correctly
    fs: fix buffer invalidation in invalidate_list
    fsnotify: use dget_parent
    smbfs: use dget_parent
    exportfs: use dget_parent
    fs: use RCU read side protection in d_validate
    fs: clean up dentry lru modification
    fs: split __shrink_dcache_sb
    fs: improve DCACHE_REFERENCED usage
    fs: use percpu counter for nr_dentry and nr_dentry_unused
    fs: simplify __d_free
    fs: take dcache_lock inside __d_path
    fs: do not assign default i_ino in new_inode
    fs: introduce a per-cpu last_ino allocator
    new helper: ihold()
    ...

    Linus Torvalds
     
  • PF_FLUSHER is only ever set, not tested, remove it.

    Signed-off-by: Peter Zijlstra
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • I had to go back to a 2.6.20 tree to work out why we're adding a
    number-of-inodes into a number-of-pages count. Restore the lost comment.

    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The dirty_ratio was silently limited in global_dirty_limits() to >= 5%.
    This is not a user expected behavior. And it's inconsistent with
    calc_period_shift(), which uses the plain vm_dirty_ratio value.

    Let's remove the internal bound.

    At the same time, fix balance_dirty_pages() to work with the
    dirty_thresh=0 case. This allows applications to proceed when
    dirty+writeback pages are all cleaned.

    And ">" fits with the name "exceeded" better than ">=" does. Neil thinks
    it is an aesthetic improvement as well as a functional one :)

    Signed-off-by: Wu Fengguang
    Cc: Jan Kara
    Proposed-by: Con Kolivas
    Acked-by: Peter Zijlstra
    Reviewed-by: Rik van Riel
    Reviewed-by: Neil Brown
    Reviewed-by: KOSAKI Motohiro
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

26 Oct, 2010

6 commits

  • Skip I_FREEING inodes just like I_WILL_FREE and I_NEW when walking the
    writeback lists. Currenly this can't happen, but once we move from
    inode_lock to more fine grained locking we can have an inode that's
    still on the writeback lists but has I_FREEING set, and we absolutely
    need to skip it here, just like we do for all other inode list walks.

    Based on a patch from Dave Chinner.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • The use of the same inode list structure (inode->i_list) for two
    different list constructs with different lifecycles and purposes
    makes it impossible to separate the locking of the different
    operations. Therefore, to enable the separation of the locking of
    the writeback and reclaim lists, split the inode->i_list into two
    separate lists dedicated to their specific tracking functions.

    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Nick Piggin
     
  • Convert the inode LRU to use lazy updates to reduce lock and
    cacheline traffic. We avoid moving inodes around in the LRU list
    during iget/iput operations so these frequent operations don't need
    to access the LRUs. Instead, we defer the refcount checks to
    reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
    reclaim that iget has touched the inode in the past. This means that
    only reclaim should be touching the LRU with any frequency, hence
    significantly reducing lock acquisitions and the amount contention
    on LRU updates.

    This also removes the inode_in_use list, which means we now only
    have one list for tracking the inode LRU status. This makes it much
    simpler to split out the LRU list operations under it's own lock.

    Signed-off-by: Nick Piggin
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Nick Piggin
     
  • The number of inodes allocated does not need to be tied to the
    addition or removal of an inode to/from a list. If we are not tied
    to a list lock, we could update the counters when inodes are
    initialised or destroyed, but to do that we need to convert the
    counters to be per-cpu (i.e. independent of a lock). This means that
    we have the freedom to change the list/locking implementation
    without needing to care about the counters.

    Based on a patch originally from Eric Dumazet.

    [AV: cleaned up a bit, fixed build breakage on weird configs

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Dave Chinner
     
  • note: for race-free uses you inode_lock held

    Signed-off-by: Al Viro

    Al Viro
     
  • Add a new helper to write out the inode using the writeback code,
    that is including the correct dirty bit and list manipulation. A few
    of filesystems already opencode this, and a lot of others should be
    using it instead of using write_inode_now which also writes out the
    data.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

04 Oct, 2010

1 commit

  • We currently use struct backing_dev_info for various different purposes.
    Originally it was introduced to describe a backing device which includes
    an unplug and congestion function and various bits of readahead information
    and VM-relevant flags. We're also using for tracking dirty inodes for
    writeback.

    To make writeback properly find all inodes we need to only access the
    per-filesystem backing_device pointed to by the superblock in ->s_bdi
    inside the writeback code, and not the instances pointeded to by
    inode->i_mapping->backing_dev which can be overriden by special devices
    or might not be set at all by some filesystems.

    Long term we should split out the writeback-relevant bits of struct
    backing_device_info (which includes more than the current bdi_writeback)
    and only point to it from the superblock while leaving the traditional
    backing device as a separate structure that can be overriden by devices.

    The one exception for now is the block device filesystem which really
    wants different writeback contexts for it's different (internal) inodes
    to handle the writeout more efficiently. For now we do this with
    a hack in fs-writeback.c because we're so late in the cycle, but in
    the future I plan to replace this with a superblock method that allows
    for multiple writeback contexts per filesystem.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

22 Sep, 2010

1 commit

  • Inodes of devices such as /dev/zero can get dirty for example via
    utime(2) syscall or due to atime update. Backing device of such inodes
    (zero_bdi, etc.) is however unable to handle dirty inodes and thus
    __mark_inode_dirty complains. In fact, inode should be rather dirtied
    against backing device of the filesystem holding it. This is generally a
    good rule except for filesystems such as 'bdev' or 'mtd_inodefs'. Inodes
    in these pseudofilesystems are referenced from ordinary filesystem
    inodes and carry mapping with real data of the device. Thus for these
    inodes we have to use inode->i_mapping->backing_dev_info as we did so
    far. We distinguish these filesystems by checking whether sb->s_bdi
    points to a non-trivial backing device or not.

    Example: Assume we have an ext3 filesystem on /dev/sda1 mounted on /.
    There's a device inode A described by a path "/dev/sdb" on this
    filesystem. This inode will be dirtied against backing device "8:0"
    after this patch. bdev filesystem contains block device inode B coupled
    with our inode A. When someone modifies a page of /dev/sdb, it's B that
    gets dirtied and the dirtying happens against the backing device "8:16".
    Thus both inodes get filed to a correct bdi list.

    Cc: stable@kernel.org
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

28 Aug, 2010

1 commit

  • Setting the task state here may cause us to miss the wake up from
    kthread_stop(), so we need to recheck kthread_should_stop() or risk
    sleeping forever in the following schedule().

    Symptom was an indefinite hang on an NFSv4 mount. (NFSv4 may create
    multiple mounts in a temporary namespace while traversing the mount
    path, and since the temporary namespace is immediately destroyed, it may
    end up destroying a mount very soon after it was created, possibly
    making this race more likely.)

    INFO: task mount.nfs4:4314 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    mount.nfs4 D 0000000000000000 2880 4314 4313 0x00000000
    ffff88001ed6da28 0000000000000046 ffff88001ed6dfd8 ffff88001ed6dfd8
    ffff88001ed6c000 ffff88001ed6c000 ffff88001ed6c000 ffff88001e5003a0
    ffff88001ed6dfd8 ffff88001e5003a8 ffff88001ed6c000 ffff88001ed6dfd8
    Call Trace:
    [] schedule_timeout+0x1cd/0x2e0
    [] ? mark_held_locks+0x6c/0xa0
    [] ? _raw_spin_unlock_irq+0x30/0x60
    [] ? trace_hardirqs_on_caller+0x14d/0x190
    [] ? sub_preempt_count+0xe/0xd0
    [] wait_for_common+0x120/0x190
    [] ? default_wake_function+0x0/0x20
    [] wait_for_completion+0x1d/0x20
    [] kthread_stop+0x4a/0x150
    [] ? thaw_process+0x70/0x80
    [] bdi_unregister+0x10a/0x1a0
    [] nfs_put_super+0x19/0x20
    [] generic_shutdown_super+0x54/0xe0
    [] kill_anon_super+0x16/0x60
    [] nfs4_kill_super+0x39/0x90
    [] deactivate_locked_super+0x45/0x60
    [] deactivate_super+0x49/0x70
    [] mntput_no_expire+0x84/0xe0
    [] release_mounts+0x9f/0xc0
    [] put_mnt_ns+0x65/0x80
    [] nfs_follow_remote_path+0x1e6/0x420
    [] nfs4_try_mount+0x6f/0xd0
    [] nfs4_get_sb+0xa2/0x360
    [] vfs_kern_mount+0x88/0x1f0
    [] do_kern_mount+0x52/0x130
    [] ? _lock_kernel+0x6a/0x170
    [] do_mount+0x26e/0x7f0
    [] ? copy_mount_options+0xea/0x190
    [] sys_mount+0x98/0xf0
    [] system_call_fastpath+0x16/0x1b
    1 lock held by mount.nfs4/4314:
    #0: (&type->s_umount_key#24){+.+...}, at: [] deactivate_super+0x41/0x70

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Jens Axboe
    Acked-by: Artem Bityutskiy

    J. Bruce Fields
     

12 Aug, 2010

5 commits

  • Commit 83ba7b071f3 ("writeback: simplify the write back thread queue")
    broke writeback_in_progress() as in that commit we started to remove work
    items from the list at the moment we start working on them and not at the
    moment they are finished. Thus if the flusher thread was doing some work
    but there was no other work queued, writeback_in_progress() returned
    false. This could in particular cause unnecessary queueing of background
    writeback from balance_dirty_pages() or writeout work from
    writeback_sb_if_idle().

    This patch fixes the problem by introducing a bit in the bdi state which
    indicates that the flusher thread is processing some work and uses this
    bit for writeback_in_progress() test.

    NOTE: Both callsites of writeback_in_progress() (namely,
    writeback_inodes_sb_if_idle() and balance_dirty_pages()) would actually
    need a different information than what writeback_in_progress() provides.
    They would need to know whether *the kind of writeback they are going to
    submit* is already queued. But this information isn't that simple to
    provide so let's fix writeback_in_progress() for the time being.

    Signed-off-by: Jan Kara
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Unify the logic for kupdate and non-kupdate cases. There won't be
    starvation because the inodes requeued into b_more_io will later be
    spliced _after_ the remaining inodes in b_io, hence won't stand in the way
    of other inodes in the next run.

    It avoids unnecessary redirty_tail() calls, hence the update of
    i_dirtied_when. The timestamp update is undesirable because it could
    later delay the inode's periodic writeback, or may exclude the inode from
    the data integrity sync operation (which checks timestamp to avoid extra
    work and livelock).

    ===
    How the redirty_tail() comes about:

    It was a long story.. This redirty_tail() was introduced with
    wbc.more_io. The initial patch for more_io actually does not have the
    redirty_tail(), and when it's merged, several 100% iowait bug reports
    arised:

    reiserfs:
    http://lkml.org/lkml/2007/10/23/93

    jfs:
    commit 29a424f28390752a4ca2349633aaacc6be494db5
    JFS: clear PAGECACHE_TAG_DIRTY for no-write pages

    ext2:
    http://www.spinics.net/linux/lists/linux-ext4/msg04762.html

    They are all old bugs hidden in various filesystems that become "visible"
    with the more_io patch. At the time, the ext2 bug is thought to be
    "trivial", so not fixed. Instead the following updated more_io patch with
    redirty_tail() is merged:

    http://www.spinics.net/linux/lists/linux-ext4/msg04507.html

    This will in general prevent 100% on ext2 and possibly other unknown FS bugs.

    Signed-off-by: Wu Fengguang
    Cc: Dave Chinner
    Cc: Martin Bligh
    Cc: Michael Rubin
    Cc: Peter Zijlstra
    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • This was not a bug, since b_io is empty for kupdate writeback. The next
    patch will do requeue_io() for non-kupdate writeback, so let's fix it.

    Signed-off-by: Wu Fengguang
    Cc: Dave Chinner
    Cc: Martin Bligh
    Cc: Michael Rubin
    Cc: Peter Zijlstra
    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Avoid delaying writeback for an expire inode with lots of dirty pages, but
    no active dirtier at the moment. Previously we only do that for the
    kupdate case.

    Any filesystem that does delayed allocation or unwritten extent conversion
    after IO completion will cause this - for example, XFS.

    Signed-off-by: Wu Fengguang
    Acked-by: Jan Kara
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
    that the latter can be avoided when under global dirty background
    threshold (which is the normal state for most systems).

    Signed-off-by: Wu Fengguang
    Cc: Peter Zijlstra
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

11 Aug, 2010

2 commits

  • * 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
    block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
    xen-blkfront: fix missing out label
    blkdev: fix blkdev_issue_zeroout return value
    block: update request stacking methods to support discards
    block: fix missing export of blk_types.h
    writeback: fix bad _bh spinlock nesting
    drbd: revert "delay probes", feature is being re-implemented differently
    drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
    drbd: Disable delay probes for the upcomming release
    writeback: cleanup bdi_register
    writeback: add new tracepoints
    writeback: remove unnecessary init_timer call
    writeback: optimize periodic bdi thread wakeups
    writeback: prevent unnecessary bdi threads wakeups
    writeback: move bdi threads exiting logic to the forker thread
    writeback: restructure bdi forker loop a little
    writeback: move last_active to bdi
    writeback: do not remove bdi from bdi_list
    writeback: simplify bdi code a little
    writeback: do not lose wake-ups in bdi threads
    ...

    Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
    drivers/scsi/scsi_error.c as per Jens.

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
    no need for list_for_each_entry_safe()/resetting with superblock list
    Fix sget() race with failing mount
    vfs: don't hold s_umount over close_bdev_exclusive() call
    sysv: do not mark superblock dirty on remount
    sysv: do not mark superblock dirty on mount
    btrfs: remove junk sb_dirt change
    BFS: clean up the superblock usage
    AFFS: wait for sb synchronization when needed
    AFFS: clean up dirty flag usage
    cifs: truncate fallout
    mbcache: fix shrinker function return value
    mbcache: Remove unused features
    add f_flags to struct statfs(64)
    pass a struct path to vfs_statfs
    update VFS documentation for method changes.
    All filesystems that need invalidate_inode_buffers() are doing that explicitly
    convert remaining ->clear_inode() to ->evict_inode()
    Make ->drop_inode() just return whether inode needs to be dropped
    fs/inode.c:clear_inode() is gone
    fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
    ...

    Fix up trivial conflicts in fs/nilfs2/super.c

    Linus Torvalds
     

10 Aug, 2010

2 commits

  • WB_SYNC_NONE writeback is done in rounds of 1024 pages so that we don't
    write out some huge inode for too long while starving writeout of other
    inodes. To avoid livelocks, we record time we started writeback in
    wbc->wb_start and do not write out inodes which were dirtied after this
    time. But currently, writeback_inodes_wb() resets wb_start each time it
    is called thus effectively invalidating this logic and making any
    WB_SYNC_NONE writeback prone to livelocks.

    This patch makes sure wb_start is set only once when we start writeback.

    Signed-off-by: Jan Kara
    Reviewed-by: Wu Fengguang
    Cc: Christoph Hellwig
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • add I_CLEAR instead of replacing I_FREEING with it. I_CLEAR is
    equivalent to I_FREEING for almost all code looking at either;
    it's there to keep track of having called clear_inode() exactly
    once per inode lifetime, at some point after having set I_FREEING.
    I_CLEAR and I_FREEING never get set at the same time with the
    current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR
    instead of I_CLEAR without loss of information. As the result of
    such change, checks become simpler and the amount of code that needs
    to know about I_CLEAR shrinks a lot.

    Signed-off-by: Al Viro

    Al Viro
     

08 Aug, 2010

4 commits

  • Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which
    should take care of the periodic background write-out. However, the write-out
    will actually start only 'dirty_writeback_interval' centisecs later, so we can
    delay the wake-up.

    This change was requested by Nick Piggin who pointed out that if we delay the
    wake-up, we weed out 2 unnecessary contex switches, which matters because
    '__mark_inode_dirty()' is a hot-path function.

    This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which
    sets up a timer to wake-up the bdi thread and returns. So the wake-up is
    delayed.

    We also delete the timer in bdi threads just before writing-back. And
    synchronously delete it when unregistering bdi. At the unregister point the bdi
    does not have any users, so no one can arm it again.

    Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq
    context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes
    this change as well.

    This patch also moves the 'bdi_wb_init()' function down in the file to avoid
    forward-declaration of 'bdi_wakeup_thread_delayed()'.

    Signed-off-by: Artem Bityutskiy
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Finally, we can get rid of unnecessary wake-ups in bdi threads, which are very
    bad for battery-driven devices.

    There are two types of activities bdi threads do:
    1. process bdi works from the 'bdi->work_list'
    2. periodic write-back

    So there are 2 sources of wake-up events for bdi threads:

    1. 'bdi_queue_work()' - submits bdi works
    2. '__mark_inode_dirty()' - adds dirty I/O to bdi's

    The former already has bdi wake-up code. The latter does not, and this patch
    adds it.

    '__mark_inode_dirty()' is hot-path function, but this patch adds another
    'spin_lock(&bdi->wb_lock)' there. However, it is taken only in rare cases when
    the bdi has no dirty inodes. So adding this spinlock should be fine and should
    not affect performance.

    This patch makes sure bdi threads and the forker thread do not wake-up if there
    is nothing to do. The forker thread will nevertheless wake up at least every
    5 min. to check whether it has to kill a bdi thread. This can also be optimized,
    but is not worth it.

    This patch also tidies up the warning about unregistered bid, and turns it from
    an ugly crocodile to a simple 'WARN()' statement.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Currently, bdi threads can decide to exit if there were no useful activities
    for 5 minutes. However, this causes nasty races: we can easily oops in the
    'bdi_queue_work()' if the bdi thread decides to exit while we are waking it up.

    And even if we do not oops, but the bdi tread exits immediately after we wake
    it up, we'd lose the wake-up event and have an unnecessary delay (up to 5 secs)
    in the bdi work processing.

    This patch makes the forker thread to be the central place which not only
    creates bdi threads, but also kills them if they were inactive long enough.
    This better design-wise.

    Another reason why this change was done is to prepare for the further changes
    which will prevent the bdi threads from waking up every 5 sec and wasting
    power. Indeed, when the task does not wake up periodically anymore, it won't be
    able to exit either.

    This patch also moves the the 'wake_up_bit()' call from the bdi thread to the
    forker thread as well. So now the forker thread sets the BDI_pending bit, then
    forks the task or kills it, then clears the bit and wakes up the waiting
    process.

    The only process which may wain on the bit is 'bdi_wb_shutdown()'. This
    function was changed as well - now it first removes the bdi from the
    'bdi_list', then waits on the 'BDI_pending' bit. Once it wakes up, it is
    guaranteed that the forker thread won't race with it, because the bdi is not
    visible. Note, the forker thread sets the 'BDI_pending' bit under the
    'bdi->wb_lock' which is essential for proper serialization.

    And additionally, when we change 'bdi->wb.task', we now take the
    'bdi->work_lock', to make sure that we do not lose wake-ups which we otherwise
    would when raced with, say, 'bdi_queue_work()'.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy
     
  • Currently bdi threads use local variable 'last_active' which stores last time
    when the bdi thread did some useful work. Move this local variable to 'struct
    bdi_writeback'. This is just a preparation for the further patches which will
    make the forker thread decide when bdi threads should be killed.

    Signed-off-by: Artem Bityutskiy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Artem Bityutskiy