06 Jul, 2010

3 commits

  • First remove items from work_list as soon as we start working on them. This
    means we don't have to track any pending or visited state and can get
    rid of all the RCU magic freeing the work items - we can simply free
    them once the operation has finished. Second use a real completion for
    tracking synchronous requests - if the caller sets the completion pointer
    we complete it, otherwise use it as a boolean indicator that we can free
    the work item directly. Third unify struct wb_writeback_args and struct
    bdi_work into a single data structure, wb_writeback_work. Previous we
    set all parameters into a struct wb_writeback_args, copied it into
    struct bdi_work, copied it again on the stack to use it there. Instead
    of just allocate one structure dynamically or on the stack and use it
    all the way through the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The case where we have a superblock doesn't require a loop here as we scan
    over all inodes in writeback_sb_inodes. Split it out into a separate helper
    to make the code simpler. This also allows to get rid of the sb member in
    struct writeback_control, which was rather out of place there.

    Also update the comments in writeback_sb_inodes that explain the handling
    of inodes from wrong superblocks.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This was just an odd wrapper around writeback_inodes_wb. Removing this
    also allows to get rid of the bdi member of struct writeback_control
    which was rather out of place there.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

01 Jul, 2010

1 commit

  • Fix kernel-doc to match the function's changed args.

    Warning(fs/fs-writeback.c:190): No description found for parameter 'args'
    Warning(fs/fs-writeback.c:190): Excess function parameter 'sb' description in 'bdi_queue_work_onstack'

    Signed-off-by: Randy Dunlap
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Randy Dunlap
     

11 Jun, 2010

8 commits

  • We need to check for s_instances to make sure we don't bother working
    against a filesystem that is beeing unmounted, and we need to call
    put_super to make sure a superblock is freed when we race against
    umount. Also no need to keep sb_lock after we got a reference on it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • In "writeback: fix writeback_inodes_wb from writeback_inodes_sb" I
    accidentally removed the requeue_io if we need to skip a superblock
    because we can't pin it. Add it back, otherwise we're getting spurious
    lockups after multiple xfstests runs.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • bdi_start_writeback now never gets a superblock passed, so we can just remove
    that case. And to further untangle the code and flatten the call stack
    split it into two trivial helpers for it's two callers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • bdi_writeback_all only has one caller, so fold it to simplify the code and
    flatten the call stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • When we call writeback_inodes_wb from writeback_inodes_sb we always have
    s_umount held, which currently makes the whole operation a no-op.

    But if we are called to write out inodes for a specific superblock we always
    have s_umount held, so replace the incorrect logic checking for WB_SYNC_ALL
    which only worked by coincidence with the proper check for an explicit
    superblock argument.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Make sure that not only sync_filesystem but all callers of writeback_inodes_sb
    have the superblock protected against remount. As-is this disables all
    functionality for these callers, but the next patch relies on this locking to
    fix writeback_inodes_sb for sync_filesystem.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • If we want to rely on s_umount in the caller we need to wait for completion
    of the I/O submission before returning to the caller. Refactor
    bdi_sync_writeback into a bdi_queue_work_onstack helper and use it for this
    case.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The code dealing with bdi_work->state and completion of a bdi_work is a
    major mess currently. This patch makes sure we directly use one set of
    flags to deal with it, and use it consistently, which means:

    - always notify about completion from the rcu callback. We only ever
    wait for it from on-stack callers, so this simplification does not
    even cause a theoretical slowdown currently. It also makes sure we
    don't miss out on the notification if we ever add other callers to
    wait for it.
    - make earlier completion notification depending on the on-stack
    allocation, not the sync mode. If we introduce new callers that
    want to do WB_SYNC_NONE writeback from on-stack callers this will
    be nessecary.

    Also rename bdi_wait_on_work_clear to bdi_wait_on_work_done and inline
    a few small functions into their only caller to make the code
    understandable.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

01 Jun, 2010

3 commits


25 May, 2010

1 commit

  • When wb_writeback() hasn't written anything it will re-acquire the inode
    lock before calling inode_wait_for_writeback.

    This change tests the sync bit first so that is doesn't need to drop &
    re-acquire the lock if the inode became available while wb_writeback() was
    waiting to get the lock.

    Signed-off-by: Richard Kennedy
    Cc: Alexander Viro
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Kennedy
     

22 May, 2010

4 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits)
    fix handling of offsets in cris eeprom.c, get rid of fake on-stack files
    get rid of home-grown mutex in cris eeprom.c
    switch ecryptfs_write() to struct inode *, kill on-stack fake files
    switch ecryptfs_get_locked_page() to struct inode *
    simplify access to ecryptfs inodes in ->readpage() and friends
    AFS: Don't put struct file on the stack
    Ban ecryptfs over ecryptfs
    logfs: replace inode uid,gid,mode initialization with helper function
    ufs: replace inode uid,gid,mode initialization with helper function
    udf: replace inode uid,gid,mode init with helper
    ubifs: replace inode uid,gid,mode initialization with helper function
    sysv: replace inode uid,gid,mode initialization with helper function
    reiserfs: replace inode uid,gid,mode initialization with helper function
    ramfs: replace inode uid,gid,mode initialization with helper function
    omfs: replace inode uid,gid,mode initialization with helper function
    bfs: replace inode uid,gid,mode initialization with helper function
    ocfs2: replace inode uid,gid,mode initialization with helper function
    nilfs2: replace inode uid,gid,mode initialization with helper function
    minix: replace inode uid,gid,mode init with helper
    ext4: replace inode uid,gid,mode init with helper
    ...

    Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)

    Linus Torvalds
     
  • This fixes sparse noise:
    error: dubious one-bit signed bitfield

    Signed-off-by: H Hartley Sweeten
    Cc: Alexander Viro
    Signed-off-by: Al Viro

    H Hartley Sweeten
     
  • Calling schedule without setting the task state to non-running will
    return immediately, so ensure that we set it properly and check our
    sleep conditions after doing so.

    This is a fixup for commit 69b62d01.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Even if the writeout itself isn't a data integrity operation, we need
    to ensure that the caller doesn't drop the sb umount sem before we
    have actually done the writeback.

    This is a fixup for commit e913fc82.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

17 May, 2010

3 commits

  • Filesystems with delalloc support may dirty inode during writepages.
    As result inode will have dirty metadata flags even after write_inode.
    In fact we have two dedicated functions for proper data and metadata
    writeback. It is reasonable to separate flags updates in two stages.

    https://bugzilla.kernel.org/show_bug.cgi?id=15906

    Signed-off-by: Dmitry Monakhov
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     
  • When umount calls sync_filesystem(), we first do a WB_SYNC_NONE
    writeback to kick off writeback of pending dirty inodes, then follow
    that up with a WB_SYNC_ALL to wait for it. Since umount already holds
    the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all
    writeback happens as WB_SYNC_ALL. This can greatly slow down umount,
    since WB_SYNC_ALL writeback is a data integrity operation and thus
    a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems
    it's a lot slower.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Prior to 2.6.32, setting /proc/sys/vm/dirty_writeback_centisecs disabled
    periodic dirty writeback from kupdate. This got broken and now causes
    excessive sys CPU usage if set to zero, as we'll keep beating on
    schedule().

    Cc: stable@kernel.org
    Reported-by: Justin Maggard
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Apr, 2010

1 commit

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block: (34 commits)
    cfq-iosched: Fix the incorrect timeslice accounting with forced_dispatch
    loop: Update mtime when writing using aops
    block: expose the statistics in blkio.time and blkio.sectors for the root cgroup
    backing-dev: Handle class_create() failure
    Block: Fix block/elevator.c elevator_get() off-by-one error
    drbd: lc_element_by_index() never returns NULL
    cciss: unlock on error path
    cfq-iosched: Do not merge queues of BE and IDLE classes
    cfq-iosched: Add additional blktrace log messages in CFQ for easier debugging
    i2o: Remove the dangerous kobj_to_i2o_device macro
    block: remove 16 bytes of padding from struct request on 64bits
    cfq-iosched: fix a kbuild regression
    block: make CONFIG_BLK_CGROUP visible
    Remove GENHD_FL_DRIVERFS
    block: Export max number of segments and max segment size in sysfs
    block: Finalize conversion of block limits functions
    block: Fix overrun in lcm() and move it to lib
    vfs: improve writeback_inodes_wb()
    paride: fix off-by-one test
    drbd: fix al-to-on-disk-bitmap for 4k logical_block_size
    ...

    Linus Torvalds
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

12 Mar, 2010

1 commit

  • Do not pin/unpin superblock for every inode in writeback_inodes_wb(), pin
    it for the whole group of inodes which belong to the same superblock and
    call writeback_sb_inodes() handler for them.

    Signed-off-by: Edward Shishkin
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Edward Shishkin
     

06 Mar, 2010

2 commits

  • This gives the filesystem more information about the writeback that
    is happening. Trond requested this for the NFS unstable write handling,
    and other filesystems might benefit from this too by beeing able to
    distinguish between the different callers in more detail.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Similar to the fsync issue fixed a while ago in commit
    2daea67e966dc0c42067ebea015ddac6834cef88 we need to write for data to
    actually hit the disk before writing out the metadata to guarantee
    data integrity for filesystems that modify the inode in the data I/O
    completion path. Currently XFS and NFS handle this manually, and AFS
    has a write_inode method that does nothing but waiting for data, while
    others are possibly missing out on this.

    Fortunately this change has a lot less impact than the fsync change
    as none of the write_inode methods starts data writeout of any form
    by itself.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

03 Jan, 2010

1 commit


23 Dec, 2009

1 commit

  • ext4, at least, would like to start pushing on writeback if it starts
    to get close to ENOSPC when reserving worst-case blocks for delalloc
    writes. Writing out delalloc data will convert those worst-case
    predictions into usually smaller actual usage, freeing up space
    before we hit ENOSPC based on this speculation.

    Thanks to Jens for the suggestion for the helper function,
    & the naming help.

    I've made the helper return status on whether writeback was
    started even though I don't plan to use it in the ext4 patch;
    it seems like it would be potentially useful to test this
    in some cases.

    Signed-off-by: Eric Sandeen
    Acked-by: Jan Kara

    Eric Sandeen
     

03 Dec, 2009

3 commits


26 Sep, 2009

7 commits

  • Sometimes we only want to write pages from a specific super_block,
    so allow that to be passed in.

    This fixes a problem with commit 56a131dcf7ed36c3c6e36bea448b674ea85ed5bb
    causing writeback on all super_blocks on a bdi, where we only really
    want to sync a specific sb from writeback_inodes_sb().

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Pointless to iterate other devices looking for a super, when
    we have a bdi mapping.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Debug traces show that in per-bdi writeback, the inode under writeback
    almost always get redirtied by a busy dirtier. We used to call
    redirty_tail() in this case, which could delay inode for up to 30s.

    This is unacceptable because it now happens so frequently for plain cp/dd,
    that the accumulated delays could make writeback of big files very slow.

    So let's distinguish between data redirty and metadata only redirty.
    The first one is caused by a busy dirtier, while the latter one could
    happen in XFS, NFS, etc. when they are doing delalloc or updating isize.

    The inode being busy dirtied will now be requeued for next io, while
    the inode being redirtied by fs will continue to be delayed to avoid
    repeated IO.

    CC: Jan Kara
    CC: Theodore Ts'o
    CC: Dave Chinner
    CC: Chris Mason
    CC: Christoph Hellwig
    Signed-off-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Wu Fengguang
     
  • Currently we pin the inode->i_sb for every single inode. This
    increases cache traffic on sb->s_umount sem. Lets instead
    cache the inode sb pin state and keep the super_block pinned
    for as long as keep writing out inodes from the same
    super_block.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • If we only moved inodes from a single super_block to the temporary
    list, there's no point in doing a resort for multiple super_blocks.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • __mark_inode_dirty adds inode to wb dirty list in random order. If a disk has
    several partitions, writeback might keep spindle moving between partitions.
    To reduce the move, better write big chunk of one partition and then move to
    another. Inodes from one fs usually are in one partion, so idealy move indoes
    from one fs together should reduce spindle move. This patch tries to address
    this. Before per-bdi writeback is added, the behavior is write indoes
    from one fs first and then another, so the patch restores previous behavior.
    The loop in the patch is a bit ugly, should we add a dirty list for each
    superblock in bdi_writeback?

    Test in a two partition disk with attached fio script shows about 3% ~ 6%
    improvement.

    Signed-off-by: Shaohua Li
    Reviewed-by: Wu Fengguang
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Signed-off-by: Jens Axboe

    Jens Axboe