04 Jan, 2012

1 commit

  • Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
    kill_bdev as well, so brd doesn't have to open code it. Reduce
    buffer_head.h requirement accordingly.

    Removed a rather large comment from invalidate_bdev, as it looked a bit
    obsolete to bother moving. The small comment replacing it says enough.

    Signed-off-by: Nick Piggin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Al Viro
     

31 Oct, 2011

1 commit

  • This creates a new 'reason' field in a wb_writeback_work
    structure, which unambiguously identifies who initiates
    writeback activity. A 'wb_reason' enumeration has been
    added to writeback.h, to enumerate the possible reasons.

    The 'writeback_work_class' and tracepoint event class and
    'writeback_queue_io' tracepoints are updated to include the
    symbolic 'reason' in all trace events.

    And the 'writeback_inodes_sbXXX' family of routines has had
    a wb_stats parameter added to them, so callers can specify
    why writeback is being started.

    Acked-by: Jan Kara
    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: Wu Fengguang

    Curt Wohlgemuth
     

21 Jul, 2011

1 commit

  • Btrfs needs to be able to control how filemap_write_and_wait_range() is called
    in fsync to make it less of a painful operation, so push down taking i_mutex and
    the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
    file systems can drop taking the i_mutex altogether it seems, like ext3 and
    ocfs2. For correctness sake I just pushed everything down in all cases to make
    sure that we keep the current behavior the same for everybody, and then each
    individual fs maintainer can make up their mind about what to do from there.
    Thanks,

    Acked-by: Jan Kara
    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     

25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

21 Mar, 2011

1 commit

  • It is frequently useful to sync a single file system, instead of all
    mounted file systems via sync(2):

    - On machines with many mounts, it is not at all uncommon for some of
    them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
    those and may never get to the one you do care about (e.g., /).
    - Some applications write lots of data to the file system and then
    want to make sure it is flushed to disk. Calling fsync(2) on each
    file introduces unnecessary ordering constraints that result in a large
    amount of sub-optimal writeback/flush/commit behavior by the file
    system.

    There are currently two ways (that I know of) to sync a single super_block:

    - BLKFLSBUF ioctl on the block device: That also invalidates the bdev
    mapping, which isn't usually desirable, and doesn't work for non-block
    file systems.
    - 'mount -o remount,rw' will call sync_filesystem as an artifact of the
    current implemention. Relying on this little-known side effect for
    something like data safety sounds foolish.

    Both of these approaches require root privileges, which some applications
    do not have (nor should they need?) given that sync(2) is an unprivileged
    operation.

    This patch introduces a new system call syncfs(2) that takes an fd and
    syncs only the file system it references. Maybe someday we can

    $ sync /some/path

    and not get

    sync: ignoring all arguments

    The syscall is motivated by comments by Al and Christoph at the last LSF.
    syncfs(2) seems like an appropriate name given statfs(2).

    A similar ioctl was also proposed a while back, see
    http://marc.info/?l=linux-fsdevel&m=127970513829285&w=2

    Signed-off-by: Sage Weil
    Signed-off-by: Al Viro

    Sage Weil
     

17 Mar, 2011

1 commit


10 Aug, 2010

1 commit


01 Jun, 2010

2 commits


28 May, 2010

1 commit


22 May, 2010

5 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits)
    fix handling of offsets in cris eeprom.c, get rid of fake on-stack files
    get rid of home-grown mutex in cris eeprom.c
    switch ecryptfs_write() to struct inode *, kill on-stack fake files
    switch ecryptfs_get_locked_page() to struct inode *
    simplify access to ecryptfs inodes in ->readpage() and friends
    AFS: Don't put struct file on the stack
    Ban ecryptfs over ecryptfs
    logfs: replace inode uid,gid,mode initialization with helper function
    ufs: replace inode uid,gid,mode initialization with helper function
    udf: replace inode uid,gid,mode init with helper
    ubifs: replace inode uid,gid,mode initialization with helper function
    sysv: replace inode uid,gid,mode initialization with helper function
    reiserfs: replace inode uid,gid,mode initialization with helper function
    ramfs: replace inode uid,gid,mode initialization with helper function
    omfs: replace inode uid,gid,mode initialization with helper function
    bfs: replace inode uid,gid,mode initialization with helper function
    ocfs2: replace inode uid,gid,mode initialization with helper function
    nilfs2: replace inode uid,gid,mode initialization with helper function
    minix: replace inode uid,gid,mode init with helper
    ext4: replace inode uid,gid,mode init with helper
    ...

    Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)

    Linus Torvalds
     
  • Now that the last user passing a NULL file pointer is gone we can remove
    the redundant dentry argument and associated hacks inside vfs_fsynmc_range.

    The next step will be removig the dentry argument from ->fsync, but given
    the luck with the last round of method prototype changes I'd rather
    defer this until after the main merge window.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • ... and switch the simple "loop over superblocks and do something"
    loops to it.

    Signed-off-by: Al Viro

    Al Viro
     
  • At the same time we can kill s_need_restart and local mutex in there.
    __put_super() made public for a while; will be gone later.

    Signed-off-by: Al Viro

    Al Viro
     
  • We used to remove from s_list and s_instances at the same
    time. So let's *not* do the former and skip superblocks
    that have empty s_instances in the loops over s_list.

    The next step, of course, will be to get rid of rescan logics
    in those loops.

    Signed-off-by: Al Viro

    Al Viro
     

17 May, 2010

1 commit

  • When umount calls sync_filesystem(), we first do a WB_SYNC_NONE
    writeback to kick off writeback of pending dirty inodes, then follow
    that up with a WB_SYNC_ALL to wait for it. Since umount already holds
    the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all
    writeback happens as WB_SYNC_ALL. This can greatly slow down umount,
    since WB_SYNC_ALL writeback is a data integrity operation and thus
    a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems
    it's a lot slower.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Apr, 2010

1 commit

  • noop_backing_dev_info is used only as a flag to mark filesystems that
    don't have any backing store, like tmpfs, procfs, spufs, etc.

    Signed-off-by: Joern Engel

    Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes
    to the noop_backing_dev_info is not legal and will not result in
    them being flushed, but we already catch this condition in
    __mark_inode_dirty() when checking for a registered bdi.

    Signed-off-by: Jens Axboe

    Jörn Engel
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

05 Mar, 2010

1 commit

  • Currenly sync_quota_sb does a lot of sync and truncate action that only
    applies to "VFS" style quotas and is actively harmful for the sync
    performance in XFS. Move it into vfs_quota_sync and add a wait parameter
    to ->quota_sync to tell if we need it or not.

    My audit of the GFS2 code says it's also not needed given the way GFS2
    implements quotas, but I'd be happy if this can get a detailed review.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     

18 Dec, 2009

1 commit

  • We recently go rid of all callers of do_sync_file_range as they're better
    served with vfs_fsync or the filemap_write_and_wait. Now that
    do_sync_file_range is down to a single caller fold it into it so that people
    don't start using it again accidentally. While at it also switch it from
    using __filemap_fdatawrite_range(..., WB_SYNC_ALL) to the more clear
    filemap_fdatawrite_range().

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

10 Dec, 2009

2 commits

  • All callers really want the more logical filemap_fdatawait_range interface,
    so convert them to use it and merge wait_on_page_writeback_range into
    filemap_fdatawait_range.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • While Linux provided an O_SYNC flag basically since day 1, it took until
    Linux 2.4.0-test12pre2 to actually get it implemented for filesystems,
    since that day we had generic_osync_around with only minor changes and the
    great "For now, when the user asks for O_SYNC, we'll actually give
    O_DSYNC" comment. This patch intends to actually give us real O_SYNC
    semantics in addition to the O_DSYNC semantics. After Jan's O_SYNC
    patches which are required before this patch it's actually surprisingly
    simple, we just need to figure out when to set the datasync flag to
    vfs_fsync_range and when not.

    This patch renames the existing O_SYNC flag to O_DSYNC while keeping it's
    numerical value to keep binary compatibility, and adds a new real O_SYNC
    flag. To guarantee backwards compatiblity it is defined as expanding to
    both the O_DSYNC and the new additional binary flag (__O_SYNC) to make
    sure we are backwards-compatible when compiled against the new headers.

    This also means that all places that don't care about the differences can
    just check O_DSYNC and get the right behaviour for O_SYNC, too - only
    places that actuall care need to check __O_SYNC in addition. Drivers and
    network filesystems have been updated in a fail safe way to always do the
    full sync magic if O_DSYNC is set. The few places setting O_SYNC for
    lower layers are kept that way for now to stay failsafe.

    We enforce that O_DSYNC is set when __O_SYNC is set early in the open path
    to make sure we always get these sane options.

    Note that parisc really screwed up their headers as they already define a
    O_DSYNC that has always been a no-op. We try to repair it by using it for
    the new O_DSYNC and redefinining O_SYNC to send both the traditional
    O_SYNC numerical value _and_ the O_DSYNC one.

    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Grant Grundler
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: Andreas Dilger
    Acked-by: Trond Myklebust
    Acked-by: Kyle McMartin
    Acked-by: Ulrich Drepper
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Jan Kara

    Christoph Hellwig
     

23 Sep, 2009

1 commit

  • According to Documentation/CodingStyle the EXPORT* macro should follow
    immediately after the closing function brace line.

    Also, mark_buffer_async_write_endio() and do_thaw_all() are not used
    elsewhere so they should be marked as static.

    In addition, file_fsync() is actually in fs/sync.c so move the EXPORT* to
    that file.

    Signed-off-by: H Hartley Sweeten
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     

16 Sep, 2009

1 commit

  • We do this automatically in get_sb_bdev() from the set_bdev_super()
    callback. Filesystems that have their own private backing_dev_info
    must assign that in ->fill_super().

    Note that ->s_bdi assignment is required for proper writeback!

    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Sep, 2009

2 commits

  • Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out
    the data, the calls into ->fsync to write out the metadata and then finally
    calls filemap_fdatawait to wait for the data I/O to complete. What sounds
    like a clever micro-optimization actually is nast trap for many filesystems.

    For many modern filesystems i_size or other inode information is only
    updated on I/O completion and we need to wait for I/O to finish before
    we can write out the metadata. For old fashionen filesystems that
    instanciate blocks during the actual write and also update the metadata
    at that point it opens up a large window were we could expose uninitialized
    blocks after a crash. While a few filesystems that need it already wait
    for the I/O to finish inside their ->fsync methods it is rather suboptimal
    as it is done under the i_mutex and also always for the whole file instead
    of just a part as we could do for O_SYNC handling.

    Here is a small audit of all fsync instances in the tree:

    - spufs_mfc_fsync:
    - ps3flash_fsync:
    - vol_cdev_fsync:
    - printer_fsync:
    - fb_deferred_io_fsync:
    - bad_file_fsync:
    - simple_sync_file:

    don't care - filesystems/drivers do't use the page cache or are
    purely in-memory.

    - simple_fsync:
    - file_fsync:
    - affs_file_fsync:
    - fat_file_fsync:
    - jfs_fsync:
    - ubifs_fsync:
    - reiserfs_dir_fsync:
    - reiserfs_sync_file:

    never touch pagecache themselves. We need to wait before if we do
    not want to expose stale data after an allocation.

    - afs_fsync:
    - fuse_fsync_common:

    do the waiting writeback itself in awkward ways, would benefit from
    proper semantics

    - block_fsync:

    Does a filemap_write_and_wait on the block device inode. Because we
    now have f_mapping that is the same inode we call it on in vfs_fsync.
    So just removing it and letting the VFS do the work in one go would
    be an improvement.

    - btrfs_sync_file:
    - cifs_fsync:
    - xfs_file_fsync:

    need the wait first and currently do it themselves. would benefit from
    doing it outside i_mutex.

    - coda_fsync:
    - ecryptfs_fsync:
    - exofs_file_fsync:
    - shm_fsync:

    only passes the fsync through to the lower layer

    - ext3_sync_file:

    doesn't seem to care, comments are confusing.

    - ext4_sync_file:

    would need the wait to work correctly for delalloc mode with late
    i_size updates. Otherwise the ext3 comment applies.

    currently implemens it's own writeback and wait in an odd way,
    could benefit from doing it properly.

    - gfs2_fsync:

    not needed for journaled data mode, but probably harmless there.
    Currently writes back data asynchronously itself. Needs some
    major audit.

    - hostfs_fsync:

    just calls fsync/datasync on the host FD. Without the wait before
    data might not even be inflight yet if we're unlucky.

    - hpfs_file_fsync:
    - ncp_fsync:

    no-ops. Dangerous before and after.

    - jffs2_fsync:

    just calls jffs2_flush_wbuf_gc, not sure how this relates to data.

    - nfs_fsync_dir:

    just increments stats, claims all directory operations are synchronous

    - nfs_file_fsync:

    only writes out data??? Looks very odd.

    - nilfs_sync_file:

    looks like it expects all data done, but not sure from the code

    - ntfs_dir_fsync:
    - ntfs_file_fsync:

    appear to do their own data writeback. Very convoluted code.

    - ocfs2_sync_file:

    does it's own data writeback, but no wait. probably needs the wait.

    - smb_fsync:

    according to a comment expects all pages written already, probably needs
    the wait before.

    This patch only changes vfs_fsync_range, removal of the wait in the methods
    that have it is left to the filesystem maintainers. Note that most
    filesystems really do need an audit for their fsync methods given the
    gems found in this very brief audit.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Introduce new function for generic inode syncing (vfs_fsync_range) and use
    it from fsync() path. Introduce also new helper for syncing after a sync
    write (generic_write_sync) using the generic function.

    Use these new helpers for syncing from generic VFS functions. This makes
    O_SYNC writes to block devices acquire i_mutex for syncing. If we really
    care about this, we can make block_fsync() drop the i_mutex and reacquire
    it before it returns.

    CC: Evgeniy Polyakov
    CC: ocfs2-devel@oss.oracle.com
    CC: Joel Becker
    CC: Felix Blyakher
    CC: xfs@oss.sgi.com
    CC: Anton Altaparmakov
    CC: linux-ntfs-dev@lists.sourceforge.net
    CC: OGAWA Hirofumi
    CC: linux-ext4@vger.kernel.org
    CC: tytso@mit.edu
    Acked-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     

11 Sep, 2009

2 commits

  • This gets rid of pdflush for bdi writeout and kupdated style cleaning.
    pdflush writeout suffers from lack of locality and also requires more
    threads to handle the same workload, since it has to work in a
    non-blocking fashion against each queue. This also introduces lumpy
    behaviour and potential request starvation, since pdflush can be starved
    for queue access if others are accessing it. A sample ffsb workload that
    does random writes to files is about 8% faster here on a simple SATA drive
    during the benchmark phase. File layout also seems a LOT more smooth in
    vmstat:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
    0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
    1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
    0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
    0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
    0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
    0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
    0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
    0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
    0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45

    where vanilla tends to fluctuate a lot in the creation phase:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
    1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
    0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
    0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
    1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
    0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
    0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
    1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
    0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
    1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
    1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
    0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54

    A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
    SSD based writeback test on XFS performs over 20% better as well, with
    the throughput being very stable around 1GB/sec, where pdflush only
    manages 750MB/sec and fluctuates wildly while doing so. Random buffered
    writes to many files behave a lot better as well, as does random mmap'ed
    writes.

    A separate thread is added to sync the super blocks. In the long term,
    adding sync_supers_bdi() functionality could get rid of this thread again.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This adds two new exported functions:

    - writeback_inodes_sb(), which only attempts to writeback dirty inodes on
    this super_block, for WB_SYNC_NONE writeout.
    - sync_inodes_sb(), which writes out all dirty inodes on this super_block
    and also waits for the IO to complete.

    Acked-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Jul, 2009

1 commit

  • I run many ffsb test cases on JBODs (typically 13/12 disks). Comparing
    with kernel 2.6.30, 2.6.31-rc1 has about 16% regression with
    ffsb_create_4k. The sub test case creates files continuously for 10
    minitues and every file is 1MB.

    Bisect located below patch.

    5cee5815d1564bbbd505fea86f4550f1efdb5cd0 is first bad commit
    commit 5cee5815d1564bbbd505fea86f4550f1efdb5cd0
    Author: Jan Kara
    Date: Mon Apr 27 16:43:51 2009 +0200

    vfs: Make sys_sync() use fsync_super() (version 4)

    It is unnecessarily fragile to have two places (fsync_super() and do_sync())
    doing data integrity sync of the filesystem. Alter __fsync_super() to
    accommodate needs of both callers and use it. So after this patch
    __fsync_super() is the only place where we gather all the calls needed to
    properly send all data on a filesystem to disk.

    As a matter of fact, ffsb calls sys_sync in the end to make sure all data
    is flushed to disks and the flushing is counted into the result. vmstat
    shows ffsb is blocked when syncing for a long time. With 2.6.30, ffsb is
    blocked for a short time.

    I checked the patch and did experiments to recover the original methods.
    Eventually, the root cause is the patch deletes the calling to
    wakeup_pdflush when syncing, so only ffsb is blocked on disk I/O.
    wakeup_pdflush could ask pdflush to write back pages with ffsb at the
    same time.

    [akpm@linux-foundation.org: restore comment too]
    Signed-off-by: Zhang Yanmin
    Cc: Jan Kara
    Cc: Al Viro
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang, Yanmin
     

12 Jun, 2009

9 commits

  • Now that all filesystems provide ->sync_fs methods we can change
    __sync_filesystem to only call ->sync_fs.

    This gives us a clear separation between periodic writeouts which
    are driven by ->write_super and data integrity syncs that go
    through ->sync_fs. (modulo file_fsync which is also going away)

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Push down lock_super into ->write_super instances and remove it from the
    caller.

    Following filesystem don't need ->s_lock in ->write_super and are skipped:

    * bfs, nilfs2 - no other uses of s_lock and have internal locks in
    ->write_super
    * ext2 - uses BKL in ext2_write_super and has internal calls without s_lock
    * reiserfs - no other uses of s_lock as has reiserfs_write_lock (BKL) in
    ->write_super
    * xfs - no other uses of s_lock and uses internal lock (buffer lock on
    superblock buffer) to serialize ->write_super. Also xfs_fs_write_super
    is superflous and will go away in the next merge window

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Make sure a superblock really is writeable by checking MS_RDONLY
    under s_umount. sync_filesystems needed some re-arragement for
    that, but all but one sync_filesystem caller had the correct locking
    already so that we could add that check there. cachefiles grew
    s_umount locking.

    I've also added a WARN_ON to sync_filesystem to assert this for
    future callers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Introduce this function which just writes all the quota structures but
    avoids all the syncing and cache pruning work to expose quota structures
    to userspace. Use this function from __sync_filesystem when wait == 0.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Currently the VFS calls vfs_dq_sync to sync out disk quotas for a given
    superblock. This is a small wrapper around sync_dquots which for the
    case of a non-NULL superblock is a small wrapper around quota_sync_sb.

    Just make quota_sync_sb global (rename it to sync_quota_sb) and call it
    directly. Also call it directly for those cases in quota.c that have a
    superblock and leave sync_dquots purely an iterator over sync_quota_sb and
    remove it's superblock argument.

    To make this nicer move the check for the lack of a quota_sync method
    from the callers into sync_quota_sb.

    [folded build fix from Alexander Beregalov ]

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Rename the function so that it better describe what it really does. Also
    remove the unnecessary include of buffer_head.h.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Move sync_filesystems(), __fsync_super(), fsync_super() from
    super.c to sync.c where it fits better.

    [build fixes folded]

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • It is unnecessarily fragile to have two places (fsync_super() and do_sync())
    doing data integrity sync of the filesystem. Alter __fsync_super() to
    accommodate needs of both callers and use it. So after this patch
    __fsync_super() is the only place where we gather all the calls needed to
    properly send all data on a filesystem to disk.

    Nice bonus is that we get a complete livelock avoidance and write_supers()
    is now only used for periodic writeback of superblocks.

    sync_blockdevs() introduced a couple of patches ago is gone now.

    [build fixes folded]

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • So far, do_sync() called:
    sync_inodes(0);
    sync_supers();
    sync_filesystems(0);
    sync_filesystems(1);
    sync_inodes(1);

    This ordering makes it kind of hard for filesystems as sync_inodes(0) need not
    submit all the IO (for example it skips inodes with I_SYNC set) so e.g. forcing
    transaction to disk in ->sync_fs() is not really enough. Therefore sys_sync has
    not been completely reliable on some filesystems (ext3, ext4, reiserfs, ocfs2
    and others are hit by this) when racing e.g. with background writeback. A
    similar problem hits also other filesystems (e.g. ext2) because of
    write_supers() being called before the sync_inodes(1).

    Change the ordering of calls in do_sync() - this requires a new function
    sync_blockdevs() to preserve the property that block devices are always synced
    after write_super() / sync_fs() call.

    The same issue is fixed in __fsync_super() function used on umount /
    remount read-only.

    [AV: build fixes]

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

28 Mar, 2009

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6: (27 commits)
    ext2: Zero our b_size in ext2_quota_read()
    trivial: fix typos/grammar errors in fs/Kconfig
    quota: Coding style fixes
    quota: Remove superfluous inlines
    quota: Remove uppercase aliases for quota functions.
    nfsd: Use lowercase names of quota functions
    jfs: Use lowercase names of quota functions
    udf: Use lowercase names of quota functions
    ufs: Use lowercase names of quota functions
    reiserfs: Use lowercase names of quota functions
    ext4: Use lowercase names of quota functions
    ext3: Use lowercase names of quota functions
    ext2: Use lowercase names of quota functions
    ramfs: Remove quota call
    vfs: Use lowercase names of quota functions
    quota: Remove dqbuf_t and other cleanups
    quota: Remove NODQUOT macro
    quota: Make global quota locks cacheline aligned
    quota: Move quota files into separate directory
    ext4: quota reservation for delayed allocation
    ...

    Linus Torvalds
     

26 Mar, 2009

1 commit