15 Sep, 2009

10 commits

  • * 'osync_cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
    fsync: wait for data writeout completion before calling ->fsync
    vfs: Remove generic_osync_inode() and sync_page_range{_nolock}()
    fat: Opencode sync_page_range_nolock()
    pohmelfs: Use new syncing helper
    xfs: Convert sync_page_range() to simple filemap_write_and_wait_range()
    ocfs2: Update syncing after splicing to match generic version
    ntfs: Use new syncing helpers and update comments
    ext4: Remove syncing logic from ext4_file_write
    ext3: Remove syncing logic from ext3_file_write
    ext2: Update comment about generic_osync_inode
    vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
    vfs: Rename generic_file_aio_write_nolock
    ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
    pohmelfs: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
    vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
    vfs: Export __generic_file_aio_write() and add some comments
    vfs: Introduce filemap_fdatawait_range

    Linus Torvalds
     
  • * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
    GFS2: Whitespace fixes
    GFS2: Remove unused sysfs file
    GFS2: Be extra careful about deallocating inodes
    GFS2: Remove no_formal_ino generating code
    GFS2: Rename eattr.[ch] as xattr.[ch]
    GFS2: Clean up of extended attribute support
    GFS2: Add explanation of extended attr on-disk format
    GFS2: Add "-o errors=panic|withdraw" mount options
    GFS2: jumping to wrong label?
    GFS2: free disk inode which is deleted by remote node -V2
    GFS2: Add a document explaining GFS2's uevents
    GFS2: Add sysfs link to device
    GFS2: Replace assertion with proper error handling
    GFS2: Improve error handling in inode allocation
    GFS2: Add some more info to uevents
    GFS2: Add online uevent to GFS2

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
    udf: Fix possible corruption when close races with write
    udf: Perform preallocation only for regular files
    udf: Remove wrong assignment in udf_symlink
    udf: Remove dead code

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (21 commits)
    fs/Kconfig: move nilfs2 outside misc filesystems
    nilfs2: convert nilfs_bmap_lookup to an inline function
    nilfs2: allow btree code to directly call dat operations
    nilfs2: add update functions of virtual block address to dat
    nilfs2: remove individual gfp constants for each metadata file
    nilfs2: stop zero-fill of btree path just before free it
    nilfs2: remove unused btree argument from btree functions
    nilfs2: remove nilfs_dat_abort_start and nilfs_dat_abort_free
    nilfs2: shorten freeze period due to GC in write operation v3
    nilfs2: add more check routines in mount process
    nilfs2: An unassigned variable is assigned to a never used structure member
    nilfs2: use GFP_NOIO for bio_alloc instead of GFP_NOWAIT
    nilfs2: stop using periodic write_super callback
    nilfs2: clean up nilfs_write_super
    nilfs2: fix disorder of nilfs_write_super in nilfs_sync_fs
    nilfs2: remove redundant super block commit
    nilfs2: implement nilfs_show_options to display mount options in /proc/mounts
    nilfs2: always lookup disk block address before reading metadata block
    nilfs2: use semaphore to protect pointer to a writable FS-instance
    nilfs2: fix format string compile warning (ino_t)
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
    cifs: consolidate reconnect logic in smb_init routines
    cifs: Replace wrtPending with a real reference count
    cifs: protect GlobalOplock_Q with its own spinlock
    cifs: use tcon pointer in cifs_show_options
    cifs: send IPv6 addr in upcall with colon delimiters
    [CIFS] Fix checkpatch warnings
    PATCH] cifs: fix broken mounts when a SSH tunnel is used (try #4)
    [CIFS] Memory leak in ntlmv2 hash calculation
    [CIFS] potential NULL dereference in parse_DFS_referrals()

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1623 commits)
    netxen: update copyright
    netxen: fix tx timeout recovery
    netxen: fix file firmware leak
    netxen: improve pci memory access
    netxen: change firmware write size
    tg3: Fix return ring size breakage
    netxen: build fix for INET=n
    cdc-phonet: autoconfigure Phonet address
    Phonet: back-end for autoconfigured addresses
    Phonet: fix netlink address dump error handling
    ipv6: Add IFA_F_DADFAILED flag
    net: Add DEVTYPE support for Ethernet based devices
    mv643xx_eth.c: remove unused txq_set_wrr()
    ucc_geth: Fix hangs after switching from full to half duplex
    ucc_geth: Rearrange some code to avoid forward declarations
    phy/marvell: Make non-aneg speed/duplex forcing work for 88E1111 PHYs
    drivers/net/phy: introduce missing kfree
    drivers/net/wan: introduce missing kfree
    net: force bridge module(s) to be GPL
    Subject: [PATCH] appletalk: Fix skb leak when ipddp interface is not loaded
    ...

    Fixed up trivial conflicts:

    - arch/x86/include/asm/socket.h

    converted to in the x86 tree. The generic
    header has the same new #define's, so that works out fine.

    - drivers/net/tun.c

    fix conflict between 89f56d1e9 ("tun: reuse struct sock fields") that
    switched over to using 'tun->socket.sk' instead of the redundantly
    available (and thus removed) 'tun->sk', and 2b980dbd ("lsm: Add hooks
    to the TUN driver") which added a new 'tun->sk' use.

    Noted in 'next' by Stephen Rothwell.

    Linus Torvalds
     
  • When we close a file, we remove preallocated blocks from it. But this
    truncation was not protected by i_mutex and thus it could have raced with a
    write through a different fd and cause crashes or even filesystem corruption.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • So far we preallocated blocks also for directories but that brings a
    problem, when to get rid of preallocated blocks we don't need. So far
    we removed them in udf_clear_inode() which has a disadvantage that
    1) blocks are unavailable long after writing to a directory finished
    and thus one can get out of space unnecessarily early
    2) releasing blocks from udf_clear_inode is problematic because VFS
    does not expect us to redirty inode there and it also slows down
    memory reclaim.

    So preallocate blocks only for regular files where we can drop preallocation
    in udf_release_file.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • Recomputation of the pointer was wrong (it should have been just increment).
    Luckily, we never use the computed value. Remove it.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • Remove code that gets never used.

    Signed-off-by: Jan Kara

    Jan Kara
     

14 Sep, 2009

30 commits

  • Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out
    the data, the calls into ->fsync to write out the metadata and then finally
    calls filemap_fdatawait to wait for the data I/O to complete. What sounds
    like a clever micro-optimization actually is nast trap for many filesystems.

    For many modern filesystems i_size or other inode information is only
    updated on I/O completion and we need to wait for I/O to finish before
    we can write out the metadata. For old fashionen filesystems that
    instanciate blocks during the actual write and also update the metadata
    at that point it opens up a large window were we could expose uninitialized
    blocks after a crash. While a few filesystems that need it already wait
    for the I/O to finish inside their ->fsync methods it is rather suboptimal
    as it is done under the i_mutex and also always for the whole file instead
    of just a part as we could do for O_SYNC handling.

    Here is a small audit of all fsync instances in the tree:

    - spufs_mfc_fsync:
    - ps3flash_fsync:
    - vol_cdev_fsync:
    - printer_fsync:
    - fb_deferred_io_fsync:
    - bad_file_fsync:
    - simple_sync_file:

    don't care - filesystems/drivers do't use the page cache or are
    purely in-memory.

    - simple_fsync:
    - file_fsync:
    - affs_file_fsync:
    - fat_file_fsync:
    - jfs_fsync:
    - ubifs_fsync:
    - reiserfs_dir_fsync:
    - reiserfs_sync_file:

    never touch pagecache themselves. We need to wait before if we do
    not want to expose stale data after an allocation.

    - afs_fsync:
    - fuse_fsync_common:

    do the waiting writeback itself in awkward ways, would benefit from
    proper semantics

    - block_fsync:

    Does a filemap_write_and_wait on the block device inode. Because we
    now have f_mapping that is the same inode we call it on in vfs_fsync.
    So just removing it and letting the VFS do the work in one go would
    be an improvement.

    - btrfs_sync_file:
    - cifs_fsync:
    - xfs_file_fsync:

    need the wait first and currently do it themselves. would benefit from
    doing it outside i_mutex.

    - coda_fsync:
    - ecryptfs_fsync:
    - exofs_file_fsync:
    - shm_fsync:

    only passes the fsync through to the lower layer

    - ext3_sync_file:

    doesn't seem to care, comments are confusing.

    - ext4_sync_file:

    would need the wait to work correctly for delalloc mode with late
    i_size updates. Otherwise the ext3 comment applies.

    currently implemens it's own writeback and wait in an odd way,
    could benefit from doing it properly.

    - gfs2_fsync:

    not needed for journaled data mode, but probably harmless there.
    Currently writes back data asynchronously itself. Needs some
    major audit.

    - hostfs_fsync:

    just calls fsync/datasync on the host FD. Without the wait before
    data might not even be inflight yet if we're unlucky.

    - hpfs_file_fsync:
    - ncp_fsync:

    no-ops. Dangerous before and after.

    - jffs2_fsync:

    just calls jffs2_flush_wbuf_gc, not sure how this relates to data.

    - nfs_fsync_dir:

    just increments stats, claims all directory operations are synchronous

    - nfs_file_fsync:

    only writes out data??? Looks very odd.

    - nilfs_sync_file:

    looks like it expects all data done, but not sure from the code

    - ntfs_dir_fsync:
    - ntfs_file_fsync:

    appear to do their own data writeback. Very convoluted code.

    - ocfs2_sync_file:

    does it's own data writeback, but no wait. probably needs the wait.

    - smb_fsync:

    according to a comment expects all pages written already, probably needs
    the wait before.

    This patch only changes vfs_fsync_range, removal of the wait in the methods
    that have it is left to the filesystem maintainers. Note that most
    filesystems really do need an audit for their fsync methods given the
    gems found in this very brief audit.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Remove these three functions since nobody uses them anymore.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • fat_cont_expand() is the only user of sync_page_range_nolock(). It's also the
    only user of generic_osync_inode() which does not have a file open. So
    opencode needed actions for FAT so that we can convert generic_osync_inode() to
    a standard syncing path.

    Update a comment about generic_osync_inode().

    CC: OGAWA Hirofumi
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Christoph Hellwig says that it is enough for XFS to call
    filemap_write_and_wait_range() instead of sync_page_range() because we do
    all the metadata syncing when forcing the log.

    CC: Felix Blyakher
    CC: xfs@oss.sgi.com
    CC: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Update ocfs2 specific splicing code to use generic syncing helper. The sync now
    does not happen under rw_lock because generic_write_sync() acquires i_mutex
    which ranks above rw_lock. That should not matter because standard fsync path
    does not hold it either.

    Acked-by: Joel Becker
    Acked-by: Mark Fasheh
    CC: ocfs2-devel@oss.oracle.com
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Use new syncing helpers in .write and .aio_write functions. Also
    remove superfluous syncing in ntfs_file_buffered_write() and update
    comments about generic_osync_inode().

    CC: Anton Altaparmakov
    CC: linux-ntfs-dev@lists.sourceforge.net
    Signed-off-by: Jan Kara

    Jan Kara
     
  • The syncing is now properly handled by generic_file_aio_write() so
    no special ext4 code is needed.

    CC: linux-ext4@vger.kernel.org
    CC: tytso@mit.edu
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Syncing is now properly done by generic_file_aio_write() so no special logic is
    needed in ext3.

    CC: linux-ext4@vger.kernel.org
    Signed-off-by: Jan Kara

    Jan Kara
     
  • We rely on generic_write_sync() now.

    CC: linux-ext4@vger.kernel.org
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Introduce new function for generic inode syncing (vfs_fsync_range) and use
    it from fsync() path. Introduce also new helper for syncing after a sync
    write (generic_write_sync) using the generic function.

    Use these new helpers for syncing from generic VFS functions. This makes
    O_SYNC writes to block devices acquire i_mutex for syncing. If we really
    care about this, we can make block_fsync() drop the i_mutex and reacquire
    it before it returns.

    CC: Evgeniy Polyakov
    CC: ocfs2-devel@oss.oracle.com
    CC: Joel Becker
    CC: Felix Blyakher
    CC: xfs@oss.sgi.com
    CC: Anton Altaparmakov
    CC: linux-ntfs-dev@lists.sourceforge.net
    CC: OGAWA Hirofumi
    CC: linux-ext4@vger.kernel.org
    CC: tytso@mit.edu
    Acked-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     
  • generic_file_aio_write_nolock() is now used only by block devices and raw
    character device. Filesystems should use __generic_file_aio_write() in case
    generic_file_aio_write() doesn't suit them. So rename the function to
    blkdev_aio_write() and move it to fs/blockdev.c.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Use the new helper. We have to submit data pages ourselves in case of O_SYNC
    write because __generic_file_aio_write does not do it for us. OCFS2 developpers
    might think about moving the sync out of i_mutex which seems to be easily
    possible but that's out of scope of this patch.

    CC: ocfs2-devel@oss.oracle.com
    Acked-by: Joel Becker
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Some people asked me questions like the following:

    On Wed, 15 Jul 2009 13:11:21 +0200, Leon Woestenberg wrote:
    > just wondering, any reasons why NILFS2 is one of the miscellaneous
    > filesystems and, for example, btrfs, is not in Kconfig?

    Actually, nilfs is NOT a filesystem came from other operating systems,
    but a filesystem created purely for Linux. Nor is it a flash
    filesystem but that for generic block devices.

    So, this moves nilfs outside the misc category as I responded in LKML
    "Re: Why does NILFS2 hide under Miscellaneous filesystems?"
    (Message-Id: ).

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • The nilfs_bmap_lookup() is now a wrapper function of
    nilfs_bmap_lookup_at_level().

    This moves the nilfs_bmap_lookup() to a header file converting it to
    an inline function and gives an opportunity for optimization.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • The current btree code is written so that btree functions call dat
    operations via wrapper functions in bmap.c when they allocate, free,
    or modify virtual block addresses.

    This abstraction requires additional function calls and causes
    frequent call of nilfs_bmap_get_dat() function since it is used in the
    every wrapper function.

    This removes the wrapper functions and makes them available from
    btree.c and direct.c, which will increase the opportunity of
    compiler optimization.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • This is a preparation for the successive cleanup ("nilfs2: allow btree
    to directly call dat operations").

    This adds functions bundling a few operations to change an entry of
    virtual block address on the dat file.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • This gets rid of NILFS_CPFILE_GFP, NILFS_SUFILE_GFP, NILFS_DAT_GFP,
    and NILFS_IFILE_GFP. All of these constants refer to NILFS_MDT_GFP,
    and can be removed.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • The btree path object is cleared just before it is freed.

    This will remove the code doing the unnecessary clear operation.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • Even though many btree functions take a btree object as their first
    argument, most of them are not used in their functions.

    This sticky use of the btree argument is hurting code readability and
    giving the possibility of inefficient code generation.

    So, this removes the unnecessary btree arguments.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • These functions are not called from any functions.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • This is a re-revised patch to shorten freeze period.
    This version include a fix of the bug Konishi-san mentioned last time.

    When GC is runnning, GC moves live block to difference segments.
    Copying live blocks into memory is done in a transaction,
    however it is not necessarily to be in the transaction.
    This patch will get the nilfs_ioctl_move_blocks() out from
    transaction lock and put it before the transaction.

    I ran sysbench fileio test against nilfs partition.
    I copied some DVD/CD images and created snapshot to create live blocks
    before starting the benchmark.

    Followings are summary of rc8 and rc8 w/ the patch of per-request
    statistics, which is min/max and avg. I ran each test three times and
    bellow is average of those numers.

    According to this benchmark result, average time is slightly degrated.
    However, worstcase (max) result is significantly improved.
    This can address a few seconds write freeze.

    - random write per-request performance of rc8
    min 0.843ms
    max 680.406ms
    avg 3.050ms
    - random write per-request performance of rc8 w/ this patch
    min 0.843ms -> 100.00%
    max 380.490ms -> 55.90%
    avg 3.233ms -> 106.00%

    - sequential write per-request performance of rc8
    min 0.736ms
    max 774.343ms
    avg 2.883ms
    - sequential write per-request performance of rc8 w/ this patch
    min 0.720ms -> 97.80%
    max 644.280ms-> 83.20%
    avg 3.130ms -> 108.50%

    -----8
    Signed-off-by: Ryusuke Konishi

    Jiro SEKIBA
     
  • nilfs2: Add more safeguard routines and protections in mount process,
    which also makes nilfs2 report consistency error messages when
    checkpoint number is invalid.

    Signed-off-by: Zhu Yanhai
    Signed-off-by: Ryusuke Konishi

    Zhu Yanhai
     
  • nilfs2: In procedure 'nilfs_get_sb()', when a nilfs filesysttem is
    mounted for the first time, local variable 'nilfs->ns_last_cno' is
    used before loading the latest checkpoint number from disk (in
    'nilfs_fill_super'). 'nilfs->ns_last_cno' is assigned to 'sd.cno', but
    'sd.cno' has never been used in the procedure.

    Signed-off-by: Zhang Qiang
    Signed-off-by: Ryusuke Konishi

    Zhang Qiang
     
  • Alberto Bertogli advised me about bio_alloc() use in nilfs:
    On Sat, 13 Jun 2009 22:52:40 -0300, Alberto Bertogli wrote:
    > By the way, those bio_alloc()s are using GFP_NOWAIT but it looks
    > like they could use at least GFP_NOIO or GFP_NOFS, since the caller
    > can (and sometimes do) sleep. The only caller is nilfs_submit_bh(),
    > which calls nilfs_submit_seg_bio() which can sleep calling
    > wait_for_completion().

    This takes in the comment and replaces the use of GFP_NOWAIT flag with
    GFP_NOIO.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi
     
  • This removes nilfs_write_super and commit super block in nilfs
    internal thread, instead of periodic write_super callback.

    VFS layer calls ->write_super callback periodically. However,
    it looks like that calling back is ommited when disk I/O is busy.
    And when cleanerd (nilfs GC) is runnig, disk I/O tend to be busy thus
    nilfs superblock is not synchronized as nilfs designed.

    To avoid it, syncing superblock by nilfs thread instead of pdflush.

    Signed-off-by: Jiro SEKIBA
    Signed-off-by: Ryusuke Konishi

    Jiro SEKIBA
     
  • Separate conditions that check if syncing super block and alternative
    super block are required as inline functions to reuse the conditions.

    Signed-off-by: Jiro SEKIBA
    Signed-off-by: Ryusuke Konishi

    Jiro SEKIBA
     
  • This fixes disorder of nilfs_write_super in nilfs_sync_fs. Commiting
    super block must be the end of the function so that every changes are
    reflected.

    ->sync_fs() is not called frequently so this makes nilfs_sync_fs call
    nilfs_commit_super instead of nilfs_write_super.

    Signed-off-by: Jiro SEKIBA
    Signed-off-by: Ryusuke Konishi

    Jiro SEKIBA
     
  • This removes redundant super block commit.

    nilfs_write_super will call nilfs_commit_super to store super block
    into block device. However, nilfs_put_super will call
    nilfs_commit_super right after calling nilfs_write_super. So calling
    nilfs_write_super in nilfs_put_super would be redundant.

    Signed-off-by: Jiro SEKIBA
    Signed-off-by: Ryusuke Konishi

    Jiro SEKIBA
     
  • This is a patch to display mount options in procfs.
    Mount options will show up in the /proc/mounts as other fs does.

    ...
    /dev/sda6 /mnt nilfs2 ro,relatime,barrier=off,cp=3,order=strict 0 0
    ...

    Signed-off-by: Jiro SEKIBA
    Signed-off-by: Ryusuke Konishi

    Jiro SEKIBA
     
  • The current metadata file code skips disk address lookup for its data
    block if the buffer has a mapped flag.

    This has a potential risk to cause read request to be performed
    against the stale block address that GC moved, and it may lead to meta
    data corruption. The mapped flag is safe if the buffer has an
    uptodate flag, otherwise it may prevent necessary update of disk
    address in the next read.

    This will avoid the potential problem by ensuring disk address lookup
    before reading metadata block even for buffers with the mapped flag.

    Signed-off-by: Ryusuke Konishi

    Ryusuke Konishi