17 Dec, 2010

10 commits

  • Opencode the xfs_iomap code in it's two callers. The overlap of
    passed flags already was minimal and will be further reduced in the
    next patch.

    As a side effect the BMAPI_* flags for xfs_bmapi and the IO_* flags
    for I/O end processing are merged into a single set of flags, which
    should be a bit more descriptive of the operation we perform.

    Also improve the tracing by giving each caller it's own type set of
    tracepoints.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Remove passing the BMAPI_* flags to these helpers, in
    xfs_iomap_write_direct the check BMAPI_DIRECT was always true, and
    in the xfs_iomap_write_delay path is was never checked at all.
    Remove the nmap return value as we never make use of it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Don't trylock the buffer. We are the only one ever locking it for a
    regular file address space, and trylock was only copied from the
    generic code which did it due to the old buffer based writeout in
    jbd. Also make sure to only write out the buffer if the iomap
    actually is valid, because we wouldn't have a proper mapping
    otherwise. In practice we will never get an invalid mapping here as
    the page lock guarantees truncate doesn't race with us, but better
    be safe than sorry. Also make sure we allocate a new ioend when
    crossing boundaries between mappings, just like we do for delalloc
    and unwritten extents. Again this currently doesn't matter as the
    I/O end handler only cares for the boundaries for unwritten extents,
    but this makes the code fully correct and the same as for
    delalloc/unwritten extents.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • We'll never have BIO_EOPNOTSUPP set after calling submit_bio as this
    can only happen for discards, and used to happen for barriers, none
    of which is every submitted by xfs_submit_ioend_bio. Also remove
    the loop around bio_alloc as it will never fail due to it's mempool
    backing.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Currently we only refuse a "read-only" mapping for writing out
    unwritten and delayed buffers, and refuse any other for overwrites.
    Improve the checks to require delalloc mappings for delayed buffers,
    and unwritten extent mappings for unwritten extents.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Dispatch to a different helper for phase1 vs phase2 in
    xlog_recover_commit_trans instead of doing it in all the
    low-level functions.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Merge the call to xlog_recover_reorder_trans and the loop over the
    recovery items from xlog_recover_do_trans into xlog_recover_commit_trans,
    and keep the switch statement over the log item types as a separate helper.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • XFS used to support different types of buffer log items long time
    ago. Remove the switch statements checking the log item type in
    various buffer recovery helpers that were left over from those days
    and the rather useless xlog_recover_do_buffer_pass2 wrapper.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • We now support mounting and using filesystems with 64-bit inodes
    even when not mounted with the inode64 option (which now only
    controls if we allocate new inodes in that space or not). Make sure
    we always use large NFS file handles when exporting a filesystem
    that may contain 64-bit inodes. Note that this only affects newly
    generated file handles, any outstanding 32-bit file handle is still
    accepted.

    [hch: the comment and commit log are mine, the rest is from a patch
    snipplet from Samuel]

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Samuel Kvasnica
     

10 Dec, 2010

1 commit

  • Now that we don't mark VFS inodes dirty anymore for internal
    timestamp changes, but rely on the transaction subsystem to push
    them out, we need to explicitly log the source inode in rename after
    updating it's timestamps to make sure the changes actually get
    forced out by sync/fsync or an AIL push.

    We already account for the fourth inode in the log reservation, as a
    rename of directories needs to update the nlink field, so just
    adding the xfs_trans_log_inode call is enough.

    This fixes the xfsqa 065 regression introduced by:

    "xfs: don't use vfs writeback for pure metadata modifications"

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

01 Dec, 2010

5 commits

  • Recent tests writing lots of small files showed the flusher thread
    being CPU bound and taking a long time to do allocations on a debug
    kernel. perf showed this as the prime reason:

    samples pcnt function DSO
    _______ _____ ___________________________ _________________

    224648.00 36.8% xfs_error_test [kernel.kallsyms]
    86045.00 14.1% xfs_btree_check_sblock [kernel.kallsyms]
    39778.00 6.5% prandom32 [kernel.kallsyms]
    37436.00 6.1% xfs_btree_increment [kernel.kallsyms]
    29278.00 4.8% xfs_btree_get_rec [kernel.kallsyms]
    27717.00 4.5% random32 [kernel.kallsyms]

    Walking btree blocks during allocation checking them requires each
    block (a cache hit, so no I/O) call xfs_error_test(), which then
    does a random32() call as the first operation. IOWs, ~50% of the
    CPU is being consumed just testing whether we need to inject an
    error, even though error injection is not active.

    Kill this overhead when error injection is not active by adding a
    global counter of active error traps and only calling into
    xfs_error_test when fault injection is active.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • When an inode has been marked stale because the cluster is being
    freed, we don't want to (re-)insert this inode into the AIL. There
    is a race condition where the cluster buffer may be unpinned before
    the inode is inserted into the AIL during transaction committed
    processing. If the buffer is unpinned before the inode item has been
    committed and inserted, then it is possible for the buffer to be
    released and hence processthe stale inode callbacks before the inode
    is inserted into the AIL.

    In this case, we then insert a clean, stale inode into the AIL which
    will never get removed by an IO completion. It will, however, get
    reclaimed and that triggers an assert in xfs_inode_free()
    complaining about freeing an inode still in the AIL.

    This race can be avoided by not moving stale inodes forward in the AIL
    during transaction commit completion processing. This closes the
    race condition by ensuring we never insert clean stale inodes into
    the AIL. It is safe to do this because a dirty stale inode, by
    definition, must already be in the AIL.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • There is an assumption in the parts of XFS that flushing a dirty
    file will make all the delayed allocation blocks disappear from an
    inode. That is, that after calling xfs_flush_pages() then
    ip->i_delayed_blks will be zero.

    This is an invalid assumption as we may have specualtive
    preallocation beyond EOF and they are recorded in
    ip->i_delayed_blks. A flush of the dirty pages of an inode will not
    change the state of these blocks beyond EOF, so a non-zero
    deeelalloc block count after a flush is valid.

    The bmap code has an invalid ASSERT() that needs to be removed, and
    the swapext code has a bug in that while it swaps the data forks
    around, it fails to swap the i_delayed_blks counter associated with
    the fork and hence can get the block accounting wrong.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • As reported by Nick Piggin, XFS is suffering from long pauses under
    highly concurrent workloads when hosted on ramdisks. The problem is
    that an inode buffer is stuck in the pinned state in memory and as a
    result either the inode buffer or one of the inodes within the
    buffer is stopping the tail of the log from being moved forward.

    The system remains in this state until a periodic log force issued
    by xfssyncd causes the buffer to be unpinned. The main problem is
    that these are stale buffers, and are hence held locked until the
    transaction/checkpoint that marked them state has been committed to
    disk. When the filesystem gets into this state, only the xfssyncd
    can cause the async transactions to be committed to disk and hence
    unpin the inode buffer.

    This problem was encountered when scaling the busy extent list, but
    only the blocking lock interface was fixed to solve the problem.
    Extend the same fix to the buffer trylock operations - if we fail to
    lock a pinned, stale buffer, then force the log immediately so that
    when the next attempt to lock it comes around, it will have been
    unpinned.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • Since the move to the new truncate sequence we call xfs_setattr to
    truncate down excessively instanciated blocks. As shown by the testcase
    in kernel.org BZ #22452 that doesn't work too well. Due to the confusion
    of the internal inode size, and the VFS inode i_size it zeroes data that
    it shouldn't.

    But full blown truncate seems like overkill here. We only instanciate
    delayed allocations in the write path, and given that we never released
    the iolock we can't have converted them to real allocations yet either.

    The only nasty case is pre-existing preallocation which we need to skip.
    We already do this for page discard during writeback, so make the delayed
    allocation block punching a generic function and call it from the failed
    write path as well as xfs_aops_discard_page. The callers are
    responsible for ensuring that partial blocks are not truncated away,
    and that they hold the ilock.

    Based on a fix originally from Christoph Hellwig. This version used
    filesystem blocks as the range unit.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

11 Nov, 2010

8 commits

  • In commit 20cb52ebd1b5ca6fa8a5d9b6b1392292f5ca8a45, titled
    "xfs: simplify xfs_vm_writepage" I added an assert that any !mapped and
    uptodate buffers are not dirty. That asserts turns out to trigger a lot
    when running fsx on filesystems with small block sizes. The reason for
    that is that the assert is simply incorrect. !mapped and uptodate
    just mean this buffer covers a hole, and whenever we do a set_page_dirty
    we mark all blocks in the page dirty, no matter if they have data or
    not. So remove the assert, and update the comment above the condition
    to match reality.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • XFS does not need it's inodes to actuall be hashed in the VFS inode
    cache, but we require the inode to be marked hashed for the
    writeback code to work.

    Insted of using insert_inode_hash, which requires a second
    inode_lock roundtrip after the partial merge of the inode
    scalability patches in 2.6.37-rc simply use the new hlist_add_fake
    helper to mark it hashed without requiring a lock or touching a
    global cache line.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Andi Kleen reported that gcc-4.5 gives lots of warnings for him
    inside the XFS code. It turned out most of them are due to the
    quota stubs beeing macros, and gcc now complaining about macros
    evaluating to 0 that are not assigned to variables.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • The filestreams code may take the iolock on the parent inode while
    holding it on a child. This is the only place in XFS where we take
    both the child and parent iolock, so just telling lockdep about it
    is enough. The lock flag required for that was already added as
    part of the ilock lockdep annotations and unused so far.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • The delayed write buffer split trace currently issues a trace for
    every buffer it scans. These buffers are not necessarily queued for
    delayed write. Indeed, when buffers are pinned, there can be
    thousands of traces of buffers that aren't actually queued for
    delayed write and the ones that are are lost in the noise. Move the
    trace point to record only buffers that are split out for IO to be
    issued on.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • The walk fails to decrement the per-ag reference count when the
    non-blocking walk fails to obtain the per-ag reclaim lock, leading
    to an assert failure on debug kernels when unmounting a filesystem.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • al_hreq is copied from userland. If al_hreq.buflen is not properly aligned
    then xfs_attr_list will ignore the last bytes of kbuf. These bytes are
    unitialized. It leads to leaking of contents of kernel stack memory.

    Signed-off-by: Vasiliy Kulikov
    Signed-off-by: Alex Elder

    Kulikov Vasiliy
     
  • We promised to do this for 2.6.37, and the code looks stable enough to
    keep that promise.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

29 Oct, 2010

1 commit


28 Oct, 2010

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (24 commits)
    quota: Fix possible oops in __dquot_initialize()
    ext3: Update kernel-doc comments
    jbd/2: fixed typos
    ext2: fixed typo.
    ext3: Fix debug messages in ext3_group_extend()
    jbd: Convert atomic_inc() to get_bh()
    ext3: Remove misplaced BUFFER_TRACE() in ext3_truncate()
    jbd: Fix debug message in do_get_write_access()
    jbd: Check return value of __getblk()
    ext3: Use DIV_ROUND_UP() on group desc block counting
    ext3: Return proper error code on ext3_fill_super()
    ext3: Remove unnecessary casts on bh->b_data
    ext3: Cleanup ext3_setup_super()
    quota: Fix issuing of warnings from dquot_transfer
    quota: fix dquot_disable vs dquot_transfer race v2
    jbd: Convert bitops to buffer fns
    ext3/jbd: Avoid WARN() messages when failing to write the superblock
    jbd: Use offset_in_page() instead of manual calculation
    jbd: Remove unnecessary goto statement
    jbd: Use printk_ratelimited() in journal_alloc_journal_head()
    ...

    Linus Torvalds
     

27 Oct, 2010

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
    split invalidate_inodes()
    fs: skip I_FREEING inodes in writeback_sb_inodes
    fs: fold invalidate_list into invalidate_inodes
    fs: do not drop inode_lock in dispose_list
    fs: inode split IO and LRU lists
    fs: switch bdev inode bdi's correctly
    fs: fix buffer invalidation in invalidate_list
    fsnotify: use dget_parent
    smbfs: use dget_parent
    exportfs: use dget_parent
    fs: use RCU read side protection in d_validate
    fs: clean up dentry lru modification
    fs: split __shrink_dcache_sb
    fs: improve DCACHE_REFERENCED usage
    fs: use percpu counter for nr_dentry and nr_dentry_unused
    fs: simplify __d_free
    fs: take dcache_lock inside __d_path
    fs: do not assign default i_ino in new_inode
    fs: introduce a per-cpu last_ino allocator
    new helper: ihold()
    ...

    Linus Torvalds
     
  • This removes more dead code that was somehow missed by commit 0d99519efef
    (writeback: remove unused nonblocking and congestion checks). There are
    no behavior change except for the removal of two entries from one of the
    ext4 tracing interface.

    The nonblocking checks in ->writepages are no longer used because the
    flusher now prefer to block on get_request_wait() than to skip inodes on
    IO congestion. The latter will lead to more seeky IO.

    The nonblocking checks in ->writepage are no longer used because it's
    redundant with the WB_SYNC_NONE check.

    We no long set ->nonblocking in VM page out and page migration, because
    a) it's effectively redundant with WB_SYNC_NONE in current code
    b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
    that would skip some dirty inodes on congestion and page out others, which
    is unfair in terms of LRU age.

    Inspired by Christoph Hellwig. Thanks!

    Signed-off-by: Wu Fengguang
    Cc: Theodore Ts'o
    Cc: David Howells
    Cc: Sage Weil
    Cc: Steve French
    Cc: Chris Mason
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

26 Oct, 2010

4 commits

  • Instead of always assigning an increasing inode number in new_inode
    move the call to assign it into those callers that actually need it.
    For now callers that need it is estimated conservatively, that is
    the call is added to all filesystems that do not assign an i_ino
    by themselves. For a few more filesystems we can avoid assigning
    any inode number given that they aren't user visible, and for others
    it could be done lazily when an inode number is actually needed,
    but that's left for later patches.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Clones an existing reference to inode; caller must already hold one.

    Signed-off-by: Al Viro

    Al Viro
     
  • Split up inode_add_to_list/__inode_add_to_list. Locking for the two
    lists will be split soon so these helpers really don't buy us much
    anymore.

    The __ prefixes for the sb list helpers will go away soon, but until
    inode_lock is gone we'll need them to distinguish between the locked
    and unlocked variants.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • __block_write_begin and block_prepare_write are identical except for slightly
    different calling conventions. Convert all callers to the __block_write_begin
    calling conventions and drop block_prepare_write.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

23 Oct, 2010

2 commits

  • * 'for-linus' of git://oss.sgi.com/xfs/xfs: (36 commits)
    xfs: semaphore cleanup
    xfs: Extend project quotas to support 32bit project ids
    xfs: remove xfs_buf wrappers
    xfs: remove xfs_cred.h
    xfs: remove xfs_globals.h
    xfs: remove xfs_version.h
    xfs: remove xfs_refcache.h
    xfs: fix the xfs_trans_committed
    xfs: remove unused t_callback field in struct xfs_trans
    xfs: fix bogus m_maxagi check in xfs_iget
    xfs: do not use xfs_mod_incore_sb_batch for per-cpu counters
    xfs: do not use xfs_mod_incore_sb for per-cpu counters
    xfs: remove XFS_MOUNT_NO_PERCPU_SB
    xfs: pack xfs_buf structure more tightly
    xfs: convert buffer cache hash to rbtree
    xfs: serialise inode reclaim within an AG
    xfs: batch inode reclaim lookup
    xfs: implement batched inode lookups for AG walking
    xfs: split out inode walk inode grabbing
    xfs: split inode AG walking into separate code for reclaim
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: remove in_workqueue_context()
    workqueue: Clarify that schedule_on_each_cpu is synchronous
    memory_hotplug: drop spurious calls to flush_scheduled_work()
    shpchp: update workqueue usage
    pciehp: update workqueue usage
    isdn/eicon: don't call flush_scheduled_work() from diva_os_remove_soft_isr()
    workqueue: add and use WQ_MEM_RECLAIM flag
    workqueue: fix HIGHPRI handling in keep_working()
    workqueue: add queue_work and activate_work trace points
    workqueue: prepare for more tracepoints
    workqueue: implement flush[_delayed]_work_sync()
    workqueue: factor out start_flush_work()
    workqueue: cleanup flush/cancel functions
    workqueue: implement alloc_ordered_workqueue()

    Fix up trivial conflict in fs/gfs2/main.c as per Tejun

    Linus Torvalds
     

19 Oct, 2010

6 commits