27 Feb, 2010

2 commits

  • * 'for-linus' of git://oss.sgi.com/xfs/xfs: (52 commits)
    fs/xfs: Correct NULL test
    xfs: optimize log flushing in xfs_fsync
    xfs: only clear the suid bit once in xfs_write
    xfs: kill xfs_bawrite
    xfs: log changed inodes instead of writing them synchronously
    xfs: remove invalid barrier optimization from xfs_fsync
    xfs: kill the unused XFS_QMOPT_* flush flags V2
    xfs: Use delay write promotion for dquot flushing
    xfs: Sort delayed write buffers before dispatch
    xfs: Don't issue buffer IO direct from AIL push V2
    xfs: Use delayed write for inodes rather than async V2
    xfs: Make inode reclaim states explicit
    xfs: more reserved blocks fixups
    xfs: turn off sign warnings
    xfs: don't hold onto reserved blocks on remount,ro
    xfs: quota limit statvfs available blocks
    xfs: replace KM_LARGE with explicit vmalloc use
    xfs: cleanup up xfs_log_force calling conventions
    xfs: kill XLOG_VEC_SET_TYPE
    xfs: remove duplicate buffer flags
    ...

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/xfs-vipt:
    xfs: fix xfs to work with Virtually Indexed architectures
    sh: add mm API for DMA to vmalloc/vmap areas
    arm: add mm API for DMA to vmalloc/vmap areas
    parisc: add mm API for DMA to vmalloc/vmap areas
    mm: add coherence API for DMA to vmalloc/vmap areas

    Linus Torvalds
     

13 Feb, 2010

1 commit

  • file_remove_suid already calls into ->setattr to clear the suid and
    sgid bits if needed, no need to start a second transaction to do it
    ourselves.

    Note that xfs_write_clear_setuid issues a sync transaction while the
    path through ->setattr doesn't, but that is consistant with the
    other filesystems.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

09 Feb, 2010

2 commits

  • When an inode has already be flushed delayed write,
    xfs_inode_clean() returns true and hence xfs_fs_write_inode() can
    return on a synchronous inode write without having written the
    inode. Currently these sycnhronous writes only come sync(1),
    unmount, a sycnhronous NFS export and cachefiles so should be
    relatively rare and out of common performance paths.

    Realistically, a synchronous inode write is not necessary here; we
    can avoid writing the inode by logging any non-transactional changes
    that are pending. This needs to be done with synchronous
    transactions, but it avoids seeking between the log and inode
    clusters as we do now. We don't force the log if the inode is
    pinned, though, so this differs from the fsync case. For normal
    sys_sync and unmount behaviour this is fine because we do a
    synchronous log force in xfs_sync_data which is called from the
    ->sync_fs code.

    It does however break the NFS synchronous export guarantees for now,
    but work is under way to fix this at a higher level or for the
    higher level to provide an additional flag in the writeback control
    to tell us that a log force is needed.

    Portions of this patch are based on work from Dave Chinner.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Alex Elder

    Christoph Hellwig
     
  • This mangles the reserved blocks counts a little more.

    1) add a helper function for the default reserved count
    2) add helper functions to save/restore counts on ro/rw
    3) save/restore reserved blocks on freeze/thaw
    4) disallow changing reserved count while readonly

    V2: changed field name to match Dave's changes

    Signed-off-by: Eric Sandeen
    Signed-off-by: Alex Elder

    Eric Sandeen
     

06 Feb, 2010

3 commits

  • xfs_buf.c includes what is essentially a hand rolled version of
    blk_rq_map_kern(). In order to work properly with the vmalloc buffers
    that xfs uses, this hand rolled routine must also implement the flushing
    API for vmap/vmalloc areas.

    [style updates from hch@lst.de]
    Acked-by: Christoph Hellwig
    Signed-off-by: James Bottomley

    James Bottomley
     
  • We currently do background inode flush asynchronously, resulting in
    inodes being written in whatever order the background writeback
    issues them. Not only that, there are also blocking and non-blocking
    asynchronous inode flushes, depending on where the flush comes from.

    This patch completely removes asynchronous inode writeback. It
    removes all the strange writeback modes and replaces them with
    either a synchronous flush or a non-blocking delayed write flush.
    That is, inode flushes will only issue IO directly if they are
    synchronous, and background flushing may do nothing if the operation
    would block (e.g. on a pinned inode or buffer lock).

    Delayed write flushes will now result in the inode buffer sitting in
    the delwri queue of the buffer cache to be flushed by either an AIL
    push or by the xfsbufd timing out the buffer. This will allow
    accumulation of dirty inode buffers in memory and allow optimisation
    of inode cluster writeback at the xfsbufd level where we have much
    greater queue depths than the block layer elevators. We will also
    get adjacent inode cluster buffer IO merging for free when a later
    patch in the series allows sorting of the delayed write buffers
    before dispatch.

    This effectively means that any inode that is written back by
    background writeback will be seen as flush locked during AIL
    pushing, and will result in the buffers being pushed from there.
    This writeback path is currently non-optimal, but the next patch
    in the series will fix that problem.

    A side effect of this delayed write mechanism is that background
    inode reclaim will no longer directly flush inodes, nor can it wait
    on the flush lock. The result is that inode reclaim must leave the
    inode in the reclaimable state until it is clean. Hence attempts to
    reclaim a dirty inode in the background will simply skip the inode
    until it is clean and this allows other mechanisms (i.e. xfsbufd) to
    do more optimal writeback of the dirty buffers. As a result, the
    inode reclaim code has been rewritten so that it no longer relies on
    the ambiguous return values of xfs_iflush() to determine whether it
    is safe to reclaim an inode.

    Portions of this patch are derived from patches by Christoph
    Hellwig.

    Version 2:
    - cleanup reclaim code as suggested by Christoph
    - log background reclaim inode flush errors
    - just pass sync flags to xfs_iflush

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • A.K.A.: don't rely on xfs_iflush() return value in reclaim

    We have gradually been moving checks out of the reclaim code because
    they are duplicated in xfs_iflush(). We've had a history of problems
    in this area, and many of them stem from the overloading of the
    return values from xfs_iflush() and interaction with inode flush
    locking to determine if the inode is safe to reclaim.

    With the desire to move to delayed write flushing of inodes and
    non-blocking inode tree reclaim walks, the overloading of the
    return value of xfs_iflush makes it very difficult to determine
    the correct thing to do next.

    This patch explicitly re-adds the checks to the inode reclaim code,
    removing the reliance on the return value of xfs_iflush() to
    determine what to do next. It also means that we can clearly
    document all the inode states that reclaim must handle and hence
    we can easily see that we handled all the necessary cases.

    This also removes the need for the xfs_inode_clean() check in
    xfs_iflush() as all callers now check this first (safely).

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

04 Feb, 2010

1 commit

  • There are no more users of this function left in the XFS code
    now that we've switched everything to delayed write flushing.
    Remove it.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

02 Feb, 2010

1 commit

  • All buffers logged into the AIL are marked as delayed write.
    When the AIL needs to push the buffer out, it issues an async write of the
    buffer. This means that IO patterns are dependent on the order of
    buffers in the AIL.

    Instead of flushing the buffer, promote the buffer in the delayed
    write list so that the next time the xfsbufd is run the buffer will
    be flushed by the xfsbufd. Return the state to the xfsaild that the
    buffer was promoted so that the xfsaild knows that it needs to cause
    the xfsbufd to run to flush the buffers that were promoted.

    Using the xfsbufd for issuing the IO allows us to dispatch all
    buffer IO from the one queue. This means that we can make much more
    enlightened decisions on what order to flush buffers to disk as
    we don't have multiple places issuing IO. Optimisations to xfsbufd
    will be in a future patch.

    Version 2
    - kill XFS_ITEM_FLUSHING as it is now unused.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

26 Jan, 2010

2 commits

  • Currently when the xfsbufd writes delayed write buffers, it pushes
    them to disk in the order they come off the delayed write list. If
    there are lots of buffers ѕpread widely over the disk, this results
    in overwhelming the elevator sort queues in the block layer and we
    end up losing the posibility of merging adjacent buffers to minimise
    the number of IOs.

    Use the new generic list_sort function to sort the delwri dispatch
    queue before issue to ensure that the buffers are pushed in the most
    friendly order possible to the lower layers.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • If we hold onto reserved blocks when doing a remount,ro we end
    up writing the blocks used count to disk that includes the reserved
    blocks. Reserved blocks are not actually used, so this results in
    the values in the superblock being incorrect.

    Hence if we run xfs_check or xfs_repair -n while the filesystem is
    mounted remount,ro we end up with an inconsistent filesystem being
    reported. Also, running xfs_copy on the remount,ro filesystem will
    result in an inconsistent image being generated.

    To fix this, unreserve the blocks when doing the remount,ro, and
    reserved them again on remount,rw. This way a remount,ro filesystem
    will appear consistent on disk to all utilities.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

22 Jan, 2010

3 commits

  • We use the KM_LARGE flag to make kmem_alloc and friends use vmalloc
    if necessary. As we only need this for a few boot/mount time
    allocations just switch to explicit vmalloc calls there.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Remove the XFS_LOG_FORCE argument which was always set, and the
    XFS_LOG_URGE define, which was never used.

    Split xfs_log_force into a two helpers - xfs_log_force which forces
    the whole log, and xfs_log_force_lsn which forces up to the
    specified LSN. The underlying implementations already were entirely
    separate, as were the users.

    Also re-indent the new _xfs_log_force/_xfs_log_force which
    previously had a weird coding style.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Currently we define aliases for the buffer flags in various
    namespaces, which only adds confusion. Remove all but the XBF_
    flags to clean this up a bit.

    Note that we still abuse XFS_B_ASYNC/XBF_ASYNC for some non-buffer
    uses, but I'll clean that up later.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

20 Jan, 2010

2 commits

  • To be consistent with the directory code, the attr code should use
    unsigned names. Convert the names from the vfs at the highest level
    to unsigned, and ænsure they are consistenly used as unsigned down
    to disk.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • xfs_buf_iomove() uses xfs_caddr_t as it's parameter types, but it doesn't
    care about the signedness of the variables as it is just copying the
    data. Change the prototype to use void * so that we don't get sign
    warnings at call sites.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

16 Jan, 2010

13 commits

  • Move xfsbdstrat and xfs_bdstrat_cb from xfs_lrw.c and xfs_bioerror
    and xfs_bioerror_relse from xfs_rw.c into xfs_buf.c. This also
    means xfs_bioerror and xfs_bioerror_relse can be marked static now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Fold XFS_bwrite into it's only caller, xfs_bwrite and move it into
    xfs_buf.c instead of leaving it as a fairly large inline function.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Don't bother using XFS_bwrite as it doesn't provide much code for
    our use case. Instead opencode it and fold xlog_bdstrat_cb into the
    new xlog_bdstrat helper.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • The filestreams cache flush is not needed in the sync code as it
    does not affect data writeback, and it is now not used by the growfs
    code, either, so kill it.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Uninline xfs_perag_{get,put} so that tracepoints can be inserted
    into them to speed debugging of reference count problems.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • xfs_get_perag is really getting the perag that an inode belongs to
    based on it's inode number. Convert the use of this function to just
    get the perag from a provided ag number. Use this new function to
    obtain the per-ag structure when traversing the per AG inode trees
    for sync and reclaim.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • The xfsbufd wakes every xfsbufd_centisecs (once per second by
    default) for each filesystem even when the filesystem is idle. If
    the xfsbufd has nothing to do, put it into a long term sleep and
    only wake it up when there is work pending (i.e. dirty buffers to
    flush soon). This will make laptop power misers happy.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Now that the AIL push algorithm is traversal safe, we don't need a
    watchdog function in the xfsaild to catch pushes that fail to make
    progress. Remove the watchdog timeout and make pushes purely driven
    by demand. This will remove the once-per-second wakeup that is seen
    when the filesystem is idle and make laptop power misers happy.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Just minor housekeeping, a lot more functions can be trivially made
    static; others could if we reordered things a bit...

    Signed-off-by: Eric Sandeen
    Signed-off-by: Alex Elder

    Eric Sandeen
     
  • To be able to diagnose whether the swap extents function is
    detecting compatible inode data fork configurations for swapping
    extents, add tracing points to the code to allow us to see the
    format of the inode forks before and after the swap.

    Signed-off-by: Dave Chinner
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • We cannot do direct inode reclaim without taking the flush lock to
    ensure that we do not reclaim an inode under IO. We check the inode
    is clean before doing direct reclaim, but this is not good enough
    because the inode flush code marks the inode clean once it has
    copied the in-core dirty state to the backing buffer.

    It is the flush lock that determines whether the inode is still
    under IO, even though it is marked clean, and the inode is still
    required at IO completion so we can't reclaim it even though it is
    clean in core. Hence the requirement that we need to take the flush
    lock even on clean inodes because this guarantees that the inode
    writeback IO has completed and it is safe to reclaim the inode.

    With delayed write inode flushing, we coul dend up waiting a long
    time on the flush lock even for a clean inode. The background
    reclaim already handles this efficiently, so avoid all the problems
    by killing the direct reclaim path altogether.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • The reclaim code will handle flushing of dirty inodes before reclaim
    occurs, so avoid them when determining whether an inode is a
    candidate for flushing to disk when walking the radix trees. This
    is based on a test patch from Christoph Hellwig.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Make the inode tree reclaim walk exclusive to avoid races with
    concurrent sync walkers and lookups. This is a version of a patch
    posted by Christoph Hellwig that avoids all the code duplication.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

11 Jan, 2010

3 commits

  • When we search for and find a busy extent during allocation we
    force the log out to ensure the extent free transaction is on
    disk before the allocation transaction. The current implementation
    has a subtle bug in it--it does not handle multiple overlapping
    ranges.

    That is, if we free lots of little extents into a single
    contiguous extent, then allocate the contiguous extent, the busy
    search code stops searching at the first extent it finds that
    overlaps the allocated range. It then uses the commit LSN of the
    transaction to force the log out to.

    Unfortunately, the other busy ranges might have more recent
    commit LSNs than the first busy extent that is found, and this
    results in xfs_alloc_search_busy() returning before all the
    extent free transactions are on disk for the range being
    allocated. This can lead to potential metadata corruption or
    stale data exposure after a crash because log replay won't replay
    all the extent free transactions that cover the allocation range.

    Modified-by: Alex Elder

    (Dropped the "found" argument from the xfs_alloc_busysearch trace
    event.)

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • We currently have some rather odd code in xfs_setattr for
    updating the a/c/mtime timestamps:

    - first we do a non-transaction update if all three are updated
    together
    - second we implicitly update the ctime for various changes
    instead of relying on the ATTR_CTIME flag
    - third we set the timestamps to the current time instead of the
    arguments in the iattr structure in many cases.

    This patch makes sure we update it in a consistent way:

    - always transactional
    - ctime is only updated if ATTR_CTIME is set or we do a size
    update, which is a special case
    - always to the times passed in from the caller instead of the
    current time

    The only non-size caller of xfs_setattr that doesn't come from
    the VFS is updated to set ATTR_CTIME and pass in a valid ctime
    value.

    Reported-by: Eric Blake
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Using DECLARE_EVENT_CLASS allows us to to use trace event code
    instead of duplicating it in the binary. This was not available
    before 2.6.33 so it had to be done as a separate step once the
    prerequisite was merged.

    This only requires changes to xfs_trace.h and the results are
    rather impressive:

    hch@brick:~/work/linux-2.6/obj-kvm$ size fs/xfs/xfs.o*
    text data bss dec hex filename
    607732 41884 3616 653232 9f7b0 fs/xfs/xfs.o
    1026732 41884 3808 1072424 105d28 fs/xfs/xfs.o.old

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

09 Jan, 2010

1 commit


18 Dec, 2009

1 commit

  • After I_SYNC was split from I_LOCK the leftover is always used together with
    I_NEW and thus superflous.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

17 Dec, 2009

5 commits

  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    XFS: Free buffer pages array unconditionally
    xfs: kill xfs_bmbt_rec_32/64 types
    xfs: improve metadata I/O merging in the elevator
    xfs: check for not fully initialized inodes in xfs_ireclaim

    Linus Torvalds
     
  • The code in xfs_free_buf() only attempts to free the b_pages array if the
    buffer is a page cache backed or page allocated buffer. The extra log buffer
    that is used when the log wraps uses pages that are allocated to a different
    log buffer, but it still has a b_pages array allocated when those pages
    are associated to with the extra buffer in xfs_buf_associate_memory.

    Hence we need to always attempt to free the b_pages array when tearing
    down a buffer, not just on buffers that are explicitly marked as page bearing
    buffers. This fixes a leak detected by the kernel memory leak code.

    Signed-off-by: Dave Chinner
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Change all async metadata buffers to use [READ|WRITE]_META I/O types
    so that the I/O doesn't get issued immediately. This allows merging of
    adjacent metadata requests but still prioritises them over bulk data.
    This shows a 10-15% improvement in sequential create speed of small
    files.

    Don't include the log buffers in this classification - leave them as
    sync types so they are issued immediately.

    Signed-off-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Currently the locking in blockdev_direct_IO is a mess, we have three different
    locking types and very confusing checks for some of them. The most
    complicated one is DIO_OWN_LOCKING for reads, which happens to not actually be
    used.

    This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read case
    is unused anyway, and the write side is almost identical to DIO_NO_LOCKING.
    The difference is that DIO_NO_LOCKING always sets the create argument for
    the get_blocks callback to zero, but we can easily move that to the actual
    get_blocks callbacks. There are four users of the DIO_NO_LOCKING mode:
    gfs already ignores the create argument and thus is fine with the new
    version, ocfs2 only errors out if create were ever set, and we can remove
    this dead code now, the block device code only ever uses create for an
    error message if we are fully beyond the device which can never happen,
    and last but not least XFS will need the new behavour for writes.

    Now we can replace the lock_type variable with a flags one, where no flag
    means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
    flag. Separate out the check for not allowing to fill holes into a separate
    flag, although for now both flags always get set at the same time.

    Also revamp the documentation of the locking scheme to actually make sense.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Add a flags argument to struct xattr_handler and pass it to all xattr
    handler methods. This allows using the same methods for multiple
    handlers, e.g. for the ACL methods which perform exactly the same action
    for the access and default ACLs, just using a different underlying
    attribute. With a little more groundwork it'll also allow sharing the
    methods for the regular user/trusted/secure handlers in extN, ocfs2 and
    jffs2 like it's already done for xfs in this patch.

    Also change the inode argument to the handlers to a dentry to allow
    using the handlers mechnism for filesystems that require it later,
    e.g. cifs.

    [with GFS2 bits updated by Steven Whitehouse ]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: James Morris
    Acked-by: Joel Becker
    Signed-off-by: Al Viro

    Christoph Hellwig