21 Jul, 2011

1 commit


25 May, 2011

1 commit

  • Now that we have reliably tracking of deleted extents in a
    transaction we can easily implement "online" discard support
    which calls blkdev_issue_discard once a transaction commits.

    The actual discard is a two stage operation as we first have
    to mark the busy extent as not available for reuse before we
    can start the actual discard. Note that we don't bother
    supporting discard for the non-delaylog mode.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

08 Apr, 2011

3 commits

  • Background inode reclaim needs to run more frequently that the XFS
    syncd work is run as 30s is too long between optimal reclaim runs.
    Add a new periodic work item to the xfs syncd workqueue to run a
    fast, non-blocking inode reclaim scan.

    Background inode reclaim is kicked by the act of marking inodes for
    reclaim. When an AG is first marked as having reclaimable inodes,
    the background reclaim work is kicked. It will continue to run
    periodically untill it detects that there are no more reclaimable
    inodes. It will be kicked again when the first inode is queued for
    reclaim.

    To ensure shrinker based inode reclaim throttles to the inode
    cleaning and reclaim rate but still reclaim inodes efficiently, make it kick the
    background inode reclaim so that when we are low on memory we are
    trying to reclaim inodes as efficiently as possible. This kick shoul
    d not be necessary, but it will protect against failures to kick the
    background reclaim when inodes are first dirtied.

    To provide the rate throttling, make the shrinker pass do
    synchronous inode reclaim so that it blocks on inodes under IO. This
    means that the shrinker will reclaim inodes rather than just
    skipping over them, but it does not adversely affect the rate of
    reclaim because most dirty inodes are already under IO due to the
    background reclaim work the shrinker kicked.

    These two modifications solve one of the two OOM killer invocations
    Chris Mason reported recently when running a stress testing script.
    The particular workload trigger for the OOM killer invocation is
    where there are more threads than CPUs all unlinking files in an
    extremely memory constrained environment. Unlike other solutions,
    this one does not have a performance impact on performance when
    memory is not constrained or the number of concurrent threads
    operating is
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • On of the problems with the current inode flush at ENOSPC is that we
    queue a flush per ENOSPC event, regardless of how many are already
    queued. Thi can result in hundreds of queued flushes, most of
    which simply burn CPU scanned and do no real work. This simply slows
    down allocation at ENOSPC.

    We really only need one active flush at a time, and we can easily
    implement that via the new xfs_syncd_wq. All we need to do is queue
    a flush if one is not already active, then block waiting for the
    currently active flush to complete. The result is that we only ever
    have a single ENOSPC inode flush active at a time and this greatly
    reduces the overhead of ENOSPC processing.

    On my 2p test machine, this results in tests exercising ENOSPC
    conditions running significantly faster - 042 halves execution time,
    083 drops from 60s to 5s, etc - while not introducing test
    regressions.

    This allows us to remove the old xfssyncd threads and infrastructure
    as they are no longer used.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • All of the work xfssyncd does is background functionality. There is
    no need for a thread per filesystem to do this work - it can al be
    managed by a global workqueue now they manage concurrency
    effectively.

    Introduce a new gglobal xfssyncd workqueue, and convert the periodic
    work to use this new functionality. To do this, use a delayed work
    construct to schedule the next running of the periodic sync work
    for the filesystem. When the sync work is complete, queue a new
    delayed work for the next running of the sync work.

    For laptop mode, we wait on completion for the sync works, so ensure
    that the sync work queuing interface can flush and wait for work to
    complete to enable the work queue infrastructure to replace the
    current sequence number and wakeup that is used.

    Because the sync work does non-trivial amounts of work, mark the
    new work queue as CPU intensive.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     

04 Jan, 2011

1 commit

  • Currently the size of the speculative preallocation during delayed
    allocation is fixed by either the allocsize mount option of a
    default size. We are seeing a lot of cases where we need to
    recommend using the allocsize mount option to prevent fragmentation
    when buffered writes land in the same AG.

    Rather than using a fixed preallocation size by default (up to 64k),
    make it dynamic by basing it on the current inode size. That way the
    EOF preallocation will increase as the file size increases. Hence
    for streaming writes we are much more likely to get large
    preallocations exactly when we need it to reduce fragementation.

    For default settings, the size of the initial extents is determined
    by the number of parallel writers and the amount of memory in the
    machine. For 4GB RAM and 4 concurrent 32GB file writes:

    EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
    0: [0..1048575]: 1048672..2097247 0 (1048672..2097247) 1048576
    1: [1048576..2097151]: 5242976..6291551 0 (5242976..6291551) 1048576
    2: [2097152..4194303]: 12583008..14680159 0 (12583008..14680159) 2097152
    3: [4194304..8388607]: 25165920..29360223 0 (25165920..29360223) 4194304
    4: [8388608..16777215]: 58720352..67108959 0 (58720352..67108959) 8388608
    5: [16777216..33554423]: 117440584..134217791 0 (117440584..134217791) 16777208
    6: [33554424..50331511]: 184549056..201326143 0 (184549056..201326143) 16777088
    7: [50331512..67108599]: 251657408..268434495 0 (251657408..268434495) 16777088

    and for 16 concurrent 16GB file writes:

    EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
    0: [0..262143]: 2490472..2752615 0 (2490472..2752615) 262144
    1: [262144..524287]: 6291560..6553703 0 (6291560..6553703) 262144
    2: [524288..1048575]: 13631592..14155879 0 (13631592..14155879) 524288
    3: [1048576..2097151]: 30408808..31457383 0 (30408808..31457383) 1048576
    4: [2097152..4194303]: 52428904..54526055 0 (52428904..54526055) 2097152
    5: [4194304..8388607]: 104857704..109052007 0 (104857704..109052007) 4194304
    6: [8388608..16777215]: 209715304..218103911 0 (209715304..218103911) 8388608
    7: [16777216..33554423]: 452984848..469762055 0 (452984848..469762055) 16777208

    Because it is hard to take back specualtive preallocation, cases
    where there are large slow growing log files on a nearly full
    filesystem may cause premature ENOSPC. Hence as the filesystem nears
    full, the maximum dynamic prealloc size іs reduced according to this
    table (based on 4k block size):

    freespace max prealloc size
    >5% full extent (8GB)
    4-5% 2GB (8GB >> 2)
    3-4% 1GB (8GB >> 3)
    2-3% 512MB (8GB >> 4)
    1-2% 256MB (8GB >> 5)
    > 6)

    This should reduce the amount of space held in speculative
    preallocation for such cases.

    The allocsize mount option turns off the dynamic behaviour and fixes
    the prealloc size to whatever the mount option specifies. i.e. the
    behaviour is unchanged.

    Signed-off-by: Dave Chinner

    Dave Chinner
     

19 Oct, 2010

4 commits


27 Jul, 2010

2 commits

  • Since Linux 2.6.33 the kernel has support for real O_SYNC, which made
    the osyncisosync option a no-op. Warn the users about this and remove
    the mount flag for it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Dmapi support was never merged upstream, but we still have a lot of hooks
    bloating XFS for it, all over the fast pathes of the filesystem.

    This patch drops over 700 lines of dmapi overhead. If we'll ever get HSM
    support in mainline at least the namespace events can be done much saner
    in the VFS instead of the individual filesystem, so it's not like this
    is much help for future work.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

20 Jul, 2010

1 commit

  • Now the shrinker passes us a context, wire up a shrinker context per
    filesystem. This allows us to remove the global mount list and the
    locking problems that introduced. It also means that a shrinker call
    does not need to traverse clean filesystems before finding a
    filesystem with reclaimable inodes. This significantly reduces
    scanning overhead when lots of filesystems are present.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

24 May, 2010

1 commit

  • The delayed logging code only changes in-memory structures and as
    such can be enabled and disabled with a mount option. Add the mount
    option and emit a warning that this is an experimental feature that
    should not be used in production yet.

    We also need infrastructure to track committed items that have not
    yet been written to the log. This is what the Committed Item List
    (CIL) is for.

    The log item also needs to be extended to track the current log
    vector, the associated memory buffer and it's location in the Commit
    Item List. Extend the log item and log vector structures to enable
    this tracking.

    To maintain the current log format for transactions with delayed
    logging, we need to introduce a checkpoint transaction and a context
    for tracking each checkpoint from initiation to transaction
    completion. This includes adding a log ticket for tracking space
    log required/used by the context checkpoint.

    To track all the changes we need an io vector array per log item,
    rather than a single array for the entire transaction. Using the new
    log vector structure for this requires two passes - the first to
    allocate the log vector structures and chain them together, and the
    second to fill them out. This log vector chain can then be passed
    to the CIL for formatting, pinning and insertion into the CIL.

    Formatting of the log vector chain is relatively simple - it's just
    a loop over the iovecs on each log vector, but it is made slightly
    more complex because we re-write the iovec after the copy to point
    back at the memory buffer we just copied into.

    This code also needs to pin log items. If the log item is not
    already tracked in this checkpoint context, then it needs to be
    pinned. Otherwise it is already pinned and we don't need to pin it
    again.

    The only other complexity is calculating the amount of new log space
    the formatting has consumed. This needs to be accounted to the
    transaction in progress, and the accounting is made more complex
    becase we need also to steal space from it for log metadata in the
    checkpoint transaction. Calculate all this at insert time and update
    all the tickets, counters, etc correctly.

    Once we've formatted all the log items in the transaction, attach
    the busy extents to the checkpoint context so the busy extents live
    until checkpoint completion and can be processed at that point in
    time. Transactions can then be freed at this point in time.

    Now we need to issue checkpoints - we are tracking the amount of log space
    used by the items in the CIL, so we can trigger background checkpoints when the
    space usage gets to a certain threshold. Otherwise, checkpoints need ot be
    triggered when a log synchronisation point is reached - a log force event.

    Because the log write code already handles chained log vectors, writing the
    transaction is trivial, too. Construct a transaction header, add it
    to the head of the chain and write it into the log, then issue a
    commit record write. Then we can release the checkpoint log ticket
    and attach the context to the log buffer so it can be called during
    Io completion to complete the checkpoint.

    We also need to allow for synchronising multiple in-flight
    checkpoints. This is needed for two things - the first is to ensure
    that checkpoint commit records appear in the log in the correct
    sequence order (so they are replayed in the correct order). The
    second is so that xfs_log_force_lsn() operates correctly and only
    flushes and/or waits for the specific sequence it was provided with.

    To do this we need a wait variable and a list tracking the
    checkpoint commits in progress. We can walk this list and wait for
    the checkpoints to change state or complete easily, an this provides
    the necessary synchronisation for correct operation in both cases.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

30 Apr, 2010

1 commit

  • On low memory boxes or those with highmem, kernel can OOM before the
    background reclaims inodes via xfssyncd. Add a shrinker to run inode
    reclaim so that it inode reclaim is expedited when memory is low.

    This is more complex than it needs to be because the VM folk don't
    want a context added to the shrinker infrastructure. Hence we need
    to add a global list of XFS mount structures so the shrinker can
    traverse them.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

06 Mar, 2010

1 commit


03 Mar, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu: add __percpu sparse annotations to what's left
    percpu: add __percpu sparse annotations to fs
    percpu: add __percpu sparse annotations to core kernel subsystems
    local_t: Remove leftover local.h
    this_cpu: Remove pageset_notifier
    this_cpu: Page allocator conversion
    percpu, x86: Generic inc / dec percpu instructions
    local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c
    module: Use this_cpu_xx to dynamically allocate counters
    local_t: Remove cpu_local_xx macros
    percpu: refactor the code in pcpu_[de]populate_chunk()
    percpu: remove compile warnings caused by __verify_pcpu_ptr()
    percpu: make accessors check for percpu pointer in sparse
    percpu: add __percpu for sparse.
    percpu: make access macros universal
    percpu: remove per_cpu__ prefix.

    Linus Torvalds
     

02 Mar, 2010

1 commit


17 Feb, 2010

1 commit

  • Add __percpu sparse annotations to fs.

    These annotations are to make sparse consider percpu variables to be
    in a different address space and warn if accessed without going
    through percpu accessors. This patch doesn't affect normal builds.

    Signed-off-by: Tejun Heo
    Cc: "Theodore Ts'o"
    Cc: Trond Myklebust
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Cc: Alexander Viro

    Tejun Heo
     

09 Feb, 2010

1 commit

  • This mangles the reserved blocks counts a little more.

    1) add a helper function for the default reserved count
    2) add helper functions to save/restore counts on ro/rw
    3) save/restore reserved blocks on freeze/thaw
    4) disallow changing reserved count while readonly

    V2: changed field name to match Dave's changes

    Signed-off-by: Eric Sandeen
    Signed-off-by: Alex Elder

    Eric Sandeen
     

26 Jan, 2010

1 commit

  • If we hold onto reserved blocks when doing a remount,ro we end
    up writing the blocks used count to disk that includes the reserved
    blocks. Reserved blocks are not actually used, so this results in
    the values in the superblock being incorrect.

    Hence if we run xfs_check or xfs_repair -n while the filesystem is
    mounted remount,ro we end up with an inconsistent filesystem being
    reported. Also, running xfs_copy on the remount,ro filesystem will
    result in an inconsistent image being generated.

    To fix this, unreserve the blocks when doing the remount,ro, and
    reserved them again on remount,rw. This way a remount,ro filesystem
    will appear consistent on disk to all utilities.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

20 Jan, 2010

1 commit


16 Jan, 2010

4 commits

  • Uninline xfs_perag_{get,put} so that tracepoints can be inserted
    into them to speed debugging of reference count problems.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Reference count the per-ag structures to ensure that we keep get/put
    pairs balanced. Assert that the reference counts are zero at unmount
    time to catch leaks. In future, reference counts will enable us to
    safely remove perag structures by allowing us to detect when they
    are no longer in use.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • The use of an array for the per-ag structures requires reallocation
    of the array when growing the filesystem. This requires locking
    access to the array to avoid use after free situations, and the
    locking is difficult to get right. To avoid needing to reallocate an
    array, change the per-ag structures to an allocated object per ag
    and index them using a tree structure.

    The AGs are always densely indexed (hence the use of an array), but
    the number supported is 2^32 and lookups tend to be random and hence
    indexing needs to scale. A simple choice is a radix tree - it works
    well with this sort of index. This change also removes another
    large contiguous allocation from the mount/growfs path in XFS.

    The growing process now needs to change to only initialise the new
    AGs required for the extra space, and as such only needs to
    exclusively lock the tree for inserts. The rest of the code only
    needs to lock the tree while doing lookups, and hence this will
    remove all the deadlocks that currently occur on the m_perag_lock as
    it is now an innermost lock. The lock is also changed to a spinlock
    from a read/write lock as the hold time is now extremely short.

    To complete the picture, the per-ag structures will need to be
    reference counted to ensure that we don't free/modify them while
    they are still in use. This will be done in subsequent patch.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • xfs_get_perag is really getting the perag that an inode belongs to
    based on it's inode number. Convert the use of this function to just
    get the perag from a provided ag number. Use this new function to
    obtain the per-ag structure when traversing the per AG inode trees
    for sync and reclaim.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

12 Dec, 2009

2 commits

  • Stop the flag saving as we never mangle those in the unmount path, and
    hide all the weird arguents to the dmapi code inside the
    XFS_SEND_PREUNMOUNT / XFS_SEND_UNMOUNT macros.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Remove our own STATIC_INLINE macro. For small function inside
    implementation files just use STATIC and let gcc inline it, and for
    those in headers do the normal static inline - they are all small
    enough to be inlined for debug builds, too.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

01 Sep, 2009

1 commit


08 Jun, 2009

1 commit

  • Kill the quota ops function vector and replace it with direct calls or
    stubs in the CONFIG_XFS_QUOTA=n case.

    Make sure we check XFS_IS_QUOTA_RUNNING in the right spots. We can remove
    the number of those checks because the XFS_TRANS_DQ_DIRTY flag can't be set
    otherwise.

    This brings us back closer to the way this code worked in IRIX and earlier
    Linux versions, but we keep a lot of the more useful factoring of common
    code.

    Eventually we should also kill xfs_qm_bhv.c, but that's left for a later
    patch.

    Reduces the size of the source code by about 250 lines and the size of
    XFS module by about 1.5 kilobytes with quotas enabled:

    text data bss dec hex filename
    615957 2960 3848 622765 980ad fs/xfs/xfs.o
    617231 3152 3848 624231 98667 fs/xfs/xfs.o.old

    Fallout:

    - xfs_qm_dqattach is split into xfs_qm_dqattach_locked which expects
    the inode locked and xfs_qm_dqattach which does the locking around it,
    thus removing XFS_QMOPT_ILOCKED.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Eric Sandeen

    Christoph Hellwig
     

07 Apr, 2009

1 commit

  • Currently xfs_device_flush calls sync_blockdev() which is
    a no-op for XFS as all it's metadata is held in a different
    address to the one sync_blockdev() works on.

    Call xfs_sync_inodes() instead to flush all the delayed
    allocation blocks out. To do this as efficiently as possible,
    do it via two passes - one to do an async flush of all the
    dirty blocks and a second to wait for all the IO to complete.
    This requires some modification to the xfs-sync_inodes_ag()
    flush code to do efficiently.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

30 Mar, 2009

1 commit

  • With the upcoming v3 inodes the default attroffset needs to be calculated
    for each specific inode, so we can't cache it in the superblock anymore.

    Also replace the assert for wrong inode sizes with a proper error check
    also included in non-debug builds. Note that the ENOSYS return for
    that might seem odd, but that error is returned by xfs_mount_validate_sb
    for all theoretically valid but not supported filesystem geometries.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Josef 'Jeff' Sipek

    Christoph Hellwig
     

29 Mar, 2009

3 commits

  • Signed-off-by: Malcolm Parsons
    Reviewed-by: Christoph Hellwig

    Malcolm Parsons
     
  • With the upcoming v3 inodes the inode data/attr area size needs to be
    calculated for each specific inode, so we can't cache it in the superblock
    anymore.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Eric Sandeen
    Reviewed-by: Felix Blyakher

    Christoph Hellwig
     
  • The ino64 mount option adds a fixed offset to 32bit inode numbers
    to bring them into the 64bit range. There's no need for this kind
    of debug tool given that it's easy to produce real 64bit inode numbers
    for testing.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Eric Sandeen
    Reviewed-by: Felix Blyakher

    Christoph Hellwig
     

09 Feb, 2009

2 commits


04 Feb, 2009

1 commit


19 Jan, 2009

1 commit

  • Currently the bad_features2 fixup and the alignment updates in the superblock
    are skipped if we mount a filesystem read-only. But for the root filesystem
    the typical case is to mount read-only first and only later remount writeable
    so we'll never perform this update at all. It's not a big problem but means
    the logs of people needing the fixup get spammed at every boot because they
    never happen on disk.

    Reported-by: Arkadiusz Miskiewicz
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

16 Jan, 2009

1 commit