12 Oct, 2011

4 commits

  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Instead of passing the block number and mount structure explicitly
    get them off the bp and fix make the argument order more natural.

    Also move it to xfs_buf.c and stop printing the device name given
    that we already get the fs name as part of xfs_alert, and we know
    what device is operates on because of the caller that gets printed,
    finally rename it to xfs_buf_ioerror_alert and pass __func__ as
    argument where it makes sense.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • The code to flush buffers in the umount code is a bit iffy: we first
    flush all delwri buffers out, but then might be able to queue up a
    new one when logging the sb counts. On a normal shutdown that one
    would get flushed out when doing the synchronous superblock write in
    xfs_unmountfs_writesb, but we skip that one if the filesystem has
    been shut down.

    Fix this by moving the delwri list flushing until just before unmounting
    the log, and while we're at it also remove the superflous delwri list
    and buffer lru flusing for the rt and log device that can never have
    cached or delwri buffers.

    Signed-off-by: Christoph Hellwig
    Reported-by: Amit Sahrawat
    Tested-by: Amit Sahrawat
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Unify the ways we add buffers to the delwri queue by always calling
    xfs_buf_delwri_queue directly. The xfs_bdwrite functions is removed and
    opencoded in its callers, and the two places setting XBF_DELWRI while a
    buffer is locked and expecting xfs_buf_unlock to pick it up are converted
    to call xfs_buf_delwri_queue directly, too. Also replace the
    XFS_BUF_UNDELAYWRITE macro with direct calls to xfs_buf_delwri_dequeue
    to make the explicit queuing/dequeuing more obvious.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

08 Aug, 2011

1 commit


27 Jul, 2011

1 commit


26 Jul, 2011

2 commits


21 Jul, 2011

1 commit


13 Jul, 2011

1 commit

  • Remove the dead hash table test rid which has been rotting away under
    QUOTADEBUG, including some code that was compiled for normal debug
    builds, but not actually called without QUOTADEBUG, and enable a few
    cheap debug checks that were hidden under QUOTADEBUG for normal
    debug builds.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

11 Jul, 2011

1 commit

  • This reverts commit 7a249cf83da1813cfa71cfe1e265b40045eceb47.

    That commit created a situation that could lead to a filesystem
    hang. As Dave Chinner pointed out, xfs_trans_alloc() could hold a
    reference to m_active_trans (i.e., keep it non-zero) and then wait
    for SB_FREEZE_TRANS to complete. Meanwhile a filesystem freeze
    request could set SB_FREEZE_TRANS and then wait for m_active_trans
    to drop to zero. Nobody benefits from this sequence of events...

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Alex Elder
     

09 Jul, 2011

1 commit

  • Pavol pointed out that there is one silent error case in the mount
    path, and that others are rather uninformative.

    I've taken Pavol's suggested patch and extended it a bit to also:

    * fix a message which says "turned off" but actually errors out
    * consolidate the vaguely differentiated "SB sanity check [12]"
    messages, and hexdump the superblock for analysis

    Original-patch-by: Pavol Gono
    Signed-off-by: Eric Sandeen
    Signed-off-by: Alex Elder

    Eric Sandeen
     

08 Jul, 2011

2 commits

  • Rename xfs_buf_cond_lock and reverse it's return value to fit most other
    trylock operations in the Kernel and XFS (with the exception of down_trylock,
    after which xfs_buf_cond_lock was modelled), and replace xfs_buf_lock_val
    with an xfs_buf_islocked for use in asserts, or and opencoded variant in
    tracing. remove the XFS_BUF_* wrappers for all the locking helpers.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • As pointed out by Jan xfs_trans_alloc can race with a concurrent filesystem
    freeze when it sleeps during the memory allocation. Fix this by moving the
    wait_for_freeze call after the memory allocation. This means moving the
    freeze into the low-level _xfs_trans_alloc helper, which thus grows a new
    argument. Also fix up some comments in that area while at it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

29 Apr, 2011

1 commit

  • follow these guidelines:
    - leave initialization in the declaration block if it fits the line
    - move to the code where it's more suitable ('for' init block)

    The last chunk was modified from David's original to be a correct
    fix for what appeared to be a duplicate initialization.

    Signed-off-by: David Sterba
    Signed-off-by: Alex Elder
    Reviewed-by: Dave Chinner

    David Sterba
     

07 Mar, 2011

3 commits


04 Jan, 2011

1 commit

  • Currently the size of the speculative preallocation during delayed
    allocation is fixed by either the allocsize mount option of a
    default size. We are seeing a lot of cases where we need to
    recommend using the allocsize mount option to prevent fragmentation
    when buffered writes land in the same AG.

    Rather than using a fixed preallocation size by default (up to 64k),
    make it dynamic by basing it on the current inode size. That way the
    EOF preallocation will increase as the file size increases. Hence
    for streaming writes we are much more likely to get large
    preallocations exactly when we need it to reduce fragementation.

    For default settings, the size of the initial extents is determined
    by the number of parallel writers and the amount of memory in the
    machine. For 4GB RAM and 4 concurrent 32GB file writes:

    EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
    0: [0..1048575]: 1048672..2097247 0 (1048672..2097247) 1048576
    1: [1048576..2097151]: 5242976..6291551 0 (5242976..6291551) 1048576
    2: [2097152..4194303]: 12583008..14680159 0 (12583008..14680159) 2097152
    3: [4194304..8388607]: 25165920..29360223 0 (25165920..29360223) 4194304
    4: [8388608..16777215]: 58720352..67108959 0 (58720352..67108959) 8388608
    5: [16777216..33554423]: 117440584..134217791 0 (117440584..134217791) 16777208
    6: [33554424..50331511]: 184549056..201326143 0 (184549056..201326143) 16777088
    7: [50331512..67108599]: 251657408..268434495 0 (251657408..268434495) 16777088

    and for 16 concurrent 16GB file writes:

    EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
    0: [0..262143]: 2490472..2752615 0 (2490472..2752615) 262144
    1: [262144..524287]: 6291560..6553703 0 (6291560..6553703) 262144
    2: [524288..1048575]: 13631592..14155879 0 (13631592..14155879) 524288
    3: [1048576..2097151]: 30408808..31457383 0 (30408808..31457383) 1048576
    4: [2097152..4194303]: 52428904..54526055 0 (52428904..54526055) 2097152
    5: [4194304..8388607]: 104857704..109052007 0 (104857704..109052007) 4194304
    6: [8388608..16777215]: 209715304..218103911 0 (209715304..218103911) 8388608
    7: [16777216..33554423]: 452984848..469762055 0 (452984848..469762055) 16777208

    Because it is hard to take back specualtive preallocation, cases
    where there are large slow growing log files on a nearly full
    filesystem may cause premature ENOSPC. Hence as the filesystem nears
    full, the maximum dynamic prealloc size іs reduced according to this
    table (based on 4k block size):

    freespace max prealloc size
    >5% full extent (8GB)
    4-5% 2GB (8GB >> 2)
    3-4% 1GB (8GB >> 3)
    2-3% 512MB (8GB >> 4)
    1-2% 256MB (8GB >> 5)
    > 6)

    This should reduce the amount of space held in speculative
    preallocation for such cases.

    The allocsize mount option turns off the dynamic behaviour and fixes
    the prealloc size to whatever the mount option specifies. i.e. the
    behaviour is unchanged.

    Signed-off-by: Dave Chinner

    Dave Chinner
     

16 Dec, 2010

1 commit

  • now that we are using RCU protection for the inode cache lookups,
    the lock is only needed on the modification side. Hence it is not
    necessary for the lock to be a rwlock as there are no read side
    holders anymore. Convert it to a spin lock to reflect it's exclusive
    nature.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

11 Nov, 2010

1 commit


19 Oct, 2010

11 commits

  • Stop having two different names for many buffer functions and use
    the more descriptive xfs_buf_* names directly.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Update the per-cpu counters manually in xfs_trans_unreserve_and_mod_sb
    and remove support for per-cpu counters from xfs_mod_incore_sb_batch
    to simplify it. And added benefit is that we don't have to take
    m_sb_lock for transactions that only modify per-cpu counters.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Export xfs_icsb_modify_counters and always use it for modifying
    the per-cpu counters. Remove support for per-cpu counters from
    xfs_mod_incore_sb to simplify it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Fail the mount if we can't allocate memory for the per-CPU counters.
    This is consistent with how we handle everything else in the mount
    path and makes the superblock counter modification a lot simpler.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • The buffer cache hash is showing typical hash scalability problems.
    In large scale testing the number of cached items growing far larger
    than the hash can efficiently handle. Hence we need to move to a
    self-scaling cache indexing mechanism.

    I have selected rbtrees for indexing becuse they can have O(log n)
    search scalability, and insert and remove cost is not excessive,
    even on large trees. Hence we should be able to cache large numbers
    of buffers without incurring the excessive cache miss search
    penalties that the hash is imposing on us.

    To ensure we still have parallel access to the cache, we need
    multiple trees. Rather than hashing the buffers by disk address to
    select a tree, it seems more sensible to separate trees by typical
    access patterns. Most operations use buffers from within a single AG
    at a time, so rather than searching lots of different lists,
    separate the buffer indexes out into per-AG rbtrees. This means that
    searches during metadata operation have a much higher chance of
    hitting cache resident nodes, and that updates of the tree are less
    likely to disturb trees being accessed on other CPUs doing
    independent operations.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • Memory reclaim via shrinkers has a terrible habit of having N+M
    concurrent shrinker executions (N = num CPUs, M = num kswapds) all
    trying to shrink the same cache. When the cache they are all working
    on is protected by a single spinlock, massive contention an
    slowdowns occur.

    Wrap the per-ag inode caches with a reclaim mutex to serialise
    reclaim access to the AG. This will block concurrent reclaim in each
    AG but still allow reclaim to scan multiple AGs concurrently. Allow
    shrinkers to move on to the next AG if it can't get the lock, and if
    we can't get any AG, then start blocking on locks.

    To prevent reclaimers from continually scanning the same inodes in
    each AG, add a cursor that tracks where the last reclaim got up to
    and start from that point on the next reclaim. This should avoid
    only ever scanning a small number of inodes at the satart of each AG
    and not making progress. If we have a non-shrinker based reclaim
    pass, ignore the cursor and reset it to zero once we are done.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • The reclaim walk requires different locking and has a slightly
    different walk algorithm, so separate it out so that it can be
    optimised separately.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • When we are checking we can access the last block of each device, we
    do not need to use cached buffers as they will be tossed away
    immediately. Use uncached buffers for size checks so that all IO
    prior to full in-memory structure initialisation does not use the
    buffer cache.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • Filesystem level managed buffers are buffers that have their
    lifecycle controlled by the filesystem layer, not the buffer cache.
    We currently cache these buffers, which makes cleanup and cache
    walking somewhat troublesome. Convert the fs managed buffers to
    uncached buffers obtained by via xfs_buf_get_uncached(), and remove
    the XBF_FS_MANAGED special cases from the buffer cache.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • When we start taking a reference to the per-ag for every cached
    buffer in the system, kernel lockstat profiling on an 8-way create
    workload shows the mp->m_perag_lock has higher acquisition rates
    than the inode lock and has significantly more contention. That is,
    it becomes the highest contended lock in the system.

    The perag lookup is trivial to convert to lock-less RCU lookups
    because perag structures never go away. Hence the only thing we need
    to protect against is tree structure changes during a grow. This can
    be done simply by replacing the locking in xfs_perag_get() with RCU
    read locking. This removes the mp->m_perag_lock completely from this
    path.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • When we start taking references per cached buffer to the the perag
    it is cached on, it will blow the current debug maximum reference
    count assert out of the water. The assert has never caught a bug,
    and we have tracing to track changes if there ever is a problem,
    so just remove it.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     

27 Jul, 2010

2 commits

  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Dmapi support was never merged upstream, but we still have a lot of hooks
    bloating XFS for it, all over the fast pathes of the filesystem.

    This patch drops over 700 lines of dmapi overhead. If we'll ever get HSM
    support in mainline at least the namespace events can be done much saner
    in the VFS instead of the individual filesystem, so it's not like this
    is much help for future work.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

24 Jun, 2010

1 commit

  • The block number comes from bulkstat based inode lookups to shortcut
    the mapping calculations. We ar enot able to trust anything from
    bulkstat, so drop the block number as well so that the correct
    lookups and mappings are always done.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

29 May, 2010

3 commits

  • If a filesystem is mounted without the inode64 mount option we
    should still be able to access inodes not fitting into 32 bits, just
    not created new ones. For this to work we need to make sure the
    inode cache radix tree is initialized for all allocation groups, not
    just those we plan to allocate inodes from. This patch makes sure
    we initialize the inode cache radix tree for all allocation groups,
    and also cleans xfs_initialize_perag up a bit to separate the
    inode32 logical from the general perag structure setup.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • The use of radix_tree_preload() only works if the radix tree was
    initialised without the __GFP_WAIT flag. The per-ag tree uses
    GFP_NOFS, so does not trigger allocation of new tree nodes from the
    preloaded array. Hence it enters the allocator with a spinlock held
    and triggers the might_sleep() warnings.

    Reported-by; Chris Mason
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Many places in the xfs code return E2BIG when they really mean
    EFBIG; trying to grow past 16T on a 32 bit machine, for example,
    says "Argument list too long" rather than "File too large" which is
    not particularly helpful.

    Some of these don't make perfect sense as EFBIG either, but still
    better than E2BIG IMHO.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Eric Sandeen
     

19 May, 2010

1 commit


06 Mar, 2010

1 commit

  • The current default size of the reserved blocks pool is easy to deplete
    with certain workloads, in particular workloads that do lots of concurrent
    delayed allocation extent conversions. If enough transactions are running
    in parallel and the entire pool is consumed then subsequent calls to
    xfs_trans_reserve() will fail with ENOSPC. Also add a rate limited
    warning so we know if this starts happening again.

    This is an updated version of an old patch from Lachlan McIlroy.

    Signed-off-by: Dave Chinner
    Signed-off-by: Alex Elder

    Dave Chinner