16 Sep, 2020

1 commit


07 Jul, 2020

1 commit

  • Inode reclaim will still throttle direct reclaim on the per-ag
    reclaim locks. This is no longer necessary as reclaim can run
    non-blocking now. Hence we can remove these locks so that we don't
    arbitrarily block reclaimers just because there are more direct
    reclaimers than there are AGs.

    This can result in multiple reclaimers working on the same range of
    an AG, but this doesn't cause any apparent issues. Optimising the
    spread of concurrent reclaimers for best efficiency can be done in a
    future patchset.

    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     

27 May, 2020

3 commits

  • It's a global atomic counter, and we are hitting it at a rate of
    half a million transactions a second, so it's bouncing the counter
    cacheline all over the place on large machines. We don't actually
    need it anymore - it used to be required because the VFS freeze code
    could not track/prevent filesystem transactions that were running,
    but that problem no longer exists.

    Hence to remove the counter, we simply have to ensure that nothing
    calls xfs_sync_sb() while we are trying to quiesce the filesytem.
    That only happens if the log worker is still running when we call
    xfs_quiesce_attr(). The log worker is cancelled at the end of
    xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it
    early here and then we can remove the counter altogether.

    Concurrent create, 50 million inodes, identical 16p/16GB virtual
    machines on different physical hosts. Machine A has twice the CPU
    cores per socket of machine B:

    unpatched patched
    machine A: 3m16s 2m00s
    machine B: 4m04s 4m05s

    Create rates:
    unpatched patched
    machine A: 282k+/-31k 468k+/-21k
    machine B: 231k+/-8k 233k+/-11k

    Concurrent rm of same 50 million inodes:

    unpatched patched
    machine A: 6m42s 2m33s
    machine B: 4m47s 4m47s

    The transaction rate on the fast machine went from just under
    300k/sec to 700k/sec, which indicates just how much of a bottleneck
    this atomic counter was.

    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • Seeing massive cpu usage from xfs_agino_range() on one machine;
    instruction level profiles look similar to another machine running
    the same workload, only one machine is consuming 10x as much CPU as
    the other and going much slower. The only real difference between
    the two machines is core count per socket. Both are running
    identical 16p/16GB virtual machine configurations

    Machine A:

    25.83% [k] xfs_agino_range
    12.68% [k] __xfs_dir3_data_check
    6.95% [k] xfs_verify_ino
    6.78% [k] xfs_dir2_data_entry_tag_p
    3.56% [k] xfs_buf_find
    2.31% [k] xfs_verify_dir_ino
    2.02% [k] xfs_dabuf_map.constprop.0
    1.65% [k] xfs_ag_block_count

    And takes around 13 minutes to remove 50 million inodes.

    Machine B:

    13.90% [k] __pv_queued_spin_lock_slowpath
    3.76% [k] do_raw_spin_lock
    2.83% [k] xfs_dir3_leaf_check_int
    2.75% [k] xfs_agino_range
    2.51% [k] __raw_callee_save___pv_queued_spin_unlock
    2.18% [k] __xfs_dir3_data_check
    2.02% [k] xfs_log_commit_cil

    And takes around 5m30s to remove 50 million inodes.

    Suspect is cacheline contention on m_sectbb_log which is used in one
    of the macros in xfs_agino_range. This is a read-only variable but
    shares a cacheline with m_active_trans which is a global atomic that
    gets bounced all around the machine.

    The workload is trying to run hundreds of thousands of transactions
    per second and hence cacheline contention will be occurring on this
    atomic counter. Hence xfs_agino_range() is likely just be an
    innocent bystander as the cache coherency protocol fights over the
    cacheline between CPU cores and sockets.

    On machine A, this rearrangement of the struct xfs_mount
    results in the profile changing to:

    9.77% [kernel] [k] xfs_agino_range
    6.27% [kernel] [k] __xfs_dir3_data_check
    5.31% [kernel] [k] __pv_queued_spin_lock_slowpath
    4.54% [kernel] [k] xfs_buf_find
    3.79% [kernel] [k] do_raw_spin_lock
    3.39% [kernel] [k] xfs_verify_ino
    2.73% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock

    Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
    of machine B and still runs substantially slower than it should.

    Current rm -rf of 50 million files:

    vanilla patched
    machine A 13m20s 6m42s
    machine B 5m30s 5m02s

    It's an improvement, hence indicating that separation and further
    optimisation of read-only global filesystem data is worthwhile, but
    it clearly isn't the underlying issue causing this specific
    performance degradation.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • Shaokun Zhang reported that XFS was using substantial CPU time in
    percpu_count_sum() when running a single threaded benchmark on
    a high CPU count (128p) machine from xfs_mod_ifree(). The issue
    is that the filesystem is empty when the benchmark runs, so inode
    allocation is running with a very low inode free count.

    With the percpu counter batching, this means comparisons when the
    counter is less that 128 * 256 = 32768 use the slow path of adding
    up all the counters across the CPUs, and this is expensive on high
    CPU count machines.

    The summing in xfs_mod_ifree() is only used to fire an assert if an
    underrun occurs. The error is ignored by the higher level code.
    Hence this is really just debug code and we don't need to run it
    on production kernels, nor do we need such debug checks to return
    error values just to trigger an assert.

    Finally, xfs_mod_icount/xfs_mod_ifree are only called from
    xfs_trans_unreserve_and_mod_sb(), so get rid of them and just
    directly call the percpu_counter_add/percpu_counter_compare
    functions. The compare functions are now run only on debug builds as
    they are internal to ASSERT() checks and so only compiled in when
    ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y).

    Reported-by: Shaokun Zhang
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     

07 May, 2020

1 commit

  • Both types control shutdown messaging and neither is used in the
    current codebase.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Allison Collins
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

05 May, 2020

2 commits


17 Apr, 2020

1 commit

  • Move the inode dirty data flushing to a workqueue so that multiple
    threads can take advantage of a single thread's flushing work. The
    ratelimiting technique used in bdd4ee4 was not successful, because
    threads that skipped the inode flush scan due to ratelimiting would
    ENOSPC early, which caused occasional (but noticeable) changes in
    behavior and sporadic fstest regressions.

    Therefore, make all the writer threads wait on a single inode flush,
    which eliminates both the stampeding hordes of flushers and the small
    window in which a write could fail with ENOSPC because it lost the
    ratelimit race after even another thread freed space.

    Fixes: c6425702f21e ("xfs: ratelimit inode flush on buffered write ENOSPC")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     

31 Mar, 2020

1 commit

  • A customer reported rcu stalls and softlockup warnings on a computer
    with many CPU cores and many many more IO threads trying to write to a
    filesystem that is totally out of space. Subsequent analysis pointed to
    the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
    which causes a lot of wb_writeback_work to be queued. The writeback
    worker spends so much time trying to wake the many many threads waiting
    for writeback completion that it trips the softlockup detector, and (in
    this case) the system automatically reboots.

    In addition, they complain that the lengthy xfs_flush_inodes scan traps
    all of those threads in uninterruptible sleep, which hampers their
    ability to kill the program or do anything else to escape the situation.

    If there's thousands of threads trying to write to files on a full
    filesystem, each of those threads will start separate copies of the
    inode flush scan. This is kind of pointless since we only need one
    scan, so rate limit the inode flush.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner

    Darrick J. Wong
     

14 Nov, 2019

3 commits


11 Nov, 2019

2 commits


06 Nov, 2019

2 commits


30 Oct, 2019

7 commits


27 Aug, 2019

1 commit

  • When trying to correlate XFS kernel allocations to memory reclaim
    behaviour, it is useful to know what allocations XFS is actually
    attempting. This information is not directly available from
    tracepoints in the generic memory allocation and reclaim
    tracepoints, so these new trace points provide a high level
    indication of what the XFS memory demand actually is.

    There is no per-filesystem context in this code, so we just trace
    the type of allocation, the size and the allocation constraints.
    The kmem code also doesn't include much of the common XFS headers,
    so there are a few definitions that need to be added to the trace
    headers and a couple of types that need to be made common to avoid
    needing to include the whole world in the kmem code.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     

29 Jun, 2019

1 commit


12 Jun, 2019

2 commits


27 Apr, 2019

1 commit

  • Add a percpu counter to track the number of blocks directly reserved for
    delayed allocations on the data device. This counter (in contrast to
    i_delayed_blks) does not track allocated CoW staging extents or anything
    going on with the realtime device. It will be used in the upcoming
    summary counter scrub function to check the free block counts without
    having to freeze the filesystem or walk all the inodes to find the
    delayed allocations.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner

    Darrick J. Wong
     

17 Apr, 2019

1 commit


15 Apr, 2019

2 commits


21 Feb, 2019

1 commit

  • Add a mode where XFS never overwrites existing blocks in place. This
    is to aid debugging our COW code, and also put infatructure in place
    for things like possible future support for zoned block devices, which
    can't support overwrites.

    This mode is enabled globally by doing a:

    echo 1 > /sys/fs/xfs/debug/always_cow

    Note that the parameter is global to allow running all tests in xfstests
    easily in this mode, which would not easily be possible with a per-fs
    sysfs file.

    In always_cow mode persistent preallocations are disabled, and fallocate
    will fail when called with a 0 mode (with our without
    FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space
    when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE.

    There are a few interesting xfstests failures when run in always_cow
    mode:

    - generic/392 fails because the bytes used in the file used to test
    hole punch recovery are less after the log replay. This is
    because the blocks written and then punched out are only freed
    with a delay due to the logging mechanism.
    - xfs/170 will fail as the already fragile file streams mechanism
    doesn't seem to interact well with the COW allocator
    - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim
    the file system is badly fragmented, but there is not much we
    can do to avoid that when always writing out of place
    - xfs/205 fails because overwriting a file in always_cow mode
    will require new space allocation and the assumption in the
    test thus don't work anymore.
    - xfs/326 fails to modify the file at all in always_cow mode after
    injecting the refcount error, leading to an unexpected md5sum
    after the remount, but that again is expected

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

15 Feb, 2019

1 commit


12 Feb, 2019

1 commit

  • Use a rhashtable to cache the unlinked list incore. This should speed
    up unlinked processing considerably when there are a lot of inodes on
    the unlinked list because iunlink_remove no longer has to traverse an
    entire bucket list to find which inode points to the one being removed.

    The incore list structure records "X.next_unlinked = Y" relations, with
    the rhashtable using Y to index the records. This makes finding the
    inode X that points to a inode Y very quick. If our cache fails to find
    anything we can always fall back on the old method.

    FWIW this drastically reduces the amount of time it takes to remove
    inodes from the unlinked list. I wrote a program to open a lot of
    O_TMPFILE files and then close them in the same order, which takes
    a very long time if we have to traverse the unlinked lists. With the
    ptach, I see:

    + /d/t/tmpfile/tmpfile
    Opened 193531 files in 6.33s.
    Closed 193531 files in 5.86s

    real 0m12.192s
    user 0m0.064s
    sys 0m11.619s
    + cd /
    + umount /mnt

    real 0m0.050s
    user 0m0.004s
    sys 0m0.030s

    And without the patch:

    + /d/t/tmpfile/tmpfile
    Opened 193588 files in 6.35s.
    Closed 193588 files in 751.61s

    real 12m38.853s
    user 0m0.084s
    sys 12m34.470s
    + cd /
    + umount /mnt

    real 0m0.086s
    user 0m0.000s
    sys 0m0.060s

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     

13 Dec, 2018

3 commits

  • The realtime summary is a two-dimensional array on disk, effectively:

    u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]

    rsum[log][bbno] is the number of extents of size 2**log which start in
    bitmap block bbno.

    xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
    rsum[log][bbno] != 0 for any log level. However, the summary array is
    stored in row-major order (i.e., like an array in C), so all of these
    entries are not adjacent, but rather spread across the entire summary
    file. In the worst case (a full bitmap block), xfs_rtany_summary() has
    to check every level.

    This means that on a moderately-used realtime device, an allocation will
    waste a lot of time finding, reading, and releasing buffers for the
    realtime summary. In particular, one of our storage services (which runs
    on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
    spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
    xfs_trans_brelse() called from xfs_rtany_summary().

    One solution would be to also store the summary with the dimensions
    swapped. However, this would require a disk format change to a very old
    component of XFS.

    Instead, we can cache the minimum size which contains any extents. We do
    so lazily; rather than guaranteeing that the cache contains the precise
    minimum, it always contains a loose lower bound which we tighten when we
    read or update a summary block. This only uses a few kilobytes of memory
    and is already serialized via the realtime bitmap and summary inode
    locks, so the cost is minimal. With this change, the same workload only
    spends 0.2% of its CPU cycles in the realtime allocator.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Omar Sandoval
     
  • Store the inode cluster alignment information in units of inodes and
    blocks in the mount data so that we don't have to keep recalculating
    them.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • Store the number of inodes and blocks per inode cluster in the mount
    data so that we don't have to keep recalculating them.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     

27 Jul, 2018

1 commit

  • The barrier mount options have been no-ops and deprecated since

    4cf4573 xfs: deprecate barrier/nobarrier mount option

    i.e. kernel 4.10 / December 2016, with a stated deprecation schedule
    after v4.15. Should be fair game to remove them now.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Carlos Maiolino
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eric Sandeen
     

24 Jul, 2018

1 commit