20 Jun, 2017

1 commit

  • Check the inode cache for a particular inode number. If it's in the
    cache, check that it's not currently being reclaimed. If it's not being
    reclaimed, return zero if the inode is allocated. This function will be
    used by various scrubbers to decide if the cache is more up to date
    than the disk in terms of checking if an inode is allocated.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     

28 Apr, 2017

1 commit

  • The AG inode iterator currently skips new inodes as such inodes are
    inserted into the inode radix tree before they are fully
    constructed. Certain contexts require the ability to wait on the
    construction of new inodes, however. The fs-wide dquot release from
    the quotaoff sequence is an example of this.

    Update the AG inode iterator to support the ability to wait on
    inodes flagged with XFS_INEW upon request. Create a new
    xfs_inode_ag_iterator_flags() interface and support a set of
    iteration flags to modify the iteration behavior. When the
    XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the
    radix tree inode lookup and wait on them before the callback is
    executed.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

31 Jan, 2017

1 commit

  • The xfs_eofblocks.eof_scan_owner field is an internal field to
    facilitate invoking eofb scans from the kernel while under the iolock.
    This is necessary because the eofb scan acquires the iolock of each
    inode. Synchronous scans are invoked on certain buffered write failures
    while under iolock. In such cases, the scan owner indicates that the
    context for the scan already owns the particular iolock and prevents a
    double lock deadlock.

    eofblocks scans while under iolock are still livelock prone in the event
    of multiple parallel scans, however. If multiple buffered writes to
    different inodes fail and invoke eofblocks scans at the same time, each
    scan avoids a deadlock with its own inode by virtue of the
    eof_scan_owner field, but will never be able to acquire the iolock of
    the inode from the parallel scan. Because the low free space scans are
    invoked with SYNC_WAIT, the scan will not return until it has processed
    every tagged inode and thus both scans will spin indefinitely on the
    iolock being held across the opposite scan. This problem can be
    reproduced reliably by generic/224 on systems with higher cpu counts
    (x16).

    To avoid this problem, simplify the semantics of eofblocks scans to
    never invoke a scan while under iolock. This means that the buffered
    write context must drop the iolock before the scan. It must reacquire
    the lock before the write retry and also repeat the initial write
    checks, as the original state might no longer be valid once the iolock
    was dropped.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

06 Oct, 2016

1 commit

  • Trim CoW reservations made on behalf of a cowextsz hint if they get too
    old or we run low on quota, so long as we don't have dirty data awaiting
    writeback or directio operations in progress.

    Garbage collection of the cowextsize extents are kept separate from
    prealloc extent reaping because setting the CoW prealloc lifetime to a
    (much) higher value than the regular prealloc extent lifetime has been
    useful for combatting CoW fragmentation on VM hosts where the VMs
    experience bursty write behaviors and we can keep the utilization ratios
    low enough that we don't start to run out of space. IOWs, it benefits
    us to keep the CoW fork reservations around for as long as we can unless
    we run out of blocks or hit inode reclaim.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

21 Jun, 2016

1 commit

  • The filesystem quiesce sequence performs the operations necessary to
    drain all background work, push pending transactions through the log
    infrastructure and wait on I/O resulting from the final AIL push. We
    have had reports of remount,ro hangs in xfs_log_quiesce() ->
    xfs_wait_buftarg(), however, and some instrumentation code to detect
    transaction commits at this point in the quiesce sequence has inculpated
    the eofblocks background scanner as a cause.

    While higher level remount code generally prevents user modifications by
    the time the filesystem has made it to xfs_log_quiesce(), the background
    scanner may still be alive and can perform pending work at any time. If
    this occurs between the xfs_log_force() and xfs_wait_buftarg() calls
    within xfs_log_quiesce(), this can lead to an indefinite lockup in
    xfs_wait_buftarg().

    To prevent this problem, cancel the background eofblocks scan worker
    during the remount read-only quiesce sequence. This suspends background
    trimming when a filesystem is remounted read-only. This is only done in
    the remount path because the freeze codepath has already locked out new
    transactions by the time the filesystem attempts to quiesce (and thus
    waiting on an active work item could deadlock). Kick the eofblocks
    worker to pick up where it left off once an fs is remounted back to
    read-write.

    Signed-off-by: Brian Foster
    Reviewed-by: Eric Sandeen
    Signed-off-by: Dave Chinner

    Brian Foster
     

28 Nov, 2014

1 commit


24 Jul, 2014

2 commits

  • From: Brian Foster

    Speculative preallocation and and the associated throttling metrics
    assume we're working with large files on large filesystems. Users have
    reported inefficiencies in these mechanisms when we happen to be dealing
    with large files on smaller filesystems. This can occur because while
    prealloc throttling is aggressive under low free space conditions, it is
    not active until we reach 5% free space or less.

    For example, a 40GB filesystem has enough space for several files large
    enough to have multi-GB preallocations at any given time. If those files
    are slow growing, they might reserve preallocation for long periods of
    time as well as avoid the background scanner due to frequent
    modification. If a new file is written under these conditions, said file
    has no access to this already reserved space and premature ENOSPC is
    imminent.

    To handle this scenario, modify the buffered write ENOSPC handling and
    retry sequence to invoke an eofblocks scan. In the smaller filesystem
    scenario, the eofblocks scan resets the usage of preallocation such that
    when the 5% free space threshold is met, throttling effectively takes
    over to provide fair and efficient preallocation until legitimate
    ENOSPC.

    The eofblocks scan is selective based on the nature of the failure. For
    example, an EDQUOT failure in a particular quota will use a filtered
    scan for that quota. Because we don't know which quota might have caused
    an allocation failure at any given time, we include each applicable
    quota determined to be under low free space conditions in the scan.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • From: Brian Foster

    The scan owner field represents an optional inode number that is
    responsible for the current scan. The purpose is to identify that an
    inode is under iolock and as such, the iolock shouldn't be attempted
    when trimming eofblocks. This is an internal only field.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

25 Jun, 2014

1 commit

  • Convert all the errors the core XFs code to negative error signs
    like the rest of the kernel and remove all the sign conversion we
    do in the interface layers.

    Errors for conversion (and comparison) found via searches like:

    $ git grep " E" fs/xfs
    $ git grep "return E" fs/xfs
    $ git grep " E[A-Z].*;$" fs/xfs

    Negation points found via searches like:

    $ git grep "= -[a-z,A-Z]" fs/xfs
    $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
    $ git grep " -[a-z].*;" fs/xfs

    [ with some bits I missed from Brian Foster ]

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

14 Apr, 2014

1 commit


13 Sep, 2013

1 commit

  • Pull xfs update #2 from Ben Myers:
    "Here we have defrag support for v5 superblock, a number of bugfixes
    and a cleanup or two.

    - defrag support for CRC filesystems
    - fix endian worning in xlog_recover_get_buf_lsn
    - fixes for sparse warnings
    - fix for assert in xfs_dir3_leaf_hdr_from_disk
    - fix for log recovery of remote symlinks
    - fix for log recovery of btree root splits
    - fixes formemory allocation failures with ACLs
    - fix for assert in xfs_buf_item_relse
    - fix for assert in xfs_inode_buf_verify
    - fix an assignment in an assert that should be a test in
    xfs_bmbt_change_owner
    - remove dead code in xlog_recover_inode_pass2"

    * tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs:
    xfs: remove dead code from xlog_recover_inode_pass2
    xfs: = vs == typo in ASSERT()
    xfs: don't assert fail on bad inode numbers
    xfs: aborted buf items can be in the AIL.
    xfs: factor all the kmalloc-or-vmalloc fallback allocations
    xfs: fix memory allocation failures with ACLs
    xfs: ensure we copy buffer type in da btree root splits
    xfs: set remote symlink buffer type for recovery
    xfs: recovery of swap extents operations for CRC filesystems
    xfs: swap extents operations for CRC filesystems
    xfs: check magic numbers in dir3 leaf verifier first
    xfs: fix some minor sparse warnings
    xfs: fix endian warning in xlog_recover_get_buf_lsn()

    Linus Torvalds
     

11 Sep, 2013

2 commits

  • Convert superblock shrinker to use the new count/scan API, and propagate
    the API changes through to the filesystem callouts. The filesystem
    callouts already use a count/scan API, so it's just changing counters to
    longs to match the VM API.

    This requires the dentry and inode shrinker callouts to be converted to
    the count/scan API. This is mainly a mechanical change.

    [glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton

    Signed-off-by: Al Viro

    Dave Chinner
     
  • This is the recovery side of the btree block owner change operation
    performed by swapext on CRC enabled filesystems. We detect that an
    owner change is needed by the flag that has been placed on the inode
    log format flag field. Because the inode recovery is being replayed
    after the buffers that make up the BMBT in the given checkpoint, we
    can walk all the buffers and directly modify them when we see the
    flag set on an inode.

    Because the inode can be relogged and hence present in multiple
    chekpoints with the "change owner" flag set, we could do multiple
    passes across the inode to do this change. While this isn't optimal,
    we can't directly ignore the flag as there may be multiple
    independent swap extent operations being replayed on the same inode
    in different checkpoints so we can't ignore them.

    Further, because the owner change operation uses ordered buffers, we
    might have buffers that are newer on disk than the current
    checkpoint and so already have the owner changed in them. Hence we
    cannot just peek at a buffer in the tree and check that it has the
    correct owner and assume that the change was completed.

    So, for the moment just brute force the owner change every time we
    see an inode with the flag set. Note that we have to be careful here
    because the owner of the buffers may point to either the old owner
    or the new owner. Currently the verifier can't verify the owner
    directly, so there is no failure case here right now. If we verify
    the owner exactly in future, then we'll have to take this into
    account.

    This was tested in terms of normal operation via xfstests - all of
    the fsr tests now pass without failure. however, we really need to
    modify xfs/227 to stress v3 inodes correctly to ensure we fully
    cover this case for v5 filesystems.

    In terms of recovery testing, I used a hacked version of xfs_fsr
    that held the temp inode open for a few seconds before exiting so
    that the filesystem could be shut down with an open owner change
    recovery flags set on at least the temp inode. fsr leaves the temp
    inode unlinked and in btree format, so this was necessary for the
    owner change to be reliably replayed.

    logprint confirmed the tmp inode in the log had the correct flag set:

    INO: cnt:3 total:3 a:0x69e9e0 len:56 a:0x69ea20 len:176 a:0x69eae0 len:88
    INODE: #regs:3 ino:0x44 flags:0x209 dsize:88
    ^^^^^

    0x200 is set, indicating a data fork owner change needed to be
    replayed on inode 0x44. A printk in the revoery code confirmed that
    the inode change was recovered:

    XFS (vdc): Mounting Filesystem
    XFS (vdc): Starting recovery (logdev: internal)
    recovering owner change ino 0x44
    XFS (vdc): Version 5 superblock detected. This kernel L support enabled!
    Use of these features in this kernel is at your own risk!
    XFS (vdc): Ending recovery (logdev: internal)

    The script used to test this was:

    $ cat ./recovery-fsr.sh
    #!/bin/bash

    dev=/dev/vdc
    mntpt=/mnt/scratch
    testfile=$mntpt/testfile

    umount $mntpt
    mkfs.xfs -f -m crc=1 $dev
    mount $dev $mntpt
    chmod 777 $mntpt

    for i in `seq 10000 -1 0`; do
    xfs_io -f -d -c "pwrite $(($i * 4096)) 4096" $testfile > /dev/null 2>&1
    done
    xfs_bmap -vp $testfile |head -20

    xfs_fsr -d -v $testfile &
    sleep 10
    /home/dave/src/xfstests-dev/src/godown -f $mntpt
    wait
    umount $mntpt

    xfs_logprint -t $dev |tail -20
    time mount $dev $mntpt
    xfs_bmap -vp $testfile
    umount $mntpt
    $

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     

16 Aug, 2013

1 commit


13 Aug, 2013

1 commit


27 Jun, 2013

1 commit


09 Nov, 2012

5 commits

  • Create a new mount workqueue and delayed_work to enable background
    scanning and freeing of eofblocks inodes. The scanner kicks in once
    speculative preallocation occurs and stops requeueing itself when
    no eofblocks inodes exist.

    The scan interval is based on the new
    'speculative_prealloc_lifetime' tunable (default to 5m). The
    background scanner performs unfiltered, best effort scans (which
    skips inodes under lock contention or with a dirty cache mapping).

    Signed-off-by: Brian Foster
    Reviewed-by: Mark Tinguely
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Brian Foster
     
  • The XFS_IOC_FREE_EOFBLOCKS ioctl allows users to invoke an EOFBLOCKS
    scan. The xfs_eofblocks structure is defined to support the command
    parameters (scan mode).

    Signed-off-by: Brian Foster
    Reviewed-by: Mark Tinguely
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Brian Foster
     
  • xfs_inodes_free_eofblocks() implements scanning functionality for
    EOFBLOCKS inodes. It uses the AG iterator to walk the tagged inodes
    and free post-EOF blocks via the xfs_inode_free_eofblocks() execute
    function. The scan can be invoked in best-effort mode or wait
    (force) mode.

    A best-effort scan (default) handles all inodes that do not have a
    dirty cache and we successfully acquire the io lock via trylock. In
    wait mode, we continue to cycle through an AG until all inodes are
    handled.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Brian Foster
     
  • Genericize xfs_inode_ag_walk() to support an optional radix tree tag
    and args argument for the execute function. Create a new wrapper
    called xfs_inode_ag_iterator_tag() that performs a tag based walk
    of perag's and inodes.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Brian Foster
     
  • Add the XFS_ICI_EOFBLOCKS_TAG inode tag to identify inodes with
    speculatively preallocated blocks beyond EOF. An inode is tagged
    when speculative preallocation occurs and untagged either via
    truncate down or when post-EOF blocks are freed via release or
    reclaim.

    The tag management is intentionally not aggressive to prefer
    simplicity over the complexity of handling all the corner cases
    under which post-EOF blocks could be freed (i.e., forward
    truncation, fallocate, write error conditions, etc.). This means
    that a tagged inode may or may not have post-EOF blocks after a
    period of time. The tag is eventually cleared when the inode is
    released or reclaimed.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Brian Foster
     

18 Oct, 2012

2 commits

  • The inode cache functions remaining in xfs_iget.c can be moved to xfs_icache.c
    along with the other inode cache functions. This removes all functionality from
    xfs_iget.c, so the file can simply be removed.

    This move results in various functions now only having the scope of a single
    file (e.g. xfs_inode_free()), so clean up all the definitions and exported
    prototypes in xfs_icache.[ch] and xfs_inode.h appropriately.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • xfs_sync.c now only contains inode reclaim functions and inode cache
    iteration functions. It is not related to sync operations anymore.
    Rename to xfs_icache.c to reflect it's contents and prepare for
    consolidation with the other inode cache file that exists
    (xfs_iget.c).

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner