16 Sep, 2020

1 commit


07 Sep, 2020

1 commit

  • With the recent rework of the inode cluster flushing, we no longer
    ever wait on the the inode flush "lock". It was never a lock in the
    first place, just a completion to allow callers to wait for inode IO
    to complete. We now never wait for flush completion as all inode
    flushing is non-blocking. Hence we can get rid of all the iflock
    infrastructure and instead just set and check a state flag.

    Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
    xfs_iflock_nowait() test-and-set operations on that flag, and
    replace all the xfs_ifunlock() calls to clear operations.

    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     

07 Jul, 2020

3 commits

  • Having different io completion callbacks for different inode states
    makes things complex. We can detect if the inode is stale via the
    XFS_ISTALE flag in IO completion, so we don't need a special
    callback just for this.

    This means inodes only have a single iodone callback, and inode IO
    completion is entirely buffer centric at this point. Hence we no
    longer need to use a log item callback at all as we can just call
    xfs_iflush_done() directly from the buffer completions and walk the
    buffer log item list to complete the all inodes under IO.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • The inode log item is kind of special in that it can be aggregating
    new changes in memory at the same time time existing changes are
    being written back to disk. This means there are fields in the log
    item that are accessed concurrently from contexts that don't share
    any locking at all.

    e.g. updating ili_last_fields occurs at flush time under the
    ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
    completion time, and is read under the ILOCK_EXCL when the inode is
    logged. Hence there is no actual serialisation between reading the
    field during logging of the inode in transactions vs clearing the
    field in IO completion.

    We currently get away with this by the fact that we are only
    clearing fields in IO completion, and nothing bad happens if we
    accidentally log more of the inode than we actually modify. Worst
    case is we consume a tiny bit more memory and log bandwidth.

    However, if we want to do more complex state manipulations on the
    log item that requires updates at all three of these potential
    locations, we need to have some mechanism of serialising those
    operations. To do this, introduce a spinlock into the log item to
    serialise internal state.

    This could be done via the xfs_inode i_flags_lock, but this then
    leads to potential lock inversion issues where inode flag updates
    need to occur inside locks that best nest inside the inode log item
    locks (e.g. marking inodes stale during inode cluster freeing).
    Using a separate spinlock avoids these sorts of problems and
    simplifies future code.

    This does not touch the use of ili_fields in the item formatting
    code - that is entirely protected by the ILOCK_EXCL at this point in
    time, so it remains untouched.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • This was used to track if the item had logged fields being flushed
    to disk. We log everything in the inode these days, so this logic is
    no longer needed. Remove it.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     

07 May, 2020

1 commit

  • The stale parameter was used to control the now unused shutdown
    parameter of xfs_trans_ail_remove().

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Reviewed-by: Allison Collins
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

05 May, 2020

1 commit


29 Jun, 2019

1 commit


07 Jun, 2018

1 commit

  • Remove the verbose license text from XFS files and replace them
    with SPDX tags. This does not change the license of any of the code,
    merely refers to the common, up-to-date license files in LICENSES/

    This change was mostly scripted. fs/xfs/Makefile and
    fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
    and modified by the following command:

    for f in `git grep -l "GNU General" fs/xfs/` ; do
    echo $f
    cat $f | awk -f hdr.awk > $f.new
    mv -f $f.new $f
    done

    And the hdr.awk script that did the modification (including
    detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
    is as follows:

    $ cat hdr.awk
    BEGIN {
    hdr = 1.0
    tag = "GPL-2.0"
    str = ""
    }

    /^ \* This program is free software/ {
    hdr = 2.0;
    next
    }

    /any later version./ {
    tag = "GPL-2.0+"
    next
    }

    /^ \*\// {
    if (hdr > 0.0) {
    print "// SPDX-License-Identifier: " tag
    print str
    print $0
    str=""
    hdr = 0.0
    next
    }
    print $0
    next
    }

    /^ \* / {
    if (hdr > 1.0)
    next
    if (hdr > 0.0) {
    if (str != "")
    str = str "\n"
    str = str $0
    next
    }
    print $0
    next
    }

    /^ \*/ {
    if (hdr > 0.0)
    next
    print $0
    next
    }

    // {
    if (hdr > 0.0) {
    if (str != "")
    str = str "\n"
    str = str $0
    next
    }
    print $0
    }

    END { }
    $

    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     

02 Nov, 2017

1 commit


03 Nov, 2015

1 commit

  • xfs: timestamp updates cause excessive fdatasync log traffic

    Sage Weil reported that a ceph test workload was writing to the
    log on every fdatasync during an overwrite workload. Event tracing
    showed that the only metadata modification being made was the
    timestamp updates during the write(2) syscall, but fdatasync(2)
    is supposed to ignore them. The key observation was that the
    transactions in the log all looked like this:

    INODE: #regs: 4 ino: 0x8b flags: 0x45 dsize: 32

    And contained a flags field of 0x45 or 0x85, and had data and
    attribute forks following the inode core. This means that the
    timestamp updates were triggering dirty relogging of previously
    logged parts of the inode that hadn't yet been flushed back to
    disk.

    There are two parts to this problem. The first is that XFS relogs
    dirty regions in subsequent transactions, so it carries around the
    fields that have been dirtied since the last time the inode was
    written back to disk, not since the last time the inode was forced
    into the log.

    The second part is that on v5 filesystems, the inode change count
    update during inode dirtying also sets the XFS_ILOG_CORE flag, so
    on v5 filesystems this makes a timestamp update dirty the entire
    inode.

    As a result when fdatasync is run, it looks at the dirty fields in
    the inode, and sees more than just the timestamp flag, even though
    the only metadata change since the last fdatasync was just the
    timestamps. Hence we force the log on every subsequent fdatasync
    even though it is not needed.

    To fix this, add a new field to the inode log item that tracks
    changes since the last time fsync/fdatasync forced the log to flush
    the changes to the journal. This flag is updated when we dirty the
    inode, but we do it before updating the change count so it does not
    carry the "core dirty" flag from timestamp updates. The fields are
    zeroed when the inode is marked clean (due to writeback/freeing) or
    when an fsync/datasync forces the log. Hence if we only dirty the
    timestamps on the inode between fsync/fdatasync calls, the fdatasync
    will not trigger another log force.

    Over 100 runs of the test program:

    Ext4 baseline:
    runtime: 1.63s +/- 0.24s
    avg lat: 1.59ms +/- 0.24ms
    iops: ~2000

    XFS, vanilla kernel:
    runtime: 2.45s +/- 0.18s
    avg lat: 2.39ms +/- 0.18ms
    log forces: ~400/s
    iops: ~1000

    XFS, patched kernel:
    runtime: 1.49s +/- 0.26s
    avg lat: 1.46ms +/- 0.25ms
    log forces: ~30/s
    iops: ~1500

    Reported-by: Sage Weil
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

13 Dec, 2013

2 commits


13 Aug, 2013

1 commit


18 Dec, 2012

1 commit


15 May, 2012

1 commit

  • xfs_trans_ail_delete_bulk() can be called from different contexts so
    if the item is not in the AIL we need different shutdown for each
    context. Pass in the shutdown method needed so the correct action
    can be taken.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     

14 Mar, 2012

3 commits

  • Add an in-memory only flag to say we logged timestamps only, and use it to
    check if fdatasync can optimize away the log force.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Add a new ili_fields member to the inode log item to isolate the in-memory
    flags from the ones that actually go to the log. This will allow tracking
    timestamp-only updates for fdatasync and O_DSYNC in the next patch and
    prepares for divorcing the on-disk log format from the in-memory log item
    a little further down the road.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     
  • Timestamps on regular files are the last metadata that XFS does not update
    transactionally. Now that we use the delaylog mode exclusively and made
    the log scode scale extremly well there is no need to bypass that code for
    timestamp updates. Logging all updates allows to drop a lot of code, and
    will allow for further performance improvements later on.

    Note that this patch drops optimized handling of fdatasync - it will be
    added back in a separate commit.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

27 Jul, 2010

2 commits

  • Currently we need to either call IHOLD or xfs_trans_ihold on an inode when
    joining it to a transaction via xfs_trans_ijoin.

    This patches instead makes xfs_trans_ijoin usable on it's own by doing
    an implicity xfs_trans_ihold, which also allows us to drop the third
    argument. For the case where we want to hold a reference on the inode
    a xfs_trans_ijoin_ref wrapper is added which does the IHOLD and marks
    the inode for needing an xfs_iput. In addition to the cleaner interface
    to the caller this also simplifies the implementation.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Stop the function pointer casting madness and give all the li_cb instances
    correct prototype.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

02 Feb, 2010

1 commit

  • All buffers logged into the AIL are marked as delayed write.
    When the AIL needs to push the buffer out, it issues an async write of the
    buffer. This means that IO patterns are dependent on the order of
    buffers in the AIL.

    Instead of flushing the buffer, promote the buffer in the delayed
    write list so that the next time the xfsbufd is run the buffer will
    be flushed by the xfsbufd. Return the state to the xfsaild that the
    buffer was promoted so that the xfsaild knows that it needs to cause
    the xfsbufd to run to flush the buffers that were promoted.

    Using the xfsbufd for issuing the IO allows us to dispatch all
    buffer IO from the one queue. This means that we can make much more
    enlightened decisions on what order to flush buffers to disk as
    we don't have multiple places issuing IO. Optimisations to xfsbufd
    will be in a future patch.

    Version 2
    - kill XFS_ITEM_FLUSHING as it is now unused.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

17 Dec, 2009

1 commit

  • For a long time we've always stored bmap btree records in the 64bit format,
    so kill off the dead 32bit type, and make sure the 64bit type is named just
    xfs_bmbt_rec everywhere, without any size postfix.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Eric Sandeen
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

02 Sep, 2009

1 commit

  • xfs_trans_iget is a wrapper for xfs_iget that adds the inode to the
    transaction after it is read. Except when the inode already is in the
    inode cache, in which case it returns the existing locked inode with
    increment lock recursion counts.

    Now, no one in the tree every decrements these lock recursion counts,
    so any user of this gets a potential double unlock when both the original
    owner of the inode and the xfs_trans_iget caller unlock it. When looking
    back in a git bisect in the historic XFS tree there was only one place
    that decremented these counts, xfs_trans_iput. Introduced in commit
    ca25df7a840f426eb566d52667b6950b92bb84b5 by Adam Sweeney in 1993,
    and removed in commit 19f899a3ab155ff6a49c0c79b06f2f61059afaf3 by
    Steve Lord in 2003. And as long as it didn't slip through git bisects
    cracks never actually used in that time frame.

    A quick audit of the callers of xfs_trans_iget shows that no caller
    really relies on this behaviour fortunately - xfs_ialloc allows this
    inode from disk so it must not be there before, and all the RT allocator
    routines only every add each RT bitmap inode once.

    In addition to removing lots of code and reducing the size of the inode
    item this patch also avoids the double inode cache lookup in each
    create/mkdir/mknod transaction.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Alex Elder
    Signed-off-by: Felix Blyakher

    Christoph Hellwig
     

22 Jan, 2009

1 commit


16 Jan, 2009

1 commit


30 Oct, 2008

1 commit


18 Apr, 2008

1 commit


28 Sep, 2006

1 commit


09 Jun, 2006

1 commit


02 Nov, 2005

2 commits


17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds