03 Oct, 2016

1 commit


26 Sep, 2016

1 commit

  • When adding a new remote attribute, we write the attribute to the
    new extent before the allocation transaction is committed. This
    means we cannot reuse busy extents as that violates crash
    consistency semantics. Hence we currently treat remote attribute
    extent allocation like userdata because it has the same overwrite
    ordering constraints as userdata.

    Unfortunately, this also allows the allocator to incorrectly apply
    extent size hints to the remote attribute extent allocation. This
    results in interesting failures, such as transaction block
    reservation overruns and in-memory inode attribute fork corruption.

    To fix this, we need to separate the busy extent reuse configuration
    from the userdata configuration. This changes the definition of
    XFS_BMAPI_METADATA slightly - it now means that allocation is
    metadata and reuse of busy extents is acceptible due to the metadata
    ordering semantics of the journal. If this flag is not set, it
    means the allocation is that has unordered data writeback, and hence
    busy extent reuse is not allowed. It no longer implies the
    allocation is for user data, just that the data write will not be
    strictly ordered. This matches the semantics for both user data
    and remote attribute block allocation.

    As such, This patch changes the "userdata" field to a "datatype"
    field, and adds a "no busy reuse" flag to the field.
    When we detect an unordered data extent allocation, we immediately set
    the no reuse flag. We then set the "user data" flags based on the
    inode fork we are allocating the extent to. Hence we only set
    userdata flags on data fork allocations now and consider attribute
    fork remote extents to be an unordered metadata extent.

    The result is that remote attribute extents now have the expected
    allocation semantics, and the data fork allocation behaviour is
    completely unchanged.

    It should be noted that there may be other ways to fix this (e.g.
    use ordered metadata buffers for the remote attribute extent data
    write) but they are more invasive and difficult to validate both
    from a design and implementation POV. Hence this patch takes the
    simple, obvious route to fixing the problem...

    Reported-and-tested-by: Ross Zwisler
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

19 Sep, 2016

1 commit

  • One unfortunate quirk of the reference count and reverse mapping
    btrees -- they can expand in size when blocks are written to *other*
    allocation groups if, say, one large extent becomes a lot of tiny
    extents. Since we don't want to start throwing errors in the middle
    of CoWing, we need to reserve some blocks to handle future expansion.
    The transaction block reservation counters aren't sufficient here
    because we have to have a reserve of blocks in every AG, not just
    somewhere in the filesystem.

    Therefore, create two per-AG block reservation pools. One feeds the
    AGFL so that rmapbt expansion always succeeds, and the other feeds all
    other metadata so that refcountbt expansion never fails.

    Use the count of how many reserved blocks we need to have on hand to
    create a virtual reservation in the AG. Through selective clamping of
    the maximum length of allocation requests and of the length of the
    longest free extent, we can make it look like there's less free space
    in the AG unless the reservation owner is asking for blocks.

    In other words, play some accounting tricks in-core to make sure that
    we always have blocks available. On the plus side, there's nothing to
    clean up if we crash, which is contrast to the strategy that the rough
    draft used (actually removing extents from the freespace btrees).

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     

03 Aug, 2016

2 commits


09 Feb, 2016

1 commit

  • Move the di_mode value from the xfs_icdinode to the VFS inode, reducing
    the xfs_icdinode byte another 2 bytes and collapsing another 2 byte hole
    in the structure.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

22 Jun, 2015

2 commits

  • We no longer calculate the minimum freelist size from the on-disk
    AGF, so we don't need the macros used for this. That means the
    nested macros can be cleaned up, and turn this into an actual
    function so the logic is clear and concise. This will make it much
    easier to add support for the rmap btree when the time comes.

    This also gets rid of the XFS_AG_MAXLEVELS macro used by these
    freelist macros as it is simply a wrapper around a single variable.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • At the moment, xfs_alloc_fix_freelist() uses a mix of per-ag based
    access and agf buffer based access to freelist and space usage
    information. However, once the AGF buffer is locked inside this
    function, it is guaranteed that both the in-memory and on-disk
    values are identical. xfs_alloc_fix_freelist() doesn't modify the
    values in the structures directly, so it is a read-only user of the
    infomration, and hence can use the per-ag structure exclusively for
    determining what it should do.

    This opens up an avenue for cleaning up a lot of duplicated logic
    whose only difference is the structure it gets the data from, and in
    doing so removes a lot of needless byte swapping overhead when
    fixing up the free list.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

27 Apr, 2015

1 commit

  • Pull fourth vfs update from Al Viro:
    "d_inode() annotations from David Howells (sat in for-next since before
    the beginning of merge window) + four assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    RCU pathwalk breakage when running into a symlink overmounting something
    fix I_DIO_WAKEUP definition
    direct-io: only inc/dec inode->i_dio_count for file systems
    fs/9p: fix readdir()
    VFS: assorted d_backing_inode() annotations
    VFS: fs/inode.c helpers: d_inode() annotations
    VFS: fs/cachefiles: d_backing_inode() annotations
    VFS: fs library helpers: d_inode() annotations
    VFS: assorted weird filesystems: d_inode() annotations
    VFS: normal filesystems (and lustre): d_inode() annotations
    VFS: security/: d_inode() annotations
    VFS: security/: d_backing_inode() annotations
    VFS: net/: d_inode() annotations
    VFS: net/unix: d_backing_inode() annotations
    VFS: kernel/: d_inode() annotations
    VFS: audit: d_backing_inode() annotations
    VFS: Fix up some ->d_inode accesses in the chelsio driver
    VFS: Cachefiles should perform fs modifications on the top layer only
    VFS: AF_UNIX sockets should call mknod on the top layer only

    Linus Torvalds
     

16 Apr, 2015

1 commit


25 Mar, 2015

1 commit


28 Nov, 2014

3 commits


25 Jun, 2014

1 commit

  • Convert all the errors the core XFs code to negative error signs
    like the rest of the kernel and remove all the sign conversion we
    do in the interface layers.

    Errors for conversion (and comparison) found via searches like:

    $ git grep " E" fs/xfs
    $ git grep "return E" fs/xfs
    $ git grep " E[A-Z].*;$" fs/xfs

    Negation points found via searches like:

    $ git grep "= -[a-z,A-Z]" fs/xfs
    $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
    $ git grep " -[a-z].*;" fs/xfs

    [ with some bits I missed from Brian Foster ]

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

23 Apr, 2014

6 commits


24 Oct, 2013

2 commits

  • Currently the xfs_inode.h header has a dependency on the definition
    of the BMAP btree records as the inode fork includes an array of
    xfs_bmbt_rec_host_t objects in it's definition.

    Move all the btree format definitions from xfs_btree.h,
    xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
    xfs_format.h to continue the process of centralising the on-disk
    format definitions. With this done, the xfs inode definitions are no
    longer dependent on btree header files.

    The enables a massive culling of unnecessary includes, with close to
    200 #include directives removed from the XFS kernel code base.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • xfs_trans.h has a dependency on xfs_log.h for a couple of
    structures. Most code that does transactions doesn't need to know
    anything about the log, but this dependency means that they have to
    include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header
    files and clean up the includes to be in dependency order.

    In doing this, remove the direct include of xfs_trans_reserve.h from
    xfs_trans.h so that we remove the dependency between xfs_trans.h and
    xfs_mount.h. Hence the xfs_trans.h include can be moved to the
    indicate the actual dependencies other header files have on it.

    Note that these are kernel only header files, so this does not
    translate to any userspace changes at all.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     

13 Aug, 2013

3 commits

  • There are a few small helper functions in xfs_util, all related to
    xfs_inode modifications. Move them all to xfs_inode.c so all
    xfs_inode operations are consiolidated in the one place.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • There is a bunch of code in xfs_bmap.c that is kernel specific and
    not shared with userspace. To minimise the difference between the
    kernel and userspace code, shift this unshared code to
    xfs_bmap_util.c, and the declarations to xfs_bmap_util.h.

    The biggest issue here is xfs_bmap_finish() - userspace has it's own
    definition of this function, and so we need to move it out of
    xfs_bmap.[ch]. This means several other files need to include
    xfs_bmap_util.h as well.

    It also introduces and interesting dance for the stack switching
    code in xfs_bmapi_allocate(). The stack switching/workqueue code is
    actually moved to xfs_bmap_util.c, so that userspace can simply use
    a #define in a header file to connect the dots without needing to
    know about the stack switch code at all.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • The log item format definitions are shared with userspace. Split
    them out of header files that contain kernel only defintions to make
    it simple to shared them.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     

12 Oct, 2011

2 commits


27 Jul, 2011

2 commits


11 Nov, 2010

1 commit

  • The filestreams code may take the iolock on the parent inode while
    holding it on a child. This is the only place in XFS where we take
    both the child and parent iolock, so just telling lockdep about it
    is enough. The lock flag required for that was already added as
    part of the ilock lockdep annotations and unused so far.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

27 Jul, 2010

3 commits

  • Move xfs_filestream_peek_ag, xxfs_filestream_get_ag and xfs_filestream_put_ag
    from xfs_filestream.h to xfs_filestream.c where it's only callers are, and
    remove the inline marker while we're at it to let the compiler decide on the
    inlining. Also don't return a value from xfs_filestream_put_ag because
    we don't need it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Dmapi support was never merged upstream, but we still have a lot of hooks
    bloating XFS for it, all over the fast pathes of the filesystem.

    This patch drops over 700 lines of dmapi overhead. If we'll ever get HSM
    support in mainline at least the namespace events can be done much saner
    in the VFS instead of the individual filesystem, so it's not like this
    is much help for future work.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     

16 Jan, 2010

3 commits

  • The filestreams cache flush is not needed in the sync code as it
    does not affect data writeback, and it is now not used by the growfs
    code, either, so kill it.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • The use of an array for the per-ag structures requires reallocation
    of the array when growing the filesystem. This requires locking
    access to the array to avoid use after free situations, and the
    locking is difficult to get right. To avoid needing to reallocate an
    array, change the per-ag structures to an allocated object per ag
    and index them using a tree structure.

    The AGs are always densely indexed (hence the use of an array), but
    the number supported is 2^32 and lookups tend to be random and hence
    indexing needs to scale. A simple choice is a radix tree - it works
    well with this sort of index. This change also removes another
    large contiguous allocation from the mount/growfs path in XFS.

    The growing process now needs to change to only initialise the new
    AGs required for the extra space, and as such only needs to
    exclusively lock the tree for inserts. The rest of the code only
    needs to lock the tree while doing lookups, and hence this will
    remove all the deadlocks that currently occur on the m_perag_lock as
    it is now an innermost lock. The lock is also changed to a spinlock
    from a read/write lock as the hold time is now extremely short.

    To complete the picture, the per-ag structures will need to be
    reference counted to ensure that we don't free/modify them while
    they are still in use. This will be done in subsequent patch.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • Use xfs_perag_get() and xfs_perag_put() in the filestreams code.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     

15 Dec, 2009

1 commit

  • Convert the old xfs tracing support that could only be used with the
    out of tree kdb and xfsidbg patches to use the generic event tracer.

    To use it make sure CONFIG_EVENT_TRACING is enabled and then enable
    all xfs trace channels by:

    echo 1 > /sys/kernel/debug/tracing/events/xfs/enable

    or alternatively enable single events by just doing the same in one
    event subdirectory, e.g.

    echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable

    or set more complex filters, etc. In Documentation/trace/events.txt
    all this is desctribed in more detail. To reads the events do a

    cat /sys/kernel/debug/tracing/trace

    Compared to the last posting this patch converts the tracing mostly to
    the one tracepoint per callsite model that other users of the new
    tracing facility also employ. This allows a very fine-grained control
    of the tracing, a cleaner output of the traces and also enables the
    perf tool to use each tracepoint as a virtual performance counter,
    allowing us to e.g. count how often certain workloads git various
    spots in XFS. Take a look at

    http://lwn.net/Articles/346470/

    for some examples.

    Also the btree tracing isn't included at all yet, as it will require
    additional core tracing features not in mainline yet, I plan to
    deliver it later.

    And the really nice thing about this patch is that it actually removes
    many lines of code while adding this nice functionality:

    fs/xfs/Makefile | 8
    fs/xfs/linux-2.6/xfs_acl.c | 1
    fs/xfs/linux-2.6/xfs_aops.c | 52 -
    fs/xfs/linux-2.6/xfs_aops.h | 2
    fs/xfs/linux-2.6/xfs_buf.c | 117 +--
    fs/xfs/linux-2.6/xfs_buf.h | 33
    fs/xfs/linux-2.6/xfs_fs_subr.c | 3
    fs/xfs/linux-2.6/xfs_ioctl.c | 1
    fs/xfs/linux-2.6/xfs_ioctl32.c | 1
    fs/xfs/linux-2.6/xfs_iops.c | 1
    fs/xfs/linux-2.6/xfs_linux.h | 1
    fs/xfs/linux-2.6/xfs_lrw.c | 87 --
    fs/xfs/linux-2.6/xfs_lrw.h | 45 -
    fs/xfs/linux-2.6/xfs_super.c | 104 ---
    fs/xfs/linux-2.6/xfs_super.h | 7
    fs/xfs/linux-2.6/xfs_sync.c | 1
    fs/xfs/linux-2.6/xfs_trace.c | 75 ++
    fs/xfs/linux-2.6/xfs_trace.h | 1369 +++++++++++++++++++++++++++++++++++++++++
    fs/xfs/linux-2.6/xfs_vnode.h | 4
    fs/xfs/quota/xfs_dquot.c | 110 ---
    fs/xfs/quota/xfs_dquot.h | 21
    fs/xfs/quota/xfs_qm.c | 40 -
    fs/xfs/quota/xfs_qm_syscalls.c | 4
    fs/xfs/support/ktrace.c | 323 ---------
    fs/xfs/support/ktrace.h | 85 --
    fs/xfs/xfs.h | 16
    fs/xfs/xfs_ag.h | 14
    fs/xfs/xfs_alloc.c | 230 +-----
    fs/xfs/xfs_alloc.h | 27
    fs/xfs/xfs_alloc_btree.c | 1
    fs/xfs/xfs_attr.c | 107 ---
    fs/xfs/xfs_attr.h | 10
    fs/xfs/xfs_attr_leaf.c | 14
    fs/xfs/xfs_attr_sf.h | 40 -
    fs/xfs/xfs_bmap.c | 507 +++------------
    fs/xfs/xfs_bmap.h | 49 -
    fs/xfs/xfs_bmap_btree.c | 6
    fs/xfs/xfs_btree.c | 5
    fs/xfs/xfs_btree_trace.h | 17
    fs/xfs/xfs_buf_item.c | 87 --
    fs/xfs/xfs_buf_item.h | 20
    fs/xfs/xfs_da_btree.c | 3
    fs/xfs/xfs_da_btree.h | 7
    fs/xfs/xfs_dfrag.c | 2
    fs/xfs/xfs_dir2.c | 8
    fs/xfs/xfs_dir2_block.c | 20
    fs/xfs/xfs_dir2_leaf.c | 21
    fs/xfs/xfs_dir2_node.c | 27
    fs/xfs/xfs_dir2_sf.c | 26
    fs/xfs/xfs_dir2_trace.c | 216 ------
    fs/xfs/xfs_dir2_trace.h | 72 --
    fs/xfs/xfs_filestream.c | 8
    fs/xfs/xfs_fsops.c | 2
    fs/xfs/xfs_iget.c | 111 ---
    fs/xfs/xfs_inode.c | 67 --
    fs/xfs/xfs_inode.h | 76 --
    fs/xfs/xfs_inode_item.c | 5
    fs/xfs/xfs_iomap.c | 85 --
    fs/xfs/xfs_iomap.h | 8
    fs/xfs/xfs_log.c | 181 +----
    fs/xfs/xfs_log_priv.h | 20
    fs/xfs/xfs_log_recover.c | 1
    fs/xfs/xfs_mount.c | 2
    fs/xfs/xfs_quota.h | 8
    fs/xfs/xfs_rename.c | 1
    fs/xfs/xfs_rtalloc.c | 1
    fs/xfs/xfs_rw.c | 3
    fs/xfs/xfs_trans.h | 47 +
    fs/xfs/xfs_trans_buf.c | 62 -
    fs/xfs/xfs_vnodeops.c | 8
    70 files changed, 2151 insertions(+), 2592 deletions(-)

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig
     

08 Jun, 2009

1 commit

  • xfs_sync_inodes is used to write back either file data or inode metadata.
    In general we always do these separately, except for one fishy case in
    xfs_fs_put_super that does both. So separate xfs_sync_inodes into
    separate xfs_sync_data and xfs_sync_attr functions. In xfs_fs_put_super
    we first call the data sync and then the attr sync as that was the previous
    order. The moved log force in that path doesn't make a difference because
    we will force the log again as part of the real unmount process.

    The filesystem readonly checks are not performed by the new function but
    instead moved into the callers, given that most callers alredy have it
    further up in the stack. Also add debug checks that we do not pass in
    incorrect flags in the new xfs_sync_data and xfs_sync_attr function and
    fix the one place that did pass in a wrong flag.

    Also remove a comment mentioning xfs_sync_inodes that has been incorrect
    for a while because we always take either the iolock or ilock in the
    sync path these days.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Eric Sandeen

    Christoph Hellwig
     

16 Mar, 2009

1 commit