18 Nov, 2013

1 commit

  • v5 filesystems use 512 byte inodes as a minimum, so read inodes in
    clusters that are effectively half the size of a v4 filesystem with
    256 byte inodes. For v5 fielsystems, scale the inode cluster size
    with the size of the inode so that we keep a constant 32 inodes per
    cluster ratio for all inode IO.

    This only works if mkfs.xfs sets the inode alignment appropriately
    for larger inode clusters, so this functionality is made conditional
    on mkfs doing the right thing. xfs_repair needs to know about
    the inode alignment changes, too.

    Wall time:
    create bulkstat find+stat ls -R unlink
    v4 237s 161s 173s 201s 299s
    v5 235s 163s 205s 31s 356s
    patched 234s 160s 182s 29s 317s

    System time:
    create bulkstat find+stat ls -R unlink
    v4 2601s 2490s 1653s 1656s 2960s
    v5 2637s 2497s 1681s 20s 3216s
    patched 2613s 2451s 1658s 20s 3007s

    So, wall time same or down across the board, system time same or
    down across the board, and cache hit rates all improve except for
    the ls -R case which is a pure cold cache directory read workload
    on v5 filesystems...

    So, this patch removes most of the performance and CPU usage
    differential between v4 and v5 filesystems on traversal related
    workloads.

    Note: while this patch is currently for v5 filesystems only, there
    is no reason it can't be ported back to v4 filesystems. This hasn't
    been done here because bringing the code back to v4 requires
    forwards and backwards kernel compatibility testing. i.e. to
    deterine if older kernels(*) do the right thing with larger inode
    alignments but still only using 8k inode cluster sizes. None of this
    testing and validation on v4 filesystems has been done, so for the
    moment larger inode clusters is limited to v5 superblocks.

    (*) a current default config v4 filesystem should mount just fine on
    2.6.23 (when lazy-count support was introduced), and so if we change
    the alignment emitted by mkfs without a feature bit then we have to
    make sure it works properly on all kernels since 2.6.23. And if we
    allow it to be changed when the lazy-count bit is not set, then it's
    all kernels since v2 logs were introduced that need to be tested for
    compatibility...

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Eric Sandeen
    Signed-off-by: Ben Myers

    Dave Chinner
     

24 Oct, 2013

4 commits

  • Currently the xfs_inode.h header has a dependency on the definition
    of the BMAP btree records as the inode fork includes an array of
    xfs_bmbt_rec_host_t objects in it's definition.

    Move all the btree format definitions from xfs_btree.h,
    xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
    xfs_format.h to continue the process of centralising the on-disk
    format definitions. With this done, the xfs inode definitions are no
    longer dependent on btree header files.

    The enables a massive culling of unnecessary includes, with close to
    200 #include directives removed from the XFS kernel code base.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • xfs_trans.h has a dependency on xfs_log.h for a couple of
    structures. Most code that does transactions doesn't need to know
    anything about the log, but this dependency means that they have to
    include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header
    files and clean up the includes to be in dependency order.

    In doing this, remove the direct include of xfs_trans_reserve.h from
    xfs_trans.h so that we remove the dependency between xfs_trans.h and
    xfs_mount.h. Hence the xfs_trans.h include can be moved to the
    indicate the actual dependencies other header files have on it.

    Note that these are kernel only header files, so this does not
    translate to any userspace changes at all.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • The on-disk format definitions for the directory and attribute
    structures are spread across 3 header files right now, only one of
    which is dedicated to defining on-disk structures and their
    manipulation (xfs_dir2_format.h). Pull all the format definitions
    into a single header file - xfs_da_format.h - and switch all the
    code over to point at that.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • All of the buffer operations structures are needed to be exported
    for xfs_db, so move them all to a common location rather than
    spreading them all over the place. They are verifying the on-disk
    format, so while xfs_format.h might be a good place, it is not part
    of the on disk format.

    Hence we need to create a new header file that we centralise these
    related definitions. Start by moving the bffer operations
    structures, and then also move all the other definitions that have
    crept into xfs_log_format.h and xfs_format.h as there was no other
    shared header file to put them in.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     

23 Aug, 2013

1 commit


21 Aug, 2013

3 commits


13 Aug, 2013

5 commits

  • With the new xfs_trans_res structure has been introduced, the log
    reservation size, log count as well as log flags are pre-initialized
    at mount time. So it's time to refine xfs_trans_reserve() interface
    to be more neat.

    Also, introduce a new helper M_RES() to return a pointer to the
    mp->m_resv structure to simplify the input.

    Signed-off-by: Jie Liu
    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Jie Liu
     
  • There are a few small helper functions in xfs_util, all related to
    xfs_inode modifications. Move them all to xfs_inode.c so all
    xfs_inode operations are consiolidated in the one place.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • xfs_mount.c is shared with userspace, but the only functions that
    are shared are to do with physical superblock manipulations. This
    means that less than 25% of the xfs_mount.c code is actually shared
    with userspace. Move all the superblock functions to xfs_sb.c and
    share that instead with libxfs.

    Note that this will leave all the in-core transaction related
    superblock counter modifications in xfs_mount.c as none of that is
    shared with userspace. With a few more small changes, xfs_mount.h
    won't need to be shared with userspace anymore, either.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Many of the definitions within xfs_dir2_priv.h are needed in
    userspace outside libxfs. Definitions within xfs_dir2_priv.h are
    wholly contained within libxfs, so we need to shuffle some of the
    definitions around to keep consistency across files shared between
    user and kernel space.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • The on disk format definitions of the on-disk dquot, log formats and
    quota off log formats are all intertwined with other definitions for
    quotas. Separate them out into their own header file so they can
    easily be shared with userspace.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     

23 Jul, 2013

2 commits

  • Start using pquotino and define a macro to check if the
    superblock has pquotino.

    Keep backward compatibilty by alowing mount of older superblock
    with no separate pquota inode.

    Signed-off-by: Chandra Seetharaman
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Chandra Seetharaman
     
  • mkfs doesn't initialize the quota inodes to NULLFSINO as it does for the
    other internal inodes. This leads to two in-core values (0 and NULLFSINO)
    to be checked against, to make sure if a quota inode is valid.

    Solve that problem by initializing the in-core values of all quotaino
    values to NULLFSINO if they are 0 in the disk.

    Note that these values are not written back to on-disk superblock unless
    some quota is enabled on the filesystem. Even in that case sb_pquotino is
    written to disk only if the on-disk superblock supports pquotino

    Signed-off-by: Chandra Seetharaman
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Chandra Seetharaman
     

29 Jun, 2013

1 commit

  • Remove all incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD. Instead,
    start using XFS_GQUOTA_.* XFS_PQUOTA_.* counterparts for GQUOTA and
    PQUOTA respectively.

    On-disk copy still uses XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD.

    Read and write of the superblock does the conversion from *OQUOTA*
    to *[PG]QUOTA*.

    Signed-off-by: Chandra Seetharaman
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Chandra Seetharaman
     

20 Jun, 2013

1 commit

  • XFS_MOUNT_RETERR is going to be set at xfs_parseargs() if
    mp->m_dalign is enabled, so any time we enter "if (mp->m_dalign)"
    branch in xfs_update_alignment(), XFS_MOUNT_RETERR is set and so
    we always be emitting a warning and returning an error.

    Hence, we can remove it and get rid of a couple of redundant
    check up against it at xfs_upate_alignment().

    Thanks Dave Chinner for the suggestions of simplify the code
    in xfs_parseargs().

    Signed-off-by: Jie Liu
    Cc: Dave Chinner
    Cc: Mark Tinguely
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Jie Liu
     

18 Jun, 2013

1 commit

  • As per the mount man page, sunit and swidth can be changed via
    mount options. For XFS, on the face of it, those options seems
    works if the specified alignments is properly, e.g.
    # mount -o sunit=4096,swidth=8192 /dev/sdb1 /mnt
    # mount | grep sdb1
    /dev/sdb1 on /mnt type xfs (rw,sunit=4096,swidth=8192)

    However, neither sunit nor swidth is shown from the xfs_info output.
    # xfs_info /mnt
    meta-data=/dev/sdb1 isize=256 agcount=4, agsize=262144 blks
    = sectsz=512 attr=2
    data = bsize=4096 blocks=1048576, imaxpct=25
    = sunit=0 swidth=0 blks
    ^^^^^^^^^^^^^^^^^^^^^^^^^^
    naming =version 2 bsize=4096 ascii-ci=0
    log =internal bsize=4096 blocks=2560, version=2
    = sectsz=512 sunit=0 blks, lazy-count=1
    realtime =none extsz=4096 blocks=0, rtextents=0

    The reason is that the alignment can only be changed if the relevant
    super block is already configured with alignments, otherwise, the
    given value is silently ignored.

    With this fix, the attempt to mount a storage without strip alignment
    setup on a super block will get an error with a warning in syslog to
    indicate the true cause, e.g.
    # mount -o sunit=4096,swidth=8192 /dev/sdb1 /mnt
    mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
    missing codepage or helper program, or other error
    In some cases useful info is found in syslog - try
    dmesg | tail or so
    .......
    XFS (sdb1): cannot change alignment: superblock does not support data
    alignment

    Signed-off-by: Jie Liu
    Cc: Mark Tinguely
    Cc: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Jie Liu
     

31 May, 2013

1 commit

  • We write the superblock every 30s or so which results in the
    verifier being called. Right now that results in this output
    every 30s:

    XFS (vda): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
    Use of these features in this kernel is at your own risk!

    And spamming the logs.

    We don't need to check for whether we support v5 superblocks or
    whether there are feature bits we don't support set as these are
    only relevant when we first mount the filesytem. i.e. on superblock
    read. Hence for the write verification we can just skip all the
    checks (and hence verbose output) altogether.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Ben Myers

    Dave Chinner
     

28 Apr, 2013

2 commits

  • The version 5 superblock has extended feature masks for compatible,
    incompatible and read-only compatible feature sets. Implement the
    masking and mount-time checking for these feature masks.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • With the addition of CRCs, there is such a wide and varied change to
    the on disk format that it makes sense to bump the superblock
    version number rather than try to use feature bits for all the new
    functionality.

    This commit introduces all the new superblock fields needed for all
    the new functionality: feature masks similar to ext4, separate
    project quota inodes, a LSN field for recovery and the CRC field.

    This commit does not bump the superblock version number, however.
    That will be done as a separate commit at the end of the series
    after all the new functionality is present so we switch it all on in
    one commit. This means that we can slowly introduce the changes
    without them being active and hence maintain bisectability of the
    tree.

    This patch is based on a patch originally written by myself back
    from SGI days, which was subsequently modified by Christoph Hellwig.
    There is relatively little of that patch remaining, but the history
    of the patch still should be acknowledged here.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     

02 Feb, 2013

3 commits


17 Jan, 2013

1 commit


16 Nov, 2012

6 commits

  • To separate the verifiers from iodone functions and associate read
    and write verifiers at the same time, introduce a buffer verifier
    operations structure to the xfs_buf.

    This avoids the need for assigning the write verifier, clearing the
    iodone function and re-running ioend processing in the read
    verifier, and gets rid of the nasty "b_pre_io" name for the write
    verifier function pointer. If we ever need to, it will also be
    easier to add further content specific callbacks to a buffer with an
    ops structure in place.

    We also avoid needing to export verifier functions, instead we
    can simply export the ops structures for those that are needed
    outside the function they are defined in.

    This patch also fixes a directory block readahead verifier issue
    it exposed.

    This patch also adds ops callbacks to the inode/alloc btree blocks
    initialised by growfs. These will need more work before they will
    work with CRCs.

    Signed-off-by: Dave Chinner
    Reviewed-by: Phil White
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Metadata buffers that are read from disk have write verifiers
    already attached to them, but newly allocated buffers do not. Add
    appropriate write verifiers to all new metadata buffers.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • These verifiers are essentially the same code as the read verifiers,
    but do not require ioend processing. Hence factor the read verifier
    functions and add a new write verifier wrapper that is used as the
    callback.

    This is done as one large patch for all verifiers rather than one
    patch per verifier as the change is largely mechanical. This
    includes hooking up the write verifier via the read verifier
    function.

    Hooking up the write verifier for buffers obtained via
    xfs_trans_get_buf() will be done in a separate patch as that touches
    code in many different places rather than just the verifier
    functions.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Add a superblock verify callback function and pass it into the
    buffer read functions. Remove the now redundant verification code
    that is currently in use.

    Adding verification shows that secondary superblocks never have
    their "sb_inprogress" flag cleared by mkfs.xfs, so when validating
    the secondary superblocks during a grow operation we have to avoid
    checking this field. Even if we fix mkfs, we will still have to
    ignore this field for verification purposes unless a version of mkfs
    that does not have this bug was used.

    Signed-off-by: Dave Chinner
    Reviewed-by: Phil White
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • With verification being done as an IO completion callback, different
    errors can be returned from a read. Uncached reads only return a
    buffer or NULL on failure, which means the verification error cannot
    be returned to the caller.

    Split the error handling for these reads into two - a failure to get
    a buffer will still return NULL, but a read error will return a
    referenced buffer with b_error set rather than NULL. The caller is
    responsible for checking the error state of the buffer returned.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Phil White
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Add a verifier function callback capability to the buffer read
    interfaces. This will be used by the callers to supply a function
    that verifies the contents of the buffer when it is read from disk.
    This patch does not provide callback functions, but simply modifies
    the interfaces to allow them to be called.

    The reason for adding this to the read interfaces is that it is very
    difficult to tell fom the outside is a buffer was just read from
    disk or whether we just pulled it out of cache. Supplying a callbck
    allows the buffer cache to use it's internal knowledge of the buffer
    to execute it only when the buffer is read from disk.

    It is intended that the verifier functions will mark the buffer with
    an EFSCORRUPTED error when verification fails. This allows the
    reading context to distinguish a verification error from an IO
    error, and potentially take further actions on the buffer (e.g.
    attempt repair) based on the error reported.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Phil White
    Signed-off-by: Ben Myers

    Dave Chinner
     

09 Nov, 2012

1 commit

  • Create a new mount workqueue and delayed_work to enable background
    scanning and freeing of eofblocks inodes. The scanner kicks in once
    speculative preallocation occurs and stops requeueing itself when
    no eofblocks inodes exist.

    The scan interval is based on the new
    'speculative_prealloc_lifetime' tunable (default to 5m). The
    background scanner performs unfiltered, best effort scans (which
    skips inodes under lock contention or with a dirty cache mapping).

    Signed-off-by: Brian Foster
    Reviewed-by: Mark Tinguely
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Brian Foster
     

18 Oct, 2012

3 commits

  • xfs_sync.c now only contains inode reclaim functions and inode cache
    iteration functions. It is not related to sync operations anymore.
    Rename to xfs_icache.c to reflect it's contents and prepare for
    consolidation with the other inode cache file that exists
    (xfs_iget.c).

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • When unmounting the filesystem, there are lots of operations that
    need to be done in a specific order, and they are spread across
    across a couple of functions. We have to drain the AIL before we
    write the unmount record, and we have to shut down the background
    log work before we do either of them.

    But this is all split haphazardly across xfs_unmountfs() and
    xfs_log_unmount(). Move all the AIL flushing and log manipulations
    to xfs_log_unmount() so that the responisbilities of each function
    is clear and the operations they perform obvious.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • Instead of starting and stopping background work on the xfs_mount_wq
    all at the same time, separate them to where they really are needed
    to start and stop.

    The xfs_sync_worker, only needs to be started after all the mount
    processing has completed successfully, while it needs to be stopped
    before the log is unmounted.

    The xfs_reclaim_worker is started on demand, and can be
    stopped before the unmount process does it's own inode reclaim pass.

    The xfs_flush_inodes work is run on demand, and so we really only
    need to ensure that it has stopped running before we start
    processing an unmount, freeze or remount,ro.

    Signed-off-by: Dave Chinner
    Reviewed-by: Mark Tinguely
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     

27 Sep, 2012

1 commit

  • Add xfs_set_inode32() to be used to enable inode32 allocation mode. this
    will reduce the amount of duplicated code needed to mount/remount a
    filesystem with inode32 option. This patch also changes
    xfs_set_inode64() to return the maximum AG number that inodes can be
    allocated instead of set mp->m_maxagi by itself, so that the behaviour
    is the same as xfs_set_inode32(). This simplifies code that calls these
    functions and needs to know the maximum AG that inodes can be allocated
    in.

    Signed-off-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Carlos Maiolino
     

02 Aug, 2012

1 commit

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     

31 Jul, 2012

1 commit

  • Generic code now blocks all writers from standard write paths. So we add
    blocking of all writers coming from ioctl (we get a protection of ioctl against
    racing remount read-only as a bonus) and convert xfs_file_aio_write() to a
    non-racy freeze protection. We also keep freeze protection on transaction
    start to block internal filesystem writes such as removal of preallocated
    blocks.

    CC: Ben Myers
    CC: Alex Elder
    CC: xfs@oss.sgi.com
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

30 Jul, 2012

1 commit

  • v2: Add the xfs_buf_lock to xfs_quiesce_attr().
    Add explaination why xfs_buf_lock() is used to wait for write.

    xfs_wait_buftarg() does not wait for the completion of the write of the
    uncached superblock. This write can race with the shutdown of the log
    and causes a panic if the write does not win the race.

    During the log write, xfsaild_push() will lock the buffer and set the
    XBF_ASYNC flag. Because the XBF_FLAG is set, complete() is not performed
    on the buffer's iowait entry, we cannot call xfs_buf_iowait() to wait
    for the write to complete. The buffer's lock is held until the write is
    complete, so we can block on a xfs_buf_lock() request to be notified
    that the write is complete.

    Signed-off-by: Mark Tinguely
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Mark Tinguely