26 Sep, 2016

1 commit

  • Recently we've had a number of reports where log recovery on a v5
    filesystem has reported corruptions that looked to be caused by
    recovery being re-run over the top of an already-recovered
    metadata. This has uncovered a bug in recovery (fixed elsewhere)
    but the vector that caused this was largely unknown.

    A kdump test started tripping over this problem - the system
    would be crashed, the kdump kernel and environment would boot and
    dump the kernel core image, and then the system would reboot. After
    reboot, the root filesystem was triggering log recovery and
    corruptions were being detected. The metadumps indicated the above
    log recovery issue.

    What is happening is that the kdump kernel and environment is
    mounting the root device read-only to find the binaries needed to do
    it's work. The result of this is that it is running log recovery.
    However, because there were unlinked files and EFIs to be processed
    by recovery, the completion of phase 1 of log recovery could not
    mark the log clean. And because it's a read-only mount, the unmount
    process does not write records to the log to mark it clean, either.
    Hence on the next mount of the filesystem, log recovery was run
    again across all the metadata that had already been recovered and
    this is what triggered corruption warnings.

    To avoid this problem, we need to ensure that a read-only mount
    always updates the log when it completes the second phase of
    recovery. We already handle this sort of issue with rw->ro remount
    transitions, so the solution is as simple as quiescing the
    filesystem at the appropriate time during the mount process. This
    results in the log being marked clean so the mount behaviour
    recorded in the logs on repeated RO mounts will change (i.e. log
    recovery will no longer be run on every mount until a RW mount is
    done). This is a user visible change in behaviour, but it is
    harmless.

    Signed-off-by: Dave Chinner
    Reviewed-by: Eric Sandeen
    Signed-off-by: Dave Chinner

    Dave Chinner
     

01 Jun, 2016

1 commit

  • Al Viro noticed that xfs_lock_inodes should be static, and
    that led to ... a few more.

    These are just the easy ones, others require moving functions
    higher in source files, so that's not done here to keep
    this review simple.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Eric Sandeen
     

02 Mar, 2016

1 commit

  • inode32/inode64 allocator behavior with respect to mount, remount
    and growfs is a little tricky.

    The inode32 mount option should only enable the inode32 allocator
    heuristics if the filesystem is large enough for 64-bit inodes to
    exist. Today, it has this behavior on the initial mount, but a
    remount with inode32 unconditionally changes the allocation
    heuristics, even for a small fs.

    Also, an inode32 mounted small filesystem should transition to the
    inode32 allocator if the filesystem is subsequently grown to a
    sufficient size. Today that does not happen.

    This patch consolidates xfs_set_inode32 and xfs_set_inode64 into a
    single new function, and moves the "is the maximum inode number big
    enough to matter" test into that function, so it doesn't rely on the
    caller to get it right - which remount did not do, previously.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Eric Sandeen
     

23 Feb, 2015

1 commit

  • Now that the in-core superblock infrastructure has been replaced with
    generic per-cpu counters, we don't need it anymore. Nuke it from
    orbit so we are sure that it won't haunt us again...

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

30 Jul, 2014

1 commit

  • Trying to support tiny disks only and saving a bit memory might have
    made sense on an SGI O2 15 years ago, but is pretty pointless today.

    Remove the rarely tested codepath that uses various smaller in-memory
    types to reduce our test matrix and make the codebase a little bit
    smaller and less complicated.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ben Myers
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

24 Jul, 2014

1 commit

  • Today, if we perform an xfs_growfs which adds allocation groups,
    mp->m_maxagi is not properly updated when the growfs is complete.

    Therefore inodes will continue to be allocated only in the
    AGs which existed prior to the growfs, and the new space
    won't be utilized.

    This is because of this path in xfs_growfs_data_private():

    xfs_growfs_data_private
    xfs_initialize_perag(mp, nagcount, &nagimax);
    if (mp->m_flags & XFS_MOUNT_32BITINODES)
    index = xfs_set_inode32(mp);
    else
    index = xfs_set_inode64(mp);

    if (maxagi)
    *maxagi = index;

    where xfs_set_inode* iterates over the (old) agcount in
    mp->m_sb.sb_agblocks, which has not yet been updated
    in the growfs path. So "index" will be returned based on
    the old agcount, not the new one, and new AGs are not available
    for inode allocation.

    Fix this by explicitly passing the proper AG count (which
    xfs_initialize_perag() already has) down another level,
    so that xfs_set_inode* can make the proper decision about
    acceptable AGs for inode allocation in the potentially
    newly-added AGs.

    This has been broken since 3.7, when these two
    xfs_set_inode* functions were added in commit 2d2194f.
    Prior to that, we looped over "agcount" not sb_agblocks
    in these calculations.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Eric Sandeen
     

18 Oct, 2012

1 commit

  • We don't do any data writeback from XFS any more - the VFS is
    completely responsible for that, including for freeze. We can
    replace the remaining caller with a VFS level function that
    achieves the same thing, but without conflicting with current
    writeback work.

    This means we can remove the flush_work and xfs_flush_inodes() - the
    VFS functionality completely replaces the internal flush queue for
    doing this writeback work in a separate context to avoid stack
    overruns.

    This does have one complication - it cannot be called with page
    locks held. Hence move the flushing of delalloc space when ENOSPC
    occurs back up into xfs_file_aio_buffered_write when we don't hold
    any locks that will stall writeback.

    Unfortunately, writeback_inodes_sb_if_idle() is not sufficient to
    trigger delalloc conversion fast enough to prevent spurious ENOSPC
    whent here are hundreds of writers, thousands of small files and GBs
    of free RAM. Hence we need to use sync_sb_inodes() to block callers
    while we wait for writeback like the previous xfs_flush_inodes
    implementation did.

    That means we have to hold the s_umount lock here, but because this
    call can nest inside i_mutex (the parent directory in the create
    case, held by the VFS), we have to use down_read_trylock() to avoid
    potential deadlocks. In practice, this trylock will succeed on
    almost every attempt as unmount/remount type operations are
    exceedingly rare.

    Note: we always need to pass a count of zero to
    generic_file_buffered_write() as the previously written byte count.
    We only do this by accident before this patch by the virtue of ret
    always being zero when there are no errors. Make this explicit
    rather than needing to specifically zero ret in the ENOSPC retry
    case.

    Signed-off-by: Dave Chinner
    Tested-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Dave Chinner
     

27 Sep, 2012

1 commit

  • Add xfs_set_inode32() to be used to enable inode32 allocation mode. this
    will reduce the amount of duplicated code needed to mount/remount a
    filesystem with inode32 option. This patch also changes
    xfs_set_inode64() to return the maximum AG number that inodes can be
    allocated instead of set mp->m_maxagi by itself, so that the behaviour
    is the same as xfs_set_inode32(). This simplifies code that calls these
    functions and needs to know the maximum AG that inodes can be allocated
    in.

    Signed-off-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Mark Tinguely
    Signed-off-by: Ben Myers

    Carlos Maiolino
     

15 Mar, 2012

1 commit

  • If we initialize the slab caches for the quota code when XFS is loaded there
    is no need for a global and reference counted quota manager structure. Drop
    all this overhead and also fix the error handling during quota initialization.

    Reviewed-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

13 Aug, 2011

1 commit

  • Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the
    annoying subdirectories in the XFS source code. Besides the large
    amount of file rename the only changes are to the Makefile, a few
    files including headers with the subdirectory prefix, and the binary
    sysctl compat code that includes a header under fs/xfs/ from
    kernel/.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Christoph Hellwig