06 Oct, 2016

7 commits

  • Implement swapext for filesystems that have reverse mapping. Back in
    the reflink patches, we augmented the bmap code with a 'REMAP' flag
    that updates only the bmbt and doesn't touch the allocator and
    implemented log redo items for those two operations. Now we can
    rewrite extent swapping as a (looong) series of remap operations.

    This is far less efficient than the fork swapping method implemented
    in the past, so we only switch this on for rmap.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Refactor the swapext function to pull out the fork swapping piece
    into a separate function. In the next patch we'll add in the bit
    we need to make it work with rmap filesystems.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Replace structure typedefs with struct expressions and fix some
    whitespace issues that result.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Trim CoW reservations made on behalf of a cowextsz hint if they get too
    old or we run low on quota, so long as we don't have dirty data awaiting
    writeback or directio operations in progress.

    Garbage collection of the cowextsize extents are kept separate from
    prealloc extent reaping because setting the CoW prealloc lifetime to a
    (much) higher value than the regular prealloc extent lifetime has been
    useful for combatting CoW fragmentation on VM hosts where the VMs
    experience bursty write behaviors and we can keep the utilization ratios
    low enough that we don't start to run out of space. IOWs, it benefits
    us to keep the CoW fork reservations around for as long as we can unless
    we run out of blocks or hit inode reclaim.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Create a per-inode extent size allocator hint for copy-on-write. This
    hint is separate from the existing extent size hint so that CoW can
    take advantage of the fragmentation-reducing properties of extent size
    hints without disabling delalloc for regular writes.

    The extent size hint that's fed to the allocator during a copy on
    write operation is the greater of the cowextsize and regular extsize
    hint.

    During reflink, if we're sharing the entire source file to the entire
    destination file and the destination file doesn't already have a
    cowextsize hint, propagate the source file's cowextsize hint to the
    destination file.

    Furthermore, zero the bulkstat buffer prior to setting the fields
    so that we don't copy kernel memory contents into userspace.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • When we're swapping the extents of two inodes, be sure to swap the
    reflink inode flag too.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Teach xfs_getbmapx how to report shared extents and CoW fork contents
    accurately in the bmap output by querying the refcount btree
    appropriately.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

26 Sep, 2016

1 commit

  • When adding a new remote attribute, we write the attribute to the
    new extent before the allocation transaction is committed. This
    means we cannot reuse busy extents as that violates crash
    consistency semantics. Hence we currently treat remote attribute
    extent allocation like userdata because it has the same overwrite
    ordering constraints as userdata.

    Unfortunately, this also allows the allocator to incorrectly apply
    extent size hints to the remote attribute extent allocation. This
    results in interesting failures, such as transaction block
    reservation overruns and in-memory inode attribute fork corruption.

    To fix this, we need to separate the busy extent reuse configuration
    from the userdata configuration. This changes the definition of
    XFS_BMAPI_METADATA slightly - it now means that allocation is
    metadata and reuse of busy extents is acceptible due to the metadata
    ordering semantics of the journal. If this flag is not set, it
    means the allocation is that has unordered data writeback, and hence
    busy extent reuse is not allowed. It no longer implies the
    allocation is for user data, just that the data write will not be
    strictly ordered. This matches the semantics for both user data
    and remote attribute block allocation.

    As such, This patch changes the "userdata" field to a "datatype"
    field, and adds a "no busy reuse" flag to the field.
    When we detect an unordered data extent allocation, we immediately set
    the no reuse flag. We then set the "user data" flags based on the
    inode fork we are allocating the extent to. Hence we only set
    userdata flags on data fork allocations now and consider attribute
    fork remote extents to be an unordered metadata extent.

    The result is that remote attribute extents now have the expected
    allocation semantics, and the data fork allocation behaviour is
    completely unchanged.

    It should be noted that there may be other ways to fix this (e.g.
    use ordered metadata buffers for the remote attribute extent data
    write) but they are more invasive and difficult to validate both
    from a design and implementation POV. Hence this patch takes the
    simple, obvious route to fixing the problem...

    Reported-and-tested-by: Ross Zwisler
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

03 Aug, 2016

6 commits

  • Swapping extents between two inodes requires the owner to be updated
    in the rmap tree for all the extents that are swapped. This code
    does not yet exist, so switch off the XFS_IOC_SWAPEXT ioctl until
    support has been implemented. This will need to be done before the
    rmap btree code can have the experimental tag removed.

    This functionality will be provided in a (much) later patch, using
    some of the reflink deferred block remapping functionality to
    accomlish extent swapping with rmap updates.

    Signed-off-by: Dave Chinner
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • When we map, unmap, or convert an extent in a file's data or attr
    fork, schedule a respective update in the rmapbt. Previous versions
    of this patch required a 1:1 correspondence between bmap and rmap,
    but this is no longer true as we now have ability to make interval
    queries against the rmapbt.

    We use the deferred operations code to handle redo operations
    atomically and deadlock free. This plumbs in all five rmap actions
    (map, unmap, convert extent, alloc, free); we'll use the first three
    now for file data, and reflink will want the last two. We also add
    an error injection site to test log recovery.

    Finally, we need to fix the bmap shift extent code to adjust the
    rmaps correctly.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Mechanical change of flist/free_list to dfops, since they're now
    deferred ops, not just a freeing list.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Drop the compatibility shims that we were using to integrate the new
    deferred operation mechanism into the existing code. No new code.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Restructure everything that used xfs_bmap_free to use xfs_defer_ops
    instead. For now we'll just remove the old symbols and play some
    cpp magic to make it work; in the next patch we'll actually rename
    everything.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • When we're deleting realtime extents, we need to lock the summary
    inode in case we need to update the summary info to prevent an assert
    on the rsumip inode lock on a debug kernel. While we're at it, fix
    the locking annotations so that we avoid triggering lockdep warnings.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     

21 Jun, 2016

5 commits


01 Jun, 2016

2 commits

  • Al Viro noticed that xfs_lock_inodes should be static, and
    that led to ... a few more.

    These are just the easy ones, others require moving functions
    higher in source files, so that's not done here to keep
    this review simple.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • The static checker reports that after commit 8d99fe92fed0 ("xfs: fix
    efi/efd error handling to avoid fs shutdown hangs"), the code has been
    reworked such that error == -EFSCORRUPTED is not possible in this
    codepath.

    Remove the spurious error check and just use SHUTDOWN_META_IO_ERROR
    unconditionally.

    Signed-off-by: Brian Foster
    Reviewed-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Brian Foster
     

27 May, 2016

2 commits

  • Pull misc DAX updates from Vishal Verma:
    "DAX error handling for 4.7

    - Until now, dax has been disabled if media errors were found on any
    device. This enables the use of DAX in the presence of these
    errors by making all sector-aligned zeroing go through the driver.

    - The driver (already) has the ability to clear errors on writes that
    are sent through the block layer using 'DSMs' defined in ACPI 6.1.

    Other misc changes:

    - When mounting DAX filesystems, check to make sure the partition is
    page aligned. This is a requirement for DAX, and previously, we
    allowed such unaligned mounts to succeed, but subsequent
    reads/writes would fail.

    - Misc/cleanup fixes from Jan that remove unused code from DAX
    related to zeroing, writeback, and some size checks"

    * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: fix a comment in dax_zero_page_range and dax_truncate_page
    dax: for truncate/hole-punch, do zeroing through the driver if possible
    dax: export a low-level __dax_zero_page_range helper
    dax: use sb_issue_zerout instead of calling dax_clear_sectors
    dax: enable dax in the presence of known media errors (badblocks)
    dax: fallback from pmd to pte on error
    block: Update blkdev_dax_capable() for consistency
    xfs: Add alignment check for DAX mount
    ext2: Add alignment check for DAX mount
    ext4: Add alignment check for DAX mount
    block: Add bdev_dax_supported() for dax mount checks
    block: Add vfs_msg() interface
    dax: Remove redundant inode size checks
    dax: Remove pointless writeback from dax_do_io()
    dax: Remove zeroing from dax_io()
    dax: Remove dead zeroing code from fault handlers
    ext2: Avoid DAX zeroing to corrupt data
    ext2: Fix block zeroing in ext2_get_blocks() for DAX
    dax: Remove complete_unwritten argument
    DAX: move RADIX_DAX_ definitions to dax.c

    Linus Torvalds
     
  • Pull xfs updates from Dave Chinner:
    "A pretty average collection of fixes, cleanups and improvements in
    this request.

    Summary:
    - fixes for mount line parsing, sparse warnings, read-only compat
    feature remount behaviour
    - allow fast path symlink lookups for inline symlinks.
    - attribute listing cleanups
    - writeback goes direct to bios rather than indirecting through
    bufferheads
    - transaction allocation cleanup
    - optimised kmem_realloc
    - added configurable error handling for metadata write errors,
    changed default error handling behaviour from "retry forever" to
    "retry until unmount then fail"
    - fixed several inode cluster writeback lookup vs reclaim race
    conditions
    - fixed inode cluster writeback checking wrong inode after lookup
    - fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
    - cleaned up inode reclaim tagging"

    * tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits)
    xfs: fix warning in xfs_finish_page_writeback for non-debug builds
    xfs: move reclaim tagging functions
    xfs: simplify inode reclaim tagging interfaces
    xfs: rename variables in xfs_iflush_cluster for clarity
    xfs: xfs_iflush_cluster has range issues
    xfs: mark reclaimed inodes invalid earlier
    xfs: xfs_inode_free() isn't RCU safe
    xfs: optimise xfs_iext_destroy
    xfs: skip stale inodes in xfs_iflush_cluster
    xfs: fix inode validity check in xfs_iflush_cluster
    xfs: xfs_iflush_cluster fails to abort on error
    xfs: remove xfs_fs_evict_inode()
    xfs: add "fail at unmount" error handling configuration
    xfs: add configuration handlers for specific errors
    xfs: add configuration of error failure speed
    xfs: introduce table-based init for error behaviors
    xfs: add configurable error support to metadata buffers
    xfs: introduce metadata IO error class
    xfs: configurable error behavior via sysfs
    xfs: buffer ->bi_end_io function requires irq-safe lock
    ...

    Linus Torvalds
     

19 May, 2016

1 commit

  • dax_clear_sectors() cannot handle poisoned blocks. These must be
    zeroed using the BIO interface instead. Convert ext2 and XFS to use
    only sb_issue_zerout().

    Reviewed-by: Jeff Moyer
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Matthew Wilcox
    [vishal: Also remove the dax_clear_sectors function entirely]
    Signed-off-by: Vishal Verma

    Matthew Wilcox
     

06 Apr, 2016

1 commit

  • Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
    that returns a transaction with all the required log and block reservations,
    and which allows passing transaction flags directly to avoid the cumbersome
    _xfs_trans_alloc interface.

    While we're at it we also get rid of the transaction type argument that has
    been superflous since we stopped supporting the non-CIL logging mode. The
    guts of it will be removed in another patch.

    [dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

22 Mar, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "There's quite a lot in this request, and there's some cross-over with
    ext4, dax and quota code due to the nature of the changes being made.

    As for the rest of the XFS changes, there are lots of little things
    all over the place, which add up to a lot of changes in the end.

    The major changes are that we've reduced the size of the struct
    xfs_inode by ~100 bytes (gives an inode cache footprint reduction of
    >10%), the writepage code now only does a single set of mapping tree
    lockups so uses less CPU, delayed allocation reservations won't
    overrun under random write loads anymore, and we added compile time
    verification for on-disk structure sizes so we find out when a commit
    or platform/compiler change breaks the on disk structure as early as
    possible.

    Change summary:

    - error propagation for direct IO failures fixes for both XFS and
    ext4
    - new quota interfaces and XFS implementation for iterating all the
    quota IDs in the filesystem
    - locking fixes for real-time device extent allocation
    - reduction of duplicate information in the xfs and vfs inode, saving
    roughly 100 bytes of memory per cached inode.
    - buffer flag cleanup
    - rework of the writepage code to use the generic write clustering
    mechanisms
    - several fixes for inode flag based DAX enablement
    - rework of remount option parsing
    - compile time verification of on-disk format structure sizes
    - delayed allocation reservation overrun fixes
    - lots of little error handling fixes
    - small memory leak fixes
    - enable xfsaild freezing again"

    * tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits)
    xfs: always set rvalp in xfs_dir2_node_trim_free
    xfs: ensure committed is initialized in xfs_trans_roll
    xfs: borrow indirect blocks from freed extent when available
    xfs: refactor delalloc indlen reservation split into helper
    xfs: update freeblocks counter after extent deletion
    xfs: debug mode forced buffered write failure
    xfs: remove impossible condition
    xfs: check sizes of XFS on-disk structures at compile time
    xfs: ioends require logically contiguous file offsets
    xfs: use named array initializers for log item dumping
    xfs: fix computation of inode btree maxlevels
    xfs: reinitialise per-AG structures if geometry changes during recovery
    xfs: remove xfs_trans_get_block_res
    xfs: fix up inode32/64 (re)mount handling
    xfs: fix format specifier , should be %llx and not %llu
    xfs: sanitize remount options
    xfs: convert mount option parsing to tokens
    xfs: fix two memory leaks in xfs_attr_list.c error paths
    xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
    xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
    ...

    Linus Torvalds
     

07 Mar, 2016

1 commit


28 Feb, 2016

1 commit

  • dax_clear_blocks() needs a valid struct block_device and previously it
    was using inode->i_sb->s_bdev in all cases. This is correct for normal
    inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
    DAX raw block devices and for XFS real-time devices.

    Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
    its arguments to take a bdev and a sector instead of an inode and a
    block. This better reflects what the function does, and it allows the
    filesystem and raw block device code to pass in an appropriate struct
    block_device.

    Signed-off-by: Ross Zwisler
    Suggested-by: Dan Williams
    Reviewed-by: Jan Kara
    Cc: Theodore Ts'o
    Cc: Al Viro
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

09 Feb, 2016

1 commit

  • Move the di_mode value from the xfs_icdinode to the VFS inode, reducing
    the xfs_icdinode byte another 2 bytes and collapsing another 2 byte hole
    in the structure.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

08 Feb, 2016

1 commit

  • RT allocation can fail on a debug kernel with:

    XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_SHARED|XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 4039

    When modifying the summary inode during allocation. This occurs
    because the summary inode is never locked, and xfs_bmapi_*
    operations expect it to be locked. The summary inode is effectively
    protected byt he lock on the bitmap inode, so this really is only a
    debug kernel issue.

    Signed-off-by: Dave Chinner
    Tested-by: Ross Zwisler
    Reviewed-by: Eric Sandeen
    Signed-off-by: Dave Chinner

    Dave Chinner
     

11 Jan, 2016

1 commit

  • Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
    associated comments were replicated several times across
    the attribute code, all dealing with what to do if the
    transaction was or wasn't committed.

    And in that replicated code, an ASSERT() test of an
    uninitialized variable occurs in several locations:

    error = xfs_attr_thing(&args);
    if (!error) {
    error = xfs_bmap_finish(&args.trans, args.flist,
    &committed);
    }
    if (error) {
    ASSERT(committed);

    If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
    never set "committed", and then test it in the ASSERT.

    Fix this up by moving the committed state internal to xfs_bmap_finish,
    and add a new inode argument. If an inode is passed in, it is passed
    through to __xfs_trans_roll() and joined to the transaction there if
    the transaction was committed.

    xfs_qm_dqalloc() was a little unique in that it called bjoin rather
    than ijoin, but as Dave points out we can detect the committed state
    but checking whether (*tpp != tp).

    Addresses-Coverity-Id: 102360
    Addresses-Coverity-Id: 102361
    Addresses-Coverity-Id: 102363
    Addresses-Coverity-Id: 102364
    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Eric Sandeen
     

03 Nov, 2015

2 commits

  • Dave Chinner
     
  • To enable DAX to do atomic allocation of zeroed extents, we need to
    drive the block zeroing deep into the allocator. Because
    xfs_bmapi_write() can return merged extents on allocation that were
    only partially allocated (i.e. requested range spans allocated and
    hole regions, allocation into the hole was contiguous), we cannot
    zero the extent returned from xfs_bmapi_write() as that can
    overwrite existing data with zeros.

    Hence we have to drive the extent zeroing into the allocation code,
    prior to where we merge the extents into the BMBT and return the
    resultant map. This means we need to propagate this need down to
    the xfs_alloc_vextent() and issue the block zeroing at this point.

    While this functionality is being introduced for DAX, there is no
    reason why it is specific to DAX - we can per-zero blocks during the
    allocation transaction on any type of device. It's just slow (and
    usually slower than unwritten allocation and conversion) on
    traditional block devices so doesn't tend to get used. We can,
    however, hook hardware zeroing optimisations via sb_issue_zeroout()
    to this operation, so it may be useful in future and hence the
    "allocate zeroed blocks" API needs to be implementation neutral.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

12 Oct, 2015

1 commit

  • The total field from struct xfs_alloc_arg is a bit of an unknown
    commodity. It is documented as the total block requirement for the
    transaction and is used in this manner from most call sites by virtue of
    passing the total block reservation of the transaction associated with
    an allocation. Several xfs_bmapi_write() callers pass hardcoded values
    of 0 or 1 for the total block requirement, which is a historical oddity
    without any clear reasoning.

    The xfs_iomap_write_direct() caller, for example, passes 0 for the total
    block requirement. This has been determined to cause problems in the
    form of ABBA deadlocks of AGF buffers due to incorrect AG selection in
    the block allocator. Specifically, the xfs_alloc_space_available()
    function incorrectly selects an AG that doesn't actually have sufficient
    space for the allocation. This occurs because the args.total field is 0
    and thus the remaining free space check on the AG doesn't actually
    consider the size of the allocation request. This locks the AGF buffer,
    the allocation attempt proceeds and ultimately fails (in
    xfs_alloc_fix_minleft()), and xfs_alloc_vexent() moves on to the next
    AG. In turn, this can lead to incorrect AG locking order (if the
    allocator wraps around, attempting to lock AG 0 after acquiring AG N)
    and thus deadlock if racing with another operation. This problem has
    been reproduced via generic/299 on smallish (1GB) ramdisk test devices.

    To avoid this problem, replace the undocumented hardcoded total
    parameters from the iomap and utility callers to pass the block
    reservation used for the associated transaction. This is consistent with
    other xfs_bmapi_write() callers throughout XFS. The assumption is that
    the total field allows the selection of an AG that can handle the entire
    operation rather than simply the allocation/range being requested (e.g.,
    resulting btree splits, etc.). This addresses the aforementioned
    generic/299 hang by ensuring AG selection only occurs when the
    allocation can be satisfied by the AG.

    Reported-by: Ross Zwisler
    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

19 Aug, 2015

3 commits

  • If a failure occurs after the bmap free list is populated and before
    xfs_bmap_finish() completes successfully (which returns a partial
    list on failure), the bmap free list must be cancelled. Otherwise,
    the extent items on the list are never freed and a memory leak
    occurs.

    Several random error paths throughout the code suffer this problem.
    Fix these up such that xfs_bmap_cancel() is always called on error.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • Log recovery attempts to free extents with leftover EFIs in the AIL
    after initial processing. If the extent free fails (e.g., due to
    unrelated fs corruption), the transaction is cancelled, though it
    might not be dirtied at the time. If this is the case, the EFD does
    not abort and thus does not release the EFI. This can lead to hangs
    as the EFI pins the AIL.

    Update xlog_recover_process_efi() to log the EFD in the transaction
    before xfs_free_extent() errors are handled to ensure the
    transaction is dirty, aborts the EFD and releases the EFI on error.
    Since this is a requirement for EFD processing (and consistent with
    xfs_bmap_finish()), update the EFD logging helper to do the extent
    free and unconditionally log the EFD. This encodes the required EFD
    logging behavior into the helper and reduces the likelihood of
    errors down the road.

    [dchinner: re-add xfs_alloc.h to xfs_log_recover.c to fix build
    failure.]

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • Freeing an extent in XFS involves logging an EFI (extent free
    intention), freeing the actual extent, and logging an EFD (extent
    free done). The EFI object is created with a reference count of 2:
    one for the current transaction and one for the subsequently created
    EFD. Under normal circumstances, the first reference is dropped when
    the EFI is unpinned and the second reference is dropped when the EFD
    is committed to the on-disk log.

    In event of errors or filesystem shutdown, there are various
    potential cleanup scenarios depending on the state of the EFI/EFD.
    The cleanup scenarios are confusing and racy, as demonstrated by the
    following test sequence:

    # mount $dev $mnt
    # fsstress -d $mnt -n 99999 -p 16 -z -f fallocate=1 \
    -f punch=1 -f creat=1 -f unlink=1 &
    # sleep 5
    # killall -9 fsstress; wait
    # godown -f $mnt
    # umount

    ... in which the final umount can hang due to the AIL being pinned
    indefinitely by one or more EFI items. This can occur due to several
    conditions. For example, if the shutdown occurs after the EFI is
    committed to the on-disk log and the EFD committed to the CIL, but
    before the EFD committed to the log, the EFD iop_committed() abort
    handler does not drop its reference to the EFI. Alternatively,
    manual error injection in the xfs_bmap_finish() codepath shows that
    if an error occurs after the EFI transaction is committed but before
    the EFD is constructed and logged, the EFI is never released from
    the AIL.

    Update the EFI/EFD item handling code to use a more straightforward
    and reliable approach to error handling. If an error occurs after
    the EFI transaction is committed and before the EFD is constructed,
    release the EFI explicitly from xfs_bmap_finish(). If the EFI
    transaction is cancelled, release the EFI in the unlock handler.

    Once the EFD is constructed, it is responsible for releasing the EFI
    under any circumstances (including whether the EFI item aborts due
    to log I/O error). Update the EFD item handlers to release the EFI
    if the transaction is cancelled or aborts due to log I/O error.
    Finally, update xfs_bmap_finish() to log at least one EFD extent to
    the transaction before xfs_free_extent() errors are handled to
    ensure the transaction is dirty and EFD item error handling is
    triggered.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

04 Jun, 2015

2 commits

  • Conflicts:
    fs/xfs/xfs_attr_inactive.c

    Dave Chinner
     
  • The flags argument to xfs_trans_commit is not useful for most callers, as
    a commit of a transaction without a permanent log reservation must pass
    0 here, and all callers for a transaction with a permanent log reservation
    except for xfs_trans_roll must pass XFS_TRANS_RELEASE_LOG_RES. So remove
    the flags argument from the public xfs_trans_commit interfaces, and
    introduce low-level __xfs_trans_commit variant just for xfs_trans_roll
    that regrants a log reservation instead of releasing it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig