09 Sep, 2015

2 commits

  • Use DAX to provide support for huge pages.

    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
    return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
    defined in linux/mm.h. Given that we don't want to include
    in , the easiest solution is to move the DAX-related
    functions to a new header, . We could also have moved
    VM_FAULT_* definitions to a new header, or a different header that isn't
    quite such a boil-the-ocean header as , but this felt like
    the best option.

    Signed-off-by: Matthew Wilcox
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

08 Sep, 2015

1 commit

  • Pull xfs updates from Dave Chinner:
    "There isn't a whole lot to this update - it's mostly bug fixes and
    they are spread pretty much all over XFS. There are some corruption
    fixes, some fixes for log recovery, some fixes that prevent unount
    from hanging, a lockdep annotation rework for inode locking to prevent
    false positives and the usual random bunch of cleanups and minor
    improvements.

    Deatils:

    - large rework of EFI/EFD lifecycle handling to fix log recovery
    corruption issues, crashes and unmount hangs

    - separate metadata UUID on disk to enable changing boot label UUID
    for v5 filesystems

    - fixes for gcc miscompilation on certain platforms and optimisation
    levels

    - remote attribute allocation and recovery corruption fixes

    - inode lockdep annotation rework to fix bugs with too many
    subclasses

    - directory inode locking changes to prevent lockdep false positives

    - a handful of minor corruption fixes

    - various other small cleanups and bug fixes"

    * tag 'xfs-for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (42 commits)
    xfs: fix error gotos in xfs_setattr_nonsize
    xfs: add mssing inode cache attempts counter increment
    xfs: return errors from partial I/O failures to files
    libxfs: bad magic number should set da block buffer error
    xfs: fix non-debug build warnings
    xfs: collapse allocsize and biosize mount option handling
    xfs: Fix file type directory corruption for btree directories
    xfs: lockdep annotations throw warnings on non-debug builds
    xfs: Fix uninitialized return value in xfs_alloc_fix_freelist()
    xfs: inode lockdep annotations broke non-lockdep build
    xfs: flush entire file on dio read/write to cached file
    xfs: Fix xfs_attr_leafblock definition
    libxfs: readahead of dir3 data blocks should use the read verifier
    xfs: stop holding ILOCK over filldir callbacks
    xfs: clean up inode lockdep annotations
    xfs: swap leaf buffer into path struct atomically during path shift
    xfs: relocate sparse inode mount warning
    xfs: dquots should be stamped with sb_meta_uuid
    xfs: log recovery needs to validate against sb_meta_uuid
    xfs: growfs not aware of sb_meta_uuid
    ...

    Linus Torvalds
     

06 Sep, 2015

1 commit

  • Pull vfs updates from Al Viro:
    "In this one:

    - d_move fixes (Eric Biederman)

    - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
    handling ought to be fixed)

    - switch of sb_writers to percpu rwsem (Oleg Nesterov)

    - superblock scalability (Josef Bacik and Dave Chinner)

    - swapon(2) race fix (Hugh Dickins)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
    vfs: Test for and handle paths that are unreachable from their mnt_root
    dcache: Reduce the scope of i_lock in d_splice_alias
    dcache: Handle escaped paths in prepend_path
    mm: fix potential data race in SyS_swapon
    inode: don't softlockup when evicting inodes
    inode: rename i_wb_list to i_io_list
    sync: serialise per-superblock sync operations
    inode: convert inode_sb_list_lock to per-sb
    inode: add hlist_fake to avoid the inode hash lock in evict
    writeback: plug writeback at a high level
    change sb_writers to use percpu_rw_semaphore
    shift percpu_counter_destroy() into destroy_super_work()
    percpu-rwsem: kill CONFIG_PERCPU_RWSEM
    percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
    percpu-rwsem: introduce percpu_down_read_trylock()
    document rwsem_release() in sb_wait_write()
    fix the broken lockdep logic in __sb_start_write()
    introduce __sb_writers_{acquired,release}() helpers
    ufs_inode_get{frag,block}(): get rid of 'phys' argument
    ufs_getfrag_block(): tidy up a bit
    ...

    Linus Torvalds
     

05 Sep, 2015

1 commit

  • Many file systems that implement the show_options hook fail to correctly
    escape their output which could lead to unescaped characters (e.g. new
    lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
    could lead to confusion, spoofed entries (resulting in things like
    systemd issuing false d-bus "mount" notifications), and who knows what
    else. This looks like it would only be the root user stepping on
    themselves, but it's possible weird things could happen in containers or
    in other situations with delegated mount privileges.

    Here's an example using overlay with setuid fusermount trusting the
    contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
    of "sudo" is something more sneaky:

    $ BASE="ovl"
    $ MNT="$BASE/mnt"
    $ LOW="$BASE/lower"
    $ UP="$BASE/upper"
    $ WORK="$BASE/work/ 0 0
    none /proc fuse.pwn user_id=1000"
    $ mkdir -p "$LOW" "$UP" "$WORK"
    $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
    $ cat /proc/mounts
    none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
    none /proc fuse.pwn user_id=1000 0 0
    $ fusermount -u /proc
    $ cat /proc/mounts
    cat: /proc/mounts: No such file or directory

    This fixes the problem by adding new seq_show_option and
    seq_show_option_n helpers, and updating the vulnerable show_option
    handlers to use them as needed. Some, like SELinux, need to be open
    coded due to unusual existing escape mechanisms.

    [akpm@linux-foundation.org: add lost chunk, per Kees]
    [keescook@chromium.org: seq_show_option should be using const parameters]
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Jan Kara
    Acked-by: Paul Moore
    Cc: J. R. Okajima
    Signed-off-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

03 Sep, 2015

1 commit

  • Pull core block updates from Jens Axboe:
    "This first core part of the block IO changes contains:

    - Cleanup of the bio IO error signaling from Christoph. We used to
    rely on the uptodate bit and passing around of an error, now we
    store the error in the bio itself.

    - Improvement of the above from myself, by shrinking the bio size
    down again to fit in two cachelines on x86-64.

    - Revert of the max_hw_sectors cap removal from a revision again,
    from Jeff Moyer. This caused performance regressions in various
    tests. Reinstate the limit, bump it to a more reasonable size
    instead.

    - Make /sys/block//queue/discard_max_bytes writeable, by me.
    Most devices have huge trim limits, which can cause nasty latencies
    when deleting files. Enable the admin to configure the size down.
    We will look into having a more sane default instead of UINT_MAX
    sectors.

    - Improvement of the SGP gaps logic from Keith Busch.

    - Enable the block core to handle arbitrarily sized bios, which
    enables a nice simplification of bio_add_page() (which is an IO hot
    path). From Kent.

    - Improvements to the partition io stats accounting, making it
    faster. From Ming Lei.

    - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
    file in blk-mq, as well as a fix for a blk-mq timeout race
    condition.

    - Ming Lin has been carrying Kents above mentioned patches forward
    for a while, and testing them. Ming also did a few fixes around
    that.

    - Sasha Levin found and fixed a use-after-free problem introduced by
    the bio->bi_error changes from Christoph.

    - Small blk cgroup cleanup from Viresh Kumar"

    * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
    blk: Fix bio_io_vec index when checking bvec gaps
    block: Replace SG_GAPS with new queue limits mask
    block: bump BLK_DEF_MAX_SECTORS to 2560
    Revert "block: remove artifical max_hw_sectors cap"
    blk-mq: fix race between timeout and freeing request
    blk-mq: fix buffer overflow when reading sysfs file of 'pending'
    Documentation: update notes in biovecs about arbitrarily sized bios
    block: remove bio_get_nr_vecs()
    fs: use helper bio_add_page() instead of open coding on bi_io_vec
    block: kill merge_bvec_fn() completely
    md/raid5: get rid of bio_fits_rdev()
    md/raid5: split bio for chunk_aligned_read
    block: remove split code in blkdev_issue_{discard,write_same}
    btrfs: remove bio splitting and merge_bvec_fn() calls
    bcache: remove driver private bio splitting code
    block: simplify bio_add_page()
    block: make generic_make_request handle arbitrarily sized bios
    blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
    block: don't access bio->bi_error after bio_put()
    block: shrink struct bio down to 2 cache lines again
    ...

    Linus Torvalds
     

01 Sep, 2015

1 commit


28 Aug, 2015

4 commits

  • As the code stands today, if xfs_trans_reserve() fails, we
    goto out_dqrele, which does not free the allocated transaction.

    Fix up the goto targets to undo everything properly.

    Addresses-Coverity-Id: 145571
    Signed-off-by: Eric Sandeen
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • Increasing the inode cache attempt counter was apparently dropped while
    refactoring the cache code and so stayed at the initial 0 value. Add the
    increment back to make the runtime stats more useful.

    Signed-off-by: Lucas Stach
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Lucas Stach
     
  • There is an issue with xfs's error reporting in some cases of I/O partially
    failing and partially succeeding. Calls like fsync() can report success even
    though not all I/O was successful in partial-failure cases such as one disk of
    a RAID0 array being offline.

    The issue can occur when there are more than one bio per xfs_ioend struct.
    Each call to xfs_end_bio() for a bio completing will write a value to
    ioend->io_error. If a successful bio completes after any failed bio, no
    error is reported do to it writing 0 over the error code set by any failed bio.
    The I/O error information is now lost and when the ioend is completed
    only success is reported back up the filesystem stack.

    xfs_end_bio() should only set ioend->io_error in the case of BIO_UPTODATE
    being clear. ioend->io_error is initialized to 0 at allocation so only needs
    to be updated by a failed bio. Also check that ioend->io_error is 0 so that
    the first error reported will be the error code returned.

    Cc: stable@vger.kernel.org
    Signed-off-by: David Jeffery
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    David Jeffery
     
  • If xfs_da3_node_read_verify() doesn't recognize the magic number of a
    buffer it's just read, set the buffer error to -EFSCORRUPTED so that
    the error can be sent up to userspace. Without this patch we'll
    notice the bad magic eventually while trying to traverse or change
    the block, but we really ought to fail early in the verifier.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     

25 Aug, 2015

6 commits

  • Dave Chinner
     
  • There seem to be a couple of new set-but-unused build warnings
    that gcc 4.9.3 is now warning about. These are not regressions, just
    the compiler being more picky.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The allocsize and biosize mount options are handled identically,
    other than allocsize accepting suffixes. suffix_kstrtoint handles
    bare numbers just fine too, so these can be collapsed.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • Users have occasionally reported that file type for some directory
    entries is wrong. This mostly happened after updating libraries some
    libraries. After some debugging the problem was traced down to
    xfs_dir2_node_replace(). The function uses args->filetype as a file type
    to store in the replaced directory entry however it also calls
    xfs_da3_node_lookup_int() which will store file type of the current
    directory entry in args->filetype. Thus we fail to change file type of a
    directory entry to a proper type.

    Fix the problem by storing new file type in a local variable before
    calling xfs_da3_node_lookup_int().

    cc: # 3.16 - 4.x
    Reported-by: Giacomo Comes
    Signed-off-by: Jan Kara
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Jan Kara
     
  • SO, now if we enable lockdep without enabling CONFIG_XFS_DEBUG,
    the lockdep annotations throw a warning because the assert that uses
    the lockdep define is not built in:

    fs/xfs/xfs_inode.c:367:1: warning: 'xfs_lockdep_subclass_ok' defined but not used [-Wunused-function]
    xfs_lockdep_subclass_ok(

    So now we need to create an ifdef mess to sort this all out, because
    we need to handle all the combinations of CONFIG_XFS_DEBUG=[y|n],
    CONFIG_XFS_WARNING=[y|n] and CONFIG_LOCKDEP=[y|n] appropriately.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • xfs_alloc_fix_freelist() can sometimes jump to out_agbp_relse
    without ever setting value of 'error' variable which is then
    returned. This can happen e.g. when pag->pagf_init is set but AG is
    for metadata and we want to allocate user data.

    Fix the problem by initializing 'error' to 0, which is the desired
    return value when we decide to skip this group.

    CC: xfs@oss.sgi.com
    Coverity-id: 1309714
    Signed-off-by: Jan Kara
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Jan Kara
     

20 Aug, 2015

2 commits


19 Aug, 2015

21 commits

  • Filesystems are responsible to manage file coherency between the page
    cache and direct I/O. The generic dio code flushes dirty pages over the
    range of a dio to ensure that the dio read or a future buffered read
    returns the correct data. XFS has generally followed this pattern,
    though traditionally has flushed and invalidated the range from the
    start of the I/O all the way to the end of the file. This changed after
    the following commit:

    7d4ea3ce xfs: use ranged writeback and invalidation for direct IO

    ... as the full file flush was no longer necessary to deal with the
    strange post-eof delalloc issues that were since fixed. Unfortunately,
    we have since received complaints about performance degradation due to
    the increased exclusive iolock cycles (which locks out parallel dio
    submission) that occur when a file has cached pages. This does not occur
    on filesystems that use the generic code as it also does not incorporate
    locking.

    The exclusive iolock is acquired any time the inode mapping has cached
    pages, regardless of whether they reside in the range of the I/O or not.
    If not, the flush/inval calls do no work and the lock was cycled for no
    reason.

    Under consideration of the cost of the exclusive iolock, update the dio
    read and write handlers to flush and invalidate the entire mapping when
    cached pages exist. In most cases, this increases the cost of the
    initial flush sequence but eliminates the need for further lock cycles
    and flushes so long as the workload does not actively mix direct and
    buffered I/O. This also more closely matches historical behavior and
    performance characteristics that users have come to expect.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • struct xfs_attr_leafblock contains 'entries' array which is declared
    with size 1 altough it can in fact contain much more entries. Since this
    array is followed by further struct members, gcc (at least in version
    4.8.3) thinks that the array has the fixed size of 1 element and thus
    may optimize away all accesses beyond the end of array resulting in
    non-working code. This problem was only observed with userspace code in
    xfsprogs, however it's better to be safe in kernel as well and have
    matching kernel and xfsprogs definitions.

    cc:
    Signed-off-by: Jan Kara
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Jan Kara
     
  • In the dir3 data block readahead function, use the regular read
    verifier to check the block's CRC and spot-check the block contents
    instead of directly calling only the spot-checking routine. This
    prevents corrupted directory data blocks from being read into the
    kernel, which can lead to garbage ls output and directory loops (if
    say one of the entries contains slashes and other junk).

    cc: # 3.12 - 4.2
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • The recent change to the readdir locking made in 40194ec ("xfs:
    reinstate the ilock in xfs_readdir") for CXFS directory sanity was
    probably the wrong thing to do. Deep in the readdir code we
    can take page faults in the filldir callback, and so taking a page
    fault while holding an inode ilock creates a new set of locking
    issues that lockdep warns all over the place about.

    The locking order for regular inodes w.r.t. page faults is io_lock
    -> pagefault -> mmap_sem -> ilock. The directory readdir code now
    triggers ilock -> page fault -> mmap_sem. While we cannot deadlock
    at this point, it inverts all the locking patterns that lockdep
    normally sees on XFS inodes, and so triggers lockdep. We worked
    around this with commit 93a8614 ("xfs: fix directory inode iolock
    lockdep false positive"), but that then just moved the lockdep
    warning to deeper in the page fault path and triggered on security
    inode locks. Fixing the shmem issue there just moved the lockdep
    reports somewhere else, and now we are getting false positives from
    filesystem freezing annotations getting confused.

    Further, if we enter memory reclaim in a readdir path, we now get
    lockdep warning about potential deadlocks because the ilock is held
    when we enter reclaim. This, again, is different to a regular file
    in that we never allow memory reclaim to run while holding the ilock
    for regular files. Hence lockdep now throws
    ilock->kmalloc->reclaim->ilock warnings.

    Basically, the problem is that the ilock is being used to protect
    the directory data and the inode metadata, whereas for a regular
    file the iolock protects the data and the ilock protects the
    metadata. From the VFS perspective, the i_mutex serialises all
    accesses to the directory data, and so not holding the ilock for
    readdir doesn't matter. The issue is that CXFS doesn't access
    directory data via the VFS, so it has no "data serialisaton"
    mechanism. Hence we need to hold the IOLOCK in the correct places to
    provide this low level directory data access serialisation.

    The ilock can then be used just when the extent list needs to be
    read, just like we do for regular files. The directory modification
    code can take the iolock exclusive when the ilock is also taken,
    and this then ensures that readdir is correct excluded while
    modifications are in progress.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Lockdep annotations are a maintenance nightmare. Locking has to be
    modified to suit the limitations of the annotations, and we're
    always having to fix the annotations because they are unable to
    express the complexity of locking heirarchies correctly.

    So, next up, we've got more issues with lockdep annotations for
    inode locking w.r.t. XFS_LOCK_PARENT:

    - lockdep classes are exclusive and can't be ORed together
    to form new classes.
    - IOLOCK needs multiple PARENT subclasses to express the
    changes needed for the readdir locking rework needed to
    stop the endless flow of lockdep false positives involving
    readdir calling filldir under the ILOCK.
    - there are only 8 unique lockdep subclasses available,
    so we can't create a generic solution.

    IOWs we need to treat the 3-bit space available to each lock type
    differently:

    - IOLOCK uses xfs_lock_two_inodes(), so needs:
    - at least 2 IOLOCK subclasses
    - at least 2 IOLOCK_PARENT subclasses
    - MMAPLOCK uses xfs_lock_two_inodes(), so needs:
    - at least 2 MMAPLOCK subclasses
    - ILOCK uses xfs_lock_inodes with up to 5 inodes, so needs:
    - at least 5 ILOCK subclasses
    - one ILOCK_PARENT subclass
    - one RTBITMAP subclass
    - one RTSUM subclass

    For the IOLOCK, split the space into two sets of subclasses.
    For the MMAPLOCK, just use half the space for the one subclass to
    match the non-parent lock classes of the IOLOCK.
    For the ILOCK, use 0-4 as the ILOCK subclasses, 5-7 for the
    remaining individual subclasses.

    Because they are now all different, modify xfs_lock_inumorder() to
    handle the nested subclasses, and to assert fail if passed an
    invalid subclass. Further, annotate xfs_lock_inodes() to assert fail
    if an invalid combination of lock primitives and inode counts are
    passed that would result in a lockdep subclass annotation overflow.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The node directory lookup code uses a state structure that tracks the
    path of buffers used to search for the hash of a filename through the
    leaf blocks. When the lookup encounters a block that ends with the
    requested hash, but the entry has not yet been found, it must shift over
    to the next block and continue looking for the entry (i.e., duplicate
    hashes could continue over into the next block). This shift mechanism
    involves walking back up and down the state structure, replacing buffers
    at the appropriate btree levels as necessary.

    When a buffer is replaced, the old buffer is released and the new buffer
    read into the active slot in the path structure. Because the buffer is
    read directly into the path slot, a buffer read failure can result in
    setting a NULL buffer pointer in an active slot. This throws off the
    state cleanup code in xfs_dir2_node_lookup(), which expects to release a
    buffer from each active slot. Instead, a BUG occurs due to a NULL
    pointer dereference:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8
    IP: [] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
    ...
    RIP: 0010:[] [] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
    ...
    Call Trace:
    [] xfs_dir2_node_lookup+0xa6/0x2c0 [xfs]
    [] xfs_dir_lookup+0x1ac/0x1c0 [xfs]
    [] xfs_lookup+0x91/0x290 [xfs]
    [] xfs_vn_lookup+0x73/0xb0 [xfs]
    [] lookup_real+0x1d/0x50
    [] path_openat+0x91e/0x1490
    [] do_filp_open+0x89/0x100
    ...

    This has been reproduced via a parallel fsstress and filesystem shutdown
    workload in a loop. The shutdown triggers the read error in the
    aforementioned codepath and causes the BUG in xfs_dir2_node_lookup().

    Update xfs_da3_path_shift() to update the active path slot atomically
    with respect to the caller when a buffer is replaced. This ensures that
    the caller always sees the old or new buffer in the slot and prevents
    the NULL pointer dereference.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • The sparse inodes feature is currently considered experimental. We warn
    at mount time from xfs_mount_validate_sb(). This function is part of the
    superblock verifier codepath, however, which means it could be invoked
    repeatedly on superblock reads or writes. This is currently only
    noticeable from userspace, where mkfs produces multiple warnings at
    format time.

    As mkfs warnings were not the intent of this change, relocate the mount
    time warning to xfs_fs_fill_super(), which is only invoked once and only
    in kernel space.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • Once the sb_uuid is changed, the wrong uuid is stamped into new
    dquots on disk. Found by inspection, verified by generic/219.

    Signed-off-by: Dave Chinner
    Reviewed-by: Eric Sandeen
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Now that sb_uuid can be changed by the user, we cannot use this to
    validate the metadata blocks being recovered belong to this
    filesystem. We must check against the sb_meta_uuid as that will
    remain unchanged.

    There is a complication in this code - the superblock itself. We can
    not check the sb_meta_uuid unconditionally, as that may not be set
    on disk. Hence we must verify the superblock sb_uuid matches between
    the log record and the in-core superblock.

    Found by inspection after the previous two problems were found.

    Signed-off-by: Dave Chinner
    Reviewed-by: Eric Sandeen
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Adding this simple change to xfstests:common/rc::_scratch_mkfs_xfs:

    + if [ $mkfs_status -eq 0 ]; then
    + xfs_admin -U generate $SCRATCH_DEV > /dev/null
    + fi

    triggers all sorts of errors in xfstests. xfs/104 is an example,
    where growfs fails with a UUID mismatch corruption detected by
    xfs_agf_write_verify() when trying to write the first new AG
    headers.

    Fix this problem by making sure we copy the sb_meta_uuid into new
    metadata written by growfs.

    Signed-off-by: Dave Chinner
    Reviewed-by: Eric Sandeen
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • After changing the UUID on a v5 filesystem, xfstests fails
    immediately on a debug kernel with:

    XFS: Assertion failed: uuid_equal(&ip->i_d.di_uuid, &mp->m_sb.sb_uuid), file: fs/xfs/xfs_inode.c, line: 799

    This needs to check against the sb_meta_uuid, not the user visible
    UUID that was changed.

    Signed-off-by: Dave Chinner
    Reviewed-by: Eric Sandeen
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • It's entirely possible for userspace to ask for an xattr which
    does not exist.

    Normally, there is no problem whatsoever when we ask for such
    a thing, but when we look at an obfuscated metadump image
    on a debug kernel with selinux, we trip over this ASSERT in
    xfs_da3_path_shift():

    *result = -ENOENT; /* we're out of our tree */
    ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);

    It (more or less) only shows up in the above scenario, because
    xfs_metadump obfuscates attr names, but chooses names which
    keep the same hash value - and xfs_da3_node_lookup_int does:

    if (((retval == -ENOENT) || (retval == -ENOATTR)) &&
    (blk->hashval == args->hashval)) {
    error = xfs_da3_path_shift(state, &state->path, 1, 1,
    &retval);

    IOWS, we only get down to the xfs_da3_path_shift() ASSERT
    if we are looking for an xattr which doesn't exist, but we
    find xattrs on disk which have the same hash, and so might be
    a hash collision, so we try the path shift. When *that*
    fails to find what we're looking for, we hit the assert about
    XFS_DA_OP_OKNOENT.

    Simply setting XFS_DA_OP_OKNOENT in xfs_attr_get solves this
    rather corner-case problem with no ill side effects. It's
    fine for an attr name lookup to fail.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • Dave Chinner
     
  • If a failure occurs after the bmap free list is populated and before
    xfs_bmap_finish() completes successfully (which returns a partial
    list on failure), the bmap free list must be cancelled. Otherwise,
    the extent items on the list are never freed and a memory leak
    occurs.

    Several random error paths throughout the code suffer this problem.
    Fix these up such that xfs_bmap_cancel() is always called on error.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • Several areas of code duplicate a pattern where we take the AIL lock,
    check whether an item is in the AIL and remove it if so. Create a new
    helper for this pattern and use it where appropriate.

    Signed-off-by: Brian Foster

    Brian Foster
     
  • The btree cursor cleanup function takes an error parameter that
    affects how buffers are released from the cursor. All buffers are
    released in the event of error. Several callers do not specify the
    XFS_BTREE_ERROR flag in the event of error, however. This can cause
    buffers to hang around locked or with an elevated hold count and
    thus lead to umount hangs in the event of errors.

    Fix up the xfs_btree_del_cursor() callers to pass XFS_BTREE_ERROR if
    the cursor is being torn down due to error.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • The root inode is read as part of the xfs_mountfs() sequence and the
    reference is dropped in the event of failure after we grab the
    inode. The reference drop doesn't necessarily free the inode,
    however. It marks it for reclaim and potentially kicks off the
    reclaim workqueue. The workqueue is destroyed further up the error
    path, which means we are subject to crash if the workqueue job runs
    after this point or a memory leak which is identified if the
    xfs_inode_zone is destroyed (e.g., on module removal). Both of these
    outcomes are reproducible via manual instrumentation of a mount
    error after the root inode xfs_iget() call in xfs_mountfs().

    Update the xfs_mountfs() error path to cancel any potential reclaim
    work items and to run a synchronous inode reclaim if the root inode
    is marked for reclaim. This ensures that no jobs remain on the queue
    before it is destroyed and that the root inode is freed before the
    reclaim mechanism is torn down.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • The first 4 bytes of every basic block in the physical log is stamped
    with the current lsn. To support this mechanism, the log record header
    (first block of each new log record) contains space for the original
    first byte of each log record block before it is replaced with the lsn.
    The log record header has space for 32k worth of blocks. The version 2
    log adds new extended record headers for each additional 32k worth of
    blocks beyond what is supported by the record header.

    The log record checksum incorporates the log record header, the extended
    headers and the record payload. xlog_cksum() checksums the extended
    headers based on log->l_iclog_heads, which specifies the number of
    extended headers in a log record based on the log buffer size mount
    option. The log buffer size is variable, however, and thus means the
    checksum can be calculated differently based on how a filesystem is
    mounted. This is problematic if a filesystem crashes and recovery occurs
    on a subsequent mount using a different log buffer size. For example,
    crash an active filesystem that is mounted with the default (32k)
    logbsize, attempt remount/recovery using '-o logbsize=64k' and the mount
    fails on or warns about log checksum failures.

    To avoid this problem, update xlog_cksum() to calculate the checksum
    based on the size of the log buffer according to the log record. The
    size is already included in the h_size field of the log record header
    and thus is available at log recovery time. Extended log record headers
    are also only written when the log record is large enough to require
    them. This makes checksum calculation of log records consistent with the
    extended record header mechanism as well as how on-disk records are
    checksummed with various log buffer size mount options.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • Inode cluster buffers are invalidated and cancelled when inode chunks
    are freed to notify log recovery that previous logged updates to the
    metadata buffer should be skipped. This ensures that log recovery does
    not overwrite buffers that might have already been reused.

    On v4 filesystems, inode chunk allocation and inode updates are logged
    via the cluster buffers and thus cancellation is easily detected via
    buffer cancellation items. v5 filesystems use the new icreate
    transaction, which uses logical logging and ordered buffers to log a
    full inode chunk allocation at once. The resulting icreate item often
    spans multiple inode cluster buffers.

    Log recovery checks for cancelled buffers when processing icreate log
    items, but it has a couple problems. First, it uses the full length of
    the inode chunk rather than the cluster size. Second, it uses the length
    in FSB units rather than BB units. Either of these problems prevent
    icreate recovery from identifying cancelled buffers and thus inode
    initialization proceeds unconditionally.

    Update xlog_recover_do_icreate_pass2() to iterate the icreate range in
    cluster sized increments and check each increment for cancellation.
    Since icreate is currently only used for the minimum atomic inode chunk
    allocation, we expect that either all or none of the buffers will be
    cancelled. Cancel the icreate if at least one buffer is cancelled to
    avoid making a bad situation worse by initializing a partial inode
    chunk, but detect such anomalies and warn the user.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • Various log items have recovery tracepoints to identify whether a
    particular log item is recovered or cancelled. Add the equivalent
    tracepoints for the icreate transaction.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • Log recovery occurs in two phases at mount time. In the first phase,
    EFIs and EFDs are processed and potentially cancelled out. EFIs without
    EFD objects are inserted into the AIL for processing and recovery in the
    second phase. xfs_mountfs() runs various other operations between the
    phases and is thus subject to failure. If failure occurs after the first
    phase but before the second, pending EFIs sit on the AIL, pin it and
    cause the mount to hang.

    Update the mount sequence to ensure that pending EFIs are cancelled in
    the event of failure. Add a recovery cancellation mechanism to iterate
    the AIL and cancel all EFI items when requested. Plumb cancellation
    support through the log mount finish helper and update xfs_mountfs() to
    invoke cancellation in the event of failure after recovery has started.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster