19 Jan, 2016

1 commit


12 Jan, 2016

3 commits

  • Dave Chinner
     
  • When we do dquot readahead in log recovery, we do not use a verifier
    as the underlying buffer may not have dquots in it. e.g. the
    allocation operation hasn't yet been replayed. Hence we do not want
    to fail recovery because we detect an operation to be replayed has
    not been run yet. This problem was addressed for inodes in commit
    d891400 ("xfs: inode buffers may not be valid during recovery
    readahead") but the problem was not recognised to exist for dquots
    and their buffers as the dquot readahead did not have a verifier.

    The result of not using a verifier is that when the buffer is then
    next read to replay a dquot modification, the dquot buffer verifier
    will only be attached to the buffer if *readahead is not complete*.
    Hence we can read the buffer, replay the dquot changes and then add
    it to the delwri submission list without it having a verifier
    attached to it. This then generates warnings in xfs_buf_ioapply(),
    which catches and warns about this case.

    Fix this and make it handle the same readahead verifier error cases
    as for inode buffers by adding a new readahead verifier that has a
    write operation as well as a read operation that marks the buffer as
    not done if any corruption is detected. Also make sure we don't run
    readahead if the dquot buffer has been marked as cancelled by
    recovery.

    This will result in readahead either succeeding and the buffer
    having a valid write verifier, or readahead failing and the buffer
    state requiring the subsequent read to resubmit the IO with the new
    verifier. In either case, this will result in the buffer always
    ending up with a valid write verifier on it.

    Note: we also need to fix the inode buffer readahead error handling
    to mark the buffer with EIO. Brian noticed the code I copied from
    there wrong during review, so fix it at the same time. Add comments
    linking the two functions that handle readahead verifier errors
    together so we don't forget this behavioural link in future.

    cc: # 3.12 - current
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • When we do inode readahead in log recovery, we do can do the
    readahead before we've replayed the icreate transaction that stamps
    the buffer with inode cores. The inode readahead verifier catches
    this and marks the buffer as !done to indicate that it doesn't yet
    contain valid inodes.

    In adding buffer error notification (i.e. setting b_error = -EIO at
    the same time as as we clear the done flag) to such a readahead
    verifier failure, we can then get subsequent inode recovery failing
    with this error:

    XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32

    This occurs when readahead completion races with icreate item replay
    such as:

    inode readahead
    find buffer
    lock buffer
    submit RA io
    ....
    icreate recovery
    xfs_trans_get_buffer
    find buffer
    lock buffer

    .....

    fails verifier
    clear XBF_DONE
    set bp->b_error = -EIO
    release and unlock buffer

    icreate initialises buffer
    marks buffer as done
    adds buffer to delayed write queue
    releases buffer

    At this point, we have an initialised inode buffer that is up to
    date but has an -EIO state registered against it. When we finally
    get to recovering an inode in that buffer:

    inode item recovery
    xfs_trans_read_buffer
    find buffer
    lock buffer
    sees XBF_DONE is set, returns buffer
    sees bp->b_error is set
    fail log recovery!

    Essentially, we need xfs_trans_get_buf_map() to clear the error status of
    the buffer when doing a lookup. This function returns uninitialised
    buffers, so the buffer returned can not be in an error state and
    none of the code that uses this function expects b_error to be set
    on return. Indeed, there is an ASSERT(!bp->b_error); in the
    transaction case in xfs_trans_get_buf_map() that would have caught
    this if log recovery used transactions....

    This patch firstly changes the inode readahead failure to set -EIO
    on the buffer, and secondly changes xfs_buf_get_map() to never
    return a buffer with an error state set so this first change doesn't
    cause unexpected log recovery failures.

    cc: # 3.12 - current
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

11 Jan, 2016

1 commit

  • Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
    associated comments were replicated several times across
    the attribute code, all dealing with what to do if the
    transaction was or wasn't committed.

    And in that replicated code, an ASSERT() test of an
    uninitialized variable occurs in several locations:

    error = xfs_attr_thing(&args);
    if (!error) {
    error = xfs_bmap_finish(&args.trans, args.flist,
    &committed);
    }
    if (error) {
    ASSERT(committed);

    If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
    never set "committed", and then test it in the ASSERT.

    Fix this up by moving the committed state internal to xfs_bmap_finish,
    and add a new inode argument. If an inode is passed in, it is passed
    through to __xfs_trans_roll() and joined to the transaction there if
    the transaction was committed.

    xfs_qm_dqalloc() was a little unique in that it called bjoin rather
    than ijoin, but as Dave points out we can detect the committed state
    but checking whether (*tpp != tp).

    Addresses-Coverity-Id: 102360
    Addresses-Coverity-Id: 102361
    Addresses-Coverity-Id: 102363
    Addresses-Coverity-Id: 102364
    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Eric Sandeen
     

08 Jan, 2016

1 commit

  • For large sparse or fragmented files, checking every single entry in
    the bmapbt on every operation is prohibitively expensive. Especially
    as such checks rarely discover problems during normal operations on
    high extent coutn files. Our regression tests don't tend to exercise
    files with hundreds of thousands to millions of extents, so mostly
    this isn't noticed.

    However, trying to run things like xfs_mdrestore of large filesystem
    dumps on a debug kernel quickly becomes impossible as the CPU is
    completely burnt up repeatedly walking the sparse file bmapbt that
    is generated for every allocation that is made.

    Hence, if the file has more than 10,000 extents, just don't bother
    with walking the tree to check it exhaustively. The btree code has
    checks that ensure that the newly inserted/removed/modified record
    is correctly ordered, so the entrie tree walk in thses cases has
    limited additional value.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

05 Jan, 2016

1 commit


04 Jan, 2016

11 commits

  • Rather than just being able to turn DAX on and off via a mount
    option, some applications may only want to enable DAX for certain
    performance critical files in a filesystem.

    This patch introduces a new inode flag to enable DAX in the v3 inode
    di_flags2 field. It adds support for setting and clearing flags in
    the di_flags2 field via the XFS_IOC_FSSETXATTR ioctl, and sets the
    S_DAX inode flag appropriately when it is seen.

    When this flag is set on a directory, it acts as an "inherit flag".
    That is, inodes created in the directory will automatically inherit
    the on-disk inode DAX flag, enabling administrators to set up
    directory heirarchies that automatically use DAX. Setting this flag
    on an empty root directory will make the entire filesystem use DAX
    by default.

    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Now that the ioctls have been hoisted up to the VFS level, use
    the VFs definitions directly and remove the XFS specific definitions
    completely. Userspace is going to have to handle the change of this
    interface separately, so removing the definitions from xfs_fs.h is
    not an issue here at all.

    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Hoist the ioctl definitions for the XFS_IOC_FS[SG]SETXATTR API from
    fs/xfs/libxfs/xfs_fs.h to include/uapi/linux/fs.h so that the ioctls
    can be used by all filesystems, not just XFS. This enables
    (initially) ext4 to use the ioctl to set project IDs on inodes.

    Based-on-patch-from: Li Xi
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Create xfs_btree_sblock_verify() to verify short-format btree blocks
    (i.e. the per-AG btrees with 32-bit block pointers) instead of
    open-coding them.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Because struct xfs_agfl is 36 bytes long and has a 64-bit integer
    inside it, gcc will quietly round the structure size up to the nearest
    64 bits -- in this case, 40 bytes. This results in the XFS_AGFL_SIZE
    macro returning incorrect results for v5 filesystems on 64-bit
    machines (118 items instead of 119). As a result, a 32-bit xfs_repair
    will see garbage in AGFL item 119 and complain.

    Therefore, tell gcc not to pad the structure so that the AGFL size
    calculation is correct.

    cc: # 3.10 - 4.4
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Use a convenience variable instead of open-coding the inode fork.
    This isn't really needed for now, but will become important when we
    add the copy-on-write fork later.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Since xfs_repair wants to use xfs_alloc_fix_freelist, remove the
    static designation. xfsprogs already has this; this simply brings
    the kernel up to date.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • In my earlier commit

    c29aad4 xfs: pass mp to XFS_WANT_CORRUPTED_GOTO

    I added some local mp variables with code which indicates that
    mp might be NULL. Coverity doesn't like this now, because the
    updated per-fs XFS_STATS macros dereference mp.

    I don't think this is actually a problem; from what I can tell,
    we cannot get to these functions with a null bma->tp, so my NULL
    check was probably pointless. Still, it's not super obvious.

    So switch this code to get mp from the inode on the xfs_bmalloca
    structure, with no conditional, because the functions are already
    using bmap->ip directly.

    Addresses-Coverity-Id: 1339552
    Addresses-Coverity-Id: 1339553
    Signed-off-by: Eric Sandeen
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • This adds a name to each buf_ops structure, so that if
    a verifier fails we can print the type of verifier that
    failed it. Should be a slight debugging aid, I hope.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • If there is any non zero bit in a long bitmap, it can jump out of the
    loop and finish the function as soon as possible.

    Signed-off-by: Jia He
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Jia He
     
  • Log recovery torn write detection uses CRC verification over a range of
    the active log to identify torn writes. Since the generic log recovery
    pass code implements a superset of the functionality required for CRC
    verification, it can be easily modified to support a CRC verification
    only pass.

    Create a new CRC pass type and update the log record processing helper
    to skip everything beyond CRC verification when in this mode. This pass
    will be invoked in subsequent patches to implement torn write detection.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

03 Nov, 2015

4 commits

  • Dave Chinner
     
  • Dave Chinner
     
  • In xfs_acl_from_disk, instead of trusting that xfs_acl.acl_cnt is correct,
    make sure that the length of the attributes is correct as well. Also, turn
    the aclp parameter into a const pointer.

    Signed-off-by: Andreas Gruenbacher
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Andreas Gruenbacher
     
  • To enable DAX to do atomic allocation of zeroed extents, we need to
    drive the block zeroing deep into the allocator. Because
    xfs_bmapi_write() can return merged extents on allocation that were
    only partially allocated (i.e. requested range spans allocated and
    hole regions, allocation into the hole was contiguous), we cannot
    zero the extent returned from xfs_bmapi_write() as that can
    overwrite existing data with zeros.

    Hence we have to drive the extent zeroing into the allocation code,
    prior to where we merge the extents into the BMBT and return the
    resultant map. This means we need to propagate this need down to
    the xfs_alloc_vextent() and issue the block zeroing at this point.

    While this functionality is being introduced for DAX, there is no
    reason why it is specific to DAX - we can per-zero blocks during the
    allocation transaction on any type of device. It's just slow (and
    usually slower than unwritten allocation and conversion) on
    traditional block devices so doesn't tend to get used. We can,
    however, hook hardware zeroing optimisations via sb_issue_zeroout()
    to this operation, so it may be useful in future and hence the
    "allocate zeroed blocks" API needs to be implementation neutral.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

12 Oct, 2015

8 commits

  • Dave Chinner
     
  • Dave Chinner
     
  • This patch modifies the stats counting macros and the callers
    to those macros to properly increment, decrement, and add-to
    the xfs stats counts. The counts for global and per-fs stats
    are correctly advanced, and cleared by writing a "1" to the
    corresponding clear file.

    global counts: /sys/fs/xfs/stats/stats
    per-fs counts: /sys/fs/xfs/sda*/stats/stats

    global clear: /sys/fs/xfs/stats/stats_clear
    per-fs clear: /sys/fs/xfs/sda*/stats/stats_clear

    [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
    stats structures and macros. ]

    Signed-off-by: Bill O'Donnell
    Reviewed-by: Eric Sandeen
    Signed-off-by: Dave Chinner

    Bill O'Donnell
     
  • Currently, we depends on Linux XATTR value for on disk
    definition. Which causes trouble on other platforms and
    maybe also if this value was to change.

    Fix it by creating a custom definition independent from
    those in Linux (although with the same values), so it is OK
    with the be16 fields used for holding these attributes.

    This patch reflects a change in xfsprogs.

    Signed-off-by: Jan Tulak
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Jan Tulak
     
  • Remove a hard dependency of Linux XATTR_LIST_MAX value by using
    a prefixed version. This patch reflects the same change in xfsprogs.

    Signed-off-by: Jan Tulak
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Jan Tulak
     
  • Just fix two typos in code comments.

    Signed-off-by: Geliang Tang
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Geliang Tang
     
  • Since the onset of v5 superblocks, the LSN of the last modification has
    been included in a variety of on-disk data structures. This LSN is used
    to provide log recovery ordering guarantees (e.g., to ensure an older
    log recovery item is not replayed over a newer target data structure).

    While this works correctly from the point a filesystem is formatted and
    mounted, userspace tools have some problematic behaviors that defeat
    this mechanism. For example, xfs_repair historically zeroes out the log
    unconditionally (regardless of whether corruption is detected). If this
    occurs, the LSN of the filesystem is reset and the log is now in a
    problematic state with respect to on-disk metadata structures that might
    have a larger LSN. Until either the log catches up to the highest
    previously used metadata LSN or each affected data structure is modified
    and written out without incident (which resets the metadata LSN), log
    recovery is susceptible to filesystem corruption.

    This problem is ultimately addressed and repaired in the associated
    userspace tools. The kernel is still responsible to detect the problem
    and notify the user that something is wrong. Check the superblock LSN at
    mount time and fail the mount if it is invalid. From that point on,
    trigger verifier failure on any metadata I/O where an invalid LSN is
    detected. This results in a filesystem shutdown and guarantees that we
    do not log metadata changes with invalid LSNs on disk. Since this is a
    known issue with a known recovery path, present a warning to instruct
    the user how to recover.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • A local format symlink inode is converted to extent format when an
    extended attribute is set on an inode as part of the attribute fork
    creation. This means a block is allocated, the local symlink target name
    is copied to the block and the block is logged. Currently,
    xfs_bmap_local_to_extents() handles logging the remote block data based
    on the size of the data fork prior to the conversion. This is not
    correct on v5 superblock filesystems, which add an additional header to
    remote symlink blocks that is nonexistent in local format inodes.

    As a result, the full length of the remote symlink block content is not
    logged. This can lead to corruption should a crash occur and log
    recovery replay this transaction.

    Since a callout is already used to initialize the new remote symlink
    block, update the local-to-extents conversion mechanism to make the
    callout also responsible for logging the block. It is already required
    to set the log buffer type and format the block appropriately based on
    the superblock version. This ensures the remote symlink is always logged
    correctly. Note that xfs_bmap_local_to_extents() is only called for
    symlinks so there are no other callouts that require modification.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

01 Sep, 2015

1 commit


28 Aug, 2015

1 commit

  • If xfs_da3_node_read_verify() doesn't recognize the magic number of a
    buffer it's just read, set the buffer error to -EFSCORRUPTED so that
    the error can be sent up to userspace. Without this patch we'll
    notice the bad magic eventually while trying to traverse or change
    the block, but we really ought to fail early in the verifier.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     

25 Aug, 2015

3 commits

  • Dave Chinner
     
  • Users have occasionally reported that file type for some directory
    entries is wrong. This mostly happened after updating libraries some
    libraries. After some debugging the problem was traced down to
    xfs_dir2_node_replace(). The function uses args->filetype as a file type
    to store in the replaced directory entry however it also calls
    xfs_da3_node_lookup_int() which will store file type of the current
    directory entry in args->filetype. Thus we fail to change file type of a
    directory entry to a proper type.

    Fix the problem by storing new file type in a local variable before
    calling xfs_da3_node_lookup_int().

    cc: # 3.16 - 4.x
    Reported-by: Giacomo Comes
    Signed-off-by: Jan Kara
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Jan Kara
     
  • xfs_alloc_fix_freelist() can sometimes jump to out_agbp_relse
    without ever setting value of 'error' variable which is then
    returned. This can happen e.g. when pag->pagf_init is set but AG is
    for metadata and we want to allocate user data.

    Fix the problem by initializing 'error' to 0, which is the desired
    return value when we decide to skip this group.

    CC: xfs@oss.sgi.com
    Coverity-id: 1309714
    Signed-off-by: Jan Kara
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Jan Kara
     

20 Aug, 2015

1 commit


19 Aug, 2015

4 commits

  • struct xfs_attr_leafblock contains 'entries' array which is declared
    with size 1 altough it can in fact contain much more entries. Since this
    array is followed by further struct members, gcc (at least in version
    4.8.3) thinks that the array has the fixed size of 1 element and thus
    may optimize away all accesses beyond the end of array resulting in
    non-working code. This problem was only observed with userspace code in
    xfsprogs, however it's better to be safe in kernel as well and have
    matching kernel and xfsprogs definitions.

    cc:
    Signed-off-by: Jan Kara
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Jan Kara
     
  • In the dir3 data block readahead function, use the regular read
    verifier to check the block's CRC and spot-check the block contents
    instead of directly calling only the spot-checking routine. This
    prevents corrupted directory data blocks from being read into the
    kernel, which can lead to garbage ls output and directory loops (if
    say one of the entries contains slashes and other junk).

    cc: # 3.12 - 4.2
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • The recent change to the readdir locking made in 40194ec ("xfs:
    reinstate the ilock in xfs_readdir") for CXFS directory sanity was
    probably the wrong thing to do. Deep in the readdir code we
    can take page faults in the filldir callback, and so taking a page
    fault while holding an inode ilock creates a new set of locking
    issues that lockdep warns all over the place about.

    The locking order for regular inodes w.r.t. page faults is io_lock
    -> pagefault -> mmap_sem -> ilock. The directory readdir code now
    triggers ilock -> page fault -> mmap_sem. While we cannot deadlock
    at this point, it inverts all the locking patterns that lockdep
    normally sees on XFS inodes, and so triggers lockdep. We worked
    around this with commit 93a8614 ("xfs: fix directory inode iolock
    lockdep false positive"), but that then just moved the lockdep
    warning to deeper in the page fault path and triggered on security
    inode locks. Fixing the shmem issue there just moved the lockdep
    reports somewhere else, and now we are getting false positives from
    filesystem freezing annotations getting confused.

    Further, if we enter memory reclaim in a readdir path, we now get
    lockdep warning about potential deadlocks because the ilock is held
    when we enter reclaim. This, again, is different to a regular file
    in that we never allow memory reclaim to run while holding the ilock
    for regular files. Hence lockdep now throws
    ilock->kmalloc->reclaim->ilock warnings.

    Basically, the problem is that the ilock is being used to protect
    the directory data and the inode metadata, whereas for a regular
    file the iolock protects the data and the ilock protects the
    metadata. From the VFS perspective, the i_mutex serialises all
    accesses to the directory data, and so not holding the ilock for
    readdir doesn't matter. The issue is that CXFS doesn't access
    directory data via the VFS, so it has no "data serialisaton"
    mechanism. Hence we need to hold the IOLOCK in the correct places to
    provide this low level directory data access serialisation.

    The ilock can then be used just when the extent list needs to be
    read, just like we do for regular files. The directory modification
    code can take the iolock exclusive when the ilock is also taken,
    and this then ensures that readdir is correct excluded while
    modifications are in progress.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The node directory lookup code uses a state structure that tracks the
    path of buffers used to search for the hash of a filename through the
    leaf blocks. When the lookup encounters a block that ends with the
    requested hash, but the entry has not yet been found, it must shift over
    to the next block and continue looking for the entry (i.e., duplicate
    hashes could continue over into the next block). This shift mechanism
    involves walking back up and down the state structure, replacing buffers
    at the appropriate btree levels as necessary.

    When a buffer is replaced, the old buffer is released and the new buffer
    read into the active slot in the path structure. Because the buffer is
    read directly into the path slot, a buffer read failure can result in
    setting a NULL buffer pointer in an active slot. This throws off the
    state cleanup code in xfs_dir2_node_lookup(), which expects to release a
    buffer from each active slot. Instead, a BUG occurs due to a NULL
    pointer dereference:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8
    IP: [] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
    ...
    RIP: 0010:[] [] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
    ...
    Call Trace:
    [] xfs_dir2_node_lookup+0xa6/0x2c0 [xfs]
    [] xfs_dir_lookup+0x1ac/0x1c0 [xfs]
    [] xfs_lookup+0x91/0x290 [xfs]
    [] xfs_vn_lookup+0x73/0xb0 [xfs]
    [] lookup_real+0x1d/0x50
    [] path_openat+0x91e/0x1490
    [] do_filp_open+0x89/0x100
    ...

    This has been reproduced via a parallel fsstress and filesystem shutdown
    workload in a loop. The shutdown triggers the read error in the
    aforementioned codepath and causes the BUG in xfs_dir2_node_lookup().

    Update xfs_da3_path_shift() to update the active path slot atomically
    with respect to the caller when a buffer is replaced. This ensures that
    the caller always sees the old or new buffer in the slot and prevents
    the NULL pointer dereference.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster