07 Oct, 2020

1 commit

  • A bug existed in the XFS reflink code between v5.1 and v5.5 in which
    the mapping for a COW IO was not trimmed to the mapping of the COW
    extent that was found. This resulted in a too-short copy, and
    corruption of other files which shared the original extent.

    (This happened only when extent size hints were set, which bypasses
    delalloc and led to this code path.)

    This was (inadvertently) fixed upstream with

    36adcbace24e "xfs: fill out the srcmap in iomap_begin"

    and related patches which moved lots of this functionality to
    the iomap subsystem.

    Hence, this is a -stable only patch, targeted to fix this
    corruption vector without other major code changes.

    Fixes: 78f0cc9d55cb ("xfs: don't use delalloc extents for COW on files with extsize hints")
    Cc: # 5.4.x
    Signed-off-by: Eric Sandeen
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Eric Sandeen
     

01 Oct, 2020

8 commits

  • [ Upstream commit 27fb5a72f50aa770dd38b0478c07acacef97e3e7 ]

    I noticed that fsfreeze can take a very long time to freeze an XFS if
    there happens to be a GETFSMAP caller running in the background. I also
    happened to notice the following in dmesg:

    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 43492 at fs/xfs/xfs_super.c:853 xfs_quiesce_attr+0x83/0x90 [xfs]
    Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables bfq iptable_filter sch_fq_codel ip_tables x_tables nfsv4 af_packet [last unloaded: xfs]
    CPU: 2 PID: 43492 Comm: xfs_io Not tainted 5.6.0-rc4-djw #rc4
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
    RIP: 0010:xfs_quiesce_attr+0x83/0x90 [xfs]
    Code: 7c 07 00 00 85 c0 75 22 48 89 df 5b e9 96 c1 00 00 48 c7 c6 b0 2d 38 a0 48 89 df e8 57 64 ff ff 8b 83 7c 07 00 00 85 c0 74 de 0b 48 89 df 5b e9 72 c1 00 00 66 90 0f 1f 44 00 00 41 55 41 54
    RSP: 0018:ffffc900030f3e28 EFLAGS: 00010202
    RAX: 0000000000000001 RBX: ffff88802ac54000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffffffff81e4a6f0 RDI: 00000000ffffffff
    RBP: ffff88807859f070 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000010 R12: 0000000000000000
    R13: ffff88807859f388 R14: ffff88807859f4b8 R15: ffff88807859f5e8
    FS: 00007fad1c6c0fc0(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f0c7d237000 CR3: 0000000077f01003 CR4: 00000000001606a0
    Call Trace:
    xfs_fs_freeze+0x25/0x40 [xfs]
    freeze_super+0xc8/0x180
    do_vfs_ioctl+0x70b/0x750
    ? __fget_files+0x135/0x210
    ksys_ioctl+0x3a/0xb0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x50/0x1a0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    These two things appear to be related. The assertion trips when another
    thread initiates a fsmap request (which uses an empty transaction) after
    the freezer waited for m_active_trans to hit zero but before the the
    freezer executes the WARN_ON just prior to calling xfs_log_quiesce.

    The lengthy delays in freezing happen because the freezer calls
    xfs_wait_buftarg to clean out the buffer lru list. Meanwhile, the
    GETFSMAP caller is continuing to grab and release buffers, which means
    that it can take a very long time for the buffer lru list to empty out.

    We fix both of these races by calling sb_start_write to obtain freeze
    protection while using empty transactions for GETFSMAP and for metadata
    scrubbing. The other two users occur during mount, during which time we
    cannot fs freeze.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • [ Upstream commit 2e107cf869eecc770e3f630060bb4e5f547d0fd8 ]

    In xchk_dir_actor, we attempt to validate the directory hash structures
    by performing a directory entry lookup by (hashed) name. If the lookup
    returns ENOENT, that means that the hash information is corrupt. The
    _process_error functions don't catch this, so we have to add that
    explicitly.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • [ Upstream commit 1cb5deb5bc095c070c09a4540c45f9c9ba24be43 ]

    If we decide that a directory free block is corrupt, we must take care
    not to leak a buffer pointer to the caller. After xfs_trans_brelse
    returns, the buffer can be freed or reused, which means that we have to
    set *bpp back to NULL.

    Callers are supposed to notice the nonzero return value and not use the
    buffer pointer, but we should code more defensively, even if all current
    callers handle this situation correctly.

    Fixes: de14c5f541e7 ("xfs: verify free block header fields")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • [ Upstream commit b1de6fc7520fe12949c070af0e8c0e4044cd3420 ]

    Omar Sandoval reported that a 4G fallocate on the realtime device causes
    filesystem shutdowns due to a log reservation overflow that happens when
    we log the rtbitmap updates. Factor rtbitmap/rtsummary updates into the
    the tr_write and tr_itruncate log reservation calculation.

    "The following reproducer results in a transaction log overrun warning
    for me:

    mkfs.xfs -f -r rtdev=/dev/vdc -d rtinherit=1 -m reflink=0 /dev/vdb
    mount -o rtdev=/dev/vdc /dev/vdb /mnt
    fallocate -l 4G /mnt/foo

    Reported-by: Omar Sandoval
    Tested-by: Omar Sandoval
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • [ Upstream commit 0c4da70c83d41a8461fdf50a3f7b292ecb04e378 ]

    Realtime files in XFS allocate extents in rextsize units. However, the
    written/unwritten state of those extents is still tracked in blocksize
    units. Therefore, a realtime file can be split up into written and
    unwritten extents that are not necessarily aligned to the realtime
    extent size. __xfs_bunmapi() has some logic to handle these various
    corner cases. Consider how it handles the following case:

    1. The last extent is unwritten.
    2. The last extent is smaller than the realtime extent size.
    3. startblock of the last extent is not aligned to the realtime extent
    size, but startblock + blockcount is.

    In this case, __xfs_bunmapi() calls xfs_bmap_add_extent_unwritten_real()
    to set the second-to-last extent to unwritten. This should merge the
    last and second-to-last extents, so __xfs_bunmapi() moves on to the
    second-to-last extent.

    However, if the size of the last and second-to-last extents combined is
    greater than MAXEXTLEN, xfs_bmap_add_extent_unwritten_real() does not
    merge the two extents. When that happens, __xfs_bunmapi() skips past the
    last extent without unmapping it, thus leaking the space.

    Fix it by only unwriting the minimum amount needed to align the last
    extent to the realtime extent size, which is guaranteed to merge with
    the last extent.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin

    Omar Sandoval
     
  • [ Upstream commit 2a2b5932db67586bacc560cc065d62faece5b996 ]

    The leaf format xattr addition helper xfs_attr3_leaf_add_work()
    adjusts the block freemap in a couple places. The first update drops
    the size of the freemap that the caller had already selected to
    place the xattr name/value data. Before the function returns, it
    also checks whether the entries array has encroached on a freemap
    range by virtue of the new entry addition. This is necessary because
    the entries array grows from the start of the block (but end of the
    block header) towards the end of the block while the name/value data
    grows from the end of the block in the opposite direction. If the
    associated freemap is already empty, however, size is zero and the
    subtraction underflows the field and causes corruption.

    This is reproduced rarely by generic/070. The observed behavior is
    that a smaller sized freemap is aligned to the end of the entries
    list, several subsequent xattr additions land in larger freemaps and
    the entries list expands into the smaller freemap until it is fully
    consumed and then underflows. Note that it is not otherwise a
    corruption for the entries array to consume an empty freemap because
    the nameval list (i.e. the firstused pointer in the xattr header)
    starts beyond the end of the corrupted freemap.

    Update the freemap size modification to account for the fact that
    the freemap entry can be empty and thus stale.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin

    Brian Foster
     
  • [ Upstream commit 249bd9087a5264d2b8a974081870e2e27671b4dc ]

    AIO+DIO can extend the file size on IO completion, and it holds
    no inode locks while the IO is in flight. Therefore, a race
    condition exists in file size updates if we do something like this:

    aio-thread fallocate-thread

    lock inode
    submit IO beyond inode->i_size
    unlock inode
    .....
    lock inode
    break layouts
    if (off + len > inode->i_size)
    new_size = off + len
    .....
    inode_dio_wait()

    .....
    completes
    inode->i_size updated
    inode_dio_done()
    ....


    if (new_size)
    xfs_vn_setattr(inode, new_size)

    Yup, that attempt to extend the file size in the fallocate code
    turns into a truncate - it removes the whatever the aio write
    allocated and put to disk, and reduced the inode size back down to
    where the fallocate operation ends.

    Fundamentally, xfs_file_fallocate() not compatible with racing
    AIO+DIO completions, so we need to move the inode_dio_wait() call
    up to where the lock the inode and break the layouts.

    Secondly, storing the inode size and then using it unchecked without
    holding the ILOCK is not safe; we can only do such a thing if we've
    locked out and drained all IO and other modification operations,
    which we don't do initially in xfs_file_fallocate.

    It should be noted that some of the fallocate operations are
    compound operations - they are made up of multiple manipulations
    that may zero data, and so we may need to flush and invalidate the
    file multiple times during an operation. However, we only need to
    lock out IO and other space manipulation operations once, as that
    lockout is maintained until the entire fallocate operation has been
    completed.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin

    Dave Chinner
     
  • [ Upstream commit 3f8a4f1d876d3e3e49e50b0396eaffcc4ba71b08 ]

    [commit message is verbose for discussion purposes - will trim it
    down later. Some questions about implementation details at the end.]

    Zorro Lang recently ran a new test to stress single inode extent
    counts now that they are no longer limited by memory allocation.
    The test was simply:

    # xfs_io -f -c "falloc 0 40t" /mnt/scratch/big-file
    # ~/src/xfstests-dev/punch-alternating /mnt/scratch/big-file

    This test uncovered a problem where the hole punching operation
    appeared to finish with no error, but apparently only created 268M
    extents instead of the 10 billion it was supposed to.

    Further, trying to punch out extents that should have been present
    resulted in success, but no change in the extent count. It looked
    like a silent failure.

    While running the test and observing the behaviour in real time,
    I observed the extent coutn growing at ~2M extents/minute, and saw
    this after about an hour:

    # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next ; \
    > sleep 60 ; \
    > xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
    fsxattr.nextents = 127657993
    fsxattr.nextents = 129683339
    #

    And a few minutes later this:

    # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
    fsxattr.nextents = 4177861124
    #

    Ah, what? Where did that 4 billion extra extents suddenly come from?

    Stop the workload, unmount, mount:

    # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
    fsxattr.nextents = 166044375
    #

    And it's back at the expected number. i.e. the extent count is
    correct on disk, but it's screwed up in memory. I loaded up the
    extent list, and immediately:

    # xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
    fsxattr.nextents = 4192576215
    #

    It's bad again. So, where does that number come from?
    xfs_fill_fsxattr():

    if (ip->i_df.if_flags & XFS_IFEXTENTS)
    fa->fsx_nextents = xfs_iext_count(&ip->i_df);
    else
    fa->fsx_nextents = ip->i_d.di_nextents;

    And that's the behaviour I just saw in a nutshell. The on disk count
    is correct, but once the tree is loaded into memory, it goes whacky.
    Clearly there's something wrong with xfs_iext_count():

    inline xfs_extnum_t xfs_iext_count(struct xfs_ifork *ifp)
    {
    return ifp->if_bytes / sizeof(struct xfs_iext_rec);
    }

    Simple enough, but 134M extents is 2**27, and that's right about
    where things went wrong. A struct xfs_iext_rec is 16 bytes in size,
    which means 2**27 * 2**4 = 2**31 and we're right on target for an
    integer overflow. And, sure enough:

    struct xfs_ifork {
    int if_bytes; /* bytes in if_u1 */
    ....

    Once we get 2**27 extents in a file, we overflow if_bytes and the
    in-core extent count goes wrong. And when we reach 2**28 extents,
    if_bytes wraps back to zero and things really start to go wrong
    there. This is where the silent failure comes from - only the first
    2**28 extents can be looked up directly due to the overflow, all the
    extents above this index wrap back to somewhere in the first 2**28
    extents. Hence with a regular pattern, trying to punch a hole in the
    range that didn't have holes mapped to a hole in the first 2**28
    extents and so "succeeded" without changing anything. Hence "silent
    failure"...

    Fix this by converting if_bytes to a int64_t and converting all the
    index variables and size calculations to use int64_t types to avoid
    overflows in future. Signed integers are still used to enable easy
    detection of extent count underflows. This enables scalability of
    extent counts to the limits of the on-disk format - MAXEXTNUM
    (2**31) extents.

    Current testing is at over 500M extents and still going:

    fsxattr.nextents = 517310478

    Reported-by: Zorro Lang
    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin

    Dave Chinner
     

17 Sep, 2020

2 commits

  • [ Upstream commit 125eac243806e021f33a1fdea3687eccbb9f7636 ]

    Don't leak kernel memory contents into the shortform attr fork.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Eric Sandeen
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • [ Upstream commit 657f101930bc6c5b41bd7d6c22565c4302a80d33 ]

    The inode chunk allocation transaction reserves inobt_maxlevels-1
    blocks to accommodate a full split of the inode btree. A full split
    requires an allocation for every existing level and a new root
    block, which means inobt_maxlevels is the worst case block
    requirement for a transaction that inserts to the inobt. This can
    lead to a transaction block reservation overrun when tmpfile
    creation allocates an inode chunk and expands the inobt to its
    maximum depth. This problem has been observed in conjunction with
    overlayfs, which makes frequent use of tmpfiles internally.

    The existing reservation code goes back as far as the Linux git repo
    history (v2.6.12). It was likely never observed as a problem because
    the traditional file/directory creation transactions also include
    worst case block reservation for directory modifications, which most
    likely is able to make up for a single block deficiency in the inode
    allocation portion of the calculation. tmpfile support is relatively
    more recent (v3.15), less heavily used, and only includes the inode
    allocation block reservation as tmpfiles aren't linked into the
    directory tree on creation.

    Fix up the inode alloc block reservation macro and a couple of the
    block allocator minleft parameters that enforce an allocation to
    leave enough free blocks in the AG for a full inobt split.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin

    Brian Foster
     

10 Sep, 2020

3 commits

  • commit b17164e258e3888d376a7434415013175d637377 upstream.

    When running in a dax mode, if the user maps a page with MAP_PRIVATE and
    PROT_WRITE, the xfs filesystem would incorrectly update ctime and mtime
    when the user hits a COW fault.

    This breaks building of the Linux kernel. How to reproduce:

    1. extract the Linux kernel tree on dax-mounted xfs filesystem
    2. run make clean
    3. run make -j12
    4. run make -j12

    at step 4, make would incorrectly rebuild the whole kernel (although it
    was already built in step 3).

    The reason for the breakage is that almost all object files depend on
    objtool. When we run objtool, it takes COW page fault on its .data
    section, and these faults will incorrectly update the timestamp of the
    objtool binary. The updated timestamp causes make to rebuild the whole
    tree.

    Signed-off-by: Mikulas Patocka
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • [ Upstream commit d0c20d38af135b2b4b90aa59df7878ef0c8fbef4 ]

    The realtime flag only applies to the data fork, so don't use the
    realtime block number checks on the attr fork of a realtime file.

    Fixes: 30b0984d9117 ("xfs: refactor bmap record validation")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Eric Sandeen
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • [ Upstream commit f4020438fab05364018c91f7e02ebdd192085933 ]

    The boundary test for the fixed-offset parts of xfs_attr_sf_entry in
    xfs_attr_shortform_verify is off by one, because the variable array
    at the end is defined as nameval[1] not nameval[].
    Hence we need to subtract 1 from the calculation.

    This can be shown by:

    # touch file
    # setfattr -n root.a file

    and verifications will fail when it's written to disk.

    This only matters for a last attribute which has a single-byte name
    and no value, otherwise the combination of namelen & valuelen will
    push endp further out and this test won't fail.

    Fixes: 1e1bbd8e7ee06 ("xfs: create structure verifier function for shortform xattrs")
    Signed-off-by: Eric Sandeen
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Eric Sandeen
     

03 Sep, 2020

1 commit

  • [ Upstream commit 96355d5a1f0ee6dcc182c37db4894ec0c29f1692 ]

    In tracking down a problem in this patchset, I discovered we are
    reclaiming dirty stale inodes. This wasn't discovered until inodes
    were always attached to the cluster buffer and then the rcu callback
    that freed inodes was assert failing because the inode still had an
    active pointer to the cluster buffer after it had been reclaimed.

    Debugging the issue indicated that this was a pre-existing issue
    resulting from the way the inodes are handled in xfs_inactive_ifree.
    When we free a cluster buffer from xfs_ifree_cluster, all the inodes
    in cache are marked XFS_ISTALE. Those that are clean have nothing
    else done to them and so eventually get cleaned up by background
    reclaim. i.e. it is assumed we'll never dirty/relog an inode marked
    XFS_ISTALE.

    On journal commit dirty stale inodes as are handled by both
    buffer and inode log items to run though xfs_istale_done() and
    removed from the AIL (buffer log item commit) or the log item will
    simply unpin it because the buffer log item will clean it. What happens
    to any specific inode is entirely dependent on which log item wins
    the commit race, but the result is the same - stale inodes are
    clean, not attached to the cluster buffer, and not in the AIL. Hence
    inode reclaim can just free these inodes without further care.

    However, if the stale inode is relogged, it gets dirtied again and
    relogged into the CIL. Most of the time this isn't an issue, because
    relogging simply changes the inode's location in the current
    checkpoint. Problems arise, however, when the CIL checkpoints
    between two transactions in the xfs_inactive_ifree() deferops
    processing. This results in the XFS_ISTALE inode being redirtied
    and inserted into the CIL without any of the other stale cluster
    buffer infrastructure being in place.

    Hence on journal commit, it simply gets unpinned, so it remains
    dirty in memory. Everything in inode writeback avoids XFS_ISTALE
    inodes so it can't be written back, and it is not tracked in the AIL
    so there's not even a trigger to attempt to clean the inode. Hence
    the inode just sits dirty in memory until inode reclaim comes along,
    sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming
    of a dirty inode caused use after free, list corruptions and other
    nasty issues later in this patchset.

    Hence this patch addresses a violation of the "never log XFS_ISTALE
    inodes" caused by the deferops processing rolling a transaction
    and relogging a stale inode in xfs_inactive_free. It also adds a
    bunch of asserts to catch this problem in debug kernels so that
    we don't reintroduce this problem in future.

    Reproducer for this issue was generic/558 on a v4 filesystem.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin

    Dave Chinner
     

26 Aug, 2020

2 commits

  • [ Upstream commit 96cf2a2c75567ff56195fe3126d497a2e7e4379f ]

    If xfs_sysfs_init is called with parent_kobj == NULL, UBSAN
    shows the following warning:

    UBSAN: null-ptr-deref in ./fs/xfs/xfs_sysfs.h:37:23
    member access within null pointer of type 'struct xfs_kobj'
    Call Trace:
    dump_stack+0x10e/0x195
    ubsan_type_mismatch_common+0x241/0x280
    __ubsan_handle_type_mismatch_v1+0x32/0x40
    init_xfs_fs+0x12b/0x28f
    do_one_initcall+0xdd/0x1d0
    do_initcall_level+0x151/0x1b6
    do_initcalls+0x50/0x8f
    do_basic_setup+0x29/0x2b
    kernel_init_freeable+0x19f/0x20b
    kernel_init+0x11/0x1e0
    ret_from_fork+0x22/0x30

    Fix it by checking parent_kobj before the code accesses its member.

    Signed-off-by: Eiichi Tsukata
    Reviewed-by: Darrick J. Wong
    [darrick: minor whitespace edits]
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin

    Eiichi Tsukata
     
  • [ Upstream commit f959b5d037e71a4d69b5bf71faffa065d9269b4a ]

    xfs_trans_dqresv is the function that we use to make reservations
    against resource quotas. Each resource contains two counters: the
    q_core counter, which tracks resources allocated on disk; and the dquot
    reservation counter, which tracks how much of that resource has either
    been allocated or reserved by threads that are working on metadata
    updates.

    For disk blocks, we compare the proposed reservation counter against the
    hard and soft limits to decide if we're going to fail the operation.
    However, for inodes we inexplicably compare against the q_core counter,
    not the incore reservation count.

    Since the q_core counter is always lower than the reservation count and
    we unlock the dquot between reservation and transaction commit, this
    means that multiple threads can reserve the last inode count before we
    hit the hard limit, and when they commit, we'll be well over the hard
    limit.

    Fix this by checking against the incore inode reservation counter, since
    we would appear to maintain that correctly (and that's what we report in
    GETQUOTA).

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Allison Collins
    Reviewed-by: Chandan Babu R
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     

19 Aug, 2020

3 commits

  • [ Upstream commit b2a8864728683443f34a9fd33a2b78b860934cc1 ]

    The block reservation calculation for inode allocation is supposed
    to consist of the blocks required for the inode chunk plus
    (maxlevels-1) of the inode btree multiplied by the number of inode
    btrees in the fs (2 when finobt is enabled, 1 otherwise).

    Instead, the macro returns (ialloc_blocks + 2) due to a precedence
    error in the calculation logic. This leads to block reservation
    overruns via generic/531 on small block filesystems with finobt
    enabled. Add braces to fix the calculation and reserve the
    appropriate number of blocks.

    Fixes: 9d43b180af67 ("xfs: update inode allocation/free transaction reservations for finobt")
    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Brian Foster
     
  • [ Upstream commit 83895227aba1ade33e81f586aa7b6b1e143096a5 ]

    Quota reservations are supposed to account for the blocks that might be
    allocated due to a bmap btree split. Reflink doesn't do this, so fix
    this to make the quota accounting more accurate before we start
    rearranging things.

    Fixes: 862bb360ef56 ("xfs: reflink extents from one file to another")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • [ Upstream commit eb0efe5063bb10bcb653e4f8e92a74719c03a347 ]

    The data fork scrubber calls filemap_write_and_wait to flush dirty pages
    and delalloc reservations out to disk prior to checking the data fork's
    extent mappings. Unfortunately, this means that scrub can consume the
    EIO/ENOSPC errors that would otherwise have stayed around in the address
    space until (we hope) the writer application calls fsync to persist data
    and collect errors. The end result is that programs that wrote to a
    file might never see the error code and proceed as if nothing were
    wrong.

    xfs_scrub is not in a position to notify file writers about the
    writeback failure, and it's only here to check metadata, not file
    contents. Therefore, if writeback fails, we should stuff the error code
    back into the address space so that an fsync by the writer application
    can pick that up.

    Fixes: 99d9d8d05da2 ("xfs: scrub inode block mappings")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     

01 Jul, 2020

1 commit

  • [ Upstream commit d0c7feaf87678371c2c09b3709400be416b2dc62 ]

    We recently used fuzz(hydra) to test XFS and automatically generate
    tmp.img(XFS v5 format, but some metadata is wrong)

    xfs_repair information(just one AG):
    agf_freeblks 0, counted 3224 in ag 0
    agf_longest 536874136, counted 3224 in ag 0
    sb_fdblocks 613, counted 3228

    Test as follows:
    mount tmp.img tmpdir
    cp file1M tmpdir
    sync

    In 4.19-stable, sync will stuck, the reason is:
    xfs_mountfs
    xfs_check_summary_counts
    if ((!xfs_sb_version_haslazysbcount(&mp->m_sb) ||
    XFS_LAST_UNMOUNT_WAS_CLEAN(mp)) &&
    !xfs_fs_has_sickness(mp, XFS_SICK_FS_COUNTERS))
    return 0; -->just return, incore sb_fdblocks still be 613
    xfs_initialize_perag_data

    cp file1M tmpdir -->ok(write file to pagecache)
    sync -->stuck(write pagecache to disk)
    xfs_map_blocks
    xfs_iomap_write_allocate
    while (count_fsb != 0) {
    nimaps = 0;
    while (nimaps == 0) { --> endless loop
    nimaps = 1;
    xfs_bmapi_write(..., &nimaps) --> nimaps becomes 0 again
    xfs_bmapi_write
    xfs_bmap_alloc
    xfs_bmap_btalloc
    xfs_alloc_vextent
    xfs_alloc_fix_freelist
    xfs_alloc_space_available -->fail(agf_freeblks is 0)

    In linux-next, sync not stuck, cause commit c2b3164320b5 ("xfs:
    use the latest extent at writeback delalloc conversion time") remove
    the above while, dmesg is as follows:
    [ 55.250114] XFS (loop0): page discard on page ffffea0008bc7380, inode 0x1b0c, offset 0.

    Users do not know why this page is discard, the better soultion is:
    1. Like xfs_repair, make sure sb_fdblocks is equal to counted
    (xfs_initialize_perag_data did this, who is not called at this mount)
    2. Add agf verify, if fail, will tell users to repair

    This patch use the second soultion.

    Signed-off-by: Zheng Bin
    Signed-off-by: Ren Xudong
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin

    Zheng Bin
     

22 Jun, 2020

3 commits

  • [ Upstream commit 629dcb38dc351947ed6a26a997d4b587f3bd5c7e ]

    The pre-flush dquot verification in xfs_qm_dqflush() duplicates the
    read verifier by checking the dquot in the on-disk buffer. Instead,
    verify the in-core variant before it is flushed to the buffer.

    Fixes: 7224fa482a6d ("xfs: add full xfs_dqblk verifier")
    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Allison Collins
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin

    Brian Foster
     
  • [ Upstream commit b6983e80b03bd4fd42de71993b3ac7403edac758 ]

    The buffer write failure flag is intended to control the internal
    write retry that XFS has historically implemented to help mitigate
    the severity of transient I/O errors. The flag is set when a buffer
    is resubmitted from the I/O completion path due to a previous
    failure. It is checked on subsequent I/O completions to skip the
    internal retry and fall through to the higher level configurable
    error handling mechanism. The flag is cleared in the synchronous and
    delwri submission paths and also checked in various places to log
    write failure messages.

    There are a couple minor problems with the current usage of this
    flag. One is that we issue an internal retry after every submission
    from xfsaild due to how delwri submission clears the flag. This
    results in double the expected or configured number of write
    attempts when under sustained failures. Another more subtle issue is
    that the flag is never cleared on successful I/O completion. This
    can cause xfs_wait_buftarg() to suggest that dirty buffers are being
    thrown away due to the existence of the flag, when the reality is
    that the flag might still be set because the write succeeded on the
    retry.

    Clear the write failure flag on successful I/O completion to address
    both of these problems. This means that the internal retry attempt
    occurs once since the last time a buffer write failed and that
    various other contexts only see the flag set when the immediately
    previous write attempt has failed.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Allison Collins
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin

    Brian Foster
     
  • [ Upstream commit 8bc3b5e4b70d28f8edcafc3c9e4de515998eea9e ]

    Make sure we release resources properly if we cannot clean out the COW
    extents in preparation for an extent swap.

    Fixes: 96987eea537d6c ("xfs: cancel COW blocks before swapext")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     

02 May, 2020

3 commits

  • [ Upstream commit c142932c29e533ee892f87b44d8abc5719edceec ]

    In the reflink extent remap function, it turns out that uirec (the block
    mapping corresponding only to the part of the passed-in mapping that got
    unmapped) was not fully initialized. Specifically, br_state was not
    being copied from the passed-in struct to the uirec. This could lead to
    unpredictable results such as the reflinked mapping being marked
    unwritten in the destination file.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • commit 10a98cb16d80be3595fdb165fad898bb28b8b6d2 upstream.

    Leaving PF_MEMALLOC set when exiting a kthread causes it to remain set
    during do_exit(). That can confuse things. In particular, if BSD
    process accounting is enabled, then do_exit() writes data to an
    accounting file. If that file has FS_SYNC_FL set, then this write
    occurs synchronously and can misbehave if PF_MEMALLOC is set.

    For example, if the accounting file is located on an XFS filesystem,
    then a WARN_ON_ONCE() in iomap_do_writepage() is triggered and the data
    doesn't get written when it should. Or if the accounting file is
    located on an ext4 filesystem without a journal, then a WARN_ON_ONCE()
    in ext4_write_inode() is triggered and the inode doesn't get written.

    Fix this in xfsaild() by using the helper functions to save and restore
    PF_MEMALLOC.

    This can be reproduced as follows in the kvm-xfstests test appliance
    modified to add the 'acct' Debian package, and with kvm-xfstests's
    recommended kconfig modified to add CONFIG_BSD_PROCESS_ACCT=y:

    mkfs.xfs -f /dev/vdb
    mount /vdb
    touch /vdb/file
    chattr +S /vdb/file
    accton /vdb/file
    mkfs.xfs -f /dev/vdc
    mount /vdc
    umount /vdc

    It causes:
    WARNING: CPU: 1 PID: 336 at fs/iomap/buffered-io.c:1534
    CPU: 1 PID: 336 Comm: xfsaild/vdc Not tainted 5.6.0-rc5 #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191223_100556-anatol 04/01/2014
    RIP: 0010:iomap_do_writepage+0x16b/0x1f0 fs/iomap/buffered-io.c:1534
    [...]
    Call Trace:
    write_cache_pages+0x189/0x4d0 mm/page-writeback.c:2238
    iomap_writepages+0x1c/0x33 fs/iomap/buffered-io.c:1642
    xfs_vm_writepages+0x65/0x90 fs/xfs/xfs_aops.c:578
    do_writepages+0x41/0xe0 mm/page-writeback.c:2344
    __filemap_fdatawrite_range+0xd2/0x120 mm/filemap.c:421
    file_write_and_wait_range+0x71/0xc0 mm/filemap.c:760
    xfs_file_fsync+0x7a/0x2b0 fs/xfs/xfs_file.c:114
    generic_write_sync include/linux/fs.h:2867 [inline]
    xfs_file_buffered_aio_write+0x379/0x3b0 fs/xfs/xfs_file.c:691
    call_write_iter include/linux/fs.h:1901 [inline]
    new_sync_write+0x130/0x1d0 fs/read_write.c:483
    __kernel_write+0x54/0xe0 fs/read_write.c:515
    do_acct_process+0x122/0x170 kernel/acct.c:522
    slow_acct_process kernel/acct.c:581 [inline]
    acct_process+0x1d4/0x27c kernel/acct.c:607
    do_exit+0x83d/0xbc0 kernel/exit.c:791
    kthread+0xf1/0x140 kernel/kthread.c:257
    ret_from_fork+0x27/0x50 arch/x86/entry/entry_64.S:352

    This bug was originally reported by syzbot at
    https://lore.kernel.org/r/0000000000000e7156059f751d7b@google.com.

    Reported-by: syzbot+1f9dc49e8de2582d90c2@syzkaller.appspotmail.com
    Signed-off-by: Eric Biggers
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • commit 4b674b9ac852937af1f8c62f730c325fb6eadcdb upstream.

    The filesystem freeze sequence in XFS waits on any background
    eofblocks or cowblocks scans to complete before the filesystem is
    quiesced. At this point, the freezer has already stopped the
    transaction subsystem, however, which means a truncate or cowblock
    cancellation in progress is likely blocked in transaction
    allocation. This results in a deadlock between freeze and the
    associated scanner.

    Fix this problem by holding superblock write protection across calls
    into the block reapers. Since protection for background scans is
    acquired from the workqueue task context, trylock to avoid a similar
    deadlock between freeze and blocking on the write lock.

    Fixes: d6b636ebb1c9f ("xfs: halt auto-reclamation activities while rebuilding rmap")
    Reported-by: Paul Furtado
    Signed-off-by: Brian Foster
    Reviewed-by: Chandan Rajendra
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Allison Collins
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Brian Foster
     

05 Mar, 2020

1 commit

  • commit 953aa9d136f53e226448dbd801a905c28f8071bf upstream.

    Don't allow passing arbitrary flags as they change behavior including
    memory allocation that the call stack is not prepared for.

    Fixes: ddbca70cc45c ("xfs: allocate xattr buffer on demand")
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     

26 Jan, 2020

1 commit

  • commit 3dd4d40b420846dd35869ccc8f8627feef2cff32 upstream.

    Flags passed to Q_XQUOTARM were not sanity checked for invalid values.
    Fix that.

    Fixes: 9da93f9b7cdf ("xfs: fix Q_XQUOTARM ioctl")
    Reported-by: Yang Xu
    Signed-off-by: Jan Kara
    Reviewed-by: Eric Sandeen
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

09 Jan, 2020

2 commits

  • [ Upstream commit 5d1116d4c6af3e580f1ed0382ca5a94bd65a34cf ]

    Christoph Hellwig complained about the following soft lockup warning
    when running scrub after generic/175 when preemption is disabled and
    slub debugging is enabled:

    watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [xfs_scrub:161]
    Modules linked in:
    irq event stamp: 41692326
    hardirqs last enabled at (41692325): [] _raw_0
    hardirqs last disabled at (41692326): [] trace0
    softirqs last enabled at (41684994): [] __do_e
    softirqs last disabled at (41684987): [] irq_e0
    CPU: 3 PID: 16189 Comm: xfs_scrub Not tainted 5.4.0-rc3+ #30
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.124
    RIP: 0010:_raw_spin_unlock_irqrestore+0x39/0x40
    Code: 89 f3 be 01 00 00 00 e8 d5 3a e5 fe 48 89 ef e8 ed 87 e5 f2
    RSP: 0018:ffffc9000233f970 EFLAGS: 00000286 ORIG_RAX: ffffffffff3
    RAX: ffff88813b398040 RBX: 0000000000000286 RCX: 0000000000000006
    RDX: 0000000000000006 RSI: ffff88813b3988c0 RDI: ffff88813b398040
    RBP: ffff888137958640 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00042b0c00
    R13: 0000000000000001 R14: ffff88810ac32308 R15: ffff8881376fc040
    FS: 00007f6113dea700(0000) GS:ffff88813bb80000(0000) knlGS:00000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f6113de8ff8 CR3: 000000012f290000 CR4: 00000000000006e0
    Call Trace:
    free_debug_processing+0x1dd/0x240
    __slab_free+0x231/0x410
    kmem_cache_free+0x30e/0x360
    xchk_ag_btcur_free+0x76/0xb0
    xchk_ag_free+0x10/0x80
    xchk_bmap_iextent_xref.isra.14+0xd9/0x120
    xchk_bmap_iextent+0x187/0x210
    xchk_bmap+0x2e0/0x3b0
    xfs_scrub_metadata+0x2e7/0x500
    xfs_ioc_scrub_metadata+0x4a/0xa0
    xfs_file_ioctl+0x58a/0xcd0
    do_vfs_ioctl+0xa0/0x6f0
    ksys_ioctl+0x5b/0x90
    __x64_sys_ioctl+0x11/0x20
    do_syscall_64+0x4b/0x1a0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    If preemption is disabled, all metadata buffers needed to perform the
    scrub are already in memory, and there are a lot of records to check,
    it's possible that the scrub thread will run for an extended period of
    time without sleeping for IO or any other reason. Then the watchdog
    timer or the RCU stall timeout can trigger, producing the backtrace
    above.

    To fix this problem, call cond_resched() from the scrub thread so that
    we back out to the scheduler whenever necessary.

    Reported-by: Christoph Hellwig
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • commit 69ffe5960df16938bccfe1b65382af0b3de51265 upstream.

    Commit 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi") added
    a check in __xfs_bunmapi() to stop early if we would touch multiple AGs
    in the wrong order. However, this check isn't applicable for realtime
    files. In most cases, it just makes us do unnecessary commits. However,
    without the fix from the previous commit ("xfs: fix realtime file data
    space leak"), if the last and second-to-last extents also happen to have
    different "AG numbers", then the break actually causes __xfs_bunmapi()
    to return without making any progress, which sends
    xfs_itruncate_extents_flags() into an infinite loop.

    Fixes: 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi")
    Signed-off-by: Omar Sandoval
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     

05 Jan, 2020

1 commit

  • commit 798a9cada4694ca8d970259f216cec47e675bfd5 upstream.

    syzbot (via KASAN) reports a use-after-free in the error path of
    xlog_alloc_log(). Specifically, the iclog freeing loop doesn't
    handle the case of a fully initialized ->l_iclog linked list.
    Instead, it assumes that the list is partially constructed and NULL
    terminated.

    This bug manifested because there was no possible error scenario
    after iclog list setup when the original code was added. Subsequent
    code and associated error conditions were added some time later,
    while the original error handling code was never updated. Fix up the
    error loop to terminate either on a NULL iclog or reaching the end
    of the list.

    Reported-by: syzbot+c732f8644185de340492@syzkaller.appspotmail.com
    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Brian Foster
     

15 Oct, 2019

1 commit

  • 64-bit time is a signed quantity in the kernel, so the bulkstat
    structure should reflect that. Note that the structure size stays
    the same and that we have not yet published userspace headers for this
    new ioctl so there are no users to break.

    Fixes: 7035f9724f84 ("xfs: introduce new v5 bulkstat structure")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

09 Oct, 2019

3 commits

  • The callers of xfs_bmap_local_to_extents_empty() log the inode
    external to the function, yet this function is where the on-disk
    format value is updated. Push the inode logging down into the
    function itself to help prevent future mistakes.

    Note that internal bmap callers track the inode logging flags
    independently and thus may log the inode core twice due to this
    change. This is harmless, so leave this code around for consistency
    with the other attr fork conversion functions.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • xfs_attr_shortform_to_leaf() attempts to put the shortform fork back
    together after a failed attempt to convert from shortform to leaf
    format. While this code reallocates and copies back the shortform
    attr fork data, it never resets the inode format field back to local
    format. Further, now that the inode is properly logged after the
    initial switch from local format, any error that triggers the
    recovery code will eventually abort the transaction and shutdown the
    fs. Therefore, remove the broken and unnecessary error handling
    code.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • When a directory changes from shortform (sf) to block format, the sf
    format is copied to a temporary buffer, the inode format is modified
    and the updated format filled with the dentries from the temporary
    buffer. If the inode format is modified and attempt to grow the
    inode fails (due to I/O error, for example), it is possible to
    return an error while leaving the directory in an inconsistent state
    and with an otherwise clean transaction. This results in corruption
    of the associated directory and leads to xfs_dabuf_map() errors as
    subsequent lookups cannot accurately determine the format of the
    directory. This problem is reproduced occasionally by generic/475.

    The fundamental problem is that xfs_dir2_sf_to_block() changes the
    on-disk inode format without logging the inode. The inode is
    eventually logged by the bmapi layer in the common case, but error
    checking introduces the possibility of failing the high level
    request before this happens.

    Update both of the dir2 and attr callers of
    xfs_bmap_local_to_extents_empty() to log the inode core as
    consistent with the bmap local to extent format change codepath.
    This ensures that any subsequent errors after the format has changed
    cause the transaction to abort.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

07 Oct, 2019

4 commits

  • Guarantee zeroed memory buffers for cases where potential memory
    leak to disk can occur. In these cases, kmem_alloc is used and
    doesn't zero the buffer, opening the possibility of information
    leakage to disk.

    Use existing infrastucture (xfs_buf_allocate_memory) to obtain
    the already zeroed buffer from kernel memory.

    This solution avoids the performance issue that would occur if a
    wholesale change to replace kmem_alloc with kmem_zalloc was done.

    Signed-off-by: Bill O'Donnell
    [darrick: fix bitwise complaint about kmflag_mask]
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Bill O'Donnell
     
  • Removed unused error variable. Instead of using error variable,
    returned the value directly as it wasn't updated.

    Signed-off-by: Aliasgar Surti
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Aliasgar Surti
     
  • The flags arg is always passed as zero, so remove it.

    (xfs_buf_get_uncached takes flags to support XBF_NO_IOACCT for
    the sb, but that should never be relevant for xfs_get_aghdr_buf)

    Signed-off-by: Eric Sandeen
    Reviewed-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eric Sandeen
     
  • To ensure that all blocks touched by the range [offset, offset + count)
    are allocated, we need to calculate the block count from the difference
    of the range end (rounded up) and the range start (rounded down).

    Before this patch, we just round up the byte count, which may lead to
    unaligned ranges not being fully allocated:

    $ touch test_file
    $ block_size=$(stat -fc '%S' test_file)
    $ fallocate -o $((block_size / 2)) -l $block_size test_file
    $ xfs_bmap test_file
    test_file:
    0: [0..7]: 1396264..1396271
    1: [8..15]: hole

    There should not be a hole there. Instead, the first two blocks should
    be fully allocated.

    With this patch applied, the result is something like this:

    $ touch test_file
    $ block_size=$(stat -fc '%S' test_file)
    $ fallocate -o $((block_size / 2)) -l $block_size test_file
    $ xfs_bmap test_file
    test_file:
    0: [0..15]: 11024..11039

    Signed-off-by: Max Reitz
    Reviewed-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Max Reitz