05 Mar, 2020

1 commit

  • commit 953aa9d136f53e226448dbd801a905c28f8071bf upstream.

    Don't allow passing arbitrary flags as they change behavior including
    memory allocation that the call stack is not prepared for.

    Fixes: ddbca70cc45c ("xfs: allocate xattr buffer on demand")
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     

26 Jan, 2020

1 commit

  • commit 3dd4d40b420846dd35869ccc8f8627feef2cff32 upstream.

    Flags passed to Q_XQUOTARM were not sanity checked for invalid values.
    Fix that.

    Fixes: 9da93f9b7cdf ("xfs: fix Q_XQUOTARM ioctl")
    Reported-by: Yang Xu
    Signed-off-by: Jan Kara
    Reviewed-by: Eric Sandeen
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

09 Jan, 2020

2 commits

  • [ Upstream commit 5d1116d4c6af3e580f1ed0382ca5a94bd65a34cf ]

    Christoph Hellwig complained about the following soft lockup warning
    when running scrub after generic/175 when preemption is disabled and
    slub debugging is enabled:

    watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [xfs_scrub:161]
    Modules linked in:
    irq event stamp: 41692326
    hardirqs last enabled at (41692325): [] _raw_0
    hardirqs last disabled at (41692326): [] trace0
    softirqs last enabled at (41684994): [] __do_e
    softirqs last disabled at (41684987): [] irq_e0
    CPU: 3 PID: 16189 Comm: xfs_scrub Not tainted 5.4.0-rc3+ #30
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.124
    RIP: 0010:_raw_spin_unlock_irqrestore+0x39/0x40
    Code: 89 f3 be 01 00 00 00 e8 d5 3a e5 fe 48 89 ef e8 ed 87 e5 f2
    RSP: 0018:ffffc9000233f970 EFLAGS: 00000286 ORIG_RAX: ffffffffff3
    RAX: ffff88813b398040 RBX: 0000000000000286 RCX: 0000000000000006
    RDX: 0000000000000006 RSI: ffff88813b3988c0 RDI: ffff88813b398040
    RBP: ffff888137958640 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00042b0c00
    R13: 0000000000000001 R14: ffff88810ac32308 R15: ffff8881376fc040
    FS: 00007f6113dea700(0000) GS:ffff88813bb80000(0000) knlGS:00000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f6113de8ff8 CR3: 000000012f290000 CR4: 00000000000006e0
    Call Trace:
    free_debug_processing+0x1dd/0x240
    __slab_free+0x231/0x410
    kmem_cache_free+0x30e/0x360
    xchk_ag_btcur_free+0x76/0xb0
    xchk_ag_free+0x10/0x80
    xchk_bmap_iextent_xref.isra.14+0xd9/0x120
    xchk_bmap_iextent+0x187/0x210
    xchk_bmap+0x2e0/0x3b0
    xfs_scrub_metadata+0x2e7/0x500
    xfs_ioc_scrub_metadata+0x4a/0xa0
    xfs_file_ioctl+0x58a/0xcd0
    do_vfs_ioctl+0xa0/0x6f0
    ksys_ioctl+0x5b/0x90
    __x64_sys_ioctl+0x11/0x20
    do_syscall_64+0x4b/0x1a0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    If preemption is disabled, all metadata buffers needed to perform the
    scrub are already in memory, and there are a lot of records to check,
    it's possible that the scrub thread will run for an extended period of
    time without sleeping for IO or any other reason. Then the watchdog
    timer or the RCU stall timeout can trigger, producing the backtrace
    above.

    To fix this problem, call cond_resched() from the scrub thread so that
    we back out to the scheduler whenever necessary.

    Reported-by: Christoph Hellwig
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • commit 69ffe5960df16938bccfe1b65382af0b3de51265 upstream.

    Commit 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi") added
    a check in __xfs_bunmapi() to stop early if we would touch multiple AGs
    in the wrong order. However, this check isn't applicable for realtime
    files. In most cases, it just makes us do unnecessary commits. However,
    without the fix from the previous commit ("xfs: fix realtime file data
    space leak"), if the last and second-to-last extents also happen to have
    different "AG numbers", then the break actually causes __xfs_bunmapi()
    to return without making any progress, which sends
    xfs_itruncate_extents_flags() into an infinite loop.

    Fixes: 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi")
    Signed-off-by: Omar Sandoval
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     

05 Jan, 2020

1 commit

  • commit 798a9cada4694ca8d970259f216cec47e675bfd5 upstream.

    syzbot (via KASAN) reports a use-after-free in the error path of
    xlog_alloc_log(). Specifically, the iclog freeing loop doesn't
    handle the case of a fully initialized ->l_iclog linked list.
    Instead, it assumes that the list is partially constructed and NULL
    terminated.

    This bug manifested because there was no possible error scenario
    after iclog list setup when the original code was added. Subsequent
    code and associated error conditions were added some time later,
    while the original error handling code was never updated. Fix up the
    error loop to terminate either on a NULL iclog or reaching the end
    of the list.

    Reported-by: syzbot+c732f8644185de340492@syzkaller.appspotmail.com
    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Brian Foster
     

15 Oct, 2019

1 commit

  • 64-bit time is a signed quantity in the kernel, so the bulkstat
    structure should reflect that. Note that the structure size stays
    the same and that we have not yet published userspace headers for this
    new ioctl so there are no users to break.

    Fixes: 7035f9724f84 ("xfs: introduce new v5 bulkstat structure")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

09 Oct, 2019

3 commits

  • The callers of xfs_bmap_local_to_extents_empty() log the inode
    external to the function, yet this function is where the on-disk
    format value is updated. Push the inode logging down into the
    function itself to help prevent future mistakes.

    Note that internal bmap callers track the inode logging flags
    independently and thus may log the inode core twice due to this
    change. This is harmless, so leave this code around for consistency
    with the other attr fork conversion functions.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • xfs_attr_shortform_to_leaf() attempts to put the shortform fork back
    together after a failed attempt to convert from shortform to leaf
    format. While this code reallocates and copies back the shortform
    attr fork data, it never resets the inode format field back to local
    format. Further, now that the inode is properly logged after the
    initial switch from local format, any error that triggers the
    recovery code will eventually abort the transaction and shutdown the
    fs. Therefore, remove the broken and unnecessary error handling
    code.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • When a directory changes from shortform (sf) to block format, the sf
    format is copied to a temporary buffer, the inode format is modified
    and the updated format filled with the dentries from the temporary
    buffer. If the inode format is modified and attempt to grow the
    inode fails (due to I/O error, for example), it is possible to
    return an error while leaving the directory in an inconsistent state
    and with an otherwise clean transaction. This results in corruption
    of the associated directory and leads to xfs_dabuf_map() errors as
    subsequent lookups cannot accurately determine the format of the
    directory. This problem is reproduced occasionally by generic/475.

    The fundamental problem is that xfs_dir2_sf_to_block() changes the
    on-disk inode format without logging the inode. The inode is
    eventually logged by the bmapi layer in the common case, but error
    checking introduces the possibility of failing the high level
    request before this happens.

    Update both of the dir2 and attr callers of
    xfs_bmap_local_to_extents_empty() to log the inode core as
    consistent with the bmap local to extent format change codepath.
    This ensures that any subsequent errors after the format has changed
    cause the transaction to abort.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

07 Oct, 2019

4 commits

  • Guarantee zeroed memory buffers for cases where potential memory
    leak to disk can occur. In these cases, kmem_alloc is used and
    doesn't zero the buffer, opening the possibility of information
    leakage to disk.

    Use existing infrastucture (xfs_buf_allocate_memory) to obtain
    the already zeroed buffer from kernel memory.

    This solution avoids the performance issue that would occur if a
    wholesale change to replace kmem_alloc with kmem_zalloc was done.

    Signed-off-by: Bill O'Donnell
    [darrick: fix bitwise complaint about kmflag_mask]
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Bill O'Donnell
     
  • Removed unused error variable. Instead of using error variable,
    returned the value directly as it wasn't updated.

    Signed-off-by: Aliasgar Surti
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Aliasgar Surti
     
  • The flags arg is always passed as zero, so remove it.

    (xfs_buf_get_uncached takes flags to support XBF_NO_IOACCT for
    the sb, but that should never be relevant for xfs_get_aghdr_buf)

    Signed-off-by: Eric Sandeen
    Reviewed-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eric Sandeen
     
  • To ensure that all blocks touched by the range [offset, offset + count)
    are allocated, we need to calculate the block count from the difference
    of the range end (rounded up) and the range start (rounded down).

    Before this patch, we just round up the byte count, which may lead to
    unaligned ranges not being fully allocated:

    $ touch test_file
    $ block_size=$(stat -fc '%S' test_file)
    $ fallocate -o $((block_size / 2)) -l $block_size test_file
    $ xfs_bmap test_file
    test_file:
    0: [0..7]: 1396264..1396271
    1: [8..15]: hole

    There should not be a hole there. Instead, the first two blocks should
    be fully allocated.

    With this patch applied, the result is something like this:

    $ touch test_file
    $ block_size=$(stat -fc '%S' test_file)
    $ fallocate -o $((block_size / 2)) -l $block_size test_file
    $ xfs_bmap test_file
    test_file:
    0: [0..15]: 11024..11039

    Signed-off-by: Max Reitz
    Reviewed-by: Carlos Maiolino
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Max Reitz
     

27 Sep, 2019

3 commits

  • Pull xfs fixes from Darrick Wong:
    "There are a couple of bug fixes and some small code cleanups that came
    in recently:

    - Minor code cleanups

    - Fix a superblock logging error

    - Ensure that collapse range converts the data fork to extents format
    when necessary

    - Revert the ALLOC_USERDATA cleanup because it caused subtle behavior
    regressions"

    * tag 'xfs-5.4-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    xfs: avoid unused to_mp() function warning
    xfs: log proper length of superblock
    xfs: revert 1baa2800e62d ("xfs: remove the unused XFS_ALLOC_USERDATA flag")
    xfs: removed unneeded variable
    xfs: convert inode to extent format after extent merge due to shift

    Linus Torvalds
     
  • Merge more updates from Andrew Morton:

    - almost all of the rest of -mm

    - various other subsystems

    Subsystems affected by this patch series:
    memcg, misc, core-kernel, lib, checkpatch, reiserfs, fat, fork,
    cpumask, kexec, uaccess, kconfig, kgdb, bug, ipc, lzo, kasan, madvise,
    cleanups, pagemap

    * emailed patches from Andrew Morton : (77 commits)
    arch/sparc/include/asm/pgtable_64.h: fix build
    mm: treewide: clarify pgtable_page_{ctor,dtor}() naming
    ntfs: remove (un)?likely() from IS_ERR() conditions
    IB/hfi1: remove unlikely() from IS_ERR*() condition
    xfs: remove unlikely() from WARN_ON() condition
    wimax/i2400m: remove unlikely() from WARN*() condition
    fs: remove unlikely() from WARN_ON() condition
    xen/events: remove unlikely() from WARN() condition
    checkpatch: check for nested (un)?likely() calls
    hexagon: drop empty and unused free_initrd_mem
    mm: factor out common parts between MADV_COLD and MADV_PAGEOUT
    mm: introduce MADV_PAGEOUT
    mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
    mm: introduce MADV_COLD
    mm: untag user pointers in mmap/munmap/mremap/brk
    vfio/type1: untag user pointers in vaddr_get_pfn
    tee/shm: untag user pointers in tee_shm_register
    media/v4l2-core: untag user pointers in videobuf_dma_contig_user_get
    drm/radeon: untag user pointers in radeon_gem_userptr_ioctl
    drm/amdgpu: untag user pointers
    ...

    Linus Torvalds
     
  • "unlikely(WARN_ON(x))" is excessive. WARN_ON() already uses unlikely()
    internally.

    Link: http://lkml.kernel.org/r/20190829165025.15750-7-efremov@linux.com
    Signed-off-by: Denis Efremov
    Reviewed-by: Darrick J. Wong
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis Efremov
     

26 Sep, 2019

1 commit

  • Pull iomap updates from Darrick Wong:
    "After last week's failed pull request attempt, I scuttled everything
    in the branch except for the directio endio api changes, which were
    trivial. Everything else will simply have to wait for the next cycle.

    Summary:

    - Report both io errors and short io results to the directio endio
    handler.

    - Allow directio callers to pass an ops structure to iomap_dio_rw"

    * tag 'iomap-5.4-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    iomap: move the iomap_dio_rw ->end_io callback into a structure
    iomap: split size and error for iomap_dio_rw ->end_io

    Linus Torvalds
     

25 Sep, 2019

1 commit

  • to_mp() was first introduced with the following commit:
    'commit 801cc4e17a34c ("xfs: debug mode forced buffered write failure")'

    But the user of to_mp() was removed by below commit:
    'commit f8c47250ba46e ("xfs: convert drop_writes to use the errortag
    mechanism")'

    So kernel build with clang throws below warning message:

    fs/xfs/xfs_sysfs.c:72:1: warning: unused function 'to_mp' [-Wunused-function]
    to_mp(struct kobject *kobject)

    Hence to_mp() might be removed safely to get rid of warning message.

    Signed-off-by: Austin Kim
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Austin Kim
     

24 Sep, 2019

4 commits

  • xfs_trans_log_buf takes first byte, last byte as args. In this
    case, it should be from 0 to sizeof() - 1.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eric Sandeen
     
  • Revert this commit, as it caused periodic regressions in xfs/173 w/
    1k blocks.

    [1] https://lore.kernel.org/lkml/20190919014602.GN15734@shao2-debian/

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • Returned value directly instead of using variable as it wasn't updated.

    Signed-off-by: Aliasgar Surti
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Aliasgar Surti
     
  • The collapse range operation can merge extents if two newly adjacent
    extents are physically contiguous. If the extent count is reduced on
    a btree format inode, a change to extent format might be necessary.
    This format change currently occurs as a side effect of the file
    size update after extents have been shifted for the collapse. This
    codepath ultimately calls xfs_bunmapi(), which happens to check for
    and execute the format conversion even if there were no blocks
    removed from the mapping.

    While this ultimately puts the inode into the correct state, the
    fact the format conversion occurs in a separate transaction from the
    change that called for it is a problem. If an extent shift
    transaction commits and the filesystem happens to crash before the
    format conversion, the inode fork is left in a corrupted state after
    log recovery. The inode fork verifier fails and xfs_repair
    ultimately nukes the inode. This problem was originally reproduced
    by generic/388.

    Similar to how the insert range extent split code handles extent to
    btree conversion, update the collapse range extent merge code to
    handle btree to extent format conversion in the same transaction
    that merges the extents. This ensures that the inode fork format
    remains consistent if the filesystem happens to crash in the middle
    of a collapse range operation that changes the inode fork format.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

20 Sep, 2019

3 commits

  • Add a new iomap_dio_ops structure that for now just contains the end_io
    handler. This avoid storing the function pointer in a mutable structure,
    which is a possible exploit vector for kernel code execution, and prepares
    for adding a submit_io handler that btrfs needs.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Modify the calling convention for the iomap_dio_rw ->end_io() callback.
    Rather than passing either dio->error or dio->size as the 'size' argument,
    instead pass both the dio->error and the dio->size value separately.

    In the instance that an error occurred during a write, we currently cannot
    determine whether any blocks have been allocated beyond the current EOF and
    data has subsequently been written to these blocks within the ->end_io()
    callback. As a result, we cannot judge whether we should take the truncate
    failed write path. Having both dio->error and dio->size will allow us to
    perform such checks within this callback.

    Signed-off-by: Matthew Bobrowski
    [hch: minor cleanups]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Matthew Wilcox (Oracle)

    Matthew Bobrowski
     
  • Pull y2038 vfs updates from Arnd Bergmann:
    "Add inode timestamp clamping.

    This series from Deepa Dinamani adds a per-superblock minimum/maximum
    timestamp limit for a file system, and clamps timestamps as they are
    written, to avoid random behavior from integer overflow as well as
    having different time stamps on disk vs in memory.

    At mount time, a warning is now printed for any file system that can
    represent current timestamps but not future timestamps more than 30
    years into the future, similar to the arbitrary 30 year limit that was
    added to settimeofday().

    This was picked as a compromise to warn users to migrate to other file
    systems (e.g. ext4 instead of ext3) when they need the file system to
    survive beyond 2038 (or similar limits in other file systems), but not
    get in the way of normal usage"

    * tag 'y2038-vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
    ext4: Reduce ext4 timestamp warnings
    isofs: Initialize filesystem timestamp ranges
    pstore: fs superblock limits
    fs: omfs: Initialize filesystem timestamp ranges
    fs: hpfs: Initialize filesystem timestamp ranges
    fs: ceph: Initialize filesystem timestamp ranges
    fs: sysv: Initialize filesystem timestamp ranges
    fs: affs: Initialize filesystem timestamp ranges
    fs: fat: Initialize filesystem timestamp ranges
    fs: cifs: Initialize filesystem timestamp ranges
    fs: nfs: Initialize filesystem timestamp ranges
    ext4: Initialize timestamps limits
    9p: Fill min and max timestamps in sb
    fs: Fill in max and min timestamps in superblock
    utimes: Clamp the timestamps before update
    mount: Add mount warning for impending timestamp expiry
    timestamp_truncate: Replace users of timespec64_trunc
    vfs: Add timestamp_truncate() api
    vfs: Add file timestamp range support

    Linus Torvalds
     

19 Sep, 2019

2 commits

  • Pull xfs updates from Darrick Wong:
    "For this cycle we have the usual pile of cleanups and bug fixes, some
    performance improvements for online metadata scrubbing, massive
    speedups in the directory entry creation code, some performance
    improvement in the file ACL lookup code, a fix for a logging stall
    during mount, and fixes for concurrency problems.

    It has survived a couple of weeks of xfstests runs and merges cleanly.

    Summary:

    - Remove KM_SLEEP/KM_NOSLEEP.

    - Ensure that memory buffers for IO are properly sector-aligned to
    avoid problems that the block layer doesn't check.

    - Make the bmap scrubber more efficient in its record checking.

    - Don't crash xfs_db when superblock inode geometry is corrupt.

    - Fix btree key helper functions.

    - Remove unneeded error returns for things that can't fail.

    - Fix buffer logging bugs in repair.

    - Clean up iterator return values.

    - Speed up directory entry creation.

    - Enable allocation of xattr value memory buffer during lookup.

    - Fix readahead racing with truncate/punch hole.

    - Other minor cleanups.

    - Fix one AGI/AGF deadlock with RENAME_WHITEOUT.

    - More BUG -> WARN whackamole.

    - Fix various problems with the log failing to advance under certain
    circumstances, which results in stalls during mount"

    * tag 'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (45 commits)
    xfs: push the grant head when the log head moves forward
    xfs: push iclog state cleaning into xlog_state_clean_log
    xfs: factor iclog state processing out of xlog_state_do_callback()
    xfs: factor callbacks out of xlog_state_do_callback()
    xfs: factor debug code out of xlog_state_do_callback()
    xfs: prevent CIL push holdoff in log recovery
    xfs: fix missed wakeup on l_flush_wait
    xfs: push the AIL in xlog_grant_head_wake
    xfs: Use WARN_ON_ONCE for bailout mount-operation
    xfs: Fix deadlock between AGI and AGF with RENAME_WHITEOUT
    xfs: define a flags field for the AG geometry ioctl structure
    xfs: add a xfs_valid_startblock helper
    xfs: remove the unused XFS_ALLOC_USERDATA flag
    xfs: cleanup xfs_fsb_to_db
    xfs: fix the dax supported check in xfs_ioctl_setattr_dax_invalidate
    xfs: Fix stale data exposure when readahead races with hole punch
    fs: Export generic_fadvise()
    mm: Handle MADV_WILLNEED through vfs_fadvise()
    xfs: allocate xattr buffer on demand
    xfs: consolidate attribute value copying
    ...

    Linus Torvalds
     
  • Pull vfs namei updates from Al Viro:
    "Pathwalk-related stuff"

    [ Audit-related cleanups, misc simplifications, and easier to follow
    nd->root refcounts - Linus ]

    * 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    devpts_pty_kill(): don't bother with d_delete()
    infiniband: don't bother with d_delete()
    hypfs: don't bother with d_delete()
    fs/namei.c: keep track of nd->root refcount status
    fs/namei.c: new helper - legitimize_root()
    kill the last users of user_{path,lpath,path_dir}()
    namei.h: get the comments on LOOKUP_... in sync with reality
    kill LOOKUP_NO_EVAL, don't bother including namei.h from audit.h
    audit_inode(): switch to passing AUDIT_INODE_...
    filename_mountpoint(): make LOOKUP_NO_EVAL unconditional there
    filename_lookup(): audit_inode() argument is always 0

    Linus Torvalds
     

06 Sep, 2019

9 commits

  • When the log fills up, we can get into the state where the
    outstanding items in the CIL being committed and aggregated are
    larger than the range that the reservation grant head tail pushing
    will attempt to clean. This can result in the tail pushing range
    being trimmed back to the the log head (l_last_sync_lsn) and so
    may not actually move the push target at all.

    When the iclogs associated with the CIL commit finally land, the
    log head moves forward, and this removes the restriction on the AIL
    push target. However, if we already have transactions sleeping on
    the grant head, and there's nothing in the AIL still to flush from
    the current push target, then nothing will move the tail of the log
    and trigger a log reservation wakeup.

    Hence the there is nothing that will trigger xlog_grant_push_ail()
    to recalculate the AIL push target and start pushing on the AIL
    again to write back the metadata objects that pin the tail of the
    log and hence free up space and allow the transaction reservations
    to be woken and make progress.

    Hence we need to push on the grant head when we move the log head
    forward, as this may be the only trigger we have that can move the
    AIL push target forwards in this situation.

    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • xlog_state_clean_log() is only called from one place, and it occurs
    when an iclog is transitioning back to ACTIVE. Prior to calling
    xlog_state_clean_log, the iclog we are processing has a hard coded
    state check to DIRTY so that xlog_state_clean_log() processes it
    correctly. We also have a hard coded wakeup after
    xlog_state_clean_log() to enfore log force waiters on that iclog
    are woken correctly.

    Both of these things are operations required to finish processing an
    iclog and return it to the ACTIVE state again, so they make little
    sense to be separated from the rest of the clean state transition
    code.

    Hence push these things inside xlog_state_clean_log(), document the
    behaviour and rename it xlog_state_clean_iclog() to indicate that
    it's being driven by an iclog state change and does the iclog state
    change work itself.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • The iclog IO completion state processing is somewhat complex, and
    because it's inside two nested loops it is highly indented and very
    hard to read. Factor it out, flatten the logic flow and clean up the
    comments so that it much easier to see what the code is doing both
    in processing the individual iclogs and in the over
    xlog_state_do_callback() operation.

    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • Simplify the code flow by lifting the iclog callback work out of
    the main iclog iteration loop. This isolates the log juggling and
    callbacks from the iclog state change logic in the loop.

    Note that the loopdidcallbacks variable is not actually tracking
    whether callbacks are actually run - it is tracking whether the
    icloglock was dropped during the loop and so determines if we
    completed the entire iclog scan loop atomically. Hence we know for
    certain there are either no more ordered completions to run or
    that the next completion will run the remaining ordered iclog
    completions. Hence rename that variable appropriately for it's
    function.

    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • Start making this function readable by lifting the debug code into
    a conditional function.

    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • generic/530 on a machine with enough ram and a non-preemptible
    kernel can run the AGI processing phase of log recovery enitrely out
    of cache. This means it never blocks on locks, never waits for IO
    and runs entirely through the unlinked lists until it either
    completes or blocks and hangs because it has run out of log space.

    It runs out of log space because the background CIL push is
    scheduled but never runs. queue_work() queues the CIL work on the
    current CPU that is busy, and the workqueue code will not run it on
    any other CPU. Hence if the unlinked list processing never yields
    the CPU voluntarily, the push work is delayed indefinitely. This
    results in the CIL aggregating changes until all the log space is
    consumed.

    When the log recoveyr processing evenutally blocks, the CIL flushes
    but because the last iclog isn't submitted for IO because it isn't
    full, the CIL flush never completes and nothing ever moves the log
    head forwards, or indeed inserts anything into the tail of the log,
    and hence nothing is able to get the log moving again and recovery
    hangs.

    There are several problems here, but the two obvious ones from
    the trace are that:
    a) log recovery does not yield the CPU for over 4 seconds,
    b) binding CIL pushes to a single CPU is a really bad idea.

    This patch addresses just these two aspects of the problem, and are
    suitable for backporting to work around any issues in older kernels.
    The more fundamental problem of preventing the CIL from consuming
    more than 50% of the log without committing will take more invasive
    and complex work, so will be done as followup work.

    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • The code in xlog_wait uses the spinlock to make adding the task to
    the wait queue, and setting the task state to UNINTERRUPTIBLE atomic
    with respect to the waker.

    Doing the wakeup after releasing the spinlock opens up the following
    race condition:

    Task 1 task 2
    add task to wait queue
    wake up task
    set task state to UNINTERRUPTIBLE

    This issue was found through code inspection as a result of kworkers
    being observed stuck in UNINTERRUPTIBLE state with an empty
    wait queue. It is rare and largely unreproducable.

    Simply moving the spin_unlock to after the wake_up_all results
    in the waker not being able to see a task on the waitqueue before
    it has set its state to UNINTERRUPTIBLE.

    This bug dates back to the conversion of this code to generic
    waitqueue infrastructure from a counting semaphore back in 2008
    which didn't place the wakeups consistently w.r.t. to the relevant
    spin locks.

    [dchinner: Also fix a similar issue in the shutdown path on
    xc_commit_wait. Update commit log with more details of the issue.]

    Fixes: d748c62367eb ("[XFS] Convert l_flushsema to a sv_t")
    Reported-by: Chris Mason
    Signed-off-by: Rik van Riel
    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Rik van Riel
     
  • In the situation where the log is full and the CIL has not recently
    flushed, the AIL push threshold is throttled back to the where the
    last write of the head of the log was completed. This is stored in
    log->l_last_sync_lsn. Hence if the CIL holds > 25% of the log space
    pinned by flushes and/or aggregation in progress, we can get the
    situation where the head of the log lags a long way behind the
    reservation grant head.

    When this happens, the AIL push target is trimmed back from where
    the reservation grant head wants to push the log tail to, back to
    where the head of the log currently is. This means the push target
    doesn't reach far enough into the log to actually move the tail
    before the transaction reservation goes to sleep.

    When the CIL push completes, it moves the log head forward such that
    the AIL push target can now be moved, but that has no mechanism for
    puhsing the log tail. Further, if the next tail movement of the log
    is not large enough wake the waiter (i.e. still not enough space for
    it to have a reservation granted), we don't wake anything up, and
    hence we do not update the AIL push target to take into account the
    head of the log moving and allowing the push target to be moved
    forwards.

    To avoid this particular condition, if we fail to wake the first
    waiter on the grant head because we don't have enough space,
    push on the AIL again. This will pick up any movement of the log
    head and allow the push target to move forward due to completion of
    CIL pushing.

    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • If the CONFIG_BUG is enabled, BUG is executed and then system is crashed.
    However, the bailout for mount is no longer proceeding.

    Using WARN_ON_ONCE rather than BUG can prevent this situation.

    Signed-off-by: Austin Kim
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Austin Kim
     

04 Sep, 2019

2 commits

  • When performing rename operation with RENAME_WHITEOUT flag, we will
    hold AGF lock to allocate or free extents in manipulating the dirents
    firstly, and then doing the xfs_iunlink_remove() call last to hold
    AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.

    The big problem here is that we have an ordering constraint on AGF
    and AGI locking - inode allocation locks the AGI, then can allocate
    a new extent for new inodes, locking the AGF after the AGI. Hence
    the ordering that is imposed by other parts of the code is AGI before
    AGF. So we get an ABBA deadlock between the AGI and AGF here.

    Process A:
    Call trace:
    ? __schedule+0x2bd/0x620
    schedule+0x33/0x90
    schedule_timeout+0x17d/0x290
    __down_common+0xef/0x125
    ? xfs_buf_find+0x215/0x6c0 [xfs]
    down+0x3b/0x50
    xfs_buf_lock+0x34/0xf0 [xfs]
    xfs_buf_find+0x215/0x6c0 [xfs]
    xfs_buf_get_map+0x37/0x230 [xfs]
    xfs_buf_read_map+0x29/0x190 [xfs]
    xfs_trans_read_buf_map+0x13d/0x520 [xfs]
    xfs_read_agf+0xa6/0x180 [xfs]
    ? schedule_timeout+0x17d/0x290
    xfs_alloc_read_agf+0x52/0x1f0 [xfs]
    xfs_alloc_fix_freelist+0x432/0x590 [xfs]
    ? down+0x3b/0x50
    ? xfs_buf_lock+0x34/0xf0 [xfs]
    ? xfs_buf_find+0x215/0x6c0 [xfs]
    xfs_alloc_vextent+0x301/0x6c0 [xfs]
    xfs_ialloc_ag_alloc+0x182/0x700 [xfs]
    ? _xfs_trans_bjoin+0x72/0xf0 [xfs]
    xfs_dialloc+0x116/0x290 [xfs]
    xfs_ialloc+0x6d/0x5e0 [xfs]
    ? xfs_log_reserve+0x165/0x280 [xfs]
    xfs_dir_ialloc+0x8c/0x240 [xfs]
    xfs_create+0x35a/0x610 [xfs]
    xfs_generic_create+0x1f1/0x2f0 [xfs]
    ...

    Process B:
    Call trace:
    ? __schedule+0x2bd/0x620
    ? xfs_bmapi_allocate+0x245/0x380 [xfs]
    schedule+0x33/0x90
    schedule_timeout+0x17d/0x290
    ? xfs_buf_find+0x1fd/0x6c0 [xfs]
    __down_common+0xef/0x125
    ? xfs_buf_get_map+0x37/0x230 [xfs]
    ? xfs_buf_find+0x215/0x6c0 [xfs]
    down+0x3b/0x50
    xfs_buf_lock+0x34/0xf0 [xfs]
    xfs_buf_find+0x215/0x6c0 [xfs]
    xfs_buf_get_map+0x37/0x230 [xfs]
    xfs_buf_read_map+0x29/0x190 [xfs]
    xfs_trans_read_buf_map+0x13d/0x520 [xfs]
    xfs_read_agi+0xa8/0x160 [xfs]
    xfs_iunlink_remove+0x6f/0x2a0 [xfs]
    ? current_time+0x46/0x80
    ? xfs_trans_ichgtime+0x39/0xb0 [xfs]
    xfs_rename+0x57a/0xae0 [xfs]
    xfs_vn_rename+0xe4/0x150 [xfs]
    ...

    In this patch we move the xfs_iunlink_remove() call to
    before acquiring the AGF lock to preserve correct AGI/AGF locking
    order.

    Signed-off-by: kaixuxia
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    kaixuxia
     
  • Define a flags field for the AG geometry ioctl structure.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner

    Darrick J. Wong
     

03 Sep, 2019

1 commit


31 Aug, 2019

1 commit