17 Apr, 2020

1 commit

  • Move the inode dirty data flushing to a workqueue so that multiple
    threads can take advantage of a single thread's flushing work. The
    ratelimiting technique used in bdd4ee4 was not successful, because
    threads that skipped the inode flush scan due to ratelimiting would
    ENOSPC early, which caused occasional (but noticeable) changes in
    behavior and sporadic fstest regressions.

    Therefore, make all the writer threads wait on a single inode flush,
    which eliminates both the stampeding hordes of flushers and the small
    window in which a write could fail with ENOSPC because it lost the
    ratelimit race after even another thread freed space.

    Fixes: c6425702f21e ("xfs: ratelimit inode flush on buffered write ENOSPC")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     

13 Apr, 2020

2 commits

  • In the reflink extent remap function, it turns out that uirec (the block
    mapping corresponding only to the part of the passed-in mapping that got
    unmapped) was not fully initialized. Specifically, br_state was not
    being copied from the passed-in struct to the uirec. This could lead to
    unpredictable results such as the reflinked mapping being marked
    unwritten in the destination file.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • The filesystem freeze sequence in XFS waits on any background
    eofblocks or cowblocks scans to complete before the filesystem is
    quiesced. At this point, the freezer has already stopped the
    transaction subsystem, however, which means a truncate or cowblock
    cancellation in progress is likely blocked in transaction
    allocation. This results in a deadlock between freeze and the
    associated scanner.

    Fix this problem by holding superblock write protection across calls
    into the block reapers. Since protection for background scans is
    acquired from the workqueue task context, trylock to avoid a similar
    deadlock between freeze and blocking on the write lock.

    Fixes: d6b636ebb1c9f ("xfs: halt auto-reclamation activities while rebuilding rmap")
    Reported-by: Paul Furtado
    Signed-off-by: Brian Foster
    Reviewed-by: Chandan Rajendra
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Allison Collins
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

06 Apr, 2020

2 commits

  • Reflink should force the log out to disk if the filesystem was mounted
    with wsync, the same as most other operations in xfs.

    [Note: XFS_MOUNT_WSYNC is set when the admin mounts the filesystem
    with either the 'wsync' or 'sync' mount options, which effectively means
    that we're classifying reflink/dedupe as IO operations and making them
    synchronous when required.]

    Fixes: 3fc9f5e409319 ("xfs: remove xfs_reflink_remap_range")
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    [darrick: add more to the changelog]
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Create a new helper to force the log up to the last LSN touching an
    inode.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

02 Apr, 2020

1 commit

  • Qian Cai reports seemingly random buffer read verifier errors during
    filesystem writeback. This was isolated to a recent patch that
    factored out some inode cluster freeing code and happened to cast an
    unsigned inode number type to a signed value. If the inode number
    value overflows, we can skip marking in-core inodes associated with
    the underlying buffer stale at the time the physical inodes are
    freed. If such an inode happens to be dirty, xfsaild will eventually
    attempt to write it back over non-inode blocks. The invalidation of
    the underlying inode buffer causes writeback to read the buffer from
    disk. This fails the read verifier (preventing eventual corruption)
    if the buffer no longer looks like an inode cluster. Analysis by
    Dave Chinner.

    Fix up the helper to use the proper type for inode number values.

    Fixes: 5806165a6663 ("xfs: factor inode lookup from xfs_ifree_cluster")
    Reported-by: Qian Cai
    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

31 Mar, 2020

2 commits

  • The variables 'udqp' and 'gdqp' have been initialized, so remove
    redundant variable assignment in xfs_symlink().

    Signed-off-by: Kaixu Xia
    Reviewed-by: Chaitanya Kulkarni
    Reviewed-by: Dave Chinner
    Signed-off-by: Darrick J. Wong

    Kaixu Xia
     
  • A customer reported rcu stalls and softlockup warnings on a computer
    with many CPU cores and many many more IO threads trying to write to a
    filesystem that is totally out of space. Subsequent analysis pointed to
    the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
    which causes a lot of wb_writeback_work to be queued. The writeback
    worker spends so much time trying to wake the many many threads waiting
    for writeback completion that it trips the softlockup detector, and (in
    this case) the system automatically reboots.

    In addition, they complain that the lengthy xfs_flush_inodes scan traps
    all of those threads in uninterruptible sleep, which hampers their
    ability to kill the program or do anything else to escape the situation.

    If there's thousands of threads trying to write to files on a full
    filesystem, each of those threads will start separate copies of the
    inode flush scan. This is kind of pointless since we only need one
    scan, so rate limit the inode flush.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner

    Darrick J. Wong
     

29 Mar, 2020

3 commits

  • If the inode buffer backing a particular inode is locked,
    xfs_iflush() returns -EAGAIN and xfs_inode_item_push() skips the
    inode. It still returns success to xfsaild, however, which bypasses
    the xfsaild backoff heuristic. Update xfs_inode_item_push() to
    return locked status if the inode buffer couldn't be locked.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • A dquot flush currently blocks on the buffer lock for the underlying
    dquot buffer. In turn, this causes xfsaild to block rather than
    continue processing other items in the meantime. Update
    xfs_qm_dqflush() to trylock the buffer, similar to how inode buffers
    are handled, and return -EAGAIN if the lock fails. Fix up any
    callers that don't currently handle the error properly.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • Since the "no-allocation" reservations for file creations has
    been removed, the resblks value should be larger than zero, so
    remove unnecessary ternary conditional.

    Signed-off-by: Kaixu Xia
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    [darrick: s/judgment/ternary/]
    Signed-off-by: Darrick J. Wong

    Kaixu Xia
     

27 Mar, 2020

18 commits

  • In commit f467cad95f5e3, I added the ability to force a recalculation of
    the filesystem summary counters if they seemed incorrect. This was done
    (not entirely correctly) by tweaking the log code to write an unmount
    record without the UMOUNT_TRANS flag set. At next mount, the log
    recovery code will fail to find the unmount record and go into recovery,
    which triggers the recalculation.

    What actually gets written to the log is what ought to be an unmount
    record, but without any flags set to indicate what kind of record it
    actually is. This worked to trigger the recalculation, but we shouldn't
    write bogus log records when we could simply write nothing.

    Fixes: f467cad95f5e3 ("xfs: force summary counter recalc at next mount")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • There's lots of indent in this code which makes it a bit hard to
    follow. We are also going to completely rework the inode lookup code
    as part of the inode reclaim rework, so factor out the inode lookup
    code from the inode cluster freeing code.

    Based on prototype code from Christoph Hellwig.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • We currently wake anything waiting on the log tail to move whenever
    the log item at the tail of the log is removed. Historically this
    was fine behaviour because there were very few items at any given
    LSN. But with delayed logging, there may be thousands of items at
    any given LSN, and we can't move the tail until they are all gone.

    Hence if we are removing them in near tail-first order, we might be
    waking up processes waiting on the tail LSN to change (e.g. log
    space waiters) repeatedly without them being able to make progress.
    This also occurs with the new sync push waiters, and can result in
    thousands of spurious wakeups every second when under heavy direct
    reclaim pressure.

    To fix this, check that the tail LSN has actually changed on the
    AIL before triggering wakeups. This will reduce the number of
    spurious wakeups when doing bulk AIL removal and make this code much
    more efficient.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Allison Collins
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • Factor the common AIL deletion code that does all the wakeups into a
    helper so we only have one copy of this somewhat tricky code to
    interface with all the wakeups necessary when the LSN of the log
    tail changes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Allison Collins
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • The XFS inode item slab actually reclaimed by inode shrinker
    callbacks from the memory reclaim subsystem. These should be marked
    as reclaimable so the mm subsystem has the full picture of how much
    memory it can actually reclaim from the XFS slab caches.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Allison Collins
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • The buffer cache shrinker frees more than just the xfs_buf slab
    objects - it also frees the pages attached to the buffers. Make sure
    the memory reclaim code accounts for this memory being freed
    correctly, similar to how the inode shrinker accounts for pages
    freed from the page cache due to mapping invalidation.

    We also need to make sure that the mm subsystem knows these are
    reclaimable objects. We provide the memory reclaim subsystem with a
    a shrinker to reclaim xfs_bufs, so we should really mark the slab
    that way.

    We also have a lot of xfs_bufs in a busy system, spread them around
    like we do inodes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Allison Collins
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • Running metadata intensive workloads, I've been seeing the AIL
    pushing getting stuck on pinned buffers and triggering log forces.
    The log force is taking a long time to run because the log IO is
    getting throttled by wbt_wait() - the block layer writeback
    throttle. It's being throttled because there is a huge amount of
    metadata writeback going on which is filling the request queue.

    IOWs, we have a priority inversion problem here.

    Mark the log IO bios with REQ_IDLE so they don't get throttled
    by the block layer writeback throttle. When we are forcing the CIL,
    we are likely to need to to tens of log IOs, and they are issued as
    fast as they can be build and IO completed. Hence REQ_IDLE is
    appropriate - it's an indication that more IO will follow shortly.

    And because we also set REQ_SYNC, the writeback throttle will now
    treat log IO the same way it treats direct IO writes - it will not
    throttle them at all. Hence we solve the priority inversion problem
    caused by the writeback throttle being unable to distinguish between
    high priority log IO and background metadata writeback.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Allison Collins
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • In certain situations the background CIL push can be indefinitely
    delayed. While we have workarounds from the obvious cases now, it
    doesn't solve the underlying issue. This issue is that there is no
    upper limit on the CIL where we will either force or wait for
    a background push to start, hence allowing the CIL to grow without
    bound until it consumes all log space.

    To fix this, add a new wait queue to the CIL which allows background
    pushes to wait for the CIL context to be switched out. This happens
    when the push starts, so it will allow us to block incoming
    transaction commit completion until the push has started. This will
    only affect processes that are running modifications, and only when
    the CIL threshold has been significantly overrun.

    This has no apparent impact on performance, and doesn't even trigger
    until over 45 million inodes had been created in a 16-way fsmark
    test on a 2GB log. That was limiting at 64MB of log space used, so
    the active CIL size is only about 3% of the total log in that case.
    The concurrent removal of those files did not trigger the background
    sleep at all.

    Signed-off-by: Dave Chinner
    Reviewed-by: Allison Collins
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • The current CIL size aggregation limit is 1/8th the log size. This
    means for large logs we might be aggregating at least 250MB of dirty objects
    in memory before the CIL is flushed to the journal. With CIL shadow
    buffers sitting around, this means the CIL is often consuming >500MB
    of temporary memory that is all allocated under GFP_NOFS conditions.

    Flushing the CIL can take some time to do if there is other IO
    ongoing, and can introduce substantial log force latency by itself.
    It also pins the memory until the objects are in the AIL and can be
    written back and reclaimed by shrinkers. Hence this threshold also
    tends to determine the minimum amount of memory XFS can operate in
    under heavy modification without triggering the OOM killer.

    Modify the CIL space limit to prevent such huge amounts of pinned
    metadata from aggregating. We can have 2MB of log IO in flight at
    once, so limit aggregation to 16x this size. This threshold was
    chosen as it little impact on performance (on 16-way fsmark) or log
    traffic but pins a lot less memory on large logs especially under
    heavy memory pressure. An aggregation limit of 8x had 5-10%
    performance degradation and a 50% increase in log throughput for
    the same workload, so clearly that was too small for highly
    concurrent workloads on large logs.

    This was found via trace analysis of AIL behaviour. e.g. insertion
    from a single CIL flush:

    xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL

    $ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l
    1721823
    $

    So there were 1.7 million objects inserted into the AIL from this
    CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which
    was the end of the trace (i.e. it hadn't finished). Clearly a major
    problem.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Allison Collins
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • Signed-off-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • Separate out the unmount record writing from the rest of the
    ticket and log state futzing necessary to make it work. This is
    a no-op, just makes the code cleaner and places the unmount record
    formatting and writing alongside the commit record formatting and
    writing code.

    We can also get rid of the ticket flag clearing before the
    xlog_write() call because it no longer cares about the state of
    XLOG_TIC_INITED.

    Signed-off-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • xlog_write_done() is just a thin wrapper around xlog_commit_record(), so
    they can be merged together easily.

    Signed-off-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • Remove xlog_ticket_done and just call the renamed low-level helpers for
    ungranting or regranting log space directly. To make that a little
    the reference put on the ticket and all tracing is moved into the actual
    helpers.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • It is not longer used or checked by anything, so remove the last
    traces from the log ticket code.

    Signed-off-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • xfs_log_done() does two separate things. Firstly, it triggers commit
    records to be written for permanent transactions, and secondly it
    releases or regrants transaction reservation space.

    Since delayed logging was introduced, transactions no longer write
    directly to the log, hence they never have the XLOG_TIC_INITED flag
    cleared on them. Hence transactions never write commit records to
    the log and only need to modify reservation space.

    Split up xfs_log_done into two parts, and only call the parts of the
    operation needed for the context xfs_log_done() is currently being
    called from.

    Signed-off-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • Commit and unmount records records do not need start records to be
    written, so rearrange the logic in xlog_write() to remove the need
    to check for XLOG_TIC_INITED to determine if we should account for
    the space used by a start record.

    Signed-off-by: Dave Chinner
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • The xlog_write() function iterates over iclogs until it completes
    writing all the log vectors passed in. The ticket tracks whether
    a start record has been written or not, so only the first iclog gets
    a start record. We only ever pass single use tickets to
    xlog_write() so we only ever need to write a start record once per
    xlog_write() call.

    Hence we don't need to store whether we should write a start record
    in the ticket as the callers provide all the information we need to
    determine if a start record should be written. For the moment, we
    have to ensure that we clear the XLOG_TIC_INITED appropriately so
    the code in xfs_log_done() still works correctly for committing
    transactions.

    (darrick: Note the slight behavior change that we always deduct the
    size of the op header from the ticket, even for unmount records)

    Signed-off-by: Dave Chinner
    [hch: pass an explicit need_start_rec argument]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • Validate the geometry of the realtime geometry when we mount the
    filesystem, so that we don't abruptly shut down the filesystem later on.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner

    Darrick J. Wong
     

26 Mar, 2020

5 commits

  • I noticed that fsfreeze can take a very long time to freeze an XFS if
    there happens to be a GETFSMAP caller running in the background. I also
    happened to notice the following in dmesg:

    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 43492 at fs/xfs/xfs_super.c:853 xfs_quiesce_attr+0x83/0x90 [xfs]
    Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables bfq iptable_filter sch_fq_codel ip_tables x_tables nfsv4 af_packet [last unloaded: xfs]
    CPU: 2 PID: 43492 Comm: xfs_io Not tainted 5.6.0-rc4-djw #rc4
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
    RIP: 0010:xfs_quiesce_attr+0x83/0x90 [xfs]
    Code: 7c 07 00 00 85 c0 75 22 48 89 df 5b e9 96 c1 00 00 48 c7 c6 b0 2d 38 a0 48 89 df e8 57 64 ff ff 8b 83 7c 07 00 00 85 c0 74 de 0b 48 89 df 5b e9 72 c1 00 00 66 90 0f 1f 44 00 00 41 55 41 54
    RSP: 0018:ffffc900030f3e28 EFLAGS: 00010202
    RAX: 0000000000000001 RBX: ffff88802ac54000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffffffff81e4a6f0 RDI: 00000000ffffffff
    RBP: ffff88807859f070 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000010 R12: 0000000000000000
    R13: ffff88807859f388 R14: ffff88807859f4b8 R15: ffff88807859f5e8
    FS: 00007fad1c6c0fc0(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f0c7d237000 CR3: 0000000077f01003 CR4: 00000000001606a0
    Call Trace:
    xfs_fs_freeze+0x25/0x40 [xfs]
    freeze_super+0xc8/0x180
    do_vfs_ioctl+0x70b/0x750
    ? __fget_files+0x135/0x210
    ksys_ioctl+0x3a/0xb0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x50/0x1a0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    These two things appear to be related. The assertion trips when another
    thread initiates a fsmap request (which uses an empty transaction) after
    the freezer waited for m_active_trans to hit zero but before the the
    freezer executes the WARN_ON just prior to calling xfs_log_quiesce.

    The lengthy delays in freezing happen because the freezer calls
    xfs_wait_buftarg to clean out the buffer lru list. Meanwhile, the
    GETFSMAP caller is continuing to grab and release buffers, which means
    that it can take a very long time for the buffer lru list to empty out.

    We fix both of these races by calling sb_start_write to obtain freeze
    protection while using empty transactions for GETFSMAP and for metadata
    scrubbing. The other two users occur during mount, during which time we
    cannot fs freeze.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner

    Darrick J. Wong
     
  • If the bio_add_page() call fails, we proceed to write out a
    partially constructed log buffer. This corrupts the physical log
    such that log recovery is not possible. Worse, persistent
    occurrences of this error eventually lead to a BUG_ON() failure in
    bio_split() as iclogs wrap the end of the physical log, which
    triggers log recovery on subsequent mount.

    Rather than warn about writing out a corrupted log buffer, shutdown
    the fs as is done for any log I/O related error. This preserves the
    consistency of the physical log such that log recovery succeeds on a
    subsequent mount. Note that this was observed on a 64k page debug
    kernel without upstream commit 59bb47985c1d ("mm, sl[aou]b:
    guarantee natural alignment for kmalloc(power-of-two)"), which
    demonstrated frequent iclog bio overflows due to unaligned (slab
    allocated) iclog data buffers.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • When we're checking bestfree information in directory blocks, always
    drop the block buffer at the end of the function. We should always
    release resources when we're done using them.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner

    Darrick J. Wong
     
  • The dirattr btree checking code uses the altpath substructure of the
    dirattr state structure to check the sibling pointers of dir/attr tree
    blocks. At the end of sibling checks, xfs_da3_path_shift could have
    changed multiple levels of buffer pointers in the altpath structure.
    Although we release the leaf level buffer, this isn't enough -- we also
    need to release the node buffers that are unique to the altpath.

    Not releasing all of the altpath buffers leaves them locked to the
    transaction. This is suboptimal because we should release resources
    when we don't need them anymore. Fix the function to loop all levels of
    the altpath, and fix the return logic so that we always run the loop.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner

    Darrick J. Wong
     
  • When quotacheck runs, it zeroes all the timer fields in every dquot.
    Unfortunately, it also does this to the root dquot, which erases any
    preconfigured grace intervals and warning limits that the administrator
    may have set. Worse yet, the incore copies of those variables remain
    set. This cache coherence problem manifests itself as the grace
    interval mysteriously being reset back to the defaults at the /next/
    mount.

    Fix it by not resetting the root disk dquot's timer and warning fields.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

23 Mar, 2020

6 commits

  • Open code the xlog_state_want_sync logic in its two callers given that
    this function is a trivial wrapper around xlog_state_switch_iclogs.

    Move the lockdep assert into xlog_state_switch_iclogs to not lose this
    debugging aid, and improve the comment that documents
    xlog_state_switch_iclogs as well.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Use the shutdown flag in the log to bypass xlog_state_clean_iclog
    entirely in case of a shut down log.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Factor out a few self-contained helpers from xlog_state_clean_iclog, and
    update the documentation so it primarily documents why things happens
    instead of how.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • We can just check for a shut down log all the way down in
    xlog_cil_committed instead of passing the parameter. This means a
    slight behavior change in that we now also abort log items if the
    shutdown came in halfway into the I/O completion processing, which
    actually is the right thing to do.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • There is no need to check for the ioerror state before the lock, as
    the shutdown case is not a fast path. Also remove the call to force
    shutdown the file system, as it must have been shut down already
    for an iclog to be in the ioerror state. Also clean up the flow of
    the function a bit.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • The only caller of xfs_log_release_iclog doesn't care about the return
    value, so remove it. Also don't bother passing the mount pointer,
    given that we can trivially derive it from the iclog.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig