05 Nov, 2020

2 commits

  • In commit 7588cbeec6df, we tried to fix a race stemming from the lack of
    coordination between higher level code that wants to allocate and remap
    CoW fork extents into the data fork. Christoph cites as examples the
    always_cow mode, and a directio write completion racing with writeback.

    According to the comments before the goto retry, we want to restart the
    lookup to catch the extent in the data fork, but we don't actually reset
    whichfork or cow_fsb, which means the second try executes using stale
    information. Up until now I think we've gotten lucky that either
    there's something left in the CoW fork to cause cow_fsb to be reset, or
    either data/cow fork sequence numbers have advanced enough to force a
    fresh lookup from the data fork. However, if we reach the retry with an
    empty stable CoW fork and a stable data fork, neither of those things
    happens. The retry foolishly re-calls xfs_convert_blocks on the CoW
    fork which fails again. This time, we toss the write.

    I've recently been working on extending reflink to the realtime device.
    When the realtime extent size is larger than a single block, we have to
    force the page cache to CoW the entire rt extent if a write (or
    fallocate) are not aligned with the rt extent size. The strategy I've
    chosen to deal with this is derived from Dave's blocksize > pagesize
    series: dirtying around the write range, and ensuring that writeback
    always starts mapping on an rt extent boundary. This has brought this
    race front and center, since generic/522 blows up immediately.

    However, I'm pretty sure this is a bug outright, independent of that.

    Fixes: 7588cbeec6df ("xfs: retry COW fork delalloc conversion when no extent was found")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • iomap writeback mapping failure only calls into ->discard_page() if
    the current page has not been added to the ioend. Accordingly, the
    XFS callback assumes a full page discard and invalidation. This is
    problematic for sub-page block size filesystems where some portion
    of a page might have been mapped successfully before a failure to
    map a delalloc block occurs. ->discard_page() is not called in that
    error scenario and the bio is explicitly failed by iomap via the
    error return from ->prepare_ioend(). As a result, the filesystem
    leaks delalloc blocks and corrupts the filesystem block counters.

    Since XFS is the only user of ->discard_page(), tweak the semantics
    to invoke the callback unconditionally on mapping errors and provide
    the file offset that failed to map. Update xfs_discard_page() to
    discard the corresponding portion of the file and pass the range
    along to iomap_invalidatepage(). The latter already properly handles
    both full and sub-page scenarios by not changing any iomap or page
    state on sub-page invalidations.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

21 Sep, 2020

1 commit

  • This helper is useful for both THPs and for supporting block size larger
    than page size. Convert all users that I could find (we have a few
    different ways of writing this idiom, and I may have missed some).

    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Acked-by: Dave Kleikamp

    Matthew Wilcox (Oracle)
     

03 Jun, 2020

2 commits

  • Pull xfs updates from Darrick Wong:
    "Most of the changes this cycle are refactoring of existing code in
    preparation for things landing in the future.

    We also fixed various problems and deficiencies in the quota
    implementation, and (I hope) the last of the stale read vectors by
    forcing write allocations to go through the unwritten state until the
    write completes.

    Summary:

    - Various cleanups to remove dead code, unnecessary conditionals,
    asserts, etc.

    - Fix a linker warning caused by xfs stuffing '-g' into CFLAGS
    redundantly.

    - Tighten up our dmesg logging to ensure that everything is prefixed
    with 'XFS' for easier grepping.

    - Kill a bunch of typedefs.

    - Refactor the deferred ops code to reduce indirect function calls.

    - Increase type-safety with the deferred ops code.

    - Make the DAX mount options a tri-state.

    - Fix some error handling problems in the inode flush code and clean
    up other inode flush warts.

    - Refactor log recovery so that each log item recovery functions now
    live with the other log item processing code.

    - Fix some SPDX forms.

    - Fix quota counter corruption if the fs crashes after running
    quotacheck but before any dquots get logged.

    - Don't fail metadata verification on zero-entry attr leaf blocks,
    since they're just part of the disk format now due to a historic
    lack of log atomicity.

    - Don't allow SWAPEXT between files with different [ugp]id when
    quotas are enabled.

    - Refactor inode fork reading and verification to run directly from
    the inode-from-disk function. This means that we now actually
    guarantee that _iget'ted inodes are totally verified and ready to
    go.

    - Move the incore inode fork format and extent counts to the ifork
    structure.

    - Scalability improvements by reducing cacheline pingponging in
    struct xfs_mount.

    - More scalability improvements by removing m_active_trans from the
    hot path.

    - Fix inode counter update sanity checking to run /only/ on debug
    kernels.

    - Fix longstanding inconsistency in what error code we return when a
    program hits project quota limits (ENOSPC).

    - Fix group quota returning the wrong error code when a program hits
    group quota limits.

    - Fix per-type quota limits and grace periods for group and project
    quotas so that they actually work.

    - Allow extension of individual grace periods.

    - Refactor the non-reclaim inode radix tree walking code to remove a
    bunch of stupid little functions and straighten out the
    inconsistent naming schemes.

    - Fix a bug in speculative preallocation where we measured a new
    allocation based on the last extent mapping in the file instead of
    looking farther for the last contiguous space allocation.

    - Force delalloc writes to unwritten extents. This closes a stale
    disk contents exposure vector if the system goes down before the
    write completes.

    - More lockdep whackamole"

    * tag 'xfs-5.8-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (129 commits)
    xfs: more lockdep whackamole with kmem_alloc*
    xfs: force writes to delalloc regions to unwritten
    xfs: refactor xfs_iomap_prealloc_size
    xfs: measure all contiguous previous extents for prealloc size
    xfs: don't fail unwritten extent conversion on writeback due to edquot
    xfs: rearrange xfs_inode_walk_ag parameters
    xfs: straighten out all the naming around incore inode tree walks
    xfs: move xfs_inode_ag_iterator to be closer to the perag walking code
    xfs: use bool for done in xfs_inode_ag_walk
    xfs: fix inode ag walk predicate function return values
    xfs: refactor eofb matching into a single helper
    xfs: remove __xfs_icache_free_eofblocks
    xfs: remove flags argument from xfs_inode_ag_walk
    xfs: remove xfs_inode_ag_iterator_flags
    xfs: remove unused xfs_inode_ag_iterator function
    xfs: replace open-coded XFS_ICI_NO_TAG
    xfs: move eofblocks conversion function to xfs_ioctl.c
    xfs: allow individual quota grace period extension
    xfs: per-type quota timers and warn limits
    xfs: switch xfs_get_defquota to take explicit type
    ...

    Linus Torvalds
     
  • Use the new readahead operation in iomap. Convert XFS and ZoneFS to use
    it.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Reviewed-by: William Kucharski
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Dave Chinner
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: John Hubbard
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-26-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

20 May, 2020

1 commit


03 Mar, 2020

1 commit

  • Use printk_ratelimit() to limit the amount of messages printed from
    xfs_discard_page. Without that a failing device causes a large
    number of errors that doesn't really help debugging the underling
    issue.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

04 Jan, 2020

1 commit

  • As of now dax_writeback_mapping_range() takes "struct block_device" as a
    parameter and dax_dev is searched from bdev name. This also involves taking
    a fresh reference on dax_dev and putting that reference at the end of
    function.

    We are developing a new filesystem virtio-fs and using dax to access host
    page cache directly. But there is no block device. IOW, we want to make
    use of dax but want to get rid of this assumption that there is always
    a block device associated with dax_dev.

    So pass in "struct dax_device" as parameter instead of bdev.

    ext2/ext4/xfs are current users and they already have a reference on
    dax_device. So there is no need to take reference and drop reference to
    dax_device on each call of this function.

    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Vivek Goyal
    Link: https://lore.kernel.org/r/20200103183307.GB13350@redhat.com
    Signed-off-by: Dan Williams

    Vivek Goyal
     

28 Oct, 2019

1 commit

  • Add a new xfs_inode_buftarg helper that gets the data I/O buftarg for a
    given inode. Replace the existing xfs_find_bdev_for_inode and
    xfs_find_daxdev_for_inode helpers with this new general one and cleanup
    some of the callers.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

22 Oct, 2019

1 commit


21 Oct, 2019

6 commits

  • Take the xfs writeback code and move it to fs/iomap. A new structure
    with three methods is added as the abstraction from the generic writeback
    code to the file system. These methods are used to map blocks, submit an
    ioend, and cancel a page that encountered an error before it was added to
    an ioend.

    Signed-off-by: Christoph Hellwig
    [darrick: rename ->submit_ioend to ->prepare_ioend to clarify what it
    does]
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner

    Christoph Hellwig
     
  • Lift the xfs code for tracing address space operations to the iomap
    layer.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • In preparation for moving the writeback code to iomap.c, replace the
    XFS-specific COW fork concept with the iomap IOMAP_F_SHARED flag.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • In preparation for moving the ioend structure to common code we need
    to get rid of the xfs-specific xfs_trans type. Just make it a file
    system private void pointer instead.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Introduce two nicely abstracted helper, which can be moved to the iomap
    code later. Also use list_first_entry_or_null to simplify the code a
    bit.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • In preparation for moving the XFS writeback code to fs/iomap.c, switch
    it to use struct iomap instead of the XFS-specific struct xfs_bmbt_irec.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Carlos Maiolino
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

16 Jul, 2019

1 commit

  • Pull more block updates from Jens Axboe:
    "A later pull request with some followup items. I had some vacation
    coming up to the merge window, so certain things items were delayed a
    bit. This pull request also contains fixes that came in within the
    last few days of the merge window, which I didn't want to push right
    before sending you a pull request.

    This contains:

    - NVMe pull request, mostly fixes, but also a few minor items on the
    feature side that were timing constrained (Christoph et al)

    - Report zones fixes (Damien)

    - Removal of dead code (Damien)

    - Turn on cgroup psi memstall (Josef)

    - block cgroup MAINTAINERS entry (Konstantin)

    - Flush init fix (Josef)

    - blk-throttle low iops timing fix (Konstantin)

    - nbd resize fixes (Mike)

    - nbd 0 blocksize crash fix (Xiubo)

    - block integrity error leak fix (Wenwen)

    - blk-cgroup writeback and priority inheritance fixes (Tejun)"

    * tag 'for-linus-20190715' of git://git.kernel.dk/linux-block: (42 commits)
    MAINTAINERS: add entry for block io cgroup
    null_blk: fixup ->report_zones() for !CONFIG_BLK_DEV_ZONED
    block: Limit zone array allocation size
    sd_zbc: Fix report zones buffer allocation
    block: Kill gfp_t argument of blkdev_report_zones()
    block: Allow mapping of vmalloc-ed buffers
    block/bio-integrity: fix a memory leak bug
    nvme: fix NULL deref for fabrics options
    nbd: add netlink reconfigure resize support
    nbd: fix crash when the blksize is zero
    block: Disable write plugging for zoned block devices
    block: Fix elevator name declaration
    block: Remove unused definitions
    nvme: fix regression upon hot device removal and insertion
    blk-throttle: fix zero wait time for iops throttled group
    block: Fix potential overflow in blk_report_zones()
    blkcg: implement REQ_CGROUP_PUNT
    blkcg, writeback: Implement wbc_blkcg_css()
    blkcg, writeback: Add wbc->no_cgroup_owner
    blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()
    ...

    Linus Torvalds
     

13 Jul, 2019

1 commit

  • Pull xfs updates from Darrick Wong:
    "In this release there are a significant amounts of consolidations and
    cleanups in the log code; restructuring of the log to issue struct
    bios directly; new bulkstat ioctls to return v5 fs inode information
    (and fix all the padding problems of the old ioctl); the beginnings of
    multithreaded inode walks (e.g. quotacheck); and a reduction in memory
    usage in the online scrub code leading to reduced runtimes.

    - Refactor inode geometry calculation into a single structure instead
    of open-coding pieces everywhere.

    - Add online repair to build options.

    - Remove unnecessary function call flags and functions.

    - Claim maintainership of various loose xfs documentation and header
    files.

    - Use struct bio directly for log buffer IOs instead of struct
    xfs_buf.

    - Reduce log item boilerplate code requirements.

    - Merge log item code spread across too many files.

    - Further distinguish between log item commits and cancellations.

    - Various small cleanups to the ag small allocator.

    - Support cgroup-aware writeback

    - libxfs refactoring for mkfs cleanup

    - Remove unneeded #includes

    - Fix a memory allocation miscalculation in the new log bio code

    - Fix bisection problems

    - Fix a crash in ioend processing caused by tripping over freeing of
    preallocated transactions

    - Split out a generic inode walk mechanism from the bulkstat code,
    hook up all the internal users to use the walking code, then clean
    up bulkstat to serve only the bulkstat ioctls.

    - Add a multithreaded iwalk implementation to speed up quotacheck on
    fast storage with many CPUs.

    - Remove unnecessary return values in logging teardown functions.

    - Supplement the bstat and inogrp structures with new bulkstat and
    inumbers structures that have all the fields we need for v5
    filesystem features and none of the padding problems of their
    predecessors.

    - Wire up new ioctls that use the new structures with a much simpler
    bulk_ireq structure at the head instead of the pointerhappy mess we
    had before.

    - Enable userspace to constrain bulkstat returns to a single AG or a
    single special inode so that we can phase out a lot of geometry
    guesswork in userspace.

    - Reduce memory consumption and zeroing overhead in extended
    attribute scrub code.

    - Fix some behavioral regressions in the new bulkstat backend code.

    - Fix some behavioral regressions in the new log bio code"

    * tag 'xfs-5.3-merge-12' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (100 commits)
    xfs: chain bios the right way around in xfs_rw_bdev
    xfs: bump INUMBERS cursor correctly in xfs_inumbers_walk
    xfs: don't update lastino for FSBULKSTAT_SINGLE
    xfs: online scrub needn't bother zeroing its temporary buffer
    xfs: only allocate memory for scrubbing attributes when we need it
    xfs: refactor attr scrub memory allocation function
    xfs: refactor extended attribute buffer pointer functions
    xfs: attribute scrub should use seen_enough to pass error values
    xfs: allow single bulkstat of special inodes
    xfs: specify AG in bulk req
    xfs: wire up the v5 inumbers ioctl
    xfs: wire up new v5 bulkstat ioctls
    xfs: introduce v5 inode group structure
    xfs: introduce new v5 bulkstat structure
    xfs: rename bulkstat functions
    xfs: remove various bulk request typedef usage
    fs: xfs: xfs_log: Change return type from int to void
    xfs: poll waiting for quotacheck
    xfs: multithreaded iwalk implementation
    xfs: refactor INUMBERS to use iwalk functions
    ...

    Linus Torvalds
     

01 Jul, 2019

5 commits

  • 'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1
    bytes.

    Before 07173c3ec276 ("block: enable multipage bvecs"), one bio can
    include very limited pages, and usually at most 256, so the fs bio
    size won't be bigger than 1M bytes most of times.

    Since we support multi-page bvec, in theory one fs bio really can
    be added > 1M pages, especially in case of hugepage, or big writeback
    with too many dirty pages. Then there is chance in which .bi_size
    is overflowed.

    Fixes this issue by using bio_full() to check if the added segment may
    overflow .bi_size.

    Cc: Liu Yiding
    Cc: kernel test robot
    Cc: "Darrick J. Wong"
    Cc: linux-xfs@vger.kernel.org
    Cc: linux-fsdevel@vger.kernel.org
    Cc: stable@vger.kernel.org
    Fixes: 07173c3ec276 ("block: enable multipage bvecs")
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Instead of a magic flag for xfs_trans_alloc, just ensure all callers
    that can't relclaim through the file system use memalloc_nofs_save to
    set the per-task nofs flag.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Compare the block layer status directly instead of converting it to
    an errno first.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • There is no real problem merging ioends that go beyond i_size into an
    ioend that doesn't. We just need to move the append transaction to the
    base ioend. Also use the opportunity to use a real error code instead
    of the magic 1 to cancel the transactions, and write a comment
    explaining the scheme.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • The fail argument is long gone, update the comment.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

29 Jun, 2019

3 commits

  • There are many, many xfs header files which are included but
    unneeded (or included twice) in the xfs code, so remove them.

    nb: xfs_linux.h includes about 9 headers for everyone, so those
    explicit includes get removed by this. I'm not sure what the
    preference is, but if we wanted explicit includes everywhere,
    a followup patch could remove those xfs_*.h includes from
    xfs_linux.h and move them into the files that need them.
    Or it could be left as-is.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eric Sandeen
     
  • Link every newly allocated writeback bio to cgroup pointed to by the
    writeback control structure, and charge every byte written back to it.

    Tested-by: Stefan Priebe - Profihost AG
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Move setting up operation and write hint to xfs_alloc_ioend, and
    then just copy over all needed information from the previous bio
    in xfs_chain_bio and stop passing various parameters to it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

17 Jun, 2019

1 commit

  • We currently have an input same_page parameter to __bio_try_merge_page
    to prohibit merging in the same page. The rationale for that is that
    some callers need to account for every page added to a bio. Instead of
    letting these callers call twice into the merge code to account for the
    new vs existing page cases, just turn the paramter into an output one that
    returns if a merge in the same page occured and let them act accordingly.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 May, 2019

1 commit

  • Pull block updates from Jens Axboe:
    "Nothing major in this series, just fixes and improvements all over the
    map. This contains:

    - Series of fixes for sed-opal (David, Jonas)

    - Fixes and performance tweaks for BFQ (via Paolo)

    - Set of fixes for bcache (via Coly)

    - Set of fixes for md (via Song)

    - Enabling multi-page for passthrough requests (Ming)

    - Queue release fix series (Ming)

    - Device notification improvements (Martin)

    - Propagate underlying device rotational status in loop (Holger)

    - Removal of mtip32xx trim support, which has been disabled for years
    (Christoph)

    - Improvement and cleanup of nvme command handling (Christoph)

    - Add block SPDX tags (Christoph)

    - Cleanup/hardening of bio/bvec iteration (Christoph)

    - A few NVMe pull requests (Christoph)

    - Removal of CONFIG_LBDAF (Christoph)

    - Various little fixes here and there"

    * tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
    block: fix mismerge in bvec_advance
    block: don't drain in-progress dispatch in blk_cleanup_queue()
    blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
    blk-mq: always free hctx after request queue is freed
    blk-mq: split blk_mq_alloc_and_init_hctx into two parts
    blk-mq: free hw queue's resource in hctx's release handler
    blk-mq: move cancel of requeue_work into blk_mq_release
    blk-mq: grab .q_usage_counter when queuing request from plug code path
    block: fix function name in comment
    nvmet: protect discovery change log event list iteration
    nvme: mark nvme_core_init and nvme_core_exit static
    nvme: move command size checks to the core
    nvme-fabrics: check more command sizes
    nvme-pci: check more command sizes
    nvme-pci: remove an unneeded variable initialization
    nvme-pci: unquiesce admin queue on shutdown
    nvme-pci: shutdown on timeout during deletion
    nvme-pci: fix psdt field for single segment sgls
    nvme-multipath: don't print ANA group state by default
    nvme-multipath: split bios with the ns_head bio_set before submitting
    ...

    Linus Torvalds
     

30 Apr, 2019

1 commit


17 Apr, 2019

2 commits

  • It's possible for pagecache writeback to split up a large amount of work
    into smaller pieces for throttling purposes or to reduce the amount of
    time a writeback operation is pending. Whatever the reason, XFS can end
    up with a bunch of IO completions that call for the same operation to be
    performed on a contiguous extent mapping. Since mappings are extent
    based in XFS, we'd prefer to run fewer transactions when we can.

    When we're processing an ioend on the list of io completions, check to
    see if the next items on the list are both adjacent and of the same
    type. If so, we can merge the completions to reduce transaction
    overhead.

    On fast storage this doesn't seem to make much of a difference in
    performance, though the number of transactions for an overnight xfstests
    run seems to drop by ~5%.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     
  • When scheduling writeback of dirty file data in the page cache, XFS uses
    IO completion workqueue items to ensure that filesystem metadata only
    updates after the write completes successfully. This is essential for
    converting unwritten extents to real extents at the right time and
    performing COW remappings.

    Unfortunately, XFS queues each IO completion work item to an unbounded
    workqueue, which means that the kernel can spawn dozens of threads to
    try to handle the items quickly. These threads need to take the ILOCK
    to update file metadata, which results in heavy ILOCK contention if a
    large number of the work items target a single file, which is
    inefficient.

    Worse yet, the writeback completion threads get stuck waiting for the
    ILOCK while holding transaction reservations, which can use up all
    available log reservation space. When that happens, metadata updates to
    other parts of the filesystem grind to a halt, even if the filesystem
    could otherwise have handled it.

    Even worse, if one of the things grinding to a halt happens to be a
    thread in the middle of a defer-ops finish holding the same ILOCK and
    trying to obtain more log reservation having exhausted the permanent
    reservation, we now have an ABBA deadlock - writeback completion has a
    transaction reserved and wants the ILOCK, and someone else has the ILOCK
    and wants a transaction reservation.

    Therefore, we create a per-inode writeback io completion queue + work
    item. When writeback finishes, it can add the ioend to the per-inode
    queue and let the single worker item process that queue. This
    dramatically cuts down on the number of kworkers and ILOCK contention in
    the system, and seems to have eliminated an occasional deadlock I was
    seeing while running generic/476.

    Testing with a program that simulates a heavy random-write workload to a
    single file demonstrates that the number of kworkers drops from
    approximately 120 threads per file to 1, without dramatically changing
    write bandwidth or pagecache access latency.

    Note that we leave the xfs-conv workqueue's max_active alone because we
    still want to be able to run ioend processing for as many inodes as the
    system can handle.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster

    Darrick J. Wong
     

09 Mar, 2019

1 commit

  • Pull block layer updates from Jens Axboe:
    "Not a huge amount of changes in this round, the biggest one is that we
    finally have Mings multi-page bvec support merged. Apart from that,
    this pull request contains:

    - Small series that avoids quiescing the queue for sysfs changes that
    match what we currently have (Aleksei)

    - Series of bcache fixes (via Coly)

    - Series of lightnvm fixes (via Mathias)

    - NVMe pull request from Christoph. Nothing major, just SPDX/license
    cleanups, RR mp policy (Hannes), and little fixes (Bart,
    Chaitanya).

    - BFQ series (Paolo)

    - Save blk-mq cpu -> hw queue mapping, removing a pointer indirection
    for the fast path (Jianchao)

    - fops->iopoll() added for async IO polling, this is a feature that
    the upcoming io_uring interface will use (Christoph, me)

    - Partition scan loop fixes (Dongli)

    - mtip32xx conversion from managed resource API (Christoph)

    - cdrom registration race fix (Guenter)

    - MD pull from Song, two minor fixes.

    - Various documentation fixes (Marcos)

    - Multi-page bvec feature. This brings a lot of nice improvements
    with it, like more efficient splitting, larger IOs can be supported
    without growing the bvec table size, and so on. (Ming)

    - Various little fixes to core and drivers"

    * tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits)
    block: fix updating bio's front segment size
    block: Replace function name in string with __func__
    nbd: propagate genlmsg_reply return code
    floppy: remove set but not used variable 'q'
    null_blk: fix checking for REQ_FUA
    block: fix NULL pointer dereference in register_disk
    fs: fix guard_bio_eod to check for real EOD errors
    blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map
    block: optimize bvec iteration in bvec_iter_advance
    block: introduce mp_bvec_for_each_page() for iterating over page
    block: optimize blk_bio_segment_split for single-page bvec
    block: optimize __blk_segment_map_sg() for single-page bvec
    block: introduce bvec_nth_page()
    iomap: wire up the iopoll method
    block: add bio_set_polled() helper
    block: wire up block device iopoll method
    fs: add an iopoll method to struct file_operations
    loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
    loop: do not print warn message if partition scan is successful
    block: bounce: make sure that bvec table is updated
    ...

    Linus Torvalds
     

21 Feb, 2019

2 commits

  • Add a mode where XFS never overwrites existing blocks in place. This
    is to aid debugging our COW code, and also put infatructure in place
    for things like possible future support for zoned block devices, which
    can't support overwrites.

    This mode is enabled globally by doing a:

    echo 1 > /sys/fs/xfs/debug/always_cow

    Note that the parameter is global to allow running all tests in xfstests
    easily in this mode, which would not easily be possible with a per-fs
    sysfs file.

    In always_cow mode persistent preallocations are disabled, and fallocate
    will fail when called with a 0 mode (with our without
    FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space
    when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE.

    There are a few interesting xfstests failures when run in always_cow
    mode:

    - generic/392 fails because the bytes used in the file used to test
    hole punch recovery are less after the log replay. This is
    because the blocks written and then punched out are only freed
    with a delay due to the logging mechanism.
    - xfs/170 will fail as the already fragile file streams mechanism
    doesn't seem to interact well with the COW allocator
    - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim
    the file system is badly fragmented, but there is not much we
    can do to avoid that when always writing out of place
    - xfs/205 fails because overwriting a file in always_cow mode
    will require new space allocation and the assumption in the
    test thus don't work anymore.
    - xfs/326 fails to modify the file at all in always_cow mode after
    injecting the refcount error, leading to an unexpected md5sum
    after the remount, but that again is expected

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • This only matters if we want to write data through the COW fork that is
    not actually an overwrite of existing data. Reasons for that are
    speculative COW fork allocations using the cowextsize, or a mode where
    we always write through the COW fork. Currently both can't actually
    happen, but I plan to enable them.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

18 Feb, 2019

5 commits

  • While we can only truncate a block under the page lock for the current
    page, there is no high-level synchronization for moving extents from the
    COW to the data fork. This means that for example we can have another
    thread doing a direct I/O completion that moves extents from the COW to
    the data fork race with writeback. While this race is very hard to hit
    the always_cow seems to reproduce it reasonably well, and it also exists
    without that. Because of that there is a chance that a delalloc
    conversion for the COW fork might not find any extents to convert. In
    that case we should retry the whole block lookup and now find the blocks
    in the data fork.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Now that we properly handle the race with truncate in the delalloc
    allocator there is no need to short cut this exceptional case earlier
    on.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • This function is a small wrapper only used by the writeback code, so
    move it together with the writeback code and simplify it down to the
    glorified do { } while loop that is now is.

    A few bits intentionally got lost here: no need to call xfs_qm_dqattach
    because quotas are always attached when we create the delalloc
    reservation, and no need for the imap->br_startblock == 0 check given
    that xfs_bmapi_convert_delalloc already has a WARN_ON_ONCE for exactly
    that condition.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • We already ensure all data fits into s_maxbytes in the write / fault
    path. The only reason we have them here is that they were copy and
    pasted from xfs_bmapi_read when we stopped using that function.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • The io_type field contains what is basically a summary of information
    from the inode fork and the imap. But we can just as easily use that
    information directly, simplifying a few bits here and there and
    improving the trace points.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

15 Feb, 2019

1 commit

  • Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
    This is needed since io_uring is now based on the block branch,
    to avoid a conflict between the multi-page bvecs and the bits
    of io_uring that touch the core block parts.

    * tag 'v5.0-rc6': (525 commits)
    Linux 5.0-rc6
    x86/mm: Make set_pmd_at() paravirt aware
    MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
    blk-mq: remove duplicated definition of blk_mq_freeze_queue
    Blk-iolatency: warn on negative inflight IO counter
    blk-iolatency: fix IO hang due to negative inflight counter
    MAINTAINERS: unify reference to xen-devel list
    x86/mm/cpa: Fix set_mce_nospec()
    futex: Handle early deadlock return correctly
    futex: Fix barrier comment
    net: dsa: b53: Fix for failure when irq is not defined in dt
    blktrace: Show requests without sector
    mips: cm: reprime error cause
    mips: loongson64: remove unreachable(), fix loongson_poweroff().
    sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
    geneve: should not call rt6_lookup() when ipv6 was disabled
    KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
    KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
    kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
    signal: Better detection of synchronous signals
    ...

    Jens Axboe