05 Nov, 2020

2 commits

  • The iomap writepage error handling logic is a mash of old and
    slightly broken XFS writepage logic. When keepwrite writeback state
    tracking was introduced in XFS in commit 0d085a529b42 ("xfs: ensure
    WB_SYNC_ALL writeback handles partial pages correctly"), XFS had an
    additional cluster writeback context that scanned ahead of
    ->writepage() to process dirty pages over the current ->writepage()
    extent mapping. This context expected a dirty page and required
    retention of the TOWRITE tag on partial page processing so the
    higher level writeback context would revisit the page (in contrast
    to ->writepage(), which passes a page with the dirty bit already
    cleared).

    The cluster writeback mechanism was eventually removed and some of
    the error handling logic folded into the primary writeback path in
    commit 150d5be09ce4 ("xfs: remove xfs_cancel_ioend"). This patch
    accidentally conflated the two contexts by using the keepwrite logic
    in ->writepage() without accounting for the fact that the page is
    not dirty. Further, the keepwrite logic has no practical effect on
    the core ->writepage() caller (write_cache_pages()) because it never
    revisits a page in the current function invocation.

    Technically, the page should be redirtied for the keepwrite logic to
    have any effect. Otherwise, write_cache_pages() may find the tagged
    page but will skip it since it is clean. Even if the page was
    redirtied, however, there is still no practical effect to keepwrite
    since write_cache_pages() does not wrap around within a single
    invocation of the function. Therefore, the dirty page would simply
    end up retagged on the next writeback sequence over the associated
    range.

    All that being said, none of this really matters because redirtying
    a partially processed page introduces a potential infinite redirty
    -> writeback failure loop that deviates from the current design
    principle of clearing the dirty state on writepage failure to avoid
    building up too much dirty, unreclaimable memory on the system.
    Therefore, drop the spurious keepwrite usage and dirty state
    clearing logic from iomap_writepage_map(), treat the partially
    processed page the same as a fully processed page, and let the
    imminent ioend failure clean up the writeback state.

    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • iomap writeback mapping failure only calls into ->discard_page() if
    the current page has not been added to the ioend. Accordingly, the
    XFS callback assumes a full page discard and invalidation. This is
    problematic for sub-page block size filesystems where some portion
    of a page might have been mapped successfully before a failure to
    map a delalloc block occurs. ->discard_page() is not called in that
    error scenario and the bio is explicitly failed by iomap via the
    error return from ->prepare_ioend(). As a result, the filesystem
    leaks delalloc blocks and corrupts the filesystem block counters.

    Since XFS is the only user of ->discard_page(), tweak the semantics
    to invoke the callback unconditionally on mapping errors and provide
    the file offset that failed to map. Update xfs_discard_page() to
    discard the corresponding portion of the file and pass the range
    along to iomap_invalidatepage(). The latter already properly handles
    both full and sub-page scenarios by not changing any iomap or page
    state on sub-page invalidations.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     

28 Sep, 2020

3 commits

  • iomap complete routine can deadlock with btrfs_fallocate because of the
    call to generic_write_sync().

    P0 P1
    inode_lock() fallocate(FALLOC_FL_ZERO_RANGE)
    __iomap_dio_rw() inode_lock()

    inode_unlock()

    inode_dio_wait()
    iomap_dio_complete()
    generic_write_sync()
    btrfs_file_fsync()
    inode_lock()

    inode_dio_end() is used to notify the end of DIO data in order
    to synchronize with truncate. Call inode_dio_end() before calling
    generic_write_sync(), so filesystems can lock i_rwsem during a sync.

    This matches the way it is done in fs/direct-io.c:dio_complete().

    Signed-off-by: Goldwyn Rodrigues
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Josef Bacik
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Goldwyn Rodrigues
     
  • This is to avoid the deadlock caused in btrfs because of O_DIRECT |
    O_DSYNC.

    Filesystems such as btrfs require i_rwsem while performing sync on a
    file. iomap_dio_rw() is called under i_rw_sem. This leads to a
    deadlock because of:

    iomap_dio_complete()
    generic_write_sync()
    btrfs_sync_file()

    Separate out iomap_dio_complete() from iomap_dio_rw(), so filesystems
    can call iomap_dio_complete() after unlocking i_rwsem.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Josef Bacik
    Signed-off-by: Goldwyn Rodrigues
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • For filesystems with block size < page size, we need to set all the
    per-block uptodate bits if the page was already uptodate at the time
    we create the per-block metadata. This can happen if the page is
    invalidated (eg by a write to drop_caches) but ultimately not removed
    from the page cache.

    This is a data corruption issue as page writeback skips blocks which
    are marked !uptodate.

    Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
    Signed-off-by: Matthew Wilcox (Oracle)
    Reported-by: Qian Cai
    Cc: Brian Foster
    Reviewed-by: Gao Xiang
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Matthew Wilcox (Oracle)
     

21 Sep, 2020

10 commits

  • Pass the full length to iomap_zero() and dax_iomap_zero(), and have
    them return how many bytes they actually handled. This is preparatory
    work for handling THP, although it looks like DAX could actually take
    advantage of it if there's a larger contiguous area.

    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Matthew Wilcox (Oracle)
     
  • iomap_write_end cannot return an error, so switch it to return
    size_t instead of int and remove the error checking from the callers.
    Also convert the arguments to size_t from unsigned int, in case anyone
    ever wants to support a page size larger than 2GB.

    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Matthew Wilcox (Oracle)
     
  • Instead of counting bio segments, count the number of bytes submitted.
    This insulates us from the block layer's definition of what a 'same page'
    is, which is not necessarily clear once THPs are involved.

    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Matthew Wilcox (Oracle)
     
  • Instead of counting bio segments, count the number of bytes submitted.
    This insulates us from the block layer's definition of what a 'same page'
    is, which is not necessarily clear once THPs are involved.

    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Matthew Wilcox (Oracle)
     
  • Size the uptodate array dynamically to support larger pages in the
    page cache. With a 64kB page, we're only saving 8 bytes per page today,
    but with a 2MB maximum page size, we'd have to allocate more than 4kB
    per page. Add a few debugging assertions.

    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Matthew Wilcox (Oracle)
     
  • Now that the bitmap is protected by a spinlock, we can use the
    more efficient bitmap ops instead of individual test/set bit ops.

    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Matthew Wilcox (Oracle)
     
  • We can skip most of the initialisation, although spinlocks still
    need explicit initialisation as architectures may use a non-zero
    value to indicate unlocked. The comment is no longer useful as
    attach_page_private() handles the refcount now.

    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Matthew Wilcox (Oracle)
     
  • This helper is useful for both THPs and for supporting block size larger
    than page size. Convert all users that I could find (we have a few
    different ways of writing this idiom, and I may have missed some).

    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Acked-by: Dave Kleikamp

    Matthew Wilcox (Oracle)
     
  • If iomap_unshare_actor() unshares to an inline iomap, the page was
    not being flushed. block_write_end() and __iomap_write_end() already
    contain flushes, so adding it to iomap_write_end_inline() seems like
    the best place. That means we can remove it from iomap_write_actor().

    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Darrick J. Wong

    Matthew Wilcox (Oracle)
     
  • Signed-off-by: Nikolay Borisov
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Nikolay Borisov
     

10 Sep, 2020

4 commits

  • When bringing (portions of) a page uptodate, we were marking blocks that
    were zeroed as being uptodate, but not blocks that were read from storage.

    Like the previous commit, this problem was found with generic/127 and
    a kernel which failed readahead I/Os. This bug causes writes to be
    silently lost when working with flaky storage.

    Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Matthew Wilcox (Oracle)
     
  • If we find a page in write_begin which is !Uptodate, we need
    to clear any error on the page before starting to read data
    into it. This matches how filemap_fault(), do_read_cache_page()
    and generic_file_buffered_read() handle PageError on !Uptodate pages.
    When calling iomap_set_range_uptodate() in __iomap_write_begin(), blocks
    were not being marked as uptodate.

    This was found with generic/127 and a specially modified kernel which
    would fail (some) readahead I/Os. The test read some bytes in a prior
    page which caused readahead to extend into page 0x34. There was
    a subsequent write to page 0x34, followed by a read to page 0x34.
    Because the blocks were still marked as !Uptodate, the read caused all
    blocks to be re-read, overwriting the write. With this change, and the
    next one, the bytes which were written are marked as being Uptodate, so
    even though the page is still marked as !Uptodate, the blocks containing
    the written data are not re-read from storage.

    Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Matthew Wilcox (Oracle)
     
  • When a direct I/O write falls back to buffered I/O entirely, dio->size
    will be 0 in iomap_dio_complete. Function invalidate_inode_pages2_range
    will try to invalidate the rest of the address space. If there are any
    dirty pages in that range, the write will fail and a "Page cache
    invalidation failure on direct I/O" error will be logged.

    On gfs2, this can be reproduced as follows:

    xfs_io \
    -c "open -ft foo" -c "pwrite 4k 4k" -c "close" \
    -c "open -d foo" -c "pwrite 0 4k"

    Fix this by recognizing 0-length writes.

    Signed-off-by: Andreas Gruenbacher
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Andreas Gruenbacher
     
  • It is trivial to trigger a WARN_ON_ONCE(1) in iomap_dio_actor() by
    unprivileged users which would taint the kernel, or worse - panic if
    panic_on_warn or panic_on_taint is set. Hence, just convert it to
    pr_warn_ratelimited() to let users know their workloads are racing.
    Thank Dave Chinner for the initial analysis of the racing reproducers.

    Signed-off-by: Qian Cai
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Qian Cai
     

24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

06 Aug, 2020

2 commits

  • Failing to invalid the page cache means data in incoherent, which is
    a very bad state for the system. Always fall back to buffered I/O
    through the page cache if we can't invalidate mappings.

    Signed-off-by: Christoph Hellwig
    Acked-by: Dave Chinner
    Reviewed-by: Goldwyn Rodrigues
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Acked-by: Bob Peterson
    Acked-by: Damien Le Moal
    Reviewed-by: Theodore Ts'o # for ext4
    Reviewed-by: Andreas Gruenbacher # for gfs2
    Reviewed-by: Ritesh Harjani

    Christoph Hellwig
     
  • The historic requirement for XFS to invalidate cached pages on
    direct IO reads has been lost in the twisty pages of history - it was
    inherited from Irix, which implemented page cache invalidation on
    read as a method of working around problems synchronising page
    cache state with uncached IO.

    XFS has carried this ever since. In the initial linux ports it was
    necessary to get mmap and DIO to play "ok" together and not
    immediately corrupt data. This was the state of play until the linux
    kernel had infrastructure to track unwritten extents and synchronise
    page faults with allocations and unwritten extent conversions
    (->page_mkwrite infrastructure). IOws, the page cache invalidation
    on DIO read was necessary to prevent trivial data corruptions. This
    didn't solve all the problems, though.

    There were peformance problems if we didn't invalidate the entire
    page cache over the file on read - we couldn't easily determine if
    the cached pages were over the range of the IO, and invalidation
    required taking a serialising lock (i_mutex) on the inode. This
    serialising lock was an issue for XFS, as it was the only exclusive
    lock in the direct Io read path.

    Hence if there were any cached pages, we'd just invalidate the
    entire file in one go so that subsequent IOs didn't need to take the
    serialising lock. This was a problem that prevented ranged
    invalidation from being particularly useful for avoiding the
    remaining coherency issues. This was solved with the conversion of
    i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
    use i_rwsem. Hence we could now just do ranged invalidation and the
    performance problem went away.

    However, page cache invalidation was still needed to serialise
    sub-page/sub-block zeroing via direct IO against buffered IO because
    bufferhead state attached to the cached page could get out of whack
    when direct IOs were issued. We've removed bufferheads from the
    XFS code, and we don't carry any extent state on the cached pages
    anymore, and so this problem has gone away, too.

    IOWs, it would appear that we don't have any good reason to be
    invalidating the page cache on DIO reads anymore. Hence remove the
    invalidation on read because it is unnecessary overhead,
    not needed to maintain coherency between mmap/buffered access and
    direct IO anymore, and prevents anyone from using direct IO reads
    from intentionally invalidating the page cache of a file.

    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Matthew Wilcox (Oracle)
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     

07 Jul, 2020

1 commit

  • Make sure iomap_end is always called when iomap_begin succeeds.

    Without this fix, iomap_end won't be called when a filesystem's
    iomap_begin operation returns an invalid mapping, bypassing any
    unlocking done in iomap_end. With this fix, the unlocking will still
    happen.

    This bug was found by Bob Peterson during code review. It's unlikely
    that such iomap_begin bugs will survive to affect users, so backporting
    this fix seems unnecessary.

    Fixes: ae259a9c8593 ("fs: introduce iomap infrastructure")
    Signed-off-by: Andreas Gruenbacher
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Andreas Gruenbacher
     

14 Jun, 2020

1 commit


09 Jun, 2020

1 commit


06 Jun, 2020

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "A lot of bug fixes and cleanups for ext4, including:

    - Fix performance problems found in dioread_nolock now that it is the
    default, caused by transaction leaks.

    - Clean up fiemap handling in ext4

    - Clean up and refactor multiple block allocator (mballoc) code

    - Fix a problem with mballoc with a smaller file systems running out
    of blocks because they couldn't properly use blocks that had been
    reserved by inode preallocation.

    - Fixed a race in ext4_sync_parent() versus rename()

    - Simplify the error handling in the extent manipulation code

    - Make sure all metadata I/O errors are felected to
    ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers.

    - Avoid passing an error pointer to brelse in ext4_xattr_set()

    - Fix race which could result to freeing an inode on the dirty last
    in data=journal mode.

    - Fix refcount handling if ext4_iget() fails

    - Fix a crash in generic/019 caused by a corrupted extent node"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits)
    ext4: avoid unnecessary transaction starts during writeback
    ext4: don't block for O_DIRECT if IOCB_NOWAIT is set
    ext4: remove the access_ok() check in ext4_ioctl_get_es_cache
    fs: remove the access_ok() check in ioctl_fiemap
    fs: handle FIEMAP_FLAG_SYNC in fiemap_prep
    fs: move fiemap range validation into the file systems instances
    iomap: fix the iomap_fiemap prototype
    fs: move the fiemap definitions out of fs.h
    fs: mark __generic_block_fiemap static
    ext4: remove the call to fiemap_check_flags in ext4_fiemap
    ext4: split _ext4_fiemap
    ext4: fix fiemap size checks for bitmap files
    ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
    add comment for ext4_dir_entry_2 file_type member
    jbd2: avoid leaking transaction credits when unreserving handle
    ext4: drop ext4_journal_free_reserved()
    ext4: mballoc: use lock for checking free blocks while retrying
    ext4: mballoc: refactor ext4_mb_good_group()
    ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
    ext4: mballoc: refactor ext4_mb_discard_preallocations()
    ...

    Linus Torvalds
     

04 Jun, 2020

4 commits

  • By moving FIEMAP_FLAG_SYNC handling to fiemap_prep we ensure it is
    handled once instead of duplicated, but can still be done under fs locks,
    like xfs/iomap intended with its duplicate handling. Also make sure the
    error value of filemap_write_and_wait is propagated to user space.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Amir Goldstein
    Reviewed-by: Darrick J. Wong
    Link: https://lore.kernel.org/r/20200523073016.2944131-8-hch@lst.de
    Signed-off-by: Theodore Ts'o

    Christoph Hellwig
     
  • Replace fiemap_check_flags with a fiemap_prep helper that also takes the
    inode and mapped range, and performs the sanity check and truncation
    previously done in fiemap_check_range. This way the validation is inside
    the file system itself and thus properly works for the stacked overlayfs
    case as well.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Amir Goldstein
    Reviewed-by: Darrick J. Wong
    Link: https://lore.kernel.org/r/20200523073016.2944131-7-hch@lst.de
    Signed-off-by: Theodore Ts'o

    Christoph Hellwig
     
  • iomap_fiemap should take u64 start and len arguments, just like the
    ->fiemap prototype.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ritesh Harjani
    Reviewed-by: Darrick J. Wong
    Link: https://lore.kernel.org/r/20200523073016.2944131-6-hch@lst.de
    Signed-off-by: Theodore Ts'o

    Christoph Hellwig
     
  • No need to pull the fiemap definitions into almost every file in the
    kernel build.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ritesh Harjani
    Reviewed-by: Darrick J. Wong
    Link: https://lore.kernel.org/r/20200523073016.2944131-5-hch@lst.de
    Signed-off-by: Theodore Ts'o

    Christoph Hellwig
     

03 Jun, 2020

5 commits

  • Pull btrfs updates from David Sterba:
    "Highlights:

    - speedup dead root detection during orphan cleanup, eg. when there
    are many deleted subvolumes waiting to be cleaned, the trees are
    now looked up in radix tree instead of a O(N^2) search

    - snapshot creation with inherited qgroup will mark the qgroup
    inconsistent, requires a rescan

    - send will emit file capabilities after chown, this produces a
    stream that does not need postprocessing to set the capabilities
    again

    - direct io ported to iomap infrastructure, cleaned up and simplified
    code, notably removing last use of struct buffer_head in btrfs code

    Core changes:

    - factor out backreference iteration, to be used by ordinary
    backreferences and relocation code

    - improved global block reserve utilization
    * better logic to serialize requests
    * increased maximum available for unlink
    * improved handling on large pages (64K)

    - direct io cleanups and fixes
    * simplify layering, where cloned bios were unnecessarily created
    for some cases
    * error handling fixes (submit, endio)
    * remove repair worker thread, used to avoid deadlocks during
    repair

    - refactored block group reading code, preparatory work for new type
    of block group storage that should improve mount time on large
    filesystems

    Cleanups:

    - cleaned up (and slightly sped up) set/get helpers for metadata data
    structure members

    - root bit REF_COWS got renamed to SHAREABLE to reflect the that the
    blocks of the tree get shared either among subvolumes or with the
    relocation trees

    Fixes:

    - when subvolume deletion fails due to ENOSPC, the filesystem is not
    turned read-only

    - device scan deals with devices from other filesystems that changed
    ownership due to overwrite (mkfs)

    - fix a race between scrub and block group removal/allocation

    - fix long standing bug of a runaway balance operation, printing the
    same line to the syslog, caused by a stale status bit on a reloc
    tree that prevented progress

    - fix corrupt log due to concurrent fsync of inodes with shared
    extents

    - fix space underflow for NODATACOW and buffered writes when it for
    some reason needs to fallback to COW mode"

    * tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (133 commits)
    btrfs: fix space_info bytes_may_use underflow during space cache writeout
    btrfs: fix space_info bytes_may_use underflow after nocow buffered write
    btrfs: fix wrong file range cleanup after an error filling dealloc range
    btrfs: remove redundant local variable in read_block_for_search
    btrfs: open code key_search
    btrfs: split btrfs_direct_IO to read and write part
    btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
    fs: remove dio_end_io()
    btrfs: switch to iomap_dio_rw() for dio
    iomap: remove lockdep_assert_held()
    iomap: add a filesystem hook for direct I/O bio submission
    fs: export generic_file_buffered_read()
    btrfs: turn space cache writeout failure messages into debug messages
    btrfs: include error on messages about failure to write space/inode caches
    btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks()
    btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums
    btrfs: make checksum item extension more efficient
    btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
    btrfs: unexport btrfs_compress_set_level()
    btrfs: simplify iget helpers
    ...

    Linus Torvalds
     
  • Pull block updates from Jens Axboe:
    "Core block changes that have been queued up for this release:

    - Remove dead blk-throttle and blk-wbt code (Guoqing)

    - Include pid in blktrace note traces (Jan)

    - Don't spew I/O errors on wouldblock termination (me)

    - Zone append addition (Johannes, Keith, Damien)

    - IO accounting improvements (Konstantin, Christoph)

    - blk-mq hardware map update improvements (Ming)

    - Scheduler dispatch improvement (Salman)

    - Inline block encryption support (Satya)

    - Request map fixes and improvements (Weiping)

    - blk-iocost tweaks (Tejun)

    - Fix for timeout failing with error injection (Keith)

    - Queue re-run fixes (Douglas)

    - CPU hotplug improvements (Christoph)

    - Queue entry/exit improvements (Christoph)

    - Move DMA drain handling to the few drivers that use it (Christoph)

    - Partition handling cleanups (Christoph)"

    * tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits)
    block: mark bio_wouldblock_error() bio with BIO_QUIET
    blk-wbt: rename __wbt_update_limits to wbt_update_limits
    blk-wbt: remove wbt_update_limits
    blk-throttle: remove tg_drain_bios
    blk-throttle: remove blk_throtl_drain
    null_blk: force complete for timeout request
    blk-mq: drain I/O when all CPUs in a hctx are offline
    blk-mq: add blk_mq_all_tag_iter
    blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx
    blk-mq: use BLK_MQ_NO_TAG in more places
    blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
    blk-mq: move more request initialization to blk_mq_rq_ctx_init
    blk-mq: simplify the blk_mq_get_request calling convention
    blk-mq: remove the bio argument to ->prepare_request
    nvme: force complete cancelled requests
    blk-mq: blk-mq: provide forced completion method
    block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds
    block: blk-crypto-fallback: remove redundant initialization of variable err
    block: reduce part_stat_lock() scope
    block: use __this_cpu_add() instead of access by smp_processor_id()
    ...

    Linus Torvalds
     
  • Since the new pair function is introduced, we can call them to clean the
    code in iomap.

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Darrick J. Wong
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Link: http://lkml.kernel.org/r/20200517214718.468-7-guoqing.jiang@cloud.ionos.com
    Signed-off-by: Linus Torvalds

    Guoqing Jiang
     
  • Use the new readahead operation in iomap. Convert XFS and ZoneFS to use
    it.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Reviewed-by: William Kucharski
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Dave Chinner
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: John Hubbard
    Cc: Joseph Qi
    Cc: Junxiao Bi
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-26-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Implement the new readahead aop and convert all callers (block_dev,
    exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
    reiserfs & udf).

    The callers are all trivial except for GFS2 & OCFS2.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Junxiao Bi # ocfs2
    Reviewed-by: Joseph Qi # ocfs2
    Reviewed-by: Dave Chinner
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Reviewed-by: William Kucharski
    Cc: Chao Yu
    Cc: Cong Wang
    Cc: Darrick J. Wong
    Cc: Eric Biggers
    Cc: Gao Xiang
    Cc: Jaegeuk Kim
    Cc: Michal Hocko
    Cc: Zi Yan
    Cc: Johannes Thumshirn
    Cc: Miklos Szeredi
    Link: http://lkml.kernel.org/r/20200414150233.24495-17-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

25 May, 2020

2 commits

  • Filesystems such as btrfs can perform direct I/O without holding the
    inode->i_rwsem in some of the cases like writing within i_size. So,
    remove the check for lockdep_assert_held() in iomap_dio_rw().

    Reviewed-by: Darrick J. Wong
    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: David Sterba

    Goldwyn Rodrigues
     
  • This helps filesystems to perform tasks on the bio while submitting for
    I/O. This could be post-write operations such as data CRC or data
    replication for fs-handled RAID.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: David Sterba

    Goldwyn Rodrigues
     

13 May, 2020

1 commit

  • Sync dio could be big, or may take long time in discard or in case of
    IO failure.

    We have prevented task hung in submit_bio_wait() and blk_execute_rq(),
    so apply the same trick for prevent task hung from happening in sync dio.

    Add helper of blk_io_schedule() and use io_schedule_timeout() to prevent
    task hung warning.

    Signed-off-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Cc: Salman Qazi
    Cc: Jesse Barnes
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Ming Lei
     

30 Apr, 2020

1 commit

  • We better warn the fibmap user and not return a truncated and therefore
    an incorrect block map address if the bmap() returned block address
    is greater than INT_MAX (since user supplied integer pointer).

    It's better to pr_warn() all user of ioctl_fibmap() and return a proper
    error code rather than silently letting a FS corruption happen if the
    user tries to fiddle around with the returned block map address.

    We fix this by returning an error code of -ERANGE and returning 0 as the
    block mapping address in case if it is > INT_MAX.

    Now iomap_bmap() could be called from either of these two paths.
    Either when a user is calling an ioctl_fibmap() interface to get
    the block mapping address or by some filesystem via use of bmap()
    internal kernel API.
    bmap() kernel API is well equipped with handling of u64 addresses.

    WARN condition in iomap_bmap_actor() was mainly added to warn all
    the fibmap users. But now that we have directly added this warning
    for all fibmap users and also made sure to return 0 as block map address
    in case if addr > INT_MAX.
    So we can now remove this logic from iomap_bmap_actor().

    Signed-off-by: Ritesh Harjani
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Ritesh Harjani
     

09 Apr, 2020

1 commit