14 Mar, 2019

2 commits

  • [ Upstream commit 4ea899ead2786a30aaa8181fefa81a3df4ad28f6 ]

    Introduce a local wait_for_completion variable to avoid an access to the
    potentially freed dio struture after dropping the last reference count.

    Also use the chance to document the completion behavior to make the
    refcounting clear to the reader of the code.

    Fixes: ff6a9292e6 ("iomap: implement direct I/O")
    Reported-by: Chandan Rajendra
    Reported-by: Darrick J. Wong
    Signed-off-by: Christoph Hellwig
    Tested-by: Chandan Rajendra
    Tested-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin

    Christoph Hellwig
     
  • [ Upstream commit 8e47a457321ca1a74ad194ab5dcbca764bc70731 ]

    migrate_page_move_mapping() expects pages with private data set to have
    a page_count elevated by 1. This is what used to happen for xfs through
    the buffer_heads code before the switch to iomap in commit 82cb14175e7d
    ("xfs: add support for sub-pagesize writeback without buffer_heads").
    Not having the count elevated causes move_pages() to fail on memory
    mapped files coming from xfs.

    Make iomap compatible with the migrate_page_move_mapping() assumption by
    elevating the page count as part of iomap_page_create() and lowering it
    in iomap_page_release().

    It causes the move_pages() syscall to misbehave on memory mapped files
    from xfs. It does not not move any pages, which I suppose is "just" a
    perf issue, but it also ends up returning a positive number which is out
    of spec for the syscall. Talking to Michal Hocko, it sounds like
    returning positive numbers might be a necessary update to move_pages()
    anyway though.

    Fixes: 82cb14175e7d ("xfs: add support for sub-pagesize writeback without buffer_heads")
    Signed-off-by: Piotr Jaroszynski
    [hch: actually get/put the page iomap_migrate_page() to make it work
    properly]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin

    Piotr Jaroszynski
     

26 Jan, 2019

1 commit

  • [ Upstream commit 3cc31fa65d85610574c0f6a474e89f4c419923d5 ]

    iomap_is_partially_uptodate() is intended to check wither blocks within
    the selected range of a not-uptodate page are uptodate; if the range we
    care about is up to date, it's an optimization.

    However, the iomap implementation continues to check all blocks up to
    from+count, which is beyond the page, and can even be well beyond the
    iop->uptodate bitmap.

    I think the worst that will happen is that we may eventually find a zero
    bit and return "not partially uptodate" when it would have otherwise
    returned true, and skip the optimization. Still, it's clearly an invalid
    memory access that must be fixed.

    So: fix this by limiting the search to within the page as is done in the
    non-iomap variant, block_is_partially_uptodate().

    Zorro noticed thiswhen KASAN went off for 512 byte blocks on a 64k
    page system:

    BUG: KASAN: slab-out-of-bounds in iomap_is_partially_uptodate+0x1a0/0x1e0
    Read of size 8 at addr ffff800120c3a318 by task fsstress/22337

    Reported-by: Zorro Lang
    Signed-off-by: Eric Sandeen
    Signed-off-by: Eric Sandeen
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin

    Eric Sandeen
     

29 Dec, 2018

1 commit

  • [ Upstream commit a837eca2412051628c0529768c9bc4f3580b040e ]

    This reverts commit 61c6de667263184125d5ca75e894fcad632b0dd3.

    The reverted commit added page reference counting to iomap page
    structures that are used to track block size < page size state. This
    was supposed to align the code with page migration page accounting
    assumptions, but what it has done instead is break XFS filesystems.
    Every fstests run I've done on sub-page block size XFS filesystems
    has since picking up this commit 2 days ago has failed with bad page
    state errors such as:

    # ./run_check.sh "-m rmapbt=1,reflink=1 -i sparse=1 -b size=1k" "generic/038"
    ....
    SECTION -- xfs
    FSTYP -- xfs (debug)
    PLATFORM -- Linux/x86_64 test1 4.20.0-rc6-dgc+
    MKFS_OPTIONS -- -f -m rmapbt=1,reflink=1 -i sparse=1 -b size=1k /dev/sdc
    MOUNT_OPTIONS -- /dev/sdc /mnt/scratch

    generic/038 454s ...
    run fstests generic/038 at 2018-12-20 18:43:05
    XFS (sdc): Unmounting Filesystem
    XFS (sdc): Mounting V5 Filesystem
    XFS (sdc): Ending clean mount
    BUG: Bad page state in process kswapd0 pfn:3a7fa
    page:ffffea0000ccbeb0 count:0 mapcount:0 mapping:ffff88800d9b6360 index:0x1
    flags: 0xfffffc0000000()
    raw: 000fffffc0000000 dead000000000100 dead000000000200 ffff88800d9b6360
    raw: 0000000000000001 0000000000000000 00000000ffffffff
    page dumped because: non-NULL mapping
    CPU: 0 PID: 676 Comm: kswapd0 Not tainted 4.20.0-rc6-dgc+ #915
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
    Call Trace:
    dump_stack+0x67/0x90
    bad_page.cold.116+0x8a/0xbd
    free_pcppages_bulk+0x4bf/0x6a0
    free_unref_page_list+0x10f/0x1f0
    shrink_page_list+0x49d/0xf50
    shrink_inactive_list+0x19d/0x3b0
    shrink_node_memcg.constprop.77+0x398/0x690
    ? shrink_slab.constprop.81+0x278/0x3f0
    shrink_node+0x7a/0x2f0
    kswapd+0x34b/0x6d0
    ? node_reclaim+0x240/0x240
    kthread+0x11f/0x140
    ? __kthread_bind_mask+0x60/0x60
    ret_from_fork+0x24/0x30
    Disabling lock debugging due to kernel taint
    ....

    The failures are from anyway that frees pages and empties the
    per-cpu page magazines, so it's not a predictable failure or an easy
    to debug failure.

    generic/038 is a reliable reproducer of this problem - it has a 9 in
    10 failure rate on one of my test machines. Failure on other
    machines have been at random points in fstests runs but every run
    has ended up tripping this problem. Hence generic/038 was used to
    bisect the failure because it was the most reliable failure.

    It is too close to the 4.20 release (not to mention holidays) to
    try to diagnose, fix and test the underlying cause of the problem,
    so reverting the commit is the only option we have right now. The
    revert has been tested against a current tot 4.20-rc7+ kernel across
    multiple machines running sub-page block size XFs filesystems and
    none of the bad page state failures have been seen.

    Signed-off-by: Dave Chinner
    Cc: Piotr Jaroszynski
    Cc: Christoph Hellwig
    Cc: William Kucharski
    Cc: Darrick J. Wong
    Cc: Brian Foster
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Dave Chinner
     

20 Dec, 2018

1 commit

  • commit 61c6de667263184125d5ca75e894fcad632b0dd3 upstream.

    migrate_page_move_mapping() expects pages with private data set to have
    a page_count elevated by 1. This is what used to happen for xfs through
    the buffer_heads code before the switch to iomap in commit 82cb14175e7d
    ("xfs: add support for sub-pagesize writeback without buffer_heads").
    Not having the count elevated causes move_pages() to fail on memory
    mapped files coming from xfs.

    Make iomap compatible with the migrate_page_move_mapping() assumption by
    elevating the page count as part of iomap_page_create() and lowering it
    in iomap_page_release().

    It causes the move_pages() syscall to misbehave on memory mapped files
    from xfs. It does not not move any pages, which I suppose is "just" a
    perf issue, but it also ends up returning a positive number which is out
    of spec for the syscall. Talking to Michal Hocko, it sounds like
    returning positive numbers might be a necessary update to move_pages()
    anyway though
    (https://lkml.kernel.org/r/20181116114955.GJ14706@dhcp22.suse.cz).

    I only hit this in tests that verify that move_pages() actually moved
    the pages. The test also got confused by the positive return from
    move_pages() (it got treated as a success as positive numbers were not
    expected and not handled) making it a bit harder to track down what's
    going on.

    Link: http://lkml.kernel.org/r/20181115184140.1388751-1-pjaroszynski@nvidia.com
    Fixes: 82cb14175e7d ("xfs: add support for sub-pagesize writeback without buffer_heads")
    Signed-off-by: Piotr Jaroszynski
    Reviewed-by: Christoph Hellwig
    Cc: William Kucharski
    Cc: Darrick J. Wong
    Cc: Brian Foster
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Piotr Jaroszynski
     

29 Sep, 2018

1 commit

  • The iomap page fault mechanism currently dirties the associated page
    after the full block range of the page has been allocated. This
    leaves the page susceptible to delayed allocations without ever
    being set dirty on sub-page block sized filesystems.

    For example, consider a page fault on a page with one preexisting
    real (non-delalloc) block allocated in the middle of the page. The
    first iomap_apply() iteration performs delayed allocation on the
    range up to the preexisting block, the next iteration finds the
    preexisting block, and the last iteration attempts to perform
    delayed allocation on the range after the prexisting block to the
    end of the page. If the first allocation succeeds and the final
    allocation fails with -ENOSPC, iomap_apply() returns the error and
    iomap_page_mkwrite() fails to dirty the page having already
    performed partial delayed allocation. This eventually results in the
    page being invalidated without ever converting the delayed
    allocation to real blocks.

    This problem is reliably reproduced by generic/083 on XFS on ppc64
    systems (64k page size, 4k block size). It results in leaked
    delalloc blocks on inode reclaim, which triggers an assert failure
    in xfs_fs_destroy_inode() and filesystem accounting inconsistency.

    Move the set_page_dirty() call from iomap_page_mkwrite() to the
    actor callback, similar to how the buffer head implementation works.
    The actor callback is called iff ->iomap_begin() returns success, so
    ensures the page is dirtied as soon as possible after an allocation.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

22 Aug, 2018

1 commit

  • Pull xfs fixes from Darrick Wong:

    - Fix an uninitialized variable

    - Don't use obviously garbage AG header counters to calculate
    transaction reservations

    - Trigger icount recalculation on bad icount when mounting

    * tag 'xfs-4.19-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    iomap: fix WARN_ON_ONCE on uninitialized variable
    xfs: sanity check ag header values in xrep_calc_ag_resblks
    xfs: recalculate summary counters at mount time if icount is bad

    Linus Torvalds
     

14 Aug, 2018

3 commits

  • Pull xfs updates from Darrick Wong:
    "This is the second part of the XFS changes for 4.19.

    The biggest changes are the removal of buffer heads frm XFS, a massive
    reworking of the deferred transaction operations handling code, the
    removal of the long defunct barrier/nobarrier mount options, and the
    addition of a few more online repair functions.

    Summary:

    - Use extent maps to track pagecache page status instead of
    bufferhead state.

    - Refactor pagecache read and write paths to use the new iomap
    library functions, which enable us to drop the old bufferhead code
    for pagesize == blocksize filesystems.

    - Set up parallel per-block-per-page metadata to track subpage
    information that was tracked by buffer heads, which enables us to
    drop the old bufferhead code for pagesize > blocksize filesystems.

    - Tie a deferred ops control structure to a transaction so that we
    can take advantage of an upper-level dfops without having to plumb
    pointer passing through the code.

    - Refactor the deferred ops code to track deferred ops as part of the
    transaction structure (instead of as a separate data structure) so
    that we can simplify the scoping rules around defer_ops.

    - Refactor twisty delwri buffer submission code to avoid deadlocks.

    - Shorten and fix indenting problems in the scrub code.

    - Detect obviously bad summary counts at mount and fix them.

    - Directly associate deferred ops control structure with a
    transaction so that callers no longer have to manage it themselves.

    - Remove a couple of IRIX-era inode macros.

    - Remove the long-deprecated 'barrier' and 'nobarrier' mount options.

    - Clean up the inode fork structure a bit.

    - Check for bad fs summary counter values in the superblock.

    - Reduce COW fork lookups during writeback.

    - Refactor the deferred ops control structures into the transaction
    structure, thereby eliminating the need for transaction users to
    handle the deferred ops as a separate data structure.

    - Add the ability to repair AG headers online.

    - Fix a crash due to insufficient return value checking.

    - Various fixes and cleanups"

    * tag 'xfs-4.19-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (155 commits)
    xfs: fix a null pointer dereference in xfs_bmap_extents_to_btree
    xfs: remove b_last_holder & associated macros
    iomap: Switch to offset_in_page for clarity
    xfs: Close race between direct IO and xfs_break_layouts()
    xfs: repair the AGI
    xfs: repair the AGFL
    xfs: repair the AGF
    xfs: remove dead error handling code in xfs_dquot_disk_alloc()
    xfs: use WRITE_ONCE to update if_seq
    xfs: fix a comment in xfs_log_reserve
    xfs: only validate summary counts on primary superblock
    xfs: substitute spaces with tabs
    xfs: fold dfops into the transaction
    xfs: always defer agfl block frees
    xfs: pass transaction to xfs_defer_add()
    xfs: replace xfs_defer_ops ->dop_pending with on-stack list
    xfs: cancel dfops on xfs_defer_finish() error
    xfs: clean out superfluous dfops dop params/vars
    xfs: drop dop param from xfs_defer_op_type ->finish_item() callback
    xfs: automatic dfops inode relogging
    ...

    Linus Torvalds
     
  • In commit 9dc55f1389f9569 ("iomap: add support for sub-pagesize buffered
    I/O without buffer heads") we moved the initialization of poff (it's
    computed from pos) into a separate helper function. Inline data only
    ever deals with pos == 0, hence the WARN_ON_ONCE, but now we're testing
    an uninitialized variable.

    Therefore, change the test to check the parameter directly.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Allison Henderson
    Reviewed-by: Carlos Maiolino

    Darrick J. Wong
     
  • Pull fs iomap refactoring from Darrick Wong:
    "This is the first part of the XFS changes for 4.19.

    Christoph and Andreas coordinated some refactoring work on the iomap
    code in preparation for removing buffer heads from XFS and porting
    gfs2 to iomap. I'm sending this small pull request ahead of the main
    XFS merge to avoid holding up gfs2 unnecessarily"

    * 'iomap-4.19-merge' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    iomap: add inline data support to iomap_readpage_actor
    iomap: support direct I/O to inline data
    iomap: refactor iomap_dio_actor
    iomap: add initial support for writes without buffer heads
    iomap: add an iomap-based readpage and readpages implementation
    iomap: add private pointer to struct iomap
    iomap: add a page_done callback
    iomap: generic inline data handling
    iomap: complete partial direct I/O writes synchronously
    iomap: mark newly allocated buffer heads as new
    fs: factor out a __generic_write_end helper

    Linus Torvalds
     

12 Aug, 2018

1 commit


03 Aug, 2018

1 commit

  • The position calculation in iomap_bmap() shifts bno the wrong way,
    so we don't progress properly and end up re-mapping block zero
    over and over, yielding an unchanging physical block range as the
    logical block advances:

    # filefrag -Be file
    ext: logical_offset: physical_offset: length: expected: flags:
    0: 0.. 0: 21.. 21: 1: merged
    1: 1.. 1: 21.. 21: 1: 22: merged
    Discontinuity: Block 1 is at 21 (was 22)
    2: 2.. 2: 21.. 21: 1: 22: merged
    Discontinuity: Block 2 is at 21 (was 22)
    3: 3.. 3: 21.. 21: 1: 22: merged

    This breaks the FIBMAP interface for anyone using it (XFS), which
    in turn breaks LILO, zipl, etc.

    Bug-actually-spotted-by: Darrick J. Wong
    Fixes: 89eb1906a953 ("iomap: add an iomap-based bmap implementation")
    Cc: stable@vger.kernel.org
    Signed-off-by: Eric Sandeen
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eric Sandeen
     

12 Jul, 2018

1 commit

  • After already supporting a simple implementation of buffered writes for
    the blocksize == PAGE_SIZE case in the last commit this adds full support
    even for smaller block sizes. There are three bits of per-block
    information in the buffer_head structure that really matter for the iomap
    read and write path:

    - uptodate status (BH_uptodate)
    - marked as currently under read I/O (BH_Async_Read)
    - marked as currently under write I/O (BH_Async_Write)

    Instead of having new per-block structures this now adds a per-page
    structure called struct iomap_page to track this information in a slightly
    different form:

    - a bitmap for the per-block uptodate status. For worst case of a 64k
    page size system this bitmap needs to contain 128 bits. For the
    typical 4k page size case it only needs 8 bits, although we still
    need a full unsigned long due to the way the atomic bitmap API works.
    - two atomic_t counters are used to track the outstanding read and write
    counts

    There is quite a bit of boilerplate code as the buffered I/O path uses
    various helper methods, but the actual code is very straight forward.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

04 Jul, 2018

3 commits

  • Just copy the inline data into the page using the existing helper.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Andreas Gruenbacher
     
  • Add support for reading from and writing to inline data to iomap_dio_rw.
    This saves filesystems from having to implement fallback code for this
    case.

    The inline data is actually cached in the inode, so the I/O is only
    direct in the sense that it doesn't go through the page cache.

    Signed-off-by: Andreas Gruenbacher
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Andreas Gruenbacher
     
  • Split the function up into two helpers for the bio based I/O and hole
    case, and a small helper to call the two. This separates the code a
    little better in preparation for supporting I/O to inline data.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Andreas Gruenbacher
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

21 Jun, 2018

1 commit

  • For now just limited to blocksize == PAGE_SIZE, where we can simply read
    in the full page in write begin, and just set the whole page dirty after
    copying data into it. This code is enabled by default and XFS will now
    be feed pages without buffer heads in ->writepage and ->writepages.

    If a file system sets the IOMAP_F_BUFFER_HEAD flag on the iomap the old
    path will still be used, this both helps the transition in XFS and
    prepares for the gfs2 migration to the iomap infrastructure.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

20 Jun, 2018

4 commits

  • Simply use iomap_apply to iterate over the file and a submit a bio for
    each non-uptodate but mapped region and zero everything else. Note that
    as-is this can not be used for file systems with a blocksize smaller than
    the page size, but that support will be added later.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • This will be used by gfs2 to attach data to transactions for the journaled
    data mode. But the concept is generic enough that we might be able to
    use it for other purposes like encryption/integrity post-processing in the
    future.

    Based on a patch from Andreas Gruenbacher.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • Add generic inline data handling by adding a pointer to the inline data
    region to struct iomap. When handling a buffered IOMAP_INLINE write,
    iomap_write_begin will copy the current inline data from the inline data
    region into the page cache, and iomap_write_end will copy the changes in
    the page cache back to the inline data region.

    This doesn't cover inline data reads and direct I/O yet because so far,
    we have no users.

    Signed-off-by: Andreas Gruenbacher
    [hch: small cleanups to better fit in with other iomap work]
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Andreas Gruenbacher
     
  • According to xfstest generic/240, applications seem to expect direct I/O
    writes to either complete as a whole or to fail; short direct I/O writes
    are apparently not appreciated. This means that when only part of an
    asynchronous direct I/O write succeeds, we can either fail the entire
    write, or we can wait for the partial write to complete and retry the
    remaining write as buffered I/O. The old __blockdev_direct_IO helper
    has code for waiting for partial writes to complete; the new
    iomap_dio_rw iomap helper does not.

    The above mentioned fallback mode is needed for gfs2, which doesn't
    allow block allocations under direct I/O to avoid taking cluster-wide
    exclusive locks. As a consequence, an asynchronous direct I/O write to
    a file range that contains a hole will result in a short write. In that
    case, wait for the short write to complete to allow gfs2 to recover.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Andreas Gruenbacher
     

13 Jun, 2018

1 commit

  • Pull more xfs updates from Darrick Wong:
    "Here's the second round of patches for XFS for 4.18. Most of the
    commits are small cleanups, bug fixes, and continued strengthening of
    metadata verifiers; the bulk of the diff is the conversion of the
    fs/xfs/ tree to use SPDX tags.

    This series has been run through a full xfstests run over the weekend
    and through a quick xfstests run against this morning's master, with
    no major failures reported.

    Summary:

    - Strengthen metadata checking to avoid ASSERTing on bad disk
    contents

    - Validate btree records that are being retrieved for clients

    - Strengthen root inode verification

    - Convert license blurbs to SPDX tags

    - Enable changing DAX flag on directories

    - Fix some writeback deadlocks in reflink

    - Refactor out some old xfs helpers

    - Move type verifiers to a separate file

    - Fix some fuzzer crashes

    - Various other bug fixes"

    * tag 'xfs-4.18-merge-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (31 commits)
    xfs: update incore per-AG inode count
    xfs: replace do_mod with native operations
    xfs: don't call xfs_da_shrink_inode with NULL bp
    xfs: clean up MIN/MAX
    xfs: move various type verifiers to common file
    xfs: xfs_reflink_convert_cow() memory allocation deadlock
    xfs: setup VFS i_rwsem lockdep state correctly
    xfs: fix string handling in label get/set functions
    xfs: convert to SPDX license tags
    xfs: validate btree records on retrieval
    xfs: push corruption -> ESTALE conversion to xfs_nfs_get_inode()
    xfs: verify root inode more thoroughly
    xfs: verify COW extent size hint is valid in inode verifier
    xfs: verify extent size hint is valid in inode verifier
    xfs: catch bad stripe alignment configurations
    iomap: fsync swap files before iterating mappings
    xfs: use xfs_trans_getsb in xfs_sync_sb_buf
    xfs: don't assert on corrupted unlinked inode list
    xfs: explicitly pass buffer size to xfs_corruption_error
    xfs: don't assert when on-disk btree pointers are garbage
    ...

    Linus Torvalds
     

09 Jun, 2018

1 commit

  • Pull aio iopriority support from Al Viro:
    "The rest of aio stuff for this cycle - Adam's aio ioprio series"

    * 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: aio ioprio use ioprio_check_cap ret val
    fs: aio ioprio add explicit block layer dependence
    fs: iomap dio set bio prio from kiocb prio
    fs: blkdev set bio prio from kiocb prio
    fs: Add aio iopriority support
    fs: Convert kiocb rw_hint from enum to u16
    block: add ioprio_check_cap function

    Linus Torvalds
     

06 Jun, 2018

1 commit

  • Swap files require that all the file mapping metadata be stable on disk.
    It is insufficient to flush dirty pages in the page cache because that
    won't necessarily result in filesystems pushing all their metadata out
    to disk. Therefore, call fsync from iomap_swapfile_activate.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara

    Darrick J. Wong
     

02 Jun, 2018

7 commits


31 May, 2018

1 commit


17 May, 2018

2 commits

  • generic_swapfile_activate() doesn't allow holes, so we should be
    consistent here. This is also a bit safer: if the user creates a
    swapfile with, say, truncate -s $SIZE followed by mkswap, they should
    really get an error and not much less swap space than they expected.
    swapon(8) will error out before calling swapon(2) if the file has holes,
    anyways.

    Fixes: 9d93388b0afe ("iomap: add a swapfile activation function")
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Darrick J. Wong

    Omar Sandoval
     
  • Currently, for an invalid swap file, we print the same error message
    regardless of the reason. This isn't very useful for an admin, who will
    likely want to know why exactly they can't use their swap file. So,
    let's add specific error messages for each reason, and also move the
    bdev check after the flags checks, since the latter are more
    fundamental.

    Reviewed-by: Darrick J. Wong
    Signed-off-by: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Darrick J. Wong

    Omar Sandoval
     

16 May, 2018

1 commit

  • Add a new iomap_swapfile_activate function so that filesystems can
    activate swap files without having to use the obsolete and slow bmap
    function. This enables XFS to support fallocate'd swap files and
    swap files on realtime devices.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Jan Kara

    Darrick J. Wong
     

10 May, 2018

2 commits

  • If we are doing direct IO writes with datasync semantics, we often
    have to flush metadata changes along with the data write. However,
    if we are overwriting existing data, there are no metadata changes
    that we need to flush. In this case, optimising the IO by using
    FUA write makes sense.

    We know from the IOMAP_F_DIRTY flag as to whether a specific inode
    requires a metadata flush - this is currently used by DAX to ensure
    extent modification as stable in page fault operations. For direct
    IO writes, we can use it to determine if we need to flush metadata
    or not once the data is on disk.

    Hence if we have been returned a mapped extent that is not new and
    the IO mapping is not dirty, then we can use a FUA write to provide
    datasync semantics. This allows us to short-cut the
    generic_write_sync() call in IO completion and hence avoid
    unnecessary operations. This makes pure direct IO data write
    behaviour identical to the way block devices use REQ_FUA to provide
    datasync semantics.

    On a FUA enabled device, a synchronous direct IO write workload
    (sequential 4k overwrites in 32MB file) had the following results:

    # xfs_io -fd -c "pwrite -V 1 -D 0 32m" /mnt/scratch/boo

    kernel time write()s write iops Write b/w
    ------ ---- -------- ---------- ---------
    (no dsync) 4s 2173/s 2173 8.5MB/s
    vanilla 22s 370/s 750 1.4MB/s
    patched 19s 420/s 420 1.6MB/s

    The patched code clearly doesn't send cache flushes anymore, but
    instead uses FUA (confirmed via blktrace), and performance improves
    a bit as a result. However, the benefits will be higher on workloads
    that mix O_DSYNC overwrites with other write IO as we won't be
    flushing the entire device cache on every DSYNC overwrite IO
    anymore.

    Signed-Off-By: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     
  • Currently iomap_dio_rw() only handles (data)sync write completions
    for AIO. This means we can't optimised non-AIO IO to minimise device
    flushes as we can't tell the caller whether a flush is required or
    not.

    To solve this problem and enable further optimisations, make
    iomap_dio_rw responsible for data sync behaviour for all IO, not
    just AIO.

    In doing so, the sync operation is now accounted as part of the DIO
    IO by inode_dio_end(), hence post-IO data stability updates will no
    long race against operations that serialise via inode_dio_wait()
    such as truncate or hole punch.

    Signed-Off-By: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     

29 Jan, 2018

1 commit


09 Jan, 2018

1 commit

  • If two programs simultaneously try to write to the same part of a file
    via direct IO and buffered IO, there's a chance that the post-diowrite
    pagecache invalidation will fail on the dirty page. When this happens,
    the dio write succeeded, which means that the page cache is no longer
    coherent with the disk!

    Programs are not supposed to mix IO types and this is a clear case of
    data corruption, so store an EIO which will be reflected to userspace
    during the next fsync. Replace the WARN_ON with a ratelimited pr_crit
    so that the developers have /some/ kind of breadcrumb to track down the
    offending program(s) and file(s) involved.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Liu Bo

    Darrick J. Wong
     

18 Nov, 2017

1 commit

  • Pull iov_iter updates from Al Viro:

    - bio_{map,copy}_user_iov() series; those are cleanups - fixes from the
    same pile went into mainline (and stable) in late September.

    - fs/iomap.c iov_iter-related fixes

    - new primitive - iov_iter_for_each_range(), which applies a function
    to kernel-mapped segments of an iov_iter.

    Usable for kvec and bvec ones, the latter does kmap()/kunmap() around
    the callback. _Not_ usable for iovec- or pipe-backed iov_iter; the
    latter is not hard to fix if the need ever appears, the former is by
    design.

    Another related primitive will have to wait for the next cycle - it
    passes page + offset + size instead of pointer + size, and that one
    will be usable for everything _except_ kvec. Unfortunately, that one
    didn't get exposure in -next yet, so...

    - a bit more lustre iov_iter work, including a use case for
    iov_iter_for_each_range() (checksum calculation)

    - vhost/scsi leak fix in failure exit

    - misc cleanups and detritectomy...

    * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits)
    iomap_dio_actor(): fix iov_iter bugs
    switch ksocknal_lib_recv_...() to use of iov_iter_for_each_range()
    lustre: switch struct ksock_conn to iov_iter
    vhost/scsi: switch to iov_iter_get_pages()
    fix a page leak in vhost_scsi_iov_to_sgl() error recovery
    new primitive: iov_iter_for_each_range()
    lnet_return_rx_credits_locked: don't abuse list_entry
    xen: don't open-code iov_iter_kvec()
    orangefs: remove detritus from struct orangefs_kiocb_s
    kill iov_shorten()
    bio_alloc_map_data(): do bmd->iter setup right there
    bio_copy_user_iov(): saner bio size calculation
    bio_map_user_iov(): get rid of copying iov_iter
    bio_copy_from_iter(): get rid of copying iov_iter
    move more stuff down into bio_copy_user_iov()
    blk_rq_map_user_iov(): move iov_iter_advance() down
    bio_map_user_iov(): get rid of the iov_for_each()
    bio_map_user_iov(): move alignment check into the main loop
    don't rely upon subsequent bio_add_pc_page() calls failing
    ... and with iov_iter_get_pages_alloc() it becomes even simpler
    ...

    Linus Torvalds