14 Mar, 2019
2 commits
-
[ Upstream commit 4ea899ead2786a30aaa8181fefa81a3df4ad28f6 ]
Introduce a local wait_for_completion variable to avoid an access to the
potentially freed dio struture after dropping the last reference count.Also use the chance to document the completion behavior to make the
refcounting clear to the reader of the code.Fixes: ff6a9292e6 ("iomap: implement direct I/O")
Reported-by: Chandan Rajendra
Reported-by: Darrick J. Wong
Signed-off-by: Christoph Hellwig
Tested-by: Chandan Rajendra
Tested-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Sasha Levin -
[ Upstream commit 8e47a457321ca1a74ad194ab5dcbca764bc70731 ]
migrate_page_move_mapping() expects pages with private data set to have
a page_count elevated by 1. This is what used to happen for xfs through
the buffer_heads code before the switch to iomap in commit 82cb14175e7d
("xfs: add support for sub-pagesize writeback without buffer_heads").
Not having the count elevated causes move_pages() to fail on memory
mapped files coming from xfs.Make iomap compatible with the migrate_page_move_mapping() assumption by
elevating the page count as part of iomap_page_create() and lowering it
in iomap_page_release().It causes the move_pages() syscall to misbehave on memory mapped files
from xfs. It does not not move any pages, which I suppose is "just" a
perf issue, but it also ends up returning a positive number which is out
of spec for the syscall. Talking to Michal Hocko, it sounds like
returning positive numbers might be a necessary update to move_pages()
anyway though.Fixes: 82cb14175e7d ("xfs: add support for sub-pagesize writeback without buffer_heads")
Signed-off-by: Piotr Jaroszynski
[hch: actually get/put the page iomap_migrate_page() to make it work
properly]
Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Sasha Levin
26 Jan, 2019
1 commit
-
[ Upstream commit 3cc31fa65d85610574c0f6a474e89f4c419923d5 ]
iomap_is_partially_uptodate() is intended to check wither blocks within
the selected range of a not-uptodate page are uptodate; if the range we
care about is up to date, it's an optimization.However, the iomap implementation continues to check all blocks up to
from+count, which is beyond the page, and can even be well beyond the
iop->uptodate bitmap.I think the worst that will happen is that we may eventually find a zero
bit and return "not partially uptodate" when it would have otherwise
returned true, and skip the optimization. Still, it's clearly an invalid
memory access that must be fixed.So: fix this by limiting the search to within the page as is done in the
non-iomap variant, block_is_partially_uptodate().Zorro noticed thiswhen KASAN went off for 512 byte blocks on a 64k
page system:BUG: KASAN: slab-out-of-bounds in iomap_is_partially_uptodate+0x1a0/0x1e0
Read of size 8 at addr ffff800120c3a318 by task fsstress/22337Reported-by: Zorro Lang
Signed-off-by: Eric Sandeen
Signed-off-by: Eric Sandeen
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Sasha Levin
29 Dec, 2018
1 commit
-
[ Upstream commit a837eca2412051628c0529768c9bc4f3580b040e ]
This reverts commit 61c6de667263184125d5ca75e894fcad632b0dd3.
The reverted commit added page reference counting to iomap page
structures that are used to track block size < page size state. This
was supposed to align the code with page migration page accounting
assumptions, but what it has done instead is break XFS filesystems.
Every fstests run I've done on sub-page block size XFS filesystems
has since picking up this commit 2 days ago has failed with bad page
state errors such as:# ./run_check.sh "-m rmapbt=1,reflink=1 -i sparse=1 -b size=1k" "generic/038"
....
SECTION -- xfs
FSTYP -- xfs (debug)
PLATFORM -- Linux/x86_64 test1 4.20.0-rc6-dgc+
MKFS_OPTIONS -- -f -m rmapbt=1,reflink=1 -i sparse=1 -b size=1k /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /mnt/scratchgeneric/038 454s ...
run fstests generic/038 at 2018-12-20 18:43:05
XFS (sdc): Unmounting Filesystem
XFS (sdc): Mounting V5 Filesystem
XFS (sdc): Ending clean mount
BUG: Bad page state in process kswapd0 pfn:3a7fa
page:ffffea0000ccbeb0 count:0 mapcount:0 mapping:ffff88800d9b6360 index:0x1
flags: 0xfffffc0000000()
raw: 000fffffc0000000 dead000000000100 dead000000000200 ffff88800d9b6360
raw: 0000000000000001 0000000000000000 00000000ffffffff
page dumped because: non-NULL mapping
CPU: 0 PID: 676 Comm: kswapd0 Not tainted 4.20.0-rc6-dgc+ #915
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
Call Trace:
dump_stack+0x67/0x90
bad_page.cold.116+0x8a/0xbd
free_pcppages_bulk+0x4bf/0x6a0
free_unref_page_list+0x10f/0x1f0
shrink_page_list+0x49d/0xf50
shrink_inactive_list+0x19d/0x3b0
shrink_node_memcg.constprop.77+0x398/0x690
? shrink_slab.constprop.81+0x278/0x3f0
shrink_node+0x7a/0x2f0
kswapd+0x34b/0x6d0
? node_reclaim+0x240/0x240
kthread+0x11f/0x140
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x24/0x30
Disabling lock debugging due to kernel taint
....The failures are from anyway that frees pages and empties the
per-cpu page magazines, so it's not a predictable failure or an easy
to debug failure.generic/038 is a reliable reproducer of this problem - it has a 9 in
10 failure rate on one of my test machines. Failure on other
machines have been at random points in fstests runs but every run
has ended up tripping this problem. Hence generic/038 was used to
bisect the failure because it was the most reliable failure.It is too close to the 4.20 release (not to mention holidays) to
try to diagnose, fix and test the underlying cause of the problem,
so reverting the commit is the only option we have right now. The
revert has been tested against a current tot 4.20-rc7+ kernel across
multiple machines running sub-page block size XFs filesystems and
none of the bad page state failures have been seen.Signed-off-by: Dave Chinner
Cc: Piotr Jaroszynski
Cc: Christoph Hellwig
Cc: William Kucharski
Cc: Darrick J. Wong
Cc: Brian Foster
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
20 Dec, 2018
1 commit
-
commit 61c6de667263184125d5ca75e894fcad632b0dd3 upstream.
migrate_page_move_mapping() expects pages with private data set to have
a page_count elevated by 1. This is what used to happen for xfs through
the buffer_heads code before the switch to iomap in commit 82cb14175e7d
("xfs: add support for sub-pagesize writeback without buffer_heads").
Not having the count elevated causes move_pages() to fail on memory
mapped files coming from xfs.Make iomap compatible with the migrate_page_move_mapping() assumption by
elevating the page count as part of iomap_page_create() and lowering it
in iomap_page_release().It causes the move_pages() syscall to misbehave on memory mapped files
from xfs. It does not not move any pages, which I suppose is "just" a
perf issue, but it also ends up returning a positive number which is out
of spec for the syscall. Talking to Michal Hocko, it sounds like
returning positive numbers might be a necessary update to move_pages()
anyway though
(https://lkml.kernel.org/r/20181116114955.GJ14706@dhcp22.suse.cz).I only hit this in tests that verify that move_pages() actually moved
the pages. The test also got confused by the positive return from
move_pages() (it got treated as a success as positive numbers were not
expected and not handled) making it a bit harder to track down what's
going on.Link: http://lkml.kernel.org/r/20181115184140.1388751-1-pjaroszynski@nvidia.com
Fixes: 82cb14175e7d ("xfs: add support for sub-pagesize writeback without buffer_heads")
Signed-off-by: Piotr Jaroszynski
Reviewed-by: Christoph Hellwig
Cc: William Kucharski
Cc: Darrick J. Wong
Cc: Brian Foster
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
29 Sep, 2018
1 commit
-
The iomap page fault mechanism currently dirties the associated page
after the full block range of the page has been allocated. This
leaves the page susceptible to delayed allocations without ever
being set dirty on sub-page block sized filesystems.For example, consider a page fault on a page with one preexisting
real (non-delalloc) block allocated in the middle of the page. The
first iomap_apply() iteration performs delayed allocation on the
range up to the preexisting block, the next iteration finds the
preexisting block, and the last iteration attempts to perform
delayed allocation on the range after the prexisting block to the
end of the page. If the first allocation succeeds and the final
allocation fails with -ENOSPC, iomap_apply() returns the error and
iomap_page_mkwrite() fails to dirty the page having already
performed partial delayed allocation. This eventually results in the
page being invalidated without ever converting the delayed
allocation to real blocks.This problem is reliably reproduced by generic/083 on XFS on ppc64
systems (64k page size, 4k block size). It results in leaked
delalloc blocks on inode reclaim, which triggers an assert failure
in xfs_fs_destroy_inode() and filesystem accounting inconsistency.Move the set_page_dirty() call from iomap_page_mkwrite() to the
actor callback, similar to how the buffer head implementation works.
The actor callback is called iff ->iomap_begin() returns success, so
ensures the page is dirtied as soon as possible after an allocation.Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
22 Aug, 2018
1 commit
-
Pull xfs fixes from Darrick Wong:
- Fix an uninitialized variable
- Don't use obviously garbage AG header counters to calculate
transaction reservations- Trigger icount recalculation on bad icount when mounting
* tag 'xfs-4.19-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: fix WARN_ON_ONCE on uninitialized variable
xfs: sanity check ag header values in xrep_calc_ag_resblks
xfs: recalculate summary counters at mount time if icount is bad
14 Aug, 2018
3 commits
-
Pull xfs updates from Darrick Wong:
"This is the second part of the XFS changes for 4.19.The biggest changes are the removal of buffer heads frm XFS, a massive
reworking of the deferred transaction operations handling code, the
removal of the long defunct barrier/nobarrier mount options, and the
addition of a few more online repair functions.Summary:
- Use extent maps to track pagecache page status instead of
bufferhead state.- Refactor pagecache read and write paths to use the new iomap
library functions, which enable us to drop the old bufferhead code
for pagesize == blocksize filesystems.- Set up parallel per-block-per-page metadata to track subpage
information that was tracked by buffer heads, which enables us to
drop the old bufferhead code for pagesize > blocksize filesystems.- Tie a deferred ops control structure to a transaction so that we
can take advantage of an upper-level dfops without having to plumb
pointer passing through the code.- Refactor the deferred ops code to track deferred ops as part of the
transaction structure (instead of as a separate data structure) so
that we can simplify the scoping rules around defer_ops.- Refactor twisty delwri buffer submission code to avoid deadlocks.
- Shorten and fix indenting problems in the scrub code.
- Detect obviously bad summary counts at mount and fix them.
- Directly associate deferred ops control structure with a
transaction so that callers no longer have to manage it themselves.- Remove a couple of IRIX-era inode macros.
- Remove the long-deprecated 'barrier' and 'nobarrier' mount options.
- Clean up the inode fork structure a bit.
- Check for bad fs summary counter values in the superblock.
- Reduce COW fork lookups during writeback.
- Refactor the deferred ops control structures into the transaction
structure, thereby eliminating the need for transaction users to
handle the deferred ops as a separate data structure.- Add the ability to repair AG headers online.
- Fix a crash due to insufficient return value checking.
- Various fixes and cleanups"
* tag 'xfs-4.19-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (155 commits)
xfs: fix a null pointer dereference in xfs_bmap_extents_to_btree
xfs: remove b_last_holder & associated macros
iomap: Switch to offset_in_page for clarity
xfs: Close race between direct IO and xfs_break_layouts()
xfs: repair the AGI
xfs: repair the AGFL
xfs: repair the AGF
xfs: remove dead error handling code in xfs_dquot_disk_alloc()
xfs: use WRITE_ONCE to update if_seq
xfs: fix a comment in xfs_log_reserve
xfs: only validate summary counts on primary superblock
xfs: substitute spaces with tabs
xfs: fold dfops into the transaction
xfs: always defer agfl block frees
xfs: pass transaction to xfs_defer_add()
xfs: replace xfs_defer_ops ->dop_pending with on-stack list
xfs: cancel dfops on xfs_defer_finish() error
xfs: clean out superfluous dfops dop params/vars
xfs: drop dop param from xfs_defer_op_type ->finish_item() callback
xfs: automatic dfops inode relogging
... -
In commit 9dc55f1389f9569 ("iomap: add support for sub-pagesize buffered
I/O without buffer heads") we moved the initialization of poff (it's
computed from pos) into a separate helper function. Inline data only
ever deals with pos == 0, hence the WARN_ON_ONCE, but now we're testing
an uninitialized variable.Therefore, change the test to check the parameter directly.
Signed-off-by: Darrick J. Wong
Reviewed-by: Allison Henderson
Reviewed-by: Carlos Maiolino -
Pull fs iomap refactoring from Darrick Wong:
"This is the first part of the XFS changes for 4.19.Christoph and Andreas coordinated some refactoring work on the iomap
code in preparation for removing buffer heads from XFS and porting
gfs2 to iomap. I'm sending this small pull request ahead of the main
XFS merge to avoid holding up gfs2 unnecessarily"* 'iomap-4.19-merge' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: add inline data support to iomap_readpage_actor
iomap: support direct I/O to inline data
iomap: refactor iomap_dio_actor
iomap: add initial support for writes without buffer heads
iomap: add an iomap-based readpage and readpages implementation
iomap: add private pointer to struct iomap
iomap: add a page_done callback
iomap: generic inline data handling
iomap: complete partial direct I/O writes synchronously
iomap: mark newly allocated buffer heads as new
fs: factor out a __generic_write_end helper
12 Aug, 2018
1 commit
-
Instead of open-coding pos & (PAGE_SIZE - 1) and pos & ~PAGE_MASK, use
the offset_in_page macro.Signed-off-by: Andreas Gruenbacher
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
03 Aug, 2018
1 commit
-
The position calculation in iomap_bmap() shifts bno the wrong way,
so we don't progress properly and end up re-mapping block zero
over and over, yielding an unchanging physical block range as the
logical block advances:# filefrag -Be file
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 0: 21.. 21: 1: merged
1: 1.. 1: 21.. 21: 1: 22: merged
Discontinuity: Block 1 is at 21 (was 22)
2: 2.. 2: 21.. 21: 1: 22: merged
Discontinuity: Block 2 is at 21 (was 22)
3: 3.. 3: 21.. 21: 1: 22: mergedThis breaks the FIBMAP interface for anyone using it (XFS), which
in turn breaks LILO, zipl, etc.Bug-actually-spotted-by: Darrick J. Wong
Fixes: 89eb1906a953 ("iomap: add an iomap-based bmap implementation")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
12 Jul, 2018
1 commit
-
After already supporting a simple implementation of buffered writes for
the blocksize == PAGE_SIZE case in the last commit this adds full support
even for smaller block sizes. There are three bits of per-block
information in the buffer_head structure that really matter for the iomap
read and write path:- uptodate status (BH_uptodate)
- marked as currently under read I/O (BH_Async_Read)
- marked as currently under write I/O (BH_Async_Write)Instead of having new per-block structures this now adds a per-page
structure called struct iomap_page to track this information in a slightly
different form:- a bitmap for the per-block uptodate status. For worst case of a 64k
page size system this bitmap needs to contain 128 bits. For the
typical 4k page size case it only needs 8 bits, although we still
need a full unsigned long due to the way the atomic bitmap API works.
- two atomic_t counters are used to track the outstanding read and write
countsThere is quite a bit of boilerplate code as the buffered I/O path uses
various helper methods, but the actual code is very straight forward.Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
04 Jul, 2018
3 commits
-
Just copy the inline data into the page using the existing helper.
Signed-off-by: Andreas Gruenbacher
Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Add support for reading from and writing to inline data to iomap_dio_rw.
This saves filesystems from having to implement fallback code for this
case.The inline data is actually cached in the inode, so the I/O is only
direct in the sense that it doesn't go through the page cache.Signed-off-by: Andreas Gruenbacher
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Split the function up into two helpers for the bio based I/O and hole
case, and a small helper to call the two. This separates the code a
little better in preparation for supporting I/O to inline data.Signed-off-by: Christoph Hellwig
Reviewed-by: Andreas Gruenbacher
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
21 Jun, 2018
1 commit
-
For now just limited to blocksize == PAGE_SIZE, where we can simply read
in the full page in write begin, and just set the whole page dirty after
copying data into it. This code is enabled by default and XFS will now
be feed pages without buffer heads in ->writepage and ->writepages.If a file system sets the IOMAP_F_BUFFER_HEAD flag on the iomap the old
path will still be used, this both helps the transition in XFS and
prepares for the gfs2 migration to the iomap infrastructure.Signed-off-by: Christoph Hellwig
Reviewed-by: Brian Foster
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
20 Jun, 2018
4 commits
-
Simply use iomap_apply to iterate over the file and a submit a bio for
each non-uptodate but mapped region and zero everything else. Note that
as-is this can not be used for file systems with a blocksize smaller than
the page size, but that support will be added later.Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
This will be used by gfs2 to attach data to transactions for the journaled
data mode. But the concept is generic enough that we might be able to
use it for other purposes like encryption/integrity post-processing in the
future.Based on a patch from Andreas Gruenbacher.
Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Add generic inline data handling by adding a pointer to the inline data
region to struct iomap. When handling a buffered IOMAP_INLINE write,
iomap_write_begin will copy the current inline data from the inline data
region into the page cache, and iomap_write_end will copy the changes in
the page cache back to the inline data region.This doesn't cover inline data reads and direct I/O yet because so far,
we have no users.Signed-off-by: Andreas Gruenbacher
[hch: small cleanups to better fit in with other iomap work]
Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
According to xfstest generic/240, applications seem to expect direct I/O
writes to either complete as a whole or to fail; short direct I/O writes
are apparently not appreciated. This means that when only part of an
asynchronous direct I/O write succeeds, we can either fail the entire
write, or we can wait for the partial write to complete and retry the
remaining write as buffered I/O. The old __blockdev_direct_IO helper
has code for waiting for partial writes to complete; the new
iomap_dio_rw iomap helper does not.The above mentioned fallback mode is needed for gfs2, which doesn't
allow block allocations under direct I/O to avoid taking cluster-wide
exclusive locks. As a consequence, an asynchronous direct I/O write to
a file range that contains a hole will result in a short write. In that
case, wait for the short write to complete to allow gfs2 to recover.Signed-off-by: Andreas Gruenbacher
Signed-off-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
13 Jun, 2018
1 commit
-
Pull more xfs updates from Darrick Wong:
"Here's the second round of patches for XFS for 4.18. Most of the
commits are small cleanups, bug fixes, and continued strengthening of
metadata verifiers; the bulk of the diff is the conversion of the
fs/xfs/ tree to use SPDX tags.This series has been run through a full xfstests run over the weekend
and through a quick xfstests run against this morning's master, with
no major failures reported.Summary:
- Strengthen metadata checking to avoid ASSERTing on bad disk
contents- Validate btree records that are being retrieved for clients
- Strengthen root inode verification
- Convert license blurbs to SPDX tags
- Enable changing DAX flag on directories
- Fix some writeback deadlocks in reflink
- Refactor out some old xfs helpers
- Move type verifiers to a separate file
- Fix some fuzzer crashes
- Various other bug fixes"
* tag 'xfs-4.18-merge-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (31 commits)
xfs: update incore per-AG inode count
xfs: replace do_mod with native operations
xfs: don't call xfs_da_shrink_inode with NULL bp
xfs: clean up MIN/MAX
xfs: move various type verifiers to common file
xfs: xfs_reflink_convert_cow() memory allocation deadlock
xfs: setup VFS i_rwsem lockdep state correctly
xfs: fix string handling in label get/set functions
xfs: convert to SPDX license tags
xfs: validate btree records on retrieval
xfs: push corruption -> ESTALE conversion to xfs_nfs_get_inode()
xfs: verify root inode more thoroughly
xfs: verify COW extent size hint is valid in inode verifier
xfs: verify extent size hint is valid in inode verifier
xfs: catch bad stripe alignment configurations
iomap: fsync swap files before iterating mappings
xfs: use xfs_trans_getsb in xfs_sync_sb_buf
xfs: don't assert on corrupted unlinked inode list
xfs: explicitly pass buffer size to xfs_corruption_error
xfs: don't assert when on-disk btree pointers are garbage
...
09 Jun, 2018
1 commit
-
Pull aio iopriority support from Al Viro:
"The rest of aio stuff for this cycle - Adam's aio ioprio series"* 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: aio ioprio use ioprio_check_cap ret val
fs: aio ioprio add explicit block layer dependence
fs: iomap dio set bio prio from kiocb prio
fs: blkdev set bio prio from kiocb prio
fs: Add aio iopriority support
fs: Convert kiocb rw_hint from enum to u16
block: add ioprio_check_cap function
06 Jun, 2018
1 commit
-
Swap files require that all the file mapping metadata be stable on disk.
It is insufficient to flush dirty pages in the page cache because that
won't necessarily result in filesystems pushing all their metadata out
to disk. Therefore, call fsync from iomap_swapfile_activate.Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Reviewed-by: Jan Kara
02 Jun, 2018
7 commits
-
This way the implementation doesn't depend on buffer_head internals.
Signed-off-by: Christoph Hellwig
Reviewed-by: Andreas Gruenbacher
Reviewed-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
We only call into this function through the iomap iterators, so we already
know the buffer is unwritten. In addition to that we always require the
uptodate flag that is ORed with the result anyway.Signed-off-by: Christoph Hellwig
Reviewed-by: Andreas Gruenbacher
Reviewed-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Darrick J. Wong -
This function is only used by the iomap code, depends on being called
from it, and will soon stop poking into buffer head internals.Signed-off-by: Christoph Hellwig
Reviewed-by: Andreas Gruenbacher
Reviewed-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
This adds a simple iomap-based implementation of the legacy ->bmap
interface. Note that we can't easily add checks for rt or reflink
files, so these will have to remain in the callers. This interface
just needs to die..Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Factor the repeated calculation of the on-disk sector for a given logical
block into a littler helper.Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
We don't need any merging logic, and this also replaces a BUG_ON with a
WARN_ON_ONCE inside __bio_add_page for the impossible overflow condition.Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Inline data is fundamentally different from our normal mapped case in that
it doesn't even have a block address. So instead of having a flag for it
it should be an entirely separate iomap range type.Signed-off-by: Christoph Hellwig
Reviewed-by: Dave Chinner
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
31 May, 2018
1 commit
-
Now that kiocb has an ioprio field copy this over to the bio when it is
created from the kiocb during direct IO.Signed-off-by: Adam Manzanares
Reviewed-by: Jeff Moyer
Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro
17 May, 2018
2 commits
-
generic_swapfile_activate() doesn't allow holes, so we should be
consistent here. This is also a bit safer: if the user creates a
swapfile with, say, truncate -s $SIZE followed by mkswap, they should
really get an error and not much less swap space than they expected.
swapon(8) will error out before calling swapon(2) if the file has holes,
anyways.Fixes: 9d93388b0afe ("iomap: add a swapfile activation function")
Reviewed-by: Darrick J. Wong
Signed-off-by: Omar Sandoval
Reviewed-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Darrick J. Wong -
Currently, for an invalid swap file, we print the same error message
regardless of the reason. This isn't very useful for an admin, who will
likely want to know why exactly they can't use their swap file. So,
let's add specific error messages for each reason, and also move the
bdev check after the flags checks, since the latter are more
fundamental.Reviewed-by: Darrick J. Wong
Signed-off-by: Omar Sandoval
Reviewed-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Darrick J. Wong
16 May, 2018
1 commit
-
Add a new iomap_swapfile_activate function so that filesystems can
activate swap files without having to use the obsolete and slow bmap
function. This enables XFS to support fallocate'd swap files and
swap files on realtime devices.Signed-off-by: Darrick J. Wong
Reviewed-by: Jan Kara
10 May, 2018
2 commits
-
If we are doing direct IO writes with datasync semantics, we often
have to flush metadata changes along with the data write. However,
if we are overwriting existing data, there are no metadata changes
that we need to flush. In this case, optimising the IO by using
FUA write makes sense.We know from the IOMAP_F_DIRTY flag as to whether a specific inode
requires a metadata flush - this is currently used by DAX to ensure
extent modification as stable in page fault operations. For direct
IO writes, we can use it to determine if we need to flush metadata
or not once the data is on disk.Hence if we have been returned a mapped extent that is not new and
the IO mapping is not dirty, then we can use a FUA write to provide
datasync semantics. This allows us to short-cut the
generic_write_sync() call in IO completion and hence avoid
unnecessary operations. This makes pure direct IO data write
behaviour identical to the way block devices use REQ_FUA to provide
datasync semantics.On a FUA enabled device, a synchronous direct IO write workload
(sequential 4k overwrites in 32MB file) had the following results:# xfs_io -fd -c "pwrite -V 1 -D 0 32m" /mnt/scratch/boo
kernel time write()s write iops Write b/w
------ ---- -------- ---------- ---------
(no dsync) 4s 2173/s 2173 8.5MB/s
vanilla 22s 370/s 750 1.4MB/s
patched 19s 420/s 420 1.6MB/sThe patched code clearly doesn't send cache flushes anymore, but
instead uses FUA (confirmed via blktrace), and performance improves
a bit as a result. However, the benefits will be higher on workloads
that mix O_DSYNC overwrites with other write IO as we won't be
flushing the entire device cache on every DSYNC overwrite IO
anymore.Signed-Off-By: Dave Chinner
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong -
Currently iomap_dio_rw() only handles (data)sync write completions
for AIO. This means we can't optimised non-AIO IO to minimise device
flushes as we can't tell the caller whether a flush is required or
not.To solve this problem and enable further optimisations, make
iomap_dio_rw responsible for data sync behaviour for all IO, not
just AIO.In doing so, the sync operation is now accounted as part of the DIO
IO by inode_dio_end(), hence post-IO data stability updates will no
long race against operations that serialise via inode_dio_wait()
such as truncate or hole punch.Signed-Off-By: Dave Chinner
Reviewed-by: Christoph Hellwig
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
29 Jan, 2018
1 commit
-
Don't let the iomap callback get away with feeding us a garbage zero
length mapping -- there was a bug in xfs that resulted in those leaking
out to hilarious effect.Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
09 Jan, 2018
1 commit
-
If two programs simultaneously try to write to the same part of a file
via direct IO and buffered IO, there's a chance that the post-diowrite
pagecache invalidation will fail on the dirty page. When this happens,
the dio write succeeded, which means that the page cache is no longer
coherent with the disk!Programs are not supposed to mix IO types and this is a clear case of
data corruption, so store an EIO which will be reflected to userspace
during the next fsync. Replace the WARN_ON with a ratelimited pr_crit
so that the developers have /some/ kind of breadcrumb to track down the
offending program(s) and file(s) involved.Signed-off-by: Darrick J. Wong
Reviewed-by: Liu Bo
18 Nov, 2017
1 commit
-
Pull iov_iter updates from Al Viro:
- bio_{map,copy}_user_iov() series; those are cleanups - fixes from the
same pile went into mainline (and stable) in late September.- fs/iomap.c iov_iter-related fixes
- new primitive - iov_iter_for_each_range(), which applies a function
to kernel-mapped segments of an iov_iter.Usable for kvec and bvec ones, the latter does kmap()/kunmap() around
the callback. _Not_ usable for iovec- or pipe-backed iov_iter; the
latter is not hard to fix if the need ever appears, the former is by
design.Another related primitive will have to wait for the next cycle - it
passes page + offset + size instead of pointer + size, and that one
will be usable for everything _except_ kvec. Unfortunately, that one
didn't get exposure in -next yet, so...- a bit more lustre iov_iter work, including a use case for
iov_iter_for_each_range() (checksum calculation)- vhost/scsi leak fix in failure exit
- misc cleanups and detritectomy...
* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits)
iomap_dio_actor(): fix iov_iter bugs
switch ksocknal_lib_recv_...() to use of iov_iter_for_each_range()
lustre: switch struct ksock_conn to iov_iter
vhost/scsi: switch to iov_iter_get_pages()
fix a page leak in vhost_scsi_iov_to_sgl() error recovery
new primitive: iov_iter_for_each_range()
lnet_return_rx_credits_locked: don't abuse list_entry
xen: don't open-code iov_iter_kvec()
orangefs: remove detritus from struct orangefs_kiocb_s
kill iov_shorten()
bio_alloc_map_data(): do bmd->iter setup right there
bio_copy_user_iov(): saner bio size calculation
bio_map_user_iov(): get rid of copying iov_iter
bio_copy_from_iter(): get rid of copying iov_iter
move more stuff down into bio_copy_user_iov()
blk_rq_map_user_iov(): move iov_iter_advance() down
bio_map_user_iov(): get rid of the iov_for_each()
bio_map_user_iov(): move alignment check into the main loop
don't rely upon subsequent bio_add_pc_page() calls failing
... and with iov_iter_get_pages_alloc() it becomes even simpler
...