29 Oct, 2018

1 commit


04 Oct, 2018

1 commit

  • [ Upstream commit ebf00be37de35788cad72f4f20b4a39e30c0be4a ]

    According to xfstest generic/240, applications seem to expect direct I/O
    writes to either complete as a whole or to fail; short direct I/O writes
    are apparently not appreciated. This means that when only part of an
    asynchronous direct I/O write succeeds, we can either fail the entire
    write, or we can wait for the partial write to complete and retry the
    remaining write as buffered I/O. The old __blockdev_direct_IO helper
    has code for waiting for partial writes to complete; the new
    iomap_dio_rw iomap helper does not.

    The above mentioned fallback mode is needed for gfs2, which doesn't
    allow block allocations under direct I/O to avoid taking cluster-wide
    exclusive locks. As a consequence, an asynchronous direct I/O write to
    a file range that contains a hole will result in a short write. In that
    case, wait for the short write to complete to allow gfs2 to recover.

    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Andreas Gruenbacher
     

17 Oct, 2017

1 commit

  • Commit 332391a9935d ("fs: Fix page cache inconsistency when mixing
    buffered and AIO DIO") moved page cache invalidation from
    iomap_dio_rw() to iomap_dio_complete() for iomap based direct write
    path, but before the dio->end_io() call, and it re-introdued the bug
    fixed by commit c771c14baa33 ("iomap: invalidate page caches should
    be after iomap_dio_complete() in direct write").

    I found this because fstests generic/418 started failing on XFS with
    v4.14-rc3 kernel, which is the regression test for this specific
    bug.

    So similarly, fix it by moving dio->end_io() (which does the
    unwritten extent conversion) before page cache invalidation, to make
    sure next buffer read reads the final real allocations not unwritten
    extents. I also add some comments about why should end_io() go first
    in case we get it wrong again in the future.

    Note that, there's no such problem in the non-iomap based direct
    write path, because we didn't remove the page cache invalidation
    after the ->direct_IO() in generic_file_direct_write() call, but I
    decided to fix dio_complete() too so we don't leave a landmine
    there, also be consistent with iomap_dio_complete().

    Fixes: 332391a9935d ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
    Signed-off-by: Eryu Guan
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Jan Kara
    Reviewed-by: Lukas Czerner

    Eryu Guan
     

29 Sep, 2017

1 commit

  • Pull xfs fixes from Darrick Wong:

    - fix various problems with the copy-on-write extent maps getting freed
    at the wrong time

    - fix printk format specifier problems

    - report zeroing operation outcomes instead of dropping them on the
    floor

    - fix some crashes when dio operations partially fail

    - fix a race condition between unwritten extent conversion & dio read

    - fix some incorrect tests in the inode log item processing

    - correct the delayed allocation space reservations on rmap filesystems

    - fix some problems checking for dax support

    * tag 'xfs-4.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    xfs: revert "xfs: factor rmap btree size into the indlen calculations"
    xfs: Capture state of the right inode in xfs_iflush_done
    xfs: perag initialization should only touch m_ag_max_usable for AG 0
    xfs: update i_size after unwritten conversion in dio completion
    iomap_dio_rw: Allocate AIO completion queue before submitting dio
    xfs: validate bdev support for DAX inode flag
    xfs: remove redundant re-initialization of total_nr_pages
    xfs: Output warning message when discard option was enabled even though the device does not support discard
    xfs: report zeroed or not correctly in xfs_zero_range()
    xfs: kill meaningless variable 'zero'
    fs/xfs: Use %pS printk format for direct addresses
    xfs: evict CoW fork extents when performing finsert/fcollapse
    xfs: don't unconditionally clear the reflink flag on zero-block files

    Linus Torvalds
     

27 Sep, 2017

1 commit

  • Executing xfs/104 test in a loop on Linux-v4.13 kernel on a ppc64
    machine can cause the following NULL pointer dereference,

    .queue_work_on+0x4c/0x80
    .iomap_dio_bio_end_io+0xbc/0x1f0
    .bio_endio+0x118/0x1f0
    .blk_update_request+0xd0/0x470
    .blk_mq_end_request+0x24/0xc0
    .lo_complete_rq+0x40/0xe0
    .__blk_mq_complete_request_remote+0x28/0x40
    .flush_smp_call_function_queue+0xc4/0x1e0
    .smp_ipi_demux_relaxed+0x8c/0x100
    .icp_hv_ipi_action+0x54/0xa0
    .__handle_irq_event_percpu+0x84/0x2c0
    .handle_irq_event_percpu+0x28/0x80
    .handle_percpu_irq+0x78/0xc0
    .generic_handle_irq+0x40/0x70
    .__do_irq+0x88/0x200
    .call_do_irq+0x14/0x24
    .do_IRQ+0x84/0x130

    This occurs due to the following sequence of events,

    1. Allocate dio for Direct I/O write.
    2. Invoke iomap_apply() until iov_iter_count() bytes have been submitted.
    - Assume that we have submitted atleast one bio. Hence iomap_dio->ref value
    will be >= 2.
    - If during the second iteration, iomap_apply() ends up returning -ENOSPC, we would
    break out of the loop and since the 'ret' value is a negative number we
    end up not allocating memory for super_block->s_dio_done_wq.
    3. Meanwhile, iomap_dio_bio_end_io() is invoked for bios that have been
    submitted and here the code ends up dereferencing the NULL pointer stored
    at super_block->s_dio_done_wq.

    This commit fixes the bug by allocating memory for
    super_block->s_dio_done_wq before iomap_apply() is invoked.

    Reported-by: Eryu Guan
    Reviewed-by: Christoph Hellwig
    Tested-by: Eryu Guan
    Signed-off-by: Chandan Rajendra
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Chandan Rajendra
     

25 Sep, 2017

1 commit

  • Currently when mixing buffered reads and asynchronous direct writes it
    is possible to end up with the situation where we have stale data in the
    page cache while the new data is already written to disk. This is
    permanent until the affected pages are flushed away. Despite the fact
    that mixing buffered and direct IO is ill-advised it does pose a thread
    for a data integrity, is unexpected and should be fixed.

    Fix this by deferring completion of asynchronous direct writes to a
    process context in the case that there are mapped pages to be found in
    the inode. Later before the completion in dio_complete() invalidate
    the pages in question. This ensures that after the completion the pages
    in the written area are either unmapped, or populated with up-to-date
    data. Also do the same for the iomap case which uses
    iomap_dio_complete() instead.

    This has a side effect of deferring the completion to a process context
    for every AIO DIO that happens on inode that has pages mapped. However
    since the consensus is that this is ill-advised practice the performance
    implication should not be a problem.

    This was based on proposal from Jeff Moyer, thanks!

    Reviewed-by: Jan Kara
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Jeff Moyer
    Signed-off-by: Lukas Czerner
    Signed-off-by: Jens Axboe

    Lukas Czerner
     

08 Sep, 2017

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

02 Sep, 2017

1 commit


24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Aug, 2017

1 commit

  • Fix the min_t calls in the zeroing and dirtying helpers to perform the
    comparisms on 64-bit types, which prevents them from incorrectly
    being truncated, and larger zeroing operations being stuck in a never
    ending loop.

    Special thanks to Markus Stockhausen for spotting the bug.

    Reported-by: Paul Menzel
    Tested-by: Paul Menzel
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

15 Jul, 2017

1 commit

  • Pull XFS fixes from Darrick Wong:
    "Largely debugging and regression fixes.

    - Add some locking assertions for the _ilock helpers.

    - Revert the XFS_QMOPT_NOLOCK patch; after discussion with hch the
    online fsck patch that would have needed it has been redesigned and
    no longer needs it.

    - Fix behavioral regression of SEEK_HOLE/DATA with negative offsets
    to match 4.12-era XFS behavior"

    * tag 'xfs-4.13-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    vfs: in iomap seek_{hole,data}, return -ENXIO for negative offsets
    Revert "xfs: grab dquots without taking the ilock"
    xfs: assert locking precondition in xfs_readlink_bmap_ilocked
    xfs: assert locking precondіtion in xfs_attr_list_int_ilocked
    xfs: fixup xfs_attr_get_ilocked

    Linus Torvalds
     

14 Jul, 2017

1 commit


11 Jul, 2017

1 commit

  • Pull XFS updates from Darrick Wong:
    "Here are some changes for you for 4.13. For the most part it's fixes
    for bugs and deadlock problems, and preparation for online fsck in
    some future merge window.

    - Avoid quotacheck deadlocks

    - Fix transaction overflows when bunmapping fragmented files

    - Refactor directory readahead

    - Allow admin to configure if ASSERT is fatal

    - Improve transaction usage detail logging during overflows

    - Minor cleanups

    - Don't leak log items when the log shuts down

    - Remove double-underscore typedefs

    - Various preparation for online scrubbing

    - Introduce new error injection configuration sysfs knobs

    - Refactor dq_get_next to use extent map directly

    - Fix problems with iterating the page cache for unwritten data

    - Implement SEEK_{HOLE,DATA} via iomap

    - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA

    - Don't use MAXPATHLEN to check on-disk symlink target lengths"

    * tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
    xfs: don't crash on unexpected holes in dir/attr btrees
    xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
    xfs: fix contiguous dquot chunk iteration livelock
    xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
    vfs: Add iomap_seek_hole and iomap_seek_data helpers
    vfs: Add page_cache_seek_hole_data helper
    xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
    xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
    xfs: Check for m_errortag initialization in xfs_errortag_test
    xfs: grab dquots without taking the ilock
    xfs: fix semicolon.cocci warnings
    xfs: Don't clear SGID when inheriting ACLs
    xfs: free cowblocks and retry on buffered write ENOSPC
    xfs: replace log_badcrc_factor knob with error injection tag
    xfs: convert drop_writes to use the errortag mechanism
    xfs: remove unneeded parameter from XFS_TEST_ERROR
    xfs: expose errortag knobs via sysfs
    xfs: make errortag a per-mountpoint structure
    xfs: free uncommitted transactions during log recovery
    xfs: don't allow bmap on rt files
    ...

    Linus Torvalds
     

03 Jul, 2017

1 commit


28 Jun, 2017

1 commit


20 Jun, 2017

1 commit


09 Jun, 2017

1 commit

  • Replace bi_error with a new bi_status to allow for a clear conversion.
    Note that device mapper overloaded bi_error with a private value, which
    we'll have to keep arround at least for now and thus propagate to a
    proper blk_status_t value.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

09 May, 2017

1 commit

  • Commit afddba49d18f ("fs: introduce write_begin, write_end, and
    perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was
    checked in pagecache_write_begin(), but that check was removed by
    4e02ed4b4a2f ("fs: remove prepare_write/commit_write").

    Between these two commits, commit d9414774dc0c ("cifs: Convert cifs to
    new aops.") added a check in cifs_write_begin(), but that check was soon
    removed by commit a98ee8c1c707 ("[CIFS] fix regression in
    cifs_write_begin/cifs_write_end").

    Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere. Let's
    remove this flag. This patch has no functionality changes.

    Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Reviewed-by: Jeff Layton
    Reviewed-by: Christoph Hellwig
    Cc: Nick Piggin
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

07 May, 2017

1 commit

  • Pull xfs updates from Darrick Wong:
    "Here are the XFS changes for 4.12. The big new feature for this
    release is the new space mapping ioctl that we've been discussing
    since LSF2016, but other than that most of the patches are larger bug
    fixes, memory corruption prevention, and other cleanups.

    Summary:
    - various code cleanups
    - introduce GETFSMAP ioctl
    - various refactoring
    - avoid dio reads past eof
    - fix memory corruption and other errors with fragmented directory blocks
    - fix accidental userspace memory corruptions
    - publish fs uuid in superblock
    - make fstrim terminatable
    - fix race between quotaoff and in-core inode creation
    - avoid use-after-free when finishing up w/ buffer heads
    - reserve enough space to handle bmap tree resizing during cow remap"

    * tag 'xfs-4.12-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (53 commits)
    xfs: fix use-after-free in xfs_finish_page_writeback
    xfs: reserve enough blocks to handle btree splits when remapping
    xfs: wait on new inodes during quotaoff dquot release
    xfs: update ag iterator to support wait on new inodes
    xfs: support ability to wait on new inodes
    xfs: publish UUID in struct super_block
    xfs: Allow user to kill fstrim process
    xfs: better log intent item refcount checking
    xfs: fix up quotacheck buffer list error handling
    xfs: remove xfs_trans_ail_delete_bulk
    xfs: don't use bool values in trace buffers
    xfs: fix getfsmap userspace memory corruption while setting OF_LAST
    xfs: fix __user annotations for xfs_ioc_getfsmap
    xfs: corruption needs to respect endianess too!
    xfs: use NULL instead of 0 to initialize a pointer in xfs_ioc_getfsmap
    xfs: use NULL instead of 0 to initialize a pointer in xfs_getfsmap
    xfs: simplify validation of the unwritten extent bit
    xfs: remove unused values from xfs_exntst_t
    xfs: remove the unused XFS_MAXLINK_1 define
    xfs: more do_div cleanups
    ...

    Linus Torvalds
     

06 May, 2017

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this has been in multiple -next releases. There were a few
    late breaking fixes and small features that got added in the last
    couple days, but the whole set has received a build success
    notification from the kbuild robot.

    Change summary:

    - Region media error reporting: A libnvdimm region device is the
    parent to one or more namespaces. To date, media errors have been
    reported via the "badblocks" attribute attached to pmem block
    devices for namespaces in "raw" or "memory" mode. Given that
    namespaces can be in "device-dax" or "btt-sector" mode this new
    interface reports media errors generically, i.e. independent of
    namespace modes or state.

    This subsequently allows userspace tooling to craft "ACPI 6.1
    Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
    requests and submit them via the ioctl path for NVDIMM root bus
    devices.

    - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
    by a request from Linus and feedback from Christoph this allows for
    dax capable drivers to publish their own custom dax operations.
    This fixes the broken assumption that all dax operations are
    related to a persistent memory device, and makes it easier for
    other architectures and platforms to add customized persistent
    memory support.

    - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
    available for storage appliance applications to manually trigger
    memory controllers to drain write-pending buffers that would
    otherwise be flushed automatically by the platform ADR
    (asynchronous-DRAM-refresh) mechanism at a power loss event.
    Support for "locked" DIMMs is included to prevent namespaces from
    surfacing when the namespace label data area is locked. Finally,
    fixes for various reported deadlocks and crashes, also tagged for
    -stable.

    - ACPI / nfit driver updates: General updates of the nfit driver to
    add DSM command overrides, ACPI 6.1 health state flags support, DSM
    payload debug available by default, and various fixes.

    Acknowledgements that came after the branch was pushed:

    - commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
    Tested-by: Yi Zhang

    - commit 23f498448362 "libnvdimm: rework region badblocks clearing"
    Tested-by: Toshi Kani "

    * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
    libnvdimm, pfn: fix 'npfns' vs section alignment
    libnvdimm: handle locked label storage areas
    libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
    brd: fix uninitialized use of brd->dax_dev
    block, dax: use correct format string in bdev_dax_supported
    device-dax: fix sysfs attribute deadlock
    libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
    libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
    libnvdimm: rework region badblocks clearing
    acpi, nfit: kill ACPI_NFIT_DEBUG
    libnvdimm: fix clear length of nvdimm_forget_poison()
    libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
    libnvdimm, region: sysfs trigger for nvdimm_flush()
    libnvdimm: fix phys_addr for nvdimm_clear_poison
    x86, dax, pmem: remove indirection around memcpy_from_pmem()
    block: remove block_device_operations ->direct_access()
    block, dax: convert bdev_dax_supported() to dax_direct_access()
    filesystem-dax: convert to dax_direct_access()
    Revert "block: use DAX for partition table reads"
    ext2, ext4, xfs: retrieve dax_device for iomap operations
    ...

    Linus Torvalds
     

04 May, 2017

1 commit

  • Patch series "Properly invalidate data in the cleancache", v2.

    We've noticed that after direct IO write, buffered read sometimes gets
    stale data which is coming from the cleancache. The reason for this is
    that some direct write hooks call call invalidate_inode_pages2[_range]()
    conditionally iff mapping->nrpages is not zero, so we may not invalidate
    data in the cleancache.

    Another odd thing is that we check only for ->nrpages and don't check
    for ->nrexceptional, but invalidate_inode_pages2[_range] also
    invalidates exceptional entries as well. So we invalidate exceptional
    entries only if ->nrpages != 0? This doesn't feel right.

    - Patch 1 fixes direct IO writes by removing ->nrpages check.
    - Patch 2 fixes similar case in invalidate_bdev().
    Note: I only fixed conditional cleancache_invalidate_inode() here.
    Do we also need to add ->nrexceptional check in into invalidate_bdev()?

    - Patches 3-4: some optimizations.

    This patch (of 4):

    Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
    conditionally iff mapping->nrpages is not zero. This can't be right,
    because invalidate_inode_pages2[_range]() also invalidate data in the
    cleancache via cleancache_invalidate_inode() call. So if page cache is
    empty but there is some data in the cleancache, buffered read after
    direct IO write would get stale data from the cleancache.

    Also it doesn't feel right to check only for ->nrpages because
    invalidate_inode_pages2[_range] invalidates exceptional entries as well.

    Fix this by calling invalidate_inode_pages2[_range]() regardless of
    nrpages state.

    Note: nfs,cifs,9p doesn't need similar fix because the never call
    cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
    they are not affected by this bug.

    Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
    Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Jan Kara
    Acked-by: Konrad Rzeszutek Wilk
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Alexey Kuznetsov
    Cc: Christoph Hellwig
    Cc: Nikolay Borisov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

26 Apr, 2017

2 commits

  • Now that a dax_device is plumbed through all dax-capable drivers we can
    switch from block_device_operations to dax_operations for invoking
    ->direct_access.

    This also lets us kill off some usages of struct blk_dax_ctl on the way
    to its eventual removal.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     
  • On a ppc64 machine executing overlayfs/019 with xfs as the lower and
    upper filesystem causes the following call trace,

    WARNING: CPU: 2 PID: 8034 at /root/repos/linux/fs/iomap.c:765 .iomap_dio_actor+0xcc/0x420
    Modules linked in:
    CPU: 2 PID: 8034 Comm: fsstress Tainted: G L 4.11.0-rc5-next-20170405 #100
    task: c000000631314880 task.stack: c0000003915d4000
    NIP: c00000000035a72c LR: c00000000035a6f4 CTR: c00000000035a660
    REGS: c0000003915d7570 TRAP: 0700 Tainted: G L (4.11.0-rc5-next-20170405)
    MSR: 800000000282b032
    CR: 24004284 XER: 00000000
    CFAR: c0000000006f7190 SOFTE: 1
    GPR00: c00000000035a6f4 c0000003915d77f0 c0000000015a3f00 000000007c22f600
    GPR04: 000000000022d000 0000000000002600 c0000003b2d56360 c0000003915d7960
    GPR08: c0000003915d7cd0 0000000000000002 0000000000002600 c000000000521cc0
    GPR12: 0000000024004284 c00000000fd80a00 000000004b04ae64 ffffffffffffffff
    GPR16: 000000001000ca70 0000000000000000 c0000003b2d56380 c00000000153d2b8
    GPR20: 0000000000000010 c0000003bc87bac8 0000000000223000 000000000022f5ff
    GPR24: c0000003b2d56360 000000000000000c 0000000000002600 000000000022d000
    GPR28: 0000000000000000 c0000003915d7960 c0000003b2d56360 00000000000001ff
    NIP [c00000000035a72c] .iomap_dio_actor+0xcc/0x420
    LR [c00000000035a6f4] .iomap_dio_actor+0x94/0x420
    Call Trace:
    [c0000003915d77f0] [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 (unreliable)
    [c0000003915d78f0] [c00000000035b9f4] .iomap_apply+0xf4/0x1f0
    [c0000003915d79d0] [c00000000035c320] .iomap_dio_rw+0x230/0x420
    [c0000003915d7ae0] [c000000000512a14] .xfs_file_dio_aio_read+0x84/0x160
    [c0000003915d7b80] [c000000000512d24] .xfs_file_read_iter+0x104/0x130
    [c0000003915d7c10] [c0000000002d6234] .__vfs_read+0x114/0x1a0
    [c0000003915d7cf0] [c0000000002d7a8c] .vfs_read+0xac/0x1a0
    [c0000003915d7d90] [c0000000002d96b8] .SyS_read+0x58/0x100
    [c0000003915d7e30] [c00000000000b8e0] system_call+0x38/0xfc
    Instruction dump:
    78630020 7f831b78 7ffc07b4 7c7ce039 40820360 a13d0018 2f890003 419e0288
    2f890004 419e00a0 2f890001 419e02a8 3b80fffb 38210100 7f83e378

    The above problem can also be recreated on a regular xfs filesystem
    using the command,

    $ fsstress -d /mnt -l 1000 -n 1000 -p 1000

    The reason for the call trace is,
    1. When 'reserving' blocks for delayed allocation , XFS reserves more
    blocks (i.e. past file's current EOF) than required. This is done
    because XFS assumes that userspace might write more data and hence
    'reserving' more blocks might lead to the file's new data being
    stored contiguously on disk.
    2. The in-memory 'struct xfs_bmbt_irec' mapping the file's last extent would
    then cover the prealloc-ed EOF blocks in addition to the regular blocks.
    3. When flushing the dirty blocks to disk, we only flush data till the
    file's EOF. But before writing out the dirty data, we allocate blocks
    on the disk for holding the file's new data. This allocation includes
    the blocks that are part of the 'prealloc EOF blocks'.
    4. Later, when the last reference to the inode is being closed, XFS frees the
    unused 'prealloc EOF blocks' in xfs_inactive().

    In step 3 above, When allocating space on disk for the delayed allocation
    range, the space allocator might sometimes allocate less blocks than
    required. If such an allocation ends right at the current EOF of the
    file, We will not be able to clear the "delayed allocation" flag for the
    'prealloc EOF blocks', since we won't have dirty buffer heads associated
    with that range of the file.

    In such a situation if a Direct I/O read operation is performed on file
    range [X, Y] (where X < EOF and Y > EOF), we flush dirty data in the
    range [X, Y] and invalidate page cache for that range (Refer to
    iomap_dio_rw()). Later for performing the Direct I/O read, XFS obtains
    the extent items (which are still cached in memory) for the file
    range. When doing so we are not supposed to get an extent item with
    IOMAP_DELALLOC flag set, since the previous "flush" operation should
    have converted any delayed allocation data in the range [X, Y]. Hence we
    end up hitting a WARN_ON_ONCE(1) statement in iomap_dio_actor().

    This commit fixes the bug by preventing the read operation from going
    beyond iomap_dio->i_size.

    Reported-by: Santhosh G
    Signed-off-by: Chandan Rajendra
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Chandan Rajendra
     

07 Mar, 2017

1 commit

  • After XFS switching to iomap based DIO (commit acdda3aae146 ("xfs:
    use iomap_dio_rw")), I started to notice dio29/dio30 tests failures
    from LTP run on ppc64 hosts, and they can be reproduced on x86_64
    hosts with 512B/1k block size XFS too.

    dio29 diotest3 -b 65536 -n 100 -i 1000 -o 1024000
    dio30 diotest6 -b 65536 -n 100 -i 1000 -o 1024000

    The failure message is like:
    bufcmp: offset 0: Expected: 0x62, got 0x0
    diotest03 1 TPASS : Read with Direct IO, Write without
    diotest03 2 TFAIL : diotest3.c:142: comparsion failed; child=98 offset=1425408
    diotest03 3 TFAIL : diotest3.c:194: Write Direct-child 98 failed

    Direct write wrote 0x62 but buffer read got zero. This is because,
    when doing direct write to a hole or preallocated file, we
    invalidate the page caches before converting the extent from
    unwritten state to normal state, which is done by
    iomap_dio_complete(), thus leave a window for other buffer reader to
    cache the unwritten state extent.

    Consider this case, with sub-page blocksize XFS, two processes are
    direct writing to different blocksize-aligned regions (say 512B) of
    the same preallocated file, and reading the region back via buffered
    I/O to compare contents.

    process A, region [0,512] process B, region [512,1024]
    xfs_file_write_iter
    xfs_file_aio_dio_write
    iomap_dio_rw
    iomap_apply
    invalidate_inode_pages2_range
    xfs_file_write_iter
    xfs_file_aio_dio_write
    iomap_dio_rw
    iomap_apply
    invalidate_inode_pages2_range
    iomap_dio_complete
    xfs_file_read_iter
    xfs_file_buffered_aio_read
    generic_file_read_iter
    do_generic_file_read

    iomap_dio_complete
    xfs_file_read_iter

    Process A first invalidates page caches, at this point the
    underlying extent is still in unwritten state (iomap_dio_complete
    not called yet), and process B finishs direct write and populates
    page caches via readahead, which caches zeros in page for region A,
    then process A reads zeros from page cache, instead of the actual
    data.

    Fix it by invalidating page caches after converting unwritten extent
    to make sure we read content from disk after extent state changed,
    as what we did before switching to iomap based dio.

    Also introduce a new 'start' variable to save the original write
    offset (iomap_dio_complete() updates iocb->ki_pos), and a 'err'
    variable for invalidating caches result, cause we can't reuse 'ret'
    anymore.

    Signed-off-by: Eryu Guan
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eryu Guan
     

02 Mar, 2017

1 commit

  • Instead of including the full , we are going to include the
    types-only header in , to further
    decouple the scheduler header from the signal headers.

    This means that various files which relied on the full need
    to be updated to gain an explicit dependency on it.

    Update the code that relies on sched.h's inclusion of the header.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

1 commit

  • Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
    branch.

    This patch also fixes multiple checkpatch warnings: WARNING: Prefer
    'unsigned int' to bare use of 'unsigned'

    Thanks to Andrew Morton for suggesting more appropriate function instead
    of macro.

    [geliangtang@gmail.com: truncate: use i_blocksize()]
    Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
    Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Signed-off-by: Geliang Tang
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

25 Feb, 2017

1 commit

  • ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
    take a vma and vmf parameter when the vma already resides in vmf.

    Remove the vma parameter to simplify things.

    [arnd@arndb.de: fix ARM build]
    Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Ross Zwisler
    Cc: Theodore Ts'o
    Cc: Darrick J. Wong
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

23 Feb, 2017

1 commit

  • Pull xfs updates from Darrick Wong:
    "Here are the XFS changes for 4.11. We aren't introducing any major
    features in this release cycle except for this being the first merge
    window I've managed on my own. :)

    Changes since last update:

    - Various cleanups

    - Livelock fixes for eofblocks scanning

    - Improved input verification for on-disk metadata

    - Fix races in the copy on write remap mechanism

    - Fix buffer io error timeout controls

    - Streamlining of directio copy on write

    - Asynchronous discard support

    - Fix asserts when splitting delalloc reservations

    - Don't bloat bmbt when right shifting extents

    - Inode alignment fixes for 32k block sizes"

    * tag 'xfs-4.11-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (39 commits)
    xfs: remove XFS_ALLOCTYPE_ANY_AG and XFS_ALLOCTYPE_START_AG
    xfs: simplify xfs_rtallocate_extent
    xfs: tune down agno asserts in the bmap code
    xfs: Use xfs_icluster_size_fsb() to calculate inode chunk alignment
    xfs: don't reserve blocks for right shift transactions
    xfs: fix len comparison in xfs_extent_busy_trim
    xfs: fix uninitialized variable in _reflink_convert_cow
    xfs: split indlen reservations fairly when under reserved
    xfs: handle indlen shortage on delalloc extent merge
    xfs: resurrect debug mode drop buffered writes mechanism
    xfs: clear delalloc and cache on buffered write failure
    xfs: don't block the log commit handler for discards
    xfs: improve busy extent sorting
    xfs: improve handling of busy extents in the low-level allocator
    xfs: don't fail xfs_extent_busy allocation
    xfs: correct null checks and error processing in xfs_initialize_perag
    xfs: update ctime and mtime on clone destinatation inodes
    xfs: allocate direct I/O COW blocks in iomap_begin
    xfs: go straight to real allocations for direct I/O COW writes
    xfs: return the converted extent in __xfs_reflink_convert_cow
    ...

    Linus Torvalds
     

04 Feb, 2017

1 commit

  • Tetsuo has noticed that an OOM stress test which performs large write
    requests can cause the full memory reserves depletion. He has tracked
    this down to the following path

    __alloc_pages_nodemask+0x436/0x4d0
    alloc_pages_current+0x97/0x1b0
    __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728
    pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331
    grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773
    iomap_write_begin+0x50/0xd0 fs/iomap.c:118
    iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190
    ? iomap_write_end+0x80/0x80 fs/iomap.c:150
    iomap_apply+0xb3/0x130 fs/iomap.c:79
    iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243
    ? iomap_write_end+0x80/0x80
    xfs_file_buffered_aio_write+0x132/0x390 [xfs]
    ? remove_wait_queue+0x59/0x60
    xfs_file_write_iter+0x90/0x130 [xfs]
    __vfs_write+0xe5/0x140
    vfs_write+0xc7/0x1f0
    ? syscall_trace_enter+0x1d0/0x380
    SyS_write+0x58/0xc0
    do_syscall_64+0x6c/0x200
    entry_SYSCALL64_slow_path+0x25/0x25

    the oom victim has access to all memory reserves to make a forward
    progress to exit easier. But iomap_file_buffered_write and other
    callers of iomap_apply loop to complete the full request. We need to
    check for fatal signals and back off with a short write instead.

    As the iomap_apply delegates all the work down to the actor we have to
    hook into those. All callers that work with the page cache are calling
    iomap_write_begin so we will check for signals there. dax_iomap_actor
    has to handle the situation explicitly because it copies data to the
    userspace directly. Other callers like iomap_page_mkwrite work on a
    single page or iomap_fiemap_actor do not allocate memory based on the
    given len.

    Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
    Link: http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Reviewed-by: Christoph Hellwig
    Cc: Al Viro
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

31 Jan, 2017

1 commit


15 Dec, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "There is quite a varied bunch of stuff in this update, and some of it
    you will have already merged through the ext4 tree which imported the
    dax-4.10-iomap-pmd topic branch from the XFS tree.

    There is also a new direct IO implementation that uses the iomap
    infrastructure. It's much simpler, faster, and has lower IO latency
    than the existing direct IO infrastructure.

    Summary:
    - DAX PMD faults via iomap infrastructure
    - Direct-io support in iomap infrastructure
    - removal of now-redundant XFS inode iolock, replaced with VFS
    i_rwsem
    - synchronisation with fixes and changes in userspace libxfs code
    - extent tree lookup helpers
    - lots of little corruption detection improvements to verifiers
    - optimised CRC calculations
    - faster buffer cache lookups
    - deprecation of barrier/nobarrier mount options - we always use
    REQ_FUA/REQ_FLUSH where appropriate for data integrity now
    - cleanups to speculative preallocation
    - miscellaneous minor bug fixes and cleanups"

    * tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (63 commits)
    xfs: nuke unused tracepoint definitions
    xfs: use GPF_NOFS when allocating btree cursors
    xfs: use xfs_vn_setattr_size to check on new size
    xfs: deprecate barrier/nobarrier mount option
    xfs: Always flush caches when integrity is required
    xfs: ignore leaf attr ichdr.count in verifier during log replay
    xfs: use rhashtable to track buffer cache
    xfs: optimise CRC updates
    xfs: make xfs btree stats less huge
    xfs: don't cap maximum dedupe request length
    xfs: don't allow di_size with high bit set
    xfs: error out if trying to add attrs and anextents > 0
    xfs: don't crash if reading a directory results in an unexpected hole
    xfs: complain if we don't get nextents bmap records
    xfs: check for bogus values in btree block headers
    xfs: forbid AG btrees with level == 0
    xfs: several xattr functions can be void
    xfs: handle cow fork in xfs_bmap_trace_exlist
    xfs: pass state not whichfork to trace_xfs_extlist
    xfs: Move AGI buffer type setting to xfs_read_agi
    ...

    Linus Torvalds
     

30 Nov, 2016

2 commits

  • This adds a full fledget direct I/O implementation using the iomap
    interface. Full fledged in this case means all features are supported:
    AIO, vectored I/O, any iov_iter type including kernel pointers, bvecs
    and pipes, support for hole filling and async apending writes. It does
    not mean supporting all the warts of the old generic code. We expect
    i_rwsem to be held over the duration of the call, and we expect to
    maintain i_dio_count ourselves, and we pass on any kinds of mapping
    to the file system for now.

    The algorithm used is very simple: We use iomap_apply to iterate over
    the range of the I/O, and then we use the new bio_iov_iter_get_pages
    helper to lock down the user range for the size of the extent.
    bio_iov_iter_get_pages can currently lock down twice as many pages as
    the old direct I/O code did, which means that we will have a better
    batch factor for everything but overwrites of badly fragmented files.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Kent Overstreet
    Tested-by: Jens Axboe
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Dave Chinner
     

14 Nov, 2016

1 commit


10 Nov, 2016

1 commit

  • Introduce a flag telling iomap operations whether they are handling a
    fault or other IO. That may influence behavior wrt inode size and
    similar things.

    Signed-off-by: Jan Kara
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Jan Kara
     

24 Oct, 2016

1 commit

  • iomap_page_mkwrite_actor() calls __block_write_begin_int() with position
    masked as pos & ~PAGE_MASK which is equivalent to pos & (PAGE_SIZE-1).
    Thus it masks off high bits of file position. However
    __block_write_begin_int() expects full file position on input. This does
    not cause any visible issues because all __block_write_begin_int()
    really cares about are low file position bits but still it is a bug
    waiting to happen.

    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Jan Kara
     

20 Oct, 2016

1 commit

  • This allows the file system to tell a FIEMAP from a read operation, and thus
    avoids the need to report flags that aren't actually used in the read path.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

03 Oct, 2016

1 commit


19 Sep, 2016

2 commits