04 Feb, 2018

1 commit

  • [ Upstream commit 22a6c83777ac7c17d6c63891beeeac24cf5da450 ]

    Fix some complaints from the UBSAN about signed integer addition overflows.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Darrick J. Wong
     

17 Oct, 2017

2 commits

  • The writeback rework in commit fbcc02561359 ("xfs: Introduce
    writeback context for writepages") introduced a subtle change in
    behavior with regard to the block mapping used across the
    ->writepages() sequence. The previous xfs_cluster_write() code would
    only flush pages up to EOF at the time of the writepage, thus
    ensuring that any pages due to file-extending writes would be
    handled on a separate cycle and with a new, updated block mapping.

    The updated code establishes a block mapping in xfs_writepage_map()
    that could extend beyond EOF if the file has post-eof preallocation.
    Because we now use the generic writeback infrastructure and pass the
    cached mapping to each writepage call, there is no implicit EOF
    limit in place. If eofblocks trimming occurs during ->writepages(),
    any post-eof portion of the cached mapping becomes invalid. The
    eofblocks code has no means to serialize against writeback because
    there are no pages associated with post-eof blocks. Therefore if an
    eofblocks trim occurs and is followed by a file-extending buffered
    write, not only has the mapping become invalid, but we could end up
    writing a page to disk based on the invalid mapping.

    Consider the following sequence of events:

    - A buffered write creates a delalloc extent and post-eof
    speculative preallocation.
    - Writeback starts and on the first writepage cycle, the delalloc
    extent is converted to real blocks (including the post-eof blocks)
    and the mapping is cached.
    - The file is closed and xfs_release() trims post-eof blocks. The
    cached writeback mapping is now invalid.
    - Another buffered write appends the file with a delalloc extent.
    - The concurrent writeback cycle picks up the just written page
    because the writeback range end is LLONG_MAX. xfs_writepage_map()
    attributes it to the (now invalid) cached mapping and writes the
    data to an incorrect location on disk (and where the file offset is
    still backed by a delalloc extent).

    This problem is reproduced by xfstests test generic/464, which
    triggers racing writes, appends, open/closes and writeback requests.

    To address this problem, trim the mapping used during writeback to
    within EOF when the mapping is validated. This ensures the mapping
    is revalidated for any pages encountered beyond EOF as of the time
    the current mapping was cached or last validated.

    Reported-by: Eryu Guan
    Diagnosed-by: Eryu Guan
    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Brian Foster
     
  • Recently we've had warnings arise from the vm handing us pages
    without bufferheads attached to them. This should not ever occur
    in XFS, but we don't defend against it properly if it does. The only
    place where we remove bufferheads from a page is in
    xfs_vm_releasepage(), but we can't tell the difference here between
    "page is dirty so don't release" and "page is dirty but is being
    invalidated so release it".

    In some places that are invalidating pages ask for pages to be
    released and follow up afterward calling ->releasepage by checking
    whether the page was dirty and then aborting the invalidation. This
    is a possible vector for releasing buffers from a page but then
    leaving it in the mapping, so we really do need to avoid dirty pages
    in xfs_vm_releasepage().

    To differentiate between invalidated pages and normal pages, we need
    to clear the page dirty flag when invalidating the pages. This can
    be done through xfs_vm_invalidatepage(), and will result
    xfs_vm_releasepage() seeing the page as clean which matches the
    bufferhead state on the page after calling block_invalidatepage().

    Hence we can re-add the page dirty check in xfs_vm_releasepage to
    catch the case where we might be releasing a page that is actually
    dirty and so should not have the bufferheads on it removed. This
    will remove one possible vector of "dirty page with no bufferheads"
    and so help narrow down the search for the root cause of that
    problem.

    Signed-Off-By: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Dave Chinner
     

27 Sep, 2017

1 commit

  • Since commit d531d91d6990 ("xfs: always use unwritten extents for
    direct I/O writes"), we start allocating unwritten extents for all
    direct writes to allow appending aio in XFS.

    But for dio writes that could extend file size we update the in-core
    inode size first, then convert the unwritten extents to real
    allocations at dio completion time in xfs_dio_write_end_io(). Thus a
    racing direct read could see the new i_size and find the unwritten
    extents first and read zeros instead of actual data, if the direct
    writer also takes a shared iolock.

    Fix it by updating the in-core inode size after the unwritten extent
    conversion. To do this, introduce a new boolean argument to
    xfs_iomap_write_unwritten() to tell if we want to update in-core
    i_size or not.

    Suggested-by: Brian Foster
    Reviewed-by: Brian Foster
    Signed-off-by: Eryu Guan
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eryu Guan
     

12 Sep, 2017

1 commit

  • Pull libnvdimm from Dan Williams:
    "A rework of media error handling in the BTT driver and other updates.
    It has appeared in a few -next releases and collected some late-
    breaking build-error and warning fixups as a result.

    Summary:

    - Media error handling support in the Block Translation Table (BTT)
    driver is reworked to address sleeping-while-atomic locking and
    memory-allocation-context conflicts.

    - The dax_device lookup overhead for xfs and ext4 is moved out of the
    iomap hot-path to a mount-time lookup.

    - A new 'ecc_unit_size' sysfs attribute is added to advertise the
    read-modify-write boundary property of a persistent memory range.

    - Preparatory fix-ups for arm and powerpc pmem support are included
    along with other miscellaneous fixes"

    * tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
    libnvdimm, btt: fix format string warnings
    libnvdimm, btt: clean up warning and error messages
    ext4: fix null pointer dereference on sbi
    libnvdimm, nfit: move the check on nd_reserved2 to the endpoint
    dax: fix FS_DAX=n BLOCK=y compilation
    libnvdimm: fix integer overflow static analysis warning
    libnvdimm, nd_blk: remove mmio_flush_range()
    libnvdimm, btt: rework error clearing
    libnvdimm: fix potential deadlock while clearing errors
    libnvdimm, btt: cache sector_size in arena_info
    libnvdimm, btt: ensure that flags were also unchanged during a map_read
    libnvdimm, btt: refactor map entry operations with macros
    libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
    libnvdimm, nfit: export an 'ecc_unit_size' sysfs attribute
    ext4: perform dax_device lookup at mount
    ext2: perform dax_device lookup at mount
    xfs: perform dax_device lookup at mount
    dax: introduce a fs_dax_get_by_bdev() helper
    libnvdimm, btt: check memory allocation failure
    libnvdimm, label: fix index block size calculation
    ...

    Linus Torvalds
     

08 Sep, 2017

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

04 Sep, 2017

1 commit

  • Our loop in xfs_finish_page_writeback, which iterates over all buffer
    heads in a page and then calls end_buffer_async_write, which also
    iterates over all buffers in the page to check if any I/O is in flight
    is not only inefficient, but also potentially dangerous as
    end_buffer_async_write can cause the page and all buffers to be freed.

    Replace it with a single loop that does the work of end_buffer_async_write
    on a per-page basis.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

01 Sep, 2017

1 commit

  • The ->iomap_begin() operation is a hot path, so cache the
    fs_dax_get_by_host() result at mount time to avoid the incurring the
    hash lookup overhead on a per-i/o basis.

    Reported-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dan Williams

    Dan Williams
     

24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

11 Jul, 2017

1 commit

  • Pull XFS updates from Darrick Wong:
    "Here are some changes for you for 4.13. For the most part it's fixes
    for bugs and deadlock problems, and preparation for online fsck in
    some future merge window.

    - Avoid quotacheck deadlocks

    - Fix transaction overflows when bunmapping fragmented files

    - Refactor directory readahead

    - Allow admin to configure if ASSERT is fatal

    - Improve transaction usage detail logging during overflows

    - Minor cleanups

    - Don't leak log items when the log shuts down

    - Remove double-underscore typedefs

    - Various preparation for online scrubbing

    - Introduce new error injection configuration sysfs knobs

    - Refactor dq_get_next to use extent map directly

    - Fix problems with iterating the page cache for unwritten data

    - Implement SEEK_{HOLE,DATA} via iomap

    - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA

    - Don't use MAXPATHLEN to check on-disk symlink target lengths"

    * tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
    xfs: don't crash on unexpected holes in dir/attr btrees
    xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
    xfs: fix contiguous dquot chunk iteration livelock
    xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
    vfs: Add iomap_seek_hole and iomap_seek_data helpers
    vfs: Add page_cache_seek_hole_data helper
    xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
    xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
    xfs: Check for m_errortag initialization in xfs_errortag_test
    xfs: grab dquots without taking the ilock
    xfs: fix semicolon.cocci warnings
    xfs: Don't clear SGID when inheriting ACLs
    xfs: free cowblocks and retry on buffered write ENOSPC
    xfs: replace log_badcrc_factor knob with error injection tag
    xfs: convert drop_writes to use the errortag mechanism
    xfs: remove unneeded parameter from XFS_TEST_ERROR
    xfs: expose errortag knobs via sysfs
    xfs: make errortag a per-mountpoint structure
    xfs: free uncommitted transactions during log recovery
    xfs: don't allow bmap on rt files
    ...

    Linus Torvalds
     

04 Jul, 2017

1 commit

  • Pull core block/IO updates from Jens Axboe:
    "This is the main pull request for the block layer for 4.13. Not a huge
    round in terms of features, but there's a lot of churn related to some
    core cleanups.

    Note this depends on the UUID tree pull request, that Christoph
    already sent out.

    This pull request contains:

    - A series from Christoph, unifying the error/stats codes in the
    block layer. We now use blk_status_t everywhere, instead of using
    different schemes for different places.

    - Also from Christoph, some cleanups around request allocation and IO
    scheduler interactions in blk-mq.

    - And yet another series from Christoph, cleaning up how we handle
    and do bounce buffering in the block layer.

    - A blk-mq debugfs series from Bart, further improving on the support
    we have for exporting internal information to aid debugging IO
    hangs or stalls.

    - Also from Bart, a series that cleans up the request initialization
    differences across types of devices.

    - A series from Goldwyn Rodrigues, allowing the block layer to return
    failure if we will block and the user asked for non-blocking.

    - Patch from Hannes for supporting setting loop devices block size to
    that of the underlying device.

    - Two series of patches from Javier, fixing various issues with
    lightnvm, particular around pblk.

    - A series from me, adding support for write hints. This comes with
    NVMe support as well, so applications can help guide data placement
    on flash to improve performance, latencies, and write
    amplification.

    - A series from Ming, improving and hardening blk-mq support for
    stopping/starting and quiescing hardware queues.

    - Two pull requests for NVMe updates. Nothing major on the feature
    side, but lots of cleanups and bug fixes. From the usual crew.

    - A series from Neil Brown, greatly improving the bio rescue set
    support. Most notably, this kills the bio rescue work queues, if we
    don't really need them.

    - Lots of other little bug fixes that are all over the place"

    * 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
    lightnvm: pblk: set line bitmap check under debug
    lightnvm: pblk: verify that cache read is still valid
    lightnvm: pblk: add initialization check
    lightnvm: pblk: remove target using async. I/Os
    lightnvm: pblk: use vmalloc for GC data buffer
    lightnvm: pblk: use right metadata buffer for recovery
    lightnvm: pblk: schedule if data is not ready
    lightnvm: pblk: remove unused return variable
    lightnvm: pblk: fix double-free on pblk init
    lightnvm: pblk: fix bad le64 assignations
    nvme: Makefile: remove dead build rule
    blk-mq: map all HWQ also in hyperthreaded system
    nvmet-rdma: register ib_client to not deadlock in device removal
    nvme_fc: fix error recovery on link down.
    nvmet_fc: fix crashes on bad opcodes
    nvme_fc: Fix crash when nvme controller connection fails.
    nvme_fc: replace ioabort msleep loop with completion
    nvme_fc: fix double calls to nvme_cleanup_cmd()
    nvme-fabrics: verify that a controller returns the correct NQN
    nvme: simplify nvme_dev_attrs_are_visible
    ...

    Linus Torvalds
     

28 Jun, 2017

1 commit


22 Jun, 2017

1 commit

  • bmap returns a dumb LBA address but not the block device that goes with
    that LBA. Swapfiles don't care about this and will blindly assume that
    the data volume is the correct blockdev, which is totally bogus for
    files on the rt subvolume. This results in the swap code doing IOs to
    arbitrary locations on the data device(!) if the passed in mapping is a
    realtime file, so just turn off bmap for rt files.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

21 Jun, 2017

1 commit

  • bmap returns a dumb LBA address but not the block device that goes with
    that LBA. Swapfiles don't care about this and will blindly assume that
    the data volume is the correct blockdev, which is totally bogus for
    files on the rt subvolume. This results in the swap code doing IOs to
    arbitrary locations on the data device(!) if the passed in mapping is a
    realtime file, so just turn off bmap for rt files.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

20 Jun, 2017

1 commit

  • This is a purely mechanical patch that removes the private
    __{u,}int{8,16,32,64}_t typedefs in favor of using the system
    {u,}int{8,16,32,64}_t typedefs. This is the sed script used to perform
    the transformation and fix the resulting whitespace and indentation
    errors:

    s/typedef\t__uint8_t/typedef __uint8_t\t/g
    s/typedef\t__uint/typedef __uint/g
    s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
    s/__uint8_t\t/__uint8_t\t\t/g
    s/__uint/uint/g
    s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
    s/__int/int/g
    /^typedef.*int[0-9]*_t;$/d

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

09 Jun, 2017

1 commit

  • Replace bi_error with a new bi_status to allow for a clear conversion.
    Note that device mapper overloaded bi_error with a private value, which
    we'll have to keep arround at least for now and thus propagate to a
    proper blk_status_t value.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

07 May, 2017

1 commit

  • Pull xfs updates from Darrick Wong:
    "Here are the XFS changes for 4.12. The big new feature for this
    release is the new space mapping ioctl that we've been discussing
    since LSF2016, but other than that most of the patches are larger bug
    fixes, memory corruption prevention, and other cleanups.

    Summary:
    - various code cleanups
    - introduce GETFSMAP ioctl
    - various refactoring
    - avoid dio reads past eof
    - fix memory corruption and other errors with fragmented directory blocks
    - fix accidental userspace memory corruptions
    - publish fs uuid in superblock
    - make fstrim terminatable
    - fix race between quotaoff and in-core inode creation
    - avoid use-after-free when finishing up w/ buffer heads
    - reserve enough space to handle bmap tree resizing during cow remap"

    * tag 'xfs-4.12-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (53 commits)
    xfs: fix use-after-free in xfs_finish_page_writeback
    xfs: reserve enough blocks to handle btree splits when remapping
    xfs: wait on new inodes during quotaoff dquot release
    xfs: update ag iterator to support wait on new inodes
    xfs: support ability to wait on new inodes
    xfs: publish UUID in struct super_block
    xfs: Allow user to kill fstrim process
    xfs: better log intent item refcount checking
    xfs: fix up quotacheck buffer list error handling
    xfs: remove xfs_trans_ail_delete_bulk
    xfs: don't use bool values in trace buffers
    xfs: fix getfsmap userspace memory corruption while setting OF_LAST
    xfs: fix __user annotations for xfs_ioc_getfsmap
    xfs: corruption needs to respect endianess too!
    xfs: use NULL instead of 0 to initialize a pointer in xfs_ioc_getfsmap
    xfs: use NULL instead of 0 to initialize a pointer in xfs_getfsmap
    xfs: simplify validation of the unwritten extent bit
    xfs: remove unused values from xfs_exntst_t
    xfs: remove the unused XFS_MAXLINK_1 define
    xfs: more do_div cleanups
    ...

    Linus Torvalds
     

06 May, 2017

1 commit

  • Commit 28b783e47ad7 ("xfs: bufferhead chains are invalid after
    end_page_writeback") fixed one use-after-free issue by
    pre-calculating the loop conditionals before calling bh->b_end_io()
    in the end_io processing loop, but it assigned 'next' pointer before
    checking end offset boundary & breaking the loop, at which point the
    bh might be freed already, and caused use-after-free.

    This is caught by KASAN when running fstests generic/127 on sub-page
    block size XFS.

    [ 2517.244502] run fstests generic/127 at 2017-04-27 07:30:50
    [ 2747.868840] ==================================================================
    [ 2747.876949] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3d3/0x4e0 [xfs] at addr ffff8801395ae698
    ...
    [ 2747.918245] Call Trace:
    [ 2747.920975] dump_stack+0x63/0x84
    [ 2747.924673] kasan_object_err+0x21/0x70
    [ 2747.928950] kasan_report+0x271/0x530
    [ 2747.933064] ? xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
    [ 2747.938409] ? end_page_writeback+0xce/0x110
    [ 2747.943171] __asan_report_load8_noabort+0x19/0x20
    [ 2747.948545] xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
    [ 2747.953724] xfs_end_io+0x1af/0x2b0 [xfs]
    [ 2747.958197] process_one_work+0x5ff/0x1000
    [ 2747.962766] worker_thread+0xe4/0x10e0
    [ 2747.966946] kthread+0x2d3/0x3d0
    [ 2747.970546] ? process_one_work+0x1000/0x1000
    [ 2747.975405] ? kthread_create_on_node+0xc0/0xc0
    [ 2747.980457] ? syscall_return_slowpath+0xe6/0x140
    [ 2747.985706] ? do_page_fault+0x30/0x80
    [ 2747.989887] ret_from_fork+0x2c/0x40
    [ 2747.993874] Object at ffff8801395ae690, in cache buffer_head size: 104
    [ 2748.001155] Allocated:
    [ 2748.003782] PID = 8327
    [ 2748.006411] save_stack_trace+0x1b/0x20
    [ 2748.010688] save_stack+0x46/0xd0
    [ 2748.014383] kasan_kmalloc+0xad/0xe0
    [ 2748.018370] kasan_slab_alloc+0x12/0x20
    [ 2748.022648] kmem_cache_alloc+0xb8/0x1b0
    [ 2748.027024] alloc_buffer_head+0x22/0xc0
    [ 2748.031399] alloc_page_buffers+0xd1/0x250
    [ 2748.035968] create_empty_buffers+0x30/0x410
    [ 2748.040730] create_page_buffers+0x120/0x1b0
    [ 2748.045493] __block_write_begin_int+0x17a/0x1800
    [ 2748.050740] iomap_write_begin+0x100/0x2f0
    [ 2748.055308] iomap_zero_range_actor+0x253/0x5c0
    [ 2748.060362] iomap_apply+0x157/0x270
    [ 2748.064347] iomap_zero_range+0x5a/0x80
    [ 2748.068624] iomap_truncate_page+0x6b/0xa0
    [ 2748.073227] xfs_setattr_size+0x1f7/0xa10 [xfs]
    [ 2748.078312] xfs_vn_setattr_size+0x68/0x140 [xfs]
    [ 2748.083589] xfs_file_fallocate+0x4ac/0x820 [xfs]
    [ 2748.088838] vfs_fallocate+0x2cf/0x780
    [ 2748.093021] SyS_fallocate+0x48/0x80
    [ 2748.097006] do_syscall_64+0x18a/0x430
    [ 2748.101186] return_from_SYSCALL_64+0x0/0x6a
    [ 2748.105948] Freed:
    [ 2748.108189] PID = 8327
    [ 2748.110816] save_stack_trace+0x1b/0x20
    [ 2748.115093] save_stack+0x46/0xd0
    [ 2748.118788] kasan_slab_free+0x73/0xc0
    [ 2748.122969] kmem_cache_free+0x7a/0x200
    [ 2748.127247] free_buffer_head+0x41/0x80
    [ 2748.131524] try_to_free_buffers+0x178/0x250
    [ 2748.136316] xfs_vm_releasepage+0x2e9/0x3d0 [xfs]
    [ 2748.141563] try_to_release_page+0x100/0x180
    [ 2748.146325] invalidate_inode_pages2_range+0x7da/0xcf0
    [ 2748.152087] xfs_shift_file_space+0x37d/0x6e0 [xfs]
    [ 2748.157557] xfs_collapse_file_space+0x49/0x120 [xfs]
    [ 2748.163223] xfs_file_fallocate+0x2a7/0x820 [xfs]
    [ 2748.168462] vfs_fallocate+0x2cf/0x780
    [ 2748.172642] SyS_fallocate+0x48/0x80
    [ 2748.176629] do_syscall_64+0x18a/0x430
    [ 2748.180810] return_from_SYSCALL_64+0x0/0x6a

    Fixed it by checking on offset against end & breaking out first,
    dereference bh only if there're still bufferheads to process.

    Signed-off-by: Eryu Guan
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eryu Guan
     

04 May, 2017

1 commit

  • xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
    some time ago. We would like to make this concept more generic and use
    it for other filesystems as well. Let's start by giving the flag a more
    generic name PF_MEMALLOC_NOFS which is in line with an exiting
    PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
    contexts. Replace all PF_FSTRANS usage from the xfs code in the first
    step before we introduce a full API for it as xfs uses the flag directly
    anyway.

    This patch doesn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170306131408.9828-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Acked-by: Vlastimil Babka
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: David Sterba
    Cc: Jan Kara
    Cc: Nikolay Borisov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 Apr, 2017

2 commits


08 Mar, 2017

2 commits

  • There are two different cases of buffered I/O errors:

    - first we can have an already shutdown fs. In that case we should skip
    any on-disk operations and just clean up the appen transaction if
    present and destroy the ioend
    - a real I/O error. In that case we should cleanup any lingering COW
    blocks. This gets skipped in the current code and is fixed by this
    patch.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • We only want to reclaim preallocations from our periodic work item.
    Currently this is archived by looking for a dirty inode, but that check
    is rather fragile. Instead add a flag to xfs_reflink_cancel_cow_* so
    that the caller can ask for just cancelling unwritten extents in the COW
    fork.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    [darrick: fix typos in commit message]
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

28 Feb, 2017

1 commit

  • Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
    branch.

    This patch also fixes multiple checkpatch warnings: WARNING: Prefer
    'unsigned int' to bare use of 'unsigned'

    Thanks to Andrew Morton for suggesting more appropriate function instead
    of macro.

    [geliangtang@gmail.com: truncate: use i_blocksize()]
    Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
    Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Signed-off-by: Geliang Tang
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

03 Feb, 2017

1 commit

  • Christoph Hellwig pointed out that there's a potentially nasty race when
    performing simultaneous nearby directio cow writes:

    "Thread 1 writes a range from B to c

    " B --------- C
    p

    "a little later thread 2 writes from A to B

    " A --------- B
    p

    [editor's note: the 'p' denote cowextsize boundaries, which I added to
    make this more clear]

    "but the code preallocates beyond B into the range where thread
    "1 has just written, but ->end_io hasn't been called yet.
    "But once ->end_io is called thread 2 has already allocated
    "up to the extent size hint into the write range of thread 1,
    "so the end_io handler will splice the unintialized blocks from
    "that preallocation back into the file right after B."

    We can avoid this race by ensuring that thread 1 cannot accidentally
    remap the blocks that thread 2 allocated (as part of speculative
    preallocation) as part of t2's write preparation in t1's end_io handler.
    The way we make this happen is by taking advantage of the unwritten
    extent flag as an intermediate step.

    Recall that when we begin the process of writing data to shared blocks,
    we create a delayed allocation extent in the CoW fork:

    D: --RRRRRRSSSRRRRRRRR---
    C: ------DDDDDDD---------

    When a thread prepares to CoW some dirty data out to disk, it will now
    convert the delalloc reservation into an /unwritten/ allocated extent in
    the cow fork. The da conversion code tries to opportunistically
    allocate as much of a (speculatively prealloc'd) extent as possible, so
    we may end up allocating a larger extent than we're actually writing
    out:

    D: --RRRRRRSSSRRRRRRRR---
    U: ------UUUUUUU---------

    Next, we convert only the part of the extent that we're actively
    planning to write to normal (i.e. not unwritten) status:

    D: --RRRRRRSSSRRRRRRRR---
    U: ------UURRUUU---------

    If the write succeeds, the end_cow function will now scan the relevant
    range of the CoW fork for real extents and remap only the real extents
    into the data fork:

    D: --RRRRRRRRSRRRRRRRR---
    U: ------UU--UUU---------

    This ensures that we never obliterate valid data fork extents with
    unwritten blocks from the CoW fork.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

12 Jan, 2017

1 commit

  • Commit 99579ccec4e2 "xfs: skip dirty pages in ->releasepage()" started
    to skip dirty pages in xfs_vm_releasepage() which also has the effect
    that if a dirty page is truncated, it does not get freed by
    block_invalidatepage() and is lingering in LRU list waiting for reclaim.
    So a simple loop like:

    while true; do
    dd if=/dev/zero of=file bs=1M count=100
    rm file
    done

    will keep using more and more memory until we hit low watermarks and
    start pagecache reclaim which will eventually reclaim also the truncate
    pages. Keeping these truncated (and thus never usable) pages in memory
    is just a waste of memory, is unnecessarily stressing page cache
    reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM
    prematurely.

    So instead of just skipping dirty pages in xfs_vm_releasepage(), return
    to old behavior of skipping them only if they have delalloc or unwritten
    buffers and fix the spurious warnings by warning only if the page is
    clean.

    CC: stable@vger.kernel.org
    CC: Brian Foster
    CC: Vlastimil Babka
    Reported-by: Petr Tůma
    Fixes: 99579ccec4e271c3d4d4e7c946058766812afdab
    Signed-off-by: Jan Kara
    Reviewed-by: Brian Foster
    Signed-off-by: Darrick J. Wong

    Jan Kara
     

15 Dec, 2016

2 commits

  • Pull xfs updates from Dave Chinner:
    "There is quite a varied bunch of stuff in this update, and some of it
    you will have already merged through the ext4 tree which imported the
    dax-4.10-iomap-pmd topic branch from the XFS tree.

    There is also a new direct IO implementation that uses the iomap
    infrastructure. It's much simpler, faster, and has lower IO latency
    than the existing direct IO infrastructure.

    Summary:
    - DAX PMD faults via iomap infrastructure
    - Direct-io support in iomap infrastructure
    - removal of now-redundant XFS inode iolock, replaced with VFS
    i_rwsem
    - synchronisation with fixes and changes in userspace libxfs code
    - extent tree lookup helpers
    - lots of little corruption detection improvements to verifiers
    - optimised CRC calculations
    - faster buffer cache lookups
    - deprecation of barrier/nobarrier mount options - we always use
    REQ_FUA/REQ_FLUSH where appropriate for data integrity now
    - cleanups to speculative preallocation
    - miscellaneous minor bug fixes and cleanups"

    * tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (63 commits)
    xfs: nuke unused tracepoint definitions
    xfs: use GPF_NOFS when allocating btree cursors
    xfs: use xfs_vn_setattr_size to check on new size
    xfs: deprecate barrier/nobarrier mount option
    xfs: Always flush caches when integrity is required
    xfs: ignore leaf attr ichdr.count in verifier during log replay
    xfs: use rhashtable to track buffer cache
    xfs: optimise CRC updates
    xfs: make xfs btree stats less huge
    xfs: don't cap maximum dedupe request length
    xfs: don't allow di_size with high bit set
    xfs: error out if trying to add attrs and anextents > 0
    xfs: don't crash if reading a directory results in an unexpected hole
    xfs: complain if we don't get nextents bmap records
    xfs: check for bogus values in btree block headers
    xfs: forbid AG btrees with level == 0
    xfs: several xattr functions can be void
    xfs: handle cow fork in xfs_bmap_trace_exlist
    xfs: pass state not whichfork to trace_xfs_extlist
    xfs: Move AGI buffer type setting to xfs_read_agi
    ...

    Linus Torvalds
     
  • Pull ext4 updates from Ted Ts'o:
    "This merge request includes the dax-4.0-iomap-pmd branch which is
    needed for both ext4 and xfs dax changes to use iomap for DAX. It also
    includes the fscrypt branch which is needed for ubifs encryption work
    as well as ext4 encryption and fscrypt cleanups.

    Lots of cleanups and bug fixes, especially making sure ext4 is robust
    against maliciously corrupted file systems --- especially maliciously
    corrupted xattr blocks and a maliciously corrupted superblock. Also
    fix ext4 support for 64k block sizes so it works well on ppcle. Fixed
    mbcache so we don't miss some common xattr blocks that can be merged"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
    dax: Fix sleep in atomic contex in grab_mapping_entry()
    fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FL
    fscrypt: Delay bounce page pool allocation until needed
    fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page()
    fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page()
    fscrypt: Never allocate fscrypt_ctx on in-place encryption
    fscrypt: Use correct index in decrypt path.
    fscrypt: move the policy flags and encryption mode definitions to uapi header
    fscrypt: move non-public structures and constants to fscrypt_private.h
    fscrypt: unexport fscrypt_initialize()
    fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info()
    fscrypto: move ioctl processing more fully into common code
    fscrypto: remove unneeded Kconfig dependencies
    MAINTAINERS: fscrypto: recommend linux-fsdevel for fscrypto patches
    ext4: do not perform data journaling when data is encrypted
    ext4: return -ENOMEM instead of success
    ext4: reject inodes with negative size
    ext4: remove another test in ext4_alloc_file_blocks()
    Documentation: fix description of ext4's block_validity mount option
    ext4: fix checks for data=ordered and journal_async_commit options
    ...

    Linus Torvalds
     

30 Nov, 2016

3 commits

  • Straight switch over to using iomap for direct I/O - we already have the
    non-COW dio path in write_begin for DAX and files with extent size hints,
    so nothing to add there. The COW path is ported over from the old
    get_blocks version and a bit of a mess, but I have some work in progress
    to make it look more like the buffered I/O COW path.

    This gets rid of xfs_get_blocks_direct and the last caller of
    xfs_get_blocks with the create flag set, so all that code can be removed.

    Last but not least I've removed a comment in xfs_filemap_fault that
    refers to xfs_get_blocks entirely instead of updating it - while the
    reference is correct, the whole DAX fault path looks different than
    the non-DAX one, so it seems rather pointless.

    Signed-off-by: Christoph Hellwig
    Tested-by: Jens Axboe
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • This patch drops the XFS-own i_iolock and uses the VFS i_rwsem which
    recently replaced i_mutex instead. This means we only have to take
    one lock instead of two in many fast path operations, and we can
    also shrink the xfs_inode structure. Thanks to the xfs_ilock family
    there is very little churn, the only thing of note is that we need
    to switch to use the lock_two_directory helper for taking the i_rwsem
    on two inodes in a few places to make sure our lock order matches
    the one used in the VFS.

    Signed-off-by: Christoph Hellwig
    Tested-by: Jens Axboe
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Dave Chinner
     

24 Nov, 2016

1 commit


08 Nov, 2016

2 commits

  • We've had reports of generic/095 causing XFS to BUG() in
    __xfs_get_blocks() due to the existence of delalloc blocks on a
    direct I/O read. generic/095 issues a mix of various types of I/O,
    including direct and memory mapped I/O to a single file. This is
    clearly not supported behavior and is known to lead to such
    problems. E.g., the lack of exclusion between the direct I/O and
    write fault paths means that a write fault can allocate delalloc
    blocks in a region of a file that was previously a hole after the
    direct read has attempted to flush/inval the file range, but before
    it actually reads the block mapping. In turn, the direct read
    discovers a delalloc extent and cannot proceed.

    While the appropriate solution here is to not mix direct and memory
    mapped I/O to the same regions of the same file, the current
    BUG_ON() behavior is probably overkill as it can crash the entire
    system. Instead, localize the failure to the I/O in question by
    returning an error for a direct I/O that cannot be handled safely
    due to delalloc blocks. Be careful to allow the case of a direct
    write to post-eof delalloc blocks. This can occur due to speculative
    preallocation and is safe as post-eof blocks are not accompanied by
    dirty pages in pagecache (conversely, preallocation within eof must
    have been zeroed, and thus dirtied, before the inode size could have
    been increased beyond said blocks).

    Finally, provide an additional warning if a direct I/O write occurs
    while the file is memory mapped. This may not catch all problematic
    scenarios, but provides a hint that some known-to-be-problematic I/O
    methods are in use.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
    improved dax_iomap_pmd_fault(). Also, now that it has no more users,
    remove xfs_get_blocks_dax_fault().

    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Signed-off-by: Dave Chinner

    Ross Zwisler
     

03 Nov, 2016

1 commit

  • Add wbc_to_write_flags(), which returns the write modifier flags to use,
    based on a struct writeback_control. No functional changes in this
    patch, but it prepares us for factoring other wbc fields for write type.

    Signed-off-by: Jens Axboe
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig

    Jens Axboe
     

01 Nov, 2016

1 commit


11 Oct, 2016

1 commit

  • We need to splice COW blocks we've completed in xfs_end_io_direct_write
    into the data fork before converting unwritten extents. Otherwise
    xfs_bmapi_write might first allocate blocks for any holes in the data
    fork, which isn't only not needed but also harmful as it might cause
    reserved block underruns in the transaction.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

06 Oct, 2016

3 commits

  • For O_DIRECT writes to shared blocks, we have to CoW them just like
    we would with buffered writes. For writes that are not block-aligned,
    just bounce them to the page cache.

    For block-aligned writes, however, we can do better than that. Use
    the same mechanisms that we employ for buffered CoW to set up a
    delalloc reservation, allocate all the blocks at once, issue the
    writes against the new blocks and use the same ioend functions to
    remap the blocks after the write. This should be fairly performant.

    Christoph discovered that xfs_reflink_allocate_cow_range may stumble
    over invalid entries in the extent array given that it drops the ilock
    but still expects the index to be stable. Simple fixing it to a new
    lookup for every iteration still isn't correct given that
    xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
    there is nothing preventing a xfs_bunmapi_cow call removing extents
    once we dropped the ilock either.

    This patch duplicates the inner loop of xfs_bmapi_allocate into a
    helper for xfs_reflink_allocate_cow_range so that it can be done under
    the same ilock critical section as our CoW fork delayed allocation.
    The directio CoW warts will be revisited in a later patch.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     
  • Report shared extents through the iomap interface so that FIEMAP flags
    shared blocks accurately. Have xfs_vm_bmap return zero for reflinked
    files because the bmap-based swap code requires static block mappings,
    which is incompatible with copy on write.

    NOTE: Existing userspace bmap users such as lilo will have the same
    problem with reflink files.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Darrick J. Wong

    Darrick J. Wong
     
  • After the write component of a copy-write operation finishes, clean up
    the bookkeeping left behind. On error, we simply free the new blocks
    and pass the error up. If we succeed, however, then we must remove
    the old data fork mapping and move the cow fork mapping to the data
    fork.

    Signed-off-by: Darrick J. Wong
    [hch: Call the CoW failure function during xfs_cancel_ioend]
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong