11 Jul, 2017

1 commit

  • Pull XFS updates from Darrick Wong:
    "Here are some changes for you for 4.13. For the most part it's fixes
    for bugs and deadlock problems, and preparation for online fsck in
    some future merge window.

    - Avoid quotacheck deadlocks

    - Fix transaction overflows when bunmapping fragmented files

    - Refactor directory readahead

    - Allow admin to configure if ASSERT is fatal

    - Improve transaction usage detail logging during overflows

    - Minor cleanups

    - Don't leak log items when the log shuts down

    - Remove double-underscore typedefs

    - Various preparation for online scrubbing

    - Introduce new error injection configuration sysfs knobs

    - Refactor dq_get_next to use extent map directly

    - Fix problems with iterating the page cache for unwritten data

    - Implement SEEK_{HOLE,DATA} via iomap

    - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA

    - Don't use MAXPATHLEN to check on-disk symlink target lengths"

    * tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
    xfs: don't crash on unexpected holes in dir/attr btrees
    xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
    xfs: fix contiguous dquot chunk iteration livelock
    xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
    vfs: Add iomap_seek_hole and iomap_seek_data helpers
    vfs: Add page_cache_seek_hole_data helper
    xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
    xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
    xfs: Check for m_errortag initialization in xfs_errortag_test
    xfs: grab dquots without taking the ilock
    xfs: fix semicolon.cocci warnings
    xfs: Don't clear SGID when inheriting ACLs
    xfs: free cowblocks and retry on buffered write ENOSPC
    xfs: replace log_badcrc_factor knob with error injection tag
    xfs: convert drop_writes to use the errortag mechanism
    xfs: remove unneeded parameter from XFS_TEST_ERROR
    xfs: expose errortag knobs via sysfs
    xfs: make errortag a per-mountpoint structure
    xfs: free uncommitted transactions during log recovery
    xfs: don't allow bmap on rt files
    ...

    Linus Torvalds
     

04 Jul, 2017

1 commit

  • Pull core block/IO updates from Jens Axboe:
    "This is the main pull request for the block layer for 4.13. Not a huge
    round in terms of features, but there's a lot of churn related to some
    core cleanups.

    Note this depends on the UUID tree pull request, that Christoph
    already sent out.

    This pull request contains:

    - A series from Christoph, unifying the error/stats codes in the
    block layer. We now use blk_status_t everywhere, instead of using
    different schemes for different places.

    - Also from Christoph, some cleanups around request allocation and IO
    scheduler interactions in blk-mq.

    - And yet another series from Christoph, cleaning up how we handle
    and do bounce buffering in the block layer.

    - A blk-mq debugfs series from Bart, further improving on the support
    we have for exporting internal information to aid debugging IO
    hangs or stalls.

    - Also from Bart, a series that cleans up the request initialization
    differences across types of devices.

    - A series from Goldwyn Rodrigues, allowing the block layer to return
    failure if we will block and the user asked for non-blocking.

    - Patch from Hannes for supporting setting loop devices block size to
    that of the underlying device.

    - Two series of patches from Javier, fixing various issues with
    lightnvm, particular around pblk.

    - A series from me, adding support for write hints. This comes with
    NVMe support as well, so applications can help guide data placement
    on flash to improve performance, latencies, and write
    amplification.

    - A series from Ming, improving and hardening blk-mq support for
    stopping/starting and quiescing hardware queues.

    - Two pull requests for NVMe updates. Nothing major on the feature
    side, but lots of cleanups and bug fixes. From the usual crew.

    - A series from Neil Brown, greatly improving the bio rescue set
    support. Most notably, this kills the bio rescue work queues, if we
    don't really need them.

    - Lots of other little bug fixes that are all over the place"

    * 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
    lightnvm: pblk: set line bitmap check under debug
    lightnvm: pblk: verify that cache read is still valid
    lightnvm: pblk: add initialization check
    lightnvm: pblk: remove target using async. I/Os
    lightnvm: pblk: use vmalloc for GC data buffer
    lightnvm: pblk: use right metadata buffer for recovery
    lightnvm: pblk: schedule if data is not ready
    lightnvm: pblk: remove unused return variable
    lightnvm: pblk: fix double-free on pblk init
    lightnvm: pblk: fix bad le64 assignations
    nvme: Makefile: remove dead build rule
    blk-mq: map all HWQ also in hyperthreaded system
    nvmet-rdma: register ib_client to not deadlock in device removal
    nvme_fc: fix error recovery on link down.
    nvmet_fc: fix crashes on bad opcodes
    nvme_fc: Fix crash when nvme controller connection fails.
    nvme_fc: replace ioabort msleep loop with completion
    nvme_fc: fix double calls to nvme_cleanup_cmd()
    nvme-fabrics: verify that a controller returns the correct NQN
    nvme: simplify nvme_dev_attrs_are_visible
    ...

    Linus Torvalds
     

28 Jun, 2017

1 commit


22 Jun, 2017

1 commit

  • bmap returns a dumb LBA address but not the block device that goes with
    that LBA. Swapfiles don't care about this and will blindly assume that
    the data volume is the correct blockdev, which is totally bogus for
    files on the rt subvolume. This results in the swap code doing IOs to
    arbitrary locations on the data device(!) if the passed in mapping is a
    realtime file, so just turn off bmap for rt files.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

21 Jun, 2017

1 commit

  • bmap returns a dumb LBA address but not the block device that goes with
    that LBA. Swapfiles don't care about this and will blindly assume that
    the data volume is the correct blockdev, which is totally bogus for
    files on the rt subvolume. This results in the swap code doing IOs to
    arbitrary locations on the data device(!) if the passed in mapping is a
    realtime file, so just turn off bmap for rt files.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

20 Jun, 2017

1 commit

  • This is a purely mechanical patch that removes the private
    __{u,}int{8,16,32,64}_t typedefs in favor of using the system
    {u,}int{8,16,32,64}_t typedefs. This is the sed script used to perform
    the transformation and fix the resulting whitespace and indentation
    errors:

    s/typedef\t__uint8_t/typedef __uint8_t\t/g
    s/typedef\t__uint/typedef __uint/g
    s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
    s/__uint8_t\t/__uint8_t\t\t/g
    s/__uint/uint/g
    s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
    s/__int/int/g
    /^typedef.*int[0-9]*_t;$/d

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

09 Jun, 2017

1 commit

  • Replace bi_error with a new bi_status to allow for a clear conversion.
    Note that device mapper overloaded bi_error with a private value, which
    we'll have to keep arround at least for now and thus propagate to a
    proper blk_status_t value.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

07 May, 2017

1 commit

  • Pull xfs updates from Darrick Wong:
    "Here are the XFS changes for 4.12. The big new feature for this
    release is the new space mapping ioctl that we've been discussing
    since LSF2016, but other than that most of the patches are larger bug
    fixes, memory corruption prevention, and other cleanups.

    Summary:
    - various code cleanups
    - introduce GETFSMAP ioctl
    - various refactoring
    - avoid dio reads past eof
    - fix memory corruption and other errors with fragmented directory blocks
    - fix accidental userspace memory corruptions
    - publish fs uuid in superblock
    - make fstrim terminatable
    - fix race between quotaoff and in-core inode creation
    - avoid use-after-free when finishing up w/ buffer heads
    - reserve enough space to handle bmap tree resizing during cow remap"

    * tag 'xfs-4.12-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (53 commits)
    xfs: fix use-after-free in xfs_finish_page_writeback
    xfs: reserve enough blocks to handle btree splits when remapping
    xfs: wait on new inodes during quotaoff dquot release
    xfs: update ag iterator to support wait on new inodes
    xfs: support ability to wait on new inodes
    xfs: publish UUID in struct super_block
    xfs: Allow user to kill fstrim process
    xfs: better log intent item refcount checking
    xfs: fix up quotacheck buffer list error handling
    xfs: remove xfs_trans_ail_delete_bulk
    xfs: don't use bool values in trace buffers
    xfs: fix getfsmap userspace memory corruption while setting OF_LAST
    xfs: fix __user annotations for xfs_ioc_getfsmap
    xfs: corruption needs to respect endianess too!
    xfs: use NULL instead of 0 to initialize a pointer in xfs_ioc_getfsmap
    xfs: use NULL instead of 0 to initialize a pointer in xfs_getfsmap
    xfs: simplify validation of the unwritten extent bit
    xfs: remove unused values from xfs_exntst_t
    xfs: remove the unused XFS_MAXLINK_1 define
    xfs: more do_div cleanups
    ...

    Linus Torvalds
     

06 May, 2017

1 commit

  • Commit 28b783e47ad7 ("xfs: bufferhead chains are invalid after
    end_page_writeback") fixed one use-after-free issue by
    pre-calculating the loop conditionals before calling bh->b_end_io()
    in the end_io processing loop, but it assigned 'next' pointer before
    checking end offset boundary & breaking the loop, at which point the
    bh might be freed already, and caused use-after-free.

    This is caught by KASAN when running fstests generic/127 on sub-page
    block size XFS.

    [ 2517.244502] run fstests generic/127 at 2017-04-27 07:30:50
    [ 2747.868840] ==================================================================
    [ 2747.876949] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3d3/0x4e0 [xfs] at addr ffff8801395ae698
    ...
    [ 2747.918245] Call Trace:
    [ 2747.920975] dump_stack+0x63/0x84
    [ 2747.924673] kasan_object_err+0x21/0x70
    [ 2747.928950] kasan_report+0x271/0x530
    [ 2747.933064] ? xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
    [ 2747.938409] ? end_page_writeback+0xce/0x110
    [ 2747.943171] __asan_report_load8_noabort+0x19/0x20
    [ 2747.948545] xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
    [ 2747.953724] xfs_end_io+0x1af/0x2b0 [xfs]
    [ 2747.958197] process_one_work+0x5ff/0x1000
    [ 2747.962766] worker_thread+0xe4/0x10e0
    [ 2747.966946] kthread+0x2d3/0x3d0
    [ 2747.970546] ? process_one_work+0x1000/0x1000
    [ 2747.975405] ? kthread_create_on_node+0xc0/0xc0
    [ 2747.980457] ? syscall_return_slowpath+0xe6/0x140
    [ 2747.985706] ? do_page_fault+0x30/0x80
    [ 2747.989887] ret_from_fork+0x2c/0x40
    [ 2747.993874] Object at ffff8801395ae690, in cache buffer_head size: 104
    [ 2748.001155] Allocated:
    [ 2748.003782] PID = 8327
    [ 2748.006411] save_stack_trace+0x1b/0x20
    [ 2748.010688] save_stack+0x46/0xd0
    [ 2748.014383] kasan_kmalloc+0xad/0xe0
    [ 2748.018370] kasan_slab_alloc+0x12/0x20
    [ 2748.022648] kmem_cache_alloc+0xb8/0x1b0
    [ 2748.027024] alloc_buffer_head+0x22/0xc0
    [ 2748.031399] alloc_page_buffers+0xd1/0x250
    [ 2748.035968] create_empty_buffers+0x30/0x410
    [ 2748.040730] create_page_buffers+0x120/0x1b0
    [ 2748.045493] __block_write_begin_int+0x17a/0x1800
    [ 2748.050740] iomap_write_begin+0x100/0x2f0
    [ 2748.055308] iomap_zero_range_actor+0x253/0x5c0
    [ 2748.060362] iomap_apply+0x157/0x270
    [ 2748.064347] iomap_zero_range+0x5a/0x80
    [ 2748.068624] iomap_truncate_page+0x6b/0xa0
    [ 2748.073227] xfs_setattr_size+0x1f7/0xa10 [xfs]
    [ 2748.078312] xfs_vn_setattr_size+0x68/0x140 [xfs]
    [ 2748.083589] xfs_file_fallocate+0x4ac/0x820 [xfs]
    [ 2748.088838] vfs_fallocate+0x2cf/0x780
    [ 2748.093021] SyS_fallocate+0x48/0x80
    [ 2748.097006] do_syscall_64+0x18a/0x430
    [ 2748.101186] return_from_SYSCALL_64+0x0/0x6a
    [ 2748.105948] Freed:
    [ 2748.108189] PID = 8327
    [ 2748.110816] save_stack_trace+0x1b/0x20
    [ 2748.115093] save_stack+0x46/0xd0
    [ 2748.118788] kasan_slab_free+0x73/0xc0
    [ 2748.122969] kmem_cache_free+0x7a/0x200
    [ 2748.127247] free_buffer_head+0x41/0x80
    [ 2748.131524] try_to_free_buffers+0x178/0x250
    [ 2748.136316] xfs_vm_releasepage+0x2e9/0x3d0 [xfs]
    [ 2748.141563] try_to_release_page+0x100/0x180
    [ 2748.146325] invalidate_inode_pages2_range+0x7da/0xcf0
    [ 2748.152087] xfs_shift_file_space+0x37d/0x6e0 [xfs]
    [ 2748.157557] xfs_collapse_file_space+0x49/0x120 [xfs]
    [ 2748.163223] xfs_file_fallocate+0x2a7/0x820 [xfs]
    [ 2748.168462] vfs_fallocate+0x2cf/0x780
    [ 2748.172642] SyS_fallocate+0x48/0x80
    [ 2748.176629] do_syscall_64+0x18a/0x430
    [ 2748.180810] return_from_SYSCALL_64+0x0/0x6a

    Fixed it by checking on offset against end & breaking out first,
    dereference bh only if there're still bufferheads to process.

    Signed-off-by: Eryu Guan
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Eryu Guan
     

04 May, 2017

1 commit

  • xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
    some time ago. We would like to make this concept more generic and use
    it for other filesystems as well. Let's start by giving the flag a more
    generic name PF_MEMALLOC_NOFS which is in line with an exiting
    PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
    contexts. Replace all PF_FSTRANS usage from the xfs code in the first
    step before we introduce a full API for it as xfs uses the flag directly
    anyway.

    This patch doesn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170306131408.9828-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Acked-by: Vlastimil Babka
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: David Sterba
    Cc: Jan Kara
    Cc: Nikolay Borisov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 Apr, 2017

2 commits


08 Mar, 2017

2 commits

  • There are two different cases of buffered I/O errors:

    - first we can have an already shutdown fs. In that case we should skip
    any on-disk operations and just clean up the appen transaction if
    present and destroy the ioend
    - a real I/O error. In that case we should cleanup any lingering COW
    blocks. This gets skipped in the current code and is fixed by this
    patch.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     
  • We only want to reclaim preallocations from our periodic work item.
    Currently this is archived by looking for a dirty inode, but that check
    is rather fragile. Instead add a flag to xfs_reflink_cancel_cow_* so
    that the caller can ask for just cancelling unwritten extents in the COW
    fork.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    [darrick: fix typos in commit message]
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

28 Feb, 2017

1 commit

  • Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
    branch.

    This patch also fixes multiple checkpatch warnings: WARNING: Prefer
    'unsigned int' to bare use of 'unsigned'

    Thanks to Andrew Morton for suggesting more appropriate function instead
    of macro.

    [geliangtang@gmail.com: truncate: use i_blocksize()]
    Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
    Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Signed-off-by: Geliang Tang
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

03 Feb, 2017

1 commit

  • Christoph Hellwig pointed out that there's a potentially nasty race when
    performing simultaneous nearby directio cow writes:

    "Thread 1 writes a range from B to c

    " B --------- C
    p

    "a little later thread 2 writes from A to B

    " A --------- B
    p

    [editor's note: the 'p' denote cowextsize boundaries, which I added to
    make this more clear]

    "but the code preallocates beyond B into the range where thread
    "1 has just written, but ->end_io hasn't been called yet.
    "But once ->end_io is called thread 2 has already allocated
    "up to the extent size hint into the write range of thread 1,
    "so the end_io handler will splice the unintialized blocks from
    "that preallocation back into the file right after B."

    We can avoid this race by ensuring that thread 1 cannot accidentally
    remap the blocks that thread 2 allocated (as part of speculative
    preallocation) as part of t2's write preparation in t1's end_io handler.
    The way we make this happen is by taking advantage of the unwritten
    extent flag as an intermediate step.

    Recall that when we begin the process of writing data to shared blocks,
    we create a delayed allocation extent in the CoW fork:

    D: --RRRRRRSSSRRRRRRRR---
    C: ------DDDDDDD---------

    When a thread prepares to CoW some dirty data out to disk, it will now
    convert the delalloc reservation into an /unwritten/ allocated extent in
    the cow fork. The da conversion code tries to opportunistically
    allocate as much of a (speculatively prealloc'd) extent as possible, so
    we may end up allocating a larger extent than we're actually writing
    out:

    D: --RRRRRRSSSRRRRRRRR---
    U: ------UUUUUUU---------

    Next, we convert only the part of the extent that we're actively
    planning to write to normal (i.e. not unwritten) status:

    D: --RRRRRRSSSRRRRRRRR---
    U: ------UURRUUU---------

    If the write succeeds, the end_cow function will now scan the relevant
    range of the CoW fork for real extents and remap only the real extents
    into the data fork:

    D: --RRRRRRRRSRRRRRRRR---
    U: ------UU--UUU---------

    This ensures that we never obliterate valid data fork extents with
    unwritten blocks from the CoW fork.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

12 Jan, 2017

1 commit

  • Commit 99579ccec4e2 "xfs: skip dirty pages in ->releasepage()" started
    to skip dirty pages in xfs_vm_releasepage() which also has the effect
    that if a dirty page is truncated, it does not get freed by
    block_invalidatepage() and is lingering in LRU list waiting for reclaim.
    So a simple loop like:

    while true; do
    dd if=/dev/zero of=file bs=1M count=100
    rm file
    done

    will keep using more and more memory until we hit low watermarks and
    start pagecache reclaim which will eventually reclaim also the truncate
    pages. Keeping these truncated (and thus never usable) pages in memory
    is just a waste of memory, is unnecessarily stressing page cache
    reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM
    prematurely.

    So instead of just skipping dirty pages in xfs_vm_releasepage(), return
    to old behavior of skipping them only if they have delalloc or unwritten
    buffers and fix the spurious warnings by warning only if the page is
    clean.

    CC: stable@vger.kernel.org
    CC: Brian Foster
    CC: Vlastimil Babka
    Reported-by: Petr Tůma
    Fixes: 99579ccec4e271c3d4d4e7c946058766812afdab
    Signed-off-by: Jan Kara
    Reviewed-by: Brian Foster
    Signed-off-by: Darrick J. Wong

    Jan Kara
     

15 Dec, 2016

2 commits

  • Pull xfs updates from Dave Chinner:
    "There is quite a varied bunch of stuff in this update, and some of it
    you will have already merged through the ext4 tree which imported the
    dax-4.10-iomap-pmd topic branch from the XFS tree.

    There is also a new direct IO implementation that uses the iomap
    infrastructure. It's much simpler, faster, and has lower IO latency
    than the existing direct IO infrastructure.

    Summary:
    - DAX PMD faults via iomap infrastructure
    - Direct-io support in iomap infrastructure
    - removal of now-redundant XFS inode iolock, replaced with VFS
    i_rwsem
    - synchronisation with fixes and changes in userspace libxfs code
    - extent tree lookup helpers
    - lots of little corruption detection improvements to verifiers
    - optimised CRC calculations
    - faster buffer cache lookups
    - deprecation of barrier/nobarrier mount options - we always use
    REQ_FUA/REQ_FLUSH where appropriate for data integrity now
    - cleanups to speculative preallocation
    - miscellaneous minor bug fixes and cleanups"

    * tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (63 commits)
    xfs: nuke unused tracepoint definitions
    xfs: use GPF_NOFS when allocating btree cursors
    xfs: use xfs_vn_setattr_size to check on new size
    xfs: deprecate barrier/nobarrier mount option
    xfs: Always flush caches when integrity is required
    xfs: ignore leaf attr ichdr.count in verifier during log replay
    xfs: use rhashtable to track buffer cache
    xfs: optimise CRC updates
    xfs: make xfs btree stats less huge
    xfs: don't cap maximum dedupe request length
    xfs: don't allow di_size with high bit set
    xfs: error out if trying to add attrs and anextents > 0
    xfs: don't crash if reading a directory results in an unexpected hole
    xfs: complain if we don't get nextents bmap records
    xfs: check for bogus values in btree block headers
    xfs: forbid AG btrees with level == 0
    xfs: several xattr functions can be void
    xfs: handle cow fork in xfs_bmap_trace_exlist
    xfs: pass state not whichfork to trace_xfs_extlist
    xfs: Move AGI buffer type setting to xfs_read_agi
    ...

    Linus Torvalds
     
  • Pull ext4 updates from Ted Ts'o:
    "This merge request includes the dax-4.0-iomap-pmd branch which is
    needed for both ext4 and xfs dax changes to use iomap for DAX. It also
    includes the fscrypt branch which is needed for ubifs encryption work
    as well as ext4 encryption and fscrypt cleanups.

    Lots of cleanups and bug fixes, especially making sure ext4 is robust
    against maliciously corrupted file systems --- especially maliciously
    corrupted xattr blocks and a maliciously corrupted superblock. Also
    fix ext4 support for 64k block sizes so it works well on ppcle. Fixed
    mbcache so we don't miss some common xattr blocks that can be merged"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
    dax: Fix sleep in atomic contex in grab_mapping_entry()
    fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FL
    fscrypt: Delay bounce page pool allocation until needed
    fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page()
    fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page()
    fscrypt: Never allocate fscrypt_ctx on in-place encryption
    fscrypt: Use correct index in decrypt path.
    fscrypt: move the policy flags and encryption mode definitions to uapi header
    fscrypt: move non-public structures and constants to fscrypt_private.h
    fscrypt: unexport fscrypt_initialize()
    fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info()
    fscrypto: move ioctl processing more fully into common code
    fscrypto: remove unneeded Kconfig dependencies
    MAINTAINERS: fscrypto: recommend linux-fsdevel for fscrypto patches
    ext4: do not perform data journaling when data is encrypted
    ext4: return -ENOMEM instead of success
    ext4: reject inodes with negative size
    ext4: remove another test in ext4_alloc_file_blocks()
    Documentation: fix description of ext4's block_validity mount option
    ext4: fix checks for data=ordered and journal_async_commit options
    ...

    Linus Torvalds
     

30 Nov, 2016

3 commits

  • Straight switch over to using iomap for direct I/O - we already have the
    non-COW dio path in write_begin for DAX and files with extent size hints,
    so nothing to add there. The COW path is ported over from the old
    get_blocks version and a bit of a mess, but I have some work in progress
    to make it look more like the buffered I/O COW path.

    This gets rid of xfs_get_blocks_direct and the last caller of
    xfs_get_blocks with the create flag set, so all that code can be removed.

    Last but not least I've removed a comment in xfs_filemap_fault that
    refers to xfs_get_blocks entirely instead of updating it - while the
    reference is correct, the whole DAX fault path looks different than
    the non-DAX one, so it seems rather pointless.

    Signed-off-by: Christoph Hellwig
    Tested-by: Jens Axboe
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • This patch drops the XFS-own i_iolock and uses the VFS i_rwsem which
    recently replaced i_mutex instead. This means we only have to take
    one lock instead of two in many fast path operations, and we can
    also shrink the xfs_inode structure. Thanks to the xfs_ilock family
    there is very little churn, the only thing of note is that we need
    to switch to use the lock_two_directory helper for taking the i_rwsem
    on two inodes in a few places to make sure our lock order matches
    the one used in the VFS.

    Signed-off-by: Christoph Hellwig
    Tested-by: Jens Axboe
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Dave Chinner
     

24 Nov, 2016

1 commit


08 Nov, 2016

2 commits

  • We've had reports of generic/095 causing XFS to BUG() in
    __xfs_get_blocks() due to the existence of delalloc blocks on a
    direct I/O read. generic/095 issues a mix of various types of I/O,
    including direct and memory mapped I/O to a single file. This is
    clearly not supported behavior and is known to lead to such
    problems. E.g., the lack of exclusion between the direct I/O and
    write fault paths means that a write fault can allocate delalloc
    blocks in a region of a file that was previously a hole after the
    direct read has attempted to flush/inval the file range, but before
    it actually reads the block mapping. In turn, the direct read
    discovers a delalloc extent and cannot proceed.

    While the appropriate solution here is to not mix direct and memory
    mapped I/O to the same regions of the same file, the current
    BUG_ON() behavior is probably overkill as it can crash the entire
    system. Instead, localize the failure to the I/O in question by
    returning an error for a direct I/O that cannot be handled safely
    due to delalloc blocks. Be careful to allow the case of a direct
    write to post-eof delalloc blocks. This can occur due to speculative
    preallocation and is safe as post-eof blocks are not accompanied by
    dirty pages in pagecache (conversely, preallocation within eof must
    have been zeroed, and thus dirtied, before the inode size could have
    been increased beyond said blocks).

    Finally, provide an additional warning if a direct I/O write occurs
    while the file is memory mapped. This may not catch all problematic
    scenarios, but provides a hint that some known-to-be-problematic I/O
    methods are in use.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
    improved dax_iomap_pmd_fault(). Also, now that it has no more users,
    remove xfs_get_blocks_dax_fault().

    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Signed-off-by: Dave Chinner

    Ross Zwisler
     

03 Nov, 2016

1 commit

  • Add wbc_to_write_flags(), which returns the write modifier flags to use,
    based on a struct writeback_control. No functional changes in this
    patch, but it prepares us for factoring other wbc fields for write type.

    Signed-off-by: Jens Axboe
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig

    Jens Axboe
     

01 Nov, 2016

1 commit


11 Oct, 2016

1 commit

  • We need to splice COW blocks we've completed in xfs_end_io_direct_write
    into the data fork before converting unwritten extents. Otherwise
    xfs_bmapi_write might first allocate blocks for any holes in the data
    fork, which isn't only not needed but also harmful as it might cause
    reserved block underruns in the transaction.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

06 Oct, 2016

3 commits

  • For O_DIRECT writes to shared blocks, we have to CoW them just like
    we would with buffered writes. For writes that are not block-aligned,
    just bounce them to the page cache.

    For block-aligned writes, however, we can do better than that. Use
    the same mechanisms that we employ for buffered CoW to set up a
    delalloc reservation, allocate all the blocks at once, issue the
    writes against the new blocks and use the same ioend functions to
    remap the blocks after the write. This should be fairly performant.

    Christoph discovered that xfs_reflink_allocate_cow_range may stumble
    over invalid entries in the extent array given that it drops the ilock
    but still expects the index to be stable. Simple fixing it to a new
    lookup for every iteration still isn't correct given that
    xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
    there is nothing preventing a xfs_bunmapi_cow call removing extents
    once we dropped the ilock either.

    This patch duplicates the inner loop of xfs_bmapi_allocate into a
    helper for xfs_reflink_allocate_cow_range so that it can be done under
    the same ilock critical section as our CoW fork delayed allocation.
    The directio CoW warts will be revisited in a later patch.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     
  • Report shared extents through the iomap interface so that FIEMAP flags
    shared blocks accurately. Have xfs_vm_bmap return zero for reflinked
    files because the bmap-based swap code requires static block mappings,
    which is incompatible with copy on write.

    NOTE: Existing userspace bmap users such as lilo will have the same
    problem with reflink files.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Darrick J. Wong

    Darrick J. Wong
     
  • After the write component of a copy-write operation finishes, clean up
    the bookkeeping left behind. On error, we simply free the new blocks
    and pass the error up. If we succeed, however, then we must remove
    the old data fork mapping and move the cow fork mapping to the data
    fork.

    Signed-off-by: Darrick J. Wong
    [hch: Call the CoW failure function during xfs_cancel_ioend]
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     

05 Oct, 2016

2 commits

  • Modify the writepage handler to find and convert pending delalloc
    extents to real allocations. Furthermore, when we're doing non-cow
    writes to a part of a file that already has a CoW reservation (the
    cowextsz hint that we set up in a subsequent patch facilitates this),
    promote the write to copy-on-write so that the entire extent can get
    written out as a single extent on disk, thereby reducing post-CoW
    fragmentation.

    Christoph moved the CoW support code in _map_blocks to a separate helper
    function, refactored other functions, and reduced the number of CoW fork
    lookups, so I merged those changes here to reduce churn.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Christoph Hellwig

    Darrick J. Wong
     
  • Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed
    allocation extents in the CoW fork to real allocations, and wire this
    up all the way back to xfs_iomap_write_allocate(). In a subsequent
    patch, we'll modify the writepage handler to call this.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

19 Sep, 2016

1 commit

  • Rename the current function to __xfs_setfilesize and add a non-static
    wrapper that also takes care of creating the transaction. This new
    helper will be used by the new iomap-based DAX path.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

28 Jul, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "The major addition is the new iomap based block mapping
    infrastructure. We've been kicking this about locally for years, but
    there are other filesystems want to use it too (e.g. gfs2). Now it
    is fully working, reviewed and ready for merge and be used by other
    filesystems.

    There are a lot of other fixes and cleanups in the tree, but those are
    XFS internal things and none are of the scale or visibility of the
    iomap changes. See below for details.

    I am likely to send another pull request next week - we're just about
    ready to merge some new functionality (on disk block->owner reverse
    mapping infrastructure), but that's a huge chunk of code (74 files
    changed, 7283 insertions(+), 1114 deletions(-)) so I'm keeping that
    separate to all the "normal" pull request changes so they don't get
    lost in the noise.

    Summary of changes in this update:
    - generic iomap based IO path infrastructure
    - generic iomap based fiemap implementation
    - xfs iomap based Io path implementation
    - buffer error handling fixes
    - tracking of in flight buffer IO for unmount serialisation
    - direct IO and DAX io path separation and simplification
    - shortform directory format definition changes for wider platform
    compatibility
    - various buffer cache fixes
    - cleanups in preparation for rmap merge
    - error injection cleanups and fixes
    - log item format buffer memory allocation restructuring to prevent
    rare OOM reclaim deadlocks
    - sparse inode chunks are now fully supported"

    * tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (53 commits)
    xfs: remove EXPERIMENTAL tag from sparse inode feature
    xfs: bufferhead chains are invalid after end_page_writeback
    xfs: allocate log vector buffers outside CIL context lock
    libxfs: directory node splitting does not have an extra block
    xfs: remove dax code from object file when disabled
    xfs: skip dirty pages in ->releasepage()
    xfs: remove __arch_pack
    xfs: kill xfs_dir2_inou_t
    xfs: kill xfs_dir2_sf_off_t
    xfs: split direct I/O and DAX path
    xfs: direct calls in the direct I/O path
    xfs: stop using generic_file_read_iter for direct I/O
    xfs: split xfs_file_read_iter into buffered and direct I/O helpers
    xfs: remove s_maxbytes enforcement in xfs_file_read_iter
    xfs: kill ioflags
    xfs: don't pass ioflags around in the ioctl path
    xfs: track and serialize in-flight async buffers against unmount
    xfs: exclude never-released buffers from buftarg I/O accounting
    xfs: don't reset b_retries to 0 on every failure
    xfs: remove extraneous buffer flag changes
    ...

    Linus Torvalds
     

22 Jul, 2016

3 commits

  • Dave Chinner
     
  • In xfs_finish_page_writeback(), we have a loop that looks like this:

    do {
    if (off < bvec->bv_offset)
    goto next_bh;
    if (off > end)
    break;
    bh->b_end_io(bh, !error);
    next_bh:
    off += bh->b_size;
    } while ((bh = bh->b_this_page) != head);

    The b_end_io function is end_buffer_async_write(), which will call
    end_page_writeback() once all the buffers have marked as no longer
    under IO. This issue here is that the only thing currently
    protecting both the bufferhead chain and the page from being
    reclaimed is the PageWriteback state held on the page.

    While we attempt to limit the loop to just the buffers covered by
    the IO, we still read from the buffer size and follow the next
    pointer in the bufferhead chain. There is no guarantee that either
    of these are valid after the PageWriteback flag has been cleared.
    Hence, loops like this are completely unsafe, and result in
    use-after-free issues. One such problem was caught by Calvin Owens
    with KASAN:

    .....
    INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1
    free_buffer_head+0x41/0x90
    __slab_free+0x1ed/0x340
    kmem_cache_free+0x270/0x300
    free_buffer_head+0x41/0x90
    try_to_free_buffers+0x171/0x240
    xfs_vm_releasepage+0xcb/0x3b0
    try_to_release_page+0x106/0x190
    shrink_page_list+0x118e/0x1a10
    shrink_inactive_list+0x42c/0xdf0
    shrink_zone_memcg+0xa09/0xfa0
    shrink_zone+0x2c3/0xbc0
    .....
    Call Trace:
    [] dump_stack+0x68/0x94
    [] print_trailer+0x115/0x1a0
    [] object_err+0x34/0x40
    [] kasan_report_error+0x217/0x530
    [] __asan_report_load8_noabort+0x43/0x50
    [] xfs_destroy_ioend+0x3bf/0x4c0
    [] xfs_end_bio+0x154/0x220
    [] bio_endio+0x158/0x1b0
    [] blk_update_request+0x18b/0xb80
    [] scsi_end_request+0x97/0x5a0
    [] scsi_io_completion+0x438/0x1690
    [] scsi_finish_command+0x375/0x4e0
    [] scsi_softirq_done+0x280/0x340

    Where the access is occuring during IO completion after the buffer
    had been freed from direct memory reclaim.

    Prevent use-after-free accidents in this end_io processing loop by
    pre-calculating the loop conditionals before calling bh->b_end_io().
    The loop is already limited to just the bufferheads covered by the
    IO in progress, so the offset checks are sufficient to prevent
    accessing buffers in the chain after end_page_writeback() has been
    called by the the bh->b_end_io() callout.

    Yet another example of why Bufferheads Must Die.

    cc: # 4.7
    Signed-off-by: Dave Chinner
    Reported-and-Tested-by: Calvin Owens
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • XFS has had scattered reports of delalloc blocks present at
    ->releasepage() time. This results in a warning with a stack trace
    similar to the following:

    ...
    Call Trace:
    [] dump_stack+0x63/0x84
    [] warn_slowpath_common+0x97/0xe0
    [] warn_slowpath_null+0x1a/0x20
    [] xfs_vm_releasepage+0x10f/0x140
    [] ? page_mkclean_one+0xd0/0xd0
    [] ? anon_vma_prepare+0x150/0x150
    [] try_to_release_page+0x32/0x50
    [] shrink_active_list+0x3ce/0x3e0
    [] shrink_lruvec+0x687/0x7d0
    [] shrink_zone+0xdc/0x2c0
    [] kswapd+0x4f9/0x970
    [] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
    [] kthread+0xc9/0xe0
    [] ? kthread_stop+0x100/0x100
    [] ret_from_fork+0x3f/0x70
    [] ? kthread_stop+0x100/0x100

    This occurs because it is possible for shrink_active_list() to send
    pages marked dirty to ->releasepage() when certain buffer_head threshold
    conditions are met. shrink_active_list() doesn't check the page dirty
    state apparently to handle an old ext3 corner case where in some cases
    clean pages would not have the dirty bit cleared, thus it is up to the
    filesystem to determine how to handle the page.

    XFS currently handles the delalloc case properly, but this behavior
    makes the warning spurious. Update the XFS ->releasepage() handler to
    explicitly skip dirty pages. Retain the existing delalloc/unwritten
    checks so we continue to warn if such buffers exist on clean pages when
    they shouldn't.

    Diagnosed-by: Dave Chinner
    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

20 Jul, 2016

2 commits