02 Sep, 2014

3 commits

  • Now we are not doing silly things with dirtying buffers beyond EOF
    and using invalidation correctly, we can finally reduce the ranges of
    writeback and invalidation used by direct IO to match that of the IO
    being issued.

    Bring the writeback and invalidation ranges back to match the
    generic direct IO code - this will greatly reduce the perturbation
    of cached data when direct IO and buffered IO are mixed, but still
    provide the same buffered vs direct IO coherency behaviour we
    currently have.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Similar to direct IO reads, direct IO writes are using
    truncate_pagecache_range to invalidate the page cache. This is
    incorrect due to the sub-block zeroing in the page cache that
    truncate_pagecache_range() triggers.

    This patch fixes things by using invalidate_inode_pages2_range
    instead. It preserves the page cache invalidation, but won't zero
    any pages.

    cc: stable@vger.kernel.org
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • xfs is using truncate_pagecache_range to invalidate the page cache
    during DIO reads. This is different from the other filesystems who
    only invalidate pages during DIO writes.

    truncate_pagecache_range is meant to be used when we are freeing the
    underlying data structs from disk, so it will zero any partial
    ranges in the page. This means a DIO read can zero out part of the
    page cache page, and it is possible the page will stay in cache.

    buffered reads will find an up to date page with zeros instead of
    the data actually on disk.

    This patch fixes things by using invalidate_inode_pages2_range
    instead. It preserves the page cache invalidation, but won't zero
    any pages.

    [dchinner: catch error and warn if it fails. Comment.]

    cc: stable@vger.kernel.org
    Signed-off-by: Chris Mason
    Reviewed-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Chris Mason
     

04 Aug, 2014

2 commits

  • Dave Chinner
     
  • Move the IO flag definitions to xfs_inode.h and kill the header file
    as it is now empty.

    Removing the xfs_vnode.h file showed up an implicit header include
    path:
    xfs_linux.h -> xfs_vnode.h -> xfs_fs.h

    And so every xfs header file has been inplicitly been including
    xfs_fs.h where it is needed or not. Hence the removal of xfs_vnode.h
    causes all sorts of build issues because BBTOB() and friends are no
    longer automatically included in the build. This also gets fixed.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

24 Jul, 2014

1 commit

  • From: Brian Foster

    Speculative preallocation and and the associated throttling metrics
    assume we're working with large files on large filesystems. Users have
    reported inefficiencies in these mechanisms when we happen to be dealing
    with large files on smaller filesystems. This can occur because while
    prealloc throttling is aggressive under low free space conditions, it is
    not active until we reach 5% free space or less.

    For example, a 40GB filesystem has enough space for several files large
    enough to have multi-GB preallocations at any given time. If those files
    are slow growing, they might reserve preallocation for long periods of
    time as well as avoid the background scanner due to frequent
    modification. If a new file is written under these conditions, said file
    has no access to this already reserved space and premature ENOSPC is
    imminent.

    To handle this scenario, modify the buffered write ENOSPC handling and
    retry sequence to invoke an eofblocks scan. In the smaller filesystem
    scenario, the eofblocks scan resets the usage of preallocation such that
    when the 5% free space threshold is met, throttling effectively takes
    over to provide fair and efficient preallocation until legitimate
    ENOSPC.

    The eofblocks scan is selective based on the nature of the failure. For
    example, an EDQUOT failure in a particular quota will use a filtered
    scan for that quota. Because we don't know which quota might have caused
    an allocation failure at any given time, we include each applicable
    quota determined to be under low free space conditions in the scan.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

25 Jun, 2014

1 commit

  • Convert all the errors the core XFs code to negative error signs
    like the rest of the kernel and remove all the sign conversion we
    do in the interface layers.

    Errors for conversion (and comparison) found via searches like:

    $ git grep " E" fs/xfs
    $ git grep "return E" fs/xfs
    $ git grep " E[A-Z].*;$" fs/xfs

    Negation points found via searches like:

    $ git grep "= -[a-z,A-Z]" fs/xfs
    $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
    $ git grep " -[a-z].*;" fs/xfs

    [ with some bits I missed from Brian Foster ]

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

22 Jun, 2014

1 commit

  • XFS_ERROR was designed long ago to trap return values, but it's not
    runtime configurable, it's not consistently used, and we can do
    similar error trapping with ftrace scripts and triggers from
    userspace.

    Just nuke XFS_ERROR and associated bits.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Eric Sandeen
     

13 Jun, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "This the bunch that sat in -next + lock_parent() fix. This is the
    minimal set; there's more pending stuff.

    In particular, I really hope to get acct.c fixes merged this cycle -
    we need that to deal sanely with delayed-mntput stuff. In the next
    pile, hopefully - that series is fairly short and localized
    (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
    iov_iter work. Most of prereqs for ->splice_write with sane locking
    order are there and Kent's dio rewrite would also fit nicely on top of
    this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
    lock_parent: don't step on stale ->d_parent of all-but-freed one
    kill generic_file_splice_write()
    ceph: switch to iter_file_splice_write()
    shmem: switch to iter_file_splice_write()
    nfs: switch to iter_splice_write_file()
    fs/splice.c: remove unneeded exports
    ocfs2: switch to iter_file_splice_write()
    ->splice_write() via ->write_iter()
    bio_vec-backed iov_iter
    optimize copy_page_{to,from}_iter()
    bury generic_file_aio_{read,write}
    lustre: get rid of messing with iovecs
    ceph: switch to ->write_iter()
    ceph_sync_direct_write: stop poking into iov_iter guts
    ceph_sync_read: stop poking into iov_iter guts
    new helper: copy_page_from_iter()
    fuse: switch to ->write_iter()
    btrfs: switch to ->write_iter()
    ocfs2: switch to ->write_iter()
    xfs: switch to ->write_iter()
    ...

    Linus Torvalds
     

12 Jun, 2014

2 commits


15 May, 2014

4 commits


07 May, 2014

7 commits


21 Apr, 2014

1 commit

  • Pull ext4 fixes from Ted Ts'o:
    "These are regression and bug fixes for ext4.

    We had a number of new features in ext4 during this merge window
    (ZERO_RANGE and COLLAPSE_RANGE fallocate modes, renameat, etc.) so
    there were many more regression and bug fixes this time around. It
    didn't help that xfstests hadn't been fully updated to fully stress
    test COLLAPSE_RANGE until after -rc1"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits)
    ext4: disable COLLAPSE_RANGE for bigalloc
    ext4: fix COLLAPSE_RANGE failure with 1KB block size
    ext4: use EINVAL if not a regular file in ext4_collapse_range()
    ext4: enforce we are operating on a regular file in ext4_zero_range()
    ext4: fix extent merging in ext4_ext_shift_path_extents()
    ext4: discard preallocations after removing space
    ext4: no need to truncate pagecache twice in collapse range
    ext4: fix removing status extents in ext4_collapse_range()
    ext4: use filemap_write_and_wait_range() correctly in collapse range
    ext4: use truncate_pagecache() in collapse range
    ext4: remove temporary shim used to merge COLLAPSE_RANGE and ZERO_RANGE
    ext4: fix ext4_count_free_clusters() with EXT4FS_DEBUG and bigalloc enabled
    ext4: always check ext4_ext_find_extent result
    ext4: fix error handling in ext4_ext_shift_extents
    ext4: silence sparse check warning for function ext4_trim_extent
    ext4: COLLAPSE_RANGE only works on extent-based files
    ext4: fix byte order problems introduced by the COLLAPSE_RANGE patches
    ext4: use i_size_read in ext4_unaligned_aio()
    fs: disallow all fallocate operation on active swapfile
    fs: move falloc collapse range check into the filesystem methods
    ...

    Linus Torvalds
     

17 Apr, 2014

1 commit

  • We negate the error value being returned from a generic function
    incorrectly. The code path that it is running in returned negative
    errors, so there is no need to negate it to get the correct error
    signs here.

    This was uncovered by generic/019.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

14 Apr, 2014

1 commit


13 Apr, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds
     

12 Apr, 2014

1 commit

  • Currently in do_fallocate in collapse range case we're checking
    whether offset + len is not bigger than i_size. However there is
    nothing which would prevent i_size from changing so the check is
    pointless. It should be done in the file system itself and the file
    system needs to make sure that i_size is not going to change. The
    i_size check for the other fallocate modes are also done in the
    filesystems.

    As it is now we can easily crash the kernel by having two processes
    doing truncate and fallocate collapse range at the same time. This
    can be reproduced on ext4 and it is theoretically possible on xfs even
    though I was not able to trigger it with this simple test.

    This commit removes the check from do_fallocate and adds it to the
    file system.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"
    Acked-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Lukas Czerner
     

08 Apr, 2014

1 commit

  • filemap_map_pages() is generic implementation of ->map_pages() for
    filesystems who uses page cache.

    It should be safe to use filemap_map_pages() for ->map_pages() if
    filesystem use filemap_fault() for ->fault().

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Alexander Viro
    Cc: Dave Chinner
    Cc: Ning Qu
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

05 Apr, 2014

1 commit

  • Pull xfs update from Dave Chinner:
    "There are a couple of new fallocate features in this request - it was
    decided that it was easiest to push them through the XFS tree using
    topic branches and have the ext4 support be based on those branches.
    Hence you may see some overlap with the ext4 tree merge depending on
    how they including those topic branches into their tree. Other than
    that, there is O_TMPFILE support, some cleanups and bug fixes.

    The main changes in the XFS tree for 3.15-rc1 are:

    - O_TMPFILE support
    - allowing AIO+DIO writes beyond EOF
    - FALLOC_FL_COLLAPSE_RANGE support for fallocate syscall and XFS
    implementation
    - FALLOC_FL_ZERO_RANGE support for fallocate syscall and XFS
    implementation
    - IO verifier cleanup and rework
    - stack usage reduction changes
    - vm_map_ram NOIO context fixes to remove lockdep warings
    - various bug fixes and cleanups"

    * tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfs: (34 commits)
    xfs: fix directory hash ordering bug
    xfs: extra semi-colon breaks a condition
    xfs: Add support for FALLOC_FL_ZERO_RANGE
    fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
    xfs: inode log reservations are still too small
    xfs: xfs_check_page_type buffer checks need help
    xfs: avoid AGI/AGF deadlock scenario for inode chunk allocation
    xfs: use NOIO contexts for vm_map_ram
    xfs: don't leak EFSBADCRC to userspace
    xfs: fix directory inode iolock lockdep false positive
    xfs: allocate xfs_da_args to reduce stack footprint
    xfs: always do log forces via the workqueue
    xfs: modify verifiers to differentiate CRC from other errors
    xfs: print useful caller information in xfs_error_report
    xfs: add xfs_verifier_error()
    xfs: add helper for updating checksums on xfs_bufs
    xfs: add helper for verifying checksums on xfs_bufs
    xfs: Use defines for CRC offsets in all cases
    xfs: skip pointless CRC updates after verifier failures
    xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
    ...

    Linus Torvalds
     

02 Apr, 2014

3 commits


13 Mar, 2014

1 commit

  • Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
    functionality as xfs ioctl XFS_IOC_ZERO_RANGE.

    We can also preallocate blocks past EOF in the same was as with
    fallocate. Flag FALLOC_FL_KEEP_SIZE will cause the inode size to remain
    the same even if we preallocate blocks past EOF.

    It uses the same code to zero range as it is used by the
    XFS_IOC_ZERO_RANGE ioctl.

    Signed-off-by: Lukas Czerner
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Lukas Czerner
     

24 Feb, 2014

1 commit

  • This patch implements fallocate's FALLOC_FL_COLLAPSE_RANGE for XFS.

    The semantics of this flag are following:
    1) It collapses the range lying between offset and length by removing any data
    blocks which are present in this range and than updates all the logical
    offsets of extents beyond "offset + len" to nullify the hole created by
    removing blocks. In short, it does not leave a hole.
    2) It should be used exclusively. No other fallocate flag in combination.
    3) Offset and length supplied to fallocate should be fs block size aligned
    in case of xfs and ext4.
    4) Collaspe range does not work beyond i_size.

    Signed-off-by: Namjae Jeon
    Signed-off-by: Ashish Sangwan
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Namjae Jeon
     

10 Feb, 2014

1 commit

  • It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support)
    when sync_page_range() had been introduced; generic_file_write{,v}() correctly
    synced
    pos_after_write - written .. pos_after_write - 1
    but generic_file_aio_write() synced
    pos_before_write .. pos_before_write + written - 1
    instead. Which is not the same thing with O_APPEND, obviously.
    A couple of years later correct variant had been killed off when
    everything switched to use of generic_file_aio_write().

    All users of generic_file_aio_write() are affected, and the same bug
    has been copied into other instances of ->aio_write().

    The fix is trivial; the only subtle point is that generic_write_sync()
    ought to be inlined to avoid calculations useless for the majority of
    calls.

    Signed-off-by: Al Viro

    Al Viro
     

25 Jan, 2014

2 commits

  • Some time ago, mkfs.xfs started picking the storage physical
    sector size as the default filesystem "sector size" in order
    to avoid RMW costs incurred by doing IOs at logical sector
    size alignments.

    However, this means that for a filesystem made with i.e.
    a 4k sector size on an "advanced format" 4k/512 disk,
    512-byte direct IOs are no longer allowed. This means
    that XFS has essentially turned this AF drive into a hard
    4K device, from the filesystem on up.

    XFS's mkfs-specified "sector size" is really just controlling
    the minimum size & alignment of filesystem metadata.

    There is no real need to tightly couple XFS's minimal
    metadata size to the minimum allowed direct IO size;
    XFS can continue doing metadata in optimal sizes, but
    still allow smaller DIOs for apps which issue them,
    for whatever reason.

    This patch adds a new field to the xfs_buftarg, so that
    we now track 2 sizes:

    1) The metadata sector size, which is the minimum unit and
    alignment of IO which will be performed by metadata operations.
    2) The device logical sector size

    The first is used internally by the file system for metadata
    alignment and IOs.
    The second is used for the minimum allowed direct IO alignment.

    This has passed xfstests on filesystems made with 4k sectors,
    including when run under the patch I sent to ignore
    XFS_IOC_DIOINFO, and issue 512 DIOs anyway. I also directly
    tested end of block behavior on preallocated, sparse, and
    existing files when we do a 512 IO into a 4k file on a
    4k-sector filesystem, to be sure there were no unexpected
    behaviors.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Eric Sandeen
     
  • In preparation for adding new members to the structure,
    give these old ones more descriptive names:

    bt_ssize -> bt_meta_sectorsize
    bt_smask -> bt_meta_sectormask

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Eric Sandeen
     

19 Dec, 2013

2 commits


24 Oct, 2013

1 commit

  • Currently the xfs_inode.h header has a dependency on the definition
    of the BMAP btree records as the inode fork includes an array of
    xfs_bmbt_rec_host_t objects in it's definition.

    Move all the btree format definitions from xfs_btree.h,
    xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
    xfs_format.h to continue the process of centralising the on-disk
    format definitions. With this done, the xfs inode definitions are no
    longer dependent on btree header files.

    The enables a massive culling of unnecessary includes, with close to
    200 #include directives removed from the XFS kernel code base.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner