25 Dec, 2016

1 commit


23 Dec, 2016

1 commit


18 Dec, 2016

2 commits

  • …/linux/kernel/git/mszeredi/vfs

    Pull partial readlink cleanups from Miklos Szeredi.

    This is the uncontroversial part of the readlink cleanup patch-set that
    simplifies the default readlink handling.

    Miklos and Al are still discussing the rest of the series.

    * git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    vfs: make generic_readlink() static
    vfs: remove ".readlink = generic_readlink" assignments
    vfs: default to generic_readlink()
    vfs: replace calling i_op->readlink with vfs_readlink()
    proc/self: use generic_readlink
    ecryptfs: use vfs_get_link()
    bad_inode: add missing i_op initializers

    Linus Torvalds
     
  • Pull more vfs updates from Al Viro:
    "In this pile:

    - autofs-namespace series
    - dedupe stuff
    - more struct path constification"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
    ocfs2: charge quota for reflinked blocks
    ocfs2: fix bad pointer cast
    ocfs2: always unlock when completing dio writes
    ocfs2: don't eat io errors during _dio_end_io_write
    ocfs2: budget for extent tree splits when adding refcount flag
    ocfs2: prohibit refcounted swapfiles
    ocfs2: add newlines to some error messages
    ocfs2: convert inode refcount test to a helper
    simple_write_end(): don't zero in short copy into uptodate
    exofs: don't mess with simple_write_{begin,end}
    9p: saner ->write_end() on failing copy into non-uptodate page
    fix gfs2_stuffed_write_end() on short copies
    fix ceph_write_end()
    nfs_write_end(): fix handling of short copies
    vfs: refactor clone/dedupe_file_range common functions
    fs: try to clone files first in vfs_copy_file_range
    vfs: misc struct path constification
    namespace.c: constify struct path passed to a bunch of primitives
    quota: constify struct path in quota_on
    ...

    Linus Torvalds
     

15 Dec, 2016

2 commits

  • Pull xfs updates from Dave Chinner:
    "There is quite a varied bunch of stuff in this update, and some of it
    you will have already merged through the ext4 tree which imported the
    dax-4.10-iomap-pmd topic branch from the XFS tree.

    There is also a new direct IO implementation that uses the iomap
    infrastructure. It's much simpler, faster, and has lower IO latency
    than the existing direct IO infrastructure.

    Summary:
    - DAX PMD faults via iomap infrastructure
    - Direct-io support in iomap infrastructure
    - removal of now-redundant XFS inode iolock, replaced with VFS
    i_rwsem
    - synchronisation with fixes and changes in userspace libxfs code
    - extent tree lookup helpers
    - lots of little corruption detection improvements to verifiers
    - optimised CRC calculations
    - faster buffer cache lookups
    - deprecation of barrier/nobarrier mount options - we always use
    REQ_FUA/REQ_FLUSH where appropriate for data integrity now
    - cleanups to speculative preallocation
    - miscellaneous minor bug fixes and cleanups"

    * tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (63 commits)
    xfs: nuke unused tracepoint definitions
    xfs: use GPF_NOFS when allocating btree cursors
    xfs: use xfs_vn_setattr_size to check on new size
    xfs: deprecate barrier/nobarrier mount option
    xfs: Always flush caches when integrity is required
    xfs: ignore leaf attr ichdr.count in verifier during log replay
    xfs: use rhashtable to track buffer cache
    xfs: optimise CRC updates
    xfs: make xfs btree stats less huge
    xfs: don't cap maximum dedupe request length
    xfs: don't allow di_size with high bit set
    xfs: error out if trying to add attrs and anextents > 0
    xfs: don't crash if reading a directory results in an unexpected hole
    xfs: complain if we don't get nextents bmap records
    xfs: check for bogus values in btree block headers
    xfs: forbid AG btrees with level == 0
    xfs: several xattr functions can be void
    xfs: handle cow fork in xfs_bmap_trace_exlist
    xfs: pass state not whichfork to trace_xfs_extlist
    xfs: Move AGI buffer type setting to xfs_read_agi
    ...

    Linus Torvalds
     
  • Pull ext4 updates from Ted Ts'o:
    "This merge request includes the dax-4.0-iomap-pmd branch which is
    needed for both ext4 and xfs dax changes to use iomap for DAX. It also
    includes the fscrypt branch which is needed for ubifs encryption work
    as well as ext4 encryption and fscrypt cleanups.

    Lots of cleanups and bug fixes, especially making sure ext4 is robust
    against maliciously corrupted file systems --- especially maliciously
    corrupted xattr blocks and a maliciously corrupted superblock. Also
    fix ext4 support for 64k block sizes so it works well on ppcle. Fixed
    mbcache so we don't miss some common xattr blocks that can be merged"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
    dax: Fix sleep in atomic contex in grab_mapping_entry()
    fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FL
    fscrypt: Delay bounce page pool allocation until needed
    fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page()
    fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page()
    fscrypt: Never allocate fscrypt_ctx on in-place encryption
    fscrypt: Use correct index in decrypt path.
    fscrypt: move the policy flags and encryption mode definitions to uapi header
    fscrypt: move non-public structures and constants to fscrypt_private.h
    fscrypt: unexport fscrypt_initialize()
    fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info()
    fscrypto: move ioctl processing more fully into common code
    fscrypto: remove unneeded Kconfig dependencies
    MAINTAINERS: fscrypto: recommend linux-fsdevel for fscrypto patches
    ext4: do not perform data journaling when data is encrypted
    ext4: return -ENOMEM instead of success
    ext4: reject inodes with negative size
    ext4: remove another test in ext4_alloc_file_blocks()
    Documentation: fix description of ext4's block_validity mount option
    ext4: fix checks for data=ordered and journal_async_commit options
    ...

    Linus Torvalds
     

14 Dec, 2016

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the main block pull request this series. Contrary to previous
    release, I've kept the core and driver changes in the same branch. We
    always ended up having dependencies between the two for obvious
    reasons, so makes more sense to keep them together. That said, I'll
    probably try and keep more topical branches going forward, especially
    for cycles that end up being as busy as this one.

    The major parts of this pull request is:

    - Improved support for O_DIRECT on block devices, with a small
    private implementation instead of using the pig that is
    fs/direct-io.c. From Christoph.

    - Request completion tracking in a scalable fashion. This is utilized
    by two components in this pull, the new hybrid polling and the
    writeback queue throttling code.

    - Improved support for polling with O_DIRECT, adding a hybrid mode
    that combines pure polling with an initial sleep. From me.

    - Support for automatic throttling of writeback queues on the block
    side. This uses feedback from the device completion latencies to
    scale the queue on the block side up or down. From me.

    - Support from SMR drives in the block layer and for SD. From Hannes
    and Shaun.

    - Multi-connection support for nbd. From Josef.

    - Cleanup of request and bio flags, so we have a clear split between
    which are bio (or rq) private, and which ones are shared. From
    Christoph.

    - A set of patches from Bart, that improve how we handle queue
    stopping and starting in blk-mq.

    - Support for WRITE_ZEROES from Chaitanya.

    - Lightnvm updates from Javier/Matias.

    - Supoort for FC for the nvme-over-fabrics code. From James Smart.

    - A bunch of fixes from a whole slew of people, too many to name
    here"

    * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
    blk-stat: fix a few cases of missing batch flushing
    blk-flush: run the queue when inserting blk-mq flush
    elevator: make the rqhash helpers exported
    blk-mq: abstract out blk_mq_dispatch_rq_list() helper
    blk-mq: add blk_mq_start_stopped_hw_queue()
    block: improve handling of the magic discard payload
    blk-wbt: don't throttle discard or write zeroes
    nbd: use dev_err_ratelimited in io path
    nbd: reset the setup task for NBD_CLEAR_SOCK
    nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
    nvme-fabrics: Add target support for FC transport
    nvme-fabrics: Add host support for FC transport
    nvme-fabrics: Add FC transport LLDD api definitions
    nvme-fabrics: Add FC transport FC-NVME definitions
    nvme-fabrics: Add FC transport error codes to nvme.h
    Add type 0x28 NVME type code to scsi fc headers
    nvme-fabrics: patch target code in prep for FC transport support
    nvme-fabrics: set sqe.command_id in core not transports
    parser: add u64 number parser
    nvme-rdma: align to generic ib_event logging helper
    ...

    Linus Torvalds
     

10 Dec, 2016

2 commits

  • Hoist both the XFS reflink inode state and preparation code and the XFS
    file blocks compare functions into the VFS so that ocfs2 can take
    advantage of it for reflink and dedupe.

    Signed-off-by: Darrick J. Wong

    Darrick J. Wong
     
  • A clone is a perfectly fine implementation of a file copy, so most
    file systems just implement the copy that way. Instead of duplicating
    this logic move it to the VFS. Currently btrfs and XFS implement copies
    the same way as clones and there is no behavior change for them, cifs
    only implements clones and grow support for copy_file_range with this
    patch. NFS implements both, so this will allow copy_file_range to work
    on servers that only implement CLONE and be lot more efficient on servers
    that implements CLONE and COPY.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     

09 Dec, 2016

9 commits

  • If .readlink == NULL implies generic_readlink().

    Generated by:

    to_del="\.readlink.*=.*generic_readlink"
    for i in `git grep -l $to_del`; do sed -i "/$to_del"/d $i; done

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Also check d_is_symlink() in callers instead of inode->i_op->readlink
    because following patches will allow NULL ->readlink for symlinks.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Dave Chinner
     
  • This is all unused code, so remove it.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • Use NOFS for allocating btree cursors, since they can be called
    under the ilock.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • Commit 6552321831dc ("xfs: remove i_iolock and use i_rwsem in the
    VFS inode instead") introduced a regression that truncate(2) doesn't
    check on new size, so it succeeds even if the new size exceeds the
    current resource limit. Because xfs_setattr_size() was used instead
    of xfs_vn_setattr_size(), and the latter calls xfs_vn_change_ok()
    first to do sanity check on permission and new size.

    This is found by truncate03 test from ltp, and the following is a
    simplified reproducer:

    #!/bin/bash
    dev=/dev/sda5
    mnt=/mnt/xfs

    mkfs -t xfs -f $dev
    mount $dev $mnt

    # set max file size to 16k
    ulimit -f 16
    truncate -s $((16 * 1024 + 1)) /mnt/xfs/testfile
    [ $? -eq 0 ] && echo "FAIL: truncate exceeded max file size"
    ulimit -f unlimited
    umount $mnt

    Signed-off-by: Eryu Guan
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Eryu Guan
     
  • We always perform integrity operations now, so these mount options
    don't do anything. Deprecate them and mark them for removal in
    in a year.

    Signed-Off-By: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • There is no reason anymore for not issuing device integrity
    operations when teh filesystem requires ordering or data integrity
    guarantees. We should always issue cache flushes and FUA writes
    where necessary and let the underlying storage optimise them as
    necessary for correct integrity operation.

    Signed-Off-By: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • When we create a new attribute, we first create a shortform
    attribute, and try to fit the new attribute into it.
    If that fails, we copy the (empty) attribute into a leaf attribute,
    and do the copy again. Thus there can be a transient state where
    we have an empty leaf attribute.

    If we encounter this during log replay, the verifier will fail.
    So add a test to ignore this part of the leaf attr verification
    during log replay.

    Thanks as usual to dchinner for spotting the problem.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Eric Sandeen
     

07 Dec, 2016

2 commits

  • Dave Chinner
     
  • On filesystems with a lot of metadata and in metadata intensive workloads
    xfs_buf_find() is showing up at the top of the CPU cycles trace. Most of
    the CPU time is spent on CPU cache misses while traversing the rbtree.

    As the buffer cache does not need any kind of ordering, but fast lookups
    a hashtable is the natural data structure to use. The rhashtable
    infrastructure provides a self-scaling hashtable implementation and
    allows lookups to proceed while the table is going through a resize
    operation.

    This reduces the CPU-time spent for the lookups to 1/3 even for small
    filesystems with a relatively small number of cached buffers, with
    possibly much larger gains on higher loaded filesystems.

    [dchinner: reduce minimum hash size to an acceptable size for large
    filesystems with many AGs with no active use.]
    [dchinner: remove stale rbtree asserts.]
    [dchinner: use xfs_buf_map for compare function argument.]
    [dchinner: make functions static.]
    [dchinner: remove redundant comments.]

    Signed-off-by: Lucas Stach
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Lucas Stach
     

05 Dec, 2016

14 commits

  • Nick Piggin reported that the CRC overhead in an fsync heavy
    workload was higher than expected on a Power8 machine. Part of this
    was to do with the fact that the power8 CRC implementation is not
    efficient for CRC lengths of less than 512 bytes, and so the way we
    split the CRCs over the CRC field means a lot of the CRCs are
    reduced to being less than than optimal size.

    To optimise this, change the CRC update mechanism to zero the CRC
    field first, and then compute the CRC in one pass over the buffer
    and write the result back into the buffer. We can do this safely
    because anything writing a CRC has exclusive access to the buffer
    the CRC is being calculated over.

    We leave the CRC verify code the same - it still splits the CRC
    calculation - because we do not want read-only operations modifying
    the underlying buffer. This is because read-only operations may not
    have an exclusive access to the buffer guaranteed, and so temporary
    modifications could leak out to to other processes accessing the
    buffer concurrently.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Embedding a switch statement in every btree stats inc/add adds a lot
    of code overhead to the core btree infrastructure paths. Stats are
    supposed to be small and lightweight, but the btree stats have
    become big and bloated as we've added more btrees. It needs fixing
    because the reflink code will just add more overhead again.

    Convert the v2 btree stats to arrays instead of independent
    variables, and instead use the type to index the specific btree
    array via an enum. This allows us to use array based indexing
    to update the stats, rather than having to derefence variables
    specific to the btree type.

    If we then wrap the xfsstats structure in a union and place uint32_t
    array beside it, and calculate the correct btree stats array base
    array index when creating a btree cursor, we can easily access
    entries in the stats structure without having to switch names based
    on the btree type.

    We then replace with the switch statement with a simple set of stats
    wrapper macros, resulting in a significant simplification of the
    btree stats code, and:

    text data bss dec hex filename
    48905 144 8 49057 bfa1 fs/xfs/libxfs/xfs_btree.o.old
    36793 144 8 36945 9051 fs/xfs/libxfs/xfs_btree.o

    it reduces the core btree infrastructure code size by close to 25%!

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • After various discussions on linux-fsdevel, it has been decided that it
    is not necessary to cap the length of a dedupe request, and that
    correctly-written userspace client programs will be able to absorb the
    change. Therefore, remove the length clamping behavior.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • The on-disk field di_size is used to set i_size, which is a signed
    integer of loff_t. If the high bit of di_size is set, we'll end up with
    a negative i_size, which will cause all sorts of problems. Since the
    VFS won't let us create a file with such length, we should catch them
    here in the verifier too.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • We shouldn't assert if somehow we end up trying to add an attr fork to
    an inode that apparently already has attr extents because this is an
    indication of on-disk corruption. Instead, return an error code to
    userspace.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • In xfs_dir3_data_read, we can encounter the situation where err == 0 and
    *bpp == NULL if the given bno offset happens to be a hole; this leads to
    a crash if we try to set the buffer type after the _da_read_buf call.
    Holes can happen due to corrupt or malicious entries in the bmbt data,
    so be a little more careful when we're handling buffers.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • When reading into memory all extents of a btree-format inode fork,
    complain if the number of extents we find is not the same as the number
    of extents reported in the inode core. This is needed to stop an IO
    action from accessing the garbage areas of the in-core fork.

    [dchinner: removed redundant assert]

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • When we're reading a btree block, make sure that what we retrieved
    matches the owner and level; and has a plausible number of records.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • There is no such thing as a zero-level AG btree since even a single-node
    zero-records btree has one level. Btree cursor constructors read
    cur_nlevels straight from disk and then access things like
    cur_bufs[cur_nlevels - 1] which is /really/ bad if cur_nlevels is zero!
    Therefore, strengthen the verifiers to prevent this possibility.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Darrick J. Wong
     
  • There are a handful of xattr functions which now return
    nothing but zero. They can be made void, chased through calling
    functions, and error handling etc can be removed.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • By inspection, xfs_bmap_trace_exlist isn't handling cow forks,
    and will trace the data fork instead.

    Fix this by setting state appropriately if whichfork
    == XFS_COW_FORK.

    ()___()
    < @ @ >
    | |
    {o_o}
    (|)

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • When xfs_bmap_trace_exlist called trace_xfs_extlist,
    it sent in the "whichfork" var instead of the bmap "state"
    as expected (even though state was already set up for this
    purpose).

    As a result, the xfs_bmap_class in tracing code used
    "whichfork" not state in xfs_iext_state_to_fork(), and got
    the wrong ifork pointer. It all goes downhill from
    there, including an ASSERT when ifp_bytes is empty
    by the time it reaches xfs_iext_get_ext():

    XFS: Assertion failed: idx < ifp->if_bytes / sizeof(xfs_bmbt_rec_t)

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • We've missed properly setting the buffer type for
    an AGI transaction in 3 spots now, so just move it
    into xfs_read_agi() and set it if we are in a transaction
    to avoid the problem in the future.

    This is similar to how it is done in i.e. the dir3
    and attr3 read functions.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • xlog_recover_clear_agi_bucket didn't set the
    type to XFS_BLFT_AGI_BUF, so we got a warning during log
    replay (or an ASSERT on a debug build).

    XFS (md0): Unknown buffer type 0!
    XFS (md0): _xfs_buf_ioapply: no ops on block 0xaea8802/0x1

    Fix this, as was done in f19b872b for 2 other locations
    with the same problem.

    cc: # 3.10 to current
    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Eric Sandeen
     

30 Nov, 2016

4 commits

  • Dave Chinner
     
  • Straight switch over to using iomap for direct I/O - we already have the
    non-COW dio path in write_begin for DAX and files with extent size hints,
    so nothing to add there. The COW path is ported over from the old
    get_blocks version and a bit of a mess, but I have some work in progress
    to make it look more like the buffered I/O COW path.

    This gets rid of xfs_get_blocks_direct and the last caller of
    xfs_get_blocks with the create flag set, so all that code can be removed.

    Last but not least I've removed a comment in xfs_filemap_fault that
    refers to xfs_get_blocks entirely instead of updating it - while the
    reference is correct, the whole DAX fault path looks different than
    the non-DAX one, so it seems rather pointless.

    Signed-off-by: Christoph Hellwig
    Tested-by: Jens Axboe
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • This patch drops the XFS-own i_iolock and uses the VFS i_rwsem which
    recently replaced i_mutex instead. This means we only have to take
    one lock instead of two in many fast path operations, and we can
    also shrink the xfs_inode structure. Thanks to the xfs_ilock family
    there is very little churn, the only thing of note is that we need
    to switch to use the lock_two_directory helper for taking the i_rwsem
    on two inodes in a few places to make sure our lock order matches
    the one used in the VFS.

    Signed-off-by: Christoph Hellwig
    Tested-by: Jens Axboe
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     
  • Dave Chinner
     

28 Nov, 2016

2 commits