06 Oct, 2016

2 commits

  • If the admin doesn't set a CoW extent size or a regular extent size
    hint, default to creating CoW reservations 32 blocks long to reduce
    fragmentation.

    Signed-off-by: DarricK J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Create a per-inode extent size allocator hint for copy-on-write. This
    hint is separate from the existing extent size hint so that CoW can
    take advantage of the fragmentation-reducing properties of extent size
    hints without disabling delalloc for regular writes.

    The extent size hint that's fed to the allocator during a copy on
    write operation is the greater of the cowextsize and regular extsize
    hint.

    During reflink, if we're sharing the entire source file to the entire
    destination file and the destination file doesn't already have a
    cowextsize hint, propagate the source file's cowextsize hint to the
    destination file.

    Furthermore, zero the bulkstat buffer prior to setting the fields
    so that we don't copy kernel memory contents into userspace.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

05 Oct, 2016

2 commits

  • Introduce a new in-core fork for storing copy-on-write delalloc
    reservations and allocated extents that are in the process of being
    written out.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Log recovery will iget an inode to replay BUI items and iput the inode
    when it's done. Unfortunately, if the inode was unlinked, the iput
    will see that i_nlink == 0 and decide to truncate & free the inode,
    which prevents us from replaying subsequent BUIs. We can't skip the
    BUIs because we have to replay all the redo items to ensure that
    atomic operations complete.

    Since unlinked inode recovery will reap the inode anyway, we can
    safely introduce a new inode flag to indicate that an inode is in this
    'unlinked recovery' state and should not be auto-reaped in the
    drop_inode path.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

04 Oct, 2016

1 commit


19 Sep, 2016

1 commit


03 Aug, 2016

1 commit


20 Jul, 2016

2 commits


21 Jun, 2016

2 commits


01 Jun, 2016

1 commit

  • Al Viro noticed that xfs_lock_inodes should be static, and
    that led to ... a few more.

    These are just the easy ones, others require moving functions
    higher in source files, so that's not done here to keep
    this review simple.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Eric Sandeen
     

06 Apr, 2016

1 commit

  • In the next patch we'll set up different inode operations for inline vs
    out of line symlinks, for that we need to make sure the flags are already
    set up properly.

    [dchinner: added xfs_setup_iops() call to xfs_rename_alloc_whiteout()]

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

07 Mar, 2016

1 commit


09 Feb, 2016

3 commits

  • Move the di_mode value from the xfs_icdinode to the VFS inode, reducing
    the xfs_icdinode byte another 2 bytes and collapsing another 2 byte hole
    in the structure.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The VFS tracks the inode nlink just like the xfs_icdinode. We can
    remove the variable from the icdinode and use the VFS inode variable
    everywhere, reducing the size of the xfs_icdinode by a further 4
    bytes.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • We currently carry around and log an entire inode core in the
    struct xfs_inode. A lot of the information in the inode core is
    duplicated in the VFS inode, but we cannot remove this duplication
    of infomration because the inode core is logged directly in
    xfs_inode_item_format().

    Add a new function xfs_inode_item_format_core() that copies the
    inode core data into a struct xfs_icdinode that is pulled directly
    from the log vector buffer. This means we no longer directly
    copy the inode core, but copy the structures one member at a time.
    This will be slightly less efficient than copying, but will allow us
    to remove duplicate and unnecessary items from the struct xfs_inode.

    To enable us to do this, call the new structure a xfs_log_dinode,
    so that we know it's different to the physical xfs_dinode and the
    in-core xfs_icdinode.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

08 Feb, 2016

1 commit

  • Factor xfs_seek_hole_data into an unlocked helper which takes
    an xfs inode rather than a file for internal use.

    Also allow specification of "end" - the vfs lseek interface is
    defined such that any offset past eof/i_size shall return -ENXIO,
    but we will use this for quota code which does not maintain i_size,
    and we want to be able to SEEK_DATA past i_size as well. So the
    lseek path can send in i_size, and the quota code can determine
    its own ending offset.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Eric Sandeen
     

19 Aug, 2015

1 commit

  • Lockdep annotations are a maintenance nightmare. Locking has to be
    modified to suit the limitations of the annotations, and we're
    always having to fix the annotations because they are unable to
    express the complexity of locking heirarchies correctly.

    So, next up, we've got more issues with lockdep annotations for
    inode locking w.r.t. XFS_LOCK_PARENT:

    - lockdep classes are exclusive and can't be ORed together
    to form new classes.
    - IOLOCK needs multiple PARENT subclasses to express the
    changes needed for the readdir locking rework needed to
    stop the endless flow of lockdep false positives involving
    readdir calling filldir under the ILOCK.
    - there are only 8 unique lockdep subclasses available,
    so we can't create a generic solution.

    IOWs we need to treat the 3-bit space available to each lock type
    differently:

    - IOLOCK uses xfs_lock_two_inodes(), so needs:
    - at least 2 IOLOCK subclasses
    - at least 2 IOLOCK_PARENT subclasses
    - MMAPLOCK uses xfs_lock_two_inodes(), so needs:
    - at least 2 MMAPLOCK subclasses
    - ILOCK uses xfs_lock_inodes with up to 5 inodes, so needs:
    - at least 5 ILOCK subclasses
    - one ILOCK_PARENT subclass
    - one RTBITMAP subclass
    - one RTSUM subclass

    For the IOLOCK, split the space into two sets of subclasses.
    For the MMAPLOCK, just use half the space for the one subclass to
    match the non-parent lock classes of the IOLOCK.
    For the ILOCK, use 0-4 as the ILOCK subclasses, 5-7 for the
    remaining individual subclasses.

    Because they are now all different, modify xfs_lock_inumorder() to
    handle the nested subclasses, and to assert fail if passed an
    invalid subclass. Further, annotate xfs_lock_inodes() to assert fail
    if an invalid combination of lock primitives and inode counts are
    passed that would result in a lockdep subclass annotation overflow.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

24 Feb, 2015

2 commits


23 Feb, 2015

3 commits

  • Al Viro noticed a generic set of issues to do with filehandle lookup
    racing with dentry cache setup. They involve a filehandle lookup
    occurring while an inode is being created and the filehandle lookup
    racing with the dentry creation for the real file. This can lead to
    multiple dentries for the one path being instantiated. There are a
    host of other issues around this same set of paths.

    The underlying cause is that file handle lookup only waits on inode
    cache instantiation rather than full dentry cache instantiation. XFS
    is mostly immune to the problems discovered due to it's own internal
    inode cache, but there are a couple of corner cases where races can
    happen.

    We currently clear the XFS_INEW flag when the inode is fully set up
    after insertion into the cache. Newly allocated inodes are inserted
    locked and so aren't usable until the allocation transaction
    commits. This, however, occurs before the dentry and security
    information is fully initialised and hence the inode is unlocked and
    available for lookups to find too early.

    To solve the problem, only clear the XFS_INEW flag for newly created
    inodes once the dentry is fully instantiated. This means lookups
    will retry until the XFS_INEW flag is removed from the inode and
    hence avoids the race conditions in questions.

    THis also means that xfs_create(), xfs_create_tmpfile() and
    xfs_symlink() need to finish the setup of the inode in their error
    paths if we had allocated the inode but failed later in the creation
    process. xfs_symlink(), in particular, needed a lot of help to make
    it's error handling match that of xfs_create().

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • A new fsync vs power fail test in xfstests indicated that XFS can
    have unreliable data consistency when doing extending truncates that
    require block zeroing. The blocks beyond EOF get zeroed in memory,
    but we never force those changes to disk before we run the
    transaction that extends the file size and exposes those blocks to
    userspace. This can result in the blocks not being correctly zeroed
    after a crash.

    Because in-memory behaviour is correct, tools like fsx don't pick up
    any coherency problems - it's not until the filesystem is shutdown
    or the system crashes after writing the truncate transaction to the
    journal but before the zeroed data in the page cache is flushed that
    the issue is exposed.

    Fix this by also flushing the dirty data in memory region between
    the old size and new size when we've found blocks that need zeroing
    in the truncate process.

    Reported-by: Liu Bo
    cc:
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Right now we cannot serialise mmap against truncate or hole punch
    sanely. ->page_mkwrite is not able to take locks that the read IO
    path normally takes (i.e. the inode iolock) because that could
    result in lock inversions (read - iolock - page fault - page_mkwrite
    - iolock) and so we cannot use an IO path lock to serialise page
    write faults against truncate operations.

    Instead, introduce a new lock that is used *only* in the
    ->page_mkwrite path that is the equivalent of the iolock. The lock
    ordering in a page fault is i_mmaplock -> page lock -> i_ilock,
    and so in truncate we can i_iolock -> i_mmaplock and so lock out
    new write faults during the process of truncation.

    Because i_mmap_lock is outside the page lock, we can hold it across
    all the same operations we hold the i_iolock for. The only
    difference is that we never hold the i_mmaplock in the normal IO
    path and so do not ever have the possibility that we can page fault
    inside it. Hence there are no recursion issues on the i_mmap_lock
    and so we can use it to serialise page fault IO against inode
    modification operations that affect the IO path.

    This patch introduces the i_mmaplock infrastructure, lockdep
    annotations and initialisation/destruction code. Use of the new lock
    will be in subsequent patches.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

02 Feb, 2015

2 commits


24 Dec, 2014

1 commit


28 Nov, 2014

1 commit

  • More consolidatation for the on-disk format defintions. Note that the
    XFS_IS_REALTIME_INODE moves to xfs_linux.h instead as it is not related
    to the on disk format, but depends on a CONFIG_ option.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

02 Oct, 2014

1 commit

  • If we write to the maximum file offset (2^63-2), XFS fails to log the
    inode size update when the page is flushed. For example:

    $ xfs_io -fc "pwrite `echo "2^63-1-1" | bc` 1" /mnt/file
    wrote 1/1 bytes at offset 9223372036854775806
    1.000000 bytes, 1 ops; 0.0000 sec (22.711 KiB/sec and 23255.8140 ops/sec)
    $ stat -c %s /mnt/file
    9223372036854775807
    $ umount /mnt ; mount /mnt/
    $ stat -c %s /mnt/file
    0

    This occurs because XFS calculates the new file size as io_offset +
    io_size, I/O occurs in block sized requests, and the maximum supported
    file size is not block aligned. Therefore, a write to the max allowable
    offset on a 4k blocksize fs results in a write of size 4k to offset
    2^63-4096 (e.g., equivalent to round_down(2^63-1, 4096), or IOW the
    offset of the block that contains the max file size). The offset plus
    size calculation (2^63 - 4096 + 4096 == 2^63) overflows the signed
    64-bit variable which goes negative and causes the > comparison to the
    on-disk inode size to fail. This returns 0 from xfs_new_eof() and
    results in no change to the inode on-disk.

    Update xfs_new_eof() to explicitly detect overflow of the local
    calculation and use the VFS inode size in this scenario. The VFS inode
    size is capped to the maximum and thus XFS writes the correct inode size
    to disk.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     

04 Aug, 2014

1 commit

  • Move the IO flag definitions to xfs_inode.h and kill the header file
    as it is now empty.

    Removing the xfs_vnode.h file showed up an implicit header include
    path:
    xfs_linux.h -> xfs_vnode.h -> xfs_fs.h

    And so every xfs header file has been inplicitly been including
    xfs_fs.h where it is needed or not. Hence the removal of xfs_vnode.h
    causes all sorts of build issues because BBTOB() and friends are no
    longer automatically included in the build. This also gets fixed.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

20 May, 2014

2 commits

  • Conflicts:
    fs/xfs/xfs_inode.c

    Dave Chinner
     
  • mkfs has turned on the XFS_SB_VERSION_NLINKBIT feature bit by
    default since November 2007. It's about time we simply made the
    kernel code turn it on by default and so always convert v1 inodes to
    v2 inodes when reading them in from disk or allocating them. This
    This removes needless version checks and modification when bumping
    link counts on inodes, and will take code out of a few common code
    paths.

    text data bss dec hex filename
    783251 100867 616 884734 d7ffe fs/xfs/xfs.o.orig
    782664 100867 616 884147 d7db3 fs/xfs/xfs.o.patched

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

15 May, 2014

1 commit


23 Apr, 2014

1 commit

  • We never test the flag except in xfs_inode_is_filestream, but that
    function already tests the on-disk flag or filesystem wide flags,
    and is used to decide if we want to set XFS_IFILESTREAM in the
    first place.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig
     

17 Apr, 2014

1 commit

  • xfstests generic/004 reproduces an ilock deadlock using the tmpfile
    interface when selinux is enabled. This occurs because
    xfs_create_tmpfile() takes the ilock and then calls d_tmpfile(). The
    latter eventually calls into xfs_xattr_get() which attempts to get the
    lock again. E.g.:

    xfs_io D ffffffff81c134c0 4096 3561 3560 0x00000080
    ffff8801176a1a68 0000000000000046 ffff8800b401b540 ffff8801176a1fd8
    00000000001d5800 00000000001d5800 ffff8800b401b540 ffff8800b401b540
    ffff8800b73a6bd0 fffffffeffffffff ffff8800b73a6bd8 ffff8800b5ddb480
    Call Trace:
    [] schedule+0x29/0x70
    [] rwsem_down_read_failed+0xc5/0x120
    [] ? xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
    [] call_rwsem_down_read_failed+0x14/0x30
    [] ? down_read_nested+0x89/0xa0
    [] ? xfs_ilock+0x122/0x250 [xfs]
    [] xfs_ilock+0x122/0x250 [xfs]
    [] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
    [] xfs_attr_get+0x90/0xe0 [xfs]
    [] xfs_xattr_get+0x37/0x50 [xfs]
    [] generic_getxattr+0x4f/0x70
    [] inode_doinit_with_dentry+0x1ae/0x650
    [] selinux_d_instantiate+0x1c/0x20
    [] security_d_instantiate+0x1b/0x30
    [] d_instantiate+0x50/0x70
    [] d_tmpfile+0xb5/0xc0
    [] xfs_create_tmpfile+0x362/0x410 [xfs]
    [] xfs_vn_tmpfile+0x18/0x20 [xfs]
    [] path_openat+0x228/0x6a0
    [] ? sched_clock+0x9/0x10
    [] ? kvm_clock_read+0x27/0x40
    [] ? __alloc_fd+0xaf/0x1f0
    [] do_filp_open+0x3a/0x90
    [] ? _raw_spin_unlock+0x27/0x40
    [] ? __alloc_fd+0xaf/0x1f0
    [] do_sys_open+0x12e/0x210
    [] SyS_open+0x1e/0x20
    [] system_call_fastpath+0x16/0x1b

    xfs_vn_tmpfile() also fails to initialize security on the newly created
    inode.

    Pull the d_tmpfile() call up into xfs_vn_tmpfile() after the transaction
    has been committed and the inode unlocked. Also, initialize security on
    the inode based on the parent directory provided via the tmpfile call.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Brian Foster
     

13 Mar, 2014

1 commit


07 Jan, 2014

2 commits

  • Add two functions xfs_create_tmpfile() and xfs_vn_tmpfile()
    to support O_TMPFILE file creation.

    In contrast to xfs_create(), xfs_create_tmpfile() has a different
    log reservation to the regular file creation because there is no
    directory modification, and doesn't check if an entry can be added
    to the directory, but the reservation quotas is required appropriately,
    and finally its inode is added to the unlinked list.

    xfs_vn_tmpfile() add one O_TMPFILE method to VFS interface and directly
    invoke xfs_create_tmpfile().

    Signed-off-by: Zhi Yong Wu
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ben Myers

    Zhi Yong Wu
     
  • It will be reused by the O_TMPFILE creation function.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Zhi Yong Wu
    Signed-off-by: Ben Myers

    Zhi Yong Wu
     

19 Dec, 2013

2 commits