17 Apr, 2014

6 commits

  • xfstests generic/004 reproduces an ilock deadlock using the tmpfile
    interface when selinux is enabled. This occurs because
    xfs_create_tmpfile() takes the ilock and then calls d_tmpfile(). The
    latter eventually calls into xfs_xattr_get() which attempts to get the
    lock again. E.g.:

    xfs_io D ffffffff81c134c0 4096 3561 3560 0x00000080
    ffff8801176a1a68 0000000000000046 ffff8800b401b540 ffff8801176a1fd8
    00000000001d5800 00000000001d5800 ffff8800b401b540 ffff8800b401b540
    ffff8800b73a6bd0 fffffffeffffffff ffff8800b73a6bd8 ffff8800b5ddb480
    Call Trace:
    [] schedule+0x29/0x70
    [] rwsem_down_read_failed+0xc5/0x120
    [] ? xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
    [] call_rwsem_down_read_failed+0x14/0x30
    [] ? down_read_nested+0x89/0xa0
    [] ? xfs_ilock+0x122/0x250 [xfs]
    [] xfs_ilock+0x122/0x250 [xfs]
    [] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
    [] xfs_attr_get+0x90/0xe0 [xfs]
    [] xfs_xattr_get+0x37/0x50 [xfs]
    [] generic_getxattr+0x4f/0x70
    [] inode_doinit_with_dentry+0x1ae/0x650
    [] selinux_d_instantiate+0x1c/0x20
    [] security_d_instantiate+0x1b/0x30
    [] d_instantiate+0x50/0x70
    [] d_tmpfile+0xb5/0xc0
    [] xfs_create_tmpfile+0x362/0x410 [xfs]
    [] xfs_vn_tmpfile+0x18/0x20 [xfs]
    [] path_openat+0x228/0x6a0
    [] ? sched_clock+0x9/0x10
    [] ? kvm_clock_read+0x27/0x40
    [] ? __alloc_fd+0xaf/0x1f0
    [] do_filp_open+0x3a/0x90
    [] ? _raw_spin_unlock+0x27/0x40
    [] ? __alloc_fd+0xaf/0x1f0
    [] do_sys_open+0x12e/0x210
    [] SyS_open+0x1e/0x20
    [] system_call_fastpath+0x16/0x1b

    xfs_vn_tmpfile() also fails to initialize security on the newly created
    inode.

    Pull the d_tmpfile() call up into xfs_vn_tmpfile() after the transaction
    has been committed and the inode unlocked. Also, initialize security on
    the inode based on the parent directory provided via the tmpfile call.

    Signed-off-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • When testing exhaustion of dm snapshots, the following appeared
    with CONFIG_DEBUG_OBJECTS_FREE enabled:

    ODEBUG: free active (active state 0) object type: work_struct hint: xfs_buf_iodone_work+0x0/0x1d0 [xfs]

    indicating that we'd freed a buffer which still had a pending reference,
    down this path:

    [ 190.867975] [] debug_check_no_obj_freed+0x22b/0x270
    [ 190.880820] [] kmem_cache_free+0xd0/0x370
    [ 190.892615] [] xfs_buf_free+0xe4/0x210 [xfs]
    [ 190.905629] [] xfs_buf_rele+0xe7/0x270 [xfs]
    [ 190.911770] [] xfs_trans_read_buf_map+0x7b6/0xac0 [xfs]

    At issue is the fact that if IO fails in xfs_buf_iorequest,
    we'll queue completion unconditionally, and then call
    xfs_buf_rele; but if IO failed, there are no IOs remaining,
    and xfs_buf_rele will free the bp while work is still queued.

    Fix this by not scheduling completion if the buffer has
    an error on it; run it immediately. The rest is only comment
    changes.

    Thanks to dchinner for spotting the root cause.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • We negate the error value being returned from a generic function
    incorrectly. The code path that it is running in returned negative
    errors, so there is no need to negate it to get the correct error
    signs here.

    This was uncovered by generic/019.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • And interesting situation can occur if a log IO error occurs during
    the unmount of a filesystem. The cases reported have the same
    signature - the update of the superblock counters fails due to a log
    write IO error:

    XFS (dm-16): xfs_do_force_shutdown(0x2) called from line 1170 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa08a44a1
    XFS (dm-16): Log I/O Error Detected. Shutting down filesystem
    XFS (dm-16): Unable to update superblock counters. Freespace may not be correct on next mount.
    XFS (dm-16): xfs_log_force: error 5 returned.
    XFS (¿-¿¿¿): Please umount the filesystem and rectify the problem(s)

    It can be seen that the last line of output contains a corrupt
    device name - this is because the log and xfs_mount structures have
    already been freed by the time this message is printed. A kernel
    oops closely follows.

    The issue is that the shutdown is occurring in a separate IO
    completion thread to the unmount. Once the shutdown processing has
    started and all the iclogs are marked with XLOG_STATE_IOERROR, the
    log shutdown code wakes anyone waiting on a log force so they can
    process the shutdown error. This wakes up the unmount code that
    is doing a synchronous transaction to update the superblock
    counters.

    The unmount path now sees all the iclogs are marked with
    XLOG_STATE_IOERROR and so never waits on them again, knowing that if
    it does, there will not be a wakeup trigger for it and we will hang
    the unmount if we do. Hence the unmount runs through all the
    remaining code and frees all the filesystem structures while the
    xlog_iodone() is still processing the shutdown. When the log
    shutdown processing completes, xfs_do_force_shutdown() emits the
    "Please umount the filesystem and rectify the problem(s)" message,
    and xlog_iodone() then aborts all the objects attached to the iclog.
    An iclog that has already been freed....

    The real issue here is that there is no serialisation point between
    the log IO and the unmount. We have serialisations points for log
    writes, log forces, reservations, etc, but we don't actually have
    any code that wakes for log IO to fully complete. We do that for all
    other types of object, so why not iclogbufs?

    Well, it turns out that we can easily do this. We've got xfs_buf
    handles, and that's what everyone else uses for IO serialisation.
    i.e. bp->b_sema. So, lets hold iclogbufs locked over IO, and only
    release the lock in xlog_iodone() when we are finished with the
    buffer. That way before we tear down the iclog, we can lock and
    unlock the buffer to ensure IO completion has finished completely
    before we tear it down.

    Signed-off-by: Dave Chinner
    Tested-by: Mike Snitzer
    Tested-by: Bob Mastors
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • FSX has been detecting data corruption after to collapse range
    calls. The key observation is that the offset of the last extent in
    the file was not being shifted, and hence when the file size was
    adjusted it was truncating away data because the extents handled
    been correctly shifted.

    Tracing indicated that before the collapse, the extent list looked
    like:

    ....
    ino 0x5788 state idx 6 offset 26 block 195904 count 10 flag 0
    ino 0x5788 state idx 7 offset 39 block 195917 count 35 flag 0
    ino 0x5788 state idx 8 offset 86 block 195964 count 32 flag 0

    and after the shift of 2 blocks:

    ino 0x5788 state idx 6 offset 24 block 195904 count 10 flag 0
    ino 0x5788 state idx 7 offset 37 block 195917 count 35 flag 0
    ino 0x5788 state idx 8 offset 86 block 195964 count 32 flag 0

    Note that the last extent did not change offset. After the changing
    of the file size:

    ino 0x5788 state idx 6 offset 24 block 195904 count 10 flag 0
    ino 0x5788 state idx 7 offset 37 block 195917 count 35 flag 0
    ino 0x5788 state idx 8 offset 86 block 195964 count 30 flag 0

    You can see that the last extent had it's length truncated,
    indicating that we've lost data.

    The reason for this is that the xfs_bmap_shift_extents() loop uses
    XFS_IFORK_NEXTENTS() to determine how many extents are in the inode.
    This, unfortunately, doesn't take into account delayed allocation
    extents - it's a count of physically allocated extents - and hence
    when the file being collapsed has a delalloc extent like this one
    does prior to the range being collapsed:

    ....
    ino 0x5788 state idx 4 offset 11 block 4503599627239429 count 1 flag 0
    ....

    it gets the count wrong and terminates the shift loop early.

    Fix it by using the in-memory extent array size that includes
    delayed allocation extents to determine the number of extents on the
    inode.

    Signed-off-by: Dave Chinner
    Tested-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Al Viro tracked down the problem that has caused generic/263 to fail
    on XFS since the test was introduced. If is caused by
    xfs_get_blocks() mapping a single extent that spans EOF without
    marking it as buffer-new() so that the direct IO code does not zero
    the tail of the block at the new EOF. This is a long standing bug
    that has been around for many, many years.

    Because xfs_get_blocks() starts the map before EOF, it can't set
    buffer_new(), because that causes he direct IO code to also zero
    unaligned sectors at the head of the IO. This would overwrite valid
    data with zeros, and hence we cannot validly return a single extent
    that spans EOF to direct IO.

    Fix this by detecting a mapping that spans EOF and truncate it down
    to EOF. This results in the the direct IO code doing the right thing
    for unaligned data blocks before EOF, and then returning to get
    another mapping for the region beyond EOF which XFS treats correctly
    by setting buffer_new() on it. This makes direct Io behave correctly
    w.r.t. tail block zeroing beyond EOF, and fsx is happy about that.

    Again, thanks to Al Viro for finding what I couldn't.

    [ dchinner: Fix for __divdi3 build error:

    Reported-by: Paul Gortmaker
    Tested-by: Paul Gortmaker
    Signed-off-by: Mark Tinguely
    Reviewed-by: Eric Sandeen
    ]

    Signed-off-by: Dave Chinner
    Tested-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

14 Apr, 2014

4 commits

  • When we are zeroing space andit is covered by a delalloc range, we
    need to punch the delalloc range out before we truncate the page
    cache. Failing to do so leaves and inconsistency between the page
    cache and the extent tree, which we later trip over when doing
    direct IO over the same range.

    Signed-off-by: Dave Chinner
    Tested-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Similar to the write_begin problem, xfs-vm_write_end will truncate
    back to the old EOF, potentially removing page cache from over the
    top of delalloc blocks with valid data in them. Fix this by
    truncating back to just the start of the failed write.

    Signed-off-by: Dave Chinner
    Tested-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • If we fail a write beyond EOF and have to handle it in
    xfs_vm_write_begin(), we truncate the inode back to the current inode
    size. This doesn't take into account the fact that we may have
    already made successful writes to the same page (in the case of block
    size < page size) and hence we can truncate the page cache away from
    blocks with valid data in them. If these blocks are delayed
    allocation blocks, we now have a mismatch between the page cache and
    the extent tree, and this will trigger - at minimum - a delayed
    block count mismatch assert when the inode is evicted from the cache.
    We can also trip over it when block mapping for direct IO - this is
    the most common symptom seen from fsx and fsstress when run from
    xfstests.

    Fix it by only truncating away the exact range we are updating state
    for in this write_begin call.

    Signed-off-by: Dave Chinner
    Tested-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • When a write fails, if we don't clear the delalloc flags from the
    buffers over the failed range, they can persist beyond EOF and cause
    problems. writeback will see the pages in the page cache, see they
    are dirty and continually retry the write, assuming that the page
    beyond EOF is just racing with a truncate. The page will eventually
    be released due to some other operation (e.g. direct IO), and it
    will not pass through invalidation because it is dirty. Hence it
    will be released with buffer_delay set on it, and trigger warnings
    in xfs_vm_releasepage() and assert fail in xfs_file_aio_write_direct
    because invalidation failed and we didn't write the corect amount.

    This causes failures on block size < page size filesystems in fsx
    and fsstress workloads run by xfstests.

    Fix it by completely trashing any state on the buffer that could be
    used to imply that it contains valid data when the delalloc range
    over the buffer is punched out during the failed write handling.

    Signed-off-by: Dave Chinner
    Tested-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     

13 Apr, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds
     

08 Apr, 2014

1 commit

  • filemap_map_pages() is generic implementation of ->map_pages() for
    filesystems who uses page cache.

    It should be safe to use filemap_map_pages() for ->map_pages() if
    filesystem use filemap_fault() for ->fault().

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Alexander Viro
    Cc: Dave Chinner
    Cc: Ning Qu
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

05 Apr, 2014

2 commits

  • Pull xfs update from Dave Chinner:
    "There are a couple of new fallocate features in this request - it was
    decided that it was easiest to push them through the XFS tree using
    topic branches and have the ext4 support be based on those branches.
    Hence you may see some overlap with the ext4 tree merge depending on
    how they including those topic branches into their tree. Other than
    that, there is O_TMPFILE support, some cleanups and bug fixes.

    The main changes in the XFS tree for 3.15-rc1 are:

    - O_TMPFILE support
    - allowing AIO+DIO writes beyond EOF
    - FALLOC_FL_COLLAPSE_RANGE support for fallocate syscall and XFS
    implementation
    - FALLOC_FL_ZERO_RANGE support for fallocate syscall and XFS
    implementation
    - IO verifier cleanup and rework
    - stack usage reduction changes
    - vm_map_ram NOIO context fixes to remove lockdep warings
    - various bug fixes and cleanups"

    * tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfs: (34 commits)
    xfs: fix directory hash ordering bug
    xfs: extra semi-colon breaks a condition
    xfs: Add support for FALLOC_FL_ZERO_RANGE
    fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
    xfs: inode log reservations are still too small
    xfs: xfs_check_page_type buffer checks need help
    xfs: avoid AGI/AGF deadlock scenario for inode chunk allocation
    xfs: use NOIO contexts for vm_map_ram
    xfs: don't leak EFSBADCRC to userspace
    xfs: fix directory inode iolock lockdep false positive
    xfs: allocate xfs_da_args to reduce stack footprint
    xfs: always do log forces via the workqueue
    xfs: modify verifiers to differentiate CRC from other errors
    xfs: print useful caller information in xfs_error_report
    xfs: add xfs_verifier_error()
    xfs: add helper for updating checksums on xfs_bufs
    xfs: add helper for verifying checksums on xfs_bufs
    xfs: Use defines for CRC offsets in all cases
    xfs: skip pointless CRC updates after verifier failures
    xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
    ...

    Linus Torvalds
     
  • Pull ext4 updates from Ted Ts'o:
    "Major changes for 3.14 include support for the newly added ZERO_RANGE
    and COLLAPSE_RANGE fallocate operations, and scalability improvements
    in the jbd2 layer and in xattr handling when the extended attributes
    spill over into an external block.

    Other than that, the usual clean ups and minor bug fixes"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
    ext4: fix premature freeing of partial clusters split across leaf blocks
    ext4: remove unneeded test of ret variable
    ext4: fix comment typo
    ext4: make ext4_block_zero_page_range static
    ext4: atomically set inode->i_flags in ext4_set_inode_flags()
    ext4: optimize Hurd tests when reading/writing inodes
    ext4: kill i_version support for Hurd-castrated file systems
    ext4: each filesystem creates and uses its own mb_cache
    fs/mbcache.c: doucple the locking of local from global data
    fs/mbcache.c: change block and index hash chain to hlist_bl_node
    ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
    ext4: refactor ext4_fallocate code
    ext4: Update inode i_size after the preallocation
    ext4: fix partial cluster handling for bigalloc file systems
    ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
    ext4: only call sync_filesystm() when remounting read-only
    fs: push sync_filesystem() down to the file system's remount_fs()
    jbd2: improve error messages for inconsistent journal heads
    jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
    jbd2: minimize region locked by j_list_lock in journal_get_create_access()
    ...

    Linus Torvalds
     

04 Apr, 2014

4 commits

  • Reclaim will be leaving shadow entries in the page cache radix tree upon
    evicting the real page. As those pages are found from the LRU, an
    iput() can lead to the inode being freed concurrently. At this point,
    reclaim must no longer install shadow pages because the inode freeing
    code needs to ensure the page tree is really empty.

    Add an address_space flag, AS_EXITING, that the inode freeing code sets
    under the tree lock before doing the final truncate. Reclaim will check
    for this flag before installing shadow pages.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Dave Chinner
     
  • Commit f5ea1100 ("xfs: add CRCs to dir2/da node blocks") introduced
    in 3.10 incorrectly converted the btree hash index array pointer in
    xfs_da3_fixhashpath(). It resulted in the the current hash always
    being compared against the first entry in the btree rather than the
    current block index into the btree block's hash entry array. As a
    result, it was comparing the wrong hashes, and so could misorder the
    entries in the btree.

    For most cases, this doesn't cause any problems as it requires hash
    collisions to expose the ordering problem. However, when there are
    hash collisions within a directory there is a very good probability
    that the entries will be ordered incorrectly and that actually
    matters when duplicate hashes are placed into or removed from the
    btree block hash entry array.

    This bug results in an on-disk directory corruption and that results
    in directory verifier functions throwing corruption warnings into
    the logs. While no data or directory entries are lost, access to
    them may be compromised, and attempts to remove entries from a
    directory that has suffered from this corruption may result in a
    filesystem shutdown. xfs_repair will fix the directory hash
    ordering without data loss occuring.

    [dchinner: wrote useful a commit message]

    cc:
    Reported-by: Hannes Frederic Sowa
    Signed-off-by: Mark Tinguely
    Reviewed-by: Ben Myers
    Signed-off-by: Dave Chinner

    Mark Tinguely
     
  • There were some extra semi-colons here which mean that we return true
    unintentionally.

    Fixes: a49935f200e2 ('xfs: xfs_check_page_type buffer checks need help')
    Signed-off-by: Dan Carpenter
    Reviewed-by: Brian Foster
    Reviewed-by: Eric Sandeen
    Signed-off-by: Dave Chinner

    Dan Carpenter
     

02 Apr, 2014

4 commits


13 Mar, 2014

7 commits

  • Previously, the no-op "mount -o mount /dev/xxx" operation when the
    file system is already mounted read-write causes an implied,
    unconditional syncfs(). This seems pretty stupid, and it's certainly
    documented or guaraunteed to do this, nor is it particularly useful,
    except in the case where the file system was mounted rw and is getting
    remounted read-only.

    However, it's possible that there might be some file systems that are
    actually depending on this behavior. In most file systems, it's
    probably fine to only call sync_filesystem() when transitioning from
    read-write to read-only, and there are some file systems where this is
    not needed at all (for example, for a pseudo-filesystem or something
    like romfs).

    Signed-off-by: "Theodore Ts'o"
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Artem Bityutskiy
    Cc: Adrian Hunter
    Cc: Evgeniy Dushistov
    Cc: Jan Kara
    Cc: OGAWA Hirofumi
    Cc: Anders Larsen
    Cc: Phillip Lougher
    Cc: Kees Cook
    Cc: Mikulas Patocka
    Cc: Petr Vandrovec
    Cc: xfs@oss.sgi.com
    Cc: linux-btrfs@vger.kernel.org
    Cc: linux-cifs@vger.kernel.org
    Cc: samba-technical@lists.samba.org
    Cc: codalist@coda.cs.cmu.edu
    Cc: linux-ext4@vger.kernel.org
    Cc: linux-f2fs-devel@lists.sourceforge.net
    Cc: fuse-devel@lists.sourceforge.net
    Cc: cluster-devel@redhat.com
    Cc: linux-mtd@lists.infradead.org
    Cc: jfs-discussion@lists.sourceforge.net
    Cc: linux-nfs@vger.kernel.org
    Cc: linux-nilfs@vger.kernel.org
    Cc: linux-ntfs-dev@lists.sourceforge.net
    Cc: ocfs2-devel@oss.oracle.com
    Cc: reiserfs-devel@vger.kernel.org

    Theodore Ts'o
     
  • Conflicts:
    fs/xfs/xfs_trans_resv.c
    - fix for XFS_INODE_CLUSTER_SIZE macro removal

    Dave Chinner
     
  • Dave Chinner
     
  • Dave Chinner
     
  • Dave Chinner
     
  • Dave Chinner
     
  • Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
    functionality as xfs ioctl XFS_IOC_ZERO_RANGE.

    We can also preallocate blocks past EOF in the same was as with
    fallocate. Flag FALLOC_FL_KEEP_SIZE will cause the inode size to remain
    the same even if we preallocate blocks past EOF.

    It uses the same code to zero range as it is used by the
    XFS_IOC_ZERO_RANGE ioctl.

    Signed-off-by: Lukas Czerner
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Lukas Czerner
     

07 Mar, 2014

5 commits

  • Back in commit 23956703 ("xfs: inode log reservations are too
    small"), the reservation size was increased to take into account the
    difference in size between the in-memory BMBT block headers and the
    on-disk BMDR headers. This solved a transaction overrun when logging
    the inode size.

    Recently, however, we've seen a number of these same overruns on
    kernels with the above fix in it. All of them have been by 4 bytes,
    so we must still not be accounting for something correctly.

    Through inspection it turns out the above commit didn't take into
    account everything it should have. That is, it only accounts for a
    single log op_hdr structure, when it can actually require up to four
    op_hdrs - one for each region (log iovec) that is formatted. These
    regions are the inode log format header, the inode core, and the two
    forks that can be held in the literal area of the inode.

    This means we are not accounting for 36 bytes of log space that the
    transaction can use, and hence when we get inodes in certain formats
    with particular fragmentation patterns we can overrun the
    transaction. Fix this by adding the correct accounting for log
    op_headers in the transaction.

    Tested-by: Brian Foster
    Signed-off-by: Dave Chinner
    Reviewed-by: Eric Sandeen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • xfs_aops_discard_page() was introduced in the following commit:

    xfs: truncate delalloc extents when IO fails in writeback

    ... to clean up left over delalloc ranges after I/O failure in
    ->writepage(). generic/224 tests for this scenario and occasionally
    reproduces panics on sub-4k blocksize filesystems.

    The cause of this is failure to clean up the delalloc range on a
    page where the first buffer does not match one of the expected
    states of xfs_check_page_type(). If a buffer is not unwritten,
    delayed or dirty&mapped, xfs_check_page_type() stops and
    immediately returns 0.

    The stress test of generic/224 creates a scenario where the first
    several buffers of a page with delayed buffers are mapped & uptodate
    and some subsequent buffer is delayed. If the ->writepage() happens
    to fail for this page, xfs_aops_discard_page() incorrectly skips
    the entire page.

    This then causes later failures either when direct IO maps the range
    and finds the stale delayed buffer, or we evict the inode and find
    that the inode still has a delayed block reservation accounted to
    it.

    We can easily fix this xfs_aops_discard_page() failure by making
    xfs_check_page_type() check all buffers, but this breaks
    xfs_convert_page() more than it is already broken. Indeed,
    xfs_convert_page() wants xfs_check_page_type() to tell it if the
    first buffers on the pages are of a type that can be aggregated into
    the contiguous IO that is already being built.

    xfs_convert_page() should not be writing random buffers out of a
    page, but the current behaviour will cause it to do so if there are
    buffers that don't match the current specification on the page.
    Hence for xfs_convert_page() we need to:

    a) return "not ok" if the first buffer on the page does not
    match the specification provided to we don't write anything;
    and
    b) abort it's buffer-add-to-io loop the moment we come
    across a buffer that does not match the specification.

    Hence we need to fix both xfs_check_page_type() and
    xfs_convert_page() to work correctly with pages that have mixed
    buffer types, whilst allowing xfs_aops_discard_page() to scan all
    buffers on the page for a type match.

    Reported-by: Brian Foster
    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The inode chunk allocation path can lead to deadlock conditions if
    a transaction is dirtied with an AGF (to fix up the freelist) for
    an AG that cannot satisfy the actual allocation request. This code
    path is written to try and avoid this scenario, but it can be
    reproduced by running xfstests generic/270 in a loop on a 512b fs.

    An example situation is:
    - process A attempts an inode allocation on AG 3, modifies
    the freelist, fails the allocation and ultimately moves on to
    AG 0 with the AG 3 AGF held
    - process B is doing a free space operation (i.e., truncate) and
    acquires the AG 0 AGF, waits on the AG 3 AGF
    - process A acquires the AG 0 AGI, waits on the AG 0 AGF (deadlock)

    The problem here is that process A acquired the AG 3 AGF while
    moving on to AG 0 (and releasing the AG 3 AGI with the AG 3 AGF
    held). xfs_dialloc() makes one pass through each of the AGs when
    attempting to allocate an inode chunk. The expectation is a clean
    transaction if a particular AG cannot satisfy the allocation
    request. xfs_ialloc_ag_alloc() is written to support this through
    use of the minalignslop allocation args field.

    When using the agi->agi_newino optimization, we attempt an exact
    bno allocation request based on the location of the previously
    allocated chunk. minalignslop is set to inform the allocator that
    we will require alignment on this chunk, and thus to not allow the
    request for this AG if the extra space is not available. Suppose
    that the AG in question has just enough space for this request, but
    not at the requested bno. xfs_alloc_fix_freelist() will proceed as
    normal as it determines the request should succeed, and thus it is
    allowed to modify the agf. xfs_alloc_ag_vextent() ultimately fails
    because the requested bno is not available. In response, the caller
    moves on to a NEAR_BNO allocation request for the same AG. The
    alignment is set, but the minalignslop field is never reset. This
    increases the overall requirement of the request from the first
    attempt. If this delta is the difference between allocation success
    and failure for the AG, xfs_alloc_fix_freelist() rejects this
    request outright the second time around and causes the allocation
    request to unnecessarily fail for this AG.

    To address this situation, reset the minalignslop field immediately
    after use and prevent it from leaking into subsequent requests.

    Signed-off-by: Brian Foster
    Reviewed-by: Mark Tinguely
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Brian Foster
     
  • When we map pages in the buffer cache, we can do so in GFP_NOFS
    contexts. However, the vmap interfaces do not provide any method of
    communicating this information to memory reclaim, and hence we get
    lockdep complaining about it regularly and occassionally see hangs
    that may be vmap related reclaim deadlocks. We can also see these
    same problems from anywhere where we use vmalloc for a large buffer
    (e.g. attribute code) inside a transaction context.

    A typical lockdep report shows up as a reclaim state warning like so:

    [14046.101458] =================================
    [14046.102850] [ INFO: inconsistent lock state ]
    [14046.102850] 3.14.0-rc4+ #2 Not tainted
    [14046.102850] ---------------------------------
    [14046.102850] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    [14046.102850] kswapd0/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
    [14046.102850] (&xfs_dir_ilock_class){++++?+}, at: [] xfs_ilock+0xff/0x16a
    [14046.102850] {RECLAIM_FS-ON-W} state was registered at:
    [14046.102850] [] mark_held_locks+0x81/0xe7
    [14046.102850] [] lockdep_trace_alloc+0x5c/0xb4
    [14046.102850] [] kmem_cache_alloc_trace+0x2b/0x11e
    [14046.102850] [] vm_map_ram+0x119/0x3e6
    [14046.102850] [] _xfs_buf_map_pages+0x5b/0xcf
    [14046.102850] [] xfs_buf_get_map+0x67/0x13f
    [14046.102850] [] xfs_attr_rmtval_set+0x396/0x4d5
    [14046.102850] [] xfs_attr_leaf_addname+0x18f/0x37d
    [14046.102850] [] xfs_attr_set_int+0x2f5/0x3e8
    [14046.102850] [] xfs_attr_set+0x6b/0x74
    [14046.102850] [] xfs_xattr_set+0x61/0x81
    [14046.102850] [] generic_setxattr+0x59/0x68
    [14046.102850] [] __vfs_setxattr_noperm+0x58/0xce
    [14046.102850] [] vfs_setxattr+0x8e/0x92
    [14046.102850] [] setxattr+0xcf/0x159
    [14046.102850] [] SyS_lsetxattr+0x88/0xbb
    [14046.102850] [] sysenter_do_call+0x12/0x36

    Now, we can't completely remove these traces - mainly because
    vm_map_ram() will do GFP_KERNEL allocation and that generates the
    above warning before we get into the reclaim code, but we can turn
    them all into false positive warnings.

    To do that, use the method that DM and other IO context code uses to
    avoid this problem: there is a process flag to tell memory reclaim
    not to do IO that we can set appropriately. That prevents GFP_KERNEL
    context reclaim being done from deep inside the vmalloc code in
    places we can't directly pass a GFP_NOFS context to. That interface
    has a pair of wrapper functions: memalloc_noio_save() and
    memalloc_noio_restore().

    Adding them around vm_map_ram and the vzalloc call in
    kmem_alloc_large() will prevent deadlocks and most lockdep reports
    for this issue. Also, convert the vzalloc() call in
    kmem_alloc_large() to use __vmalloc() so that we can pass the
    correct gfp context to the data page allocation routine inside
    __vmalloc() so that it is clear that GFP_NOFS context is important
    to this vmalloc call.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • While the verifier routines may return EFSBADCRC when a buffer has
    a bad CRC, we need to translate that to EFSCORRUPTED so that the
    higher layers treat the error appropriately and we return a
    consistent error to userspace. This fixes a xfs/005 regression.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     

28 Feb, 2014

1 commit

  • Pull filesystem fixes from Jan Kara:
    "Notification, writeback, udf, quota fixes

    The notification patches are (with one exception) a fallout of my
    fsnotify rework which went into -rc1 (I've extented LTP to cover these
    cornercases to avoid similar breakage in future).

    The UDF patch is a nasty data corruption Al has recently reported,
    the revert of the writeback patch is due to possibility of violating
    sync(2) guarantees, and a quota bug can lead to corruption of quota
    files in ocfs2"

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    fsnotify: Allocate overflow events with proper type
    fanotify: Handle overflow in case of permission events
    fsnotify: Fix detection whether overflow event is queued
    Revert "writeback: do not sync data dirtied after sync start"
    quota: Fix race between dqput() and dquot_scan_active()
    udf: Fix data corruption on file type conversion
    inotify: Fix reporting of cookies for inotify events

    Linus Torvalds
     

27 Feb, 2014

5 commits

  • The change to add the IO lock to protect the directory extent map
    during readdir operations has cause lockdep to have a heart attack
    as it now sees a different locking order on inodes w.r.t. the
    mmap_sem because readdir has a different ordering to write().

    Add a new lockdep class for directory inodes to avoid this false
    positive.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • The struct xfs_da_args used to pass directory/attribute operation
    information to the lower layers is 128 bytes in size and is
    allocated on the stack. Dynamically allocate them to reduce the
    stack footprint of directory operations.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Log forces can occur deep in the call chain when we have relatively
    little stack free. Log forces can also happen at close to the call
    chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from
    places where we really don't want to add more stack overhead.

    This stack overhead occurs because log forces do foreground CIL
    pushes (xlog_cil_push_foreground()) rather than waking the
    background push wq and waiting for the for the push to complete.
    This foreground push was done to avoid confusing the CFQ Io
    scheduler when fsync()s were issued, as it has trouble dealing with
    dependent IOs being issued from different process contexts.

    Avoiding blowing the stack is much more critical than performance
    optimisations for CFQ, especially as we've been recommending against
    the use of CFQ for XFS since 3.2 kernels were release because of
    it's problems with multi-threaded IO workloads.

    Hence convert xlog_cil_push_foreground() to move the push work
    to the CIL workqueue. We already do the waiting for the push to
    complete in xlog_cil_force_lsn(), so there's nothing else we need to
    modify to make this work.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Dave Chinner
     
  • Modify all read & write verifiers to differentiate
    between CRC errors and other inconsistencies.

    This sets the appropriate error number on bp->b_error,
    and then calls xfs_verifier_error() if something went
    wrong. That function will issue the appropriate message
    to the user.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner

    Eric Sandeen
     
  • xfs_error_report used to just print the hex address of the caller;
    %pF will give us something more human-readable.

    Signed-off-by: Eric Sandeen
    Reviewed-by: Jie Liu
    Signed-off-by: Dave Chinner

    Eric Sandeen