Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

17 Apr, 2014

6 commits

330033d69 xfs: fix tmpfile/selinux deadlock and initialize security ... Browse Code »
13

xfstests generic/004 reproduces an ilock deadlock using the tmpfile
interface when selinux is enabled. This occurs because
xfs_create_tmpfile() takes the ilock and then calls d_tmpfile(). The
latter eventually calls into xfs_xattr_get() which attempts to get the
lock again. E.g.:

xfs_io D ffffffff81c134c0 4096 3561 3560 0x00000080
ffff8801176a1a68 0000000000000046 ffff8800b401b540 ffff8801176a1fd8
00000000001d5800 00000000001d5800 ffff8800b401b540 ffff8800b401b540
ffff8800b73a6bd0 fffffffeffffffff ffff8800b73a6bd8 ffff8800b5ddb480
Call Trace:
[] schedule+0x29/0x70
[] rwsem_down_read_failed+0xc5/0x120
[] ? xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
[] call_rwsem_down_read_failed+0x14/0x30
[] ? down_read_nested+0x89/0xa0
[] ? xfs_ilock+0x122/0x250 [xfs]
[] xfs_ilock+0x122/0x250 [xfs]
[] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
[] xfs_attr_get+0x90/0xe0 [xfs]
[] xfs_xattr_get+0x37/0x50 [xfs]
[] generic_getxattr+0x4f/0x70
[] inode_doinit_with_dentry+0x1ae/0x650
[] selinux_d_instantiate+0x1c/0x20
[] security_d_instantiate+0x1b/0x30
[] d_instantiate+0x50/0x70
[] d_tmpfile+0xb5/0xc0
[] xfs_create_tmpfile+0x362/0x410 [xfs]
[] xfs_vn_tmpfile+0x18/0x20 [xfs]
[] path_openat+0x228/0x6a0
[] ? sched_clock+0x9/0x10
[] ? kvm_clock_read+0x27/0x40
[] ? __alloc_fd+0xaf/0x1f0
[] do_filp_open+0x3a/0x90
[] ? _raw_spin_unlock+0x27/0x40
[] ? __alloc_fd+0xaf/0x1f0
[] do_sys_open+0x12e/0x210
[] SyS_open+0x1e/0x20
[] system_call_fastpath+0x16/0x1b

xfs_vn_tmpfile() also fails to initialize security on the newly created
inode.

Pull the d_tmpfile() call up into xfs_vn_tmpfile() after the transaction
has been committed and the inode unlocked. Also, initialize security on
the inode based on the parent directory provided via the tmpfile call.

Signed-off-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Brian Foster
2014-04-17 06:15:30 +0800
8d6c12101 xfs: fix buffer use after free on IO error ... Browse Code »

When testing exhaustion of dm snapshots, the following appeared
with CONFIG_DEBUG_OBJECTS_FREE enabled:

ODEBUG: free active (active state 0) object type: work_struct hint: xfs_buf_iodone_work+0x0/0x1d0 [xfs]

indicating that we'd freed a buffer which still had a pending reference,
down this path:

[ 190.867975] [] debug_check_no_obj_freed+0x22b/0x270
[ 190.880820] [] kmem_cache_free+0xd0/0x370
[ 190.892615] [] xfs_buf_free+0xe4/0x210 [xfs]
[ 190.905629] [] xfs_buf_rele+0xe7/0x270 [xfs]
[ 190.911770] [] xfs_trans_read_buf_map+0x7b6/0xac0 [xfs]

At issue is the fact that if IO fails in xfs_buf_iorequest,
we'll queue completion unconditionally, and then call
xfs_buf_rele; but if IO failed, there are no IOs remaining,
and xfs_buf_rele will free the bp while work is still queued.

Fix this by not scheduling completion if the buffer has
an error on it; run it immediately. The rest is only comment
changes.

Thanks to dchinner for spotting the root cause.

Signed-off-by: Eric Sandeen
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Eric Sandeen
2014-04-17 06:15:28 +0800
07d5035a2 xfs: wrong error sign conversion during failed DIO writes ... Browse Code »

We negate the error value being returned from a generic function
incorrectly. The code path that it is running in returned negative
errors, so there is no need to negate it to get the correct error
signs here.

This was uncovered by generic/019.

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Dave Chinner
2014-04-17 06:15:27 +0800
9c23eccc1 xfs: unmount does not wait for shutdown during unmount ... Browse Code »

And interesting situation can occur if a log IO error occurs during
the unmount of a filesystem. The cases reported have the same
signature - the update of the superblock counters fails due to a log
write IO error:

XFS (dm-16): xfs_do_force_shutdown(0x2) called from line 1170 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa08a44a1
XFS (dm-16): Log I/O Error Detected. Shutting down filesystem
XFS (dm-16): Unable to update superblock counters. Freespace may not be correct on next mount.
XFS (dm-16): xfs_log_force: error 5 returned.
XFS (¿-¿¿¿): Please umount the filesystem and rectify the problem(s)

It can be seen that the last line of output contains a corrupt
device name - this is because the log and xfs_mount structures have
already been freed by the time this message is printed. A kernel
oops closely follows.

The issue is that the shutdown is occurring in a separate IO
completion thread to the unmount. Once the shutdown processing has
started and all the iclogs are marked with XLOG_STATE_IOERROR, the
log shutdown code wakes anyone waiting on a log force so they can
process the shutdown error. This wakes up the unmount code that
is doing a synchronous transaction to update the superblock
counters.

The unmount path now sees all the iclogs are marked with
XLOG_STATE_IOERROR and so never waits on them again, knowing that if
it does, there will not be a wakeup trigger for it and we will hang
the unmount if we do. Hence the unmount runs through all the
remaining code and frees all the filesystem structures while the
xlog_iodone() is still processing the shutdown. When the log
shutdown processing completes, xfs_do_force_shutdown() emits the
"Please umount the filesystem and rectify the problem(s)" message,
and xlog_iodone() then aborts all the objects attached to the iclog.
An iclog that has already been freed....

The real issue here is that there is no serialisation point between
the log IO and the unmount. We have serialisations points for log
writes, log forces, reservations, etc, but we don't actually have
any code that wakes for log IO to fully complete. We do that for all
other types of object, so why not iclogbufs?

Well, it turns out that we can easily do this. We've got xfs_buf
handles, and that's what everyone else uses for IO serialisation.
i.e. bp->b_sema. So, lets hold iclogbufs locked over IO, and only
release the lock in xlog_iodone() when we are finished with the
buffer. That way before we tear down the iclog, we can lock and
unlock the buffer to ensure IO completion has finished completely
before we tear it down.

Signed-off-by: Dave Chinner
Tested-by: Mike Snitzer
Tested-by: Bob Mastors
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2014-04-17 06:15:26 +0800
d39a2ced0 xfs: collapse range is delalloc challenged ... Browse Code »

FSX has been detecting data corruption after to collapse range
calls. The key observation is that the offset of the last extent in
the file was not being shifted, and hence when the file size was
adjusted it was truncating away data because the extents handled
been correctly shifted.

Tracing indicated that before the collapse, the extent list looked
like:

....
ino 0x5788 state idx 6 offset 26 block 195904 count 10 flag 0
ino 0x5788 state idx 7 offset 39 block 195917 count 35 flag 0
ino 0x5788 state idx 8 offset 86 block 195964 count 32 flag 0

and after the shift of 2 blocks:

ino 0x5788 state idx 6 offset 24 block 195904 count 10 flag 0
ino 0x5788 state idx 7 offset 37 block 195917 count 35 flag 0
ino 0x5788 state idx 8 offset 86 block 195964 count 32 flag 0

Note that the last extent did not change offset. After the changing
of the file size:

ino 0x5788 state idx 6 offset 24 block 195904 count 10 flag 0
ino 0x5788 state idx 7 offset 37 block 195917 count 35 flag 0
ino 0x5788 state idx 8 offset 86 block 195964 count 30 flag 0

You can see that the last extent had it's length truncated,
indicating that we've lost data.

The reason for this is that the xfs_bmap_shift_extents() loop uses
XFS_IFORK_NEXTENTS() to determine how many extents are in the inode.
This, unfortunately, doesn't take into account delayed allocation
extents - it's a count of physically allocated extents - and hence
when the file being collapsed has a delalloc extent like this one
does prior to the range being collapsed:

....
ino 0x5788 state idx 4 offset 11 block 4503599627239429 count 1 flag 0
....

it gets the count wrong and terminates the shift loop early.

Fix it by using the in-memory extent array size that includes
delayed allocation extents to determine the number of extents on the
inode.

Signed-off-by: Dave Chinner
Tested-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Dave Chinner
2014-04-17 06:15:25 +0800
0e1f789d0 xfs: don't map ranges that span EOF for direct IO ... Browse Code »

Al Viro tracked down the problem that has caused generic/263 to fail
on XFS since the test was introduced. If is caused by
xfs_get_blocks() mapping a single extent that spans EOF without
marking it as buffer-new() so that the direct IO code does not zero
the tail of the block at the new EOF. This is a long standing bug
that has been around for many, many years.

Because xfs_get_blocks() starts the map before EOF, it can't set
buffer_new(), because that causes he direct IO code to also zero
unaligned sectors at the head of the IO. This would overwrite valid
data with zeros, and hence we cannot validly return a single extent
that spans EOF to direct IO.

Fix this by detecting a mapping that spans EOF and truncate it down
to EOF. This results in the the direct IO code doing the right thing
for unaligned data blocks before EOF, and then returning to get
another mapping for the region beyond EOF which XFS treats correctly
by setting buffer_new() on it. This makes direct Io behave correctly
w.r.t. tail block zeroing beyond EOF, and fsx is happy about that.

Again, thanks to Al Viro for finding what I couldn't.

[ dchinner: Fix for __divdi3 build error:

Reported-by: Paul Gortmaker
Tested-by: Paul Gortmaker
Signed-off-by: Mark Tinguely
Reviewed-by: Eric Sandeen
]

Signed-off-by: Dave Chinner
Tested-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Dave Chinner
2014-04-17 06:15:19 +0800

14 Apr, 2014

4 commits

897b73b6a xfs: zeroing space needs to punch delalloc blocks ... Browse Code »

When we are zeroing space andit is covered by a delalloc range, we
need to punch the delalloc range out before we truncate the page
cache. Failing to do so leaves and inconsistency between the page
cache and the extent tree, which we later trip over when doing
direct IO over the same range.

Signed-off-by: Dave Chinner
Tested-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Dave Chinner
2014-04-14 16:15:11 +0800
aad3f3755 xfs: xfs_vm_write_end truncates too much on failure ... Browse Code »

Similar to the write_begin problem, xfs-vm_write_end will truncate
back to the old EOF, potentially removing page cache from over the
top of delalloc blocks with valid data in them. Fix this by
truncating back to just the start of the failed write.

Signed-off-by: Dave Chinner
Tested-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Dave Chinner
2014-04-14 16:14:11 +0800
72ab70a19 xfs: write failure beyond EOF truncates too much data ... Browse Code »

If we fail a write beyond EOF and have to handle it in
xfs_vm_write_begin(), we truncate the inode back to the current inode
size. This doesn't take into account the fact that we may have
already made successful writes to the same page (in the case of block
size < page size) and hence we can truncate the page cache away from
blocks with valid data in them. If these blocks are delayed
allocation blocks, we now have a mismatch between the page cache and
the extent tree, and this will trigger - at minimum - a delayed
block count mismatch assert when the inode is evicted from the cache.
We can also trip over it when block mapping for direct IO - this is
the most common symptom seen from fsx and fsstress when run from
xfstests.

Fix it by only truncating away the exact range we are updating state
for in this write_begin call.

Signed-off-by: Dave Chinner
Tested-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Dave Chinner
2014-04-14 16:13:29 +0800
4ab9ed578 xfs: kill buffers over failed write ranges properly ... Browse Code »

When a write fails, if we don't clear the delalloc flags from the
buffers over the failed range, they can persist beyond EOF and cause
problems. writeback will see the pages in the page cache, see they
are dirty and continually retry the write, assuming that the page
beyond EOF is just racing with a truncate. The page will eventually
be released due to some other operation (e.g. direct IO), and it
will not pass through invalidation because it is dirty. Hence it
will be released with buffer_delay set on it, and trigger warnings
in xfs_vm_releasepage() and assert fail in xfs_file_aio_write_direct
because invalidation failed and we didn't write the corect amount.

This causes failures on block size < page size filesystems in fsx
and fsstress workloads run by xfstests.

Fix it by completely trashing any state on the buffer that could be
used to imply that it contains valid data when the delalloc range
over the buffer is punched out during the failed write handling.

Signed-off-by: Dave Chinner
Tested-by: Brian Foster
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Dave Chinner
2014-04-14 16:11:58 +0800

13 Apr, 2014

1 commit

5166701b3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"The first vfs pile, with deep apologies for being very late in this
window.

Assorted cleanups and fixes, plus a large preparatory part of iov_iter
work. There's a lot more of that, but it'll probably go into the next
merge window - it *does* shape up nicely, removes a lot of
boilerplate, gets rid of locking inconsistencie between aio_write and
splice_write and I hope to get Kent's direct-io rewrite merged into
the same queue, but some of the stuff after this point is having
(mostly trivial) conflicts with the things already merged into
mainline and with some I want more testing.

This one passes LTP and xfstests without regressions, in addition to
usual beating. BTW, readahead02 in ltp syscalls testsuite has started
giving failures since "mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead pages" - might be a false
positive, might be a real regression..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
missing bits of "splice: fix racy pipe->buffers uses"
cifs: fix the race in cifs_writev()
ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
kill generic_file_buffered_write()
ocfs2_file_aio_write(): switch to generic_perform_write()
ceph_aio_write(): switch to generic_perform_write()
xfs_file_buffered_aio_write(): switch to generic_perform_write()
export generic_perform_write(), start getting rid of generic_file_buffer_write()
generic_file_direct_write(): get rid of ppos argument
btrfs_file_aio_write(): get rid of ppos
kill the 5th argument of generic_file_buffered_write()
kill the 4th argument of __generic_file_aio_write()
lustre: don't open-code kernel_recvmsg()
ocfs2: don't open-code kernel_recvmsg()
drbd: don't open-code kernel_recvmsg()
constify blk_rq_map_user_iov() and friends
lustre: switch to kernel_sendmsg()
ocfs2: don't open-code kernel_sendmsg()
take iov_iter stuff to mm/iov_iter.c
process_vm_access: tidy up a bit
...

Linus Torvalds
2014-04-13 05:49:50 +0800

08 Apr, 2014

1 commit

f1820361f mm: implement ->map_pages for page cache ... Browse Code »

filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.

It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().

Signed-off-by: Kirill A. Shutemov
Acked-by: Linus Torvalds
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Matthew Wilcox
Cc: Dave Hansen
Cc: Alexander Viro
Cc: Dave Chinner
Cc: Ning Qu
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-04-08 07:35:53 +0800

05 Apr, 2014

2 commits

d15e03104 Merge tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfs ... Browse Code »

Pull xfs update from Dave Chinner:
"There are a couple of new fallocate features in this request - it was
decided that it was easiest to push them through the XFS tree using
topic branches and have the ext4 support be based on those branches.
Hence you may see some overlap with the ext4 tree merge depending on
how they including those topic branches into their tree. Other than
that, there is O_TMPFILE support, some cleanups and bug fixes.

The main changes in the XFS tree for 3.15-rc1 are:

- O_TMPFILE support
- allowing AIO+DIO writes beyond EOF
- FALLOC_FL_COLLAPSE_RANGE support for fallocate syscall and XFS
implementation
- FALLOC_FL_ZERO_RANGE support for fallocate syscall and XFS
implementation
- IO verifier cleanup and rework
- stack usage reduction changes
- vm_map_ram NOIO context fixes to remove lockdep warings
- various bug fixes and cleanups"

* tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfs: (34 commits)
xfs: fix directory hash ordering bug
xfs: extra semi-colon breaks a condition
xfs: Add support for FALLOC_FL_ZERO_RANGE
fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
xfs: inode log reservations are still too small
xfs: xfs_check_page_type buffer checks need help
xfs: avoid AGI/AGF deadlock scenario for inode chunk allocation
xfs: use NOIO contexts for vm_map_ram
xfs: don't leak EFSBADCRC to userspace
xfs: fix directory inode iolock lockdep false positive
xfs: allocate xfs_da_args to reduce stack footprint
xfs: always do log forces via the workqueue
xfs: modify verifiers to differentiate CRC from other errors
xfs: print useful caller information in xfs_error_report
xfs: add xfs_verifier_error()
xfs: add helper for updating checksums on xfs_bufs
xfs: add helper for verifying checksums on xfs_bufs
xfs: Use defines for CRC offsets in all cases
xfs: skip pointless CRC updates after verifier failures
xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
...

Linus Torvalds
2014-04-05 06:50:08 +0800
24e7ea3be Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 updates from Ted Ts'o:
"Major changes for 3.14 include support for the newly added ZERO_RANGE
and COLLAPSE_RANGE fallocate operations, and scalability improvements
in the jbd2 layer and in xattr handling when the extended attributes
spill over into an external block.

Other than that, the usual clean ups and minor bug fixes"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
ext4: fix premature freeing of partial clusters split across leaf blocks
ext4: remove unneeded test of ret variable
ext4: fix comment typo
ext4: make ext4_block_zero_page_range static
ext4: atomically set inode->i_flags in ext4_set_inode_flags()
ext4: optimize Hurd tests when reading/writing inodes
ext4: kill i_version support for Hurd-castrated file systems
ext4: each filesystem creates and uses its own mb_cache
fs/mbcache.c: doucple the locking of local from global data
fs/mbcache.c: change block and index hash chain to hlist_bl_node
ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
ext4: refactor ext4_fallocate code
ext4: Update inode i_size after the preallocation
ext4: fix partial cluster handling for bigalloc file systems
ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
ext4: only call sync_filesystm() when remounting read-only
fs: push sync_filesystem() down to the file system's remount_fs()
jbd2: improve error messages for inconsistent journal heads
jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
jbd2: minimize region locked by j_list_lock in journal_get_create_access()
...

Linus Torvalds
2014-04-05 06:39:39 +0800

04 Apr, 2014

4 commits

91b0abe36 mm + fs: store shadow entries in page cache ... Browse Code »

Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page. As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently. At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate. Reclaim will check
for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:01 +0800
a6cf33bc5 Merge branch 'xfs-bug-fixes-for-3.15-3' into for-next Browse Code »

Dave Chinner
2014-04-04 05:07:35 +0800
c88547a81 xfs: fix directory hash ordering bug ... Browse Code »
5

Commit f5ea1100 ("xfs: add CRCs to dir2/da node blocks") introduced
in 3.10 incorrectly converted the btree hash index array pointer in
xfs_da3_fixhashpath(). It resulted in the the current hash always
being compared against the first entry in the btree rather than the
current block index into the btree block's hash entry array. As a
result, it was comparing the wrong hashes, and so could misorder the
entries in the btree.

For most cases, this doesn't cause any problems as it requires hash
collisions to expose the ordering problem. However, when there are
hash collisions within a directory there is a very good probability
that the entries will be ordered incorrectly and that actually
matters when duplicate hashes are placed into or removed from the
btree block hash entry array.

This bug results in an on-disk directory corruption and that results
in directory verifier functions throwing corruption warnings into
the logs. While no data or directory entries are lost, access to
them may be compromised, and attempts to remove entries from a
directory that has suffered from this corruption may result in a
filesystem shutdown. xfs_repair will fix the directory hash
ordering without data loss occuring.

[dchinner: wrote useful a commit message]

cc:
Reported-by: Hannes Frederic Sowa
Signed-off-by: Mark Tinguely
Reviewed-by: Ben Myers
Signed-off-by: Dave Chinner

Mark Tinguely
2014-04-04 04:10:49 +0800
805eeb8e0 xfs: extra semi-colon breaks a condition ... Browse Code »

There were some extra semi-colons here which mean that we return true
unintentionally.

Fixes: a49935f200e2 ('xfs: xfs_check_page_type buffer checks need help')
Signed-off-by: Dan Carpenter
Reviewed-by: Brian Foster
Reviewed-by: Eric Sandeen
Signed-off-by: Dave Chinner

Dan Carpenter
2014-04-04 03:56:30 +0800

02 Apr, 2014

4 commits

0a64bc2c0 xfs_file_buffered_aio_write(): switch to generic_perform_write() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-04-02 11:19:36 +0800
5cb6c6c7e generic_file_direct_write(): get rid of ppos argument ... Browse Code »

always equal to &iocb->ki_pos.

Signed-off-by: Al Viro

Al Viro
2014-04-02 11:19:35 +0800
fcacafd26 kill the 5th argument of generic_file_buffered_write() ... Browse Code »

same story - it's &iocb->ki_pos in all cases

Signed-off-by: Al Viro

Al Viro
2014-04-02 11:19:34 +0800
5d826c847 new helper: readlink_copy() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2014-04-02 11:19:15 +0800

13 Mar, 2014

7 commits

02b9984d6 fs: push sync_filesystem() down to the file system's remount_fs() ... Browse Code »
13

Previously, the no-op "mount -o mount /dev/xxx" operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs(). This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.

However, it's possible that there might be some file systems that are
actually depending on this behavior. In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).

Signed-off-by: "Theodore Ts'o"
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig
Cc: Artem Bityutskiy
Cc: Adrian Hunter
Cc: Evgeniy Dushistov
Cc: Jan Kara
Cc: OGAWA Hirofumi
Cc: Anders Larsen
Cc: Phillip Lougher
Cc: Kees Cook
Cc: Mikulas Patocka
Cc: Petr Vandrovec
Cc: xfs@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: codalist@coda.cs.cmu.edu
Cc: linux-ext4@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: fuse-devel@lists.sourceforge.net
Cc: cluster-devel@redhat.com
Cc: linux-mtd@lists.infradead.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux-nfs@vger.kernel.org
Cc: linux-nilfs@vger.kernel.org
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: reiserfs-devel@vger.kernel.org

Theodore Ts'o
2014-03-13 22:14:33 +0800
fe986f9d8 Merge branch 'xfs-O_TMPFILE-support' into for-next ... Browse Code »

Conflicts:
fs/xfs/xfs_trans_resv.c
- fix for XFS_INODE_CLUSTER_SIZE macro removal

Dave Chinner
2014-03-13 16:14:43 +0800
5f44e4c18 Merge branch 'xfs-bug-fixes-for-3.15-2' into for-next Browse Code »

Dave Chinner
2014-03-13 16:13:05 +0800
49ae4b97d Merge branch 'xfs-verifier-cleanup' into for-next Browse Code »

Dave Chinner
2014-03-13 16:12:33 +0800
730357a5c Merge branch 'xfs-stack-fixes' into for-next Browse Code »

Dave Chinner
2014-03-13 16:12:13 +0800
b6db0551f Merge branch 'xfs-collapse-range' into for-next Browse Code »

Dave Chinner
2014-03-13 16:11:06 +0800
376ba3131 xfs: Add support for FALLOC_FL_ZERO_RANGE ... Browse Code »

Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
functionality as xfs ioctl XFS_IOC_ZERO_RANGE.

We can also preallocate blocks past EOF in the same was as with
fallocate. Flag FALLOC_FL_KEEP_SIZE will cause the inode size to remain
the same even if we preallocate blocks past EOF.

It uses the same code to zero range as it is used by the
XFS_IOC_ZERO_RANGE ioctl.

Signed-off-by: Lukas Czerner
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Lukas Czerner
2014-03-13 16:07:58 +0800

07 Mar, 2014

5 commits

fe4c224aa xfs: inode log reservations are still too small ... Browse Code »

Back in commit 23956703 ("xfs: inode log reservations are too
small"), the reservation size was increased to take into account the
difference in size between the in-memory BMBT block headers and the
on-disk BMDR headers. This solved a transaction overrun when logging
the inode size.

Recently, however, we've seen a number of these same overruns on
kernels with the above fix in it. All of them have been by 4 bytes,
so we must still not be accounting for something correctly.

Through inspection it turns out the above commit didn't take into
account everything it should have. That is, it only accounts for a
single log op_hdr structure, when it can actually require up to four
op_hdrs - one for each region (log iovec) that is formatted. These
regions are the inode log format header, the inode core, and the two
forks that can be held in the literal area of the inode.

This means we are not accounting for 36 bytes of log space that the
transaction can use, and hence when we get inodes in certain formats
with particular fragmentation patterns we can overrun the
transaction. Fix this by adding the correct accounting for log
op_headers in the transaction.

Tested-by: Brian Foster
Signed-off-by: Dave Chinner
Reviewed-by: Eric Sandeen
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Dave Chinner
2014-03-07 13:19:14 +0800
a49935f20 xfs: xfs_check_page_type buffer checks need help ... Browse Code »

xfs_aops_discard_page() was introduced in the following commit:

xfs: truncate delalloc extents when IO fails in writeback

... to clean up left over delalloc ranges after I/O failure in
->writepage(). generic/224 tests for this scenario and occasionally
reproduces panics on sub-4k blocksize filesystems.

The cause of this is failure to clean up the delalloc range on a
page where the first buffer does not match one of the expected
states of xfs_check_page_type(). If a buffer is not unwritten,
delayed or dirty&mapped, xfs_check_page_type() stops and
immediately returns 0.

The stress test of generic/224 creates a scenario where the first
several buffers of a page with delayed buffers are mapped & uptodate
and some subsequent buffer is delayed. If the ->writepage() happens
to fail for this page, xfs_aops_discard_page() incorrectly skips
the entire page.

This then causes later failures either when direct IO maps the range
and finds the stale delayed buffer, or we evict the inode and find
that the inode still has a delayed block reservation accounted to
it.

We can easily fix this xfs_aops_discard_page() failure by making
xfs_check_page_type() check all buffers, but this breaks
xfs_convert_page() more than it is already broken. Indeed,
xfs_convert_page() wants xfs_check_page_type() to tell it if the
first buffers on the pages are of a type that can be aggregated into
the contiguous IO that is already being built.

xfs_convert_page() should not be writing random buffers out of a
page, but the current behaviour will cause it to do so if there are
buffers that don't match the current specification on the page.
Hence for xfs_convert_page() we need to:

a) return "not ok" if the first buffer on the page does not
match the specification provided to we don't write anything;
and
b) abort it's buffer-add-to-io loop the moment we come
across a buffer that does not match the specification.

Hence we need to fix both xfs_check_page_type() and
xfs_convert_page() to work correctly with pages that have mixed
buffer types, whilst allowing xfs_aops_discard_page() to scan all
buffers on the page for a type match.

Reported-by: Brian Foster
Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Dave Chinner
2014-03-07 13:19:14 +0800
e480a7239 xfs: avoid AGI/AGF deadlock scenario for inode chunk allocation ... Browse Code »

The inode chunk allocation path can lead to deadlock conditions if
a transaction is dirtied with an AGF (to fix up the freelist) for
an AG that cannot satisfy the actual allocation request. This code
path is written to try and avoid this scenario, but it can be
reproduced by running xfstests generic/270 in a loop on a 512b fs.

An example situation is:
- process A attempts an inode allocation on AG 3, modifies
the freelist, fails the allocation and ultimately moves on to
AG 0 with the AG 3 AGF held
- process B is doing a free space operation (i.e., truncate) and
acquires the AG 0 AGF, waits on the AG 3 AGF
- process A acquires the AG 0 AGI, waits on the AG 0 AGF (deadlock)

The problem here is that process A acquired the AG 3 AGF while
moving on to AG 0 (and releasing the AG 3 AGI with the AG 3 AGF
held). xfs_dialloc() makes one pass through each of the AGs when
attempting to allocate an inode chunk. The expectation is a clean
transaction if a particular AG cannot satisfy the allocation
request. xfs_ialloc_ag_alloc() is written to support this through
use of the minalignslop allocation args field.

When using the agi->agi_newino optimization, we attempt an exact
bno allocation request based on the location of the previously
allocated chunk. minalignslop is set to inform the allocator that
we will require alignment on this chunk, and thus to not allow the
request for this AG if the extra space is not available. Suppose
that the AG in question has just enough space for this request, but
not at the requested bno. xfs_alloc_fix_freelist() will proceed as
normal as it determines the request should succeed, and thus it is
allowed to modify the agf. xfs_alloc_ag_vextent() ultimately fails
because the requested bno is not available. In response, the caller
moves on to a NEAR_BNO allocation request for the same AG. The
alignment is set, but the minalignslop field is never reset. This
increases the overall requirement of the request from the first
attempt. If this delta is the difference between allocation success
and failure for the AG, xfs_alloc_fix_freelist() rejects this
request outright the second time around and causes the allocation
request to unnecessarily fail for this AG.

To address this situation, reset the minalignslop field immediately
after use and prevent it from leaking into subsequent requests.

Signed-off-by: Brian Foster
Reviewed-by: Mark Tinguely
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner

Brian Foster
2014-03-07 13:19:14 +0800
ae687e58b xfs: use NOIO contexts for vm_map_ram ... Browse Code »

When we map pages in the buffer cache, we can do so in GFP_NOFS
contexts. However, the vmap interfaces do not provide any method of
communicating this information to memory reclaim, and hence we get
lockdep complaining about it regularly and occassionally see hangs
that may be vmap related reclaim deadlocks. We can also see these
same problems from anywhere where we use vmalloc for a large buffer
(e.g. attribute code) inside a transaction context.

A typical lockdep report shows up as a reclaim state warning like so:

[14046.101458] =================================
[14046.102850] [ INFO: inconsistent lock state ]
[14046.102850] 3.14.0-rc4+ #2 Not tainted
[14046.102850] ---------------------------------
[14046.102850] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[14046.102850] kswapd0/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
[14046.102850] (&xfs_dir_ilock_class){++++?+}, at: [] xfs_ilock+0xff/0x16a
[14046.102850] {RECLAIM_FS-ON-W} state was registered at:
[14046.102850] [] mark_held_locks+0x81/0xe7
[14046.102850] [] lockdep_trace_alloc+0x5c/0xb4
[14046.102850] [] kmem_cache_alloc_trace+0x2b/0x11e
[14046.102850] [] vm_map_ram+0x119/0x3e6
[14046.102850] [] _xfs_buf_map_pages+0x5b/0xcf
[14046.102850] [] xfs_buf_get_map+0x67/0x13f
[14046.102850] [] xfs_attr_rmtval_set+0x396/0x4d5
[14046.102850] [] xfs_attr_leaf_addname+0x18f/0x37d
[14046.102850] [] xfs_attr_set_int+0x2f5/0x3e8
[14046.102850] [] xfs_attr_set+0x6b/0x74
[14046.102850] [] xfs_xattr_set+0x61/0x81
[14046.102850] [] generic_setxattr+0x59/0x68
[14046.102850] [] __vfs_setxattr_noperm+0x58/0xce
[14046.102850] [] vfs_setxattr+0x8e/0x92
[14046.102850] [] setxattr+0xcf/0x159
[14046.102850] [] SyS_lsetxattr+0x88/0xbb
[14046.102850] [] sysenter_do_call+0x12/0x36

Now, we can't completely remove these traces - mainly because
vm_map_ram() will do GFP_KERNEL allocation and that generates the
above warning before we get into the reclaim code, but we can turn
them all into false positive warnings.

To do that, use the method that DM and other IO context code uses to
avoid this problem: there is a process flag to tell memory reclaim
not to do IO that we can set appropriately. That prevents GFP_KERNEL
context reclaim being done from deep inside the vmalloc code in
places we can't directly pass a GFP_NOFS context to. That interface
has a pair of wrapper functions: memalloc_noio_save() and
memalloc_noio_restore().

Adding them around vm_map_ram and the vzalloc call in
kmem_alloc_large() will prevent deadlocks and most lockdep reports
for this issue. Also, convert the vzalloc() call in
kmem_alloc_large() to use __vmalloc() so that we can pass the
correct gfp context to the data page allocation routine inside
__vmalloc() so that it is clear that GFP_NOFS context is important
to this vmalloc call.

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Dave Chinner
2014-03-07 13:19:14 +0800
ac75a1f7a xfs: don't leak EFSBADCRC to userspace ... Browse Code »

While the verifier routines may return EFSBADCRC when a buffer has
a bad CRC, we need to translate that to EFSCORRUPTED so that the
higher layers treat the error appropriately and we return a
consistent error to userspace. This fixes a xfs/005 regression.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2014-03-07 13:19:14 +0800

28 Feb, 2014

1 commit

8d7531825 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs ... Browse Code »

Pull filesystem fixes from Jan Kara:
"Notification, writeback, udf, quota fixes

The notification patches are (with one exception) a fallout of my
fsnotify rework which went into -rc1 (I've extented LTP to cover these
cornercases to avoid similar breakage in future).

The UDF patch is a nasty data corruption Al has recently reported,
the revert of the writeback patch is due to possibility of violating
sync(2) guarantees, and a quota bug can lead to corruption of quota
files in ocfs2"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: Allocate overflow events with proper type
fanotify: Handle overflow in case of permission events
fsnotify: Fix detection whether overflow event is queued
Revert "writeback: do not sync data dirtied after sync start"
quota: Fix race between dqput() and dquot_scan_active()
udf: Fix data corruption on file type conversion
inotify: Fix reporting of cookies for inotify events

Linus Torvalds
2014-02-28 02:37:22 +0800

27 Feb, 2014

5 commits

93a8614e3 xfs: fix directory inode iolock lockdep false positive ... Browse Code »
5

The change to add the IO lock to protect the directory extent map
during readdir operations has cause lockdep to have a heart attack
as it now sees a different locking order on inodes w.r.t. the
mmap_sem because readdir has a different ordering to write().

Add a new lockdep class for directory inodes to avoid this false
positive.

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner

Dave Chinner
2014-02-27 13:51:39 +0800
a1358aa3d xfs: allocate xfs_da_args to reduce stack footprint ... Browse Code »

The struct xfs_da_args used to pass directory/attribute operation
information to the lower layers is 128 bytes in size and is
allocated on the stack. Dynamically allocate them to reduce the
stack footprint of directory operations.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2014-02-27 13:51:26 +0800
f876e4460 xfs: always do log forces via the workqueue ... Browse Code »
13

Log forces can occur deep in the call chain when we have relatively
little stack free. Log forces can also happen at close to the call
chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from
places where we really don't want to add more stack overhead.

This stack overhead occurs because log forces do foreground CIL
pushes (xlog_cil_push_foreground()) rather than waking the
background push wq and waiting for the for the push to complete.
This foreground push was done to avoid confusing the CFQ Io
scheduler when fsync()s were issued, as it has trouble dealing with
dependent IOs being issued from different process contexts.

Avoiding blowing the stack is much more critical than performance
optimisations for CFQ, especially as we've been recommending against
the use of CFQ for XFS since 3.2 kernels were release because of
it's problems with multi-threaded IO workloads.

Hence convert xlog_cil_push_foreground() to move the push work
to the CIL workqueue. We already do the waiting for the push to
complete in xlog_cil_force_lsn(), so there's nothing else we need to
modify to make this work.

Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Dave Chinner
2014-02-27 13:40:42 +0800
ce5028cfe xfs: modify verifiers to differentiate CRC from other errors ... Browse Code »

Modify all read & write verifiers to differentiate
between CRC errors and other inconsistencies.

This sets the appropriate error number on bp->b_error,
and then calls xfs_verifier_error() if something went
wrong. That function will issue the appropriate message
to the user.

Signed-off-by: Eric Sandeen
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner

Eric Sandeen
2014-02-27 12:23:10 +0800
db9355c29 xfs: print useful caller information in xfs_error_report ... Browse Code »

xfs_error_report used to just print the hex address of the caller;
%pF will give us something more human-readable.

Signed-off-by: Eric Sandeen
Reviewed-by: Jie Liu
Signed-off-by: Dave Chinner

Eric Sandeen
2014-02-27 12:21:37 +0800