14 Nov, 2014
1 commit
-
fsnotify() needs to merge inode and mount marks lists when notifying
groups about events so that ignore masks from inode marks are reflected
in mount mark notifications and groups are notified in proper order
(according to priorities).Currently the sorting of the lists done by fsnotify_add_inode_mark() /
fsnotify_add_vfsmount_mark() and fsnotify() differed which resulted
ignore masks not being used in some cases.Fix the problem by always using the same comparison function when
sorting / merging the mark lists.Thanks to Heinrich Schuchardt for improvements of my patch.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=87721
Signed-off-by: Jan Kara
Reported-by: Heinrich Schuchardt
Tested-by: Heinrich Schuchardt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Nov, 2014
1 commit
-
Pull btrfs fix from Chris Mason:
"It's a one liner for an error cleanup path that leads to crashes"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup
08 Nov, 2014
1 commit
-
Pull xfs fixes from Dave Chinner:
"This update fixes a warning in the new pagecache_isize_extended() and
updates some related comments, another fix for zero-range
misbehaviour, and an unforntuately large set of fixes for regressions
in the bulkstat code.The bulkstat fixes are large but necessary. I wouldn't normally push
such a rework for a -rcX update, but right now xfsdump can silently
create incomplete dumps on 3.17 and it's possible that even xfsrestore
won't notice that the dumps were incomplete. Hence we need to get
this update into 3.17-stable kernels ASAP.In more detail, the refactoring work I committed in 3.17 has exposed a
major hole in our QA coverage. With both xfsdump (the major user of
bulkstat) and xfsrestore silently ignoring missing files in the
dump/restore process, incomplete dumps were going unnoticed if they
were being triggered. Many of the dump/restore filesets were so small
that they didn't evenhave a chance of triggering the loop iteration
bugs we introduced in 3.17, so we didn't exercise the code
sufficiently, either.We have already taken steps to improve QA coverage in xfstests to
avoid this happening again, and I've done a lot of manual verification
of dump/restore on very large data sets (tens of millions of inodes)
of the past week to verify this patch set results in bulkstat behaving
the same way as it does on 3.16.Unfortunately, the fixes are not exactly simple - in tracking down the
problem historic API warts were discovered (e.g xfsdump has been
working around a 20 year old bug in the bulkstat API for the past 10
years) and so that complicated the process of diagnosing and fixing
the problems. i.e. we had to fix bugs in the code as well as
discover and re-introduce the userspace visible API bugs that we
unwittingly "fixed" in 3.17 that xfsdump relied on to work correctly.Summary:
- incorrect warnings about i_mutex locking in pagecache_isize_extended()
and updates comments to match expected locking
- another zero-range bug fix for stray file size updates
- a bunch of fixes for regression in the bulkstat code introduced in
3.17"* tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: track bulkstat progress by agino
xfs: bulkstat error handling is broken
xfs: bulkstat main loop logic is a mess
xfs: bulkstat chunk-formatter has issues
xfs: bulkstat chunk formatting cursor is broken
xfs: bulkstat btree walk doesn't terminate
mm: Fix comment before truncate_setsize()
xfs: rework zero range to prevent invalid i_size updates
mm: Remove false WARN_ON from pagecache_isize_extended()
xfs: Check error during inode btree iteration in xfs_bulkstat()
xfs: bulkstat doesn't release AGI buffer on error
07 Nov, 2014
6 commits
-
The bulkstat main loop progress is tracked by the "lastino"
variable, which is a full 64 bit inode. However, the loop actually
works on agno/agino pairs, and so there's a significant disconnect
between the rest of the loop and the main cursor. Convert this to
use the agino, and pass the agino into the chunk formatting function
and convert it too.This gets rid of the inconsistency in the loop processing, and
finally makes it simple for us to skip inodes at any point in the
loop simply by incrementing the agino cursor.cc: # 3.17
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner -
The error propagation is a horror - xfs_bulkstat() returns
a rval variable which is only set if there are formatter errors. Any
sort of btree walk error or corruption will cause the bulkstat walk
to terminate but will not pass an error back to userspace. Worse
is the fact that formatter errors will also be ignored if any inodes
were correctly formatted into the user buffer.Hence bulkstat can fail badly yet still report success to userspace.
This causes significant issues with xfsdump not dumping everything
in the filesystem yet reporting success. It's not until a restore
fails that there is any indication that the dump was bad and tha
bulkstat failed. This patch now triggers xfsdump to fail with
bulkstat errors rather than silently missing files in the dump.This now causes bulkstat to fail when the lastino cookie does not
fall inside an existing inode chunk. The pre-3.17 code tolerated
that error by allowing the code to move to the next inode chunk
as the agino target is guaranteed to fall into the next btree
record.With the fixes up to this point in the series, xfsdump now passes on
the troublesome filesystem image that exposes all these bugs.cc:
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster -
There are a bunch of variables tha tare more wildy scoped than they
need to be, obfuscated user buffer checks and tortured "next inode"
tracking. This all needs cleaning up to expose the real issues that
need fixing.cc: # 3.17
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner -
The loop construct has issues:
- clustidx is completely unused, so remove it.
- the loop tries to be smart by terminating when the
"freecount" tells it that all inodes are free. Just drop
it as in most cases we have to scan all inodes in the
chunk anyway.
- move the "user buffer left" condition check to the only
point where we consume space int eh user buffer.
- move the initialisation of agino out of the loop, leaving
just a simple loop control logic using the clusteridx.Also, double handling of the user buffer variables leads to problems
tracking the current state - use the cursor variables directly
rather than keeping local copies and then having to update the
cursor before returning.cc: # 3.17
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner -
The xfs_bulkstat_agichunk formatting cursor takes buffer values from
the main loop and passes them via the structure to the chunk
formatter, and the writes the changed values back into the main loop
local variables. Unfortunately, this complex dance is full of corner
cases that aren't handled correctly.The biggest problem is that it is double handling the information in
both the main loop and the chunk formatting function, leading to
inconsistent updates and endless loops where progress is not made.To fix this, push the struct xfs_bulkstat_agichunk outwards to be
the primary holder of user buffer information. this removes the
double handling in the main loop.Also, pass the last inode processed by the chunk formatter as a
separate parameter as it purely an output variable and is not
related to the user buffer consumption cursor.Finally, the chunk formatting code is not shared by anyone, so make
it local to xfs_itable.c.cc: # 3.17
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner -
The bulkstat code has several different ways of detecting the end of
an AG when doing a walk. They are not consistently detected, and the
code that checks for the end of AG conditions is not consistently
coded. Hence the are conditions where the walk code can get stuck in
an endless loop making no progress and not triggering any
termination conditions.Convert all the "tmp/i" status return codes from btree operations
to a common name (stat) and apply end-of-ag detection to these
operations consistently.cc: # 3.17
Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Dave Chinner
06 Nov, 2014
1 commit
-
uninitialized msghdr. Broken in "ocfs2: don't open-code kernel_recvmsg()"
by me ;-/Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Al Viro
05 Nov, 2014
1 commit
-
ovl_cache_put() can be called from ovl_dir_reset() if the cache needs to be
rebuilt. We did list_del() on the cursor, which results in an Oops on the
poisoned pointer in ovl_seek_cursor().Reported-by: Jordi Pujol Palomer
Signed-off-by: Miklos Szeredi
Tested-by: Jordi Pujol Palomer
Signed-off-by: Al Viro
04 Nov, 2014
1 commit
-
If we hit any errors in btrfs_lookup_csums_range, we'll loop through all
the csums we allocate and free them. But the code was using list_entry
incorrectly, and ended up trying to free the on-stack list_head instead.This bug came from commit 0678b6185
btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range()
Signed-off-by: Chris Mason
Reported-by: Erik Berg
cc: stable@vger.kernel.org # 3.3 or newer
03 Nov, 2014
1 commit
-
Pull VFS fixes from Al Viro:
"A bunch of assorted fixes, most of them followups to overlayfs merge"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ovl: initialize ->is_cursor
Return short read or 0 at end of a raw device, not EIO
isofs: don't bother with ->d_op for normal case
isofs_cmp(): we'll never see a dentry for . or ..
overlayfs: fix lockdep misannotation
ovl: fix check for cursor
overlayfs: barriers for opening upper-layer directory
rcu: Provide counterpart to rcu_dereference() for non-RCU situations
staging: android: logger: Fix log corruption regression
02 Nov, 2014
1 commit
-
Pull btrfs fixes from Chris Mason:
"Filipe is nailing down some problems with our skinny extent variation,
and Dave's patch fixes endian problems in the new super block checks"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
Btrfs: properly clean up btrfs_end_io_wq_cache
Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()
btrfs: use macro accessors in superblock validation checks
01 Nov, 2014
3 commits
-
Pull ext4 bugfixes from Ted Ts'o:
"A set of miscellaneous ext4 bug fixes for 3.18"* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: make ext4_ext_convert_to_initialized() return proper number of blocks
ext4: bail early when clearing inode journal flag fails
ext4: bail out from make_indexed_dir() on first error
jbd2: use a better hash function for the revoke table
ext4: prevent bugon on race between write/fcntl
ext4: remove extent status procfs files if journal load fails
ext4: disallow changing journal_csum option during remount
ext4: enable journal checksum when metadata checksum feature enabled
ext4: fix oops when loading block bitmap failed
ext4: fix overflow when updating superblock backups after resize -
Pull quota and ext3 fixes from Jan Kara.
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fs, jbd: use a more generic hash function
quota: Properly return errors from dquot_writeback_dquots()
ext3: Don't check quota format when there are no quota files -
Signed-off-by: Miklos Szeredi
Signed-off-by: Al Viro
31 Oct, 2014
3 commits
-
Author: David Jeffery
Changes to the basic direct I/O code have broken the raw driver when reading
to the end of a raw device. Instead of returning a short read for a read that
extends partially beyond the device's end or 0 when at the end of the device,
these reads now return EIO.The raw driver needs the same end of device handling as was added for normal
block devices. Using blkdev_read_iter, which has the needed size checks,
prevents the EIO conditions at the end of the device.Signed-off-by: David Jeffery
Signed-off-by: Al Viro -
we only need it for joliet and case-insensitive mounts
Signed-off-by: Al Viro
-
The man page for open(2) indicates that when O_CREAT is specified, the
'mode' argument applies only to future accesses to the file:Note that this mode applies only to future accesses of the newly
created file; the open() call that creates a read-only file
may well return a read/write file descriptor.The man page for open(2) implies that 'mode' is treated identically by
O_CREAT and O_TMPFILE.O_TMPFILE, however, behaves differently:
int fd = open("/tmp", O_TMPFILE | O_RDWR, 0);
assert(fd == -1);
assert(errno == EACCES);int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);
assert(fd > 0);For O_CREAT, do_last() sets acc_mode to MAY_OPEN only:
if (*opened & FILE_CREATED) {
/* Don't check for write permission, don't truncate */
open_flag &= ~O_TRUNC;
will_truncate = false;
acc_mode = MAY_OPEN;
path_to_nameidata(path, nd);
goto finish_open_created;
}But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to
may_open().This patch lines up the behavior of O_TMPFILE with O_CREAT. After the
inode is created, may_open() is called with acc_mode = MAY_OPEN, in
do_tmpfile().A different, but related glibc bug revealed the discrepancy:
https://sourceware.org/bugzilla/show_bug.cgi?id=17523The glibc lazily loads the 'mode' argument of open() and openat() using
va_arg() only if O_CREAT is present in 'flags' (to support both the 2
argument and the 3 argument forms of open; same idea for openat()).
However, the glibc ignores the 'mode' argument if O_TMPFILE is in
'flags'.On x86_64, for open(), it magically works anyway, as 'mode' is in
RDX when entering open(), and is still in RDX on SYSCALL, which is where
the kernel looks for the 3rd argument of a syscall.But openat() is not quite so lucky: 'mode' is in RCX when entering the
glibc wrapper for openat(), while the kernel looks for the 4th argument
of a syscall in R10. Indeed, the syscall calling convention differs from
the regular calling convention in this respect on x86_64. So the kernel
sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and
fails with EACCES.Signed-off-by: Eric Rannaud
Acked-by: Andy Lutomirski
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds
30 Oct, 2014
16 commits
-
ext4_ext_convert_to_initialized() can return more blocks than are
actually allocated from map->m_lblk in case where initial part of the
on-disk extent is zeroed out. Luckily this doesn't have serious
consequences because the caller currently uses the return value
only to unmap metadata buffers. Anyway this is a data
corruption/exposure problem waiting to happen so fix it.Coverity-id: 1226848
Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o -
When clearing inode journal flag, we call jbd2_journal_flush() to force
all the journalled data to their final locations. Currently we ignore
when this fails and continue clearing inode journal flag. This isn't a
big problem because when jbd2_journal_flush() fails, journal is likely
aborted anyway. But it can still lead to somewhat confusing results so
rather bail out early.Coverity-id: 989044
Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o -
When ext4_handle_dirty_dx_node() or ext4_handle_dirty_dirent_node()
fail, there's really something wrong with the fs and there's no point in
continuing further. Just return error from make_indexed_dir() in that
case. Also initialize frames array so that if we return early due to
error, dx_release() doesn't try to dereference uninitialized memory
(which could happen also due to error in do_split()).Coverity-id: 741300
Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org -
The old hash function didn't work well for 64-bit block numbers, and
used undefined (negative) shift right behavior. Use the generic
64-bit hash function instead.Signed-off-by: Theodore Ts'o
Reported-by: Andrey Ryabinin -
O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked
twice inside ext4_file_write_iter() and __generic_file_write() which
result in BUG_ON inside ext4_direct_IO.Let's initialize iocb->private unconditionally.
TESTCASE: xfstest:generic/036 https://patchwork.ozlabs.org/patch/402445/
#TYPICAL STACK TRACE:
kernel BUG at fs/ext4/inode.c:2960!
invalid opcode: 0000 [#1] SMP
Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod
CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 #161
Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
task: ffff88080e95a7c0 ti: ffff88080f908000 task.ti: ffff88080f908000
RIP: 0010:[] [] ext4_direct_IO+0x162/0x3d0
RSP: 0018:ffff88080f90bb58 EFLAGS: 00010246
RAX: 0000000000000400 RBX: ffff88080fdb2a28 RCX: 00000000a802c818
RDX: 0000040000080000 RSI: ffff88080d8aeb80 RDI: 0000000000000001
RBP: ffff88080f90bbc8 R08: 0000000000000000 R09: 0000000000001581
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88080d8aeb80
R13: ffff88080f90bbf8 R14: ffff88080fdb28c8 R15: ffff88080fdb2a28
FS: 00007f23b2055700(0000) GS:ffff880818400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f23b2045000 CR3: 000000080cedf000 CR4: 00000000000407e0
Stack:
ffff88080f90bb98 0000000000000000 7ffffffffffffffe ffff88080fdb2c30
0000000000000200 0000000000000200 0000000000000001 0000000000000200
ffff88080f90bbc8 ffff88080fdb2c30 ffff88080f90be08 0000000000000200
Call Trace:
[] generic_file_direct_write+0xed/0x180
[] __generic_file_write_iter+0x222/0x370
[] ext4_file_write_iter+0x34b/0x400
[] ? aio_run_iocb+0x239/0x410
[] ? aio_run_iocb+0x239/0x410
[] ? local_clock+0x25/0x30
[] ? __lock_acquire+0x274/0x700
[] ? ext4_unwritten_wait+0xb0/0xb0
[] aio_run_iocb+0x286/0x410
[] ? local_clock+0x25/0x30
[] ? lock_release_holdtime+0x29/0x190
[] ? lookup_ioctx+0x4b/0xf0
[] do_io_submit+0x55b/0x740
[] ? do_io_submit+0x3ca/0x740
[] SyS_io_submit+0x10/0x20
[] system_call_fastpath+0x16/0x1b
Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c
24 18 00 75 04 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89
RIP [] ext4_direct_IO+0x162/0x3d0
RSPReported-by: Sasha Levin
Signed-off-by: Theodore Ts'o
Signed-off-by: Dmitry Monakhov
Cc: stable@vger.kernel.org -
If we can't load the journal, remove the procfs files for the extent
status information file to avoid leaking resources.Signed-off-by: Darrick J. Wong
Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org -
ext4 does not permit changing the metadata or journal checksum feature
flag while mounted. Until we decide to support that, don't allow a
remount to change the journal_csum flag (right now we silently fail to
change anything).Signed-off-by: Darrick J. Wong
Signed-off-by: Theodore Ts'o -
If metadata checksumming is turned on for the FS, we need to tell the
journal to use checksumming too.Signed-off-by: Darrick J. Wong
Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org -
When we fail to load block bitmap in __ext4_new_inode() we will
dereference NULL pointer in ext4_journal_get_write_access(). So check
for error from ext4_read_block_bitmap().Coverity-id: 989065
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o -
When there are no meta block groups update_backups() will compute the
backup block in 32-bit arithmetics thus possibly overflowing the block
number and corrupting the filesystem. OTOH filesystems without meta
block groups larger than 16 TB should be rare. Fix the problem by doing
the counting in 64-bit arithmetics.Coverity-id: 741252
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o
Reviewed-by: Lukas Czerner -
Merge misc fixes from Andrew Morton:
"21 fixes"* emailed patches from Andrew Morton : (21 commits)
mm/balloon_compaction: fix deflation when compaction is disabled
sh: fix sh770x SCIF memory regions
zram: avoid NULL pointer access in concurrent situation
mm/slab_common: don't check for duplicate cache names
ocfs2: fix d_splice_alias() return code checking
mm: rmap: split out page_remove_file_rmap()
mm: memcontrol: fix missed end-writeback page accounting
mm: page-writeback: inline account_page_dirtied() into single caller
lib/bitmap.c: fix undefined shift in __bitmap_shift_{left|right}()
drivers/rtc/rtc-bq32k.c: fix register value
memory-hotplug: clear pgdat which is allocated by bootmem in try_offline_node()
drivers/rtc/rtc-s3c.c: fix initialization failure without rtc source clock
kernel/kmod: fix use-after-free of the sub_info structure
drivers/rtc/rtc-pm8xxx.c: rework to support pm8941 rtc
mm, thp: fix collapsing of hugepages on madvise
drivers: of: add return value to of_reserved_mem_device_init()
mm: free compound page with correct order
gcov: add ARM64 to GCOV_PROFILE_ALL
fsnotify: next_i is freed during fsnotify_unmount_inodes.
mm/compaction.c: avoid premature range skip in isolate_migratepages_range
... -
The zero range operation is analogous to fallocate with the exception of
converting the range to zeroes. E.g., it attempts to allocate zeroed
blocks over the range specified by the caller. The XFS implementation
kills all delalloc blocks currently over the aligned range, converts the
range to allocated zero blocks (unwritten extents) and handles the
partial pages at the ends of the range by sending writes through the
pagecache.The current implementation suffers from several problems associated with
inode size. If the aligned range covers an extending I/O, said I/O is
discarded and an inode size update from a previous write never makes it
to disk. Further, if an unaligned zero range extends beyond eof, the
page write induced for the partial end page can itself increase the
inode size, even if the zero range request is not supposed to update
i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).The latter behavior not only incorrectly increases the inode size, but
can lead to stray delalloc blocks on the inode. Typically, post-eof
preallocation blocks are either truncated on release or inode eviction
or explicitly written to by xfs_zero_eof() on natural file size
extension. If the inode size increases due to zero range, however,
associated blocks leak into the address space having never been
converted or mapped to pagecache pages. A direct I/O to such an
uncovered range cannot convert the extent via writeback and will BUG().
For example:$ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321"
...
$ xfs_io -d -c "pread 128k 128k"If the entire delalloc extent happens to not have page coverage
whatsoever (e.g., delalloc conversion couldn't find a large enough free
space extent), even a full file writeback won't convert what's left of
the extent and we'll assert on inode eviction.Rework xfs_zero_file_space() to avoid buffered I/O for partial pages.
Use the existing hole punch and prealloc mechanisms as primitives for
zero range. This implementation is not efficient nor ideal as we
writeback dirty data over the range and remove existing extents rather
than convert to unwrittern. The former writeback, however, is currently
the only mechanism available to ensure consistency between pagecache and
extent state. Even a pagecache truncate/delalloc punch prior to hole
punch has lead to inconsistencies due to racing with writeback.This provides a consistent, correct implementation of zero range that
survives fsstress/fsx testing without assert failures. The
implementation can be optimized from this point forward once the
fundamental issue of pagecache and delalloc extent state consistency is
addressed.Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner -
xfs_bulkstat() doesn't check error return from xfs_btree_increment(). In
case of specific fs corruption that could result in xfs_bulkstat()
entering an infinite loop because we would be looping over the same
chunk over and over again. Fix the problem by checking the return value
and terminating the loop properly.Coverity-id: 1231338
cc:
Signed-off-by: Jan Kara
Reviewed-by: Jie Liu
Signed-off-by: Dave Chinner -
d_splice_alias() can return a valid dentry, NULL or an ERR_PTR.
Currently the code checks not for ERR_PTR and will cuase an oops in
ocfs2_dentry_attach_lock(). Fix this by using IS_ERR_OR_NULL().Signed-off-by: Richard Weinberger
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
During file system stress testing on 3.10 and 3.12 based kernels, the
umount command occasionally hung in fsnotify_unmount_inodes in the
section of code:spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
spin_unlock(&inode->i_lock);
continue;
}As this section of code holds the global inode_sb_list_lock, eventually
the system hangs trying to acquire the lock.Multiple crash dumps showed:
The inode->i_state == 0x60 and i_count == 0 and i_sb_list would point
back at itself. As this is not the value of list upon entry to the
function, the kernel never exits the loop.To help narrow down problem, the call to list_del_init in
inode_sb_list_del was changed to list_del. This poisons the pointers in
the i_sb_list and causes a kernel to panic if it transverse a freed
inode.Subsequent stress testing paniced in fsnotify_unmount_inodes at the
bottom of the list_for_each_entry_safe loop showing next_i had become
free.We believe the root cause of the problem is that next_i is being freed
during the window of time that the list_for_each_entry_safe loop
temporarily releases inode_sb_list_lock to call fsnotify and
fsnotify_inode_delete.The code in fsnotify_unmount_inodes attempts to prevent the freeing of
inode and next_i by calling __iget. However, the code doesn't do the
__iget call on next_iif i_count == 0 or
if i_state & (I_FREEING | I_WILL_FREE)The patch addresses this issue by advancing next_i in the above two cases
until we either find a next_i which we can __iget or we reach the end of
the list. This makes the handling of next_i more closely match the
handling of the variable "inode."The time to reproduce the hang is highly variable (from hours to days.) We
ran the stress test on a 3.10 kernel with the proposed patch for a week
without failure.During list_for_each_entry_safe, next_i is becoming free causing
the loop to never terminate. Advance next_i in those cases where
__iget is not done.Signed-off-by: Jerry Hoemann
Cc: Jeff Kirsher
Cc: Ken Helias
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Pull block layer fixes from Jens Axboe:
"A small collection of fixes for the current kernel. This contains:- Two error handling fixes from Jan Kara. One for null_blk on
failure to add a device, and the other for the block/scsi_ioctl
SCSI_IOCTL_SEND_COMMAND fixing up the error jump point.- A commit added in the merge window for the bio integrity bits
unfortunately disabled merging for all requests if
CONFIG_BLK_DEV_INTEGRITY wasn't set. Reverse the logic, so that
integrity checking wont disallow merges when not enabled.- A fix from Ming Lei for merging and generating too many segments.
This caused a BUG in virtio_blk.- Two error handling printk() fixups from Robert Elliott, improving
the information given when we rate limit.- Error handling fixup on elevator_init() failure from Sudip
Mukherjee.- A fix from Tony Battersby, fixing up a memory leak in the
scatterlist handling with scsi-mq"* 'for-linus' of git://git.kernel.dk/linux-block:
block: Fix merge logic when CONFIG_BLK_DEV_INTEGRITY is not defined
lib/scatterlist: fix memory leak with scsi-mq
block: fix wrong error return in elevator_init()
scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND
null_blk: Cleanup error recovery in null_add_dev()
blk-merge: recaculate segment if it isn't less than max segments
fs: clarify rate limit suppressed buffer I/O errors
fs: merge I/O error prints into one line
29 Oct, 2014
4 commits
-
Signed-off-by: Al Viro
-
In an overlay directory that shadows an empty lower directory, say
/mnt/a/empty102, do:touch /mnt/a/empty102/x
unlink /mnt/a/empty102/x
rmdir /mnt/a/empty102It's actually harmless, but needs another level of nesting between
I_MUTEX_CHILD and I_MUTEX_NORMAL.Signed-off-by: Miklos Szeredi
Tested-by: David Howells
Signed-off-by: Al Viro -
ovl_cache_entry.name is now an array not a pointer, so it makes no sense
test for it being NULL.Detected by coverity.
From: Miklos Szeredi
Fixes: 68bf8611076a ("overlayfs: make ovl_cache_entry->name an array instead of
+pointer")
Signed-off-by: Miklos Szeredi
Signed-off-by: Al Viro -
make sure that
a) all stores done by opening struct file don't leak past storing
the reference in od->upperfile
b) the lockless side has read dependency barrierSigned-off-by: Al Viro