Eric Lee / smarc-fsl-linux-kernel

28 May, 2016

2 commits

d102a56ed Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs fixes from Al Viro:
"Followups to the parallel lookup work:

- update docs

- restore killability of the places that used to take ->i_mutex
killably now that we have down_write_killable() merged

- Additionally, it turns out that I missed a prerequisite for
security_d_instantiate() stuff - ->getxattr() wasn't the only thing
that could be called before dentry is attached to inode; with smack
we needed the same treatment applied to ->setxattr() as well"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch ->setxattr() to passing dentry and inode separately
switch xattr_handler->set() to passing dentry and inode separately
restore killability of old mutex_lock_killable(&inode->i_mutex) users
add down_write_killable_nested()
update D/f/directory-locking

Linus Torvalds
2016-05-28 08:14:05 +0800
593012268 switch xattr_handler->set() to passing dentry and inode separately ... Browse Code »

preparation for similar switch in ->setxattr() (see the next commit for
rationale).

Signed-off-by: Al Viro

Al Viro
2016-05-28 03:39:43 +0800

27 May, 2016

1 commit

315227f6d Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull misc DAX updates from Vishal Verma:
"DAX error handling for 4.7

- Until now, dax has been disabled if media errors were found on any
device. This enables the use of DAX in the presence of these
errors by making all sector-aligned zeroing go through the driver.

- The driver (already) has the ability to clear errors on writes that
are sent through the block layer using 'DSMs' defined in ACPI 6.1.

Other misc changes:

- When mounting DAX filesystems, check to make sure the partition is
page aligned. This is a requirement for DAX, and previously, we
allowed such unaligned mounts to succeed, but subsequent
reads/writes would fail.

- Misc/cleanup fixes from Jan that remove unused code from DAX
related to zeroing, writeback, and some size checks"

* tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: fix a comment in dax_zero_page_range and dax_truncate_page
dax: for truncate/hole-punch, do zeroing through the driver if possible
dax: export a low-level __dax_zero_page_range helper
dax: use sb_issue_zerout instead of calling dax_clear_sectors
dax: enable dax in the presence of known media errors (badblocks)
dax: fallback from pmd to pte on error
block: Update blkdev_dax_capable() for consistency
xfs: Add alignment check for DAX mount
ext2: Add alignment check for DAX mount
ext4: Add alignment check for DAX mount
block: Add bdev_dax_supported() for dax mount checks
block: Add vfs_msg() interface
dax: Remove redundant inode size checks
dax: Remove pointless writeback from dax_do_io()
dax: Remove zeroing from dax_io()
dax: Remove dead zeroing code from fault handlers
ext2: Avoid DAX zeroing to corrupt data
ext2: Fix block zeroing in ext2_get_blocks() for DAX
dax: Remove complete_unwritten argument
DAX: move RADIX_DAX_ definitions to dax.c

Linus Torvalds
2016-05-27 10:34:26 +0800

25 May, 2016

1 commit

0e01df100 Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 updates from Ted Ts'o:
"Fix a number of bugs, most notably a potential stale data exposure
after a crash and a potential BUG_ON crash if a file has the data
journalling flag enabled while it has dirty delayed allocation blocks
that haven't been written yet. Also fix a potential crash in the new
project quota code and a maliciously corrupted file system.

In addition, fix some DAX-specific bugs, including when there is a
transient ENOSPC situation and races between writes via direct I/O and
an mmap'ed segment that could lead to lost I/O.

Finally the usual set of miscellaneous cleanups"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
ext4: pre-zero allocated blocks for DAX IO
ext4: refactor direct IO code
ext4: fix race in transient ENOSPC detection
ext4: handle transient ENOSPC properly for DAX
dax: call get_blocks() with create == 1 for write faults to unwritten extents
ext4: remove unmeetable inconsisteny check from ext4_find_extent()
jbd2: remove excess descriptions for handle_s
ext4: remove unnecessary bio get/put
ext4: silence UBSAN in ext4_mb_init()
ext4: address UBSAN warning in mb_find_order_for_block()
ext4: fix oops on corrupted filesystem
ext4: fix check of dqget() return value in ext4_ioctl_setproject()
ext4: clean up error handling when orphan list is corrupted
ext4: fix hang when processing corrupted orphaned inode list
ext4: remove trailing \n from ext4_warning/ext4_error calls
ext4: fix races between changing inode journal mode and ext4_writepages
ext4: handle unwritten or delalloc buffers before enabling data journaling
ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
ext4: do not ask jbd2 to write data for delalloc buffers
jbd2: add support for avoiding data writes during transaction commits
...

Linus Torvalds
2016-05-25 03:55:26 +0800

21 May, 2016

1 commit

8da4b8c48 lib/uuid.c: move generate_random_uuid() to uuid.c ... Browse Code »

Let's gather the UUID related functions under one hood.

Signed-off-by: Andy Shevchenko
Reviewed-by: Matt Fleming
Cc: Dmitry Kasatkin
Cc: Mimi Zohar
Cc: Rasmus Villemoes
Cc: Arnd Bergmann
Cc: "Theodore Ts'o"
Cc: Al Viro
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2016-05-21 08:58:30 +0800

18 May, 2016

1 commit

c2e7b2070 Merge branch 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs cleanups from Al Viro:
"More cleanups from Christoph"

* 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
nfsd: use RWF_SYNC
fs: add RWF_DSYNC aand RWF_SYNC
ceph: use generic_write_sync
fs: simplify the generic_write_sync prototype
fs: add IOCB_SYNC and IOCB_DSYNC
direct-io: remove the offset argument to dio_complete
direct-io: eliminate the offset argument to ->direct_IO
xfs: eliminate the pos variable in xfs_file_dio_aio_write
filemap: remove the pos argument to generic_file_direct_write
filemap: remove pos variables in generic_file_read_iter

Linus Torvalds
2016-05-18 06:05:23 +0800

17 May, 2016

3 commits

87eefeb4e ext4: Add alignment check for DAX mount ... Browse Code »

When a partition is not aligned by 4KB, mount -o dax succeeds,
but any read/write access to the filesystem fails, except for
metadata update.

Call bdev_dax_supported() to perform proper precondition checks
which includes this partition alignment check.

Reported-by: Micah Parrish
Signed-off-by: Toshi Kani
Reviewed-by: Jan Kara
Cc: "Theodore Ts'o"
Cc: Andreas Dilger
Cc: Jan Kara
Cc: Dan Williams
Cc: Ross Zwisler
Cc: Christoph Hellwig
Cc: Boaz Harrosh
Signed-off-by: Vishal Verma

Toshi Kani
2016-05-17 14:44:11 +0800
0e0162bb8 Merge branch 'ovl-fixes' into for-linus ... Browse Code »

Backmerge to resolve a conflict in ovl_lookup_real();
"ovl_lookup_real(): use lookup_one_len_unlocked()" instead,
but it was too late in the cycle to rebase.

Al Viro
2016-05-17 14:17:59 +0800
02fbd1397 dax: Remove complete_unwritten argument ... Browse Code »

Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Vishal Verma

Jan Kara
2016-05-17 08:11:51 +0800

13 May, 2016

5 commits

12735f881 ext4: pre-zero allocated blocks for DAX IO ... Browse Code »

Currently ext4 treats DAX IO the same way as direct IO. I.e., it
allocates unwritten extents before IO is done and converts unwritten
extents afterwards. However this way DAX IO can race with page fault to
the same area:

ext4_ext_direct_IO() dax_fault()
dax_io()
get_block() - allocates unwritten extent
copy_from_iter_pmem()
get_block() - converts
unwritten block to
written and zeroes it
out
ext4_convert_unwritten_extents()

So data written with DAX IO gets lost. Similarly dax_new_buf() called
from dax_io() can overwrite data that has been already written to the
block via mmap.

Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
use them for DAX mmap. The downside of this solution is that every
allocating write writes each block twice (once zeros, once data). Fixing
the race with locking is possible as well however we would need to
lock-out faults for the whole range written to by DAX IO. And that is
not easy to do without locking-out faults for the whole file which seems
too aggressive.

Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-05-13 12:51:15 +0800
914f82a32 ext4: refactor direct IO code ... Browse Code »

Currently ext4 direct IO handling is split between ext4_ext_direct_IO()
and ext4_ind_direct_IO(). However the extent based function calls into
the indirect based one for some cases and for example it is not able to
handle file extending. Previously it was not also properly handling
retries in case of ENOSPC errors. With DAX things would get even more
contrieved so just refactor the direct IO code and instead of indirect /
extent split do the split to read vs writes.

Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-05-13 12:44:16 +0800
dbc427ce4 ext4: fix race in transient ENOSPC detection ... Browse Code »

When there are blocks to free in the running transaction, block
allocator can return ENOSPC although the filesystem has some blocks to
free. We use ext4_should_retry_alloc() to force commit of the current
transaction and return whether anything was committed so that it makes
sense to retry the allocation. However the transaction may get committed
after block allocation fails but before we call
ext4_should_retry_alloc(). So ext4_should_retry_alloc() returns false
because there is nothing to commit and we wrongly return ENOSPC.

Fix the race by unconditionally returning 1 from ext4_should_retry_alloc()
when we tried to commit a transaction. This should not add any
unnecessary retries since we had a transaction running a while ago when
trying to allocate blocks and we want to retry the allocation once that
transaction has committed anyway.

Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-05-13 12:42:40 +0800
7cb476f83 ext4: handle transient ENOSPC properly for DAX ... Browse Code »

ext4_dax_get_blocks() was accidentally omitted fixing get blocks
handlers to properly handle transient ENOSPC errors. Fix it now to use
ext4_get_blocks_trans() helper which takes care of these errors.

Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-05-13 12:38:16 +0800
ae05327a0 ext4: switch to ->iterate_shared() ... Browse Code »

Note that we need relax_dir() equivalent for directories
locked shared.

Signed-off-by: Al Viro

Al Viro
2016-05-13 08:36:01 +0800

06 May, 2016

4 commits

816cd71b0 ext4: remove unmeetable inconsisteny check from ext4_find_extent() ... Browse Code »

ext4_find_extent(), stripped down to the parts relevant to this patch,
reads as

ppos = 0;
i = depth;
while (i) {
--i;
++ppos;
if (unlikely(ppos > depth)) {
...
ret = -EFSCORRUPTED;
goto err;
}
}

Due to the loop's bounds, the condition ppos > depth can never be met.

Remove this dead code.

Signed-off-by: Nicolai Stange
Signed-off-by: Theodore Ts'o

Nicolai Stange
2016-05-06 10:43:04 +0800
32157de29 ext4: remove unnecessary bio get/put ... Browse Code »

ext4_io_submit() used to check for EOPNOTSUPP after bio submission,
which is why it had to get an extra reference to the bio before
submitting it. But since we no longer touch the bio after submission,
get rid of the redundant get/put of the bio. If we do get the extra
reference, we enter the slower path of having to flag this bio as now
having external references.

Signed-off-by: Jens Axboe
Signed-off-by: Theodore Ts'o

Jens Axboe
2016-05-06 10:09:49 +0800
935244cd5 ext4: silence UBSAN in ext4_mb_init() ... Browse Code »

Currently, in ext4_mb_init(), there's a loop like the following:

do {
...
offset += 1 << (sb->s_blocksize_bits - i);
i++;
} while (i s_blocksize_bits + 1);

Note that the updated offset is used in the loop's next iteration only.

However, at the last iteration, that is at i == sb->s_blocksize_bits + 1,
the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3))
and UBSAN reports

UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15
shift exponent 4294967295 is too large for 32-bit type 'int'
[...]
Call Trace:
[] dump_stack+0xbc/0x117
[] ? _atomic_dec_and_lock+0x169/0x169
[] ubsan_epilogue+0xd/0x4e
[] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
[] ? __ubsan_handle_load_invalid_value+0x158/0x158
[] ? kmem_cache_alloc+0x101/0x390
[] ? ext4_mb_init+0x13b/0xfd0
[] ? create_cache+0x57/0x1f0
[] ? create_cache+0x11a/0x1f0
[] ? mutex_lock+0x38/0x60
[] ? mutex_unlock+0x1b/0x50
[] ? put_online_mems+0x5b/0xc0
[] ? kmem_cache_create+0x117/0x2c0
[] ext4_mb_init+0xc49/0xfd0
[...]

Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1.

Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of offset is never used again.

Silence UBSAN by introducing another variable, offset_incr, holding the
next increment to apply to offset and adjust that one by right shifting it
by one position per loop iteration.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange
Signed-off-by: Theodore Ts'o

Nicolai Stange
2016-05-06 07:46:19 +0800
b5cb316cd ext4: address UBSAN warning in mb_find_order_for_block() ... Browse Code »

Currently, in mb_find_order_for_block(), there's a loop like the following:

while (order bd_blkbits + 1) {
...
bb += 1 << (e4b->bd_blkbits - order);
}

Note that the updated bb is used in the loop's next iteration only.

However, at the last iteration, that is at order == e4b->bd_blkbits + 1,
the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports

UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11
shift exponent -1 is negative
[...]
Call Trace:
[] dump_stack+0xbc/0x117
[] ? _atomic_dec_and_lock+0x169/0x169
[] ubsan_epilogue+0xd/0x4e
[] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
[] ? __ubsan_handle_load_invalid_value+0x158/0x158
[] ? ext4_mb_generate_from_pa+0x590/0x590
[] ? ext4_read_block_bitmap_nowait+0x598/0xe80
[] mb_find_order_for_block+0x1ce/0x240
[...]

Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of bb is never used again.

Silence UBSAN by introducing another variable, bb_incr, holding the next
increment to apply to bb and adjust that one by right shifting it by one
position per loop iteration.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange
Signed-off-by: Theodore Ts'o

Nicolai Stange
2016-05-06 05:38:03 +0800

05 May, 2016

2 commits

74177f55b ext4: fix oops on corrupted filesystem ... Browse Code »

When filesystem is corrupted in the right way, it can happen
ext4_mark_iloc_dirty() in ext4_orphan_add() returns error and we
subsequently remove inode from the in-memory orphan list. However this
deletion is done with list_del(&EXT4_I(inode)->i_orphan) and thus we
leave i_orphan list_head with a stale content. Later we can look at this
content causing list corruption, oops, or other issues. The reported
trace looked like:

WARNING: CPU: 0 PID: 46 at lib/list_debug.c:53 __list_del_entry+0x6b/0x100()
list_del corruption, 0000000061c1d6e0->next is LIST_POISON1
0000000000100100)
CPU: 0 PID: 46 Comm: ext4.exe Not tainted 4.1.0-rc4+ #250
Stack:
60462947 62219960 602ede24 62219960
602ede24 603ca293 622198f0 602f02eb
62219950 6002c12c 62219900 601b4d6b
Call Trace:
[] ? vprintk_emit+0x2dc/0x5c0
[] ? printk+0x0/0x94
[] show_stack+0xdc/0x1a0
[] ? printk+0x0/0x94
[] ? printk+0x0/0x94
[] dump_stack+0x2a/0x2c
[] warn_slowpath_common+0x9c/0xf0
[] ? __list_del_entry+0x6b/0x100
[] warn_slowpath_fmt+0x94/0xa0
[] ? __mutex_lock_slowpath+0x239/0x3a0
[] ? warn_slowpath_fmt+0x0/0xa0
[] ? set_signals+0x3f/0x50
[] ? kmem_cache_free+0x10a/0x180
[] ? mutex_lock+0x18/0x30
[] __list_del_entry+0x6b/0x100
[] ext4_orphan_del+0x22c/0x2f0
[] ? __ext4_journal_start_sb+0x2c/0xa0
[] ? ext4_truncate+0x383/0x390
[] ext4_write_begin+0x30b/0x4b0
[] ? copy_from_user+0x0/0xb0
[] ? iov_iter_fault_in_readable+0xa0/0xc0
[] generic_perform_write+0xaf/0x1e0
[] ? file_update_time+0x46/0x110
[] __generic_file_write_iter+0x18f/0x1b0
[] ext4_file_write_iter+0x15f/0x470
[] ? unlink_file_vma+0x0/0x70
[] ? unlink_anon_vmas+0x0/0x260
[] ? free_pgtables+0xb9/0x100
[] __vfs_write+0xb0/0x130
[] vfs_write+0xa5/0x170
[] SyS_write+0x56/0xe0
[] ? __libc_waitpid+0x0/0xa0
[] handle_syscall+0x68/0x90
[] userspace+0x4fd/0x600
[] ? save_registers+0x1f/0x40
[] ? arch_prctl+0x177/0x1b0
[] fork_handler+0x85/0x90

Fix the problem by using list_del_init() as we always should with
i_orphan list.

CC: stable@vger.kernel.org
Reported-by: Vegard Nossum
Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-05-05 23:10:15 +0800
ff0bc0845 ext4: fix check of dqget() return value in ext4_ioctl_setproject() ... Browse Code »

A failed call to dqget() returns an ERR_PTR() and not null. Fix
the check in ext4_ioctl_setproject() to handle this correctly.

Fixes: 9b7365fc1c82 ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support")
Cc: stable@vger.kernel.org # v4.5
Signed-off-by: Seth Forshee
Signed-off-by: Theodore Ts'o
Reviewed-by: Jan Kara

Seth Forshee
2016-05-05 22:52:38 +0800

03 May, 2016

1 commit

84695ffee Merge getxattr prototype change into work.lookups ... Browse Code »

The rest of work.xattr stuff isn't needed for this branch

Al Viro
2016-05-03 07:45:47 +0800

02 May, 2016

3 commits

e25922176 fs: simplify the generic_write_sync prototype ... Browse Code »

The kiocb already has the new position, so use that. The only interesting
case is AIO, where we currently don't bother updating ki_pos. We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.

While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2016-05-02 07:58:39 +0800
dde0c2e79 fs: add IOCB_SYNC and IOCB_DSYNC ... Browse Code »

This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.

XXX: Will need a few additional audits for O_DSYNC

Signed-off-by: Al Viro

Christoph Hellwig
2016-05-02 07:58:39 +0800
c8b8e32d7 direct-io: eliminate the offset argument to ->direct_IO ... Browse Code »

Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually
work, so eliminate the superflous argument.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2016-05-02 07:58:39 +0800

30 Apr, 2016

2 commits

7827a7f6e ext4: clean up error handling when orphan list is corrupted ... Browse Code »

Instead of just printing warning messages, if the orphan list is
corrupted, declare the file system is corrupted. If there are any
reserved inodes in the orphaned inode list, declare the file system
corrupted and stop right away to avoid doing more potential damage to
the file system.

Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o

Theodore Ts'o
2016-04-30 12:49:54 +0800
c9eb13a91 ext4: fix hang when processing corrupted orphaned inode list ... Browse Code »

If the orphaned inode list contains inode #5, ext4_iget() returns a
bad inode (since the bootloader inode should never be referenced
directly). Because of the bad inode, we end up processing the inode
repeatedly and this hangs the machine.

This can be reproduced via:

mke2fs -t ext4 /tmp/foo.img 100
debugfs -w -R "ssv last_orphan 5" /tmp/foo.img
mount -o loop /tmp/foo.img /mnt

(But don't do this if you are using an unpatched kernel if you care
about the system staying functional. :-)

This bug was found by the port of American Fuzzy Lop into the kernel
to find file system problems[1]. (Since it *only* happens if inode #5
shows up on the orphan list --- 3, 7, 8, etc. won't do it, it's not
surprising that AFL needed two hours before it found it.)

[1] http://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016_0.pdf

Cc: stable@vger.kernel.org
Reported by: Vegard Nossum
Signed-off-by: Theodore Ts'o

Theodore Ts'o
2016-04-30 12:48:54 +0800

27 Apr, 2016

1 commit

8d2ae1cbe ext4: remove trailing \n from ext4_warning/ext4_error calls ... Browse Code »

Messages passed to ext4_warning() or ext4_error() don't need trailing
newlines, because these function add the newlines themselves.

Signed-off-by: Jakub Wilk

Jakub Wilk
2016-04-27 13:11:21 +0800

26 Apr, 2016

3 commits

c8585c6fc ext4: fix races between changing inode journal mode and ext4_writepages ... Browse Code »

In ext4, there is a race condition between changing inode journal mode
and ext4_writepages(). While ext4_writepages() is executed on a
non-journalled mode inode, the inode's journal mode could be enabled
by ioctl() and then, some pages dirtied after switching the journal
mode will be still exposed to ext4_writepages() in non-journaled mode.
To resolve this problem, we use fs-wide per-cpu rw semaphore by Jan
Kara's suggestion because we don't want to waste ext4_inode_info's
space for this extra rare case.

Signed-off-by: Daeho Jeong
Signed-off-by: Theodore Ts'o
Reviewed-by: Jan Kara

Daeho Jeong
2016-04-26 11:22:35 +0800
4c5465926 ext4: handle unwritten or delalloc buffers before enabling data journaling ... Browse Code »

We already allocate delalloc blocks before changing the inode mode into
"per-file data journal" mode to prevent delalloc blocks from remaining
not allocated, but another issue concerned with "BH_Unwritten" status
still exists. For example, by fallocate(), several buffers' status
change into "BH_Unwritten", but these buffers cannot be processed by
ext4_alloc_da_blocks(). So, they still remain in unwritten status after
per-file data journaling is enabled and they cannot be changed into
written status any more and, if they are journaled and eventually
checkpointed, these unwritten buffer will cause a kernel panic by the
below BUG_ON() function of submit_bh_wbc() when they are submitted
during checkpointing.

static int submit_bh_wbc(int rw, struct buffer_head *bh,...
{
...
BUG_ON(buffer_unwritten(bh));

Moreover, when "dioread_nolock" option is enabled, the status of a
buffer is changed into "BH_Unwritten" after write_begin() completes and
the "BH_Unwritten" status will be cleared after I/O is done. Therefore,
if a buffer's status is changed into unwrutten but the buffer's I/O is
not submitted and completed, it can cause the same problem after
enabling per-file data journaling. You can easily generate this bug by
executing the following command.

./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269

To resolve these problems and define a boundary between the previous
mode and per-file data journaling mode, we need to flush and wait all
the I/O of buffers of a file before enabling per-file data journaling
of the file.

Signed-off-by: Daeho Jeong
Signed-off-by: Theodore Ts'o
Reviewed-by: Jan Kara

Daeho Jeong
2016-04-26 11:21:00 +0800
7b8081912 ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart() ... Browse Code »

The function jbd2_journal_extend() takes as its argument the number of
new credits to be added to the handle. We weren't taking into account
the currently unused handle credits; worse, we would try to extend the
handle by N credits when it had N credits available.

In the case where jbd2_journal_extend() fails because the transaction
is too large, when jbd2_journal_restart() gets called, the N credits
owned by the handle gets returned to the transaction, and the
transaction commit is asynchronously requested, and then
start_this_handle() will be able to successfully attach the handle to
the current transaction since the required credits are now available.

This is mostly harmless, but since ext4_ext_truncate_extend_restart()
returns EAGAIN, the truncate machinery will once again try to call
ext4_ext_truncate_extend_restart(), which will do the above sequence
over and over again until the transaction has committed.

This was found while I was debugging a lockup in caused by running
xfstests generic/074 in the data=journal case. I'm still not sure why
we ended up looping forever, which suggests there may still be another
bug hiding in the transaction accounting machinery, but this commit
prevents us from looping in the first place.

Signed-off-by: Theodore Ts'o

Theodore Ts'o
2016-04-26 11:13:17 +0800

24 Apr, 2016

5 commits

ee0876bc6 ext4: do not ask jbd2 to write data for delalloc buffers ... Browse Code »

Currently we ask jbd2 to write all dirty allocated buffers before
committing a transaction when doing writeback of delay allocated blocks.
However this is unnecessary since we move all pages to writeback state
before dropping a transaction handle and then submit all the necessary
IO. We still need the transaction commit to wait for all the outstanding
writeback before flushing disk caches during transaction commit to avoid
data exposure issues though. Use the new jbd2 capability and ask it to
only wait for outstanding writeback during transaction commit when
writing back data in ext4_writepages().

Tested-by: "HUANG Weller (CM/ESW12-CN)"
Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-04-24 12:56:08 +0800
41617e1a8 jbd2: add support for avoiding data writes during transaction commits ... Browse Code »

Currently when filesystem needs to make sure data is on permanent
storage before committing a transaction it adds inode to transaction's
inode list. During transaction commit, jbd2 writes back all dirty
buffers that have allocated underlying blocks and waits for the IO to
finish. However when doing writeback for delayed allocated data, we
allocate blocks and immediately submit the data. Thus asking jbd2 to
write dirty pages just unnecessarily adds more work to jbd2 possibly
writing back other redirtied blocks.

Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
outstanding data writes before committing a transaction and thus avoid
unnecessary writes.

Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-04-24 12:56:07 +0800
3957ef53a ext4: remove EXT4_STATE_ORDERED_MODE ... Browse Code »

This flag is just duplicating what ext4_should_order_data() tells you
and is used in a single place. Furthermore it doesn't reflect changes to
inode data journalling flag so it may be possibly misleading. Just
remove it.

Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-04-24 12:56:05 +0800
06bd3c36a ext4: fix data exposure after a crash ... Browse Code »

Huang has reported that in his powerfail testing he is seeing stale
block contents in some of recently allocated blocks although he mounts
ext4 in data=ordered mode. After some investigation I have found out
that indeed when delayed allocation is used, we don't add inode to
transaction's list of inodes needing flushing before commit. Originally
we were doing that but commit f3b59291a69d removed the logic with a
flawed argument that it is not needed.

The problem is that although for delayed allocated blocks we write their
contents immediately after allocating them, there is no guarantee that
the IO scheduler or device doesn't reorder things and thus transaction
allocating blocks and attaching them to inode can reach stable storage
before actual block contents. Actually whenever we attach freshly
allocated blocks to inode using a written extent, we should add inode to
transaction's ordered inode list to make sure we properly wait for block
contents to be written before committing the transaction. So that is
what we do in this patch. This also handles other cases where stale data
exposure was possible - like filling hole via mmap in
data=ordered,nodelalloc mode.

The only exception to the above rule are extending direct IO writes where
blkdev_direct_IO() waits for IO to complete before increasing i_size and
thus stale data exposure is not possible. For now we don't complicate
the code with optimizing this special case since the overhead is pretty
low. In case this is observed to be a performance problem we can always
handle it using a special flag to ext4_map_blocks().

CC: stable@vger.kernel.org
Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d
Reported-by: "HUANG Weller (CM/ESW12-CN)"
Tested-by: "HUANG Weller (CM/ESW12-CN)"
Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-04-24 12:56:03 +0800
1f60fbe72 ext4: allow readdir()'s of large empty directories to be interrupted ... Browse Code »

If a directory has a large number of empty blocks, iterating over all
of them can take a long time, leading to scheduler warnings and users
getting irritated when they can't kill a process in the middle of one
of these long-running readdir operations. Fix this by adding checks to
ext4_readdir() and ext4_htree_fill_tree().

This was reverted earlier due to a typo in the original commit where I
experimented with using signal_pending() instead of
fatal_signal_pending(). The test was in the wrong place if we were
going to return signal_pending() since we would end up returning
duplicant entries. See 9f2394c9be47 for a more detailed explanation.

Added fix as suggested by Linus to check for signal_pending() in
in the filldir() functions.

Reported-by: Benjamin LaHaise
Google-Bug-Id: 27880676
Signed-off-by: Theodore Ts'o

Theodore Ts'o
2016-04-24 10:50:07 +0800

13 Apr, 2016

1 commit

03a8bb0e5 ext4/fscrypto: avoid RCU lookup in d_revalidate ... Browse Code »

As Al pointed, d_revalidate should return RCU lookup before using d_inode.
This was originally introduced by:
commit 34286d666230 ("fs: rcu-walk aware d_revalidate method").

Reported-by: Al Viro
Signed-off-by: Jaegeuk Kim
Cc: Theodore Ts'o
Cc: stable

Jaegeuk Kim
2016-04-13 11:01:35 +0800

11 Apr, 2016

3 commits

b296821a7 xattr_handler: pass dentry and inode as separate arguments of ->get() ... Browse Code »

... and do not assume they are already attached to each other

Signed-off-by: Al Viro

Al Viro
2016-04-11 08:48:24 +0800
9f2394c9b Revert "ext4: allow readdir()'s of large empty directories to be interrupted" ... Browse Code »

This reverts commit 1028b55bafb7611dda1d8fed2aeca16a436b7dff.

It's broken: it makes ext4 return an error at an invalid point, causing
the readdir wrappers to write the the position of the last successful
directory entry into the position field, which means that the next
readdir will now return that last successful entry _again_.

You can only return fatal errors (that terminate the readdir directory
walk) from within the filesystem readdir functions, the "normal" errors
(that happen when the readdir buffer fills up, for example) happen in
the iterorator where we know the position of the actual failing entry.

I do have a very different patch that does the "signal_pending()"
handling inside the iterator function where it is allowable, but while
that one passes all the sanity checks, I screwed up something like four
times while emailing it out, so I'm not going to commit it today.

So my track record is not good enough, and the stars will have to align
better before that one gets committed. And it would be good to get some
review too, of course, since celestial alignments are always an iffy
debugging model.

IOW, let's just revert the commit that caused the problem for now.

Reported-by: Greg Thelen
Cc: Theodore Ts'o
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-04-11 07:52:24 +0800
fc64005c9 don't bother with ->d_inode->i_sb - it's always equal to ->d_sb ... Browse Code »

... and neither can ever be NULL

Signed-off-by: Al Viro

Al Viro
2016-04-11 05:11:51 +0800

08 Apr, 2016

1 commit

93061f390 Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 bugfixes from Ted Ts'o:
"These changes contains a fix for overlayfs interacting with some
(badly behaved) dentry code in various file systems. These have been
reviewed by Al and the respective file system mtinainers and are going
through the ext4 tree for convenience.

This also has a few ext4 encryption bug fixes that were discovered in
Android testing (yes, we will need to get these sync'ed up with the
fs/crypto code; I'll take care of that). It also has some bug fixes
and a change to ignore the legacy quota options to allow for xfstests
regression testing of ext4's internal quota feature and to be more
consistent with how xfs handles this case"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: ignore quota mount options if the quota feature is enabled
ext4 crypto: fix some error handling
ext4: avoid calling dquot_get_next_id() if quota is not enabled
ext4: retry block allocation for failed DIO and DAX writes
ext4: add lockdep annotations for i_data_sem
ext4: allow readdir()'s of large empty directories to be interrupted
btrfs: fix crash/invalid memory access on fsync when using overlayfs
ext4 crypto: use dget_parent() in ext4_d_revalidate()
ext4: use file_dentry()
ext4: use dget_parent() in ext4_file_open()
nfs: use file_dentry()
fs: add file_dentry()
ext4 crypto: don't let data integrity writebacks fail with ENOMEM
ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()

Linus Torvalds
2016-04-08 08:22:20 +0800