Eric Lee / smarc-fsl-linux-kernel

20 May, 2016

6 commits

bc2466e42 dax: Use radix tree entry lock to protect cow faults ... Browse Code »

When doing cow faults, we cannot directly fill in PTE as we do for other
faults as we rely on generic code to do proper accounting of the cowed page.
We also have no page to lock to protect against races with truncate as
other faults have and we need the protection to extend until the moment
generic code inserts cowed page into PTE thus at that point we have no
protection of fs-specific i_mmap_sem. So far we relied on using
i_mmap_lock for the protection however that is completely special to cow
faults. To make fault locking more uniform use DAX entry lock instead.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Ross Zwisler

Jan Kara
2016-05-20 05:27:49 +0800
ac401cc78 dax: New fault locking ... Browse Code »

Currently DAX page fault locking is racy.

CPU0 (write fault) CPU1 (read fault)

__dax_fault() __dax_fault()
get_block(inode, block, &bh, 0) -> not mapped
get_block(inode, block, &bh, 0)
-> not mapped
if (!buffer_mapped(&bh))
if (vmf->flags & FAULT_FLAG_WRITE)
get_block(inode, block, &bh, 1) -> allocates blocks
if (page) -> no
if (!buffer_mapped(&bh))
if (vmf->flags & FAULT_FLAG_WRITE) {
} else {
dax_load_hole();
}
dax_insert_mapping()

And we are in a situation where we fail in dax_radix_entry() with -EIO.

Another problem with the current DAX page fault locking is that there is
no race-free way to clear dirty tag in the radix tree. We can always
end up with clean radix tree and dirty data in CPU cache.

We fix the first problem by introducing locking of exceptional radix
tree entries in DAX mappings acting very similarly to page lock and thus
synchronizing properly faults against the same mapping index. The same
lock can later be used to avoid races when clearing radix tree dirty
tag.

Reviewed-by: NeilBrown
Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Ross Zwisler

Jan Kara
2016-05-20 05:20:54 +0800
4f622938a dax: Allow DAX code to replace exceptional entries ... Browse Code »

Currently we forbid page_cache_tree_insert() to replace exceptional radix
tree entries for DAX inodes. However to make DAX faults race free we will
lock radix tree entries and when hole is created, we need to replace
such locked radix tree entry with a hole page. So modify
page_cache_tree_insert() to allow that.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Ross Zwisler

Jan Kara
2016-05-20 05:18:30 +0800
e804315dd dax: Define DAX lock bit for radix tree exceptional entry ... Browse Code »

We will use lowest available bit in the radix tree exceptional entry for
locking of the entry. Define it. Also clean up definitions of DAX entry
type bits in DAX exceptional entries to use defined constants instead of
hardcoding numbers and cleanup checking of these bits to not rely on how
other bits in the entry are set.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Ross Zwisler

Jan Kara
2016-05-20 05:14:55 +0800
348e967ab dax: Make huge page handling depend of CONFIG_BROKEN ... Browse Code »

Currently the handling of huge pages for DAX is racy. For example the
following can happen:

CPU0 (THP write fault) CPU1 (normal read fault)

__dax_pmd_fault() __dax_fault()
get_block(inode, block, &bh, 0) -> not mapped
get_block(inode, block, &bh, 0)
-> not mapped
if (!buffer_mapped(&bh) && write)
get_block(inode, block, &bh, 1) -> allocates blocks
truncate_pagecache_range(inode, lstart, lend);
dax_load_hole();

This results in data corruption since process on CPU1 won't see changes
into the file done by CPU0.

The race can happen even if two normal faults race however with THP the
situation is even worse because the two faults don't operate on the same
entries in the radix tree and we want to use these entries for
serialization. So make THP support in DAX code depend on CONFIG_BROKEN
for now.

Signed-off-by: Jan Kara
Signed-off-by: Ross Zwisler

Jan Kara
2016-05-20 05:13:17 +0800
b9953536c dax: Fix condition for filling of PMD holes ... Browse Code »

Currently dax_pmd_fault() decides to fill a PMD-sized hole only if
returned buffer has BH_Uptodate set. However that doesn't get set for
any mapping buffer so that branch is actually a dead code. The
BH_Uptodate check doesn't make any sense so just remove it.

Signed-off-by: Jan Kara
Signed-off-by: Ross Zwisler

Jan Kara
2016-05-20 05:13:00 +0800

19 May, 2016

5 commits

40543f62c dax: fix a comment in dax_zero_page_range and dax_truncate_page ... Browse Code »

The distinction between PAGE_SIZE and PAGE_CACHE_SIZE was removed in

09cbfea mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release}
macros

The comments for the above functions described a distinction between
those, that is now redundant, so remove those paragraphs

Cc: Kirill A. Shutemov
Reviewed-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Vishal Verma

Vishal Verma
2016-05-19 02:16:58 +0800
4b0228fa1 dax: for truncate/hole-punch, do zeroing through the driver if possible ... Browse Code »

In the truncate or hole-punch path in dax, we clear out sub-page ranges.
If these sub-page ranges are sector aligned and sized, we can do the
zeroing through the driver instead so that error-clearing is handled
automatically.

For sub-sector ranges, we still have to rely on clear_pmem and have the
possibility of tripping over errors.

Cc: Dan Williams
Cc: Ross Zwisler
Cc: Jeff Moyer
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Jan Kara
Reviewed-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Vishal Verma

Vishal Verma
2016-05-19 02:16:57 +0800
679c8bd3b dax: export a low-level __dax_zero_page_range helper ... Browse Code »

This allows XFS to perform zeroing using the iomap infrastructure and
avoid buffer heads.

Reviewed-by: Jan Kara
Signed-off-by: Christoph Hellwig
[vishal: fix conflicts with dax-error-handling]
Signed-off-by: Vishal Verma

Christoph Hellwig
2016-05-19 02:16:56 +0800
3dc291610 dax: use sb_issue_zerout instead of calling dax_clear_sectors ... Browse Code »

dax_clear_sectors() cannot handle poisoned blocks. These must be
zeroed using the BIO interface instead. Convert ext2 and XFS to use
only sb_issue_zerout().

Reviewed-by: Jeff Moyer
Reviewed-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Matthew Wilcox
[vishal: Also remove the dax_clear_sectors function entirely]
Signed-off-by: Vishal Verma

Matthew Wilcox
2016-05-19 02:16:56 +0800
0a70bd430 dax: enable dax in the presence of known media errors (badblocks) ... Browse Code »

1/ If a mapping overlaps a bad sector fail the request.

2/ Do not opportunistically report more dax-capable capacity than is
requested when errors present.

Reviewed-by: Jeff Moyer
Reviewed-by: Christoph Hellwig
Signed-off-by: Dan Williams
[vishal: fix a conflict with system RAM collision patches]
[vishal: add a 'size' parameter to ->direct_access]
[vishal: fix a conflict with DAX alignment check patches]
Signed-off-by: Vishal Verma

Dan Williams
2016-05-19 02:16:56 +0800

17 May, 2016

15 commits

8b3db9798 dax: fallback from pmd to pte on error ... Browse Code »

In preparation for consulting a badblocks list in pmem_direct_access(),
teach dax_pmd_fault() to fallback rather than fail immediately upon
encountering an error. The thought being that reducing the span of the
dax request may avoid the error region.

Reviewed-by: Jeff Moyer
Reviewed-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Dan Williams
Signed-off-by: Vishal Verma

Dan Williams
2016-05-17 14:44:13 +0800
a8078b1fc block: Update blkdev_dax_capable() for consistency ... Browse Code »

blkdev_dax_capable() is similar to bdev_dax_supported(), but needs
to remain as a separate interface for checking dax capability of
a raw block device.

Rename and relocate blkdev_dax_capable() to keep them maintained
consistently, and call bdev_direct_access() for the dax capability
check.

There is no change in the behavior.

Link: https://lkml.org/lkml/2016/5/9/950
Signed-off-by: Toshi Kani
Reviewed-by: Jan Kara
Cc: Alexander Viro
Cc: Jens Axboe
Cc: Andreas Dilger
Cc: Jan Kara
Cc: Dave Chinner
Cc: Dan Williams
Cc: Ross Zwisler
Cc: Christoph Hellwig
Cc: Boaz Harrosh
Signed-off-by: Vishal Verma

Toshi Kani
2016-05-17 14:44:13 +0800
1e937cddd xfs: Add alignment check for DAX mount ... Browse Code »

When a partition is not aligned by 4KB, mount -o dax succeeds,
but any read/write access to the filesystem fails, except for
metadata update.

Call bdev_dax_supported() to perform proper precondition checks
which includes this partition alignment check.

Signed-off-by: Toshi Kani
Reviewed-by: Christoph Hellwig
Cc: Dave Chinner
Cc: Dan Williams
Cc: Ross Zwisler
Cc: Christoph Hellwig
Cc: Boaz Harrosh
Signed-off-by: Vishal Verma

Toshi Kani
2016-05-17 14:44:12 +0800
284854be2 ext2: Add alignment check for DAX mount ... Browse Code »

When a partition is not aligned by 4KB, mount -o dax succeeds,
but any read/write access to the filesystem fails, except for
metadata update.

Call bdev_dax_supported() to perform proper precondition checks
which includes this partition alignment check.

Signed-off-by: Toshi Kani
Reviewed-by: Jan Kara
Cc: Jan Kara
Cc: Dan Williams
Cc: Ross Zwisler
Cc: Christoph Hellwig
Cc: Boaz Harrosh
Signed-off-by: Vishal Verma

Toshi Kani
2016-05-17 14:44:12 +0800
87eefeb4e ext4: Add alignment check for DAX mount ... Browse Code »

When a partition is not aligned by 4KB, mount -o dax succeeds,
but any read/write access to the filesystem fails, except for
metadata update.

Call bdev_dax_supported() to perform proper precondition checks
which includes this partition alignment check.

Reported-by: Micah Parrish
Signed-off-by: Toshi Kani
Reviewed-by: Jan Kara
Cc: "Theodore Ts'o"
Cc: Andreas Dilger
Cc: Jan Kara
Cc: Dan Williams
Cc: Ross Zwisler
Cc: Christoph Hellwig
Cc: Boaz Harrosh
Signed-off-by: Vishal Verma

Toshi Kani
2016-05-17 14:44:11 +0800
2d96afc8f block: Add bdev_dax_supported() for dax mount checks ... Browse Code »

DAX imposes additional requirements to a device. Add
bdev_dax_supported() which performs all the precondition checks
necessary for filesystem to mount the device with dax option.

Also add a new check to verify if a partition is aligned by 4KB.
When a partition is unaligned, any dax read/write access fails,
except for metadata update.

Signed-off-by: Toshi Kani
Reviewed-by: Christoph Hellwig
Cc: Alexander Viro
Cc: Jens Axboe
Cc: "Theodore Ts'o"
Cc: Andreas Dilger
Cc: Jan Kara
Cc: Dave Chinner
Cc: Dan Williams
Cc: Ross Zwisler
Cc: Christoph Hellwig
Cc: Boaz Harrosh
Signed-off-by: Vishal Verma

Toshi Kani
2016-05-17 14:44:11 +0800
2af3a8159 block: Add vfs_msg() interface ... Browse Code »

In preparation of moving DAX capability checks to the block layer
from filesystem code, add a VFS message interface that aligns with
filesystem's message format.

For instance, a vfs_msg() message followed by XFS messages in case
of a dax mount error may look like:

VFS (pmem0p1): error: unaligned partition for dax
XFS (pmem0p1): DAX unsupported by block device. Turning off DAX.
XFS (pmem0p1): Mounting V5 Filesystem
:

vfs_msg() is largely based on ext4_msg().

Signed-off-by: Toshi Kani
Reviewed-by: Christoph Hellwig
Cc: Alexander Viro
Cc: Jens Axboe
Cc: "Theodore Ts'o"
Cc: Andreas Dilger
Cc: Jan Kara
Cc: Dave Chinner
Cc: Dan Williams
Cc: Ross Zwisler
Cc: Christoph Hellwig
Cc: Boaz Harrosh
Signed-off-by: Vishal Verma

Toshi Kani
2016-05-17 14:44:10 +0800
7795bec89 dax: Remove redundant inode size checks ... Browse Code »

Callers of dax fault handlers must make sure these calls cannot race
with truncate. Thus it is enough to check inode size when entering the
function and we don't have to recheck it again later in the handler.
Note that inode size itself can be decreased while the fault handler
runs but filesystem locking prevents against any radix tree or block
mapping information changes resulting from the truncate and that is what
we really care about.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Vishal Verma

Jan Kara
2016-05-17 14:44:10 +0800
c3d98e39d dax: Remove pointless writeback from dax_do_io() ... Browse Code »

dax_do_io() is calling filemap_write_and_wait() if DIO_LOCKING flags is
set. Presumably this was copied over from direct IO code. However DAX
inodes have no pagecache pages to write so the call is pointless. Remove
it.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Vishal Verma

Jan Kara
2016-05-17 14:44:09 +0800
069c77bc9 dax: Remove zeroing from dax_io() ... Browse Code »

All the filesystems are now zeroing blocks themselves for DAX IO to avoid
races between dax_io() and dax_fault(). Remove the zeroing code from
dax_io() and add warning to catch the case when somebody unexpectedly
returns new or unwritten buffer.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Vishal Verma

Jan Kara
2016-05-17 14:44:09 +0800
2b10945c5 dax: Remove dead zeroing code from fault handlers ... Browse Code »

Now that all filesystems zero out blocks allocated for a fault handler,
we can just remove the zeroing from the handler itself. Also add checks
that no filesystem returns to us unwritten or new buffer.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Vishal Verma

Jan Kara
2016-05-17 14:44:08 +0800
86b0624e4 ext2: Avoid DAX zeroing to corrupt data ... Browse Code »

Currently ext2 zeroes any data blocks allocated for DAX inode however it
still returns them as BH_New. Thus DAX code zeroes them again in
dax_insert_mapping() which can possibly overwrite the data that has been
already stored to those blocks by a racing dax_io(). Avoid marking
pre-zeroed buffers as new.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Vishal Verma

Jan Kara
2016-05-17 14:44:07 +0800
9b6cd5f76 ext2: Fix block zeroing in ext2_get_blocks() for DAX ... Browse Code »

When zeroing allocated blocks for DAX, we accidentally zeroed only the
first allocated block instead of all of them. So far this problem is
hidden by the fact that page faults always need only a single block and
DAX write code zeroes blocks again. But the zeroing in DAX code is racy
and needs to be removed so fix the zeroing in ext2 to zero all allocated
blocks.

Reported-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Vishal Verma

Jan Kara
2016-05-17 14:41:24 +0800
02fbd1397 dax: Remove complete_unwritten argument ... Browse Code »

Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.

Reviewed-by: Ross Zwisler
Signed-off-by: Jan Kara
Signed-off-by: Vishal Verma

Jan Kara
2016-05-17 08:11:51 +0800
e4b274915 DAX: move RADIX_DAX_ definitions to dax.c ... Browse Code »

These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
Let's try to maintain the idea that radix-tree simply implements an
abstract data type.

Acked-by: Ross Zwisler
Reviewed-by: Matthew Wilcox
Signed-off-by: NeilBrown
Signed-off-by: Jan Kara
Signed-off-by: Vishal Verma

NeilBrown
2016-05-17 08:11:51 +0800

13 May, 2016

5 commits

12735f881 ext4: pre-zero allocated blocks for DAX IO ... Browse Code »

Currently ext4 treats DAX IO the same way as direct IO. I.e., it
allocates unwritten extents before IO is done and converts unwritten
extents afterwards. However this way DAX IO can race with page fault to
the same area:

ext4_ext_direct_IO() dax_fault()
dax_io()
get_block() - allocates unwritten extent
copy_from_iter_pmem()
get_block() - converts
unwritten block to
written and zeroes it
out
ext4_convert_unwritten_extents()

So data written with DAX IO gets lost. Similarly dax_new_buf() called
from dax_io() can overwrite data that has been already written to the
block via mmap.

Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
use them for DAX mmap. The downside of this solution is that every
allocating write writes each block twice (once zeros, once data). Fixing
the race with locking is possible as well however we would need to
lock-out faults for the whole range written to by DAX IO. And that is
not easy to do without locking-out faults for the whole file which seems
too aggressive.

Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-05-13 12:51:15 +0800
914f82a32 ext4: refactor direct IO code ... Browse Code »

Currently ext4 direct IO handling is split between ext4_ext_direct_IO()
and ext4_ind_direct_IO(). However the extent based function calls into
the indirect based one for some cases and for example it is not able to
handle file extending. Previously it was not also properly handling
retries in case of ENOSPC errors. With DAX things would get even more
contrieved so just refactor the direct IO code and instead of indirect /
extent split do the split to read vs writes.

Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-05-13 12:44:16 +0800
dbc427ce4 ext4: fix race in transient ENOSPC detection ... Browse Code »

When there are blocks to free in the running transaction, block
allocator can return ENOSPC although the filesystem has some blocks to
free. We use ext4_should_retry_alloc() to force commit of the current
transaction and return whether anything was committed so that it makes
sense to retry the allocation. However the transaction may get committed
after block allocation fails but before we call
ext4_should_retry_alloc(). So ext4_should_retry_alloc() returns false
because there is nothing to commit and we wrongly return ENOSPC.

Fix the race by unconditionally returning 1 from ext4_should_retry_alloc()
when we tried to commit a transaction. This should not add any
unnecessary retries since we had a transaction running a while ago when
trying to allocate blocks and we want to retry the allocation once that
transaction has committed anyway.

Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-05-13 12:42:40 +0800
7cb476f83 ext4: handle transient ENOSPC properly for DAX ... Browse Code »

ext4_dax_get_blocks() was accidentally omitted fixing get blocks
handlers to properly handle transient ENOSPC errors. Fix it now to use
ext4_get_blocks_trans() helper which takes care of these errors.

Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-05-13 12:38:16 +0800
aef39ab15 dax: call get_blocks() with create == 1 for write faults to unwritten extents ... Browse Code »

Currently, __dax_fault() does not call get_blocks() callback with create
argument set, when we got back unwritten extent from the initial
get_blocks() call during a write fault. This is because originally
filesystems were supposed to convert unwritten extents to written ones
using complete_unwritten() callback. Later this was abandoned in favor of
using pre-zeroed blocks however the condition whether get_blocks() needs
to be called with create == 1 remained.

Fix the condition so that filesystems are not forced to zero-out and
convert unwritten extents when get_blocks() is called with create == 0
(which introduces unnecessary overhead for read faults and can be
problematic as the filesystem may possibly be read-only).

Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-05-13 12:38:15 +0800

06 May, 2016

5 commits

816cd71b0 ext4: remove unmeetable inconsisteny check from ext4_find_extent() ... Browse Code »

ext4_find_extent(), stripped down to the parts relevant to this patch,
reads as

ppos = 0;
i = depth;
while (i) {
--i;
++ppos;
if (unlikely(ppos > depth)) {
...
ret = -EFSCORRUPTED;
goto err;
}
}

Due to the loop's bounds, the condition ppos > depth can never be met.

Remove this dead code.

Signed-off-by: Nicolai Stange
Signed-off-by: Theodore Ts'o

Nicolai Stange
2016-05-06 10:43:04 +0800
466c3fb61 jbd2: remove excess descriptions for handle_s ... Browse Code »

Commit bf6993276f74 ("jbd2: Use tracepoints for history file")
removed the members j_history, j_history_max and j_history_cur from struct
handle_s but the descriptions stayed lingering. Removing them.

Signed-off-by: Luis de Bethencourt
Signed-off-by: Theodore Ts'o
Reviewed-by: Jan Kara

Luis de Bethencourt
2016-05-06 10:35:54 +0800
32157de29 ext4: remove unnecessary bio get/put ... Browse Code »

ext4_io_submit() used to check for EOPNOTSUPP after bio submission,
which is why it had to get an extra reference to the bio before
submitting it. But since we no longer touch the bio after submission,
get rid of the redundant get/put of the bio. If we do get the extra
reference, we enter the slower path of having to flag this bio as now
having external references.

Signed-off-by: Jens Axboe
Signed-off-by: Theodore Ts'o

Jens Axboe
2016-05-06 10:09:49 +0800
935244cd5 ext4: silence UBSAN in ext4_mb_init() ... Browse Code »

Currently, in ext4_mb_init(), there's a loop like the following:

do {
...
offset += 1 << (sb->s_blocksize_bits - i);
i++;
} while (i s_blocksize_bits + 1);

Note that the updated offset is used in the loop's next iteration only.

However, at the last iteration, that is at i == sb->s_blocksize_bits + 1,
the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3))
and UBSAN reports

UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15
shift exponent 4294967295 is too large for 32-bit type 'int'
[...]
Call Trace:
[] dump_stack+0xbc/0x117
[] ? _atomic_dec_and_lock+0x169/0x169
[] ubsan_epilogue+0xd/0x4e
[] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
[] ? __ubsan_handle_load_invalid_value+0x158/0x158
[] ? kmem_cache_alloc+0x101/0x390
[] ? ext4_mb_init+0x13b/0xfd0
[] ? create_cache+0x57/0x1f0
[] ? create_cache+0x11a/0x1f0
[] ? mutex_lock+0x38/0x60
[] ? mutex_unlock+0x1b/0x50
[] ? put_online_mems+0x5b/0xc0
[] ? kmem_cache_create+0x117/0x2c0
[] ext4_mb_init+0xc49/0xfd0
[...]

Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1.

Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of offset is never used again.

Silence UBSAN by introducing another variable, offset_incr, holding the
next increment to apply to offset and adjust that one by right shifting it
by one position per loop iteration.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange
Signed-off-by: Theodore Ts'o

Nicolai Stange
2016-05-06 07:46:19 +0800
b5cb316cd ext4: address UBSAN warning in mb_find_order_for_block() ... Browse Code »

Currently, in mb_find_order_for_block(), there's a loop like the following:

while (order bd_blkbits + 1) {
...
bb += 1 << (e4b->bd_blkbits - order);
}

Note that the updated bb is used in the loop's next iteration only.

However, at the last iteration, that is at order == e4b->bd_blkbits + 1,
the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports

UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11
shift exponent -1 is negative
[...]
Call Trace:
[] dump_stack+0xbc/0x117
[] ? _atomic_dec_and_lock+0x169/0x169
[] ubsan_epilogue+0xd/0x4e
[] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
[] ? __ubsan_handle_load_invalid_value+0x158/0x158
[] ? ext4_mb_generate_from_pa+0x590/0x590
[] ? ext4_read_block_bitmap_nowait+0x598/0xe80
[] mb_find_order_for_block+0x1ce/0x240
[...]

Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of bb is never used again.

Silence UBSAN by introducing another variable, bb_incr, holding the next
increment to apply to bb and adjust that one by right shifting it by one
position per loop iteration.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange
Signed-off-by: Theodore Ts'o

Nicolai Stange
2016-05-06 05:38:03 +0800

05 May, 2016

2 commits

74177f55b ext4: fix oops on corrupted filesystem ... Browse Code »

When filesystem is corrupted in the right way, it can happen
ext4_mark_iloc_dirty() in ext4_orphan_add() returns error and we
subsequently remove inode from the in-memory orphan list. However this
deletion is done with list_del(&EXT4_I(inode)->i_orphan) and thus we
leave i_orphan list_head with a stale content. Later we can look at this
content causing list corruption, oops, or other issues. The reported
trace looked like:

WARNING: CPU: 0 PID: 46 at lib/list_debug.c:53 __list_del_entry+0x6b/0x100()
list_del corruption, 0000000061c1d6e0->next is LIST_POISON1
0000000000100100)
CPU: 0 PID: 46 Comm: ext4.exe Not tainted 4.1.0-rc4+ #250
Stack:
60462947 62219960 602ede24 62219960
602ede24 603ca293 622198f0 602f02eb
62219950 6002c12c 62219900 601b4d6b
Call Trace:
[] ? vprintk_emit+0x2dc/0x5c0
[] ? printk+0x0/0x94
[] show_stack+0xdc/0x1a0
[] ? printk+0x0/0x94
[] ? printk+0x0/0x94
[] dump_stack+0x2a/0x2c
[] warn_slowpath_common+0x9c/0xf0
[] ? __list_del_entry+0x6b/0x100
[] warn_slowpath_fmt+0x94/0xa0
[] ? __mutex_lock_slowpath+0x239/0x3a0
[] ? warn_slowpath_fmt+0x0/0xa0
[] ? set_signals+0x3f/0x50
[] ? kmem_cache_free+0x10a/0x180
[] ? mutex_lock+0x18/0x30
[] __list_del_entry+0x6b/0x100
[] ext4_orphan_del+0x22c/0x2f0
[] ? __ext4_journal_start_sb+0x2c/0xa0
[] ? ext4_truncate+0x383/0x390
[] ext4_write_begin+0x30b/0x4b0
[] ? copy_from_user+0x0/0xb0
[] ? iov_iter_fault_in_readable+0xa0/0xc0
[] generic_perform_write+0xaf/0x1e0
[] ? file_update_time+0x46/0x110
[] __generic_file_write_iter+0x18f/0x1b0
[] ext4_file_write_iter+0x15f/0x470
[] ? unlink_file_vma+0x0/0x70
[] ? unlink_anon_vmas+0x0/0x260
[] ? free_pgtables+0xb9/0x100
[] __vfs_write+0xb0/0x130
[] vfs_write+0xa5/0x170
[] SyS_write+0x56/0xe0
[] ? __libc_waitpid+0x0/0xa0
[] handle_syscall+0x68/0x90
[] userspace+0x4fd/0x600
[] ? save_registers+0x1f/0x40
[] ? arch_prctl+0x177/0x1b0
[] fork_handler+0x85/0x90

Fix the problem by using list_del_init() as we always should with
i_orphan list.

CC: stable@vger.kernel.org
Reported-by: Vegard Nossum
Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o

Jan Kara
2016-05-05 23:10:15 +0800
ff0bc0845 ext4: fix check of dqget() return value in ext4_ioctl_setproject() ... Browse Code »

A failed call to dqget() returns an ERR_PTR() and not null. Fix
the check in ext4_ioctl_setproject() to handle this correctly.

Fixes: 9b7365fc1c82 ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support")
Cc: stable@vger.kernel.org # v4.5
Signed-off-by: Seth Forshee
Signed-off-by: Theodore Ts'o
Reviewed-by: Jan Kara

Seth Forshee
2016-05-05 22:52:38 +0800

30 Apr, 2016

2 commits

7827a7f6e ext4: clean up error handling when orphan list is corrupted ... Browse Code »

Instead of just printing warning messages, if the orphan list is
corrupted, declare the file system is corrupted. If there are any
reserved inodes in the orphaned inode list, declare the file system
corrupted and stop right away to avoid doing more potential damage to
the file system.

Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o

Theodore Ts'o
2016-04-30 12:49:54 +0800
c9eb13a91 ext4: fix hang when processing corrupted orphaned inode list ... Browse Code »

If the orphaned inode list contains inode #5, ext4_iget() returns a
bad inode (since the bootloader inode should never be referenced
directly). Because of the bad inode, we end up processing the inode
repeatedly and this hangs the machine.

This can be reproduced via:

mke2fs -t ext4 /tmp/foo.img 100
debugfs -w -R "ssv last_orphan 5" /tmp/foo.img
mount -o loop /tmp/foo.img /mnt

(But don't do this if you are using an unpatched kernel if you care
about the system staying functional. :-)

This bug was found by the port of American Fuzzy Lop into the kernel
to find file system problems[1]. (Since it *only* happens if inode #5
shows up on the orphan list --- 3, 7, 8, etc. won't do it, it's not
surprising that AFL needed two hours before it found it.)

[1] http://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016_0.pdf

Cc: stable@vger.kernel.org
Reported by: Vegard Nossum
Signed-off-by: Theodore Ts'o

Theodore Ts'o
2016-04-30 12:48:54 +0800