15 Dec, 2015
3 commits
-
commit 4327ba52afd03fc4b5afa0ee1d774c9c5b0e85c5 upstream.
If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
journaling will be aborted first and the error number will be recorded
into JBD2 superblock and, finally, the system will enter into the
panic state in "errors=panic" option. But, in the rare case, this
sequence is little twisted like the below figure and it will happen
that the system enters into panic state, which means the system reset
in mobile environment, before completion of recording an error in the
journal superblock. In this case, e2fsck cannot recognize that the
filesystem failure occurred in the previous run and the corruption
wouldn't be fixed.Task A Task B
ext4_handle_error()
-> jbd2_journal_abort()
-> __journal_abort_soft()
-> __jbd2_journal_abort_hard()
| -> journal->j_flags |= JBD2_ABORT;
|
| __ext4_abort()
| -> jbd2_journal_abort()
| | -> __journal_abort_soft()
| | -> if (journal->j_flags & JBD2_ABORT)
| | return;
| -> panic()
|
-> jbd2_journal_update_sb_errno()Tested-by: Hobin Woo
Signed-off-by: Daeho Jeong
Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman -
commit 6934da9238da947628be83635e365df41064b09b upstream.
There is a use-after-free possibility in __ext4_journal_stop() in the
case that we free the handle in the first jbd2_journal_stop() because
we're referencing handle->h_err afterwards. This was introduced in
9705acd63b125dee8b15c705216d7186daea4625 and it is wrong. Fix it by
storing the handle->h_err value beforehand and avoid referencing
potentially freed handle.Fixes: 9705acd63b125dee8b15c705216d7186daea4625
Signed-off-by: Lukas Czerner
Reviewed-by: Andreas Dilger
Signed-off-by: Greg Kroah-Hartman -
commit 937d7b84dca58f2565715f2c8e52f14c3d65fb22 upstream.
There are times when ext4_bio_write_page() is called even though we
don't actually need to do any I/O. This happens when ext4_writepage()
gets called by the jbd2 commit path when an inode needs to force its
pages written out in order to provide data=ordered guarantees --- and
a page is backed by an unwritten (e.g., uninitialized) block on disk,
or if delayed allocation means the page's backing store hasn't been
allocated yet. In that case, we need to skip the call to
ext4_encrypt_page(), since in addition to wasting CPU, it leads to a
bounce page and an ext4 crypto context getting leaked.Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman
30 Sep, 2015
2 commits
-
commit bdfe0cbd746aa9b2509c2f6d6be17193cf7facd7 upstream.
This reverts commit 08439fec266c3cc5702953b4f54bdf5649357de0.
Unfortunately we still need to test for bdi->dev to avoid a crash when a
USB stick is yanked out while a file system is mounted:usb 2-2: USB disconnect, device number 2
Buffer I/O error on dev sdb1, logical block 15237120, lost sync page write
JBD2: Error -5 detected when updating journal superblock for sdb1-8.
BUG: unable to handle kernel paging request at 34beb000
IP: [] __percpu_counter_add+0x18/0xc0
*pdpt = 0000000023db9001 *pde = 0000000000000000
Oops: 0000 [#1] SMP
CPU: 0 PID: 4083 Comm: umount Tainted: G U OE 4.1.1-040101-generic #201507011435
Hardware name: LENOVO 7675CTO/7675CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011
task: ebf06b50 ti: ebebc000 task.ti: ebebc000
EIP: 0060:[] EFLAGS: 00010082 CPU: 0
EIP is at __percpu_counter_add+0x18/0xc0
EAX: f21c8e88 EBX: f21c8e88 ECX: 00000000 EDX: 00000001
ESI: 00000001 EDI: 00000000 EBP: ebebde60 ESP: ebebde40
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: 34beb000 CR3: 33354200 CR4: 000007f0
Stack:
c1abe100 edcb0098 edcb00ec ffffffff f21c8e68 ffffffff f21c8e68 f286d160
ebebde84 c1160454 00000010 00000282 f72a77f8 00000984 f72a77f8 f286d160
f286d170 ebebdea0 c11e613f 00000000 00000282 f72a77f8 edd7f4d0 00000000
Call Trace:
[] account_page_dirtied+0x74/0x110
[] __set_page_dirty+0x3f/0xb0
[] mark_buffer_dirty+0x53/0xc0
[] ext4_commit_super+0x17b/0x250
[] ext4_put_super+0xc1/0x320
[] ? fsnotify_unmount_inodes+0x1aa/0x1c0
[] ? evict_inodes+0xca/0xe0
[] generic_shutdown_super+0x6a/0xe0
[] ? prepare_to_wait_event+0xd0/0xd0
[] ? unregister_shrinker+0x40/0x50
[] kill_block_super+0x26/0x70
[] deactivate_locked_super+0x45/0x80
[] deactivate_super+0x47/0x60
[] cleanup_mnt+0x39/0x80
[] __cleanup_mnt+0x10/0x20
[] task_work_run+0x91/0xd0
[] do_notify_resume+0x7c/0x90
[] work_notify
Code: 8b 55 e8 e9 f4 fe ff ff 90 90 90 90 90 90 90 90 90 90 90 55 89 e5 83 ec 20 89 5d f4 89 c3 89 75 f8 89 d6 89 7d fc 89 cf 8b 48 14 8b 01 89 45 ec 89 c2 8b 45 08 c1 fa 1f 01 75 ec 89 55 f0 89
EIP: [] __percpu_counter_add+0x18/0xc0 SS:ESP 0068:ebebde40
CR2: 0000000034beb000
---[ end trace dd564a7bea834ecd ]---Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=101011
Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman -
commit c642dc9e1aaed953597e7092d7df329e6234096e upstream.
At some point along this sequence of changes:
f6e63f9 ext4: fold ext4_nojournal_sops into ext4_sops
bb04457 ext4: support freezing ext2 (nojournal) file systems
9ca9238 ext4: Use separate super_operations structure for no_journal filesystemsext4 started setting needs_recovery on filesystems without journals
when they are unfrozen. This makes no sense, and in fact confuses
blkid to the point where it doesn't recognize the filesystem at all.(freeze ext2; unfreeze ext2; run blkid; see no output; run dumpe2fs,
see needs_recovery set on fs w/ no journal).To fix this, don't manipulate the INCOMPAT_RECOVER feature on
filesystems without journals.Reported-by: Stu Mark
Reviewed-by: Jan Kara
Signed-off-by: Eric Sandeen
Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman
22 Sep, 2015
1 commit
-
commit a068acf2ee77693e0bf39d6e07139ba704f461c3 upstream.
Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g. new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else. This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
of "sudo" is something more sneaky:$ BASE="ovl"
$ MNT="$BASE/mnt"
$ LOW="$BASE/lower"
$ UP="$BASE/upper"
$ WORK="$BASE/work/ 0 0
none /proc fuse.pwn user_id=1000"
$ mkdir -p "$LOW" "$UP" "$WORK"
$ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
$ cat /proc/mounts
none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
none /proc fuse.pwn user_id=1000 0 0
$ fusermount -u /proc
$ cat /proc/mounts
cat: /proc/mounts: No such file or directoryThis fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed. Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook
Acked-by: Serge Hallyn
Acked-by: Jan Kara
Acked-by: Paul Moore
Cc: J. R. Okajima
Signed-off-by: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
04 Aug, 2015
10 commits
-
commit 7444a072c387a93ebee7066e8aee776954ab0e41 upstream.
ext4_free_blocks is looping around the allocation request and mimics
__GFP_NOFAIL behavior without any allocation fallback strategy. Let's
remove the open coded loop and replace it with __GFP_NOFAIL. Without the
flag the allocator has no way to find out never-fail requirement and
cannot help in any way.Signed-off-by: Michal Hocko
Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman -
commit 8974fec7d72e3e02752fe0f27b4c3719c78d9a15 upstream.
Currently ext4_ind_migrate() doesn't correctly handle a file which
contains a hole at the beginning of the file. This caused the migration
to be done incorrectly, and then if there is a subsequent following
delayed allocation write to the "hole", this would reclaim the same data
blocks again and results in fs corruption.# assmuing 4k block size ext4, with delalloc enabled
# skip the first block and write to the second block
xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile# converting to indirect-mapped file, which would move the data blocks
# to the beginning of the file, but extent status cache still marks
# that region as a hole
chattr -e /mnt/ext4/testfile# delayed allocation writes to the "hole", reclaim the same data block
# again, results in i_blocks corruption
xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile
umount /mnt/ext4
e2fsck -nf /dev/sda6
...
Inode 53, i_blocks is 16, should be 8. Fix? no
...Signed-off-by: Eryu Guan
Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman -
commit d6f123a9297496ad0b6335fe881504c4b5b2a5e5 upstream.
Currently the check in ext4_ind_migrate() is not enough before doing the
real conversion:a) delayed allocated extents could bypass the check on eh->eh_entries
and eh->eh_depthThis can be demonstrated by this script
xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile
chattr -e /mnt/ext4/testfilewhere testfile has two extents but still be converted to non-extent
based file format.b) only extent length is checked but not the offset, which would result
in data lose (delalloc) or fs corruption (nodelalloc), because
non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30)
blocksThis can be demostrated by
xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile
chattr -e /mnt/ext4/testfile
syncIf delalloc is enabled, dmesg prints
EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53
EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5
EXT4-fs (dm-4): This should not happen!! Data will be lostIf delalloc is disabled, e2fsck -nf shows corruption
Inode 53, i_size is 5497558142976, should be 4096. Fix? noFix the two issues by
a) forcing all delayed allocation blocks to be allocated before checking
eh->eh_depth and eh->eh_entries
b) limiting the last logical block of the extent is within direct mapSigned-off-by: Eryu Guan
Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman -
commit 9705acd63b125dee8b15c705216d7186daea4625 upstream.
On delalloc enabled file system on invalidatepage operation
in ext4_da_page_release_reservation() we want to clear the delayed
buffer and remove the extent covering the delayed buffer from the extent
status tree.However currently there is a bug where on the systems with page size >
block size we will always remove extents from the start of the page
regardless where the actual delayed buffers are positioned in the page.
This leads to the errors like this:EXT4-fs warning (device loop0): ext4_da_release_space:1225:
ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
blocksThis however can cause data loss on writeback time if the file system is
in ENOSPC condition because we're releasing reservation for someones
else delayed buffer.Fix this by only removing extents that corresponds to the part of the
page we want to invalidate.This problem is reproducible by the following fio receipt (however I was
only able to reproduce it with fio-2.1 or older.[global]
bs=8k
iodepth=1024
iodepth_batch=60
randrepeat=1
size=1m
directory=/mnt/test
numjobs=20
[job1]
ioengine=sync
bs=1k
direct=1
rw=randread
filename=file1:file2
[job2]
ioengine=libaio
rw=randwrite
direct=1
filename=file1:file2
[job3]
bs=1k
ioengine=posixaio
rw=randwrite
direct=1
filename=file1:file2
[job5]
bs=1k
ioengine=sync
rw=randread
filename=file1:file2
[job7]
ioengine=libaio
rw=randwrite
filename=file1:file2
[job8]
ioengine=posixaio
rw=randwrite
filename=file1:file2
[job10]
ioengine=mmap
rw=randwrite
bs=1k
filename=file1:file2
[job11]
ioengine=mmap
rw=randwrite
direct=1
filename=file1:file2Signed-off-by: Lukas Czerner
Signed-off-by: Theodore Ts'o
Reviewed-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman -
commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 upstream.
Switch ext4 to using sb_getblk_gfp with GFP_NOFS added to fix possible
deadlocks in the page writeback path.Signed-off-by: Nikolay Borisov
Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman -
commit 0f0ff9a9f3fa2ec6f427603fd521d5f3a0b076d1 upstream.
Commit 8f4d8558391: "ext4: fix lazytime optimization" was not a
complete fix. In the case where the inode number is a multiple of 16,
and we could still end up updating an inode with dirty timestamps
written to the wrong inode on disk. Oops.This can be easily reproduced by using generic/005 with a file system
with metadata_csum and lazytime enabled.Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman -
commit a2fd66d069d86d793e9d39d4079b96f46d13f237 upstream.
Newer versions of mount parse the lazytime feature and pass it to the
mount system call via the flags field in the mount system call,
removing the lazytime string from the mount options list. So we need
to check for the presence of MS_LAZYTIME and set it in sb->s_flags in
order for this flag to be set on a remount.Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman -
commit 292db1bc6c105d86111e858859456bcb11f90f91 upstream.
ext4 isn't willing to map clusters to a non-extent file. Don't signal
this with an out of space error, since the FS will retry the
allocation (which didn't fail) forever. Instead, return EUCLEAN so
that the operation will fail immediately all the way back to userspace.(The fix is either to run e2fsck -E bmap2extent, or to chattr +e the file.)
Signed-off-by: Darrick J. Wong
Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman -
commit 89d96a6f8e6491f24fc8f99fd6ae66820e85c6c1 upstream.
Normally all of the buffers will have been forced out to disk before
we call invalidate_bdev(), but there will be some cases, where a file
system operation was aborted due to an ext4_error(), where there may
still be some dirty buffers in the buffer cache for the device. So
try to force them out to memory before calling invalidate_bdev().This fixes a warning triggered by generic/081:
WARNING: CPU: 1 PID: 3473 at /usr/projects/linux/ext4/fs/block_dev.c:56 __blkdev_put+0xb5/0x16f()
Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman -
commit bdf96838aea6a265f2ae6cbcfb12a778c84a0b8e upstream.
The commit cf108bca465d: "ext4: Invert the locking order of page_lock
and transaction start" caused __ext4_journalled_writepage() to drop
the page lock before the page was written back, as part of changing
the locking order to jbd2_journal_start -> page_lock. However, this
introduced a potential race if there was a truncate racing with the
data=journalled writeback mode.Fix this by grabbing the page lock after starting the journal handle,
and then checking to see if page had gotten truncated out from under
us.This fixes a number of different warnings or BUG_ON's when running
xfstests generic/086 in data=journalled mode, including:jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7
c0, 164), jh->b_transaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0- and -
kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
...
Call Trace:
[] ? __ext4_journalled_invalidatepage+0x117/0x117
[] __ext4_journalled_invalidatepage+0x10f/0x117
[] ? __ext4_journalled_invalidatepage+0x117/0x117
[] ? lock_buffer+0x36/0x36
[] ext4_journalled_invalidatepage+0xd/0x22
[] do_invalidatepage+0x22/0x26
[] truncate_inode_page+0x5b/0x85
[] truncate_inode_pages_range+0x156/0x38c
[] truncate_inode_pages+0x11/0x15
[] truncate_pagecache+0x55/0x71
[] ext4_setattr+0x4a9/0x560
[] ? current_kernel_time+0x10/0x44
[] notify_change+0x1c7/0x2be
[] do_truncate+0x65/0x85
[] ? file_ra_state_init+0x12/0x29- and -
WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
irty_metadata+0x14a/0x1ae()
...
Call Trace:
[] ? console_unlock+0x3a1/0x3ce
[] dump_stack+0x48/0x60
[] warn_slowpath_common+0x89/0xa0
[] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
[] warn_slowpath_null+0x14/0x18
[] jbd2_journal_dirty_metadata+0x14a/0x1ae
[] __ext4_handle_dirty_metadata+0xd4/0x19d
[] write_end_fn+0x40/0x53
[] ext4_walk_page_buffers+0x4e/0x6a
[] ext4_writepage+0x354/0x3b8
[] ? mpage_release_unused_pages+0xd4/0xd4
[] ? wait_on_buffer+0x2c/0x2c
[] ? ext4_writepage+0x3b8/0x3b8
[] __writepage+0x10/0x2e
[] write_cache_pages+0x22d/0x32c
[] ? ext4_writepage+0x3b8/0x3b8
[] ext4_writepages+0x102/0x607
[] ? sched_clock_local+0x10/0x10e
[] ? __lock_is_held+0x2e/0x44
[] ? lock_is_held+0x43/0x51
[] do_writepages+0x1c/0x29
[] __writeback_single_inode+0xc3/0x545
[] writeback_sb_inodes+0x21f/0x36d
...Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman
15 May, 2015
6 commits
-
The xfstests test suite assumes that an attempt to collapse range on
the range (0, 1) will return EOPNOTSUPP if the file system does not
support collapse range. Commit 280227a75b56: "ext4: move check under
lock scope to close a race" broke this, and this caused xfstests to
fail when run when testing file systems that did not have the extents
feature enabled.Reported-by: Eric Whitney
Signed-off-by: Theodore Ts'o -
The following commit introduced a bug when checking for zero length extent
5946d08 ext4: check for overlapping extents in ext4_valid_extent_entries()
Zero length extent could pass the check if lblock is zero.
Adding the explicit check for zero length back.
Signed-off-by: Eryu Guan
Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org -
Currently when journal restart fails, we'll have the h_transaction of
the handle set to NULL to indicate that the handle has been effectively
aborted. We handle this situation quietly in the jbd2_journal_stop() and just
free the handle and exit because everything else has been done before we
attempted (and failed) to restart the journal.Unfortunately there are a number of problems with that approach
introduced with commit41a5b913197c "jbd2: invalidate handle if jbd2_journal_restart()
fails"First of all in ext4 jbd2_journal_stop() will be called through
__ext4_journal_stop() where we would try to get a hold of the superblock
by dereferencing h_transaction which in this case would lead to NULL
pointer dereference and crash.In addition we're going to free the handle regardless of the refcount
which is bad as well, because others up the call chain will still
reference the handle so we might potentially reference already freed
memory.Moreover it's expected that we'll get aborted handle as well as detached
handle in some of the journalling function as the error propagates up
the stack, so it's unnecessary to call WARN_ON every time we get
detached handle.And finally we might leak some memory by forgetting to free reserved
handle in jbd2_journal_stop() in the case where handle was detached from
the transaction (h_transaction is NULL).Fix the NULL pointer dereference in __ext4_journal_stop() by just
calling jbd2_journal_stop() quietly as suggested by Jan Kara. Also fix
the potential memory leak in jbd2_journal_stop() and use proper
handle refcounting before we attempt to free it to avoid use-after-free
issues.And finally remove all WARN_ON(!transaction) from the code so that we do
not get random traces when something goes wrong because when journal
restart fails we will get to some of those functions.Cc: stable@vger.kernel.org
Signed-off-by: Lukas Czerner
Signed-off-by: Theodore Ts'o
Reviewed-by: Jan Kara -
The ext4_extent_tree_init() function hasn't been in the ext4 code for
a long time ago, except in an unused function prototype in ext4.hGoogle-Bug-Id: 4530137
Signed-off-by: Theodore Ts'o -
Google-Bug-Id: 20939131
Signed-off-by: Theodore Ts'o -
We had a fencepost error in the lazytime optimization which means that
timestamp would get written to the wrong inode.Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o
04 May, 2015
1 commit
-
Pull ext4 fixes from Ted Ts'o:
"Some miscellaneous bug fixes and some final on-disk and ABI changes
for ext4 encryption which provide better security and performance"* tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix growing of tiny filesystems
ext4: move check under lock scope to close a race.
ext4: fix data corruption caused by unwritten and delayed extents
ext4 crypto: remove duplicated encryption mode definitions
ext4 crypto: do not select from EXT4_FS_ENCRYPTION
ext4 crypto: add padding to filenames before encrypting
ext4 crypto: simplify and speed up filename encryption
03 May, 2015
3 commits
-
The estimate of necessary transaction credits in ext4_flex_group_add()
is too pessimistic. It reserves credit for sb, resize inode, and resize
inode dindirect block for each group added in a flex group although they
are always the same block and thus it is enough to account them only
once. Also the number of modified GDT block is overestimated since we
fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block.Make the estimation more precise. That reduces number of requested
credits enough that we can grow 20 MB filesystem (which has 1 MB
journal, 79 reserved GDT blocks, and flex group size 16 by default).Signed-off-by: Jan Kara
Signed-off-by: Theodore Ts'o
Reviewed-by: Eric Sandeen -
fallocate() checks that the file is extent-based and returns
EOPNOTSUPP in case is not. Other tasks can convert from and to
indirect and extent so it's safe to check only after grabbing
the inode mutex.Signed-off-by: Davide Italiano
Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org -
Currently it is possible to lose whole file system block worth of data
when we hit the specific interaction with unwritten and delayed extents
in status extent tree.The problem is that when we insert delayed extent into extent status
tree the only way to get rid of it is when we write out delayed buffer.
However there is a limitation in the extent status tree implementation
so that when inserting unwritten extent should there be even a single
delayed block the whole unwritten extent would be marked as delayed.At this point, there is no way to get rid of the delayed extents,
because there are no delayed buffers to write out. So when a we write
into said unwritten extent we will convert it to written, but it still
remains delayed.When we try to write into that block later ext4_da_map_blocks() will set
the buffer new and delayed and map it to invalid block which causes
the rest of the block to be zeroed loosing already written data.For now we can fix this by simply not allowing to set delayed status on
written extent in the extent status tree. Also add WARN_ON() to make
sure that we notice if this happens in the future.This problem can be easily reproduced by running the following xfs_io.
xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
-c "falloc 0 131072" \
-c "pwrite -S 0xbb 65536 2048" \
-c "fsync" /mnt/test/fffecho 3 > /proc/sys/vm/drop_caches
xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fffThis can be theoretically also reproduced by at random by running fsx,
but it's not very reliable, though on machines with bigger page size
(like ppc) this can be seen more often (especially xfstest generic/127)Signed-off-by: Lukas Czerner
Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org
02 May, 2015
4 commits
-
This patch removes duplicated encryption modes which were already in
ext4.h. They were duplicated from commit 3edc18d and commit f542fb.Cc: Theodore Ts'o
Cc: Michael Halcrow
Cc: Andreas Dilger
Signed-off-by: Chanho Park
Signed-off-by: Theodore Ts'o -
This patch adds a tristate EXT4_ENCRYPTION to do the selections
for EXT4_FS_ENCRYPTION because selecting from a bool causes all
the selected options to be built-in, even if EXT4 itself is a
module.Signed-off-by: Herbert Xu
Signed-off-by: Theodore Ts'o -
This obscures the length of the filenames, to decrease the amount of
information leakage. By default, we pad the filenames to the next 4
byte boundaries. This costs nothing, since the directory entries are
aligned to 4 byte boundaries anyway. Filenames can also be padded to
8, 16, or 32 bytes, which will consume more directory space.Change-Id: Ibb7a0fb76d2c48e2061240a709358ff40b14f322
Signed-off-by: Theodore Ts'o -
Avoid using SHA-1 when calculating the user-visible filename when the
encryption key is available, and avoid decrypting lots of filenames
when searching for a directory entry in a directory block.Change-Id: If4655f144784978ba0305b597bfa1c8d7bb69e63
Signed-off-by: Theodore Ts'o
27 Apr, 2015
1 commit
-
Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only
25 Apr, 2015
1 commit
-
do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:clat percentiles (usec):
| 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34],
| 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35],
| 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80],
| 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155],
| 99.99th=[ 165]After:
clat percentiles (usec):
| 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149],
| 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171],
| 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270],
| 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422],
| 99.99th=[ 438]In other setups, Robert Elliott reported seeing good performance
improvements:https://lkml.org/lkml/2015/4/3/557
The more applications accessing the device, the worse it gets.
Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.Cc: Andrew Morton
Cc: Christoph Hellwig
Cc: Theodore Ts'o
Cc: Elliott, Robert (Server Storage)
Cc: Al Viro
Signed-off-by: Jens Axboe
Signed-off-by: Al Viro
20 Apr, 2015
1 commit
-
Pull ext4 updates from Ted Ts'o:
"A few bug fixes and add support for file-system level encryption in
ext4"* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits)
ext4 crypto: enable encryption feature flag
ext4 crypto: add symlink encryption
ext4 crypto: enable filename encryption
ext4 crypto: filename encryption modifications
ext4 crypto: partial update to namei.c for fname crypto
ext4 crypto: insert encrypted filenames into a leaf directory block
ext4 crypto: teach ext4_htree_store_dirent() to store decrypted filenames
ext4 crypto: filename encryption facilities
ext4 crypto: implement the ext4 decryption read path
ext4 crypto: implement the ext4 encryption write path
ext4 crypto: inherit encryption policies on inode and directory create
ext4 crypto: enforce context consistency
ext4 crypto: add encryption key management facilities
ext4 crypto: add ext4 encryption facilities
ext4 crypto: add encryption policy and password salt support
ext4 crypto: add encryption xattr support
ext4 crypto: export ext4_empty_dir()
ext4 crypto: add ext4 encryption Kconfig
ext4 crypto: reserve codepoints used by the ext4 encryption feature
ext4 crypto: add ext4_mpage_readpages()
...
17 Apr, 2015
2 commits
-
Pull third hunk of vfs changes from Al Viro:
"This contains the ->direct_IO() changes from Omar + saner
generic_write_checks() + dealing with fcntl()/{read,write}() races
(mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of
repeatedly looking at ->f_flags, which can be changed by fcntl(2),
check ->ki_flags - which cannot) + infrastructure bits for dhowells'
d_inode annotations + Christophs switch of /dev/loop to
vfs_iter_write()"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits)
block: loop: switch to VFS ITER_BVEC
configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode
VFS: Make pathwalk use d_is_reg() rather than S_ISREG()
VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()
VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk
NFS: Don't use d_inode as a variable name
VFS: Impose ordering on accesses of d_inode and d_flags
VFS: Add owner-filesystem positive/negative dentry checks
nfs: generic_write_checks() shouldn't be done on swapout...
ocfs2: use __generic_file_write_iter()
mirror O_APPEND and O_DIRECT into iocb->ki_flags
switch generic_write_checks() to iocb and iter
ocfs2: move generic_write_checks() before the alignment checks
ocfs2_file_write_iter: stop messing with ppos
udf_file_write_iter: reorder and simplify
fuse: ->direct_IO() doesn't need generic_write_checks()
ext4_file_write_iter: move generic_write_checks() up
xfs_file_aio_write_checks: switch to iocb/iov_iter
generic_write_checks(): drop isblk argument
blkdev_write_iter: expand generic_file_checks() call in there
... -
Pull quota and udf updates from Jan Kara:
"The pull contains quota changes which complete unification of XFS and
VFS quota interfaces (so tools can use either interface to manipulate
any filesystem). There's also a patch to support project quotas in
VFS quota subsystem from Li Xi.Finally there's a bunch of UDF fixes and cleanups and tiny cleanup in
reiserfs & ext3"* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits)
udf: Update ctime and mtime when directory is modified
udf: return correct errno for udf_update_inode()
ext3: Remove useless condition in if statement.
vfs: Add general support to enforce project quota limits
reiserfs: fix __RASSERT format string
udf: use int for allocated blocks instead of sector_t
udf: remove redundant buffer_head.h includes
udf: remove else after return in __load_block_bitmap()
udf: remove unused variable in udf_table_free_blocks()
quota: Fix maximum quota limit settings
quota: reorder flags in quota state
quota: paranoia: check quota tree root
quota: optimize i_dquot access
quota: Hook up Q_XSETQLIM for id 0 to ->set_info
xfs: Add support for Q_SETINFO
quota: Make ->set_info use structure with neccesary info to VFS and XFS
quota: Remove ->get_xstate and ->get_xstatev callbacks
gfs2: Convert to using ->get_state callback
xfs: Convert to using ->get_state callback
quota: Wire up Q_GETXSTATE and Q_GETXSTATV calls to work with ->get_state
...
16 Apr, 2015
5 commits
-
Also add the test dummy encryption mode flag so we can more easily
test the encryption patches using xfstests.Signed-off-by: Michael Halcrow
Signed-off-by: Theodore Ts'o -
Signed-off-by: Uday Savagaonkar
Signed-off-by: Theodore Ts'o -
Merge second patchbomb from Andrew Morton:
- the rest of MM
- various misc bits
- add ability to run /sbin/reboot at reboot time
- printk/vsprintf changes
- fiddle with seq_printf() return value
* akpm: (114 commits)
parisc: remove use of seq_printf return value
lru_cache: remove use of seq_printf return value
tracing: remove use of seq_printf return value
cgroup: remove use of seq_printf return value
proc: remove use of seq_printf return value
s390: remove use of seq_printf return value
cris fasttimer: remove use of seq_printf return value
cris: remove use of seq_printf return value
openrisc: remove use of seq_printf return value
ARM: plat-pxa: remove use of seq_printf return value
nios2: cpuinfo: remove use of seq_printf return value
microblaze: mb: remove use of seq_printf return value
ipc: remove use of seq_printf return value
rtc: remove use of seq_printf return value
power: wakeup: remove use of seq_printf return value
x86: mtrr: if: remove use of seq_printf return value
linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
MAINTAINERS: CREDITS: remove Stefano Brivio from B43
.mailmap: add Ricardo Ribalda
CREDITS: add Ricardo Ribalda Delgado
... -
The original dax patchset split the ext2/4_file_operations because of the
two NULL splice_read/splice_write in the dax case.In the vfs if splice_read/splice_write are NULL we then call
default_splice_read/write.What we do here is make generic_file_splice_read aware of IS_DAX() so the
original ext2/4_file_operations can be used as is.For write it appears that iter_file_splice_write is just fine. It uses
the regular f_op->write(file,..) or new_sync_write(file, ...).Signed-off-by: Boaz Harrosh
Reviewed-by: Jan Kara
Cc: Dave Chinner
Cc: Matthew Wilcox
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
From: Yigal Korman
[v1]
Without this patch, c/mtime is not updated correctly when mmap'ed page is
first read from and then written to.A new xfstest is submitted for testing this (generic/080)
[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.Signed-off-by: Yigal Korman
Signed-off-by: Boaz Harrosh
Reviewed-by: Jan Kara
Cc: Matthew Wilcox
Cc: Dave Chinner
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Kirill A. Shutemov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds