Eric Lee / smarc-fsl-linux-kernel

23 Oct, 2012

2 commits

e589db7a6 Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 fixes from Ted Ts'o:
"Various bug fixes for ext4. The most serious of them fixes a security
bug (CVE-2012-4508) which leads to stale data exposure when we have
fallocate racing against writes to files undergoing delayed
allocation. We also have two fixes for the metadata checksum feature,
the most serious of which can cause the superblock to have a invalid
checksum after a power failure."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Avoid underflow in ext4_trim_fs()
ext4: Checksum the block bitmap properly with bigalloc enabled
ext4: fix undefined bit shift result in ext4_fill_flex_info
ext4: fix metadata checksum calculation for the superblock
ext4: race-condition protection for ext4_convert_unwritten_extents_endio
ext4: serialize fallocate with ext4_convert_unwritten_extents

Linus Torvalds
2012-10-23 13:48:26 +0800
5de35e8d5 ext4: Avoid underflow in ext4_trim_fs() ... Browse Code »

Currently if len argument in ext4_trim_fs() is smaller than one block,
the 'end' variable underflow. Avoid that by returning EINVAL if len is
smaller than file system block.

Also remove useless unlikely().

Signed-off-by: Lukas Czerner
Signed-off-by: "Theodore Ts'o"
Cc: stable@vger.kernel.org

Lukas Czerner
2012-10-23 06:01:19 +0800

22 Oct, 2012

1 commit

79f1ba495 ext4: Checksum the block bitmap properly with bigalloc enabled ... Browse Code »

In mke2fs, we only checksum the whole bitmap block and it is right.
While in the kernel, we use EXT4_BLOCKS_PER_GROUP to indicate the
size of the checksumed bitmap which is wrong when we enable bigalloc.
The right size should be EXT4_CLUSTERS_PER_GROUP and this patch fixes
it.

Also as every caller of ext4_block_bitmap_csum_set and
ext4_block_bitmap_csum_verify pass in EXT4_BLOCKS_PER_GROUP(sb)/8,
we'd better removes this parameter and sets it in the function itself.

Signed-off-by: Tao Ma
Signed-off-by: "Theodore Ts'o"
Reviewed-by: Lukas Czerner
Cc: stable@vger.kernel.org

Tao Ma
2012-10-22 12:34:32 +0800

16 Oct, 2012

1 commit

76495ec1d ext4: fix undefined bit shift result in ext4_fill_flex_info ... Browse Code »

The result of the bit shift expression in
'1 << sbi->s_log_groups_per_flex' can be undefined in the case that
s_log_groups_per_flex is 31 because the result of the shift is bigger
than INT_MAX. In reality this probably should not cause much problems
since we'll end up with INT_MIN which will then be converted into
'unsigned int' type, but nevertheless according to the ISO C99 the
result is actually undefined.

Fix this by changing the left operand to 'unsigned int' type.

Note that the commit d50f2ab6f050311dbf7b8f5501b25f0bf64a439b already
tried to fix the undefined behaviour, but this was missed.

Thanks to Laszlo Ersek for pointing this out and suggesting the fix.

Signed-off-by: Lukas Czerner
Signed-off-by: "Theodore Ts'o"
Reviewed-by: Carlos Maiolino
Reported-by: Laszlo Ersek

Lukas Czerner
2012-10-16 00:56:49 +0800

10 Oct, 2012

2 commits

06db49e68 ext4: fix metadata checksum calculation for the superblock ... Browse Code »

The function ext4_handle_dirty_super() was calculating the superblock
on the wrong block data. As a result, when the superblock is modified
while it is mounted (most commonly, when inodes are added or removed
from the orphan list), the superblock checksum would be wrong. We
didn't notice because the superblock *was* being correctly calculated
in ext4_commit_super(), and this would get called when the file system
was unmounted. So the problem only became obvious if the system
crashed while the file system was mounted.

Fix this by removing the poorly designed function signature for
ext4_superblock_csum_set(); if it only took a single argument, the
pointer to a struct superblock, the ambiguity which caused this
mistake would have been impossible.

Reported-by: George Spelvin
Signed-off-by: "Theodore Ts'o"
Cc: stable@vger.kernel.org

Theodore Ts'o
2012-10-10 13:06:58 +0800
dee1f973c ext4: race-condition protection for ext4_convert_unwritten_extents_endio ... Browse Code »

We assumed that at the time we call ext4_convert_unwritten_extents_endio()
extent in question is fully inside [map.m_lblk, map->m_len] because
it was already split during submission. But this may not be true due to
a race between writeback vs fallocate.

If extent in question is larger than requested we will split it again.
Special precautions should being done if zeroout required because
[map.m_lblk, map->m_len] already contains valid data.

Signed-off-by: Dmitry Monakhov
Signed-off-by: "Theodore Ts'o"
Cc: stable@vger.kernel.org

Dmitry Monakhov
2012-10-10 13:04:58 +0800

09 Oct, 2012

1 commit

0b173bc4d mm: kill vma flag VM_CAN_NONLINEAR ... Browse Code »

Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().

Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.

Now device drivers can implement this method and obtain nonlinear vma support.

Signed-off-by: Konstantin Khlebnikov
Cc: Alexander Viro
Cc: Carsten Otte
Cc: Chris Metcalf #arch/tile
Cc: Cyrill Gorcunov
Cc: Eric Paris
Cc: H. Peter Anvin
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: James Morris
Cc: Jason Baron
Cc: Kentaro Takeda
Cc: Matt Helsley
Cc: Nick Piggin
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Suresh Siddha
Cc: Tetsuo Handa
Cc: Venkatesh Pallipadi
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-10-09 15:22:17 +0800

08 Oct, 2012

1 commit

6432f2128 Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 updates from Ted Ts'o:
"The big new feature added this time is supporting online resizing
using the meta_bg feature. This allows us to resize file systems
which are greater than 16TB. In addition, the speed of online
resizing has been improved in general.

We also fix a number of races, some of which could lead to deadlocks,
in ext4's Asynchronous I/O and online defrag support, thanks to good
work by Dmitry Monakhov.

There are also a large number of more minor bug fixes and cleanups
from a number of other ext4 contributors, quite of few of which have
submitted fixes for the first time."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (69 commits)
ext4: fix ext4_flush_completed_IO wait semantics
ext4: fix mtime update in nodelalloc mode
ext4: fix ext_remove_space for punch_hole case
ext4: punch_hole should wait for DIO writers
ext4: serialize truncate with owerwrite DIO workers
ext4: endless truncate due to nonlocked dio readers
ext4: serialize unlocked dio reads with truncate
ext4: serialize dio nonlocked reads with defrag workers
ext4: completed_io locking cleanup
ext4: fix unwritten counter leakage
ext4: give i_aiodio_unwritten a more appropriate name
ext4: ext4_inode_info diet
ext4: convert to use leXX_add_cpu()
ext4: ext4_bread usage audit
fs: reserve fallocate flag codepoint
ext4: remove redundant offset check in mext_check_arguments()
ext4: don't clear orphan list on ro mount with errors
jbd2: fix assertion failure in commit code due to lacking transaction credits
ext4: release donor reference when EXT4_IOC_MOVE_EXT ioctl fails
ext4: enable FITRIM ioctl on bigalloc file system
...

Linus Torvalds
2012-10-08 05:36:39 +0800

05 Oct, 2012

2 commits

60d4616f3 ext4: serialize fallocate with ext4_convert_unwritten_extents ... Browse Code »

Fallocate should wait for pended ext4_convert_unwritten_extents()
otherwise following race may happen:

ftruncate( ,12288);
fallocate( ,0, 4096)
io_sibmit( ,0, 4096); /* Write to fallocated area, split extent if needed */
fallocate( ,0, 8192); /* Grow extent and broke assumption about extent */

Later kwork completion will do:
->ext4_convert_unwritten_extents (0, 4096)
->ext4_map_blocks(handle, inode, &map, EXT4_GET_BLOCKS_IO_CONVERT_EXT);
->ext4_ext_map_blocks() /* Will find new extent: ex = [0,2] !!!!!! */
->ext4_ext_handle_uninitialized_extents()
->ext4_convert_unwritten_extents_endio()
/* convert [0,2] extent to initialized, but only[0,1] was written */

Signed-off-by: Dmitry Monakhov
Signed-off-by: "Theodore Ts'o"

Dmitry Monakhov
2012-10-05 23:32:02 +0800
c278531d3 ext4: fix ext4_flush_completed_IO wait semantics ... Browse Code »

BUG #1) All places where we call ext4_flush_completed_IO are broken
because buffered io and DIO/AIO goes through three stages
1) submitted io,
2) completed io (in i_completed_io_list) conversion pended
3) finished io (conversion done)
And by calling ext4_flush_completed_IO we will flush only
requests which were in (2) stage, which is wrong because:
1) punch_hole and truncate _must_ wait for all outstanding unwritten io
regardless to it's state.
2) fsync and nolock_dio_read should also wait because there is
a time window between end_page_writeback() and ext4_add_complete_io()
As result integrity fsync is broken in case of buffered write
to fallocated region:
fsync blkdev_completion
->filemap_write_and_wait_range
->ext4_end_bio
->end_page_writeback
ext4_flush_completed_IO
sees empty i_completed_io_list but pended
conversion still exist
->ext4_add_complete_io

BUG #2) Race window becomes wider due to the 'ext4: completed_io
locking cleanup V4' patch series

This patch make following changes:
1) ext4_flush_completed_io() now first try to flush completed io and when
wait for any outstanding unwritten io via ext4_unwritten_wait()
2) Rename function to more appropriate name.
3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to
prevent endless wait

Signed-off-by: Dmitry Monakhov
Signed-off-by: "Theodore Ts'o"
Reviewed-by: Jan Kara

Dmitry Monakhov
2012-10-05 23:31:55 +0800

03 Oct, 2012

3 commits

aab174f0d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs update from Al Viro:

- big one - consolidation of descriptor-related logics; almost all of
that is moved to fs/file.c

(BTW, I'm seriously tempted to rename the result to fd.c. As it is,
we have a situation when file_table.c is about handling of struct
file and file.c is about handling of descriptor tables; the reasons
are historical - file_table.c used to be about a static array of
struct file we used to have way back).

A lot of stray ends got cleaned up and converted to saner primitives,
disgusting mess in android/binder.c is still disgusting, but at least
doesn't poke so much in descriptor table guts anymore. A bunch of
relatively minor races got fixed in process, plus an ext4 struct file
leak.

- related thing - fget_light() partially unuglified; see fdget() in
there (and yes, it generates the code as good as we used to have).

- also related - bits of Cyrill's procfs stuff that got entangled into
that work; _not_ all of it, just the initial move to fs/proc/fd.c and
switch of fdinfo to seq_file.

- Alex's fs/coredump.c spiltoff - the same story, had been easier to
take that commit than mess with conflicts. The rest is a separate
pile, this was just a mechanical code movement.

- a few misc patches all over the place. Not all for this cycle,
there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
MAX_LFS_FILESIZE should be a loff_t
compat: fs: Generic compat_sys_sendfile implementation
fs: push rcu_barrier() from deactivate_locked_super() to filesystems
btrfs: reada_extent doesn't need kref for refcount
coredump: move core dump functionality into its own file
coredump: prevent double-free on an error path in core dumper
usb/gadget: fix misannotations
fcntl: fix misannotations
ceph: don't abuse d_delete() on failure exits
hypfs: ->d_parent is never NULL or negative
vfs: delete surplus inode NULL check
switch simple cases of fget_light to fdget
new helpers: fdget()/fdput()
switch o2hb_region_dev_write() to fget_light()
proc_map_files_readdir(): don't bother with grabbing files
make get_file() return its argument
vhost_set_vring(): turn pollstart/pollstop into bool
switch prctl_set_mm_exe_file() to fget_light()
switch xfs_find_handle() to fget_light()
switch xfs_swapext() to fget_light()
...

Linus Torvalds
2012-10-03 11:25:04 +0800
8c0a85377 fs: push rcu_barrier() from deactivate_locked_super() to filesystems ... Browse Code »

There's no reason to call rcu_barrier() on every
deactivate_locked_super(). We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.

Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths. E.g. on my machine exit_group() of a last process in IPC
namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.

Signed-off-by: Kirill A. Shutemov
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Kirill A. Shutemov
2012-10-03 09:35:55 +0800
437589a74 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace changes from Eric Biederman:
"This is a mostly modest set of changes to enable basic user namespace
support. This allows the code to code to compile with user namespaces
enabled and removes the assumption there is only the initial user
namespace. Everything is converted except for the most complex of the
filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
nfs, ocfs2 and xfs as those patches need a bit more review.

The strategy is to push kuid_t and kgid_t values are far down into
subsystems and filesystems as reasonable. Leaving the make_kuid and
from_kuid operations to happen at the edge of userspace, as the values
come off the disk, and as the values come in from the network.
Letting compile type incompatible compile errors (present when user
namespaces are enabled) guide me to find the issues.

The most tricky areas have been the places where we had an implicit
union of uid and gid values and were storing them in an unsigned int.
Those places were converted into explicit unions. I made certain to
handle those places with simple trivial patches.

Out of that work I discovered we have generic interfaces for storing
quota by projid. I had never heard of the project identifiers before.
Adding full user namespace support for project identifiers accounts
for most of the code size growth in my git tree.

Ultimately there will be work to relax privlige checks from
"capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
root in a user names to do those things that today we only forbid to
non-root users because it will confuse suid root applications.

While I was pushing kuid_t and kgid_t changes deep into the audit code
I made a few other cleanups. I capitalized on the fact we process
netlink messages in the context of the message sender. I removed
usage of NETLINK_CRED, and started directly using current->tty.

Some of these patches have also made it into maintainer trees, with no
problems from identical code from different trees showing up in
linux-next.

After reading through all of this code I feel like I might be able to
win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
userns: Convert the ufs filesystem to use kuid/kgid where appropriate
userns: Convert the udf filesystem to use kuid/kgid where appropriate
userns: Convert ubifs to use kuid/kgid
userns: Convert squashfs to use kuid/kgid where appropriate
userns: Convert reiserfs to use kuid and kgid where appropriate
userns: Convert jfs to use kuid/kgid where appropriate
userns: Convert jffs2 to use kuid and kgid where appropriate
userns: Convert hpfs to use kuid and kgid where appropriate
userns: Convert btrfs to use kuid/kgid where appropriate
userns: Convert bfs to use kuid/kgid where appropriate
userns: Convert affs to use kuid/kgid wherwe appropriate
userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
userns: On ia64 deal with current_uid and current_gid being kuid and kgid
userns: On ppc convert current_uid from a kuid before printing.
userns: Convert s390 getting uid and gid system calls to use kuid and kgid
userns: Convert s390 hypfs to use kuid and kgid where appropriate
userns: Convert binder ipc to use kuids
userns: Teach security_path_chown to take kuids and kgids
userns: Add user namespace support to IMA
userns: Convert EVM to deal with kuids and kgids in it's hmac computation
...

Linus Torvalds
2012-10-03 02:11:09 +0800

02 Oct, 2012

1 commit

99dbb1632 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull the trivial tree from Jiri Kosina:
"Tiny usual fixes all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
doc: fix old config name of kprobetrace
fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc
btrfs: fix the commment for the action flags in delayed-ref.h
btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID
vfs: fix kerneldoc for generic_fh_to_parent()
treewide: fix comment/printk/variable typos
ipr: fix small coding style issues
doc: fix broken utf8 encoding
nfs: comment fix
platform/x86: fix asus_laptop.wled_type module parameter
mfd: printk/comment fixes
doc: getdelays.c: remember to close() socket on error in create_nl_socket()
doc: aliasing-test: close fd on write error
mmc: fix comment typos
dma: fix comments
spi: fix comment/printk typos in spi
Coccinelle: fix typo in memdup_user.cocci
tmiofb: missing NULL pointer checks
tools: perf: Fix typo in tools/perf
tools/testing: fix comment / output typos
...

Linus Torvalds
2012-10-02 00:06:36 +0800

01 Oct, 2012

3 commits

041bbb6d3 ext4: fix mtime update in nodelalloc mode ... Browse Code »

Commits 5e8830dc85d0 and 41c4d25f78c0 introduced a regression into
v3.6-rc1 for ext4 in nodealloc mode, such that mtime updates would not
take place for files modified via mmap if the page was already in the
page cache. This would also affect ext3 file systems mounted using
the ext4 file system driver.

The problem was that ext4_page_mkwrite() had a shortcut which would
avoid calling __block_page_mkwrite() under some circumstances, and the
above two commit transferred the responsibility of calling
file_update_time() to __block_page_mkwrite --- which woudln't get
called in some circumstances.

Since __block_page_mkwrite() only has three callers,
block_page_mkwrite(), ext4_page_mkwrite, and nilfs_page_mkwrite(), the
best way to solve this is to move the responsibility for calling
file_update_time() to its caller.

This problem was found via xfstests #215 with a file system mounted
with -o nodelalloc.

Signed-off-by: "Theodore Ts'o"
Reviewed-by: Jan Kara
Cc: KONISHI Ryusuke
Cc: stable@vger.kernel.org

Theodore Ts'o
2012-10-01 11:04:56 +0800
6f2080e64 ext4: fix ext_remove_space for punch_hole case ... Browse Code »

Inode is allowed to have empty leaf only if it this is blockless inode.

Signed-off-by: Dmitry Monakhov
Signed-off-by: "Theodore Ts'o"

Dmitry Monakhov
2012-10-01 11:03:50 +0800
02d262dff ext4: punch_hole should wait for DIO writers ... Browse Code »

punch_hole is the place where we have to wait for all existing writers
(writeback, aio, dio), but currently we simply flush pended end_io request
which is not sufficient. Other issue is that punch_hole performed w/o i_mutex
held which obviously result in dangerous data corruption due to
write-after-free.

This patch performs following changes:
- Guard punch_hole with i_mutex
- Recheck inode flags under i_mutex
- Block all new dio readers in order to prevent information leak caused by
read-after-free pattern.
- punch_hole now wait for all writers in flight
NOTE: XXX write-after-free race is still possible because new dirty pages
may appear due to mmap(), and currently there is no easy way to stop
writeback while punch_hole is in progress.

[ Fixed error return from ext4_ext_punch_hole() to make sure that we
release i_mutex before returning EPERM or ETXTBUSY -- Ted ]

Signed-off-by: Dmitry Monakhov
Signed-off-by: "Theodore Ts'o"

Dmitry Monakhov
2012-10-01 11:03:42 +0800

29 Sep, 2012

8 commits

1f555cfa2 ext4: serialize truncate with owerwrite DIO workers ... Browse Code »

Jan Kara have spotted interesting issue:
There are potential data corruption issue with direct IO overwrites
racing with truncate:
Like:
dio write truncate_task
->ext4_ext_direct_IO
->overwrite == 1
->down_read(&EXT4_I(inode)->i_data_sem);
->mutex_unlock(&inode->i_mutex);
->ext4_setattr()
->inode_dio_wait()
->truncate_setsize()
->ext4_truncate()
->down_write(&EXT4_I(inode)->i_data_sem);
->__blockdev_direct_IO
->ext4_get_block
->submit_io()
->up_read(&EXT4_I(inode)->i_data_sem);
# truncate data blocks, allocate them to
# other inode - bad stuff happens because
# dio is still in flight.

In order to serialize with truncate dio worker should grab extra i_dio_count
reference before drop i_mutex.

Reviewed-by: Jan Kara
Signed-off-by: Dmitry Monakhov
Signed-off-by: "Theodore Ts'o"

Dmitry Monakhov
2012-09-29 12:58:26 +0800
1b65007e9 ext4: endless truncate due to nonlocked dio readers ... Browse Code »

If we have enough aggressive DIO readers, truncate and other dio
waiters will wait forever inside inode_dio_wait(). It is reasonable
to disable nonlock DIO read optimization during truncate.

Reviewed-by: Jan Kara
Signed-off-by: Dmitry Monakhov
Signed-off-by: "Theodore Ts'o"

Dmitry Monakhov
2012-09-29 12:56:15 +0800
1c9114f9c ext4: serialize unlocked dio reads with truncate ... Browse Code »

Current serialization will works only for DIO which holds
i_mutex, but nonlocked DIO following race is possible:

dio_nolock_read_task truncate_task
->ext4_setattr()
->inode_dio_wait()
->ext4_ext_direct_IO
->ext4_ind_direct_IO
->__blockdev_direct_IO
->ext4_get_block
->truncate_setsize()
->ext4_truncate()
#alloc truncated blocks
#to other inode
->submit_io()
#INFORMATION LEAK

In order to serialize with unlocked DIO reads we have to
rearrange wait sequence
1) update i_size first
2) if i_size about to be reduced wait for outstanding DIO requests
3) and only after that truncate inode blocks

Reviewed-by: Jan Kara
Signed-off-by: Dmitry Monakhov
Signed-off-by: "Theodore Ts'o"

Dmitry Monakhov
2012-09-29 12:55:23 +0800
17335dcc4 ext4: serialize dio nonlocked reads with defrag workers ... Browse Code »

Inode's block defrag and ext4_change_inode_journal_flag() may
affect nonlocked DIO reads result, so proper synchronization
required.

- Add missed inode_dio_wait() calls where appropriate
- Check inode state under extra i_dio_count reference.

Reviewed-by: Jan Kara
Signed-off-by: Dmitry Monakhov
Signed-off-by: "Theodore Ts'o"

Dmitry Monakhov
2012-09-29 12:41:21 +0800
28a535f9a ext4: completed_io locking cleanup ... Browse Code »

Current unwritten extent conversion state-machine is very fuzzy.
- For unknown reason it performs conversion under i_mutex. What for?
My diagnosis:
We already protect extent tree with i_data_sem, truncate and punch_hole
should wait for DIO, so the only data we have to protect is end_io->flags
modification, but only flush_completed_IO and end_io_work modified this
flags and we can serialize them via i_completed_io_lock.

Currently all these games with mutex_trylock result in the following deadlock
truncate: kworker:
ext4_setattr ext4_end_io_work
mutex_lock(i_mutex)
inode_dio_wait(inode) ->BLOCK
DEADLOCKflags modification
is protected by ei->ext4_complete_io_lock

Full list of changes:
- Move all completion end_io related routines to page-io.c in order to improve
logic locality
- Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
- remove EXT4_IO_END_FSYNC
- Improve SMP scalability by removing useless i_mutex which does not
protect io->flags anymore.
- Reduce lock contention on i_completed_io_lock by optimizing list walk.
- Rename ext4_end_io_nolock to end4_end_io and make it static
- Check flush completion status to ext4_ext_punch_hole(). Because it is
not good idea to punch blocks from corrupted inode.

Changes since V3 (in request to Jan's comments):
Fall back to active flush_completed_IO() approach in order to prevent
performance issues with nolocked DIO reads.
Changes since V2:
Fix use-after-free caused by race truncate vs end_io_work

Signed-off-by: Dmitry Monakhov
Signed-off-by: "Theodore Ts'o"

Dmitry Monakhov
2012-09-29 12:14:55 +0800
82e542291 ext4: fix unwritten counter leakage ... Browse Code »

ext4_set_io_unwritten_flag() will increment i_unwritten counter, so
once we mark end_io with EXT4_END_IO_UNWRITTEN we have to revert it back
on error path.

- add missed error checks to prevent counter leakage
- ext4_end_io_nolock() will clear EXT4_END_IO_UNWRITTEN flag to signal
that conversion finished.
- add BUG_ON to ext4_free_end_io() to prevent similar leakage in future.

Visible effect of this bug is that unaligned aio_stress may deadlock

Reviewed-by: Jan Kara
Signed-off-by: Dmitry Monakhov
Signed-off-by: "Theodore Ts'o"

Dmitry Monakhov
2012-09-29 11:36:25 +0800
e27f41e1b ext4: give i_aiodio_unwritten a more appropriate name ... Browse Code »

AIO/DIO prefix is wrong because it account unwritten extents which
also may be scheduled from buffered write endio

Reviewed-by: Jan Kara
Signed-off-by: Dmitry Monakhov
Signed-off-by: "Theodore Ts'o"

Dmitry Monakhov
2012-09-29 11:24:52 +0800
f45ee3a1e ext4: ext4_inode_info diet ... Browse Code »

Generic inode has unused i_private pointer which may be used as cur_aio_dio
storage.

TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
to have concurent AIO_DIO requests.

Reviewed-by: Zheng Liu
Reviewed-by: Jan Kara
Signed-off-by: Dmitry Monakhov
Signed-off-by: "Theodore Ts'o"

Dmitry Monakhov
2012-09-29 11:21:09 +0800

27 Sep, 2012

15 commits

ba39ebb61 ext4: convert to use leXX_add_cpu() ... Browse Code »

Convert cpu_to_leXX(leXX_to_cpu(E1) + E2) to use leXX_add_cpu().

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun
Signed-off-by: "Theodore Ts'o"

Wei Yongjun
2012-09-27 21:37:53 +0800
6d1ab10e6 ext4: ext4_bread usage audit ... Browse Code »

When ext4_bread() returns NULL and err is set to zero, this means
there is no phyical block mapped to the specified logical block
number. (Previous to commit 90b0a97323, err was uninitialized in this
case, which caused other problems.)

The directory handling routines use ext4_bread() in many places, the
fact that ext4_bread() now returns NULL with err set to zero could
cause problems since a number of these functions will simply return
the value of err if the result of ext4_bread() was the NULL pointer,
causing the caller of the function to think that the function was
successful.

Since directories should never contain holes, this case can only
happen if the file system is corrupted. This commit audits all of the
callers of ext4_bread(), and makes sure they do the right thing if a
hole in a directory is found by ext4_bread().

Some ext4_bread() callers did not need any changes either because they
already had its own hole detector paths.

Signed-off-by: Carlos Maiolino
Signed-off-by: "Theodore Ts'o"

Carlos Maiolino
2012-09-27 21:31:33 +0800
cbb4ee830 ext4: remove redundant offset check in mext_check_arguments() ... Browse Code »

In the check code above, if orig_start != donor_start, we would
return -EINVAL. So here, orig_start should be equal with donor_start.
Remove the redundant check here.

Signed-off-by: Wang Sheng-Hui
Signed-off-by: "Theodore Ts'o"

Wang Sheng-Hui
2012-09-27 20:00:01 +0800
c25f9bc61 ext4: don't clear orphan list on ro mount with errors ... Browse Code »

If the file system contains errors and it is being mounted read-only,
don't clear the orphan list. We should minimize changes to the file
system if it is mounted read-only.

Signed-off-by: Eric Sandeen
Signed-off-by: "Theodore Ts'o"

Eric Sandeen
2012-09-27 11:30:12 +0800
9b6873327 ext4: release donor reference when EXT4_IOC_MOVE_EXT ioctl fails ... Browse Code »

When the EXT4_IOC_MOVE_EXT ioctl() fails on bigalloc file systems, we
should jump to the 'mext_out' label to release the donor file reference.

Signed-off-by: Djalal Harouni
Signed-off-by: "Theodore Ts'o"

Djalal Harouni
2012-09-27 10:58:50 +0800
aaf7d73e5 ext4: enable FITRIM ioctl on bigalloc file system ... Browse Code »

With a minor tweaks regarding minimum extent size to discard and
discarded bytes reporting the FITRIM can be enabled on bigalloc file
system and it works without any problem.

This patch fixes minlen handling and discarded bytes reporting to
take into consideration bigalloc enabled file systems and finally
removes the restriction and allow FITRIM to be used on file system with
bigalloc feature enabled.

Reviewed-by: Carlos Maiolino
Signed-off-by: Lukas Czerner
Signed-off-by: "Theodore Ts'o"

Lukas Czerner
2012-09-27 10:21:21 +0800
2903ff019 switch simple cases of fget_light to fdget ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-09-27 10:20:08 +0800
b71fc079b ext4: fix fdatasync() for files with only i_size changes ... Browse Code »

Code tracking when transaction needs to be committed on fdatasync(2) forgets
to handle a situation when only inode's i_size is changed. Thus in such
situations fdatasync(2) doesn't force transaction with new i_size to disk
and that can result in wrong i_size after a crash.

Fix the issue by updating inode's i_datasync_tid whenever its size is
updated.

CC: # >= 2.6.32
Reported-by: Kristian Nielsen
Signed-off-by: Jan Kara

Jan Kara
2012-09-27 09:52:20 +0800
6a08f447f ext4: always set i_op in ext4_mknod() ... Browse Code »

ext4_special_inode_operations have their own ifdef CONFIG_EXT4_FS_XATTR
to mask those methods. And ext4_iget also always sets it, so there is
an inconsistency.

Signed-off-by: Bernd Schubert
Signed-off-by: "Theodore Ts'o"
Cc: stable@vger.kernel.org

Bernd Schubert
2012-09-27 09:24:57 +0800
6bdf29540 switch EXT4_IOC_MOVE_EXT to fget_light() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-09-27 09:10:06 +0800
399c9b862 ext4: close struct file leak on EXT4_IOC_MOVE_EXT ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-09-27 09:10:06 +0800
63fedaf1c ext4: remove unused function ext4_ext_check_cache ... Browse Code »

Remove unused function ext4_ext_check_cache() and merge the code back to
the ext4_ext_in_cache().

Reviewed-by: Carlos Maiolino
Signed-off-by: Lukas Czerner
Signed-off-by: "Theodore Ts'o"

Lukas Czerner
2012-09-27 09:09:06 +0800
85556c9a5 ext4: use kmem_cache_zalloc instead of kmem_cache_alloc/memset ... Browse Code »

Using kmem_cache_zalloc() instead of kmem_cache_alloc() and memset().

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun
Signed-off-by: "Theodore Ts'o"

Wei Yongjun
2012-09-27 08:43:37 +0800
8c8544739 ext4: reimplement uninit extent optimization for move_extent_per_page() ... Browse Code »

Uninitialized extent may became initialized(parallel writeback task)
at any moment after we drop i_data_sem, so we have to recheck extent's
state after we hold page's lock and i_data_sem.

If we about to change page's mapping we must hold page's lock in order to
serialize other users.

Signed-off-by: Dmitry Monakhov
Signed-off-by: "Theodore Ts'o"

Dmitry Monakhov
2012-09-27 00:54:52 +0800
bb5574880 ext4: clean up online defrag bugs in move_extent_per_page() ... Browse Code »

Non-full list of bugs:
1) uninitialized extent optimization does not hold page's lock,
and simply replace brunches after that writeback code goes
crazy because block mapping changed under it's feets
kernel BUG at fs/ext4/inode.c:1434! ( 288'th xfstress)

2) uninitialized extent may became initialized right after we
drop i_data_sem, so extent state must be rechecked

3) Locked pages goes uptodate via following sequence:
->readpage(page); lock_page(page); use_that_page(page)
But after readpage() one may invalidate it because it is
uptodate and unlocked (reclaimer does that)
As result kernel bug at include/linux/buffer_head.c:133!

4) We call write_begin() with already opened stansaction which
result in following deadlock:
->move_extent_per_page()
->ext4_journal_start()-> hold journal transaction
->write_begin()
->ext4_da_write_begin()
->ext4_nonda_switch()
->writeback_inodes_sb_if_idle() --> will wait for journal_stop()

5) try_to_release_page() may fail and it does fail if one of page's bh was
pinned by journal

6) If we about to change page's mapping we MUST hold it's lock during entire
remapping procedure, this is true for both pages(original and donor one)

Fixes:

- Avoid (1) and (2) simply by temproraly drop uninitialized extent handling
optimization, this will be reimplemented later.

- Fix (3) by manually forcing page to uptodate state w/o dropping it's lock

- Fix (4) by rearranging existing locking:
from: journal_start(); ->write_begin
to: write_begin(); journal_extend()
- Fix (5) simply by checking retvalue
- Fix (6) by locking both (original and donor one) pages during extent swap
with help of mext_page_double_lock()

Signed-off-by: Dmitry Monakhov
Signed-off-by: "Theodore Ts'o"

Dmitry Monakhov
2012-09-27 00:52:07 +0800