Eric Lee / smarc-fsl-linux-kernel

05 Nov, 2013

1 commit

ea7d2efa6 jffs2: do not support the MLC nand ... Browse Code »

We should not support the MLC nand for jffs2. So if the nand type is
MLC, we quit immediatly.

Signed-off-by: Huang Shijie
Signed-off-by: Brian Norris

Huang Shijie
2013-11-05 09:19:10 +0800

18 Oct, 2013

4 commits

ad4c3cc41 cope with potentially long ->d_dname() output for shmem/hugetlb ... Browse Code »

commit 118b23022512eb2f41ce42db70dc0568d00be4ba upstream.

dynamic_dname() is both too much and too little for those - the
output may be well in excess of 64 bytes dynamic_dname() assumes
to be enough (thanks to ashmem feeding really long names to
shmem_file_setup()) and vsnprintf() is an overkill for those
guys.

Signed-off-by: Al Viro
Cc: Colin Cross
Signed-off-by: Greg Kroah-Hartman

Al Viro
2013-10-18 22:45:45 +0800
cec073e26 ext4: fix memory leak in xattr ... Browse Code »

commit 6e4ea8e33b2057b85d75175dd89b93f5e26de3bc upstream.

If we take the 2nd retry path in ext4_expand_extra_isize_ea, we
potentionally return from the function without having freed these
allocations. If we don't do the return, we over-write the previous
allocation pointers, so we leak either way.

Spotted with Coverity.

[ Fixed by tytso to set is and bs to NULL after freeing these
pointers, in case in the retry loop we later end up triggering an
error causing a jump to cleanup, at which point we could have a double
free bug. -- Ted ]

Signed-off-by: Dave Jones
Signed-off-by: "Theodore Ts'o"
Reviewed-by: Eric Sandeen
Signed-off-by: Greg Kroah-Hartman

Dave Jones
2013-10-18 22:45:44 +0800
a849b2f42 Btrfs: use right root when checking for hash collision ... Browse Code »

commit 4871c1588f92c6c13f4713a7009f25f217055807 upstream.

btrfs_rename was using the root of the old dir instead of the root of the new
dir when checking for a hash collision, so if you tried to move a file into a
subvol it would freak out because it would see the file you are trying to move
in its current root. This fixes the bug where this would fail

btrfs subvol create test1
btrfs subvol create test2
mv test1 test2.

Thanks to Chris Murphy for catching this,

Reported-by: Chris Murphy
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Josef Bacik
2013-10-18 22:45:44 +0800
670c0d101 vfs: allow O_PATH file descriptors for fstatfs() ... Browse Code »

commit 9d05746e7b16d8565dddbe3200faa1e669d23bbf upstream.

Olga reported that file descriptors opened with O_PATH do not work with
fstatfs(), found during further development of ksh93's thread support.

There is no reason to not allow O_PATH file descriptors here (fstatfs is
very much a path operation), so use "fdget_raw()". See commit
55815f70147d ("vfs: make O_PATH file descriptors usable for 'fstat()'")
for a very similar issue reported for fstat() by the same team.

Reported-and-tested-by: ольга крыжановская
Acked-by: Al Viro
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2013-10-18 22:45:44 +0800

14 Oct, 2013

9 commits

322d9a97c Btrfs: remove ourselves from the cluster list under lock ... Browse Code »

commit b8d0c69b9469ffd33df30fee3e990f2d4aa68a09 upstream.

A user was reporting weird warnings from btrfs_put_delayed_ref() and I noticed
that we were doing this list_del_init() on our head ref outside of
delayed_refs->lock. This is a problem if we have people still on the list, we
could end up modifying old pointers and such. Fix this by removing us from the
list before we do our run_delayed_ref on our head ref. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Josef Bacik
2013-10-14 07:08:34 +0800
0ac5762ca Btrfs: skip subvol entries when checking if we've created a dir already ... Browse Code »

commit a05254143cd183b18002cbba7759a1e4629aa762 upstream.

We have logic to see if we've already created a parent directory by check to see
if an inode inside of that directory has a lower inode number than the one we
are currently processing. The logic is that if there is a lower inode number
then we would have had to made sure the directory was created at that previous
point. The problem is that subvols inode numbers count from the lowest objectid
in the root tree, which may be less than our current progress. So just skip if
our dir item key is a root item. This fixes the original test and the xfstest
version I made that added an extra subvol create. Thanks,

Reported-by: Emil Karlson
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Josef Bacik
2013-10-14 07:08:34 +0800
34aa872c2 Btrfs: change how we queue blocks for backref checking ... Browse Code »

commit b6c60c8018c4e9beb2f83fc82c09f9d033766571 upstream.

Previously we only added blocks to the list to have their backrefs checked if
the level of the block is right above the one we are searching for. This is
because we want to make sure we don't add the entire path up to the root to the
lists to make sure we process things one at a time. This assumes that if any
blocks in the path to the root are going to be not checked (shared in other
words) then they will be in the level right above the current block on up. This
isn't quite right though since we can have blocks higher up the list that are
shared because they are attached to a reloc root. But we won't add this block
to be checked and then later on we will BUG_ON(!upper->checked). So instead
keep track of wether or not we've queued a block to be checked in this current
search, and if we haven't go ahead and queue it to be checked. This patch fixed
the panic I was seeing where we BUG_ON(!upper->checked). Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Josef Bacik
2013-10-14 07:08:34 +0800
1fe36ec49 xfs: fix node forward in xfs_node_toosmall ... Browse Code »

commit 997def25e4b9cee3b01609e18a52f926bca8bd2b upstream.

Commit f5ea1100 cleans up the disk to host conversions for
node directory entries, but because a variable is reused in
xfs_node_toosmall() the next node is not correctly found.
If the original node is small enough (< BBTOB(bp->b_length),
file: /root/newest/xfs/fs/xfs/xfs_trans_buf.c, line: 569

Keep the original node header to get the correct forward node.

(When a node is considered for a merge with a sibling, it overwrites the
sibling pointers of the original incore nodehdr with the sibling's
pointers. This leads to loop considering the original node as a merge
candidate with itself in the second pass, and so it incorrectly
determines a merge should occur.)

[v3: added Dave Chinner's (slightly modified) suggestion to the commit header,
cleaned up whitespace. -bpm]

Signed-off-by: Mark Tinguely
Reviewed-by: Ben Myers
Signed-off-by: Ben Myers
Signed-off-by: Greg Kroah-Hartman

Mark Tinguely
2013-10-14 07:08:34 +0800
28f7ae257 NFSv4.1: nfs4_fl_prepare_ds - fix bugs when the connect attempt fails ... Browse Code »

commit 52b26a3e1bb3e065c32b3febdac1e1f117d88e15 upstream.

- Fix an Oops when nfs4_ds_connect() returns an error.
- Always check the device status after waiting for a connect to complete.

Reported-by: Andy Adamson
Reported-by: Jeff Layton
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2013-10-14 07:08:33 +0800
d8974c7fe nilfs2: fix issue with race condition of competition between segments for dirty blocks ... Browse Code »

commit 7f42ec3941560f0902fe3671e36f2c20ffd3af0a upstream.

Many NILFS2 users were reported about strange file system corruption
(for example):

NILFS: bad btree node (blocknr=185027): level = 0, flags = 0x0, nchildren = 768
NILFS error (device sda4): nilfs_bmap_last_key: broken bmap (inode number=11540)

But such error messages are consequence of file system's issue that takes
place more earlier. Fortunately, Jerome Poulin
and Anton Eliasson were reported about another
issue not so recently. These reports describe the issue with segctor
thread's crash:

BUG: unable to handle kernel paging request at 0000000000004c83
IP: nilfs_end_page_io+0x12/0xd0 [nilfs2]

Call Trace:
nilfs_segctor_do_construct+0xf25/0x1b20 [nilfs2]
nilfs_segctor_construct+0x17b/0x290 [nilfs2]
nilfs_segctor_thread+0x122/0x3b0 [nilfs2]
kthread+0xc0/0xd0
ret_from_fork+0x7c/0xb0

These two issues have one reason. This reason can raise third issue
too. Third issue results in hanging of segctor thread with eating of
100% CPU.

REPRODUCING PATH:

One of the possible way or the issue reproducing was described by
Jermoe me Poulin :

1. init S to get to single user mode.
2. sysrq+E to make sure only my shell is running
3. start network-manager to get my wifi connection up
4. login as root and launch "screen"
5. cd /boot/log/nilfs which is a ext3 mount point and can log when NILFS dies.
6. lscp | xz -9e > lscp.txt.xz
7. mount my snapshot using mount -o cp=3360839,ro /dev/vgUbuntu/root /mnt/nilfs
8. start a screen to dump /proc/kmsg to text file since rsyslog is killed
9. start a screen and launch strace -f -o find-cat.log -t find
/mnt/nilfs -type f -exec cat {} > /dev/null \;
10. start a screen and launch strace -f -o apt-get.log -t apt-get update
11. launch the last command again as it did not crash the first time
12. apt-get crashes
13. ps aux > ps-aux-crashed.log
13. sysrq+W
14. sysrq+E wait for everything to terminate
15. sysrq+SUSB

Simplified way of the issue reproducing is starting kernel compilation
task and "apt-get update" in parallel.

REPRODUCIBILITY:

The issue is reproduced not stable [60% - 80%]. It is very important to
have proper environment for the issue reproducing. The critical
conditions for successful reproducing:

(1) It should have big modified file by mmap() way.

(2) This file should have the count of dirty blocks are greater that
several segments in size (for example, two or three) from time to time
during processing.

(3) It should be intensive background activity of files modification
in another thread.

INVESTIGATION:

First of all, it is possible to see that the reason of crash is not valid
page address:

NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82
NILFS [nilfs_segctor_complete_write]:2101 segbuf->sb_segnum 6783

Moreover, value of b_page (0x1a82) is 6786. This value looks like segment
number. And b_blocknr with b_size values look like block numbers. So,
buffer_head's pointer points on not proper address value.

Detailed investigation of the issue is discovered such picture:

[-----------------------------SEGMENT 6783-------------------------------]
NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111149024, segbuf->sb_segnum 6783

[-----------------------------SEGMENT 6784-------------------------------]
NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff8802174a6798, bh->b_assoc_buffers.prev ffff880221cffee8
NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6784
NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111150080, segbuf->sb_segnum 6784, segbuf->sb_nbio 0
[----------] ditto
NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111164416, segbuf->sb_segnum 6784, segbuf->sb_nbio 15

[-----------------------------SEGMENT 6785-------------------------------]
NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff880219277e80, bh->b_assoc_buffers.prev ffff880221cffc88
NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6785
NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8
NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111165440, segbuf->sb_segnum 6785, segbuf->sb_nbio 0
[----------] ditto
NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111177728, segbuf->sb_segnum 6785, segbuf->sb_nbio 12

NILFS [nilfs_segctor_do_construct]:2399 nilfs_segctor_wait
NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6783
NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6784
NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6785

NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82

BUG: unable to handle kernel paging request at 0000000000001a82
IP: [] nilfs_end_page_io+0x12/0xd0 [nilfs2]

Usually, for every segment we collect dirty files in list. Then, dirty
blocks are gathered for every dirty file, prepared for write and
submitted by means of nilfs_segbuf_submit_bh() call. Finally, it takes
place complete write phase after calling nilfs_end_bio_write() on the
block layer. Buffers/pages are marked as not dirty on final phase and
processed files removed from the list of dirty files.

It is possible to see that we had three prepare_write and submit_bio
phases before segbuf_wait and complete_write phase. Moreover, segments
compete between each other for dirty blocks because on every iteration
of segments processing dirty buffer_heads are added in several lists of
payload_buffers:

[SEGMENT 6784]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
[SEGMENT 6785]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8

The next pointer is the same but prev pointer has changed. It means
that buffer_head has next pointer from one list but prev pointer from
another. Such modification can be made several times. And, finally, it
can be resulted in various issues: (1) segctor hanging, (2) segctor
crashing, (3) file system metadata corruption.

FIX:
This patch adds:

(1) setting of BH_Async_Write flag in nilfs_segctor_prepare_write()
for every proccessed dirty block;

(2) checking of BH_Async_Write flag in
nilfs_lookup_dirty_data_buffers() and
nilfs_lookup_dirty_node_buffers();

(3) clearing of BH_Async_Write flag in nilfs_segctor_complete_write(),
nilfs_abort_logs(), nilfs_forget_buffer(), nilfs_clear_dirty_page().

Reported-by: Jerome Poulin
Reported-by: Anton Eliasson
Cc: Paul Fertser
Cc: ARAI Shun-ichi
Cc: Piotr Szymaniak
Cc: Juan Barry Manuel Canham
Cc: Zahid Chowdhury
Cc: Elmer Zhang
Cc: Kenneth Langga
Signed-off-by: Vyacheslav Dubeyko
Acked-by: Ryusuke Konishi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vyacheslav Dubeyko
2013-10-14 07:08:32 +0800
9abd30b43 fuse: fix fallocate vs. ftruncate race ... Browse Code »

commit 0ab08f576b9e6a6b689fc6b4e632079b978e619b upstream.

A former patch introducing FUSE_I_SIZE_UNSTABLE flag provided detailed
description of races between ftruncate and anyone who can extend i_size:

> 1. As in the previous scenario fuse_dentry_revalidate() discovered that i_size
> changed (due to our own fuse_do_setattr()) and is going to call
> truncate_pagecache() for some 'new_size' it believes valid right now. But by
> the time that particular truncate_pagecache() is called ...
> 2. fuse_do_setattr() returns (either having called truncate_pagecache() or
> not -- it doesn't matter).
> 3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
> 4. mmap-ed write makes a page in the extended region dirty.

This patch adds necessary bits to fuse_file_fallocate() to protect from that
race.

Signed-off-by: Maxim Patlasov
Signed-off-by: Miklos Szeredi
Signed-off-by: Greg Kroah-Hartman

Maxim Patlasov
2013-10-14 07:08:32 +0800
eb12ca30f fuse: wait for writeback in fuse_file_fallocate() ... Browse Code »

commit bde52788bdb755b9e4b75db6c434f30e32a0ca0b upstream.

The patch fixes a race between mmap-ed write and fallocate(PUNCH_HOLE):

1) An user makes a page dirty via mmap-ed write.
2) The user performs fallocate(2) with mode == PUNCH_HOLE|KEEP_SIZE
and covering the page.
3) Before truncate_pagecache_range call from fuse_file_fallocate,
the page goes to write-back. The page is fully processed by fuse_writepage
(including end_page_writeback on the page), but fuse_flush_writepages did
nothing because fi->writectr < 0.
4) truncate_pagecache_range is called and fuse_file_fallocate is finishing
by calling fuse_release_nowrite. The latter triggers processing queued
write-back request which will write stale data to the hole soon.

Changed in v2 (thanks to Brian for suggestion):
- Do not truncate page cache until FUSE_FALLOCATE succeeded. Otherwise,
we can end up in returning -ENOTSUPP while user data is already punched
from page cache. Use filemap_write_and_wait_range() instead.
Changed in v3 (thanks to Miklos for suggestion):
- fuse_wait_on_writeback() is prone to livelocks; use fuse_set_nowrite()
instead. So far as we need a dirty-page barrier only, fuse_sync_writes()
should be enough.
- rebased to for-linus branch of fuse.git

Signed-off-by: Maxim Patlasov
Signed-off-by: Miklos Szeredi
Signed-off-by: Greg Kroah-Hartman

Maxim Patlasov
2013-10-14 07:08:31 +0800
cc748eed1 fs/binfmt_elf.c: prevent a coredump with a large vm_map_count from Oopsing ... Browse Code »

commit 72023656961b8c81a168a7a6762d589339d0d7ec upstream.

A high setting of max_map_count, and a process core-dumping with a large
enough vm_map_count could result in an NT_FILE note not being written,
and the kernel crashing immediately later because it has assumed
otherwise.

Reproduction of the oops-causing bug described here:

https://lkml.org/lkml/2013/8/30/50

Rge ussue originated in commit 2aa362c49c31 ("coredump: extend core dump
note section to contain file names of mapped file") from Oct 4, 2012.

This patch make that section optional in that case. fill_files_note()
should signify the error, and also let the info struct in
elf_core_dump() be zero-initialized so that we can check for the
optionally written note.

[akpm@linux-foundation.org: avoid abusing E2BIG, remove a couple of not-really-needed local variables]
[akpm@linux-foundation.org: fix sparse warning]
Signed-off-by: Dan Aloni
Cc: Al Viro
Cc: Denys Vlasenko
Reported-by: Martin MOKREJS
Tested-by: Martin MOKREJS
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Dan Aloni
2013-10-14 07:08:31 +0800

05 Oct, 2013

2 commits

d8a1cf0bd sysv: Add forgotten superblock lock init for v7 fs ... Browse Code »

commit 49475555848d396a0c78fb2f8ecceb3f3f263ef1 upstream.

Superblock lock was replaced with (un)lock_super() removal, but left
uninitialized for Seventh Edition UNIX filesystem in the following commit (3.7):
c07cb01 sysv: drop lock/unlock super

Signed-off-by: Lubomir Rintel
Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Lubomir Rintel
2013-10-05 22:13:09 +0800
9204e9dd4 block: Fix bio_copy_data() ... Browse Code »

commit 2f6cf0de0281d210061ce976f2d42d246adc75bb upstream.

The memcpy() in bio_copy_data() was using the wrong offset vars, leading
to data corruption in weird unusual setups.

Signed-off-by: Kent Overstreet
Cc: Jens Axboe
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Kent Overstreet
2013-10-05 22:13:09 +0800

02 Oct, 2013

4 commits

edc96e2c3 bio-integrity: Fix use of bs->bio_integrity_pool after free ... Browse Code »

commit adbe6991efd36104ac9eaf751993d35eaa7f493a upstream.

This fixes a copy and paste error introduced by 9f060e2231
("block: Convert integrity to bvec_alloc_bs()").

Found by Coverity (CID 1020654).

Signed-off-by: Bjorn Helgaas
Acked-by: Kent Overstreet
Signed-off-by: Jens Axboe
Cc: Jonghwan Choi
Signed-off-by: Greg Kroah-Hartman

Bjorn Helgaas
2013-10-02 00:17:48 +0800
39b79aa3f udf: Refuse RW mount of the filesystem instead of making it RO ... Browse Code »

commit e729eac6f65e11c5f03b09adcc84bd5bcb230467 upstream.

Refuse RW mount of udf filesystem. So far we just silently changed it
to RO mount but when the media is writeable, block layer won't notice
this change and thus will think device is used RW and will block eject
button of the drive. That is unexpected by users because for
non-writeable media eject button works just fine.

Userspace mount(8) command handles this just fine and retries mounting
with MS_RDONLY set so userspace shouldn't see any regression. Plus any
tool mounting udf is likely confronted with the case of read-only
media where block layer already refuses to mount the filesystem without
MS_RDONLY set so our behavior shouldn't be anything new for it.

Reported-by: Hui Wang
Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2013-10-02 00:17:48 +0800
66ccf9618 udf: Standardize return values in mount sequence ... Browse Code »

commit d759bfa4e7919b89357de50a2e23817079889195 upstream.

Change all function used in filesystem discovery during mount to user
standard kernel return values - -errno on error, 0 on success instead
of 1 on failure and 0 on success. This allows us to pass error number
(not just failure / success) so we can abort device scanning earlier
in case of errors like EIO or ENOMEM . Also we will be able to return
EROFS in case writeable mount is requested but writing isn't supported.

Signed-off-by: Jan Kara
Cc: Hui Wang
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2013-10-02 00:17:48 +0800
b1bf34798 cifs: fix filp leak in cifs_atomic_open() ... Browse Code »

commit dfb1d61b0e9f9e2c542e9adc8d970689f4114ff6 upstream.

If an error occurs after having called finish_open() then fput() needs to
be called on the already opened file.

Signed-off-by: Miklos Szeredi
Cc: Steve French
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Miklos Szeredi
2013-10-02 00:17:45 +0800

27 Sep, 2013

13 commits

e5362560a fuse: readdir: check for slash in names ... Browse Code »

commit efeb9e60d48f7778fdcad4a0f3ad9ea9b19e5dfd upstream.

Userspace can add names containing a slash character to the directory
listing. Don't allow this as it could cause all sorts of trouble.

Signed-off-by: Miklos Szeredi
Signed-off-by: Greg Kroah-Hartman

Miklos Szeredi
2013-09-27 08:18:30 +0800
4e2083031 fuse: hotfix truncate_pagecache() issue ... Browse Code »

commit 06a7c3c2781409af95000c60a5df743fd4e2f8b4 upstream.

The way how fuse calls truncate_pagecache() from fuse_change_attributes()
is completely wrong. Because, w/o i_mutex held, we never sure whether
'oldsize' and 'attr->size' are valid by the time of execution of
truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we
released fc->lock in the middle of fuse_change_attributes(), we completely
loose control of actions which may happen with given inode until we reach
truncate_pagecache. The list of potentially dangerous actions includes
mmap-ed reads and writes, ftruncate(2) and write(2) extending file size.

The typical outcome of doing truncate_pagecache() with outdated arguments
is data corruption from user point of view. This is (in some sense)
acceptable in cases when the issue is triggered by a change of the file on
the server (i.e. externally wrt fuse operation), but it is absolutely
intolerable in scenarios when a single fuse client modifies a file without
any external intervention. A real life case I discovered by fsx-linux
looked like this:

1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends
FUSE_SETATTR to the server synchronously, but before getting fc->lock ...
2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP
to the server synchronously, then calls fuse_change_attributes(). The
latter updates i_size, releases fc->lock, but before comparing oldsize vs
attr->size..
3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and
updating attributes and i_size, but now oldsize is equal to
outarg.attr.size because i_size has just been updated (step 2). Hence,
fuse_do_setattr() returns w/o calling truncate_pagecache().
4. As soon as ftruncate(2) completes, the user extends file size by
write(2) making a hole in the middle of file, then reads data from the hole
either by read(2) or mmap-ed read. The user expects to get zero data from
the hole, but gets stale data because truncate_pagecache() is not executed
yet.

The scenario above illustrates one side of the problem: not truncating the
page cache even though we should. Another side corresponds to truncating
page cache too late, when the state of inode changed significantly.
Theoretically, the following is possible:

1. As in the previous scenario fuse_dentry_revalidate() discovered that
i_size changed (due to our own fuse_do_setattr()) and is going to call
truncate_pagecache() for some 'new_size' it believes valid right now. But
by the time that particular truncate_pagecache() is called ...
2. fuse_do_setattr() returns (either having called truncate_pagecache() or
not -- it doesn't matter).
3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
4. mmap-ed write makes a page in the extended region dirty.

The result will be the lost of data user wrote on the fourth step.

The patch is a hotfix resolving the issue in a simplistic way: let's skip
dangerous i_size update and truncate_pagecache if an operation changing
file size is in progress. This simplistic approach looks correct for the
cases w/o external changes. And to handle them properly, more sophisticated
and intrusive techniques (e.g. NFS-like one) would be required. I'd like to
postpone it until the issue is well discussed on the mailing list(s).

Changed in v2:
- improved patch description to cover both sides of the issue.

Signed-off-by: Maxim Patlasov
Signed-off-by: Miklos Szeredi
Signed-off-by: Greg Kroah-Hartman

Maxim Patlasov
2013-09-27 08:18:30 +0800
eb97a45d1 fuse: invalidate inode attributes on xattr modification ... Browse Code »

commit d331a415aef98717393dda0be69b7947da08eba3 upstream.

Calls like setxattr and removexattr result in updation of ctime.
Therefore invalidate inode attributes to force a refresh.

Signed-off-by: Anand Avati
Reviewed-by: Brian Foster
Signed-off-by: Miklos Szeredi
Signed-off-by: Greg Kroah-Hartman

Anand Avati
2013-09-27 08:18:30 +0800
3002e63be fuse: postpone end_page_writeback() in fuse_writepage_locked() ... Browse Code »

commit 4a4ac4eba1010ef9a804569058ab29e3450c0315 upstream.

The patch fixes a race between ftruncate(2), mmap-ed write and write(2):

1) An user makes a page dirty via mmap-ed write.
2) The user performs shrinking truncate(2) intended to purge the page.
3) Before fuse_do_setattr calls truncate_pagecache, the page goes to
writeback. fuse_writepage_locked fills FUSE_WRITE request and releases
the original page by end_page_writeback.
4) fuse_do_setattr() completes and successfully returns. Since now, i_mutex
is free.
5) Ordinary write(2) extends i_size back to cover the page. Note that
fuse_send_write_pages do wait for fuse writeback, but for another
page->index.
6) fuse_writepage_locked proceeds by queueing FUSE_WRITE request.
fuse_send_writepage is supposed to crop inarg->size of the request,
but it doesn't because i_size has already been extended back.

Moving end_page_writeback to the end of fuse_writepage_locked fixes the
race because now the fact that truncate_pagecache is successfully returned
infers that fuse_writepage_locked has already called end_page_writeback.
And this, in turn, infers that fuse_flush_writepages has already called
fuse_send_writepage, and the latter used valid (shrunk) i_size. write(2)
could not extend it because of i_mutex held by ftruncate(2).

Signed-off-by: Maxim Patlasov
Signed-off-by: Miklos Szeredi
Signed-off-by: Greg Kroah-Hartman

Maxim Patlasov
2013-09-27 08:18:30 +0800
4f8e2fc10 ceph: Don't forget the 'up_read(&osdc->map_sem)' if met error. ... Browse Code »

commit 494ddd11be3e2621096bb425eed2886f8e8446d4 upstream.

Signed-off-by: Jianpeng Ma
Reviewed-by: Sage Weil
Signed-off-by: Greg Kroah-Hartman

majianpeng
2013-09-27 08:18:29 +0800
4b5fe51a9 isofs: Refuse RW mount of the filesystem instead of making it RO ... Browse Code »

commit 17b7f7cf58926844e1dd40f5eb5348d481deca6a upstream.

Refuse RW mount of isofs filesystem. So far we just silently changed it
to RO mount but when the media is writeable, block layer won't notice
this change and thus will think device is used RW and will block eject
button of the drive. That is unexpected by users because for
non-writeable media eject button works just fine.

Userspace mount(8) command handles this just fine and retries mounting
with MS_RDONLY set so userspace shouldn't see any regression. Plus any
tool mounting isofs is likely confronted with the case of read-only
media where block layer already refuses to mount the filesystem without
MS_RDONLY set so our behavior shouldn't be anything new for it.

Reported-by: Hui Wang
Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2013-09-27 08:18:28 +0800
1ca915459 proc: Restrict mounting the proc filesystem ... Browse Code »

commit aee1c13dd0f6c2fc56e0e492b349ee8ac655880f upstream.

Don't allow mounting the proc filesystem unless the caller has
CAP_SYS_ADMIN rights over the pid namespace. The principle here is if
you create or have capabilities over it you can mount it, otherwise
you get to live with what other people have mounted.

Andy pointed out that this is needed to prevent users in a user
namespace from remounting proc and specifying different hidepid and gid
options on already existing proc mounts.

Reported-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2013-09-27 08:18:28 +0800
3c46f7269 ocfs2: fix the end cluster offset of FIEMAP ... Browse Code »

commit 28e8be31803b19d0d8f76216cb11b480b8a98bec upstream.

Call fiemap ioctl(2) with given start offset as well as an desired mapping
range should show extents if possible. However, we somehow figure out the
end offset of mapping via 'mapping_end -= cpos' before iterating the
extent records which would cause problems if the given fiemap length is
too small to a cluster size, e.g,

Cluster size 4096:
debugfs.ocfs2 1.6.3
Block Size Bits: 12 Cluster Size Bits: 12

The extended fiemap test utility From David:
https://gist.github.com/anonymous/6172331

# dd if=/dev/urandom of=/ocfs2/test_file bs=1M count=1000
# ./fiemap /ocfs2/test_file 4096 10
start: 4096, length: 10
File /ocfs2/test_file has 0 extents:
# Logical Physical Length Flags
^^^^^
Reported-by: David Weber
Tested-by: David Weber
Cc: Sunil Mushran
Cc: Mark Fashen
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jie Liu
2013-09-27 08:18:28 +0800
42cc8e567 Btrfs: don't allow the replace procedure on read only filesystems ... Browse Code »

commit bbb651e469d99f0088e286fdeb54acca7bb4ad4e upstream.

If you start the replace procedure on a read only filesystem, at
the end the procedure fails to write the updated dev_items to the
chunk tree. The problem is that this error is not indicated except
for a WARN_ON(). If the user now thinks that everything was done
as expected and destroys the source device (with mkfs or with a
hammer). The next mount fails with "failed to read chunk root" and
the filesystem is gone.

This commit adds code to fail the attempt to start the replace
procedure if the filesystem is mounted read-only.

Signed-off-by: Stefan Behrens
Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Stefan Behrens
2013-09-27 08:18:27 +0800
263c784f2 ext4: simplify truncation code in ext4_setattr() ... Browse Code »

commit 5208386c501276df18fee464e21d3c58d2d79517 upstream.

Merge conditions in ext4_setattr() handling inode size changes, also
move ext4_begin_ordered_truncate() call somewhat earlier because it
simplifies error recovery in case of failure. Also add error handling in
case i_disksize update fails.

Signed-off-by: Jan Kara
Signed-off-by: "Theodore Ts'o"
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2013-09-27 08:18:15 +0800
b08d9b572 CIFS: Fix missing lease break ... Browse Code »

commit 933d4b36576c951d0371bbfed05ec0135d516a6e upstream.

If a server sends a lease break to a connection that doesn't have
opens with a lease key specified in the server response, we can't
find an open file to send an ack. Fix this by walking through
all connections we have.

Signed-off-by: Pavel Shilovsky
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Pavel Shilovsky
2013-09-27 08:18:05 +0800
af66f40c4 CIFS: Fix a memory leak when a lease break comes ... Browse Code »

commit 1a05096de82f3cd672c76389f63964952678506f upstream.

This happens when we receive a lease break from a server, then
find an appropriate lease key in opened files and schedule the
oplock_break slow work. lw pointer isn't freed in this case.

Signed-off-by: Pavel Shilovsky
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Pavel Shilovsky
2013-09-27 08:18:04 +0800
9b8ace674 cifs: ensure that srv_mutex is held when dealing with ssocket pointer ... Browse Code »

commit 73e216a8a42c0ef3d08071705c946c38fdbe12b0 upstream.

Oleksii reported that he had seen an oops similar to this:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
IP: [] sock_sendmsg+0x93/0xd0
PGD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: ipt_MASQUERADE xt_REDIRECT xt_tcpudp iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables carl9170 ath usb_storage f2fs nfnetlink_log nfnetlink md4 cifs dns_resolver hid_generic usbhid hid af_packet uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev rfcomm btusb bnep bluetooth qmi_wwan qcserial cdc_wdm usb_wwan usbnet usbserial mii snd_hda_codec_hdmi snd_hda_codec_realtek iwldvm mac80211 coretemp intel_powerclamp kvm_intel kvm iwlwifi snd_hda_intel cfg80211 snd_hda_codec xhci_hcd e1000e ehci_pci snd_hwdep sdhci_pci snd_pcm ehci_hcd microcode psmouse sdhci thinkpad_acpi mmc_core i2c_i801 pcspkr usbcore hwmon snd_timer snd_page_alloc snd ptp rfkill pps_core soundcore evdev usb_common vboxnetflt(O) vboxdrv(O)Oops#2 Part8
loop tun binfmt_misc fuse msr acpi_call(O) ipv6 autofs4
CPU: 0 PID: 21612 Comm: kworker/0:1 Tainted: G W O 3.10.1SIGN #28
Hardware name: LENOVO 2306CTO/2306CTO, BIOS G2ET92WW (2.52 ) 02/22/2013
Workqueue: cifsiod cifs_echo_request [cifs]
task: ffff8801e1f416f0 ti: ffff880148744000 task.ti: ffff880148744000
RIP: 0010:[] [] sock_sendmsg+0x93/0xd0
RSP: 0000:ffff880148745b00 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880148745b78 RCX: 0000000000000048
RDX: ffff880148745c90 RSI: ffff880181864a00 RDI: ffff880148745b78
RBP: ffff880148745c48 R08: 0000000000000048 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880181864a00
R13: ffff880148745c90 R14: 0000000000000048 R15: 0000000000000048
FS: 0000000000000000(0000) GS:ffff88021e200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000088 CR3: 000000020c42c000 CR4: 00000000001407b0
Oops#2 Part7
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffff880148745b30 ffffffff810c4af9 0000004848745b30 ffff880181864a00
ffffffff81ffbc40 0000000000000000 ffff880148745c90 ffffffff810a5aab
ffff880148745bc0 ffffffff81ffbc40 ffff880148745b60 ffffffff815a9fb8
Call Trace:
[] ? finish_task_switch+0x49/0xe0
[] ? lock_timer_base.isra.36+0x2b/0x50
[] ? _raw_spin_unlock_irqrestore+0x18/0x40
[] ? try_to_del_timer_sync+0x4f/0x70
[] ? _raw_spin_unlock_bh+0x1f/0x30
[] kernel_sendmsg+0x37/0x50
[] smb_send_kvec+0xd0/0x1d0 [cifs]
[] smb_send_rqst+0x83/0x1f0 [cifs]
[] cifs_call_async+0xec/0x1b0 [cifs]
[] ? free_rsp_buf+0x40/0x40 [cifs]
Oops#2 Part6
[] SMB2_echo+0x8e/0xb0 [cifs]
[] cifs_echo_request+0x79/0xa0 [cifs]
[] process_one_work+0x173/0x4a0
[] worker_thread+0x121/0x3a0
[] ? manage_workers.isra.27+0x2b0/0x2b0
[] kthread+0xc0/0xd0
[] ? kthread_create_on_node+0x120/0x120
[] ret_from_fork+0x7c/0xb0
[] ? kthread_create_on_node+0x120/0x120
Code: 84 24 b8 00 00 00 4c 89 f1 4c 89 ea 4c 89 e6 48 89 df 4c 89 60 18 48 c7 40 28 00 00 00 00 4c 89 68 30 44 89 70 14 49 8b 44 24 28 90 88 00 00 00 3d ef fd ff ff 74 10 48 8d 65 e0 5b 41 5c 41
RIP [] sock_sendmsg+0x93/0xd0
RSP
CR2: 0000000000000088

The client was in the middle of trying to send a frame when the
server->ssocket pointer got zeroed out. In most places, that we access
that pointer, the srv_mutex is held. There's only one spot that I see
that the server->ssocket pointer gets set and the srv_mutex isn't held.
This patch corrects that.

The upstream bug report was here:

https://bugzilla.kernel.org/show_bug.cgi?id=60557

Reported-by: Oleksii Shevchuk
Signed-off-by: Jeff Layton
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2013-09-27 08:18:04 +0800

08 Sep, 2013

1 commit

192caed2b jfs: fix readdir cookie incompatibility with NFSv4 ... Browse Code »

commit 44512449c0ab368889dd13ae0031fba74ee7e1d2 upstream.

NFSv4 reserves readdir cookie values 0-2 for special entries (. and ..),
but jfs allows a value of 2 for a non-special entry. This incompatibility
can result in the nfs client reporting a readdir loop.

This patch doesn't change the value stored internally, but adds one to
the value exposed to the iterate method.

Signed-off-by: Dave Kleikamp
[bwh: Backported to 3.2:
- Adjust context
- s/ctx->pos/filp->f_pos/]
Tested-by: Christian Kujau
Signed-off-by: Ben Hutchings
Signed-off-by: Greg Kroah-Hartman

Dave Kleikamp
2013-09-08 13:09:57 +0800

30 Aug, 2013

4 commits

32b8d5f87 SCSI: sg: Fix user memory corruption when SG_IO is interrupted by a signal ... Browse Code »

commit 35dc248383bbab0a7203fca4d722875bc81ef091 upstream.

There is a nasty bug in the SCSI SG_IO ioctl that in some circumstances
leads to one process writing data into the address space of some other
random unrelated process if the ioctl is interrupted by a signal.
What happens is the following:

- A process issues an SG_IO ioctl with direction DXFER_FROM_DEV (ie the
underlying SCSI command will transfer data from the SCSI device to
the buffer provided in the ioctl)

- Before the command finishes, a signal is sent to the process waiting
in the ioctl. This will end up waking up the sg_ioctl() code:

result = wait_event_interruptible(sfp->read_wait,
(srp_done(sfp, srp) || sdp->detached));

but neither srp_done() nor sdp->detached is true, so we end up just
setting srp->orphan and returning to userspace:

srp->orphan = 1;
write_unlock_irq(&sfp->rq_list_lock);
return result; /* -ERESTARTSYS because signal hit process */

At this point the original process is done with the ioctl and
blithely goes ahead handling the signal, reissuing the ioctl, etc.

- Eventually, the SCSI command issued by the first ioctl finishes and
ends up in sg_rq_end_io(). At the end of that function, we run through:

write_lock_irqsave(&sfp->rq_list_lock, iflags);
if (unlikely(srp->orphan)) {
if (sfp->keep_orphan)
srp->sg_io_owned = 0;
else
done = 0;
}
srp->done = done;
write_unlock_irqrestore(&sfp->rq_list_lock, iflags);

if (likely(done)) {
/* Now wake up any sg_read() that is waiting for this
* packet.
*/
wake_up_interruptible(&sfp->read_wait);
kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN);
kref_put(&sfp->f_ref, sg_remove_sfp);
} else {
INIT_WORK(&srp->ew.work, sg_rq_end_io_usercontext);
schedule_work(&srp->ew.work);
}

Since srp->orphan *is* set, we set done to 0 (assuming the
userspace app has not set keep_orphan via an SG_SET_KEEP_ORPHAN
ioctl), and therefore we end up scheduling sg_rq_end_io_usercontext()
to run in a workqueue.

- In workqueue context we go through sg_rq_end_io_usercontext() ->
sg_finish_rem_req() -> blk_rq_unmap_user() -> ... ->
bio_uncopy_user() -> __bio_copy_iov() -> copy_to_user().

The key point here is that we are doing copy_to_user() on a
workqueue -- that is, we're on a kernel thread with current->mm
equal to whatever random previous user process was scheduled before
this kernel thread. So we end up copying whatever data the SCSI
command returned to the virtual address of the buffer passed into
the original ioctl, but it's quite likely we do this copying into a
different address space!

As suggested by James Bottomley ,
add a check for current->mm (which is NULL if we're on a kernel thread
without a real userspace address space) in bio_uncopy_user(), and skip
the copy if we're on a kernel thread.

There's no reason that I can think of for any caller of bio_uncopy_user()
to want to do copying on a kernel thread with a random active userspace
address space.

Huge thanks to Costa Sapuntzakis for the
original pointer to this bug in the sg code.

Signed-off-by: Roland Dreier
Tested-by: David Milburn
Cc: Jens Axboe
Signed-off-by: James Bottomley
Signed-off-by: Greg Kroah-Hartman

Roland Dreier
2013-08-30 00:47:39 +0800
020269d2c nilfs2: fix issue with counting number of bio requests for BIO_EOPNOTSUPP error detection ... Browse Code »

commit 4bf93b50fd04118ac7f33a3c2b8a0a1f9fa80bc9 upstream.

Fix the issue with improper counting number of flying bio requests for
BIO_EOPNOTSUPP error detection case.

The sb_nbio must be incremented exactly the same number of times as
complete() function was called (or will be called) because
nilfs_segbuf_wait() will call wail_for_completion() for the number of
times set to sb_nbio:

do {
wait_for_completion(&segbuf->sb_bio_event);
} while (--segbuf->sb_nbio > 0);

Two functions complete() and wait_for_completion() must be called the
same number of times for the same sb_bio_event. Otherwise,
wait_for_completion() will hang or leak.

Signed-off-by: Vyacheslav Dubeyko
Cc: Dan Carpenter
Acked-by: Ryusuke Konishi
Tested-by: Ryusuke Konishi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vyacheslav Dubeyko
2013-08-30 00:47:37 +0800
b0e01ab2f nilfs2: remove double bio_put() in nilfs_end_bio_write() for BIO_EOPNOTSUPP error ... Browse Code »

commit 2df37a19c686c2d7c4e9b4ce1505b5141e3e5552 upstream.

Remove double call of bio_put() in nilfs_end_bio_write() for the case of
BIO_EOPNOTSUPP error detection. The issue was found by Dan Carpenter
and he suggests first version of the fix too.

Signed-off-by: Vyacheslav Dubeyko
Reported-by: Dan Carpenter
Acked-by: Ryusuke Konishi
Tested-by: Ryusuke Konishi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vyacheslav Dubeyko
2013-08-30 00:47:37 +0800
2bee1e0f2 VFS: collect_mounts() should return an ERR_PTR ... Browse Code »

commit 52e220d357a38cb29fa2e29f34ed94c1d66357f4 upstream.

This should actually be returning an ERR_PTR on error instead of NULL.
That was how it was designed and all the callers expect it.

[AV: actually, that's what "VFS: Make clone_mnt()/copy_tree()/collect_mounts()
return errors" missed - originally collect_mounts() was expected to return
NULL on failure]

Signed-off-by: Dan Carpenter
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Dan Carpenter
2013-08-30 00:47:35 +0800

20 Aug, 2013

2 commits

94aa327e1 jbd2: Fix use after free after error in jbd2_journal_dirty_metadata() ... Browse Code »

commit 91aa11fae1cf8c2fd67be0609692ea9741cdcc43 upstream.

When jbd2_journal_dirty_metadata() returns error,
__ext4_handle_dirty_metadata() stops the handle. However callers of this
function do not count with that fact and still happily used now freed
handle. This use after free can result in various issues but very likely
we oops soon.

The motivation of adding __ext4_journal_stop() into
__ext4_handle_dirty_metadata() in commit 9ea7a0df seems to be only to
improve error reporting. So replace __ext4_journal_stop() with
ext4_journal_abort_handle() which was there before that commit and add
WARN_ON_ONCE() to dump stack to provide useful information.

Reported-by: Sage Weil
Signed-off-by: Jan Kara
Signed-off-by: "Theodore Ts'o"
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2013-08-20 23:43:05 +0800
8e220cfd1 Fix TLB gather virtual address range invalidation corner cases ... Browse Code »

commit 2b047252d087be7f2ba088b4933cd904f92e6fce upstream.

Ben Tebulin reported:

"Since v3.7.2 on two independent machines a very specific Git
repository fails in 9/10 cases on git-fsck due to an SHA1/memory
failures. This only occurs on a very specific repository and can be
reproduced stably on two independent laptops. Git mailing list ran
out of ideas and for me this looks like some very exotic kernel issue"

and bisected the failure to the backport of commit 53a59fc67f97 ("mm:
limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").

That commit itself is not actually buggy, but what it does is to make it
much more likely to hit the partial TLB invalidation case, since it
introduces a new case in tlb_next_batch() that previously only ever
happened when running out of memory.

The real bug is that the TLB gather virtual memory range setup is subtly
buggered. It was introduced in commit 597e1c3580b7 ("mm/mmu_gather:
enable tlb flush range in generic mmu_gather"), and the range handling
was already fixed at least once in commit e6c495a96ce0 ("mm: fix the TLB
range flushed when __tlb_remove_page() runs out of slots"), but that fix
was not complete.

The problem with the TLB gather virtual address range is that it isn't
set up by the initial tlb_gather_mmu() initialization (which didn't get
the TLB range information), but it is set up ad-hoc later by the
functions that actually flush the TLB. And so any such case that forgot
to update the TLB range entries would potentially miss TLB invalidates.

Rather than try to figure out exactly which particular ad-hoc range
setup was missing (I personally suspect it's the hugetlb case in
zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
did), this patch just gets rid of the problem at the source: make the
TLB range information available to tlb_gather_mmu(), and initialize it
when initializing all the other tlb gather fields.

This makes the patch larger, but conceptually much simpler. And the end
result is much more understandable; even if you want to play games with
partial ranges when invalidating the TLB contents in chunks, now the
range information is always there, and anybody who doesn't want to
bother with it won't introduce subtle bugs.

Ben verified that this fixes his problem.

Reported-bisected-and-tested-by: Ben Tebulin
Build-testing-by: Stephen Rothwell
Build-testing-by: Richard Weinberger
Reviewed-by: Michal Hocko
Acked-by: Peter Zijlstra
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2013-08-20 23:43:05 +0800