Doug / smarc-fsl-linux-kernel | Embedian Git Server

18 Dec, 2007

1 commit

b47b6f38e ext3, ext4: avoid divide by zero ... Browse Code »

As it turns out, the kernel divides by EXT3_INODES_PER_GROUP(s) when
mounting an ext3 filesystem. If that number is zero, a crash follows.
Below a patch.

This crash was reported by Joeri de Ruiter, Carst Tankink and Pim Vullers.

Cc:
Acked-by: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andries E. Brouwer
2007-12-18 11:28:16 +0800

15 Nov, 2007

2 commits

7c06a8dc6 Fix 64KB blocksize in ext3 directories ... Browse Code »

With 64KB blocksize, a directory entry can have size 64KB which does not
fit into 16 bits we have for entry lenght. So we store 0xffff instead and
convert value when read from / written to disk. The patch also converts
some places to use ext3_next_entry() when we are changing them anyway.

[akpm@linux-foundation.org: coding-style cleanups]
Signed-off-by: Jan Kara
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2007-11-15 10:45:43 +0800
e47776a0a Forbid user to change file flags on quota files ... Browse Code »

Forbid user from changing file flags on quota files. User has no bussiness
in playing with these flags when quota is on. Furthermore there is a
remote possibility of deadlock due to a lock inversion between quota file's
i_mutex and transaction's start (i_mutex for quota file is locked only when
trasaction is started in quota operations) in ext3 and ext4.

Signed-off-by: Jan Kara
Cc: LIOU Payphone
Cc:
Acked-by: Dave Kleikamp
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2007-11-15 10:45:38 +0800

14 Nov, 2007

1 commit

0b832a4b9 Revert "ext2/ext3/ext4: add block bitmap validation" ... Browse Code »

This reverts commit 7c9e69faa28027913ee059c285a5ea8382e24b5d, fixing up
conflicts in fs/ext4/balloc.c manually.

The cost of doing the bitmap validation on each lookup - even when the
bitmap is cached - is absolutely prohibitive. We could, and probably
should, do it only when adding the bitmap to the buffer cache. However,
right now we are better off just reverting it.

Peter Zijlstra measured the cost of this extra validation as a 85%
decrease in cached iozone, and while I had a patch that took it down to
just 17% by not being _quite_ so stupid in the validation, it was still
a big slowdown that could have been avoided by just doing it right.

Cc: Peter Zijlstra
Cc: Andrew Morton
Cc: Aneesh Kumar
Cc: Andreas Dilger
Cc: Mingming Cao
Signed-off-by: Linus Torvalds

Linus Torvalds
2007-11-14 00:09:11 +0800

22 Oct, 2007

2 commits

396551644 exportfs: make struct export_operations const ... Browse Code »

Now that nfsd has stopped writing to the find_exported_dentry member we an
mark the export_operations const

Signed-off-by: Christoph Hellwig
Cc: Neil Brown
Cc: "J. Bruce Fields"
Cc:
Cc: Dave Kleikamp
Cc: Anton Altaparmakov
Cc: David Chinner
Cc: Timothy Shimmin
Cc: OGAWA Hirofumi
Cc: Hugh Dickins
Cc: Chris Mason
Cc: Jeff Mahoney
Cc: "Vladimir V. Saveliev"
Cc: Steven Whitehouse
Cc: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2007-10-22 23:13:21 +0800
74af0baad ext3: new export ops ... Browse Code »

Trivial switch over to the new generic helpers.

Signed-off-by: Christoph Hellwig
Cc: Neil Brown
Cc: "J. Bruce Fields"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2007-10-22 23:13:19 +0800

20 Oct, 2007

2 commits

9ad163ae0 JBD: Fix JBD warnings when compiling with CONFIG_JBD_DEBUG ... Browse Code »

Note from Mingming's JBD2 fix:

Noticed all warnings are occurs when the debug level is 0. Then found the
"jbd2: Move jbd2-debug file to debugfs" patch
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0f49d5d019afa4e94253bfc92f0daca3badb990b

changed the jbd2_journal_enable_debug from int type to u8, makes the
jbd_debug comparision is always true when the debugging level is 0. Thus
the compile warning occurs.

Thought about changing the jbd2_journal_enable_debug data type back to int,
but can't, because the jbd2-debug is moved to debug fs, where calling
debugfs_create_u8() to create the debugfs entry needs the value to be u8
type.

Even if we changed the data type back to int, the code is still buggy,
kernel should not print jbd2 debug message if the jbd2_journal_enable_debug
is set to 0. But this is not the case.

The fix is change the level of debugging to 1. The same should fixed in
ext3/JBD, but currently ext3 jbd-debug via /proc fs is broken, so we
probably should fix it all together.

Signed-off-by: Jose R. Santos
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jose R. Santos
2007-10-20 02:53:35 +0800
8c3478a52 JBD/ext3 cleanups: convert to kzalloc ... Browse Code »

Convert kmalloc to kzalloc() and get rid of the memset().

Signed-off-by: Mingming Cao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mingming Cao
2007-10-20 02:53:34 +0800

19 Oct, 2007

3 commits

c80544dc0 sparse pointer use of zero as null ... Browse Code »

Get rid of sparse related warnings from places that use integer as NULL
pointer.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stephen Hemminger
Cc: Andi Kleen
Cc: Jeff Garzik
Cc: Matt Mackall
Cc: Ian Kent
Cc: Arnd Bergmann
Cc: Davide Libenzi
Cc: Stephen Smalley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stephen Hemminger
2007-10-19 05:37:31 +0800
42a2b6ad7 ext3: fix setup_new_group_blocks locking ... Browse Code »

setup_new_group_blocks() manipulates the group descriptor block bh under
the block_bitmap bh's lock. It shouldn't matter since nobody but resize
should be touching these blocks, but it's worth fixing up.

Signed-off-by: Eric Sandeen
C:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Sandeen
2007-10-19 05:37:29 +0800
0f0a89ebe ext3: support large blocksize up to PAGESIZE ... Browse Code »

This patch set supports large block size(>4k,
Signed-off-by: Mingming Cao
Cc:
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Takashi Sato
2007-10-19 05:37:29 +0800

17 Oct, 2007

14 commits

1ad6ecf91 ext3: lighten up resize transaction requirements ... Browse Code »

When resizing online, setup_new_group_blocks attempts to reserve a
potentially very large transaction, depending on the current filesystem
geometry. For some journal sizes, there may not be enough room for this
transaction, and the online resize will fail.

The patch below resizes & restarts the transaction as necessary while
setting up the new group, and should work with even the smallest journal.

Tested with something like:

[root@newbox ~]# dd if=/dev/zero of=fsfile bs=1024 count=32768
[root@newbox ~]# mkfs.ext3 -b 1024 fsfile 16384
[root@newbox ~]# mount -o loop fsfile mnt/
[root@newbox ~]# resize2fs /dev/loop0
resize2fs 1.40.2 (12-Jul-2007)
Filesystem at /dev/loop0 is mounted on /root/mnt; on-line resizing required
old desc_blocks = 1, new_desc_blocks = 1
Performing an on-line resize of /dev/loop0 to 32768 (1k) blocks.
resize2fs: No space left on device While trying to add group #2
[root@newbox ~]# dmesg | tail -n 1
JBD: resize2fs wants too many credits (258 > 256)
[root@newbox ~]#

With the below change, it works.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Eric Sandeen
Acked-by: Andreas Dilger
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Sandeen
2007-10-17 23:43:01 +0800
059590f49 ext3: remove #ifdef CONFIG_EXT3_INDEX ... Browse Code »

CONFIG_EXT3_INDEX is not an exposed config option in the kernel, and it is
unconditionally defined in ext3_fs.h. tune2fs is already able to turn off
dir indexing, so at this point it's just cluttering up the code. Remove
it.

Signed-off-by: Eric Sandeen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Sandeen
2007-10-17 23:43:01 +0800
2b47c3611 Fix f_version type: should be u64 instead of unsigned long ... Browse Code »

Fix f_version type: should be u64 instead of long

There is a type inconsistency between struct inode i_version and struct file
f_version.

fs.h:

struct inode
u64 i_version;

and

struct file
unsigned long f_version;

Users do:

fs/ext3/dir.c:

if (filp->f_version != inode->i_version) {

So why isn't f_version a u64 ? It becomes a problem if versions gets
higher than 2^32 and we are on an architecture where longs are 32 bits.

This patch changes the f_version type to u64, and updates the users accordingly.

It applies to 2.6.23-rc2-mm2.

Signed-off-by: Mathieu Desnoyers
Cc: Martin Bligh
Cc: "Randy.Dunlap"
Cc: Al Viro
Cc:
Cc: Mark Fasheh
Cc: Christoph Hellwig
Cc: "J. Bruce Fields"
Cc: Trond Myklebust
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mathieu Desnoyers
2007-10-17 23:42:53 +0800
7c9e69faa ext2/ext3/ext4: add block bitmap validation ... Browse Code »

When a new block bitmap is read from disk in read_block_bitmap() there are
a few bits that should ALWAYS be set. In particular, the blocks given by
ext4_blk_bitmap, ext4_inode_bitmap and ext4_inode_table. Validate the
block bitmap against these blocks.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Andreas Dilger
Acked-by: Mingming Cao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2007-10-17 23:42:52 +0800
ef2fb6798 remove unused bh in calls to ext234_get_group_desc ... Browse Code »

ext[234]_get_group_desc never tests the bh argument, and only sets it if it
is passed in; it is perfectly happy with a NULL bh argument. But, many
callers send one in and never use it. May as well call with NULL like
other callers who don't use the bh.

Signed-off-by: Eric Sandeen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Sandeen
2007-10-17 23:42:49 +0800
571beed8d ext3: show all mount options ... Browse Code »

Signed-off-by: Miklos Szeredi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2007-10-17 23:42:48 +0800
febfcf911 fs: mark nibblemap const ... Browse Code »

Signed-off-by: Philippe De Muyter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Philippe De Muyter
2007-10-17 23:42:47 +0800
4ba9b9d0b Slab API: remove useless ctor parameter and reorder parameters ... Browse Code »

Slab constructors currently have a flags parameter that is never used. And
the order of the arguments is opposite to other slab functions. The object
pointer is placed before the kmem_cache pointer.

Convert

ctor(void *object, struct kmem_cache *s, unsigned long flags)

to

ctor(struct kmem_cache *s, void *object)

throughout the kernel

[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 23:42:45 +0800
833f4077b lib: percpu_counter_init error handling ... Browse Code »

alloc_percpu can fail, propagate that error.

Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2007-10-17 23:42:44 +0800
52d9f3b40 lib: percpu_counter_sum_positive ... Browse Code »

s/percpu_counter_sum/&_positive/

Because its consitent with percpu_counter_read*

Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2007-10-17 23:42:44 +0800
3cb4f9fa0 lib: percpu_counter_sub ... Browse Code »

Hugh spotted that some code does:
percpu_counter_add(&counter, -unsignedlong)

which, when the amount argument is of type s32, sort-of works thanks to
two's-complement. However when we'd change the type to s64 this breaks on 32bit
machines, because the promotion rules zero extend the unsigned number.

Provide percpu_counter_sub() to hide the s64 cast. That is:
percpu_counter_sub(&counter, foo)
is equal to:
percpu_counter_add(&counter, -(s64)foo);

Signed-off-by: Peter Zijlstra
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2007-10-17 23:42:44 +0800
aa0dff2d0 lib: percpu_counter_add ... Browse Code »

s/percpu_counter_mod/percpu_counter_add/

Because its a better name, _mod implies modulo.

Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2007-10-17 23:42:44 +0800
f4fc66a89 ext3: convert to new aops ... Browse Code »

Various fixes and improvements

Signed-off-by: Badari Pulavarty
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800
f4e6b498d readahead: combine file_ra_state.prev_index/prev_offset into prev_pos ... Browse Code »

Combine the file_ra_state members
unsigned long prev_index
unsigned int prev_offset
into
loff_t prev_pos

It is more consistent and better supports huge files.

Thanks to Peter for the nice proposal!

[akpm@linux-foundation.org: fix shift overflow]
Cc: Peter Zijlstra
Signed-off-by: Fengguang Wu
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2007-10-17 00:42:52 +0800

20 Sep, 2007

2 commits

ef2b02d3e ext34: ensure do_split leaves enough free space in both blocks ... Browse Code »

The do_split() function for htree dir blocks is intended to split a leaf
block to make room for a new entry. It sorts the entries in the original
block by hash value, then moves the last half of the entries to the new
block - without accounting for how much space this actually moves. (IOW,
it moves half of the entry *count* not half of the entry *space*). If by
chance we have both large & small entries, and we move only the smallest
entries, and we have a large new entry to insert, we may not have created
enough space for it.

The patch below stores each record size when calculating the dx_map, and
then walks the hash-sorted dx_map, calculating how many entries must be
moved to more evenly split the existing entries between the old block and
the new block, guaranteeing enough space for the new entry.

The dx_map "offs" member is reduced to u16 so that the overall map size
does not change - it is temporarily stored at the end of the new block, and
if it grows too large it may be overwritten. By making offs and size both
u16, we won't grow the map size.

Also add a few comments to the functions involved.

This fixes the testcase reported by hooanon05@yahoo.co.jp on the
linux-ext4 list, "ext3 dir_index causes an error"

Thanks to Andreas Dilger for discussing the problem & solution with me.

Signed-off-by: Eric Sandeen
Signed-off-by: Andreas Dilger
Tested-by: Junjiro Okajima
Cc: Theodore Ts'o
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Sandeen
2007-09-20 02:24:18 +0800
3d82abae9 dir_index: error out instead of BUG on corrupt dx dirs ... Browse Code »

Convert asserts (BUGs) in dx_probe from bad on-disk data to recoverable
errors with helpful warnings. With help catching other asserts from Duane
Griffin

Signed-off-by: Eric Sandeen
Acked-by: Duane Griffin
Acked-by: Theodore Ts'o
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Sandeen
2007-09-20 02:24:18 +0800

12 Sep, 2007

1 commit

9c3013e9b quota: fix infinite loop ... Browse Code »

If we fail to start a transaction when releasing dquot, we have to call
dquot_release() anyway to mark dquot structure as inactive. Otherwise we
end in an infinite loop inside dqput().

Signed-off-by: Jan Kara
Cc: xb
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2007-09-12 08:21:19 +0800

27 Jul, 2007

1 commit

780dcdb21 fix inode_table test in ext234_check_descriptors ... Browse Code »

ext[234]_check_descriptors sanity checks block group descriptor geometry at
mount time, testing whether the block bitmap, inode bitmap, and inode table
reside wholly within the blockgroup. However, the inode table test is off
by one so that if the last block in the inode table resides on the last
block of the block group, the test incorrectly fails. This is because it
tests the last block as (start + length) rather than (start + length - 1).

This can be seen by trying to mount a filesystem made such as:

mkfs.ext2 -F -b 1024 -m 0 -g 256 -N 3744 fsfile 1024

which yields:

EXT2-fs error (device loop0): ext2_check_descriptors: Inode table for group 0 not in group (block 101)!
EXT2-fs: group descriptors corrupted!

There is a similar bug in e2fsprogs, patch already sent for that.

(I wonder if inside(), outside(), and/or in_range() should someday be
used in this and other tests throughout the ext filesystems...)

Signed-off-by: Eric Sandeen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Sandeen
2007-07-27 02:35:17 +0800

20 Jul, 2007

3 commits

20c2df83d mm: Remove slab destructors from kmem_cache_create(). ... Browse Code »

Slab destructors were no longer supported after Christoph's
c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt

Paul Mundt
2007-07-20 09:11:58 +0800
cf914a7d6 readahead: split ondemand readahead interface into two functions ... Browse Code »

Split ondemand readahead interface into two functions. I think this makes it
a little clearer for non-readahead experts (like Rusty).

Internally they both call ondemand_readahead(), but the page argument is
changed to an obvious boolean flag.

Signed-off-by: Rusty Russell
Signed-off-by: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rusty Russell
2007-07-20 01:04:44 +0800
dc7868fcb readahead: convert ext3/ext4 invocations ... Browse Code »

Convert ext3/ext4 dir reads to use on-demand readahead.

Readahead for dirs operates _not_ on file level, but on blockdev level. This
makes a difference when the data blocks are not continuous. And the read
routine is somehow opaque: there's no handy info about the status of current
page. So a simplified call scheme is employed: to call into readahead
whenever the current page falls out of readahead windows.

Signed-off-by: Fengguang Wu
Cc: Steven Pratt
Cc: Ram Pai
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fengguang Wu
2007-07-20 01:04:44 +0800

18 Jul, 2007

2 commits

3bd858ab1 Introduce is_owner_or_cap() to wrap CAP_FOWNER use with fsuid check ... Browse Code »

Introduce is_owner_or_cap() macro in fs.h, and convert over relevant
users to it. This is done because we want to avoid bugs in the future
where we check for only effective fsuid of the current task against a
file's owning uid, without simultaneously checking for CAP_FOWNER as
well, thus violating its semantics.
[ XFS uses special macros and structures, and in general looked ...
untouchable, so we leave it alone -- but it has been looked over. ]

The (current->fsuid != inode->i_uid) check in generic_permission() and
exec_permission_lite() is left alone, because those operations are
covered by CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH. Similarly operations
falling under the purview of CAP_CHOWN and CAP_LEASE are also left alone.

Signed-off-by: Satyam Sharma
Cc: Al Viro
Acked-by: Serge E. Hallyn
Signed-off-by: Linus Torvalds

Satyam Sharma
2007-07-18 03:00:03 +0800
a56942551 knfsd: exportfs: add exportfs.h header ... Browse Code »

currently the export_operation structure and helpers related to it are in
fs.h. fs.h is already far too large and there are very few places needing the
export bits, so split them off into a separate header.

[akpm@linux-foundation.org: fix cifs build]
Signed-off-by: Christoph Hellwig
Signed-off-by: Neil Brown
Cc: Steven French
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2007-07-18 01:23:06 +0800

17 Jul, 2007

6 commits

a71ce8c6c ext3: statfs speed up ... Browse Code »

This is a patch that speeds up statfs. It is very simple - the "overhead"
calculation, which takes a huge amount of time for large filesystems, never
changes unless the size of the filesystem itself changes. That means we can
store it in memory and only recalculate if the filesystem has been resized
(almost never).

It also fixes a minor problem that we never update the on-disk superblock free
blocks/inodes counts until the filesystem is unmounted. While not fatal, we
may as well update that on disk when we have the information, and it makes
things like debugfs and dumpe2fs report a bit more accurate info.

Signed-off-by: Badari Pulavarty
Signed-off-by: Andreas Dilger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Badari Pulavarty
2007-07-17 00:05:52 +0800
952d9de11 ext3: fix error handling in ext3_create_journal() ... Browse Code »

Fix error handling in ext3_create_journal according to kernel conventions.

Signed-off-by: Borislav Petkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Borislav Petkov
2007-07-17 00:05:51 +0800
3fc74269c is_power_of_2: ext3/super.c ... Browse Code »

Replace (n & (n-1)) in the context of power of 2 checks with is_power_of_2()

Signed-off-by: vignesh babu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

vignesh babu
2007-07-17 00:05:48 +0800
e3a68e30d ext3: remove extra IS_RDONLY() check ... Browse Code »

ext3_change_inode_journal_flag() is only called from one location:
ext3_ioctl(EXT3_IOC_SETFLAGS). That ioctl case already has a IS_RDONLY()
call in it so this one is superfluous.

Signed-off-by: Dave Hansen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2007-07-17 00:05:48 +0800
030703e49 ext3: fix deadlock in ext3_remount() and orphan list handling ... Browse Code »

ext3_orphan_add() and ext3_orphan_del() functions lock sb->s_lock with a
transaction started with ext3_mark_recovery_complete() waits for a transaction
holding sb->s_lock, thus leading to a possible deadlock. At the moment we
call ext3_mark_recovery_complete() from ext3_remount() we have done all the
work needed for remounting and thus we are safe to drop sb->s_lock before we
wait for transactions to commit. Note that at this moment we are still
guarded by s_umount lock against other remounts/umounts.

Signed-off-by: Jan Kara
Cc: Eric Sandeen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2007-07-17 00:05:47 +0800
a6c15c2b0 ext3/ext4: orphan list corruption due bad inode ... Browse Code »

After ext3 orphan list check has been added into ext3_destroy_inode()
(please see my previous patch) the following situation has been detected:

EXT3-fs warning (device sda6): ext3_unlink: Deleting nonexistent file (37901290), 0
Inode 00000101a15b7840: orphan list check failed!
00000773 6f665f00 74616d72 00000573 65725f00 06737270 66000000 616d726f
...
Call Trace: [] ext3_destroy_inode+0x79/0x90
[] sys_unlink+0x126/0x1a0
[] error_exit+0x0/0x81
[] system_call+0x7e/0x83

First messages said that unlinked inode has i_nlink=0, then ext3_unlink()
adds this inode into orphan list.

Second message means that this inode has not been removed from orphan list.
Inode dump has showed that i_fop = &bad_file_ops and it can be set in
make_bad_inode() only. Then I've found that ext3_read_inode() can call
make_bad_inode() without any error/warning messages, for example in the
following case:

...
if (inode->i_nlink == 0) {
if (inode->i_mode == 0 ||
!(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)) {
/* this inode is deleted */
brelse (bh);
goto bad_inode;
...

Bad inode can live some time, ext3_unlink can add it to orphan list, but
ext3_delete_inode() do not deleted this inode from orphan list. As result
we can have orphan list corruption detected in ext3_destroy_inode().

However it is not clear for me how to fix this issue correctly.

As far as i see is_bad_inode() is called after iget() in all places
excluding ext3_lookup() and ext3_get_parent(). I believe it makes sense to
add bad inode check to these functions too and call iput if bad inode
detected.

Signed-off-by: Vasily Averin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vasily Averin
2007-07-17 00:05:46 +0800