09 Jan, 2009
3 commits
-
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
ext4: Remove "extents" mount option
block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
ext4: Make printk's consistently prefixed with "EXT4-fs: "
ext4: Add sanity checks for the superblock before mounting the filesystem
ext4: Add mount option to set kjournald's I/O priority
jbd2: Submit writes to the journal using WRITE_SYNC
jbd2: Add pid and journal device name to the "kjournald2 starting" message
ext4: Add markers for better debuggability
ext4: Remove code to create the journal inode
ext4: provide function to release metadata pages under memory pressure
ext3: provide function to release metadata pages under memory pressure
add releasepage hooks to block devices which can be used by file systems
ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
ext4: Init the complete page while building buddy cache
ext4: Don't allow new groups to be added during block allocation
ext4: mark the blocks/inode bitmap beyond end of group as used
ext4: Use new buffer_head flag to check uninit group bitmaps initialization
ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
ext4: code cleanup
... -
When I review ocfs2 code, find there are 2 typos to "successfull". After
doing grep "successfull " in kernel tree, 22 typos found totally -- great
minds always think alike :)This patch fixes all the similar typos. Thanks for Randy's ack and comments.
Signed-off-by: Coly Li
Acked-by: Randy Dunlap
Acked-by: Roland Dreier
Cc: Jeremy Kerr
Cc: Jeff Garzik
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Cc: Theodore Ts'o
Cc: Mark Fasheh
Cc: Vlad Yasevich
Cc: Sridhar Samudrala
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use the new generic implementation.
Signed-off-by: Wu Fengguang
Cc: Theodore Ts'o
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Jan, 2009
4 commits
-
For NR_CPUS >= 16 values, FBC_BATCH is 2*NR_CPUS
Considering more and more distros are using high NR_CPUS values, it makes
sense to use a more sensible value for FBC_BATCH, and get rid of NR_CPUS.A sensible value is 2*num_online_cpus(), with a minimum value of 32 (This
minimum value helps branch prediction in __percpu_counter_add())We already have a hotcpu notifier, so we can adjust FBC_BATCH dynamically.
We rename FBC_BATCH to percpu_counter_batch since its not a constant
anymore.Signed-off-by: Eric Dumazet
Acked-by: David S. Miller
Acked-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This mount option is largely superfluous, and in fact the way it was
implemented was buggy; if a filesystem which did not have the extents
feature flag was mounted -o extents, the filesystem would attempt to
create and use extents-based file even though the extents feature flag
was not eabled. The simplest thing to do is to nuke the mount option
entirely. It's not all that useful to force the non-creation of new
extent-based files if the filesystem can support it.Signed-off-by: "Theodore Ts'o"
-
This avoids insane superblock configurations that could lead to kernel
oops due to null pointer derefences.http://bugzilla.kernel.org/show_bug.cgi?id=12371
Thanks to David Maciejak at Fortinet's FortiGuard Global Security
Research Team who discovered this bug independently (but at
approximately the same time) as Thiemo Nagel, who submitted the patch.Signed-off-by: Thiemo Nagel
Signed-off-by: "Theodore Ts'o"
Cc: stable@kernel.org -
This code has been obsolete in quite some time, since the supported
method for adding a journal inode is to use tune2fs (or to creating
new filesystem with a journal via mke2fs or mkfs.ext4).Signed-off-by: "Theodore Ts'o"
06 Jan, 2009
17 commits
-
Previously, some were "ext4: ", and some were "EXT4: "; change them to
be consistent with most ext4 printk's, which is to use "EXT4-fs: ".Signed-off-by: "Theodore Ts'o"
-
Signed-off-by: "Theodore Ts'o"
Cc: Jens Axboe -
Signed-off-by: Jan Kara
Signed-off-by: Mark Fasheh -
Signed-off-by: Jan Kara
Signed-off-by: Mark Fasheh -
Pages in the page cache belonging to ext4 data files are released via
the ext4_releasepage() function specified in the ext4 inode's
address_space_ops. However, metadata blocks (such as indirect blocks,
directory blocks, etc) are managed via the block device
address_space_ops, and they can not be released by
try_to_free_buffers() if they have a journal head attached to them.To address this, we supply a release_metadata function which calls
jbd2_journal_try_to_free_buffers() function to free the metadata, and
which is called by the block device's blkdev_releasepage() function.Signed-off-by: Toshiyuki Okajima
Signed-off-by: "Theodore Ts'o"
Cc: linux-fsdevel@vger.kernel.org -
With nodelalloc option we need to update the dirty block counter on
block allocation failure. This is needed because we increment the
dirty block counter early in the block allocation phase. Without
the patch s_dirty_blocks_counter goes wrong so that filesystem's
free blocks decreases incorrectly.Tested-by: Akira Fujita
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o"
Cc: stable@kernel.org -
We need to init the complete page during buddy cache init
by setting the contents to '1'. Otherwise we can see the
following errors after doing an online resize of the
filesystem:EXT4-fs error (device sdb1): ext4_mb_mark_diskspace_used:
Allocating block 1040385 in system zone of 127 groupSigned-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o"
Cc: stable@kernel.org -
After we mark the blocks in the buddy cache as allocated,
we need to ensure that we don't reinit the buddy cache until
the block bitmap is updated. This commit achieves this by holding
the group_info alloc_semaphore till ext4_mb_release_contextSigned-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o"
Cc: stable@kernel.org -
We need to mark the block/inode bitmap beyond the end of the group
with '1'.Signed-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o"
Cc: stable@kernel.org -
For uninit block group, the on-disk bitmap is not initialized. That
implies we cannot depend on the uptodate flag on the bitmap
buffer_head to find bitmap validity. Use a new buffer_head flag which
would be set after we properly initialize the bitmap. This also
prevents (re-)initializing the uninit group bitmap every time we call
ext4_read_block_bitmap().Signed-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o"
Cc: stable@kernel.org -
We need to make sure we update the inode bitmap and clear
EXT4_BG_INODE_UNINIT flag with sb_bgl_lock held, since
ext4_read_inode_bitmap() looks at EXT4_BG_INODE_UNINIT to decide
whether to initialize the inode bitmap each time it is called.
(introduced by commit c806e68f.)ext4_read_inode_bitmap does:
spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
ext4_init_inode_bitmap(sb, bh, block_group, desc);and ext4_new_inode does
if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, group),
ino, inode_bitmap_bh->b_data))
......
...
spin_lock(sb_bgl_lock(sbi, group));gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
i.e., on allocation we update the bitmap then we take the sb_bgl_lock
and clear the EXT4_BG_INODE_UNINIT flag. What can happen is a
parallel ext4_read_inode_bitmap can zero out the bitmap in between
the above ext4_set_bit_atomic and spin_lock(sb_bg_lock..)The race results in below user visible errors
EXT4-fs error (device sdb1): ext4_free_inode: bit already cleared for inode 168449
EXT4-fs warning (device sdb1): ext4_unlink: Deleting nonexistent file ...
EXT4-fs warning (device sdb1): ext4_rmdir: empty directory has too many links ...
# ls -al /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71
ls: /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71: Stale NFS file handleSigned-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o"
Cc: stable@kernel.org -
Rename the lower bits with suffix _lo and add helper
to access the values. Also rename bg_itable_unused_hi
to bg_pad as in e2fsprogs.Signed-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o" -
We need to make sure we update the block bitmap and clear
EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held, since
ext4_read_block_bitmap() looks at EXT4_BG_BLOCK_UNINIT to decide
whether to initialize the block bitmap each time it is called
(introduced by commit c806e68f), and this can race with block
allocations in ext4_mb_mark_diskspace_used().ext4_read_block_bitmap does:
spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
ext4_init_block_bitmap(sb, bh, block_group, desc);Now on the block allocation side we do
mb_set_bits(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group), bitmap_bh->b_data,
ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len);
....
spin_lock(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group));
if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);ie on allocation we update the bitmap then we take the sb_bgl_lock
and clear the EXT4_BG_BLOCK_UNINIT flag. What can happen is a
parallel ext4_read_block_bitmap can zero out the bitmap in between
the above mb_set_bits and spin_lock(sb_bg_lock..)The race results in below user visible errors
EXT4-fs error (device sdb1): ext4_mb_release_inode_pa: free 100, pa_free 105
EXT4-fs error (device sdb1): mb_free_blocks: double-free of inode 0's block ..Signed-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o"
Cc: stable@kernel.org -
The mballoc code likes to call ext4_error while it is holding locked
block groups. This can causes a scheduling in atomic context BUG. We
can't just unlock the block group and relock it after/if ext4_error
returns since that might result in race conditions in the case where
the filesystem is set to continue after finding errors.Signed-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o" -
When we generate buddy cache (especially during resize) we need to
make sure we don't use the blocks freed but not yet comitted. This
makes sure we have the right value of free blocks count in the group
info and also in the bitmap. This also ensures the ordered mode
consistencySigned-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o"
Cc: stable@kernel.org -
The new groups added during resize are flagged as
need_init group. Make sure we properly initialize these
groups. When we have block size < page size and we are adding
new groups the page may still be marked uptodate even though
we haven't initialized the group. While forcing the init
of buddy cache we need to make sure other groups part of the
same page of buddy cache is not using the cache.
group_info->alloc_sem is added to ensure the same.Signed-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o"
cc: stable@kernel.org -
With this change new blocks added during resize
are marked as free in the block bitmap and the
group is flagged with EXT4_GROUP_INFO_NEED_INIT_BIT
flag. This makes sure when mballoc tries to allocate
blocks from the new group we would reload the
buddy information using the bitmap present in the disk.Signed-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o"
Cc: stable@kernel.org
05 Jan, 2009
2 commits
-
With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened. They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim. This bug could
cause filesystem deadlocks.The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock. The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
random example).[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin
Reviewed-by: KOSAKI Motohiro
Cc: [2.6.28.x]
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
[ Cleaned up the calling convention: just pass in the AOP flags
untouched to the grab_cache_page_write_begin() function. That
just simplifies everybody, and may even allow future expansion of the
logic. - Linus ]
Signed-off-by: Linus Torvalds -
As suggested by Andreas Dilger, introduce a bgl_lock_ptr() helper in
and add separate sb_bgl_lock() helpers to
filesystem specific header files to break the hidden dependency to
struct ext[234]_sb_info.Also, while at it, convert the macros to static inlines to try make up
for all the times I broke Andrew Morton's tree.Acked-by: Andreas Dilger
Signed-off-by: Pekka Enberg
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Jan, 2009
2 commits
-
Signed-off-by: "Theodore Ts'o"
-
Rename some variables. We also unlock locks in the reverse order we
acquired as a part of cleanup.Signed-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o"
01 Jan, 2009
2 commits
-
Signed-off-by: Al Viro
-
Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.Cc: Andrew Morton
Cc: Theodore Ts'o
Cc: adilger@sun.com
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Duane Griffin
Signed-off-by: Al Viro
29 Dec, 2008
1 commit
-
We have two seperate config entries for large devices/files. One
is CONFIG_LBD that guards just the devices, the other is CONFIG_LSF
that handles large files. This doesn't make a lot of sense, you typically
want both or none. So get rid of CONFIG_LSF and change CONFIG_LBD wording
to indicate that it covers both.Acked-by: Jean Delvare
Signed-off-by: Jens Axboe
25 Dec, 2008
1 commit
11 Dec, 2008
2 commits
-
Revert
commit e8ced39d5e8911c662d4d69a342b9d053eaaac4e
Author: Mingming Cao
Date: Fri Jul 11 19:27:31 2008 -0400percpu_counter: new function percpu_counter_sum_and_set
As described in
revert "percpu counter: clean up percpu_counter_sum_and_set()"
the new percpu_counter_sum_and_set() is racy against updates to the
cpu-local accumulators on other CPUs. Revert that change.This means that ext4 will be slow again. But correct.
Reported-by: Eric Dumazet
Cc: "David S. Miller"
Cc: Peter Zijlstra
Cc: Mingming Cao
Cc:
Cc: [2.6.27.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Revert
commit 1f7c14c62ce63805f9574664a6c6de3633d4a354
Author: Mingming Cao
Date: Thu Oct 9 12:50:59 2008 -0400percpu counter: clean up percpu_counter_sum_and_set()
Before this patch we had the following:
percpu_counter_sum(): return the percpu_counter's value
percpu_counter_sum_and_set(): return the percpu_counter's value, copying
that value into the central value and zeroing the per-cpu counters before
returning.After this patch, percpu_counter_sum_and_set() has gone, and
percpu_counter_sum() gets the old percpu_counter_sum_and_set()
functionality.Problem is, as Eric points out, the old percpu_counter_sum_and_set()
functionality was racy and wrong. It zeroes out counters on "other" cpus,
without holding any locks which will prevent races agaist updates from
those other CPUS.This patch reverts 1f7c14c62ce63805f9574664a6c6de3633d4a354. This means
that percpu_counter_sum_and_set() still has the race, but
percpu_counter_sum() does not.Note that this is not a simple revert - ext4 has since started using
percpu_counter_sum() for its dirty_blocks counter as well.Note that this revert patch changes percpu_counter_sum() semantics.
Before the patch, a call to percpu_counter_sum() will bring the counter's
central counter mostly up-to-date, so a following percpu_counter_read()
will return a close value.After this patch, a call to percpu_counter_sum() will leave the counter's
central accumulator unaltered, so a subsequent call to
percpu_counter_read() can now return a significantly inaccurate result.If there is any code in the tree which was introduced after
e8ced39d5e8911c662d4d69a342b9d053eaaac4e was merged, and which depends
upon the new percpu_counter_sum() semantics, that code will break.Reported-by: Eric Dumazet
Cc: "David S. Miller"
Cc: Peter Zijlstra
Cc: Mingming Cao
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
26 Nov, 2008
1 commit
-
Move some of the forward declaration of the static functions
to mballoc.c where they are used. This enables us to include
mballoc.h in other .c files. Also correct the buddy cache
documentation.Signed-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o"
24 Nov, 2008
1 commit
-
In ext4_mb_init_group(), if the filesystem block size is less than
PAGE_SIZE/2, the code tries to grab alloc_sem for multiple block
groups in a loop. We need to allow for this by using
down_write_nested() and passing in the loop index as a lock subclass
number. This works because no other code path needs to take multiple
alloc_sem's. Note that lockdep will fail for filesystem blocksize
smaller than to PAGE_SIZE/16k. (e.g., a 1k filesystem blocksize with
a 32k page size, or a 2k filesystem blocksize with a 64k blocksize,
etc.)Signed-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o"
14 Nov, 2008
2 commits
-
Conflicts:
security/keys/internal.h
security/keys/process_keys.c
security/keys/request_key.cFixed conflicts above by using the non 'tsk' versions.
Signed-off-by: James Morris
-
Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().
Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.Signed-off-by: David Howells
Reviewed-by: James Morris
Acked-by: Serge Hallyn
Cc: Stephen Tweedie
Cc: Andrew Morton
Cc: adilger@sun.com
Cc: linux-ext4@vger.kernel.org
Signed-off-by: James Morris
07 Nov, 2008
2 commits
-
When initializing an uninitialized block group in ext4_new_inode(),
its block group checksum must be re-calculated. This fixes a race
when several threads try to allocate a new inode in an UNINIT'd group.There is some question whether we need to be initializing the block
bitmap in ext4_new_inode() at all, but for now, if we are going to
init the block group, let's eliminate the race.Signed-off-by: Frederic Bohe
Signed-off-by: "Theodore Ts'o" -
We need to make sure we mark the buffer_heads as dirty and uptodate
so that block_write_full_page write them correctly.This fixes mmap corruptions that can occur in low memory situations.
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o"