09 Jan, 2009

3 commits

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
    jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
    ext4: Remove "extents" mount option
    block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
    ext4: Make printk's consistently prefixed with "EXT4-fs: "
    ext4: Add sanity checks for the superblock before mounting the filesystem
    ext4: Add mount option to set kjournald's I/O priority
    jbd2: Submit writes to the journal using WRITE_SYNC
    jbd2: Add pid and journal device name to the "kjournald2 starting" message
    ext4: Add markers for better debuggability
    ext4: Remove code to create the journal inode
    ext4: provide function to release metadata pages under memory pressure
    ext3: provide function to release metadata pages under memory pressure
    add releasepage hooks to block devices which can be used by file systems
    ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
    ext4: Init the complete page while building buddy cache
    ext4: Don't allow new groups to be added during block allocation
    ext4: mark the blocks/inode bitmap beyond end of group as used
    ext4: Use new buffer_head flag to check uninit group bitmaps initialization
    ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
    ext4: code cleanup
    ...

    Linus Torvalds
     
  • When I review ocfs2 code, find there are 2 typos to "successfull". After
    doing grep "successfull " in kernel tree, 22 typos found totally -- great
    minds always think alike :)

    This patch fixes all the similar typos. Thanks for Randy's ack and comments.

    Signed-off-by: Coly Li
    Acked-by: Randy Dunlap
    Acked-by: Roland Dreier
    Cc: Jeremy Kerr
    Cc: Jeff Garzik
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Theodore Ts'o
    Cc: Mark Fasheh
    Cc: Vlad Yasevich
    Cc: Sridhar Samudrala
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Coly Li
     
  • Use the new generic implementation.

    Signed-off-by: Wu Fengguang
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

07 Jan, 2009

4 commits

  • For NR_CPUS >= 16 values, FBC_BATCH is 2*NR_CPUS

    Considering more and more distros are using high NR_CPUS values, it makes
    sense to use a more sensible value for FBC_BATCH, and get rid of NR_CPUS.

    A sensible value is 2*num_online_cpus(), with a minimum value of 32 (This
    minimum value helps branch prediction in __percpu_counter_add())

    We already have a hotcpu notifier, so we can adjust FBC_BATCH dynamically.

    We rename FBC_BATCH to percpu_counter_batch since its not a constant
    anymore.

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • This mount option is largely superfluous, and in fact the way it was
    implemented was buggy; if a filesystem which did not have the extents
    feature flag was mounted -o extents, the filesystem would attempt to
    create and use extents-based file even though the extents feature flag
    was not eabled. The simplest thing to do is to nuke the mount option
    entirely. It's not all that useful to force the non-creation of new
    extent-based files if the filesystem can support it.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • This avoids insane superblock configurations that could lead to kernel
    oops due to null pointer derefences.

    http://bugzilla.kernel.org/show_bug.cgi?id=12371

    Thanks to David Maciejak at Fortinet's FortiGuard Global Security
    Research Team who discovered this bug independently (but at
    approximately the same time) as Thiemo Nagel, who submitted the patch.

    Signed-off-by: Thiemo Nagel
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Theodore Ts'o
     
  • This code has been obsolete in quite some time, since the supported
    method for adding a journal inode is to use tune2fs (or to creating
    new filesystem with a journal via mke2fs or mkfs.ext4).

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

06 Jan, 2009

17 commits

  • Previously, some were "ext4: ", and some were "EXT4: "; change them to
    be consistent with most ext4 printk's, which is to use "EXT4-fs: ".

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • Signed-off-by: "Theodore Ts'o"
    Cc: Jens Axboe

    Theodore Ts'o
     
  • Signed-off-by: Jan Kara
    Signed-off-by: Mark Fasheh

    Jan Kara
     
  • Signed-off-by: Jan Kara
    Signed-off-by: Mark Fasheh

    Jan Kara
     
  • Pages in the page cache belonging to ext4 data files are released via
    the ext4_releasepage() function specified in the ext4 inode's
    address_space_ops. However, metadata blocks (such as indirect blocks,
    directory blocks, etc) are managed via the block device
    address_space_ops, and they can not be released by
    try_to_free_buffers() if they have a journal head attached to them.

    To address this, we supply a release_metadata function which calls
    jbd2_journal_try_to_free_buffers() function to free the metadata, and
    which is called by the block device's blkdev_releasepage() function.

    Signed-off-by: Toshiyuki Okajima
    Signed-off-by: "Theodore Ts'o"
    Cc: linux-fsdevel@vger.kernel.org

    Toshiyuki Okajima
     
  • With nodelalloc option we need to update the dirty block counter on
    block allocation failure. This is needed because we increment the
    dirty block counter early in the block allocation phase. Without
    the patch s_dirty_blocks_counter goes wrong so that filesystem's
    free blocks decreases incorrectly.

    Tested-by: Akira Fujita
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Aneesh Kumar K.V
     
  • We need to init the complete page during buddy cache init
    by setting the contents to '1'. Otherwise we can see the
    following errors after doing an online resize of the
    filesystem:

    EXT4-fs error (device sdb1): ext4_mb_mark_diskspace_used:
    Allocating block 1040385 in system zone of 127 group

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Aneesh Kumar K.V
     
  • After we mark the blocks in the buddy cache as allocated,
    we need to ensure that we don't reinit the buddy cache until
    the block bitmap is updated. This commit achieves this by holding
    the group_info alloc_semaphore till ext4_mb_release_context

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Aneesh Kumar K.V
     
  • We need to mark the block/inode bitmap beyond the end of the group
    with '1'.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Aneesh Kumar K.V
     
  • For uninit block group, the on-disk bitmap is not initialized. That
    implies we cannot depend on the uptodate flag on the bitmap
    buffer_head to find bitmap validity. Use a new buffer_head flag which
    would be set after we properly initialize the bitmap. This also
    prevents (re-)initializing the uninit group bitmap every time we call
    ext4_read_block_bitmap().

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Aneesh Kumar K.V
     
  • We need to make sure we update the inode bitmap and clear
    EXT4_BG_INODE_UNINIT flag with sb_bgl_lock held, since
    ext4_read_inode_bitmap() looks at EXT4_BG_INODE_UNINIT to decide
    whether to initialize the inode bitmap each time it is called.
    (introduced by commit c806e68f.)

    ext4_read_inode_bitmap does:

    spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
    if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
    ext4_init_inode_bitmap(sb, bh, block_group, desc);

    and ext4_new_inode does
    if (!ext4_set_bit_atomic(sb_bgl_lock(sbi, group),
    ino, inode_bitmap_bh->b_data))
    ......
    ...
    spin_lock(sb_bgl_lock(sbi, group));

    gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
    i.e., on allocation we update the bitmap then we take the sb_bgl_lock
    and clear the EXT4_BG_INODE_UNINIT flag. What can happen is a
    parallel ext4_read_inode_bitmap can zero out the bitmap in between
    the above ext4_set_bit_atomic and spin_lock(sb_bg_lock..)

    The race results in below user visible errors
    EXT4-fs error (device sdb1): ext4_free_inode: bit already cleared for inode 168449
    EXT4-fs warning (device sdb1): ext4_unlink: Deleting nonexistent file ...
    EXT4-fs warning (device sdb1): ext4_rmdir: empty directory has too many links ...
    # ls -al /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71
    ls: /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71: Stale NFS file handle

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Aneesh Kumar K.V
     
  • Rename the lower bits with suffix _lo and add helper
    to access the values. Also rename bg_itable_unused_hi
    to bg_pad as in e2fsprogs.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     
  • We need to make sure we update the block bitmap and clear
    EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held, since
    ext4_read_block_bitmap() looks at EXT4_BG_BLOCK_UNINIT to decide
    whether to initialize the block bitmap each time it is called
    (introduced by commit c806e68f), and this can race with block
    allocations in ext4_mb_mark_diskspace_used().

    ext4_read_block_bitmap does:

    spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
    if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
    ext4_init_block_bitmap(sb, bh, block_group, desc);

    Now on the block allocation side we do

    mb_set_bits(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group), bitmap_bh->b_data,
    ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len);
    ....
    spin_lock(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group));
    if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
    gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);

    ie on allocation we update the bitmap then we take the sb_bgl_lock
    and clear the EXT4_BG_BLOCK_UNINIT flag. What can happen is a
    parallel ext4_read_block_bitmap can zero out the bitmap in between
    the above mb_set_bits and spin_lock(sb_bg_lock..)

    The race results in below user visible errors
    EXT4-fs error (device sdb1): ext4_mb_release_inode_pa: free 100, pa_free 105
    EXT4-fs error (device sdb1): mb_free_blocks: double-free of inode 0's block ..

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Aneesh Kumar K.V
     
  • The mballoc code likes to call ext4_error while it is holding locked
    block groups. This can causes a scheduling in atomic context BUG. We
    can't just unlock the block group and relock it after/if ext4_error
    returns since that might result in race conditions in the case where
    the filesystem is set to continue after finding errors.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     
  • When we generate buddy cache (especially during resize) we need to
    make sure we don't use the blocks freed but not yet comitted. This
    makes sure we have the right value of free blocks count in the group
    info and also in the bitmap. This also ensures the ordered mode
    consistency

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Aneesh Kumar K.V
     
  • The new groups added during resize are flagged as
    need_init group. Make sure we properly initialize these
    groups. When we have block size < page size and we are adding
    new groups the page may still be marked uptodate even though
    we haven't initialized the group. While forcing the init
    of buddy cache we need to make sure other groups part of the
    same page of buddy cache is not using the cache.
    group_info->alloc_sem is added to ensure the same.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    cc: stable@kernel.org

    Aneesh Kumar K.V
     
  • With this change new blocks added during resize
    are marked as free in the block bitmap and the
    group is flagged with EXT4_GROUP_INFO_NEED_INIT_BIT
    flag. This makes sure when mballoc tries to allocate
    blocks from the new group we would reload the
    buddy information using the bitmap present in the disk.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Aneesh Kumar K.V
     

05 Jan, 2009

2 commits

  • With the write_begin/write_end aops, page_symlink was broken because it
    could no longer pass a GFP_NOFS type mask into the point where the
    allocations happened. They are done in write_begin, which would always
    assume that the filesystem can be entered from reclaim. This bug could
    cause filesystem deadlocks.

    The funny thing with having a gfp_t mask there is that it doesn't really
    allow the caller to arbitrarily tinker with the context in which it can be
    called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
    take the page lock. The only thing any callers care about is __GFP_FS
    anyway, so turn that into a single flag.

    Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
    this flag in their write_begin function. Change __grab_cache_page to
    accept a nofs argument as well, to honour that flag (while we're there,
    change the name to grab_cache_page_write_begin which is more instructive
    and does away with random leading underscores).

    This is really a more flexible way to go in the end anyway -- if a
    filesystem happens to want any extra allocations aside from the pagecache
    ones in ints write_begin function, it may now use GFP_KERNEL (rather than
    GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
    random example).

    [kosaki.motohiro@jp.fujitsu.com: fix ubifs]
    [kosaki.motohiro@jp.fujitsu.com: fix fuse]
    Signed-off-by: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: [2.6.28.x]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    [ Cleaned up the calling convention: just pass in the AOP flags
    untouched to the grab_cache_page_write_begin() function. That
    just simplifies everybody, and may even allow future expansion of the
    logic. - Linus ]
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • As suggested by Andreas Dilger, introduce a bgl_lock_ptr() helper in
    and add separate sb_bgl_lock() helpers to
    filesystem specific header files to break the hidden dependency to
    struct ext[234]_sb_info.

    Also, while at it, convert the macros to static inlines to try make up
    for all the times I broke Andrew Morton's tree.

    Acked-by: Andreas Dilger
    Signed-off-by: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

04 Jan, 2009

2 commits


01 Jan, 2009

2 commits


29 Dec, 2008

1 commit

  • We have two seperate config entries for large devices/files. One
    is CONFIG_LBD that guards just the devices, the other is CONFIG_LSF
    that handles large files. This doesn't make a lot of sense, you typically
    want both or none. So get rid of CONFIG_LSF and change CONFIG_LBD wording
    to indicate that it covers both.

    Acked-by: Jean Delvare
    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Dec, 2008

1 commit


11 Dec, 2008

2 commits

  • Revert

    commit e8ced39d5e8911c662d4d69a342b9d053eaaac4e
    Author: Mingming Cao
    Date: Fri Jul 11 19:27:31 2008 -0400

    percpu_counter: new function percpu_counter_sum_and_set

    As described in

    revert "percpu counter: clean up percpu_counter_sum_and_set()"

    the new percpu_counter_sum_and_set() is racy against updates to the
    cpu-local accumulators on other CPUs. Revert that change.

    This means that ext4 will be slow again. But correct.

    Reported-by: Eric Dumazet
    Cc: "David S. Miller"
    Cc: Peter Zijlstra
    Cc: Mingming Cao
    Cc:
    Cc: [2.6.27.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Revert

    commit 1f7c14c62ce63805f9574664a6c6de3633d4a354
    Author: Mingming Cao
    Date: Thu Oct 9 12:50:59 2008 -0400

    percpu counter: clean up percpu_counter_sum_and_set()

    Before this patch we had the following:

    percpu_counter_sum(): return the percpu_counter's value

    percpu_counter_sum_and_set(): return the percpu_counter's value, copying
    that value into the central value and zeroing the per-cpu counters before
    returning.

    After this patch, percpu_counter_sum_and_set() has gone, and
    percpu_counter_sum() gets the old percpu_counter_sum_and_set()
    functionality.

    Problem is, as Eric points out, the old percpu_counter_sum_and_set()
    functionality was racy and wrong. It zeroes out counters on "other" cpus,
    without holding any locks which will prevent races agaist updates from
    those other CPUS.

    This patch reverts 1f7c14c62ce63805f9574664a6c6de3633d4a354. This means
    that percpu_counter_sum_and_set() still has the race, but
    percpu_counter_sum() does not.

    Note that this is not a simple revert - ext4 has since started using
    percpu_counter_sum() for its dirty_blocks counter as well.

    Note that this revert patch changes percpu_counter_sum() semantics.

    Before the patch, a call to percpu_counter_sum() will bring the counter's
    central counter mostly up-to-date, so a following percpu_counter_read()
    will return a close value.

    After this patch, a call to percpu_counter_sum() will leave the counter's
    central accumulator unaltered, so a subsequent call to
    percpu_counter_read() can now return a significantly inaccurate result.

    If there is any code in the tree which was introduced after
    e8ced39d5e8911c662d4d69a342b9d053eaaac4e was merged, and which depends
    upon the new percpu_counter_sum() semantics, that code will break.

    Reported-by: Eric Dumazet
    Cc: "David S. Miller"
    Cc: Peter Zijlstra
    Cc: Mingming Cao
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

26 Nov, 2008

1 commit

  • Move some of the forward declaration of the static functions
    to mballoc.c where they are used. This enables us to include
    mballoc.h in other .c files. Also correct the buddy cache
    documentation.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     

24 Nov, 2008

1 commit

  • In ext4_mb_init_group(), if the filesystem block size is less than
    PAGE_SIZE/2, the code tries to grab alloc_sem for multiple block
    groups in a loop. We need to allow for this by using
    down_write_nested() and passing in the loop index as a lock subclass
    number. This works because no other code path needs to take multiple
    alloc_sem's. Note that lockdep will fail for filesystem blocksize
    smaller than to PAGE_SIZE/16k. (e.g., a 1k filesystem blocksize with
    a 32k page size, or a 2k filesystem blocksize with a 64k blocksize,
    etc.)

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     

14 Nov, 2008

2 commits

  • Conflicts:
    security/keys/internal.h
    security/keys/process_keys.c
    security/keys/request_key.c

    Fixed conflicts above by using the non 'tsk' versions.

    Signed-off-by: James Morris

    James Morris
     
  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: Stephen Tweedie
    Cc: Andrew Morton
    Cc: adilger@sun.com
    Cc: linux-ext4@vger.kernel.org
    Signed-off-by: James Morris

    David Howells
     

07 Nov, 2008

2 commits