26 Oct, 2010

1 commit


21 Jul, 2010

1 commit


05 Mar, 2010

2 commits

  • Get rid of the initialize dquot operation - it is now always called from
    the filesystem and if a filesystem really needs it's own (which none
    currently does) it can just call into it's own routine directly.

    Rename the now static low-level dquot_initialize helper to __dquot_initialize
    and vfs_dq_init to dquot_initialize to have a consistent namespace.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     
  • Currently various places in the VFS call vfs_dq_init directly. This means
    we tie the quota code into the VFS. Get rid of that and make the
    filesystem responsible for the initialization. For most metadata operations
    this is a straight forward move into the methods, but for truncate and
    open it's a bit more complicated.

    For truncate we currently only call vfs_dq_init for the sys_truncate case
    because open already takes care of it for ftruncate and open(O_TRUNC) - the
    new code causes an additional vfs_dq_init for those which is harmless.

    For open the initialization is moved from do_filp_open into the open method,
    which means it happens slightly earlier now, and only for regular files.
    The latter is fine because we don't need to initialize it for operations
    on special files, and we already do it as part of the namespace operations
    for directories.

    Add a dquot_file_open helper that filesystems that support generic quotas
    can use to fill in ->open.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     

23 Dec, 2009

2 commits


09 Sep, 2009

1 commit


04 Apr, 2009

1 commit


03 Apr, 2009

3 commits

  • In data=writeback mode, start an asynchronous flush when renaming a
    file on top of an already-existing file. This lowers the probability
    of data loss in the case of applications that attempt to replace a
    file via using rename().

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • ext3_iget() returns -ESTALE if invoked on a deleted inode, in order to
    report errors to NFS properly. However, in ext[234]_lookup(), this
    -ESTALE can be propagated to userspace if the filesystem is corrupted such
    that a directory entry references a deleted inode. This leads to a
    misleading error message - "Stale NFS file handle" - and confusion on the
    part of the admin.

    The bug can be easily reproduced by creating a new filesystem, making a
    link to an unused inode using debugfs, then mounting and attempting to ls
    -l said link.

    This patch thus changes ext3_lookup to return -EIO if it receives -ESTALE
    from ext3_iget(), as ext3 does for other filesystem metadata corruption;
    and also invokes the appropriate ext*_error functions when this case is
    detected.

    Signed-off-by: Bryan Donlan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bryan Donlan
     
  • Use unsigned instead of int for the parameter which carries a blocksize.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Wei Yongjun
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yongjun
     

26 Mar, 2009

1 commit


17 Jan, 2009

1 commit

  • Make sure the rec_len field in the '..' entry is sane, lest we overrun
    the directory block and cause a kernel oops on a purposefully
    corrupted filesystem.

    This fixes a bug related to a bug originally reported by Sami Liedes
    for ext4 at:

    http://bugzilla.kernel.org/show_bug.cgi?id=12430

    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Theodore Ts'o
     

09 Jan, 2009

2 commits

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
    jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
    ext4: Remove "extents" mount option
    block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
    ext4: Make printk's consistently prefixed with "EXT4-fs: "
    ext4: Add sanity checks for the superblock before mounting the filesystem
    ext4: Add mount option to set kjournald's I/O priority
    jbd2: Submit writes to the journal using WRITE_SYNC
    jbd2: Add pid and journal device name to the "kjournald2 starting" message
    ext4: Add markers for better debuggability
    ext4: Remove code to create the journal inode
    ext4: provide function to release metadata pages under memory pressure
    ext3: provide function to release metadata pages under memory pressure
    add releasepage hooks to block devices which can be used by file systems
    ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
    ext4: Init the complete page while building buddy cache
    ext4: Don't allow new groups to be added during block allocation
    ext4: mark the blocks/inode bitmap beyond end of group as used
    ext4: Use new buffer_head flag to check uninit group bitmaps initialization
    ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
    ext4: code cleanup
    ...

    Linus Torvalds
     
  • Use the new generic implementation.

    Signed-off-by: Wu Fengguang
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

05 Jan, 2009

1 commit

  • With the write_begin/write_end aops, page_symlink was broken because it
    could no longer pass a GFP_NOFS type mask into the point where the
    allocations happened. They are done in write_begin, which would always
    assume that the filesystem can be entered from reclaim. This bug could
    cause filesystem deadlocks.

    The funny thing with having a gfp_t mask there is that it doesn't really
    allow the caller to arbitrarily tinker with the context in which it can be
    called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
    take the page lock. The only thing any callers care about is __GFP_FS
    anyway, so turn that into a single flag.

    Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
    this flag in their write_begin function. Change __grab_cache_page to
    accept a nofs argument as well, to honour that flag (while we're there,
    change the name to grab_cache_page_write_begin which is more instructive
    and does away with random leading underscores).

    This is really a more flexible way to go in the end anyway -- if a
    filesystem happens to want any extra allocations aside from the pagecache
    ones in ints write_begin function, it may now use GFP_KERNEL (rather than
    GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
    random example).

    [kosaki.motohiro@jp.fujitsu.com: fix ubifs]
    [kosaki.motohiro@jp.fujitsu.com: fix fuse]
    Signed-off-by: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: [2.6.28.x]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    [ Cleaned up the calling convention: just pass in the AOP flags
    untouched to the grab_cache_page_write_begin() function. That
    just simplifies everybody, and may even allow future expansion of the
    logic. - Linus ]
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

01 Jan, 2009

1 commit


07 Dec, 2008

1 commit

  • This fixes a gcc warning but it doesn't appear able to result in a
    failure, since the primary way the loop is exited is the first
    conditional in the for loop, and at least for a consistent filesystem,
    the signed/unsigned should in practice never be exposed.

    Signed-off-by: Roel Kluin
    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

29 Oct, 2008

1 commit

  • The original ext3 hash algorithms assumed that variables of type char
    were signed, as God and K&R intended. Unfortunately, this assumption
    is not true on some architectures. Userspace support for marking
    filesystems with non-native signed/unsigned chars was added two years
    ago, but the kernel-side support was never added (until now).

    Signed-off-by: "Theodore Ts'o"
    Cc: akpm@linux-foundation.org
    Cc: linux-kernel@vger.kernel.org

    Theodore Ts'o
     

23 Oct, 2008

2 commits


26 Jul, 2008

2 commits

  • ext3_dx_find_entry uses ext3_next_entry without verifying that the entry
    is valid. If its rec_len == 0 this causes an infinite loop. Refactor the
    loop to check the validity of entries before checking whether they match
    and moving onto the next one.

    There are other uses of ext3_next_entry in this file which also look
    problematic. They should be reviewed and fixed if/when we have a
    test-case that triggers them.

    This patch fixes the first case (image hdb.25.softlockup.gz) reported in
    http://bugzilla.kernel.org/show_bug.cgi?id=10882.

    Signed-off-by: Duane Griffin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Duane Griffin
     
  • dx_root_limit() will never return 20, and I can't figure out what 20
    stands for. This function has never changed since htree directory
    indexing was merged.

    Similar for dx_node_limit() and the magic 22.

    Signed-off-by: Li Zefan
    Acked-by: Andreas Dilger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

28 Apr, 2008

3 commits


08 Feb, 2008

1 commit

  • Stop the EXT3 filesystem from using iget() and read_inode(). Replace
    ext3_read_inode() with ext3_iget(), and call that instead of iget().
    ext3_iget() then uses iget_locked() directly and returns a proper error code
    instead of an inode in the event of an error.

    ext3_fill_super() returns any error incurred when getting the root inode
    instead of EINVAL.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: David Howells
    Acked-by: "Theodore Ts'o"
    Acked-by: Jan Kara
    Cc:
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

07 Feb, 2008

1 commit


15 Nov, 2007

1 commit

  • With 64KB blocksize, a directory entry can have size 64KB which does not
    fit into 16 bits we have for entry lenght. So we store 0xffff instead and
    convert value when read from / written to disk. The patch also converts
    some places to use ext3_next_entry() when we are changing them anyway.

    [akpm@linux-foundation.org: coding-style cleanups]
    Signed-off-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

17 Oct, 2007

1 commit

  • CONFIG_EXT3_INDEX is not an exposed config option in the kernel, and it is
    unconditionally defined in ext3_fs.h. tune2fs is already able to turn off
    dir indexing, so at this point it's just cluttering up the code. Remove
    it.

    Signed-off-by: Eric Sandeen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     

20 Sep, 2007

2 commits

  • The do_split() function for htree dir blocks is intended to split a leaf
    block to make room for a new entry. It sorts the entries in the original
    block by hash value, then moves the last half of the entries to the new
    block - without accounting for how much space this actually moves. (IOW,
    it moves half of the entry *count* not half of the entry *space*). If by
    chance we have both large & small entries, and we move only the smallest
    entries, and we have a large new entry to insert, we may not have created
    enough space for it.

    The patch below stores each record size when calculating the dx_map, and
    then walks the hash-sorted dx_map, calculating how many entries must be
    moved to more evenly split the existing entries between the old block and
    the new block, guaranteeing enough space for the new entry.

    The dx_map "offs" member is reduced to u16 so that the overall map size
    does not change - it is temporarily stored at the end of the new block, and
    if it grows too large it may be overwritten. By making offs and size both
    u16, we won't grow the map size.

    Also add a few comments to the functions involved.

    This fixes the testcase reported by hooanon05@yahoo.co.jp on the
    linux-ext4 list, "ext3 dir_index causes an error"

    Thanks to Andreas Dilger for discussing the problem & solution with me.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Andreas Dilger
    Tested-by: Junjiro Okajima
    Cc: Theodore Ts'o
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • Convert asserts (BUGs) in dx_probe from bad on-disk data to recoverable
    errors with helpful warnings. With help catching other asserts from Duane
    Griffin

    Signed-off-by: Eric Sandeen
    Acked-by: Duane Griffin
    Acked-by: Theodore Ts'o
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     

17 Jul, 2007

1 commit

  • After ext3 orphan list check has been added into ext3_destroy_inode()
    (please see my previous patch) the following situation has been detected:

    EXT3-fs warning (device sda6): ext3_unlink: Deleting nonexistent file (37901290), 0
    Inode 00000101a15b7840: orphan list check failed!
    00000773 6f665f00 74616d72 00000573 65725f00 06737270 66000000 616d726f
    ...
    Call Trace: [] ext3_destroy_inode+0x79/0x90
    [] sys_unlink+0x126/0x1a0
    [] error_exit+0x0/0x81
    [] system_call+0x7e/0x83

    First messages said that unlinked inode has i_nlink=0, then ext3_unlink()
    adds this inode into orphan list.

    Second message means that this inode has not been removed from orphan list.
    Inode dump has showed that i_fop = &bad_file_ops and it can be set in
    make_bad_inode() only. Then I've found that ext3_read_inode() can call
    make_bad_inode() without any error/warning messages, for example in the
    following case:

    ...
    if (inode->i_nlink == 0) {
    if (inode->i_mode == 0 ||
    !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ORPHAN_FS)) {
    /* this inode is deleted */
    brelse (bh);
    goto bad_inode;
    ...

    Bad inode can live some time, ext3_unlink can add it to orphan list, but
    ext3_delete_inode() do not deleted this inode from orphan list. As result
    we can have orphan list corruption detected in ext3_destroy_inode().

    However it is not clear for me how to fix this issue correctly.

    As far as i see is_bad_inode() is called after iget() in all places
    excluding ext3_lookup() and ext3_get_parent(). I believe it makes sense to
    add bad inode check to these functions too and call iput if bad inode
    detected.

    Signed-off-by: Vasily Averin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     

09 May, 2007

2 commits

  • Remove includes of where it is not used/needed.
    Suggested by Al Viro.

    Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
    sparc64, and arm (all 59 defconfigs).

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • - ext3_dx_find_entry() exit with out setting proper error pointer

    - do_split() exit with out setting proper error pointer
    it is realy painful because many callers contain folowing code:

    de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
    if (!(de))
    return retval;
    <<< WOW retval wasn't changed by do_split(), so caller failed
    <<< but return SUCCESS :)

    - Rearrange do_split() error path. Current error path is realy ugly, all
    this up and down jump stuff doesn't make code easy to understand.

    [dmonakhov@sw.ru: fix annoying fake error messages]
    Signed-off-by: Monakhov Dmitriy
    Cc: Andreas Dilger
    Cc: Theodore Ts'o
    Signed-off-by: Monakhov Dmitriy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitriy Monakhov
     

13 Feb, 2007

1 commit

  • Many struct inode_operations in the kernel can be "const". Marking them const
    moves these to the .rodata section, which avoids false sharing with potential
    dirty data. In addition it'll catch accidental writes at compile time to
    these shared resources.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

12 Feb, 2007

2 commits

  • - Naming is confusing, ext3_inc_count manipulates i_nlink not i_count
    - handle argument passed in is not used
    - ext3 and ext4 already call inc_nlink and dec_nlink directly in other places

    Signed-off-by: Eric Sandeen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • Return -ENOENT from ext[34]_link if we've raced with unlink and i_nlink is
    0. Doing otherwise has the potential to corrupt the orphan inode list,
    because we'd wind up with an inode with a non-zero link count on the list,
    and it will never get properly cleaned up & removed from the orphan list
    before it is freed.

    [akpm@osdl.org: build fix]
    Signed-off-by: Eric Sandeen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     

09 Dec, 2006

1 commit


08 Dec, 2006

1 commit

  • I've been using Steve Grubb's purely evil "fsfuzzer" tool, at
    http://people.redhat.com/sgrubb/files/fsfuzzer-0.4.tar.gz

    Basically it makes a filesystem, splats some random bits over it, then
    tries to mount it and do some simple filesystem actions.

    At best, the filesystem catches the corruption gracefully. At worst,
    things spin out of control.

    As you might guess, we found a couple places in ext3 where things spin out
    of control :)

    First, we had a corrupted directory that was never checked for
    consistency... it was corrupt, and pointed to another bad "entry" of
    length 0. The for() loop looped forever, since the length of
    ext3_next_entry(de) was 0, and we kept looking at the same pointer over and
    over and over and over... I modeled this check and subsequent action on
    what is done for other directory types in ext3_readdir...

    (adding this check adds some computational expense; I am testing a followup
    patch to reduce the number of times we check and re-check these directory
    entries, in all cases. Thanks for the idea, Andreas).

    Next we had a root directory inode which had a corrupted size, claimed to
    be > 200M on a 4M filesystem. There was only really 1 block in the
    directory, but because the size was so large, readdir kept coming back for
    more, spewing thousands of printk's along the way.

    Per Andreas' suggestion, if we're in this read error condition and we're
    trying to read an offset which is greater than i_blocks worth of bytes,
    stop trying, and break out of the loop.

    With these two changes fsfuzz test survives quite well on ext3.

    Signed-off-by: Eric Sandeen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen