04 Aug, 2008

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: remove write-only variables from ext4_ordered_write_end
    ext4: unexport jbd2_journal_update_superblock
    ext4: Cleanup whitespace and other miscellaneous style issues
    ext4: improve ext4_fill_flex_info() a bit
    ext4: Cleanup the block reservation code path
    ext4: don't assume extents can't cross block groups when truncating
    ext4: Fix lack of credits BUG() when deleting a badly fragmented inode
    ext4: Fix ext4_ext_journal_restart()
    ext4: fix ext4_da_write_begin error path
    jbd2: don't abort if flushing file data failed
    ext4: don't read inode block if the buffer has a write error
    ext4: Don't allow lg prealloc list to be grow large.
    ext4: Convert the usage of NR_CPUS to nr_cpu_ids.
    ext4: Improve error handling in mballoc
    ext4: lock block groups when initializing
    ext4: sync up block and inode bitmap reading functions
    ext4: Allow read/only mounts with corrupted block group checksums
    ext4: Fix data corruption when writing to prealloc area

    Linus Torvalds
     

03 Aug, 2008

6 commits

  • The variables 'from' and 'to' are not used anywhere.

    Signed-off-by: Eric Sandeen
    Acked-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Eric Sandeen
     
  • The extents codepath for ext4_truncate() requests journal transaction
    credits in very small chunks, requesting only what is needed. This
    means there may not be enough credits left on the transaction handle
    after ext4_truncate() returns and then when ext4_delete_inode() tries
    finish up its work, it may not have enough transaction credits,
    causing a BUG() oops in the jbd2 core.

    Also, reserve an extra 2 blocks when starting an ext4_delete_inode()
    since we need to update the inode bitmap, as well as update the
    orphaned inode linked list.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • ext4_da_write_begin needs to call journal_stop before returning,
    if the page allocation fails.

    Signed-off-by: Eric Sandeen
    Acked-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Eric Sandeen
     
  • I noticed when filling a 1T filesystem with 4 threads using the
    fs_mark benchmark:

    fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0

    that I occasionally got checksum mismatch errors:

    EXT4-fs error (device sdb): ext4_init_inode_bitmap: Checksum bad for group 6935

    etc. I'd reliably get 4-5 of them during the run.

    It appears that the problem is likely a race to init the bg's
    when the uninit_bg feature is enabled.

    With the patch below, which adds sb_bgl_locking around initialization,
    I was able to complete several runs with no errors or warnings.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Theodore Ts'o

    Eric Sandeen
     
  • ext4_read_block_bitmap and read_inode_bitmap do essentially
    the same thing, and yet they are structured quite differently.
    I came across this difference while looking at doing bg locking
    during bg initialization.

    This patch:

    * removes unnecessary casts in the error messages
    * renames read_inode_bitmap to ext4_read_inode_bitmap
    * and more substantially, restructures the inode bitmap
    reading function to be more like the block bitmap counterpart.

    The change to the inode bitmap reader simplifies the locking
    to be applied in the next patch.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Theodore Ts'o

    Eric Sandeen
     
  • Inserting an extent can cause a new entry in the already existing index
    block. That doesn't increase the depth of the instead. Instead it adds a
    new leaf block. Now with the new leaf block the path information
    corresponding to the logical block should be fetched from the new block.
    The old path will be pointing to the old leaf block.

    We need to recalucate the path information on extent insert
    even if depth doesn't change. Without this change, the extent merge
    after converting an unwritten extent to initialized extent takes the wrong
    extent and cause data corruption.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     

02 Aug, 2008

2 commits

  • With the FLEX_BG layout, there is no reason why extents can't cross
    block groups, so make the truncate code reserve enough credits so we
    don't BUG if we come across such an extent.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • The ext4_ext_journal_restart() is a convenience function which checks
    to see if the requested number of credits is present, and if so it
    closes the current transaction and attaches the current handle to the
    new transaction. Unfortunately, it wasn't proprely checking the
    return value from ext4_journal_extend(), so it was starting a new
    transaction when one was not necessary, and returning an error when
    all that was necessary was to restart the handle with a new
    transaction.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

01 Aug, 2008

1 commit

  • * new helper: vfs_quota_on_path(); equivalent of vfs_quota_on() sans the
    pathname resolution.
    * callers of vfs_quota_on() that do their own pathname resolution and
    checks based on it are switched to vfs_quota_on_path(); that way we
    avoid the races.
    * reiserfs leaked dentry/vfsmount references on several failure exits.

    Signed-off-by: Al Viro

    Al Viro
     

29 Jul, 2008

1 commit

  • When we read some part of a file through pagecache, if there is a
    pagecache of corresponding index but this page is not uptodate, read IO
    is issued and this page will be uptodate.

    I think this is good for pagesize == blocksize environment but there is
    room for improvement on pagesize != blocksize environment. Because in
    this case a page can have multiple buffers and even if a page is not
    uptodate, some buffers can be uptodate.

    So I suggest that when all buffers which correspond to a part of a file
    that we want to read are uptodate, use this pagecache and copy data from
    this pagecache to user buffer even if a page is not uptodate. This can
    reduce read IO and improve system throughput.

    I wrote a benchmark program and got result number with this program.

    This benchmark do:

    1: mount and open a test file.

    2: create a 512MB file.

    3: close a file and umount.

    4: mount and again open a test file.

    5: pwrite randomly 300000 times on a test file. offset is aligned
    by IO size(1024bytes).

    6: measure time of preading randomly 100000 times on a test file.

    The result was:
    2.6.26
    330 sec

    2.6.26-patched
    226 sec

    Arch:i386
    Filesystem:ext3
    Blocksize:1024 bytes
    Memory: 1GB

    On ext3/4, a file is written through buffer/block. So random read/write
    mixed workloads or random read after random write workloads are optimized
    with this patch under pagesize != blocksize environment. This test result
    showed this.

    The benchmark program is as follows:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define LEN 1024
    #define LOOP 1024*512 /* 512MB */

    main(void)
    {
    unsigned long i, offset, filesize;
    int fd;
    char buf[LEN];
    time_t t1, t2;

    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    memset(buf, 0, LEN);
    fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }
    for (i = 0; i < LOOP; i++)
    write(fd, buf, LEN);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    fd = open("/root/test1/testfile", O_RDWR);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }

    filesize = LEN * LOOP;
    for (i = 0; i < 300000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pwrite(fd, buf, LEN, offset);
    }
    printf("start test\n");
    time(&t1);
    for (i = 0; i < 100000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pread(fd, buf, LEN, offset);
    }
    time(&t2);
    printf("%ld sec\n", t2-t1);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    }

    Signed-off-by: Hisashi Hifumi
    Cc: Nick Piggin
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     

27 Jul, 2008

5 commits

  • * kill nameidata * argument; map the 3 bits in ->flags anybody cares
    about to new MAY_... ones and pass with the mask.
    * kill redundant gfs2_iop_permission()
    * sanitize ecryptfs_permission()
    * fix remaining places where ->permission() instances might barf on new
    MAY_... found in mask.

    The obvious next target in that direction is permission(9)

    folded fix for nfs_permission() breakage from Miklos Szeredi

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • Kmem cache passed to constructor is only needed for constructors that are
    themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
    passed kmem cache in non-trivial way, so pass only pointer to object.

    Non-trivial places are:
    arch/powerpc/mm/init_64.c
    arch/powerpc/mm/hugetlbpage.c

    This is flag day, yes.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Jon Tollefson
    Cc: Nick Piggin
    Cc: Matt Mackall
    [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
    [akpm@linux-foundation.org: fix mm/slab.c]
    [akpm@linux-foundation.org: fix ubifs]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • A transient I/O error can corrupt inode data. Here is the scenario:

    (1) update inode_A at the block_B
    (2) pdflush writes out new inode_A to the filesystem, but it results
    in write I/O error, at this point, BH_Uptodate flag of the buffer
    for block_B is cleared and BH_Write_EIO is set
    (3) create new inode_C which located at block_B, and
    __ext4_get_inode_loc() tries to read on-disk block_B because the
    buffer is not uptodate
    (4) if it can read on-disk block_B successfully, inode_A is
    overwritten by old data

    This patch makes __ext4_get_inode_loc() not read the inode block if the
    buffer has BH_Write_EIO flag. In this case, the buffer should have the
    latest information, so setting the uptodate flag to the buffer (this
    avoids WARN_ON_ONCE() in mark_buffer_dirty().)

    According to this change, we would need to test BH_Write_EIO flag for the
    error checking. Currently nobody checks write I/O errors on metadata
    buffers, but it will be done in other patches I'm working on.

    Signed-off-by: Hidehiro Kawai
    Cc: sugita
    Cc: Satoshi OSHIMA
    Cc: Nick Piggin
    Cc: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Theodore Ts'o

    Hidehiro Kawai
     
  • If the block group checksums are corrupted, still allow the mount to
    succeed, so e2fsck can have a chance to try to fix things up. Add
    code in the remount r/w path to make sure the block group checksums
    are valid before allowing the filesystem to be remounted read/write.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

25 Jul, 2008

1 commit


24 Jul, 2008

3 commits

  • Currently, the locality group prealloc list is freed only when there
    is a block allocation failure. This can result in large number of
    entries in the preallocation list making ext4_mb_use_preallocated()
    expensive.

    To fix this, we convert the locality group prealloc list to a hash
    list. The hash index is the order of number of blocks in the prealloc
    space with a max order of 9. When adding prealloc space to the list we
    make sure total entries for each order does not exceed 8. If it is
    more than 8 we discard few entries and make sure the we have only
    Signed-off-by: Theodore Ts'o

    Aneesh Kumar K.V
     
  • NR_CPUS can be really large. We should be using nr_cpu_ids instead.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Theodore Ts'o

    Aneesh Kumar K.V
     
  • Don't call BUG_ON on file system failures. Instead use ext4_error and
    also handle the continue case properly.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Theodore Ts'o

    Aneesh Kumar K.V
     

18 Jul, 2008

1 commit

  • The truncate patch should not use the i_allocated_meta_blocks
    value. So add seperate functions to be used in the truncate
    and alloc path. We also need to release the meta-data block
    that we reserved for the blocks that we are truncating.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Theodore Ts'o

    Aneesh Kumar K.V
     

15 Jul, 2008

1 commit

  • This patch does block reservation for delayed
    allocation, to avoid ENOSPC later at page flush time.

    Blocks(data and metadata) are reserved at da_write_begin()
    time, the freeblocks counter is updated by then, and the number of
    reserved blocks is store in per inode counter.

    At the writepage time, the unused reserved meta blocks are returned
    back. At unlink/truncate time, reserved blocks are properly released.

    Updated fix from Aneesh Kumar K.V
    to fix the oldallocator block reservation accounting with delalloc, added
    lock to guard the counters and also fix the reservation for meta blocks.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Mingming Cao
    Signed-off-by: Theodore Ts'o

    Mingming Cao
     

12 Jul, 2008

18 commits

  • We've talked for a while about getting rid of any feature-
    setting from the kernel; this gets rid of the code which would
    set the INCOMPAT_EXTENTS flag on the first file write when mounted
    as ext4[dev].

    With this patch, if the extents feature is not already set on disk,
    then mounting as ext4 will fall back to noextents with a warning,
    and if -o extents is explicitly requested, the mount will fail,
    also with warning.

    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"

    Eric Sandeen
     
  • The block mapped inode format can address only blocks within 2**32. This
    causes a number of issues, the biggest of which is that the block
    allocator needs to be taught that certain inodes can not utilize block
    numbers > 2**32. So until this is fixed, it is simplest to fail
    mounting of file systems with more than 2**32 blocks if the -o noextents
    option is given.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     
  • Enable delalloc by default to ensure it gets sufficient testing and
    because it makes the filesystem much more efficient. Add a nodealalloc
    option to disable delayed allocation, and update ext4_show_options to
    show delayed allocation off if it is disabled.

    If the data=journal mount option is used, disable delayed allocation
    since the delalloc code doesn't support data=journal yet.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Mingming Cao

    Aneesh Kumar K.V
     
  • Right now i_blocks is not getting updated until the blocks are actually
    allocaed on disk. This means with delayed allocation, right after files
    are copied, "ls -sF" shoes the file as taking 0 blocks on disk. "du"
    also shows the files taking zero space, which is highly confusing to the
    user.

    Since delayed allocation already keeps track of per-inode total
    number of blocks that are subject to delayed allocation, this patch fix
    this by using that to adjust the value returned by stat(2). When real
    block allocation is done, the i_blocks will get updated. Since the
    reserved blocks for delayed allocation will be decreased, this will be
    keep value returned by stat(2) consistent.

    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Mingming Cao
     
  • Ext4_da_write_end() used walk_page_buffers() with a callback function of
    ext4_bh_unmapped_or_delay() to check if it extended the file size
    without allocating any blocks (since in this case i_disksize needs to be
    updated). However, this is didn't work proprely because the buffer head
    has not been marked dirty yet --- this is done later in
    block_commit_write() --- which caused ext4_bh_unmapped_or_delay() to
    always return false.

    In addition, walk_page_buffers() checks all of the buffer heads covering
    the page, and the only buffer_head that should be checked is the one
    covering the end of the write. Otherwise, given a 1k blocksize
    filesystem and a 4k page size, the buffer head covering the first 1k
    stripe of the file could be unmapped (because it was a sparse file), and
    the second or third buffer_head covering that page could be mapped, and
    using walk_page_buffers() would fail in this case since it would stop at
    the first unmapped buffer_head and return true.

    The core problem is that walk_page_buffers() was intended to do work in
    a callback function, and a non-zero return value indicated a failure,
    which termined the walk of the buffer heads covering the page. It was
    not intended to be used with a boolean function, such as
    ext4_bh_unmapped_or_delay().

    Add addtional fix from Aneesh to protect i_disksize update rave with truncate.

    Signed-off-by: Mingming Cao
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"

    Mingming Cao
     
  • It can happen that buffers are removed from the page before it gets
    marked dirty and then is passed to writepage(). In writepage() we just
    initialize the buffers and check whether they are mapped and non
    delay. If they are mapped and non delay we write the page. Otherwise we
    mark them dirty. With this change we don't do block allocation at all
    in ext4_*_write_page.

    writepage() can get called under many condition and with a locking order
    of journal_start -> lock_page, we should not try to allocate blocks in
    writepage() which get called after taking page lock. writepage() can
    get called via shrink_page_list even with a journal handle which was
    created for doing inode update. For example when doing
    ext4_da_write_begin we create a journal handle with credit 1 expecting a
    i_disksize update for the inode. But ext4_da_write_begin can cause
    shrink_page_list via _grab_page_cache. So having a valid handle via
    ext4_journal_current_handle is not a guarantee that we can use the
    handle for block allocation in writepage, since we shouldn't be using
    credits that had been reserved for other updates. That it could result
    in we running out of credits when we update inodes.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     
  • This provides a new ordered mode implementation which gets rid of using
    buffer heads to enforce the ordering between metadata change with the
    related data chage. Instead, in the new ordering mode, it keeps track
    of all of the inodes touched by each transaction on a list, and when
    that transaction is committed, it flushes all of the dirty pages for
    those inodes. In addition, the new ordered mode reverses the lock
    ordering of the page lock and transaction lock, which provides easier
    support for delayed allocation.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     
  • With the reverse locking, we need to start a transation before taking
    the page lock, so in ext4_da_writepages() we need to break the write-out
    into chunks, and restart the journal for each chunck to ensure the
    write-out fits in a single transaction.

    Updated patch from Aneesh Kumar K.V
    which fixes delalloc sync hang with journal lock inversion, and address
    the performance regression issue.

    Signed-off-by: Mingming Cao
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Mingming Cao
     
  • Delayed allocation need to check free blocks at every write time.
    percpu_counter_read_positive() is not quit accurate. delayed
    allocation need a more accurate accounting, but using
    percpu_counter_sum_positive() is frequently is quite expensive.

    This patch added a new function to update center counter when sum
    per-cpu counter, to increase the accurate rate for next
    percpu_counter_read() and require less calling expensive
    percpu_counter_sum().

    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Mingming Cao
     
  • Updated with fixes from Mingming Cao to unlock and
    release the page from page cache if the delalloc write_begin failed, and
    properly handle preallocated blocks. Also added a fix to clear
    buffer_delay in block_write_full_page() after allocating a delayed
    buffer.

    Updated with fixes from Aneesh Kumar K.V
    to update i_disksize properly and to add bmap support for delayed
    allocation.

    Updated with a fix from Valerie Clement to
    avoid filesystem corruption when the filesystem is mounted with the
    delalloc option and blocksize < pagesize.

    Signed-off-by: Alex Tomas
    Signed-off-by: Mingming Cao
    Signed-off-by: Dave Kleikamp
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Aneesh Kumar K.V

    Alex Tomas
     
  • This patch makes ext4 use inode-based implementation of data=ordered mode
    in JBD2. It allows us to unify some data=ordered and data=writeback paths
    (especially writepage since we don't have to start a transaction anymore)
    and remove some buffer walking.

    Updated fix from Aneesh Kumar K.V
    to fix file system hang due to corrupt jinode values.

    Signed-off-by: Jan Kara
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • We cannot call ext4_orphan_add() from under i_data_sem because that
    causes a lock ordering violation between i_data_sem and and the
    superblock lock.

    Updated with Aneesh's locking order fix

    Signed-off-by: Jan Kara
    Signed-off-by: Mingming Cao
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • This changes are needed to support data=ordered mode handling via
    inodes. This enables us to get rid of the journal heads and buffer
    heads for data buffers in the ordered mode. With the changes, during
    tranasaction commit we writeout the inode pages using the
    writepages()/writepage(). That implies we take page lock during
    transaction commit. This can cause a deadlock with the locking order
    page_lock -> jbd2_journal_start, since the jbd2_journal_start can wait
    for the journal_commit to happen and the journal_commit now needs to
    take the page lock. To avoid this dead lock reverse the locking order.

    Signed-off-by: Jan Kara
    Signed-off-by: Mingming Cao
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • We would like to get notified when we are doing a write on mmap section.
    This is needed with respect to preallocated area. We split the preallocated
    area into initialzed extent and uninitialzed extent in the call back. This
    let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and
    that would result in data loss. The changes are also needed to handle ENOSPC
    when writing to an mmap section of files with holes.

    Acked-by: Jan Kara
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     
  • Update group infos when updating a group's descriptor.
    Add group infos when adding a group's descriptor.
    Refresh cache pages used by mb_alloc when changes occur.
    This will probably need modifications when META_BG resizing will be allowed.

    Signed-off-by: Frederic Bohe
    Signed-off-by: Mingming Cao

    Frederic Bohe
     
  • Use the BUFFER_FNS functions (set_buffer_foo) to set buffer
    head state atomically instead of nonatomic __set_bit().

    Signed-off-by: Eric Sandeen
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Eric Sandeen
     
  • Set sbi->s_journal to NULL after we call journal_destroy(). This
    will be later needed because after journal_destroy() is called,
    ext4_clear_inode() can still be called for some inodes (e.g. root
    inode) and we'll need to detect there that journal doesn't exists
    anymore.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • mballoc allocation missed check for blocks reserved for root users. Add
    ext4_has_free_blocks() check before allocation. Also modified
    ext4_has_free_blocks() to support multiple block allocation request.

    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Mingming Cao