01 May, 2011

1 commit

  • In the bio completion routine, we should not be setting
    PageUptodate at all -- it's set at sys_write() time, and is
    unaffected by success/failure of the write to disk.

    This can cause a page corruption bug when the file system's
    block size is less than the architecture's VM page size.

    if we have only written a single block -- we might end up
    setting the page's PageUptodate flag, indicating that page
    is completely read into memory, which may not be true.
    This could cause subsequent reads to get bad data.

    This commit also takes the opportunity to clean up error
    handling in ext4_end_bio(), and remove some extraneous code:

    - fixes ext4_end_bio() to set AS_EIO in the
    page->mapping->flags on error, which was left out by
    mistake. This is needed so that fsync() will
    return an error if there was an I/O error.
    - remove the clear_buffer_dirty() call on unmapped
    buffers for each page.
    - consolidate page/buffer error handling in a single
    section.

    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: "Theodore Ts'o"
    Reported-by: Jim Meyering
    Reported-by: Hugh Dickins
    Cc: Mingming Cao

    Curt Wohlgemuth
     

26 Mar, 2011

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (43 commits)
    ext4: fix a BUG in mb_mark_used during trim.
    ext4: unused variables cleanup in fs/ext4/extents.c
    ext4: remove redundant set_buffer_mapped() in ext4_da_get_block_prep()
    ext4: add more tracepoints and use dev_t in the trace buffer
    ext4: don't kfree uninitialized s_group_info members
    ext4: add missing space in printk's in __ext4_grp_locked_error()
    ext4: add FITRIM to compat_ioctl.
    ext4: handle errors in ext4_clear_blocks()
    ext4: unify the ext4_handle_release_buffer() api
    ext4: handle errors in ext4_rename
    jbd2: add COW fields to struct jbd2_journal_handle
    jbd2: add the b_cow_tid field to journal_head struct
    ext4: Initialize fsync transaction ids in ext4_new_inode()
    ext4: Use single thread to perform DIO unwritten convertion
    ext4: optimize ext4_bio_write_page() when no extent conversion is needed
    ext4: skip orphan cleanup if fs has unknown ROCOMPAT features
    ext4: use the nblocks arg to ext4_truncate_restart_trans()
    ext4: fix missing iput of root inode for some mount error paths
    ext4: make FIEMAP and delayed allocation play well together
    ext4: suppress verbose debugging information if malloc-debug is off
    ...

    Fi up conflicts in fs/ext4/super.c due to workqueue changes

    Linus Torvalds
     

10 Mar, 2011

1 commit

  • With the plugging now being explicitly controlled by the
    submitter, callers need not pass down unplugging hints
    to the block layer. If they want to unplug, it's because they
    manually plugged on their own - in which case, they should just
    unplug at will.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Mar, 2011

1 commit

  • If no extent conversion is required, wake up any processes waiting for
    the page's writeback to be complete and free the ext4_io_end structure
    directly in ext4_end_bio() instead of dropping it on the linked list
    (which requires taking a spinlock to queue and dequeue the io_end
    structure), and waiting for the workqueue to do this work.

    This removes an extra scheduling delay before process waiting for an
    fsync() to complete gets woken up, and it also reduces the CPU
    overhead for a random write workload.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

28 Feb, 2011

1 commit

  • In ext4_bio_write_page(), if the memory allocation for the struct
    ext4_io_page fails, it returns with the page's PageWriteback flag set.
    This will end up causing the page not to skip writeback in
    WB_SYNC_NONE mode, and in WB_SYNC_ALL mode (i.e., on a sync, fsync, or
    umount) the writeback daemon will get stuck forever on the
    wait_on_page_writeback() function in write_cache_pages_da().

    Or, if journalling is enabled and the file gets deleted, it the
    journal thread can get stuck in journal_finish_inode_data_buffers()
    call to filemap_fdatawait().

    Another place where things can get hung up is in
    truncate_inode_pages(), called out of ext4_evict_inode().

    Fix this by not setting PageWriteback until after we have successfully
    allocated the struct ext4_io_page.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

22 Feb, 2011

1 commit


12 Feb, 2011

1 commit

  • ext4 has a data corruption case when doing non-block-aligned
    asynchronous direct IO into a sparse file, as demonstrated
    by xfstest 240.

    The root cause is that while ext4 preallocates space in the
    hole, mappings of that space still look "new" and
    dio_zero_block() will zero out the unwritten portions. When
    more than one AIO thread is going, they both find this "new"
    block and race to zero out their portion; this is uncoordinated
    and causes data corruption.

    Dave Chinner fixed this for xfs by simply serializing all
    unaligned asynchronous direct IO. I've done the same here.
    The difference is that we only wait on conversions, not all IO.
    This is a very big hammer, and I'm not very pleased with
    stuffing this into ext4_file_write(). But since ext4 is
    DIO_LOCKING, we need to serialize it at this high level.

    I tried to move this into ext4_ext_direct_IO, but by then
    we have the i_mutex already, and we will wait on the
    work queue to do conversions - which must also take the
    i_mutex. So that won't work.

    This was originally exposed by qemu-kvm installing to
    a raw disk image with a normal sector-63 alignment. I've
    tested a backport of this patch with qemu, and it does
    avoid the corruption. It is also quite a lot slower
    (14 min for package installs, vs. 8 min for well-aligned)
    but I'll take slow correctness over fast corruption any day.

    Mingming suggested that we can track outstanding
    conversions, and wait on those so that non-sparse
    files won't be affected, and I've implemented that here;
    unaligned AIO to nonsparse files won't take a perf hit.

    [tytso@mit.edu: Keep the mutex as a hashed array instead
    of bloating the ext4 inode]

    [tytso@mit.edu: Fix up namespace issues so that global
    variables are protected with an "ext4_" prefix.]

    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"

    Eric Sandeen
     

08 Feb, 2011

1 commit

  • This fixes a corruption problem with the multi-block
    writepages submittal change for ext4, from commit
    bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc ("ext4: use bio
    layer instead of buffer layer in mpage_da_submit_io").

    (Note that this corruption is not present in 2.6.37 on
    ext4, because the corruption was detected after the
    feature was merged in 2.6.37-rc1, and so it was turned
    off by adding a non-default mount option,
    mblk_io_submit. With this commit, which hopefully
    fixes the last of the bugs with this feature, we'll be
    able to turn on this performance feature by default in
    2.6.38, and remove the mblk_io_submit option.)

    The ext4 code path to bundle multiple pages for
    writeback in ext4_bio_write_page() had a bug: we should
    be clearing buffer head dirty flags *before* we submit
    the bio, not in the completion routine.

    The patch below was tested on 2.6.37 under KVM with the
    postgresql script which was submitted by Jon Nelson as
    documented in commit 1449032be1.

    Without the patch, I'd hit the corruption problem about
    50-70% of the time. With the patch, I executed the
    script > 100 times with no corruption seen.

    I also fixed a bug to make sure ext4_end_bio() doesn't
    dereference the bio after the bio_put() call.

    Reported-by: Jon Nelson
    Reported-by: Matthias Bayer
    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Curt Wohlgemuth
     

11 Jan, 2011

1 commit


20 Dec, 2010

1 commit

  • Use advantage of kmem_cache_zalloc() to remove a memset() call in
    ext4_init_io_end() and save a few bytes.

    Before:
    [jj@dragon linux-2.6]$ size fs/ext4/page-io.o
    text data bss dec hex filename
    3016 0 624 3640 e38 fs/ext4/page-io.o
    After:
    [jj@dragon linux-2.6]$ size fs/ext4/page-io.o
    text data bss dec hex filename
    3000 0 624 3624 e28 fs/ext4/page-io.o

    Signed-off-by: Jesper Juhl
    Signed-off-by: "Theodore Ts'o"

    Jesper Juhl
     

18 Nov, 2010

1 commit

  • ext4_end_bio calls put_page and kmem_cache_free before calling
    SetPageUpdate(). This can result in setting the PageUptodate bit on
    random pages and causes the following BUG:

    BUG: Bad page state in process rm pfn:52e54
    page:ffffea0001222260 count:0 mapcount:0 mapping: (null) index:0x0
    arch kernel: page flags: 0x4000000000000008(uptodate)

    Fix the problem by moving put_io_page() after the SetPageUpdate() call.

    Thanks to Hugh Dickins for analyzing this problem.

    Reported-by: Markus Trippelsdorf
    Tested-by: Markus Trippelsdorf
    Signed-off-by: Markus Trippelsdorf
    Signed-off-by: "Theodore Ts'o"

    Markus Trippelsdorf
     

09 Nov, 2010

2 commits

  • Use an atomic_t and make sure we don't free the structure while we
    might still be submitting I/O for that page.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • The following BUG can occur when an inode which is getting freed when
    it still has dirty pages outstanding, and it gets deleted (in this
    because it was the target of a rename). In ordered mode, we need to
    make sure the data pages are written just in case we crash before the
    rename (or unlink) is committed. If the inode is being freed then
    when we try to igrab the inode, we end up tripping the BUG_ON at
    fs/ext4/page-io.c:146.

    To solve this problem, we need to keep track of the number of io
    callbacks which are pending, and avoid destroying the inode until they
    have all been completed. That way we don't have to bump the inode
    count to keep the inode from being destroyed; an approach which
    doesn't work because the count could have already been dropped down to
    zero before the inode writeback has started (at which point we're not
    allowed to bump the count back up to 1, since it's already started
    getting freed).

    Thanks to Dave Chinner for suggesting this approach, which is also
    used by XFS.

    kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
    Call Trace:
    [] ext4_bio_write_page+0x172/0x307
    [] mpage_da_submit_io+0x2f9/0x37b
    [] mpage_da_map_and_submit+0x2cc/0x2e2
    [] mpage_add_bh_to_extent+0xc6/0xd5
    [] write_cache_pages_da+0x2a4/0x3ac
    [] ext4_da_writepages+0x2d6/0x44d
    [] do_writepages+0x1c/0x25
    [] __filemap_fdatawrite_range+0x4b/0x4d
    [] filemap_fdatawrite_range+0xe/0x10
    [] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
    [] ext4_evict_inode+0x57/0x24c
    [] evict+0x22/0x92
    [] iput+0x212/0x249
    [] dentry_iput+0xa1/0xb9
    [] d_kill+0x3d/0x5d
    [] dput+0x13a/0x147
    [] sys_renameat+0x1b5/0x258
    [] ? _atomic_dec_and_lock+0x2d/0x4c
    [] ? cp_new_stat+0xde/0xea
    [] ? sys_newlstat+0x2d/0x38
    [] sys_rename+0x16/0x18
    [] system_call_fastpath+0x16/0x1b

    Reported-by: Nick Bowler
    Signed-off-by: "Theodore Ts'o"
    Tested-by: Nick Bowler

    Theodore Ts'o
     

28 Oct, 2010

2 commits