14 Dec, 2011

1 commit


13 Dec, 2011

1 commit

  • Commit 1939dd84b3 ("ext4: cleanup ext4_ext_grow_indepth code") added a
    reference to ext4_extent_header.eh_depth, but forget to pass the value
    read through le16_to_cpu. The result is a crash on big-endian
    machines, such as this crash on a POWER7 server:

    attempt to access beyond end of device
    sda8: rw=0, want=776392648163376, limit=168558560
    Unable to handle kernel paging request for data at address 0x6b6b6b6b6b6b6bcb
    Faulting instruction address: 0xc0000000001f5f38
    cpu 0x14: Vector: 300 (Data Access) at [c000001bd1aaecf0]
    pc: c0000000001f5f38: .__brelse+0x18/0x60
    lr: c0000000002e07a4: .ext4_ext_drop_refs+0x44/0x80
    sp: c000001bd1aaef70
    msr: 9000000000009032
    dar: 6b6b6b6b6b6b6bcb
    dsisr: 40000000
    current = 0xc000001bd15b8010
    paca = 0xc00000000ffe4600
    pid = 19911, comm = flush-8:0
    enter ? for help
    [c000001bd1aaeff0] c0000000002e07a4 .ext4_ext_drop_refs+0x44/0x80
    [c000001bd1aaf090] c0000000002e0c58 .ext4_ext_find_extent+0x408/0x4c0
    [c000001bd1aaf180] c0000000002e145c .ext4_ext_insert_extent+0x2bc/0x14c0
    [c000001bd1aaf2c0] c0000000002e3fb8 .ext4_ext_map_blocks+0x628/0x1710
    [c000001bd1aaf420] c0000000002b2974 .ext4_map_blocks+0x224/0x310
    [c000001bd1aaf4d0] c0000000002b7f2c .mpage_da_map_and_submit+0xbc/0x490
    [c000001bd1aaf5a0] c0000000002b8688 .write_cache_pages_da+0x2c8/0x430
    [c000001bd1aaf720] c0000000002b8b28 .ext4_da_writepages+0x338/0x670
    [c000001bd1aaf8d0] c000000000157280 .do_writepages+0x40/0x90
    [c000001bd1aaf940] c0000000001ea830 .writeback_single_inode+0xe0/0x530
    [c000001bd1aafa00] c0000000001eb680 .writeback_sb_inodes+0x210/0x300
    [c000001bd1aafb20] c0000000001ebc84 .__writeback_inodes_wb+0xd4/0x140
    [c000001bd1aafbe0] c0000000001ebfec .wb_writeback+0x2fc/0x3e0
    [c000001bd1aafce0] c0000000001ed770 .wb_do_writeback+0x2f0/0x300
    [c000001bd1aafdf0] c0000000001ed848 .bdi_writeback_thread+0xc8/0x340
    [c000001bd1aafed0] c0000000000c5494 .kthread+0xb4/0xc0
    [c000001bd1aaff90] c000000000021f48 .kernel_thread+0x54/0x70

    This is due to getting ext_depth(inode) == 0x101 and therefore running
    off the end of the path array in ext4_ext_drop_refs into following
    unallocated structures.

    This fixes it by adding the necessary le16_to_cpu.

    Signed-off-by: Paul Mackerras
    Signed-off-by: "Theodore Ts'o"

    Paul Mackerras
     

02 Nov, 2011

2 commits


01 Nov, 2011

2 commits

  • If an fallocate request fits in EXT_UNINIT_MAX_LEN, then set the
    EXT4_GET_BLOCKS_NO_NORMALIZE flag. For larger fallocate requests,
    let mballoc.c normalize the request.

    This fixes a problem where large requests were being split into
    non-contiguous extents due to commit 556b27abf73: ext4: do not
    normalize block requests from fallocate.

    Testing:
    *) Checked that 8.x MB falloc'ed files are still laid down next to
    each other (contiguously).
    *) Checked that the maximum size extent (127.9MB) is allocated as 1
    extent.
    *) Checked that a 1GB file is somewhat contiguous (often 5-6
    non-contiguous extents now).
    *) Checked that a 120MB file can still be falloc'ed even if there are
    no single extents large enough to hold it.

    Signed-off-by: Greg Harm
    Signed-off-by: "Theodore Ts'o"

    Greg Harm
     
  • EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten
    should be done simultaneously since ext4_end_io_nolock always clear
    the flag and decrease the counter in the same time.

    We have found some bugs that the flag is set while leaving
    i_aiodio_unwritten unchanged(commit 32c80b32c053d). So this patch just tries
    to create a helper function to wrap them to avoid any future bug.
    The idea is inspired by Eric.

    Cc: Eric Sandeen
    Signed-off-by: Tao Ma
    Signed-off-by: "Theodore Ts'o"

    Tao Ma
     

29 Oct, 2011

3 commits


27 Oct, 2011

2 commits

  • ext4_ext_insert_extent() (respectively ext4_ext_insert_index())
    was using EXT_MAX_EXTENT() (resp. EXT_MAX_INDEX()) to determine
    how many entries needed to be moved beyond the insertion point.
    In practice this means that (320 - I) * 24 bytes were memmove()'d
    when I is the insertion point, rather than (#entries - I) * 24 bytes.

    This patch uses EXT_LAST_EXTENT() (resp. EXT_LAST_INDEX()) instead
    to only move existing entries. The code flow is also simplified
    slightly to highlight similarities and reduce code duplication in
    the insertion logic.

    This patch reduces system CPU consumption by over 25% on a 4kB
    synchronous append DIO write workload when used with the
    pre-2.6.39 x86_64 memmove() implementation. With the much faster
    2.6.39 memmove() implementation we still see a decrease in
    system CPU usage between 2% and 7%.

    Note that the ext_debug() output changes with this patch, splitting
    some log information between entries. Users of the ext_debug() output
    should note that the "move %d" units changed from reporting the number
    of bytes moved to reporting the number of entries moved.

    Signed-off-by: Eric Gouriou
    Signed-off-by: "Theodore Ts'o"

    Eric Gouriou
     
  • This patch introduces a fast path in ext4_ext_convert_to_initialized()
    for the case when the conversion can be performed by transferring
    the newly initialized blocks from the uninitialized extent into
    an adjacent initialized extent. Doing so removes the expensive
    invocations of memmove() which occur during extent insertion and
    the subsequent merge.

    In practice this should be the common case for clients performing
    append writes into files pre-allocated via
    fallocate(FALLOC_FL_KEEP_SIZE). In such a workload performed via
    direct IO and when using a suboptimal implementation of memmove()
    (x86_64 prior to the 2.6.39 rewrite), this patch reduces kernel CPU
    consumption by 32%.

    Two new trace points are added to ext4_ext_convert_to_initialized()
    to offer visibility into its operations. No exit trace point has
    been added due to the multiplicity of return points. This can be
    revisited once the upstream cleanup is backported.

    Signed-off-by: Eric Gouriou
    Signed-off-by: "Theodore Ts'o"

    Eric Gouriou
     

26 Oct, 2011

3 commits

  • When we want to convert the unitialized extent in direct write, we can
    either do it in ext4_end_io_nolock(AIO case) or in
    ext4_ext_direct_IO(non AIO case) and EXT4_I(inode)->cur_aio_dio is a
    guard for ext4_ext_map_blocks to find the right case. In e9e3bcecf,
    we mistakenly change it by:

    - if (io)
    + if (io && !(io->flag & EXT4_IO_END_UNWRITTEN)) {
    io->flag = EXT4_IO_END_UNWRITTEN;
    - else
    + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
    + } else
    ext4_set_inode_state(inode,
    EXT4_STATE_DIO_UNWRITTEN);

    So now if we map 2 blocks, and the first one set the
    EXT_IO_END_UNWRITTEN, the 2nd mapping will set inode state because of
    the check for the flag. This is wrong.

    Cc: Eric Sandeen
    Signed-off-by: Tao Ma
    Signed-off-by: "Theodore Ts'o"

    Tao Ma
     
  • In ext4_ext_next_allocated_block(), the path[depth] might
    have a p_ext that is NULL -- see ext4_ext_binsearch(). In
    such a case, dereferencing it will crash the machine.

    This patch checks for p_ext == NULL in
    ext4_ext_next_allocated_block() before dereferencinging it.

    Tested using a hand-crafted an inode with eh_entries == 0 in
    an extent block, verified that running FIEMAP on it crashes
    without this patch, works fine with it.

    Signed-off-by: Curt Wohlgemuth
    Signed-off-by: "Theodore Ts'o"

    Curt Wohlgemuth
     
  • When allocated is unsigned it breaks the error handling at the end
    of the function when we call:
    allocated = ext4_split_extent(...);
    if (allocated < 0)
    err = allocated;

    I've made it a signed int instead of unsigned.

    Signed-off-by: Dan Carpenter
    Signed-off-by: "Theodore Ts'o"

    Dan Carpenter
     

25 Oct, 2011

2 commits

  • EOFBLOCK_FL should be updated if called w/o FALLOCATE_FL_KEEP_SIZE
    Currently it happens only if new extent was allocated.

    TESTCASE:
    fallocate test_file -n -l4096
    fallocate test_file -l4096
    Last fallocate cmd has updated size, but keept EOFBLOCK_FL set. And
    fsck will complain about that.

    Also remove ping pong in ext4_fallocate() in case of new extents,
    where ext4_ext_map_blocks() clear EOFBLOCKS bit, and later
    ext4_falloc_update_inode() restore it again.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • - Both callers(truncate and punch_hole) already aligned left end point
    so we no longer need split logic here.
    - Remove dead duplicated code.
    - Call ext4_ext_dirty only after we have updated eh_entries, otherwise
    we'll loose entries update. Regression caused by d583fb87a3ff0
    266'th testcase in xfstests (http://patchwork.ozlabs.org/patch/120872)

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     

22 Oct, 2011

1 commit

  • Currently code make an impression what grow procedure is very complicated
    and some mythical paths, blocks are involved. But in fact grow in depth
    it relatively simple procedure:
    1) Just create new meta block and copy root data to that block.
    2) Convert root from extent to index if old depth == 0
    3) Update root block pointer

    This patch does:
    - Reorganize code to make it more self explanatory
    - Do not pass path parameter to new_meta_block() in order to
    provoke allocation from inode's group because top-level block
    should site closer to it's inode, but not to leaf data block.

    [ This happens anyway, due to logic in mballoc; we should drop
    the path parameter from new_meta_block() entirely. -- tytso ]

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     

18 Oct, 2011

1 commit


17 Oct, 2011

1 commit


09 Oct, 2011

2 commits


10 Sep, 2011

5 commits

  • Currently, there exists a race between delayed allocated writes and
    the writeback when bigalloc feature is in use. The race was because we
    wanted to determine what blocks in a cluster are under delayed
    allocation and we were using buffer_delayed(bh) check for it. But, the
    writeback codepath clears this bit without any synchronization which
    resulted in a race and an ext4 warning similar to:

    EXT4-fs (ram1): ext4_da_update_reserve_space: ino 13, used 1 with only 0
    reserved data blocks

    The race existed in two places.
    (1) between ext4_find_delalloc_range() and ext4_map_blocks() when called from
    writeback code path.
    (2) between ext4_find_delalloc_range() and ext4_da_get_block_prep() (where
    buffer_delayed(bh) is set.

    To fix (1), this patch introduces a new buffer_head state bit -
    BH_Da_Mapped. This bit is set under the protection of
    EXT4_I(inode)->i_data_sem when we have actually mapped the delayed
    allocated blocks during the writeout time. We can now reliably check
    for this bit inside ext4_find_delalloc_range() to determine whether
    the reservation for the blocks have already been claimed or not.

    To fix (2), it was necessary to set buffer_delay(bh) under the
    protection of i_data_sem. So, I extracted the very beginning of
    ext4_map_blocks into a new function - ext4_da_map_blocks() - and
    performed the required setting of bh_delay bit and the quota
    reservation under the protection of i_data_sem. These two fixes makes
    the checking of buffer_delay(bh) and buffer_da_mapped(bh) consistent,
    thus removing the race.

    Tested: I was able to reproduce the problem by running 'dd' and
    'fsync' in parallel. Also, xfstests sometimes used to reproduce this
    race. After the fix both my test and xfstests were successful and no
    race (warning message) was observed.

    Google-Bug-Id: 4997027

    Signed-off-by: Aditya Kali
    Signed-off-by: "Theodore Ts'o"

    Aditya Kali
     
  • This patch adds some tracepoints in ext4/extents.c and updates a tracepoint in
    ext4/inode.c.

    Tested: Built and ran the kernel and verified that these tracepoints work.
    Also ran xfstests.

    Signed-off-by: Aditya Kali
    Signed-off-by: "Theodore Ts'o"

    Aditya Kali
     
  • With bigalloc changes, the i_blocks value was not correctly set (it was still
    set to number of blocks being used, but in case of bigalloc, we want i_blocks
    to represent the number of clusters being used). Since the quota subsystem sets
    the i_blocks value, this patch fixes the quota accounting and makes sure that
    the i_blocks value is set correctly.

    Signed-off-by: Aditya Kali
    Signed-off-by: "Theodore Ts'o"

    Aditya Kali
     
  • When we are truncating (as opposed unlinking) a file, we need to worry
    about partial truncates of a file, especially in the light of sparse
    files. The changes here make sure that arbitrary truncates of sparse
    files works correctly. Yeah, it's messy.

    Note that these functions will need to be revisted when the punch
    ioctl is integrated --- in fact this commit will probably have merge
    conflicts with the punch changes which Allison Henders and the IBM LTC
    have been working on. I will need to fix this up when either patch
    hits mainline.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • If we need to allocate a new block in ext4_ext_map_blocks(), the
    function needs to see if the cluster has already been allocated.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

07 Sep, 2011

1 commit

  • While running extended fsx tests to verify the first
    two patches, a similar bug was also found in the
    truncate operation.

    This bug happens because the truncate routine only zeros
    the unblock aligned portion of the last page. This means
    that the block aligned portions of the page appearing after
    i_size are left unzeroed, and the buffer heads still mapped.

    This bug is corrected by using ext4_discard_partial_page_buffers
    in the truncate routine to zero the partial page and unmap
    the buffer headers.

    Signed-off-by: Allison Henderson
    Signed-off-by: "Theodore Ts'o"

    Allison Henderson
     

04 Sep, 2011

1 commit

  • Add debugging information in case jbd2_journal_dirty_metadata() is
    called with a buffer_head which didn't have
    jbd2_journal_get_write_access() called on it, or if the journal_head
    has the wrong transaction in it. In addition, return an error code.
    This won't change anything for ocfs2, which will BUG_ON() the non-zero
    exit code.

    For ext4, the caller of this function is ext4_handle_dirty_metadata(),
    and on seeing a non-zero return code, will call __ext4_journal_stop(),
    which will print the function and line number of the (buggy) calling
    function and abort the journal. This will allow us to recover instead
    of bug halting, which is better from a robustness and reliability
    point of view.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

03 Sep, 2011

2 commits

  • This patch fixes a second punch hole bug found by xfstests 127.

    This bug happens because punch hole needs to flush the pages
    of the hole to avoid race conditions. But if the end of the
    hole is in the same page as i_size, the buffer heads beyond
    i_size need to be unmapped and the page needs to be zeroed
    after it is flushed.

    To correct this, the new ext4_discard_partial_page_buffers
    routine is used to zero and unmap the partial page
    beyond i_size if the end of the hole appears in the same
    page as i_size.

    The code has also been optimized to set the end of the hole
    to the page after i_size if the specified hole exceeds i_size,
    and the code that flushes the pages has been simplified.

    Signed-off-by: Allison Henderson

    Allison Henderson
     
  • This patch addresses a bug found by xfstests 75, 112, 127
    when blocksize = 1k

    This bug happens because the punch hole code only zeros
    out non block aligned regions of the page. This means that if the
    blocks are smaller than a page, then the block aligned regions of
    the page inside the hole are left un-zeroed, and their buffer heads
    are still mapped. This bug is corrected by using
    ext4_discard_partial_page_buffers to properly zero the partial page
    at the head and tail of the hole, and unmap the corresponding buffer
    heads

    This patch also addresses a bug reported by Lukas while working on a
    new patch to add discard support for loop devices using punch hole.
    The bug happened because of the first and last block number
    needed to be cast to a larger data type before calculating the
    byte offset, but since now we only need the byte offsets of the
    pages, we no longer even need to be calculating the byte offsets
    of the blocks. The code to do the block offset calculations is
    removed in this patch.

    Signed-off-by: Allison Henderson

    Allison Henderson
     

28 Jul, 2011

2 commits

  • The logical block number in map.l_blk is a __u32, and so before we
    shift it left, by the block size, we neeed cast it to a 64-bit size.

    Otherwise i_size can be corrupted on an ENOSPC.

    # df -T /mnt/mp1
    Filesystem Type 1K-blocks Used Available Use% Mounted on
    /dev/sda6 ext4 9843276 153056 9190200 2% /mnt/mp1
    # fallocate -o 0 -l 2199023251456 /mnt/mp1/testfile
    fallocate: /mnt/mp1/testfile: fallocate failed: No space left on device
    # stat /mnt/mp1/testfile
    File: `/mnt/mp1/testfile'
    Size: 4293656576 Blocks: 19380440 IO Block: 4096 regular file
    Device: 806h/2054d Inode: 12 Links: 1
    Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
    Access: 2011-07-25 13:01:31.414490496 +0900
    Modify: 2011-07-25 13:01:31.414490496 +0900
    Change: 2011-07-25 13:01:31.454490495 +0900

    Signed-off-by: Utako Kusaka
    Signed-off-by: "Theodore Ts'o"
    --
    fs/ext4/extents.c | 2 +-
    1 files changed, 1 insertions(+), 1 deletions(-)

    Utako Kusaka
     
  • The old function ext4_ext_rm_idx is used only for truncate case
    because it just remove last index in extent-index-block. When punching
    hole, it usually needed to remove "middle" index, therefore we must
    move indexes which after it forward.

    (I create a file with 1 depth extent tree and punch hole in the middle
    of it, the last index in index-block strangly gone, so I find out this
    bug)

    Signed-off-by: Robin Dong
    Signed-off-by: "Theodore Ts'o"

    Robin Dong
     

24 Jul, 2011

3 commits


18 Jul, 2011

4 commits

  • If eh_entries is equal to (or greater than) eh_max, the operation of
    inserting new extent_idx will make number of entries overflow.
    So check eh_entries before inserting the new extent_idx.

    Although there is no bug case according the code (function
    ext4_ext_insert_index is called by ext4_ext_split and ext4_ext_split
    is called only if the index block has free space), the right logic
    should be "lookup the capacity before insertion".

    Signed-off-by: Robin Dong
    Signed-off-by: "Theodore Ts'o"

    Robin Dong
     
  • This patch avoids an extraneous lookup of the extent cache
    in ext4_ext_map_blocks() when the flag
    EXT4_GET_BLOCKS_PUNCH_OUT_EXT is absent.

    The existing logic was performing the lookup but not making
    use of the result. The patch simply reverses the order of evaluation
    in the condition.

    Since ext4_ext_in_cache() does not initialize newex on misses, bypassing
    its invocation does not introduce any new issue in this regard.

    Signed-off-by: Robin Dong
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Lukas Czerner
    Reviewed-by: Eric Gouriou

    Robin Dong
     
  • This patch removes the extra parameter in ext4_ext_remove_space()
    which is no longer needed.

    Signed-off-by: Allison Henderson
    Signed-off-by: "Theodore Ts'o"

    Allison Henderson
     
  • This patch optimizes the punch hole operation by skipping the
    tree walking code that is used by truncate. Since punch hole
    is done through map blocks, the path to the extent is already
    known in this function, so we do not need to look it up again.

    Signed-off-by: Allison Henderson
    Signed-off-by: "Theodore Ts'o"

    Allison Henderson
     

12 Jul, 2011

1 commit