06 Jun, 2011

4 commits

  • While creating fixed tracepoints for ext3, basically by porting them
    from ext4, I found a lot of useless retyping, wrong type usage, useless
    variable passing and other inconsistencies in the ext4 fixed tracepoint
    code.

    This patch cleans the fixed tracepoint code for ext4 and also simplify
    some of them.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     
  • Currently we are not marking the extent as the last one
    (FIEMAP_EXTENT_LAST) if there is a hole at the end of the file. This is
    because we just do not check for it right now and continue searching for
    next extent. But at the point we hit the hole at the end of the file, it
    is too late.

    This commit adds check for the allocated block in subsequent extent and
    if there is no more extents (block = EXT_MAX_BLOCKS) just flag the
    current one as the last one.

    This behaviour has been spotted unintentionally by 252 xfstest, when the
    test hangs out, because of wrong loop condition. However on other
    filesystems (like xfs) it will exit anyway, because we notice the last
    extent flag and exit.

    With this patch xfstest 252 does not hang anymore, ext4 fiemap
    implementation still reports bad extent type in some cases, however
    this seems to be different issue.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     
  • Kazuya Mio reported that he was able to hit BUG_ON(next == lblock)
    in ext4_ext_put_gap_in_cache() while creating a sparse file in extent
    format and fill the tail of file up to its end. We will hit the BUG_ON
    when we write the last block (2^32-1) into the sparse file.

    The root cause of the problem lies in the fact that we specifically set
    s_maxbytes so that block at s_maxbytes fit into on-disk extent format,
    which is 32 bit long. However, we are not storing start and end block
    number, but rather start block number and length in blocks. It means
    that in order to cover extent from 0 to EXT_MAX_BLOCK we need
    EXT_MAX_BLOCK+1 to fit into len (because we counting block 0 as well) -
    and it does not.

    The only way to fix it without changing the meaning of the struct
    ext4_extent members is, as Kazuya Mio suggested, to lower s_maxbytes
    by one fs block so we can cover the whole extent we can get by the
    on-disk extent format.

    Also in many places EXT_MAX_BLOCK is used as length instead of maximum
    logical block number as the name suggests, it is all a bit messy. So
    this commit renames it to EXT_MAX_BLOCKS and change its usage in some
    places to actually be maximum number of blocks in the extent.

    The bug which this commit fixes can be reproduced as follows:

    dd if=/dev/zero of=/mnt/mp1/file bs= count=1 seek=$((2**32-2))
    sync
    dd if=/dev/zero of=/mnt/mp1/file bs= count=1 seek=$((2**32-1))

    Reported-by: Kazuya Mio
    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     
  • metadata is not parameter of ext4_free_blocks() any more.

    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     

27 May, 2011

3 commits

  • Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
    anything else, so that the filesystem can track internally if it
    needs to push out a transaction for fdatasync or not.

    This is just the prototype change with no user for it yet. I plan
    to push large XFS changes for the next merge window, and getting
    this trivial infrastructure in this window would help a lot to avoid
    tree interdependencies.

    Also remove incorrect comments that ->dirty_inode can't block. That
    has been changed a long time ago, and many implementations rely on it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
    xen: cleancache shim to Xen Transcendent Memory
    ocfs2: add cleancache support
    ext4: add cleancache support
    btrfs: add cleancache support
    ext3: add cleancache support
    mm/fs: add hooks to support cleancache
    mm: cleancache core ops functions and config
    fs: add field to superblock to support cleancache
    mm/fs: cleancache documentation

    Fix up trivial conflict in fs/btrfs/extent_io.c due to includes

    Linus Torvalds
     
  • This seventh patch of eight in this cleancache series "opts-in"
    cleancache for ext4. Filesystems must explicitly enable cleancache
    by calling cleancache_init_fs anytime an instance of the filesystem
    is mounted. For ext4, all other cleancache hooks are in
    the VFS layer including the matching cleancache_flush_fs
    hook which must be called on unmount.

    Details and a FAQ can be found in Documentation/vm/cleancache.txt

    [v6-v8: no changes]
    [v5: jeremy@goop.org: simplify init hook and any future fs init changes]
    Signed-off-by: Dan Magenheimer
    Reviewed-by: Jeremy Fitzhardinge
    Reviewed-by: Konrad Rzeszutek Wilk
    Acked-by: Andreas Dilger
    Cc: Ted Ts'o
    Cc: Andrew Morton
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Rik Van Riel
    Cc: Jan Beulich
    Cc: Chris Mason
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Nitin Gupta

    Dan Magenheimer
     

26 May, 2011

2 commits


25 May, 2011

12 commits

  • Currently, an fallocate request of size slightly larger than a power of
    2 is turned into two block requests, each a power of 2, with the extra
    blocks pre-allocated for future use. When an application calls
    fallocate, it already has an idea about how large the file may grow so
    there is usually little benefit to reserve extra blocks on the
    preallocation list. This reduces disk fragmentation.

    Tested: fsstress. Also verified manually that fallocat'ed files are
    contiguously laid out with this change (whereas without it they begin at
    power-of-2 boundaries, leaving blocks in between). CPU usage of
    fallocate is not appreciably higher. In a tight fallocate loop, CPU
    usage hovers between 5%-8% with this change, and 5%-7% without it.

    Using a simulated file system aging program which the file system to
    70%, the percentage of free extents larger than 8MB (as measured by
    e2freefrag) increased from 38.8% without this change, to 69.4% with
    this change.

    Signed-off-by: Vivek Haldar
    Signed-off-by: "Theodore Ts'o"

    Vivek Haldar
     
  • This patch adds new routines: "ext4_punch_hole" "ext4_ext_punch_hole"
    and "ext4_ext_check_cache"

    fallocate has been modified to call ext4_punch_hole when the punch hole
    flag is passed. At the moment, we only support punching holes in
    extents, so this routine is pretty much a wrapper for the ext4_ext_punch_hole
    routine.

    The ext4_ext_punch_hole routine first completes all outstanding writes
    with the associated pages, and then releases them. The unblock
    aligned data is zeroed, and all blocks in between are punched out.

    The ext4_ext_check_cache routine is very similar to ext4_ext_in_cache
    except it accepts a ext4_ext_cache parameter instead of a ext4_extent
    parameter. This routine is used by ext4_ext_punch_hole to check and
    see if a block in a hole that has been cached. The ext4_ext_cache
    parameter is necessary because the members ext4_extent structure are
    not large enough to hold a 32 bit value. The existing
    ext4_ext_in_cache routine has become a wrapper to this new function.

    [ext4 punch hole patch series 5/5 v7]

    Signed-off-by: Allison Henderson
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Mingming Cao

    Allison Henderson
     
  • This patch adds a new flag to ext4_map_blocks() that specifies the
    given range of blocks should be punched out. Extents are first
    converted to uninitialized extents before they are punched
    out. Because punching a hole may require that the extent be split, it
    is possible that the splitting may need more blocks than are
    available. To deal with this, use of reserved blocks are enabled to
    allow the split to proceed.

    The routine then returns the number of blocks successfully
    punched out.

    [ext4 punch hole patch series 4/5 v7]

    Signed-off-by: Allison Henderson
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Mingming Cao

    Allison Henderson
     
  • This patch modifies the truncate routines to support hole punching
    Below is a brief summary of the patches changes:

    - Added end param to ext_ext4_rm_leaf
    This function has been modified to accept an end parameter
    which enables it to punch holes in leafs instead of just
    truncating them.

    - Implemented the "remove head" case in the ext_remove_blocks routine
    This routine is used by ext_ext4_rm_leaf to remove the tail
    of an extent during a truncate. The new ext_ext4_rm_leaf
    routine will now also use it to remove the head of an extent in the
    case that the hole covers a region of blocks at the beginning
    of an extent.

    - Added "end" param to ext4_ext_remove_space routine
    This function has been modified to accept a stop parameter, which
    is passed through to ext4_ext_rm_leaf.

    [ext4 punch hole patch series 3/5 v6]

    Signed-off-by: Allison Henderson
    Signed-off-by: "Theodore Ts'o"

    Allison Henderson
     
  • This patch modifies the existing ext4_block_truncate_page() function
    which was used by the truncate code path, and which zeroes out block
    unaligned data, by adding a new length parameter, and renames it to
    ext4_block_zero_page_rage(). This function can now be used to zero out the
    head of a block, the tail of a block, or the middle
    of a block.

    The ext4_block_truncate_page() function is now a wrapper to
    ext4_block_zero_page_range().

    [ext4 punch hole patch series 2/5 v7]

    Signed-off-by: Allison Henderson
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Mingming Cao

    Allison Henderson
     
  • This patch adds an allocation request flag to the ext4_has_free_blocks
    function which enables the use of reserved blocks. This will allow a
    punch hole to proceed even if the disk is full. Punching a hole may
    require additional blocks to first split the extents.

    Because ext4_has_free_blocks is a low level function, the flag needs
    to be passed down through several functions listed below:

    ext4_ext_insert_extent
    ext4_ext_create_new_leaf
    ext4_ext_grow_indepth
    ext4_ext_split
    ext4_ext_new_meta_block
    ext4_mb_new_blocks
    ext4_claim_free_blocks
    ext4_has_free_blocks

    [ext4 punch hole patch series 1/5 v7]

    Signed-off-by: Allison Henderson
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Mingming Cao

    Allison Henderson
     
  • I am working on patch to add quota as a built-in feature for ext4
    filesystem. The implementation is based on the design given at
    https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4.
    This patch reserves the inode numbers 3 and 4 for quota purposes and
    also reserves EXT4_FEATURE_RO_COMPAT_QUOTA feature code.

    Signed-off-by: Aditya Kali
    Signed-off-by: "Theodore Ts'o"

    Aditya Kali
     
  • Prevent an ext4 filesystem from being mounted multiple times.
    A sequence number is stored on disk and is periodically updated (every 5
    seconds by default) by a mounted filesystem.
    At mount time, we now wait for s_mmp_update_interval seconds to make sure
    that the MMP sequence does not change.
    In case of failure, the nodename, bdevname and the time at which the MMP
    block was last updated is displayed.

    Signed-off-by: Andreas Dilger
    Signed-off-by: Johann Lombardi
    Signed-off-by: "Theodore Ts'o"

    Johann Lombardi
     
  • I found the issue that the number of free blocks went negative.
    # stat -f /mnt/mp1/
    File: "/mnt/mp1/"
    ID: e175ccb83a872efe Namelen: 255 Type: ext2/ext3
    Block size: 4096 Fundamental block size: 4096
    Blocks: Total: 258022 Free: -15 Available: -13122
    Inodes: Total: 65536 Free: 63029

    f_bfree in struct statfs will go negative when the filesystem has
    few free blocks. Because the number of dirty blocks is bigger than
    the number of free blocks in the following two cases.

    CASE 1:
    ext4_da_writepages
    mpage_da_map_and_submit
    ext4_map_blocks
    ext4_ext_map_blocks
    ext4_mb_new_blocks
    ext4_mb_diskspace_used
    percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);

    ext4_da_update_reserve_space
    percpu_counter_sub(&sbi->s_dirtyblocks_counter,
    used + ei->i_allocated_meta_blocks);

    CASE 2:
    ext4_write_begin
    __block_write_begin
    ext4_map_blocks
    ext4_ext_map_blocks
    ext4_mb_new_blocks
    ext4_mb_diskspace_used
    percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);

    percpu_counter_sub(&sbi->s_dirtyblocks_counter, reserv_blks);

    To avoid the issue, this patch ensures that f_bfree is non-negative.

    Signed-off-by: Kazuya Mio

    Kazuya Mio
     
  • We should protect reading bd_info->bb_first_free with the group lock
    because otherwise we might miss some free blocks. This is not a big deal
    at all, but the change to do right thing is really simple, so lets do
    that.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     
  • Currently we are loading buddy ext4_mb_load_buddy() for every block
    group we are going through in ext4_trim_fs() in many cases just to find
    out that there is not enough space to be bothered with. As Amir Goldstein
    suggested we can use bb_free information directly from ext4_group_info.

    This commit removes ext4_mb_load_buddy() from ext4_trim_fs() and rather
    get the ext4_group_info via ext4_get_group_info() and use the bb_free
    information directly from that. This avoids unnecessary call to load
    buddy in the case the group does not have enough free space to trim.
    Loading buddy is now moved to ext4_trim_all_free().

    Tested by me with xfstests 251.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     
  • jbd2_log_start_commit() returns 1 only when we really start a
    transaction. But we also need to wait for a transaction when the
    commit is already running. Fix this problem by waiting for
    transaction commit unconditionally (which is just a quick check if the
    transaction is already committed).

    Also we have to be more careful with sending of a barrier because when
    transaction is being committed in parallel to ext4_sync_file()
    running, we cannot be sure that the barrier the journalling code sends
    happens after we wrote all the data for fsync (note that not every
    data writeout needs to trigger metadata changes thus commit of some
    metadata changes can be running while other data is still written
    out). So use jbd2_will_send_data_barrier() helper to detect the common
    cases when we can be sure barrier will be issued by the commit code
    and issue the barrier ourselves in the remaining cases.

    Reported-by: Edward Goggin
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

24 May, 2011

2 commits

  • To get delayed-extent information, ext4_ext_fiemap_cb() looks up
    pagecache, it thus collects information starting from a page's
    head block.

    If blocksize < pagesize, the beginning blocks of a page may lies
    before the request range. So ext4_ext_fiemap_cb() should proceed
    ignoring them, because they has been handled before. If no mapped
    buffer in the range is found in the 1st page, we need to look up
    the 2nd page, otherwise delayed-extents after a hole will be ignored.

    Without this patch, xfstests 225 will hung on ext4 with 1K block.

    Reported-by: Amir Goldstein
    Signed-off-by: Yongqiang Yang
    Signed-off-by: "Theodore Ts'o"

    Yongqiang Yang
     
  • In commit c8d46e41 (ext4: Add flag to files with blocks intentionally
    past EOF), if the EOFBLOCKS_FL flag is set, we call ext4_truncate()
    before calling vmtruncate(). This caused any allocated but unwritten
    blocks created by calling fallocate() with the FALLOC_FL_KEEP_SIZE
    flag to be dropped. This was done to make to make sure that
    EOFBLOCKS_FL would not be cleared while still leaving blocks past
    i_size allocated. This was not necessary, since ext4_truncate()
    guarantees that blocks past i_size will be dropped, even in the case
    where truncate() has increased i_size before calling ext4_truncate().

    So fix this by removing the EOFBLOCKS_FL special case treatment in
    ext4_setattr(). In addition, use truncate_setsize() followed by a
    call to ext4_truncate() instead of using vmtruncate(). This is more
    efficient since it skips the call to inode_newsize_ok(), which has
    been checked already by inode_change_ok(). This is also in a win in
    the case where EOFBLOCKS_FL is set since it avoids calling
    ext4_truncate() twice.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

23 May, 2011

4 commits


21 May, 2011

4 commits

  • We need to take reference to the s_li_request after we take a mutex,
    because it might be freed since then, hence result in accessing old
    already freed memory. Also we should protect the whole
    ext4_remove_li_request() because ext4_li_info might be in the process of
    being freed in ext4_lazyinit_thread().

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Eric Sandeen

    Lukas Czerner
     
  • For some reason, when we set the mount option "init_itable=0" it
    behaves as we would set init_itable=20 which is not right at all.
    Basically when we set it to zero we are saying to lazyinit thread not
    to wait between zeroing the inode table (except of cond_resched()) so
    this commit fixes that and removes the unnecessary condition. The 'n'
    should be also properly used on remount.

    When the n is not set at all, it means that the default miltiplier
    EXT4_DEF_LI_WAIT_MULT is set instead.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"
    Reported-by: Eric Sandeen

    Lukas Czerner
     
  • For some reason we have been waiting for lazyinit thread to start in the
    ext4_run_lazyinit_thread() but it is not needed since it was jus
    unnecessary complexity, so get rid of it. We can also remove li_task and
    li_wait_task since it is not used anymore.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Eric Sandeen

    Lukas Czerner
     
  • In order to make lazyinit eat approx. 10% of io bandwidth at max, we
    are sleeping between zeroing each single inode table. For that purpose
    we are using timer which wakes up thread when it expires. It is set
    via add_timer() and this may cause troubles in the case that thread
    has been woken up earlier and in next iteration we call add_timer() on
    still running timer hence hitting BUG_ON in add_timer(). We could fix
    that by using mod_timer() instead however we can use
    schedule_timeout_interruptible() for waiting and hence simplifying
    things a lot.

    This commit exchange the old "waiting mechanism" with simple
    schedule_timeout_interruptible(), setting the time to sleep. Hence we
    do not longer need li_wait_daemon waiting queue and others, so get rid
    of it.

    Addresses-Red-Hat-Bugzilla: #699708

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Eric Sandeen

    Lukas Czerner
     

19 May, 2011

3 commits


16 May, 2011

2 commits

  • This patch addresses bugs found while testing punch hole
    with the fsx test. The patch corrects the number of blocks
    that are zeroed out while splitting an extent, and also corrects
    the return value to return the number of blocks split out, instead
    of the number of blocks zeroed out.

    This patch has been tested in addition to the following patches:
    [Ext4 punch hole v7]
    [XFS Tests Punch Hole 1/1 v2] Add Punch Hole Testing to FSX

    The test ran successfully for 24 hours.

    Signed-off-by: Allison Henderson
    Signed-off-by: "Theodore Ts'o"

    Allison Henderson
     
  • If quota is not enabled when ext4_quota_off() is called, we must not
    dereference quota file inode since it is NULL. Check properly for
    this.

    This fixes a bug in commit 21f976975cbe (ext4: remove unnecessary
    [cm]time update of quota file), which was merged for 2.6.39-rc3.

    Reported-by: Amir Goldstein
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Amir Goldstein
     

15 May, 2011

1 commit


10 May, 2011

3 commits

  • After taking care of all group init races, all that remains is to
    remove alloc_semp from ext4_allocation_context and ext4_buddy structs.

    Signed-off-by: Amir Goldstein
    Signed-off-by: "Theodore Ts'o"

    Amir Goldstein
     
  • After online resize which adds new groups, some of the groups
    in a buddy page may be initialized and uptodate, while other
    (new ones) may be uninitialized.

    The indication for init of new block groups is when ext4_mb_init_cache()
    is called with an uptodate buddy page. In this case, initialized groups
    on that buddy page must be skipped when initializing the buddy cache.

    Signed-off-by: Amir Goldstein
    Signed-off-by: "Theodore Ts'o"

    Amir Goldstein
     
  • The old routines ext4_mb_[get|put]_buddy_cache_lock(), which used
    to take grp->alloc_sem for all groups on the buddy page have been
    replaced with the routines ext4_mb_[get|put]_buddy_page_lock().

    The new routines take both buddy and bitmap page locks to protect
    against concurrent init of groups on the same buddy page.

    The GROUP_NEED_INIT flag is tested again under page lock to check
    if the group was initialized by another caller.

    Signed-off-by: Amir Goldstein
    Signed-off-by: "Theodore Ts'o"

    Amir Goldstein