12 Mar, 2013

1 commit

  • jbd2_journal_dirty_metadata() didn't get a reference to journal_head it
    was working with. This is OK in most of the cases since the journal head
    should be attached to a transaction but in rare occasions when we are
    journalling data, __ext4_journalled_writepage() can race with
    jbd2_journal_invalidatepage() stripping buffers from a page and thus
    journal head can be freed under hands of jbd2_journal_dirty_metadata().

    Fix the problem by getting own journal head reference in
    jbd2_journal_dirty_metadata() (and also in jbd2_journal_set_triggers()
    which can possibly have the same issue).

    Reported-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Jan Kara
     

11 Mar, 2013

10 commits

  • Currently we only reserve space (data+metadata) in delayed allocation if
    we're allocating from new cluster (which is always in non-bigalloc file
    system) which is ok for data blocks, because we reserve the whole cluster.

    However we have to reserve metadata for every delayed block we're going
    to write because every block could potentially require metedata block
    when we need to grow the extent tree.

    Signed-off-by: Lukas Czerner

    Lukas Czerner
     
  • Currently in ext4_ext_map_blocks() in delayed allocation writeback
    we would update the reservation and after that check whether we claimed
    cluster outside of the range of the allocation and if so, we'll give the
    block back to the reservation pool.

    However this also means that if the number of reserved data block
    dropped to zero before the correction, we would release all the metadata
    reservation as well, however we might still need it because the we're
    not done with the delayed allocation and there might be more blocks to
    come. This will result in error messages such as:

    EXT4-fs warning (device sdb): ext4_da_update_reserve_space:361: ino 12,
    allocated 1 with only 0 reserved metadata blocks (releasing 1 blocks
    with reserved 1 data blocks)

    This will only happen on bigalloc file system and it can be easily
    reproduced using fiemap-tester from xfstests like this:

    ./src/fiemap-tester -m DHDHDHDHD -S -p0 /mnt/test/file

    Or using xfstests such as 225.

    Fix this by doing the correction first and updating the reservation
    after that so that we do not accidentally decrease
    i_reserved_data_blocks to zero.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     
  • Using yield() is strongly discouraged (see sched/core.c) especially
    since we can just use cond_resched().

    Replace all use of yield() with cond_resched().

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     
  • Remove unused variable 'freed' in ext4_free_blocks().

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     
  • ext4_releasepage() warns when it is passed a page with PageChecked set.
    However this can correctly happen when invalidate_inode_pages2_range()
    invalidates pages - and we should fail the release in that case. Since
    the page was dirty anyway, it won't be discarded and no harm has
    happened but it's good to be safe. Also remove bogus page_has_buffers()
    check - we are guaranteed page has buffers in this function.

    Reported-by: Zheng Liu
    Tested-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Jan Kara

    Jan Kara
     
  • This commit fixes a wrong return value of the number of the allocated
    blocks in ext4_split_extent. When the length of blocks we want to
    allocate is greater than the length of the current extent, we return a
    wrong number. Let's see what happens in the following case when we
    call ext4_split_extent().

    map: [48, 72]
    ex: [32, 64, u]

    'ex' will be split into two parts:
    ex1: [32, 47, u]
    ex2: [48, 64, w]

    'map->m_len' is returned from this function, and the value is 24. But
    the real length is 16. So it should be fixed.

    Meanwhile in this commit we use right length of the allocated blocks
    when get_reserved_cluster_alloc in ext4_ext_handle_uninitialized_extents
    is called.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Cc: Dmitry Monakhov
    Cc: stable@vger.kernel.org

    Zheng Liu
     
  • When we try to split an extent, this extent could be zeroed out and mark
    as initialized. But we don't know this in ext4_map_blocks because it
    only returns a length of allocated extent. Meanwhile we will mark this
    extent as uninitialized because we only check m_flags.

    This commit update extent status tree when we try to split an unwritten
    extent. We don't need to worry about the status of this extent because
    we always mark it as initialized.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Cc: Dmitry Monakhov

    Zheng Liu
     
  • The ext4_ext_handle_uninitialized_extents() function was assuming the
    return value of ext4_ext_map_blocks() is equal to map->m_len. This
    incorrect assumption was harmless until we started use status tree as
    a extent cache because we need to update status tree according to
    'm_len' value.

    Meanwhile this commit marks EXT4_MAP_MAPPED flag after unwritten extent
    conversion. It shouldn't cause a bug because we update status tree
    according to checking EXT4_MAP_UNWRITTEN flag. But it should be fixed.

    After applied this commit, the following error message from self-testing
    infrastructure disappears.

    ...
    kernel: ES len assertation failed for inode: 230 retval 1 !=
    map->m_len 3 in ext4_map_blocks (allocation)
    ...

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Cc: Dmitry Monakhov

    Zheng Liu
     
  • This commit adds a self-testing infrastructure like extent tree does to
    do a sanity check for extent status tree. After status tree is as a
    extent cache, we'd better to make sure that it caches right result.

    After applied this commit, we will get a lot of messages when we run
    xfstests as below.

    ...
    kernel: ES len assertation failed for inode: 230 retval 1 != map->m_len
    3 in ext4_map_blocks (allocation)
    ...
    kernel: ES cache assertation failed for inode: 230 es_cached ex
    [974/2/4781/20] != found ex [974/1/4781/1000]
    ...
    kernel: ES insert assertation failed for inode: 635 ex_status
    [0/45/21388/w] != es_status [44/1/21432/u]
    ...

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • Check the length of an extent to avoid a potential overflow in
    ext4_es_can_be_merged().

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Cc: Dmitry Monakhov

    Zheng Liu
     

04 Mar, 2013

5 commits

  • mext_replace_branches() will change inode's extents layout so
    we have to drop corresponding cache.

    TESTCASE: 301'th xfstest was not yet accepted to official xfstest's branch
    and can be found here: https://github.com/dmonakhov/xfstests/commit/7b7efeee30a41109201e2040034e71db9b66ddc0

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Dmitry Monakhov
     
  • Now that we don't merge uninitialized extents anymore,
    ext4_fallocate() is free to operate on the inode while there are still
    some extent conversions pending - it won't disturb them in any way.

    Reviewed-by: Zheng Liu
    Reviewed-by: Dmitry Monakhov
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Splitting extents inside endio is a bad thing, but unfortunately it is
    still possible. In fact we are pretty close to the moment when all
    related issues will be fixed. Let's warn developer if it still the
    case.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Dmitry Monakhov
     
  • Derived from Jan's patch:http://permalink.gmane.org/gmane.comp.file-systems.ext4/36470

    Merging of uninitialized extents creates all sorts of interesting race
    possibilities when writeback / DIO races with fallocate. Thus
    ext4_convert_unwritten_extents_endio() has to deal with a case where
    extent to be converted needs to be split out first. That isn't nice
    for two reasons:

    1) It may need allocation of extent tree block so ENOSPC is possible.
    2) It complicates end_io handling code

    So we disable merging of uninitialized extents which allows us to simplify
    the code. Extents will get merged after they are converted to initialized
    ones.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Dmitry Monakhov
     
  • When ext4_split_extent_at() ends up doing zeroout & conversion to
    initialized instead of split & conversion, ext4_split_extent() gets
    confused and can wrongly mark the extent back as uninitialized
    resulting in end IO code getting confused from large unwritten extents
    and may result in data loss.

    The example of problematic behavior is:
    lblk len lblk len
    ext4_split_extent() (ex=[1000,30,uninit], map=[1010,10])
    ext4_split_extent_at() (split [1000,30,uninit] at 1020)
    ext4_ext_insert_extent() -> ENOSPC
    ext4_ext_zeroout()
    -> extent [1000,30] is now initialized
    ext4_split_extent_at() (split [1000,30,init] at 1010,
    MARK_UNINIT1 | MARK_UNINIT2)
    -> extent is split and parts marked as uninitialized

    Fix the problem by rechecking extent type after the first
    ext4_split_extent_at() returns. None of split_flags can not be applied
    to initialized extent so this patch also add BUG_ON to prevent similar
    issues in future.

    TESTCASE: https://github.com/dmonakhov/xfstests/commit/b8a55eb5ce28c6ff29e620ab090902fcd5833597

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Dmitry Monakhov
     

03 Mar, 2013

6 commits

  • When using quota feature we need to enable quotas before orphan cleanup
    so that changes happening during it are properly reflected in quota
    accounting.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • So far we silently ignored when quota mount options were set while quota
    feature was enabled. But this can create confusion in userspace when
    mount options are set but silently ignored and also creates opportunities
    for bugs when we don't properly test all quota types. Actually
    ext4_mark_dquot_dirty() forgets to test for quota feature so it was
    dependent on journaled quota options being set. OTOH ext4_orphan_cleanup()
    tries to enable journaled quota when quota options are specified which is
    wrong when quota feature is enabled.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • ext4_dir_llseek is only used as a callback function, and no one calls
    it directly. So make it as a static function in order to remove a
    warning message from sparse check.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"

    Zheng Liu
     
  • We're using macro EXT4_B2C() to convert number of blocks to number of
    clusters for bigalloc file systems. However, we should be using
    EXT4_NUM_B2C().

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Lukas Czerner
     
  • 'orig_data' is malloced in ext4_remount() and should be freed
    before leaving from the error handling cases, otherwise it will
    cause memory leak.

    Signed-off-by: Wei Yongjun
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Lukas Czerner
    Cc: stable@vger.kernel.org

    Wei Yongjun
     
  • If start_this_handle() failed handle will be initialized
    to ERR_PTR() and can not be dereferenced.

    paging request at fffffffffffffff6
    IP: [] jbd2__journal_start+0x18f/0x290
    PGD 200e067 PUD 200f067 PMD 0
    Oops: 0000 [#1] SMP
    Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
    CPU 0 journal commit I/O error

    Pid: 2694, comm: fio Not tainted 3.8.0-rc3+ #79 /DQ67SW
    RIP: 0010:[] [] jbd2__journal_start+0x18f/0x290
    RSP: 0018:ffff880233b8ba58 EFLAGS: 00010292
    RAX: 00000000ffffffe2 RBX: ffffffffffffffe2 RCX: 0000000000000006
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff82128f48
    RBP: ffff880233b8ba98 R08: 0000000000000000 R09: ffff88021440a6e0

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     

02 Mar, 2013

1 commit

  • Use a percpu counter rather than atomic types for shrinker accounting.
    There's no need for ultimate accuracy in the shrinker, so this
    should come a little more cheaply. The percpu struct is somewhat
    large, but there was a big gap before the cache-aligned
    s_es_lru_lock anyway, and it fits nicely in there.

    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

01 Mar, 2013

1 commit

  • When the system is under memory pressure, ext4_es_srhink() will get
    called very often. So optimize returning the number of items in the
    file system's extent status cache by keeping a per-filesystem count,
    instead of calculating it each time by scanning all of the inodes in
    the extent status cache.

    Also rename the slab used for the extent status cache to be
    "ext4_extent_status" so it's obviousl the slab in question is created
    by ext4.

    Signed-off-by: "Theodore Ts'o"
    Cc: Zheng Liu

    Theodore Ts'o
     

28 Feb, 2013

1 commit

  • This fixes a regression introduced by commit f7fec032aa782. The
    problem was that the extents status flags caused us to mask out block
    numbers smaller than 2**28 blocks. Since we didn't test with file
    systems smaller than 512GB, we didn't notice this during the
    development cycle.

    A typical failure looks like this:

    EXT4-fs error (device sdb1): htree_dirblock_to_tree:919: inode #172235804: block
    152052301: comm ls: bad entry in directory: rec_len is smaller than minimal -
    offset=0(0), inode=0, rec_len=0, name_len=0

    ... where 'debugfs -R "stat " /dev/sdb1' reports that the
    inode has block number 688923213. When viewed in hex, block number
    152052301 (from the syslog) is 0x910224D, while block number 688923213
    is 0x2910224D. Note the missing "0x20000000" in the block number.

    Reported-by: Markus Trippelsdorf
    Verified-by: Markus Trippelsdorf
    Reported-by: Dave Jones
    Verified-by: Dave Jones
    Cc: Zheng Liu
    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

23 Feb, 2013

2 commits

  • ext4_has_free_clusters() should tell us whether there is enough free
    clusters to allocate, however number of free clusters in the file system
    is converted to blocks using EXT4_C2B() which is not only wrong use of
    the macro (we should have used EXT4_NUM_B2C) but it's also completely
    wrong concept since everything else is in cluster units.

    Moreover when calculating number of root clusters we should be using
    macro EXT4_NUM_B2C() instead of EXT4_B2C() otherwise the result might be
    off by one. However r_blocks_count should always be a multiple of the
    cluster ratio so doing a plain bit shift should be enough here. We
    avoid using EXT4_B2C() because it's confusing.

    As a result of the first problem number of free clusters is much bigger
    than it should have been and ext4_has_free_clusters() would return 1 even
    if there is really not enough free clusters available.

    Fix this by removing the EXT4_C2B() conversion of free clusters and
    using bit shift when calculating number of root clusters. This bug
    affects number of xfstests tests covering file system ENOSPC situation
    handling. With this patch most of the ENOSPC problems with bigalloc file
    system disappear, especially the errors caused by delayed allocation not
    having enough space when the actual allocation is finally requested.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"
    Cc: stable@vger.kernel.org

    Lukas Czerner
     
  • len is 0 means no extent needs to be removed, so return immediately.
    Otherwise it could trigger the following BUG_ON() in
    ext4_es_remove_extent()

    end = lblk + len - 1;
    BUG_ON(end < lblk);

    This could be reproduced by a simple truncate(1) command by an
    unprivileged user

    truncate -s $(($((2**32 - 1)) * 4096)) /mnt/ext4/testfile

    The same is true for __es_insert_extent().

    Patched kernel passed xfstests regression test.

    Signed-off-by: Eryu Guan
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Zheng Liu

    Eryu Guan
     

19 Feb, 2013

1 commit

  • Currently when new xattr block is created or released we we would call
    dquot_free_block() or dquot_alloc_block() respectively, among the else
    decrementing or incrementing the number of blocks assigned to the
    inode by one block.

    This however does not work for bigalloc file system because we always
    allocate/free the whole cluster so we have to count with that in
    dquot_free_block() and dquot_alloc_block() as well.

    Use the clusters-to-blocks conversion EXT4_C2B() when passing number of
    blocks to the dquot_alloc/free functions to fix the problem.

    The problem has been revealed by xfstests #117 (and possibly others).

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Eric Sandeen
    Cc: stable@vger.kernel.org

    Lukas Czerner
     

18 Feb, 2013

9 commits

  • Although extent status is loaded on-demand, we also need to reclaim
    extent from the tree when we are under a heavy memory pressure because
    in some cases fragmented extent tree causes status tree costs too much
    memory.

    Here we maintain a lru list in super_block. When the extent status of
    an inode is accessed and changed, this inode will be move to the tail
    of the list. The inode will be dropped from this list when it is
    cleared. In the inode, a counter is added to count the number of
    cached objects in extent status tree. Here only written/unwritten/hole
    extent is counted because delayed extent doesn't be reclaimed due to
    fiemap, bigalloc and seek_data/hole need it. The counter will be
    increased as a new extent is allocated, and it will be decreased as a
    extent is freed.

    In this commit we use normal shrinker framework to reclaim memory from
    the status tree. ext4_es_reclaim_extents_count() traverses the lru list
    to count the number of reclaimable extents. ext4_es_shrink() tries to
    reclaim written/unwritten/hole extents from extent status tree. The
    inode that has been shrunk is moved to the tail of lru list.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Cc: Jan kara

    Zheng Liu
     
  • This commit changes some interfaces in extent status tree because we
    need to use inode to count the cached objects in a extent status tree.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Cc: Jan kara

    Zheng Liu
     
  • Single extent cache could be removed because we have extent status tree
    as a extent cache, and it would be better.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Cc: Jan kara

    Zheng Liu
     
  • After tracking all extent status, we already have a extent cache in
    memory. Every time we want to lookup a block mapping, we can first
    try to lookup it in extent status tree to avoid a potential disk I/O.

    A new function called ext4_es_lookup_extent is defined to finish this
    work. When we try to lookup a block mapping, we always call
    ext4_map_blocks and/or ext4_da_map_blocks. So in these functions we
    first try to lookup a block mapping in extent status tree.

    A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
    in order not to put a hole into extent status tree because this hole
    will be converted to delayed extent in the tree immediately.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Cc: Jan kara

    Zheng Liu
     
  • By recording the phycisal block and status, extent status tree is able
    to track the status of every extents. When we call _map_blocks
    functions to lookup an extent or create a new written/unwritten/delayed
    extent, this extent will be inserted into extent status tree.

    We don't load all extents from disk in alloc_inode() because it costs
    too much memory, and if a file is opened and closed frequently it will
    takes too much time to load all extent information. So currently when
    we create/lookup an extent, this extent will be inserted into extent
    status tree. Hence, the extent status tree may not comprehensively
    contain all of the extents found in the file.

    Here a condition we need to take care is that an extent might contains
    unwritten and delayed status simultaneously because an extent is delayed
    allocated and could be allocated by fallocate. At this time we need to
    keep delayed status because later we need to update delayed reservation
    space using it.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Cc: Jan kara

    Zheng Liu
     
  • This commit lets ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
    because in later commit ext4_map_blocks needs to use this flag to
    determine the extent status.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Zheng Liu
     
  • This commit renames ext4_es_find_extent with ext4_es_find_delayed_extent
    and improve this function. First, we split input and output parameter.
    Second, this function never return the first block of the next delayed
    extent after 'es'.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Cc: Jan kara

    Zheng Liu
     
  • This commit adds two members in extent_status structure to let it record
    physical block and extent status. Here es_pblk is used to record both
    of them because physical block only has 48 bits. So extent status could
    be stashed into it so that we can save some memory. Now written,
    unwritten, delayed and hole are defined as status.

    Due to new member is added into extent status tree, all interfaces need
    to be adjusted.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Zheng Liu
     
  • This commit refines the extent status tree code.

    1) A prefix 'es_' is added to to the extent status tree structure
    members.

    2) Refactored es_remove_extent() so that __es_remove_extent() can be
    used by es_insert_extent() to remove the old extent entry(-ies) before
    inserting a new one.

    3) Rename extent_status_end() to ext4_es_end()

    4) ext4_es_can_be_merged() is define to check whether two extents can
    be merged or not.

    5) Update and clarified comments.

    Signed-off-by: Zheng Liu
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Zheng Liu
     

15 Feb, 2013

3 commits

  • Use ERR_PTR()/IS_ERR() abstraction instead of passing in a separate
    pointer to an integer for the error code, as a code cleanup.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • The code to read in directory blocks and verify their metadata
    checksums was replicated in ten different places across
    fs/ext4/namei.c, and the code was buggy in subtle ways in a number of
    those replicated sites. In some cases, ext4_error() was called with a
    training newline. In others, in particularly in empty_dir(), it was
    possible to call ext4_dirent_csum_verify() on an index block, which
    would trigger false warnings requesting the system adminsitrator to
    run e2fsck.

    By refactoring the code, we make the code more readable, as well as
    shrinking the compiled object file by over 700 bytes and 50 lines of
    code.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • Print some additional debugging context to hopefully help to debug a
    warning which is getting triggered by xfstests #74.

    Also remove extraneous newlines from when printk's were converted to
    ext4_warning() and ext4_msg().

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o