30 Aug, 2016

1 commit

  • Online defragging of encrypted files is not currently implemented.
    However, the move extent ioctl can still return successfully when
    called. For example, this occurs when xfstest ext4/020 is run on an
    encrypted file system, resulting in a corrupted test file and a
    corresponding test failure.

    Until the proper functionality is implemented, fail the move extent
    ioctl if either the original or donor file is encrypted.

    Cc: stable@vger.kernel.org
    Signed-off-by: Eric Whitney
    Signed-off-by: Theodore Ts'o

    Eric Whitney
     

24 Apr, 2016

1 commit

  • Currently we ask jbd2 to write all dirty allocated buffers before
    committing a transaction when doing writeback of delay allocated blocks.
    However this is unnecessary since we move all pages to writeback state
    before dropping a transaction handle and then submit all the necessary
    IO. We still need the transaction commit to wait for all the outstanding
    writeback before flushing disk caches during transaction commit to avoid
    data exposure issues though. Use the new jbd2 capability and ask it to
    only wait for outstanding writeback during transaction commit when
    writing back data in ext4_writepages().

    Tested-by: "HUANG Weller (CM/ESW12-CN)"
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

08 Apr, 2016

1 commit

  • Pull ext4 bugfixes from Ted Ts'o:
    "These changes contains a fix for overlayfs interacting with some
    (badly behaved) dentry code in various file systems. These have been
    reviewed by Al and the respective file system mtinainers and are going
    through the ext4 tree for convenience.

    This also has a few ext4 encryption bug fixes that were discovered in
    Android testing (yes, we will need to get these sync'ed up with the
    fs/crypto code; I'll take care of that). It also has some bug fixes
    and a change to ignore the legacy quota options to allow for xfstests
    regression testing of ext4's internal quota feature and to be more
    consistent with how xfs handles this case"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: ignore quota mount options if the quota feature is enabled
    ext4 crypto: fix some error handling
    ext4: avoid calling dquot_get_next_id() if quota is not enabled
    ext4: retry block allocation for failed DIO and DAX writes
    ext4: add lockdep annotations for i_data_sem
    ext4: allow readdir()'s of large empty directories to be interrupted
    btrfs: fix crash/invalid memory access on fsync when using overlayfs
    ext4 crypto: use dget_parent() in ext4_d_revalidate()
    ext4: use file_dentry()
    ext4: use dget_parent() in ext4_file_open()
    nfs: use file_dentry()
    fs: add file_dentry()
    ext4 crypto: don't let data integrity writebacks fail with ENOMEM
    ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()

    Linus Torvalds
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

01 Apr, 2016

1 commit

  • With the internal Quota feature, mke2fs creates empty quota inodes and
    quota usage tracking is enabled as soon as the file system is mounted.
    Since quotacheck is no longer preallocating all of the blocks in the
    quota inode that are likely needed to be written to, we are now seeing
    a lockdep false positive caused by needing to allocate a quota block
    from inside ext4_map_blocks(), while holding i_data_sem for a data
    inode. This results in this complaint:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&ei->i_data_sem);
    lock(&s->s_dquot.dqio_mutex);
    lock(&ei->i_data_sem);
    lock(&s->s_dquot.dqio_mutex);

    Google-Bug-Id: 27907753

    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     

10 Mar, 2016

1 commit

  • In commit bcff24887d00 ("ext4: don't read blocks from disk after extents
    being swapped") bh is not updated correctly in the for loop and wrong
    data has been written to disk. generic/324 catches this on sub-page
    block size ext4.

    Fixes: bcff24887d00 ("ext4: don't read blocks from disk after extentsbeing swapped")
    Signed-off-by: Eryu Guan
    Signed-off-by: Theodore Ts'o

    Eryu Guan
     

12 Feb, 2016

1 commit

  • I notice ext4/307 fails occasionally on ppc64 host, reporting md5
    checksum mismatch after moving data from original file to donor file.

    The reason is that move_extent_per_page() calls __block_write_begin()
    and block_commit_write() to write saved data from original inode blocks
    to donor inode blocks, but __block_write_begin() not only maps buffer
    heads but also reads block content from disk if the size is not block
    size aligned. At this time the physical block number in mapped buffer
    head is pointing to the donor file not the original file, and that
    results in reading wrong data to page, which get written to disk in
    following block_commit_write call.

    This also can be reproduced by the following script on 1k block size ext4
    on x86_64 host:

    mnt=/mnt/ext4
    donorfile=$mnt/donor
    testfile=$mnt/testfile
    e4compact=~/xfstests/src/e4compact

    rm -f $donorfile $testfile

    # reserve space for donor file, written by 0xaa and sync to disk to
    # avoid EBUSY on EXT4_IOC_MOVE_EXT
    xfs_io -fc "pwrite -S 0xaa 0 1m" -c "fsync" $donorfile

    # create test file written by 0xbb
    xfs_io -fc "pwrite -S 0xbb 0 1023" -c "fsync" $testfile

    # compute initial md5sum
    md5sum $testfile | tee md5sum.txt
    # drop cache, force e4compact to read data from disk
    echo 3 > /proc/sys/vm/drop_caches

    # test defrag
    echo "$testfile" | $e4compact -i -v -f $donorfile
    # check md5sum
    md5sum -c md5sum.txt

    Fix it by creating & mapping buffer heads only but not reading blocks
    from disk, because all the data in page is guaranteed to be up-to-date
    in mext_page_mkuptodate().

    Cc: stable@vger.kernel.org
    Signed-off-by: Eryu Guan
    Signed-off-by: Theodore Ts'o

    Eryu Guan
     

22 Jun, 2015

1 commit

  • Make the error reporting behavior resulting from the unsupported use
    of online defrag on files with data journaling enabled consistent with
    that implemented for bigalloc file systems. Difference found with
    ext4/308.

    Signed-off-by: Eric Whitney
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Darrick J. Wong

    Eric Whitney
     

13 Jun, 2015

1 commit


17 Dec, 2014

1 commit


06 Nov, 2014

1 commit

  • Xiaoguang Wang has reported sporadic EBUSY failures of ext4/302
    Unfortunetly there is nothing we can do if some other task holds BH's
    refenrence. So we must return EBUSY in this case. But we can try
    kicking the journal to see if the other task releases the bh reference
    after the commit is complete. Also decrease false positives by
    properly checking for ENOSPC and retrying the allocation after kicking
    the journal --- which is done by ext4_should_retry_alloc().

    [ Modified by tytso to properly check for ENOSPC. ]

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Theodore Ts'o

    Dmitry Monakhov
     

12 Oct, 2014

1 commit

  • In patch 'ext4: refactor ext4_move_extents code base', Dmitry Monakhov has
    refactored ext4_move_extents' implementation, but forgot to update the
    corresponding comments, this patch will try to delete some useless comments.

    Reviewed-by: Dmitry Monakhov
    Signed-off-by: Xiaoguang Wang
    Signed-off-by: Theodore Ts'o

    Xiaoguang Wang
     

02 Sep, 2014

4 commits

  • Make the function name less redundant.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • Reuse the path object in ext4_move_extents() so we don't unnecessarily
    free and reallocate it.

    Also clean up the get_ext_path() wrapper so that it has the same
    semantics of freeing the path object on error as ext4_ext_find_extent().

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • Teach ext4_ext_drop_refs() to accept a NULL argument, much like
    kfree(). This allows us to drop a lot of checks to make sure path is
    non-NULL before calling ext4_ext_drop_refs().

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • Right now, there are a places where it is all to easy to leak memory
    on an error path, via a usage like this:

    struct ext4_ext_path *path = NULL

    while (...) {
    ...
    path = ext4_ext_find_extent(inode, block, path, 0);
    if (IS_ERR(path)) {
    /* oops, if path was non-NULL before the call to
    ext4_ext_find_extent, we've leaked it! :-( */
    ...
    return PTR_ERR(path);
    }
    ...
    }

    Unfortunately, there some code paths where we are doing the following
    instead:

    path = ext4_ext_find_extent(inode, block, orig_path, 0);

    and where it's important that we _not_ free orig_path in the case
    where ext4_ext_find_extent() returns an error.

    So change the function signature of ext4_ext_find_extent() so that it
    takes a struct ext4_ext_path ** for its third argument, and by
    default, on an error, it will free the struct ext4_ext_path, and then
    zero out the struct ext4_ext_path * pointer. In order to avoid
    causing problems, we add a flag EXT4_EX_NOFREE_ON_ERR which causes
    ext4_ext_find_extent() to use the original behavior of forcing the
    caller to deal with freeing the original path pointer on the error
    case.

    The goal is to get rid of EXT4_EX_NOFREE_ON_ERR entirely, but this
    allows for a gentle transition and makes the patches easier to verify.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

31 Aug, 2014

2 commits

  • ext4_move_extents is too complex for review. It has duplicate almost
    each function available in the rest of other codebase. It has useless
    artificial restriction orig_offset == donor_offset. But in fact logic
    of ext4_move_extents is very simple:

    Iterate extents one by one (similar to ext4_fill_fiemap_extents)
    ->Iterate each page covered extent (similar to generic_perform_write)
    ->swap extents for covered by page (can be shared with IOC_MOVE_DATA)

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Theodore Ts'o

    Dmitry Monakhov
     
  • This allows us to make mext_next_extent static and potentially get rid
    of it.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Theodore Ts'o

    Dmitry Monakhov
     

28 Jul, 2014

1 commit


13 May, 2014

1 commit


21 Apr, 2014

1 commit

  • Currently in ext4 there is quite a mess when it comes to naming
    unwritten extents. Sometimes we call it uninitialized and sometimes we
    refer to it as unwritten.

    The right name for the extent which has been allocated but does not
    contain any written data is _unwritten_. Other file systems are
    using this name consistently, even the buffer head state refers to it as
    unwritten. We need to fix this confusion in ext4.

    This commit changes every reference to an uninitialized extent (meaning
    allocated but unwritten) to unwritten extent. This includes comments,
    function names and variable names. It even covers abbreviation of the
    word uninitialized (such as uninit) and some misspellings.

    This commit does not change any of the code paths at all. This has been
    confirmed by comparing md5sums of the assembly code of each object file
    after all the function names were stripped from it.

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     

24 Feb, 2014

1 commit

  • This patch implements fallocate's FALLOC_FL_COLLAPSE_RANGE for Ext4.

    The semantics of this flag are following:
    1) It collapses the range lying between offset and length by removing any data
    blocks which are present in this range and than updates all the logical
    offsets of extents beyond "offset + len" to nullify the hole created by
    removing blocks. In short, it does not leave a hole.
    2) It should be used exclusively. No other fallocate flag in combination.
    3) Offset and length supplied to fallocate should be fs block size aligned
    in case of xfs and ext4.
    4) Collaspe range does not work beyond i_size.

    Signed-off-by: Namjae Jeon
    Signed-off-by: Ashish Sangwan
    Tested-by: Dongsu Park
    Signed-off-by: "Theodore Ts'o"

    Namjae Jeon
     

18 Feb, 2014

1 commit


09 Nov, 2013

1 commit


17 Aug, 2013

1 commit

  • When we read in an extent tree leaf block from disk, arrange to have
    all of its entries cached. In nearly all cases the in-memory
    representation will be more compact than the on-disk representation in
    the buffer cache, and it allows us to get the information without
    having to traverse the extent tree for successive extents.

    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Zheng Liu

    Theodore Ts'o
     

17 Jun, 2013

1 commit


20 Apr, 2013

1 commit


12 Apr, 2013

1 commit

  • - grab_cache_page_write_begin() may not wait on page's writeback since
    (1d1d1a767206). But it is still reasonable to wait on page's writeback
    here in order to be on the safe side.

    - Fix miss typo: pass 'length' instead of 'end' to __block_write_begin()
    https://bugzilla.kernel.org/show_bug.cgi?id=56241

    TESTCASE: git://oss.sgi.com/xfs/cmds/xfstests.git
    MKFS_OPTIONS="-b1024" ; ./check ext4/304

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Akira Fujita

    Dmitry Monakhov
     

10 Apr, 2013

1 commit


09 Apr, 2013

1 commit

  • Add a new ioctl, EXT4_IOC_SWAP_BOOT which swaps i_blocks and
    associated attributes (like i_blocks, i_size, i_flags, ...) from the
    specified inode with inode EXT4_BOOT_LOADER_INO (#5). This is
    typically used to store a boot loader in a secure part of the
    filesystem, where it can't be changed by a normal user by accident.
    The data blocks of the previous boot loader will be associated with
    the given inode.

    This usercode program is a simple example of the usage:

    int main(int argc, char *argv[])
    {
    int fd;
    int err;

    if ( argc != 2 ) {
    printf("usage: ext4-swap-boot-inode FILE-TO-SWAP\n");
    exit(1);
    }

    fd = open(argv[1], O_WRONLY);
    if ( fd < 0 ) {
    perror("open");
    exit(1);
    }

    err = ioctl(fd, EXT4_IOC_SWAP_BOOT);
    if ( err < 0 ) {
    perror("ioctl");
    exit(1);
    }

    close(fd);
    exit(0);
    }

    [ Modified by Theodore Ts'o to fix a number of bugs in the original code.]

    Signed-off-by: Dr. Tilmann Bubeck
    Signed-off-by: "Theodore Ts'o"

    Dr. Tilmann Bubeck
     

22 Mar, 2013

1 commit

  • Pull ext4 fixes from Ted Ts'o:
    "Fix a number of regression and other bugs in ext4, most of which were
    relatively obscure cornercases or races that were found using
    regression tests."

    * tag 'ext4_for_linue' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
    ext4: fix data=journal fast mount/umount hang
    ext4: fix ext4_evict_inode() racing against workqueue processing code
    ext4: fix memory leakage in mext_check_coverage
    ext4: use s_extent_max_zeroout_kb value as number of kb
    ext4: use atomic64_t for the per-flexbg free_clusters count
    jbd2: fix use after free in jbd2_journal_dirty_metadata()
    ext4: reserve metadata block for every delayed write
    ext4: update reserved space after the 'correction'
    ext4: do not use yield()
    ext4: remove unused variable in ext4_free_blocks()
    ext4: fix WARN_ON from ext4_releasepage()
    ext4: fix the wrong number of the allocated blocks in ext4_split_extent()
    ext4: update extent status tree after an extent is zeroed out
    ext4: fix wrong m_len value after unwritten extent conversion
    ext4: add self-testing infrastructure to do a sanity check
    ext4: avoid a potential overflow in ext4_es_can_be_merged()
    ext4: invalidate extent status tree during extent migration
    ext4: remove unnecessary wait for extent conversion in ext4_fallocate()
    ext4: add warning to ext4_convert_unwritten_extents_endio
    ext4: disable merging of uninitialized extents
    ...

    Linus Torvalds
     

18 Mar, 2013

1 commit

  • Regression was introduced by following commit 8c854473
    TESTCASE (git://oss.sgi.com/xfs/cmds/xfstests.git):
    #while true;do ./check 301 || break ;done

    Also fix potential memory leakage in get_ext_path() once
    ext4_ext_find_extent() have failed.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     

04 Mar, 2013

1 commit

  • mext_replace_branches() will change inode's extents layout so
    we have to drop corresponding cache.

    TESTCASE: 301'th xfstest was not yet accepted to official xfstest's branch
    and can be found here: https://github.com/dmonakhov/xfstests/commit/7b7efeee30a41109201e2040034e71db9b66ddc0

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara

    Dmitry Monakhov
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

23 Feb, 2013

1 commit


18 Feb, 2013

1 commit


09 Feb, 2013

1 commit

  • So we can better understand what bits of ext4 are responsible for
    long-running jbd2 handles, use jbd2__journal_start() so we can pass
    context information for logging purposes.

    The recommended way for finding the longer-running handles is:

    T=/sys/kernel/debug/tracing
    EVENT=$T/events/jbd2/jbd2_handle_stats
    echo "interval > 5" > $EVENT/filter
    echo 1 > $EVENT/enable

    ./run-my-fs-benchmark

    cat $T/trace > /tmp/problem-handles

    This will list handles that were active for longer than 20ms. Having
    longer-running handles is bad, because a commit started at the wrong
    time could stall for those 20+ milliseconds, which could delay an
    fsync() or an O_SYNC operation. Here is an example line from the
    trace file describing a handle which lived on for 311 jiffies, or over
    1.2 seconds:

    postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32
    tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
    dirtied_blocks 0

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

02 Feb, 2013

1 commit

  • Commit 2147b1a6a48 resulted in a new smatch warning:

    > fs/ext4/move_extent.c:693 mext_replace_branches()
    > warn: variable dereferenced before check 'dext' (see line 683)

    Fix this by adding a check to make sure dext is non-NULL before we
    derefrence it.

    Signed-off-by: Akria Fujita
    [ modified by tytso to make sure an ext4_error is called ]
    Signed-off-by: "Theodore Ts'o"

    Akria Fujita
     

29 Nov, 2012

1 commit

  • Previously, ext4_extents.h was being included at the end of ext4.h,
    which was bad for a number of reasons: (a) it was not being included
    in the expected place, and (b) it caused the header to be included
    multiple times. There were #ifdef's to prevent this from causing any
    problems, but it still was unnecessary.

    By moving the function declarations that were in ext4_extents.h to
    ext4.h, which is standard practice for where the function declarations
    for the rest of ext4.h can be found, we can remove ext4_extents.h from
    being included in ext4.h at all, and then we can only include
    ext4_extents.h where it is needed in ext4's source files.

    It should be possible to move a few more things into ext4.h, and
    further reduce the number of source files that need to #include
    ext4_extents.h, but that's a cleanup for another day.

    Reported-by: Sachin Kamat
    Reported-by: Wei Yongjun
    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

29 Sep, 2012

1 commit

  • Inode's block defrag and ext4_change_inode_journal_flag() may
    affect nonlocked DIO reads result, so proper synchronization
    required.

    - Add missed inode_dio_wait() calls where appropriate
    - Check inode state under extra i_dio_count reference.

    Reviewed-by: Jan Kara
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov